
Senior Site Reliability Engineer, Wikimedia Enterprise
Wikimedia Foundationabout 4 hours ago
Base Salary
$117k - $181k/yr
Responsibilities
- Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets.
- Build and enhance observability systems for proactive detection and troubleshooting.
- Drive reliability engineering practices, including capacity planning and load testing.
- Improve developer experience by enabling self-service infrastructure.
- Partner with engineering teams to embed reliability best practices early in development.
- Design and optimize CI/CD and GitOps workflows for automated deployments.
- Implement secure-by-default infrastructure and enforce best practices.
- Continuously optimize infrastructure cost and efficiency using FinOps principles.
- Establish and track operational metrics to drive continuous improvement.
- Reduce operational toil by implementing automation-first solutions.
- Contribute to and evolve internal platform capabilities for scalability.
- Collaborate with a globally distributed team.
- Mentor peers in technical and operational areas.
Requirements
- Experience with Infrastructure as Code and automation tools like Terraform or Ansible.
- Proficiency in at least one programming language such as Python or Go.
- Experience designing and operating cloud-based systems on platforms like AWS, Azure, or GCP.
- Familiarity with CI/CD pipelines and GitOps workflows.
- Experience with incident response and leading postmortems.
- Strong understanding of SRE best practices, including SLOs and observability.
- Ability to work effectively in a distributed, cross-functional environment.
- Familiarity with Wikimedia or other open source projects is a plus.
Tech Stack
AnsibleApache FlinkApache KafkaApache SparkAWSAzureGoGoogle Cloud PlatformKubernetesPrometheusPythonTerraform