20 days ago
Responsibilities
- Analyze systemic failure patterns and design improvements to prevent incidents.
- Define and maintain SLO/SLA frameworks and use error budgets for reliability investments.
- Build tooling and automation to reduce incident response toil.
- Own Rootly configuration and integrations with incident management tools.
- Analyze reliability data and build dashboards to drive action.
- Serve as an on-call Incident Commander for production incidents.
- Develop and deliver training programs for engineering teams.
- Edit and review customer-facing incident documents for clarity and quality.
- Partner with engineering leaders to enhance reliability practices.
Requirements
- 10+ years in SRE, incident management, or reliability engineering.
- Cloud experience with at least one of AWS, GCP, or Azure.
- Deep expertise with incident management tooling like Rootly or PagerDuty.
- Strong understanding of distributed systems and failure modes at scale.
- Experience with observability tools for diagnosing complex issues.
- Kubernetes and container orchestration experience.
- Familiarity with CI/CD pipelines and release processes.
- Strong written communication skills for documentation.
- Experience navigating reliability programs in large organizations.