about 2 months ago
Remote, Canada
Staff+
Responsibilities
- Analyze systemic failure patterns and design reliability improvements.
- Own Rootly configuration, workflows, and integrations with incident management tools.
- Define and maintain SLO/SLA frameworks and use error budgets for reliability investments.
- Establish standards and practices for incident response across engineering teams.
- Edit and review customer-facing incident documents for quality and clarity.
- Develop and deliver training programs and coach teams through post-mortems.
- Partner with engineering leaders to elevate reliability practices organization-wide.
Requirements
- 10+ years of experience in SRE, incident management, or reliability engineering.
- Cloud experience with AWS, GCP, or Azure.
- Experience in reliability/incident programs at organizations with 500+ engineers.
- Deep expertise with incident management tooling like Rootly or PagerDuty.
- Strong understanding of distributed systems and failure modes at scale.
- Experience with observability tools including metrics, logging, and tracing.
- Kubernetes and container orchestration experience is required.
- Understanding of CI/CD pipelines and release processes.
- Strong written communication skills for design docs and post-mortems.
- Experience driving organization-wide process and cultural changes.