Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada)

3 months ago

Remote, Canada

Staff+

Responsibilities

Analyze systemic failure patterns and design reliability improvements.
Own Rootly configuration, workflows, and integrations with incident management tools.
Define and maintain SLO/SLA frameworks and use error budgets for reliability investments.
Establish standards and practices for incident response across engineering teams.
Edit and review customer-facing incident documents for quality and clarity.
Develop and deliver training programs and coach teams through post-mortems.
Partner with engineering leaders to elevate reliability practices organization-wide.

10+ years of experience in SRE, incident management, or reliability engineering.
Cloud experience with AWS, GCP, or Azure.
Experience in reliability/incident programs at organizations with 500+ engineers.
Deep expertise with incident management tooling like Rootly or PagerDuty.
Strong understanding of distributed systems and failure modes at scale.
Experience with observability tools including metrics, logging, and tracing.
Kubernetes and container orchestration experience is required.
Understanding of CI/CD pipelines and release processes.
Strong written communication skills for design docs and post-mortems.
Experience driving organization-wide process and cultural changes.