GrepJob
Confluent

Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada)

Confluent
Apply
about 2 months ago
Remote, Canada
Staff+

Responsibilities

  • Analyze systemic failure patterns and design reliability improvements.
  • Own Rootly configuration, workflows, and integrations with incident management tools.
  • Define and maintain SLO/SLA frameworks and use error budgets for reliability investments.
  • Establish standards and practices for incident response across engineering teams.
  • Edit and review customer-facing incident documents for quality and clarity.
  • Develop and deliver training programs and coach teams through post-mortems.
  • Partner with engineering leaders to elevate reliability practices organization-wide.

Requirements

  • 10+ years of experience in SRE, incident management, or reliability engineering.
  • Cloud experience with AWS, GCP, or Azure.
  • Experience in reliability/incident programs at organizations with 500+ engineers.
  • Deep expertise with incident management tooling like Rootly or PagerDuty.
  • Strong understanding of distributed systems and failure modes at scale.
  • Experience with observability tools including metrics, logging, and tracing.
  • Kubernetes and container orchestration experience is required.
  • Understanding of CI/CD pipelines and release processes.
  • Strong written communication skills for design docs and post-mortems.
  • Experience driving organization-wide process and cultural changes.