GrepJob
Confluent

Staff Software Engineer I - SRE

Confluent
Apply
20 days ago
Remote, IndiaStaff+
H1B Sponsor

Responsibilities

  • Analyze systemic failure patterns and design improvements to prevent incidents.
  • Define and maintain SLO/SLA frameworks and use error budgets for reliability investments.
  • Build tooling and automation to reduce incident response toil.
  • Own Rootly configuration and integrations with incident management tools.
  • Analyze reliability data and build dashboards to drive action.
  • Serve as an on-call Incident Commander for production incidents.
  • Develop and deliver training programs for engineering teams.
  • Edit and review customer-facing incident documents for clarity and quality.
  • Partner with engineering leaders to enhance reliability practices.

Requirements

  • 10+ years in SRE, incident management, or reliability engineering.
  • Cloud experience with at least one of AWS, GCP, or Azure.
  • Deep expertise with incident management tooling like Rootly or PagerDuty.
  • Strong understanding of distributed systems and failure modes at scale.
  • Experience with observability tools for diagnosing complex issues.
  • Kubernetes and container orchestration experience.
  • Familiarity with CI/CD pipelines and release processes.
  • Strong written communication skills for documentation.
  • Experience navigating reliability programs in large organizations.