GrepJob
Stuut

Lead Site Reliability Engineer

Stuut
Apply
about 5 hours ago
San Francisco, CA, USASenior / Staff+

Base Salary

$200k - $275k/yr

Responsibilities

  • Define the long-term vision for site reliability, including SLOs/SLIs and operational standards.
  • Architect and maintain resilient, scalable cloud infrastructure across AWS and Kubernetes.
  • Design and evolve monitoring, alerting, and logging systems for actionable insights.
  • Lead incident management practices and drive blameless postmortems.
  • Identify reliability risks and lead efforts around redundancy and capacity planning.
  • Partner with engineering teams to ensure safe and observable deployments.
  • Automate operational tasks and improve developer experience.
  • Guide teams through debugging reliability issues and root cause resolution.
  • Promote reliability-first thinking and shared ownership of production systems.
  • Mentor engineers on reliability principles and operational best practices.

Requirements

  • 7+ years of experience in site reliability engineering or related fields.
  • Experience designing and operating highly available production-grade systems.
  • Fluency in Python and/or TypeScript for building automation and tooling.
  • Deep experience with AWS, Kubernetes, Docker, and cloud-native architectures.
  • Experience implementing observability stacks and creating high-signal alerting.
  • Understanding of SLOs, SLIs, and error budgets.
  • Familiarity with modern stacks like FastAPI, Vue.js, and PostgreSQL.
  • Experience with CI/CD pipelines and infrastructure as code.
  • Ability to balance reliability, velocity, and cost in decision-making.
  • Strong collaboration skills across multiple engineering teams.

Benefits

  • Top-of-market salary and equity package.
  • Medical, dental & vision insurance coverage.
  • 401(k) with match.
  • Flexible PTO.
  • Parental leave.