about 5 hours ago
San Francisco, CA, USASenior / Staff+
Base Salary
$200k - $275k/yr
Responsibilities
- Define the long-term vision for site reliability, including SLOs/SLIs and operational standards.
- Architect and maintain resilient, scalable cloud infrastructure across AWS and Kubernetes.
- Design and evolve monitoring, alerting, and logging systems for actionable insights.
- Lead incident management practices and drive blameless postmortems.
- Identify reliability risks and lead efforts around redundancy and capacity planning.
- Partner with engineering teams to ensure safe and observable deployments.
- Automate operational tasks and improve developer experience.
- Guide teams through debugging reliability issues and root cause resolution.
- Promote reliability-first thinking and shared ownership of production systems.
- Mentor engineers on reliability principles and operational best practices.
Requirements
- 7+ years of experience in site reliability engineering or related fields.
- Experience designing and operating highly available production-grade systems.
- Fluency in Python and/or TypeScript for building automation and tooling.
- Deep experience with AWS, Kubernetes, Docker, and cloud-native architectures.
- Experience implementing observability stacks and creating high-signal alerting.
- Understanding of SLOs, SLIs, and error budgets.
- Familiarity with modern stacks like FastAPI, Vue.js, and PostgreSQL.
- Experience with CI/CD pipelines and infrastructure as code.
- Ability to balance reliability, velocity, and cost in decision-making.
- Strong collaboration skills across multiple engineering teams.
Benefits
- Top-of-market salary and equity package.
- Medical, dental & vision insurance coverage.
- 401(k) with match.
- Flexible PTO.
- Parental leave.
