Lead Site Reliability Engineer

about 2 months ago

Base Salary

$200k - $275k/yr

Responsibilities

Define the long-term vision for site reliability, including SLOs/SLIs and operational standards.
Architect and maintain resilient, scalable cloud infrastructure across AWS and Kubernetes.
Design and evolve monitoring, alerting, and logging systems for actionable insights.
Lead incident management practices and drive blameless postmortems.
Identify reliability risks and lead efforts around redundancy and capacity planning.
Partner with engineering teams to ensure safe and observable deployments.
Automate operational tasks and improve developer experience.
Guide teams through debugging reliability issues and root cause resolution.
Promote reliability-first thinking and shared ownership of production systems.
Mentor engineers on reliability principles and operational best practices.