24 days ago
Base Salary
$201k - $251k/yr
Responsibilities
- Lead the design, implementation, and operation of reliable, scalable cloud infrastructure.
- Define and begin rollout of SLI/SLO standards across microservices.
- Develop self-service instrumentation tooling enabling engineering teams to own observability.
- Establish observability and monitoring using OSS toolchain.
- Serve as a technical escalation point for critical incidents and perform deep-dive root cause analyses.
- Enhance monitoring, logging, and tracing systems for comprehensive visibility into system health.
- Set the technical direction and best practices for the SRE and engineering organization.
- Mentor mid-level and senior engineers on design patterns, operational rigor, and reliability principles.
Requirements
- 8+ years of progressive experience in Site Reliability Engineering or a closely related role.
- Deep expertise with AWS, Kubernetes, Helm, and observability stacks.
- Fluency in at least one major scripting/programming language for automation and tooling.
- Hands-on engineering mindset capable of instrumenting services directly.
- Track record of building or improving incident detection and response systems.
- Deep technical familiarity with Kubernetes ecosystems and modern IaC tooling.
- Exceptional communication skills for explaining complex technical issues.
