Staff/Lead Site Reliability Engineer (SRE)

24 days ago

H1B Sponsor

Base Salary

$201k - $251k/yr

Responsibilities

Lead the design, implementation, and operation of reliable, scalable cloud infrastructure.
Define and begin rollout of SLI/SLO standards across microservices.
Develop self-service instrumentation tooling enabling engineering teams to own observability.
Establish observability and monitoring using OSS toolchain.
Serve as a technical escalation point for critical incidents and perform deep-dive root cause analyses.
Enhance monitoring, logging, and tracing systems for comprehensive visibility into system health.
Set the technical direction and best practices for the SRE and engineering organization.
Mentor mid-level and senior engineers on design patterns, operational rigor, and reliability principles.

8+ years of progressive experience in Site Reliability Engineering or a closely related role.
Deep expertise with AWS, Kubernetes, Helm, and observability stacks.
Fluency in at least one major scripting/programming language for automation and tooling.
Hands-on engineering mindset capable of instrumenting services directly.
Track record of building or improving incident detection and response systems.
Deep technical familiarity with Kubernetes ecosystems and modern IaC tooling.
Exceptional communication skills for explaining complex technical issues.

AWSGoGrafanaHarnessHelmIstioJavaKubernetesPrometheusPythonTerraform