Staff Site Reliability Engineer

7 days ago

Remote, United StatesStaff+

H1B Sponsor

Base Salary

$150k - $225k/yr

Responsibilities

Architect reliability frameworks and self-service tooling for teams.
Drive AI-driven reliability through automation of diagnostics and proactive failure prevention.
Embed SRE practices across engineering via design reviews and operational standards.
Act as Incident Commander during critical events and lead blameless postmortems.
Deliver end-to-end monitoring and profiling to optimize performance.
Mentor engineers across SRE and product teams through knowledge sharing.

8+ years of experience in Site Reliability Engineering, DevOps, or a similar role.
At least 3+ years in a Senior+ SRE position.
Strong background in running production SaaS systems at scale.
Proficiency in at least one programming/scripting language (Python, Go, or similar).
Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes.
Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing).
Experience with monitoring and alerting tools (Prometheus, Grafana, Datadog, ELK).
Familiarity with advanced observability techniques (OTEL, continuous profiling).
Proven incident management experience, including leading high-severity incidents.
Strong troubleshooting skills across the full stack.
Excellent communication and collaboration skills.

AWSAzureDatadogGoGoogle Cloud PlatformGrafanaKubernetesPrometheusPython