Staff Site Reliability Engineer

3 months ago

Bengaluru, IndiaStaff+

H1B Sponsor

Responsibilities

Architect frameworks and self-service tooling for service reliability.
Drive the AIOps strategy for automated diagnostics and proactive failure prevention.
Embed SRE practices across engineering through design reviews and operational standards.
Act as Incident Commander during critical events and lead blameless postmortems.
Deliver end-to-end monitoring and profiling to optimize performance.
Mentor engineers across SRE and product teams through technical guidance.

8+ years of experience in Site Reliability Engineering, DevOps, or a similar role.
At least 3+ years in a Senior+ SRE position.
Strong background in running production SaaS systems at scale.
Proficiency in at least one programming/scripting language (Python, Go, etc.).
Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes.
Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing).
Experience with monitoring and alerting tools (Prometheus, Grafana, Datadog, ELK).
Familiarity with advanced observability tools (OTEL, continuous profiling).
Proven incident management experience, including leading high-severity incidents.
Strong troubleshooting skills across the full stack.
Excellent communication and collaboration skills.