Staff Site Reliability Engineer

3 months ago

Delhi, IndiaStaff+

H1B Sponsor

Responsibilities

Architect reliability frameworks and self-service tooling for service ownership.
Drive AIOps strategy for automated diagnostics and proactive failure prevention.
Embed SRE practices through design reviews and operational standards.
Lead incident management as Incident Commander during critical events.
Deliver end-to-end monitoring and profiling to optimize system performance.
Mentor engineers across SRE and product teams through knowledge sharing.

Requirements

8+ years of experience in Site Reliability Engineering or similar roles.
At least 3 years in a Senior+ SRE position.
Strong background in running production SaaS systems at scale.
Proficiency in programming/scripting languages like Python or Go.
Hands-on expertise with cloud platforms such as AWS, GCP, or Azure.
Deep understanding of networking fundamentals like TCP/IP and DNS.
Experience with monitoring and alerting tools like Prometheus and Grafana.
Familiarity with advanced observability techniques.
Proven incident management experience with high-severity incidents.
Strong troubleshooting skills across the full stack.
Excellent communication and collaboration skills.

Tech Stack

AWS AzureDatadogGo Google Cloud PlatformGrafanaKubernetesPrometheusPython

Categories