Sr Site Reliability Engineer

3 days ago

Hyderābād, IndiaSenior

Responsibilities

Implement SRE frameworks, best practices, and playbooks provided by the CoE.
Act as a hands-on engineer, contributing to observability, reliability, and incident response initiatives.
Partner with senior SREs and leadership to maintain consistency in monitoring and incident processes.
Contribute to automation projects that improve reliability and reduce manual work.
Build and maintain monitoring solutions with various observability tools.
Create and refine dashboards, metrics, and alerts for proactive anomaly detection.
Implement SLIs, SLOs, SLAs, and error budgets in partnership with product and platform teams.
Participate in capacity planning, resiliency testing, and scaling reviews.
Support chaos engineering and reliability validation activities.
Participate in incident response, including on-call rotations for 24x7 coverage.
Assist with root cause analysis and implement corrective actions.
Ensure alignment with ITSM processes for incident, problem, and change management.
Collaborate with various teams to embed reliability practices and share knowledge.

7+ years in SRE, Operations, or Infrastructure Engineering.
Strong hands-on experience with monitoring and observability platforms.
Experience with tools such as New Relic, Datadog, Prometheus, Grafana, CloudWatch, OpenTelemetry, Graylog.
Proven experience in incident response, troubleshooting production issues, and improving MTTR/MTTD.
Good knowledge of SLIs, SLOs, SLAs, and error budgets.
Hands-on experience with AWS services (EC2, ECS, EKS, networking, scaling groups).
Proficiency in containers & Kubernetes (Docker, EKS).
Scripting/programming in Python, Go, or shell scripting.
Understanding of networking, distributed systems, and high-availability architectures.
Exposure to ITIL/ITSM processes.

AWSDatadogDockerElasticsearchGoGrafanaGraylogKubernetesMicrosoft SQL ServerMongoDBPrometheusPython