3 days ago
Hyderābād, IndiaSenior
Responsibilities
- Implement SRE frameworks, best practices, and playbooks provided by the CoE.
- Act as a hands-on engineer, contributing to observability, reliability, and incident response initiatives.
- Partner with senior SREs and leadership to maintain consistency in monitoring and incident processes.
- Contribute to automation projects that improve reliability and reduce manual work.
- Build and maintain monitoring solutions with various observability tools.
- Create and refine dashboards, metrics, and alerts for proactive anomaly detection.
- Implement SLIs, SLOs, SLAs, and error budgets in partnership with product and platform teams.
- Participate in capacity planning, resiliency testing, and scaling reviews.
- Support chaos engineering and reliability validation activities.
- Participate in incident response, including on-call rotations for 24x7 coverage.
- Assist with root cause analysis and implement corrective actions.
- Ensure alignment with ITSM processes for incident, problem, and change management.
- Collaborate with various teams to embed reliability practices and share knowledge.
Requirements
- 7+ years in SRE, Operations, or Infrastructure Engineering.
- Strong hands-on experience with monitoring and observability platforms.
- Experience with tools such as New Relic, Datadog, Prometheus, Grafana, CloudWatch, OpenTelemetry, Graylog.
- Proven experience in incident response, troubleshooting production issues, and improving MTTR/MTTD.
- Good knowledge of SLIs, SLOs, SLAs, and error budgets.
- Hands-on experience with AWS services (EC2, ECS, EKS, networking, scaling groups).
- Proficiency in containers & Kubernetes (Docker, EKS).
- Scripting/programming in Python, Go, or shell scripting.
- Understanding of networking, distributed systems, and high-availability architectures.
- Exposure to ITIL/ITSM processes.