GrepJob
GHX

Sr Site Reliability Engineer

GHX
Apply
3 days ago
Hyderābād, IndiaSenior

Responsibilities

  • Implement SRE frameworks, best practices, and playbooks provided by the CoE.
  • Act as a hands-on engineer, contributing to observability, reliability, and incident response initiatives.
  • Partner with senior SREs and leadership to maintain consistency in monitoring and incident processes.
  • Contribute to automation projects that improve reliability and reduce manual work.
  • Build and maintain monitoring solutions with various observability tools.
  • Create and refine dashboards, metrics, and alerts for proactive anomaly detection.
  • Implement SLIs, SLOs, SLAs, and error budgets in partnership with product and platform teams.
  • Participate in capacity planning, resiliency testing, and scaling reviews.
  • Support chaos engineering and reliability validation activities.
  • Participate in incident response, including on-call rotations for 24x7 coverage.
  • Assist with root cause analysis and implement corrective actions.
  • Ensure alignment with ITSM processes for incident, problem, and change management.
  • Collaborate with various teams to embed reliability practices and share knowledge.

Requirements

  • 7+ years in SRE, Operations, or Infrastructure Engineering.
  • Strong hands-on experience with monitoring and observability platforms.
  • Experience with tools such as New Relic, Datadog, Prometheus, Grafana, CloudWatch, OpenTelemetry, Graylog.
  • Proven experience in incident response, troubleshooting production issues, and improving MTTR/MTTD.
  • Good knowledge of SLIs, SLOs, SLAs, and error budgets.
  • Hands-on experience with AWS services (EC2, ECS, EKS, networking, scaling groups).
  • Proficiency in containers & Kubernetes (Docker, EKS).
  • Scripting/programming in Python, Go, or shell scripting.
  • Understanding of networking, distributed systems, and high-availability architectures.
  • Exposure to ITIL/ITSM processes.

Tech Stack

AWSDatadogDockerElasticsearchGoGrafanaGraylogKubernetesMicrosoft SQL ServerMongoDBPrometheusPython

Categories