about 3 hours ago
Bengaluru, India
Senior
H1B Sponsor
Responsibilities
- Facilitate blameless post-incident reviews to identify root causes and prioritize reliability improvements.
- Implement chaos engineering practices to validate system resilience and recovery procedures.
- Establish core SRE principles and frameworks across the organization.
- Manage error budgets to balance feature velocity with system reliability.
- Automate repetitive operational tasks to reduce toil.
- Implement capacity planning processes to ensure systems meet SLOs.
- Build observability systems for deep visibility into service health and performance.
- Create SRE dashboards for real-time visibility into reliability metrics.
- Partner with development teams to implement reliability from the design phase.
- Drive continuous improvement through SRE feedback loops and documentation.
Requirements
- 8+ years of experience in DevOps/SRE roles with expertise in SRE principles.
- Deep experience with observability and monitoring platforms like Prometheus and Grafana.
- Strong background in incident management and conducting blameless postmortems.
- Understanding of distributed systems and reliability engineering concepts.
- Experience with Kubernetes, Docker, and service mesh technologies.
- Proficiency in cloud-focused software development, preferably in Go or Python.
- Experience with Infrastructure as Code tools like Terraform or Ansible.
- Hands-on experience with cloud platforms such as AWS, GCP, or Azure.
- Ability to communicate effectively with technical and non-technical stakeholders.
- BS Degree in Computer Science or equivalent.
Benefits
- Comprehensive benefits including healthcare, life, and retirement options.
- Global access to mental health and financial wellness support.
- Flexible work arrangements with a hybrid work approach.
- Time off for vacation and personal reasons.
Tech Stack
AmbassadorAnsibleAWSAzureDatadogDockerGoGoogle Cloud PlatformGrafanaIstioKubernetesPrometheusPythonTerraform
Categories
BackendDevOps