about 14 hours ago
Bengaluru, IndiaMid Level / Senior
H1B Sponsor
Responsibilities
- Own incident management practices across all production systems.
- Act as the primary Incident Manager for high-priority production incidents.
- Administer and optimize CI/CD pipelines for safe and frequent deployments.
- Continuously improve incident response runbooks and escalation matrices.
- Drive root cause analysis for major incidents and track action items.
- Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).
- Establish and enforce SLA/SLO/SLI frameworks across production services.
- Build automated runbooks and self-healing mechanisms.
- Implement synthetic monitoring to detect customer-facing issues.
- Utilize Splunk for incident investigation and observability.
Requirements
- 4+ years of experience in SRE, DevOps, Platform Engineering, or Observability Engineering roles.
- Hands-on experience leading incident response for high-severity production incidents.
- Strong background in Linux systems administration and distributed systems troubleshooting.
- Experience defining and managing SLOs, SLIs, and Error Budgets in production.
Benefits
- Generous time off policies.
- Top shelf benefits.
- Education, wellness, and lifestyle support.
