1 day ago
Pune, IndiaSenior / Mid Level
H1B Sponsor
Responsibilities
- Architect reliability improvements across Kubernetes, GPU infrastructure, ML Ops, networking, and monitoring.
- Lead incident management, blameless post-mortems, and error-budget policies.
- Drive automation, IaC, and reliability tooling at scale.
- Oversee metrics, logs, tracing, and dashboards; ensure actionable alerting.
- Integrate GPU operators/exporters and model lifecycle workflows for inference platforms.
- Mentor junior and mid-level SREs and guide cross-team initiatives.
Requirements
- 5–8 years of SRE or platform engineering experience.
- Expert Kubernetes operations and cloud platform experience (AWS/GCP/Azure).
- Advanced networking and security fundamentals.
- Strong coding background (Python, Go, or Java).
- Deep observability knowledge (Prometheus, Grafana, ELK/Fluentd).
