about 3 hours ago
Bengaluru, IndiaSenior / Mid Level
Responsibilities
- Own production services end-to-end, including reliability, scalability, and operational excellence.
- Participate in on-call rotation and lead incident response.
- Engage with engineering, operations, and product teams to maintain a highly available application platform.
- Collaborate to identify gaps, prioritize, and resolve issues.
- Define, implement, and maintain SLIs and SLOs aligned with customer experience.
- Design and instrument SLIs such as latency, error rates, and availability across critical services.
- Manage and enforce error budgets to balance system reliability with product feature velocity.
- Improve alert quality by reducing noise and focusing on actionable alerts.
- Embed with product teams to review architectures and catch reliability risks early.
- Share knowledge and experience with the engineering organization.
Requirements
- Bachelor's degree in computer science, engineering, or related field.
- 4+ years of application development experience with Java or equivalent language.
- Experience with Spring environment.
- Experience in cloud-based infrastructure (Azure, AWS, GCP, etc.).
- Understanding of performance factors in software applications at multiple layers.
- Hands-on experience with observability tools (Datadog, Prometheus, Grafana, etc.).
- Familiarity with CI/CD pipelines and infrastructure-as-code (Terraform, Helm, Jenkins, GitLab).
- Experience deploying AI systems in production.
- Strong understanding of prompt engineering and evaluation of LLM outputs in reliability workflows.
Benefits
- Comprehensive healthcare coverage.
- Flexible PTO and equity RSUs.
- Annual performance bonus opportunities.
- Retirement account support and 14+ weeks of paid parental leave.
- Career development opportunities and company-paid privacy certification exam fees.
Tech Stack
AWSAzureBashDatadogGoogle Cloud PlatformGrafanaHelmJavaJenkinsKubernetesPrometheusPythonRubySQLTerraform