about 3 hours ago
Bengaluru, IndiaStaff+
Responsibilities
- Design and build platforms, tools, and frameworks to improve system reliability, scalability, and performance.
- Define and implement SRE best practices, including SLIs/SLOs, error budgets, and reliability metrics.
- Lead incident response efforts, drive root cause analysis, and implement long-term fixes to prevent recurrence.
- Analyze system behavior, identify bottlenecks and saturation points, and implement solutions to improve resilience.
- Partner with engineering teams to embed reliability into the software development lifecycle.
- Evaluate emerging technologies and recommend tools that enhance productivity, observability, and system robustness.
- Drive capacity planning, performance tuning, and cost optimization efforts.
- Collaborate with cross-functional teams to identify gaps, prioritize improvements, and resolve production issues.
- Provide technical leadership and mentorship across the engineering organization.
- Influence senior leadership with insights, metrics, and recommendations to improve system health and operational excellence.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related technical field.
- 10+ years of experience in software engineering with a strong focus on backend systems and distributed architecture.
- Extensive experience building and operating Java-based systems using RESTful APIs, Spring Boot, and Microservices architecture.
- Strong understanding of distributed systems concepts, including fault tolerance, eventual consistency, and scalability.
- Proven experience with cloud platforms (AWS/Azure/GCP) and cloud-native architectures.
- Expertise in observability tools such as Prometheus, Grafana, ELK, or similar.
- Experience defining and managing SLIs, SLOs, and error budgets.
- Strong knowledge of CI/CD pipelines, automation, and infrastructure as code.
- Hands-on experience with incident management, root cause analysis (RCA), and postmortems.
- Excellent analytical, debugging, and problem-solving skills.
- Strong communication, collaboration, and leadership abilities.
Benefits
- Comprehensive healthcare coverage.
- Flexible PTO.
- Equity RSUs and annual performance bonus opportunities.
- Retirement account support.
- 14+ weeks of paid parental leave.
- Career development opportunities.
- Company-paid privacy certification exam fees.