2 days ago
Responsibilities
- Build and extend platforms to improve system reliability.
- Work on team goals that encompass reliability for the entire company.
- Standardize reliability tools across multiple platforms and organizations.
- Triage, coordinate, and lead stabilization of sev 0–1 incidents.
- Serve as primary oncall, maintaining structured escalation paths.
- Drive platform-wide reliability improvements and shared operational tooling.
- Use AI-driven systems to improve signal detection and accelerate root cause analysis.
- Design and implement safe deployment patterns.
Requirements
- Drive to root cause systems with many moving parts.
- Demonstrated technical initiative and leadership on previous projects.
- Familiarity with AI-driven tooling for observability and incident analysis.
- Experience running production oncall for high-availability systems.
- Strong incident management skills including structured triage and blameless postmortems.
- Fluency with CI/CD pipelines and rollback automation.
- Monitoring and observability expertise.
- Ability to create and maintain evidence-based maturity assessments.
- Comfort with vendor/dependency management.
- Boundless curiosity, autonomy, and a strong sense of accountability.
- 5+ years of software development experience.
Benefits
- Remote work options.
- Medical insurance.
- Flexible time off.
- Retirement savings plans.
- Modern family planning support.