about 3 hours ago
Menlo Park, CA, USA
Senior / Mid Level
H1B Sponsor
Base Salary
$196k - $230k/yr
Responsibilities
- Drive the long-term reliability and observability strategy across Robinhood’s infrastructure.
- Collaborate with engineers to enhance operational excellence and incident response.
- Lead incident mitigation efforts and facilitate critical decision-making during incidents.
- Develop and maintain incident management processes to minimize customer impact.
- Own incident discovery by defining global dashboards and alerts.
- Evolve incident response tooling and processes, focusing on MTTD/MTTR improvements.
- Drive post-incident governance and establish standards for reviews.
- Design failure mitigation strategies to prevent major outages.
- Improve monitoring and observability frameworks across services.
- Deliver insights and reports to support business decisions on service reliability.
- Mentor team members and contribute to engineering culture.
Requirements
- 5+ years of software engineering experience with production systems.
- 2+ years focused on reliability engineering or production operations.
- Hands-on experience in incident leadership roles.
- Strong communication skills, especially during high-severity incidents.
- Deep knowledge of systems reliability and fault-tolerant architecture.
- Experience with multi-region architectures and failover strategies.
- Familiarity with modern observability tools like OpenTelemetry and Grafana.
- Proven ability to drive improvements in MTTD, MTTR, and service availability.
Benefits
- Challenging, high-impact work to advance your career.
- Performance-driven compensation with bonuses and equity ownership.
- 100% paid health insurance for employees and 90% for dependents.
- Flexible benefits spending account for wellness and learning.
- Employer-paid life and disability insurance, fertility, and mental health benefits.
- Generous time off policies including holidays, PTO, and parental leave.
- Exceptional office experience with catered meals and events.
Tech Stack
GrafanaPrometheus
Categories
BackendDevOps