2 days ago
San Jose, CA, USA
Staff+
H1B Sponsor
Base Salary
$210k - $270k/yr
Responsibilities
- Participate in and influence high-impact incident response efforts.
- Define and evolve organization-wide incident practices and reliability tooling.
- Architect and evolve observability platforms for actionable insights.
- Lead the development of reliability and observability practices.
- Guide teams in building resilient, fault-tolerant services.
- Partner with cross-functional teams to ensure new systems are operable.
- Design and implement internal tools for deployment safety and incident coordination.
- Mentor engineers in operational rigor and reliability principles.
Requirements
- 8+ years of experience in operating and scaling production infrastructure.
- Deep expertise in incident response and debugging distributed systems.
- Strong knowledge of observability stacks and alerting strategies.
- Experience with fault isolation and chaos engineering practices.
- Proficiency in infrastructure-as-code and configuration management.
- Ability to influence teams through standards and culture.
- Strong communication skills for mentoring and aligning across teams.
Benefits
- Flexible, hybrid work environment.
- Unlimited Vacation.
- 100% paid employee health benefit options.
- Commuter Benefits.
- 401(k) with employer funded match.
- Corporate wellness program.
- Sabbatical leave for employees with 5+ years of service.
- Competitive paid parental leave and fertility reimbursement.
- Cell phone reimbursement.
- Catered lunch every day along with beverages and snacks.
- Employee Resource Groups and ZocClubs.
- Great Place to Work Certified.
Tech Stack
Terraform
Categories
DevOpsSecurity
