3 months ago
San Francisco, CA, USA or Sunnyvale, CA, USASenior / Staff+
Base Salary
$180k - $220k/yr
Responsibilities
- Serve as the highest-level escalation point for complex P1/P0 incidents.
- Lead cross-functional root cause investigations involving various technical layers.
- Design and improve node validation and release readiness processes.
- Influence Kubernetes architecture and workload orchestration for stability.
- Troubleshoot AI/ML infrastructure issues and support complex workloads.
- Act as a senior technical advisor during high-risk customer incidents.
- Mentor P3/P4 engineers and define technical standards for support excellence.
Requirements
- 8+ years of experience in SRE, DevOps, HPC, or Cloud Infrastructure roles.
- Advanced expertise in Linux systems.
- Deep operational experience with Kubernetes (CKA-level or higher).
- Strong networking knowledge including Infiniband and RDMA.
- Experience supporting AI/ML workloads at scale.
- Proven track record of resolving multi-layer, distributed system failures.
- Strong customer communication and executive-facing presence.
Benefits
- Competitive compensation with Restricted Stock Units.
- Paid time off and paid holidays.
- Comprehensive health, dental, and vision insurance.
- Employer contributions to HSA account.
- Paid parental leave and life insurance.
- Professional development and tuition reimbursement.
- Mental health and wellness support.
- Commuter benefits and cell phone stipend.
- 401(k) Retirement plan with company match.
- Volunteer time off.
