5 days ago
Bellevue, WA, USA +2 moreSenior / Mid Level
Base Salary
$139k - $204k/yr
Responsibilities
- Contribute to the roadmap for Applied Training by identifying essential workloads.
- Design and build a complete research cluster experience, including CLI and job configuration.
- Own the Python SDK for sandbox infrastructure and ensure integration with Kubernetes.
- Write documentation for popular OSS training frameworks on CoreWeave.
- Collaborate with infrastructure teams and customers to enhance system design.
Requirements
- 5-8+ years of experience in building distributed systems or ML infrastructure.
- Proven experience with Kubernetes, including custom controllers and workload orchestration.
- Strong understanding of researcher productivity and efficient workflows.
- Familiarity with distributed job scheduling and large-scale training challenges.
- Experience shipping production infrastructure relied upon by users.
Benefits
- 100% paid medical, dental, and vision insurance.
- Company-paid life insurance and short/long-term disability insurance.
- Tuition reimbursement and participation in Employee Stock Purchase Program.
- Flexible PTO and paid parental leave.
- Catered lunch each day and a casual work environment.
