5 days ago
Bellevue, WA, USA +2 moreStaff+
Base Salary
$165k - $242k/yr
Responsibilities
- Contribute to the roadmap for Applied Training by identifying key workload unlocks.
- Design and build a complete research cluster experience, including CLI and job configuration.
- Own the Python SDK for sandbox infrastructure and ensure integration with Kubernetes clusters.
- Write documentation for running popular open-source training frameworks on CoreWeave.
- Collaborate with infrastructure teams and customers to understand their supercomputing needs.
Requirements
- 8-12+ years of experience in building distributed systems or ML infrastructure.
- Real Kubernetes experience, including custom controllers and workload orchestration.
- Strong understanding of what makes researchers productive in their workflows.
- Familiarity with distributed job scheduling and large-scale training challenges.
- Proven track record of shipping production infrastructure relied upon by users.
Benefits
- 100% paid medical, dental, and vision insurance.
- Company-paid life insurance and short/long-term disability insurance.
- Flexible Spending Account and Health Savings Account options.
- Tuition reimbursement and participation in Employee Stock Purchase Program.
- Mental wellness benefits and family-forming support.
- Flexible PTO and catered lunch each day.
