about 7 hours ago
Base Salary
$207k - $275k/yr
Responsibilities
- Contribute to the roadmap for Applied Training to identify essential workloads.
- Collaborate with customers and internal teams to build cloud-native primitives.
- Design and build a complete research cluster experience addressing researchers' challenges.
- Own the Python SDK for sandbox infrastructure, enabling large-scale RL training runs.
- Write documentation for popular OSS training frameworks to assist customers.
- Engage directly with infrastructure teams and customers to enhance system design.
Requirements
- 8–12+ years of experience in building distributed systems or ML infrastructure.
- Proven experience with Kubernetes, including custom controllers and workload orchestration.
- Understanding of researcher productivity and the importance of efficient workflows.
- Familiarity with distributed job scheduling and large-scale training challenges.
- Experience shipping production systems relied upon by users.
- Strong communication skills to translate customer needs into system designs.
Benefits
- 100% paid medical, dental, and vision insurance.
- Company-paid life insurance and voluntary supplemental options.
- Short and long-term disability insurance.
- Flexible Spending Account and Health Savings Account.
- Tuition reimbursement and participation in Employee Stock Purchase Program.
- Mental wellness benefits and family-forming support.
- Paid parental leave and flexible childcare support.
- 401(k) with generous employer match and flexible PTO.
- Catered lunch in office locations and a casual work environment.
