about 5 hours ago
Bellevue, WA, USA or Sunnyvale, CA, USASenior / Staff+
H1B Sponsor
Base Salary
$182k - $242k/yr
Responsibilities
- Contribute to the roadmap for Applied Training to identify essential workloads.
- Collaborate closely with customers and internal teams on cloud-native primitives.
- Design and build a complete research cluster experience, addressing researchers' challenges.
- Own the Python SDK for sandbox infrastructure, enabling large-scale RL training runs.
- Write documentation for OSS training frameworks to assist customers.
- Engage with infrastructure teams and customers to enhance system designs.
Requirements
- 8–12+ years of experience in building distributed systems or ML infrastructure.
- Proven experience with Kubernetes, including custom controllers and workload orchestration.
- Understanding of researcher productivity and the importance of efficient workflows.
- Familiarity with distributed job scheduling and large-scale training challenges.
- Experience in shipping production systems relied upon by users.
- Strong communication skills to translate customer needs into system designs.
Benefits
- 100% paid medical, dental, and vision insurance.
- Company-paid life insurance and short/long-term disability insurance.
- Flexible Spending Account and Health Savings Account options.
- Tuition reimbursement and participation in Employee Stock Purchase Program.
- Mental wellness benefits and family-forming support.
- Flexible PTO and catered lunch in office locations.
