about 16 hours ago
Base Salary
$150k - $300k/yr
Responsibilities
- Build and maintain the training and inference stack for fast iteration and flexibility.
- Develop benchmarks to identify bottlenecks in the training and inference processes.
- Explore state-of-the-art advances in training and inference and apply them.
- Design systems for scaling model training across multi-node, multi-GPU environments.
- Scale distributed training and inference workloads across large GPU clusters.
- Build tooling and abstractions to help ML engineers transition from experiment to production.
Requirements
- Strong Python skills and a background in systems engineering.
- Experience with Kubernetes and distributed training frameworks.
- Ability to solve complex problems and build from first principles.
- Comfortable working in fast-changing, high-growth environments.
- Effective collaboration across technical and non-technical teams.
- Willingness to take full ownership from strategy through execution.
Benefits
- Unlimited PTO for recharging.
- Free daily lunch with teammates at the office.
- Reimbursed transportation costs.
- Generous health insurance covering medical, dental, and vision.
- Health and wellness budget of up to $150/month.
- Flexible parental leave schedule.
