about 1 month ago
Base Salary
$200k - $400k/yr
Responsibilities
- Design and build distributed training platforms for LLM and multimodal fine-tuning.
- Integrate state-of-the-art training algorithms into production pipelines.
- Own inference architecture and multi-provider routing, including failover and optimization.
- Lead initiatives to improve latency and cost efficiency across the training and serving stack.
- Build evaluation and experimentation infrastructure for rapid iteration.
- Drive technical direction, mentor engineers, and establish best practices for ML infrastructure.
Requirements
- 6+ years building ML infrastructure or production systems at scale.
- Deep experience with distributed training, including multi-node GPU clusters.
- Strong understanding of LLM inference, latency optimization, and serving architecture.
- Proven track record leading complex, multi-quarter technical projects.
Benefits
- Take what you need vacation policy.
- Medical, Dental, and Vision benefits for you and your family.
- Life Insurance and Disability Benefits.
- Retirement Plan (e.g., 401K, pension).
- Parental Leave.
- Fertility and family building benefits through Carrot.
- Daily lunches and snacks in the office.
