1 day ago
Base Salary
$160k - $230k/yr
Responsibilities
- Design and build full-stack solutions from public training APIs to GPU data planes.
- Scale distributed systems for serverless GPU compute with multi-tenant scheduling and fault tolerance.
- Drive end-to-end performance for training, inference, and RL loops under heavy load.
- Productionize research techniques into reliable components for enterprise use.
Requirements
- 5+ years of experience building and shipping production ML systems.
- Strong foundation in distributed systems and infrastructure, particularly on Kubernetes.
- Familiarity with GPU and LLM infrastructure, including PyTorch and CUDA.
- Proven ability to enhance system reliability, throughput, and cost efficiency.
- BS in Computer Science or related field; MS/PhD is a plus.
- Hands-on experience with LLM post-training is a bonus.