4 months ago
Base Salary
$225k - $550k/yr
Responsibilities
- Scale distributed training across large GPU clusters using data, tensor, and pipeline parallelism.
- Optimize communication patterns and gradient synchronization.
- Improve checkpointing, fault tolerance, and job recovery systems.
- Profile and eliminate performance bottlenecks across compute, networking, and storage.
- Enhance experiment reproducibility and orchestration workflows.
- Increase hardware utilization and training throughput.
- Collaborate with Kernels and Research to align model architecture with systems realities.
Requirements
- Strong software engineering and distributed systems fundamentals.
- Experience training large models in multi-node GPU environments.
- Deep understanding of parallelism strategies and performance trade-offs.
- Experience debugging cross-layer issues in production ML systems.
- Strong ownership mindset and ability to operate critical infrastructure.
- Track record of improving performance or reliability of large-scale systems.
Benefits
- Annual salary range: $225K - $550K.
- Equity is a significant part of total compensation, in addition to salary.
- 401(k) plan with 6% salary matching.
- Generous health, dental, and vision insurance for you and your dependents.
- Unlimited paid time off.
- Visa sponsorship and relocation stipend to bring you to SF, if possible.
