11 months ago
Remote, Worldwide +2 moreMid Level / Senior
H1B Sponsor
Responsibilities
- Design and build core systems for fast and reliable large training runs.
- Build scalable distributed training infrastructure for GPU clusters.
- Implement and tune parallelism/sharding strategies for evolving architectures.
- Optimize distributed efficiency through topology-aware collectives and straggler mitigation.
- Build data loading systems to eliminate I/O bottlenecks for multimodal datasets.
- Develop checkpointing mechanisms balancing memory constraints with recovery needs.
- Create monitoring, profiling, and debugging tools for training stability and performance.
Requirements
- Hands-on experience building distributed training infrastructure using PyTorch Distributed DDP/FSDP, DeepSpeed ZeRO, or Megatron-LM TP/PP.
- Experience diagnosing performance bottlenecks and failure modes.
- Understanding of hardware accelerators and networking topologies.
- Experience optimizing data pipelines for machine learning workloads.
- Nice-to-have: MoE training experience and large-scale distributed training experience.
Benefits
- Greenfield challenges with high ownership from day one.
- Competitive base salary with equity in a unicorn-stage company.
- 100% coverage of medical, dental, and vision premiums for employees and dependents.
- 401(k) matching up to 4% of base pay.
- Unlimited PTO plus company-wide Refill Days throughout the year.
