
Member of Technical Staff - Pre-Training Infra
Reflection3 months ago
London, United Kingdom +2 moreMid Level / Senior
H1B Sponsor
Responsibilities
- Build and scale distributed training systems for frontier model pre-training.
- Design and operate large-scale training runs in collaboration with research teams.
- Develop infrastructure for efficient training across thousands of GPUs.
- Optimize training throughput, stability, and efficiency for large model workloads.
- Translate experimental ideas into scalable, production-ready training systems.
- Improve performance of distributed training workloads through optimization.
- Build and maintain training pipelines for large-scale datasets and experiment iteration.
- Debug and resolve performance bottlenecks in distributed training stacks.
- Contribute to systems enabling rapid experimentation on new training techniques.
Requirements
- Experience building or operating distributed training systems for large machine learning models.
- Strong experience with modern distributed training frameworks like Megatron or DeepSpeed.
- Familiarity with large-scale model parallelism strategies.
- Experience optimizing training throughput and GPU utilization in distributed environments.
- Familiarity with GPU communication libraries such as NCCL.
- Strong debugging skills across GPU compute and distributed training systems.
- Experience working with large datasets and training pipelines for foundation model pre-training.
Benefits
- Top-tier compensation with salary and equity to retain the best talent.
- Comprehensive medical, dental, vision, life, and disability insurance.
- Fully paid parental leave for all new parents, including adoptive and surrogate journeys.
- Financial support for family planning.
- Paid time off, relocation support, and additional perks for work-life balance.
- Daily lunch and dinner provided, along with regular off-sites and team celebrations.
Categories
AI & MLData Engineering