Member of Technical Staff - Pre-Training Infra

3 months ago

London, United Kingdom +2 moreMid Level / Senior

H1B Sponsor

Responsibilities

Build and scale distributed training systems for frontier model pre-training.
Design and operate large-scale training runs in collaboration with research teams.
Develop infrastructure for efficient training across thousands of GPUs.
Optimize training throughput, stability, and efficiency for large model workloads.
Translate experimental ideas into scalable, production-ready training systems.
Improve performance of distributed training workloads through optimization.
Build and maintain training pipelines for large-scale datasets and experiment iteration.
Debug and resolve performance bottlenecks in distributed training stacks.
Contribute to systems enabling rapid experimentation on new training techniques.

Experience building or operating distributed training systems for large machine learning models.
Strong experience with modern distributed training frameworks like Megatron or DeepSpeed.
Familiarity with large-scale model parallelism strategies.
Experience optimizing training throughput and GPU utilization in distributed environments.
Familiarity with GPU communication libraries such as NCCL.
Strong debugging skills across GPU compute and distributed training systems.
Experience working with large datasets and training pipelines for foundation model pre-training.

Top-tier compensation with salary and equity to retain the best talent.
Comprehensive medical, dental, vision, life, and disability insurance.
Fully paid parental leave for all new parents, including adoptive and surrogate journeys.
Financial support for family planning.
Paid time off, relocation support, and additional perks for work-life balance.
Daily lunch and dinner provided, along with regular off-sites and team celebrations.