Member of Technical Staff - Distributed Training Engineer

11 months ago

Remote, Worldwide +2 moreMid Level / Senior

H1B Sponsor

Responsibilities

Design and build core systems for fast and reliable large training runs.
Build scalable distributed training infrastructure for GPU clusters.
Implement and tune parallelism/sharding strategies for evolving architectures.
Optimize distributed efficiency through topology-aware collectives and straggler mitigation.
Build data loading systems to eliminate I/O bottlenecks for multimodal datasets.
Develop checkpointing mechanisms balancing memory constraints with recovery needs.
Create monitoring, profiling, and debugging tools for training stability and performance.

Hands-on experience building distributed training infrastructure using PyTorch Distributed DDP/FSDP, DeepSpeed ZeRO, or Megatron-LM TP/PP.
Experience diagnosing performance bottlenecks and failure modes.
Understanding of hardware accelerators and networking topologies.
Experience optimizing data pipelines for machine learning workloads.
Nice-to-have: MoE training experience and large-scale distributed training experience.

Greenfield challenges with high ownership from day one.
Competitive base salary with equity in a unicorn-stage company.
100% coverage of medical, dental, and vision premiums for employees and dependents.
401(k) matching up to 4% of base pay.
Unlimited PTO plus company-wide Refill Days throughout the year.