Machine Learning Engineer - Distributed ML Systems

about 1 month ago

Responsibilities

Design and implement large-scale distributed training systems optimized for low-bandwidth, high-latency conditions.
Develop and optimize model-parallel training strategies with custom sharding techniques.
Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes.
Implement robust checkpointing, state synchronization, and recovery mechanisms for training jobs.
Build monitoring and metrics systems to track training progress and system bottlenecks.
Architect resilient training systems that can handle node failures and network partitions.
Design and optimize peer-to-peer topologies for decentralized coordination.
Implement NAT traversal, peer discovery, and dynamic routing.

5+ years of experience in building and operating distributed systems in production.
Hands-on expertise with distributed training frameworks like FSDP, DeepSpeed, or Megatron.
Deep understanding of model parallelism techniques.
Expert-level Python skills with production experience.
Strong networking fundamentals including P2P systems and gRPC.
Experience optimizing GPU workloads and large-scale compute efficiency.