
Machine Learning Engineer - Distributed ML Systems
Pluralis Researchabout 1 month ago
San Francisco, CA, USASenior / Staff+
Responsibilities
- Design and implement large-scale distributed training systems optimized for low-bandwidth, high-latency conditions.
- Develop and optimize model-parallel training strategies with custom sharding techniques.
- Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes.
- Implement robust checkpointing, state synchronization, and recovery mechanisms for training jobs.
- Build monitoring and metrics systems to track training progress and system bottlenecks.
- Architect resilient training systems that can handle node failures and network partitions.
- Design and optimize peer-to-peer topologies for decentralized coordination.
- Implement NAT traversal, peer discovery, and dynamic routing.
Requirements
- 5+ years of experience in building and operating distributed systems in production.
- Hands-on expertise with distributed training frameworks like FSDP, DeepSpeed, or Megatron.
- Deep understanding of model parallelism techniques.
- Expert-level Python skills with production experience.
- Strong networking fundamentals including P2P systems and gRPC.
- Experience optimizing GPU workloads and large-scale compute efficiency.
Benefits
- Equity-heavy compensation with meaningful ownership.
- Competitive base salary for senior engineering roles in Australia.
- Visa sponsorship available for exceptional candidates.
- Remote-first work environment with optional access to the Melbourne hub.
- Opportunity to work with a world-class team from top tech companies.