
Machine Learning Engineer - Distributed ML Systems
Pluralis Research2 months ago
Sydney, Australia or Melbourne, AustraliaSenior / Staff+
Responsibilities
- Design and implement large-scale distributed training systems optimized for heterogeneous hardware.
- Develop and optimize model-parallel training strategies with custom sharding techniques.
- Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes.
- Implement robust checkpointing, state synchronization, and recovery mechanisms.
- Build monitoring and metrics systems to track training progress and model quality.
- Architect resilient training systems for dynamic participant management.
- Design and optimize peer-to-peer topologies for decentralized coordination.
- Profile and optimize communication patterns to reduce latency and bandwidth overhead.
Requirements
- 5+ years of experience in building and operating distributed systems in production.
- Hands-on expertise with distributed training frameworks like FSDP, DeepSpeed, or Megatron.
- Deep understanding of model parallelism including data, tensor, and pipeline parallelism.
- Expert-level Python skills with production experience in concurrency and error handling.
- Strong networking fundamentals including P2P systems and NAT traversal.
- Experience optimizing GPU workloads and large-scale compute efficiency.
Benefits
- Equity-heavy compensation with meaningful ownership in a mission-driven company.
- Competitive base salary for senior engineering roles in Australia.
- Visa sponsorship available for exceptional candidates.
- Remote-first work environment with optional access to the Melbourne hub.
- Opportunity to work with a world-class team from leading tech companies.