Machine Learning Engineer - Distributed ML Systems

2 months ago

Sydney, Australia or Melbourne, AustraliaSenior / Staff+

Responsibilities

Design and implement large-scale distributed training systems optimized for heterogeneous hardware.
Develop and optimize model-parallel training strategies with custom sharding techniques.
Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes.
Implement robust checkpointing, state synchronization, and recovery mechanisms.
Build monitoring and metrics systems to track training progress and model quality.
Architect resilient training systems for dynamic participant management.
Design and optimize peer-to-peer topologies for decentralized coordination.
Profile and optimize communication patterns to reduce latency and bandwidth overhead.

5+ years of experience in building and operating distributed systems in production.
Hands-on expertise with distributed training frameworks like FSDP, DeepSpeed, or Megatron.
Deep understanding of model parallelism including data, tensor, and pipeline parallelism.
Expert-level Python skills with production experience in concurrency and error handling.
Strong networking fundamentals including P2P systems and NAT traversal.
Experience optimizing GPU workloads and large-scale compute efficiency.

Equity-heavy compensation with meaningful ownership in a mission-driven company.
Competitive base salary for senior engineering roles in Australia.
Visa sponsorship available for exceptional candidates.
Remote-first work environment with optional access to the Melbourne hub.
Opportunity to work with a world-class team from leading tech companies.