GrepJob
Pluralis Research

Machine Learning Engineer - Distributed ML Systems

Pluralis Research
Apply
2 months ago
Sydney, Australia or Melbourne, AustraliaSenior / Staff+

Responsibilities

  • Design and implement large-scale distributed training systems optimized for heterogeneous hardware.
  • Develop and optimize model-parallel training strategies with custom sharding techniques.
  • Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes.
  • Implement robust checkpointing, state synchronization, and recovery mechanisms.
  • Build monitoring and metrics systems to track training progress and model quality.
  • Architect resilient training systems for dynamic participant management.
  • Design and optimize peer-to-peer topologies for decentralized coordination.
  • Profile and optimize communication patterns to reduce latency and bandwidth overhead.

Requirements

  • 5+ years of experience in building and operating distributed systems in production.
  • Hands-on expertise with distributed training frameworks like FSDP, DeepSpeed, or Megatron.
  • Deep understanding of model parallelism including data, tensor, and pipeline parallelism.
  • Expert-level Python skills with production experience in concurrency and error handling.
  • Strong networking fundamentals including P2P systems and NAT traversal.
  • Experience optimizing GPU workloads and large-scale compute efficiency.

Benefits

  • Equity-heavy compensation with meaningful ownership in a mission-driven company.
  • Competitive base salary for senior engineering roles in Australia.
  • Visa sponsorship available for exceptional candidates.
  • Remote-first work environment with optional access to the Melbourne hub.
  • Opportunity to work with a world-class team from leading tech companies.

Tech Stack

gRPCPython

Categories

AI & MLData Engineering