GrepJob
Liquid AI

Member of Technical Staff - Distributed Training Engineer

Liquid AI
Apply
11 months ago
Remote, Worldwide +2 moreMid Level / Senior
H1B Sponsor

Responsibilities

  • Design and build core systems for fast and reliable large training runs.
  • Build scalable distributed training infrastructure for GPU clusters.
  • Implement and tune parallelism/sharding strategies for evolving architectures.
  • Optimize distributed efficiency through topology-aware collectives and straggler mitigation.
  • Build data loading systems to eliminate I/O bottlenecks for multimodal datasets.
  • Develop checkpointing mechanisms balancing memory constraints with recovery needs.
  • Create monitoring, profiling, and debugging tools for training stability and performance.

Requirements

  • Hands-on experience building distributed training infrastructure using PyTorch Distributed DDP/FSDP, DeepSpeed ZeRO, or Megatron-LM TP/PP.
  • Experience diagnosing performance bottlenecks and failure modes.
  • Understanding of hardware accelerators and networking topologies.
  • Experience optimizing data pipelines for machine learning workloads.
  • Nice-to-have: MoE training experience and large-scale distributed training experience.

Benefits

  • Greenfield challenges with high ownership from day one.
  • Competitive base salary with equity in a unicorn-stage company.
  • 100% coverage of medical, dental, and vision premiums for employees and dependents.
  • 401(k) matching up to 4% of base pay.
  • Unlimited PTO plus company-wide Refill Days throughout the year.

Tech Stack

Categories

AI & MLBackendData Engineering