7 months ago
Responsibilities
- Build and own the training framework for large-scale LLM training.
- Design distributed training abstractions for efficient model training.
- Improve training throughput and stability on multi-node clusters.
- Develop and maintain tooling for monitoring and debugging.
- Collaborate with infrastructure teams to support high-performance training.
- Investigate and resolve performance bottlenecks across the ML systems stack.
- Build systems that ensure reproducible and debuggable large-scale runs.
Requirements
- Strong engineering experience in large-scale distributed training or HPC systems.
- Deep familiarity with JAX internals and distributed training libraries.
- Experience with multi-node cluster orchestration tools like Slurm or Kubernetes.
- Comfort debugging performance issues across CUDA/NCCL and data pipelines.
- Experience with containerized environments such as Docker.
- A track record of building tools that enhance developer velocity for ML teams.
- Strong collaboration skills to work with infra, research, and deployment teams.
Benefits
- An open and inclusive culture and work environment.
- Weekly lunch stipend, in-office lunches, and snacks.
- Full health and dental benefits, including mental health support.
- 100% Parental Leave top-up for up to 6 months.
- Personal enrichment benefits for arts, culture, fitness, and workspace improvement.
- Remote-flexible work options with offices in major cities.
- 6 weeks of vacation (30 working days).
Tech Stack
Categories
AI & MLData Engineering
