GrepJob
Cohere

Senior ML Systems Engineer, Frameworks & Tooling

Cohere
Apply
7 months ago
Toronto, Canada +5 moreSenior
H1B Sponsor

Responsibilities

  • Build and own the training framework for large-scale LLM training.
  • Design distributed training abstractions for efficient model training.
  • Improve training throughput and stability on multi-node clusters.
  • Develop and maintain tooling for monitoring and debugging.
  • Collaborate with infrastructure teams to support high-performance training.
  • Investigate and resolve performance bottlenecks across the ML systems stack.
  • Build systems that ensure reproducible and debuggable large-scale runs.

Requirements

  • Strong engineering experience in large-scale distributed training or HPC systems.
  • Deep familiarity with JAX internals and distributed training libraries.
  • Experience with multi-node cluster orchestration tools like Slurm or Kubernetes.
  • Comfort debugging performance issues across CUDA/NCCL and data pipelines.
  • Experience with containerized environments such as Docker.
  • A track record of building tools that enhance developer velocity for ML teams.
  • Strong collaboration skills to work with infra, research, and deployment teams.

Benefits

  • An open and inclusive culture and work environment.
  • Weekly lunch stipend, in-office lunches, and snacks.
  • Full health and dental benefits, including mental health support.
  • 100% Parental Leave top-up for up to 6 months.
  • Personal enrichment benefits for arts, culture, fitness, and workspace improvement.
  • Remote-flexible work options with offices in major cities.
  • 6 weeks of vacation (30 working days).

Categories

AI & MLData Engineering