GrepJob
Cohere

Staff Software Engineer, GPU Infrastructure (HPC)

Cohere
Apply
6 months ago
Toronto, CanadaStaff+
H1B Sponsor

Responsibilities

  • Build and scale ML-optimized HPC infrastructure using Kubernetes-based GPU/TPU superclusters.
  • Collaborate with cloud providers to optimize infrastructure for AI/ML training.
  • Proactively identify and resolve infrastructure bottlenecks and performance issues.
  • Design intuitive self-service tools for researchers to monitor and optimize training jobs.
  • Work closely with AI researchers to develop scalable infrastructure solutions.
  • Advocate for best practices in observability, automation, and infrastructure-as-code.
  • Mentor team members through code reviews and cross-team collaboration.

Requirements

  • Deep expertise in ML/HPC infrastructure and experience with GPU/TPU clusters.
  • Proven ability to deploy and manage cloud-native Kubernetes clusters.
  • Strong programming skills in Python and Go, with a preference for open-source contributions.
  • Familiarity with Linux internals and performance optimization for ML workloads.
  • Experience collaborating with AI researchers or ML engineers.
  • Self-directed problem-solving skills in a fast-paced environment.

Benefits

  • An open and inclusive culture and work environment.
  • Weekly lunch stipend, in-office lunches, and snacks.
  • Full health and dental benefits, including a budget for mental health.
  • 100% Parental Leave top-up for up to 6 months.
  • Personal enrichment benefits for arts, culture, fitness, and workspace improvement.
  • Remote-flexible work options with offices in major cities.
  • 6 weeks of vacation (30 working days).

Tech Stack

Categories

AI & MLData EngineeringDevOps