Staff Software Engineer, GPU Infrastructure (HPC)

6 months ago

Toronto, CanadaStaff+

H1B Sponsor

Responsibilities

Build and scale ML-optimized HPC infrastructure using Kubernetes-based GPU/TPU superclusters.
Collaborate with cloud providers to optimize infrastructure for AI/ML training.
Proactively identify and resolve infrastructure bottlenecks and performance issues.
Design intuitive self-service tools for researchers to monitor and optimize training jobs.
Work closely with AI researchers to develop scalable infrastructure solutions.
Advocate for best practices in observability, automation, and infrastructure-as-code.
Mentor team members through code reviews and cross-team collaboration.

Deep expertise in ML/HPC infrastructure and experience with GPU/TPU clusters.
Proven ability to deploy and manage cloud-native Kubernetes clusters.
Strong programming skills in Python and Go, with a preference for open-source contributions.
Familiarity with Linux internals and performance optimization for ML workloads.
Experience collaborating with AI researchers or ML engineers.
Self-directed problem-solving skills in a fast-paced environment.

An open and inclusive culture and work environment.
Weekly lunch stipend, in-office lunches, and snacks.
Full health and dental benefits, including a budget for mental health.
100% Parental Leave top-up for up to 6 months.
Personal enrichment benefits for arts, culture, fitness, and workspace improvement.
Remote-flexible work options with offices in major cities.
6 weeks of vacation (30 working days).