Tech Lead, AI Compute Infrastructure

5 months ago

Toronto, Canada +4 moreSenior / Staff+

H1B Sponsor

Responsibilities

Design and implement mechanisms to optimize GPU and cluster utilization across thousands of devices.
Build scalable frameworks for launching and managing large-scale AI jobs.
Develop observability, tracing, and visualization tools for compute clusters.
Collaborate with AI researchers to integrate acceleration techniques into production pipelines.
Champion the adoption of modern cloud and container technologies for distributed systems.

Bachelor's degree in Computer Science, Engineering, or a related field.
5+ years of experience in large-scale MLOps, AI infrastructure, or HPC systems.
Experience with data frameworks like Ray, Apache Spark, and LanceDB.
Strong proficiency in Python and a high-performance language such as C++.
Hands-on experience with orchestration and distributed computing frameworks like Kubernetes and Ray.
Experience with core ML frameworks such as PyTorch, TensorFlow, or JAX.