5 months ago
Toronto, Canada +4 moreSenior / Staff+
H1B Sponsor
Responsibilities
- Design and implement mechanisms to optimize GPU and cluster utilization across thousands of devices.
- Build scalable frameworks for launching and managing large-scale AI jobs.
- Develop observability, tracing, and visualization tools for compute clusters.
- Collaborate with AI researchers to integrate acceleration techniques into production pipelines.
- Champion the adoption of modern cloud and container technologies for distributed systems.
Requirements
- Bachelor's degree in Computer Science, Engineering, or a related field.
- 5+ years of experience in large-scale MLOps, AI infrastructure, or HPC systems.
- Experience with data frameworks like Ray, Apache Spark, and LanceDB.
- Strong proficiency in Python and a high-performance language such as C++.
- Hands-on experience with orchestration and distributed computing frameworks like Kubernetes and Ray.
- Experience with core ML frameworks such as PyTorch, TensorFlow, or JAX.
Benefits
- Competitive salary and benefits package.
- Dynamic and inclusive work environment.
- Opportunities for professional growth and advancement.
- Collaborative culture that values innovation and creativity.
- Access to the latest technologies and tools.