Staff Platform Engineer - AI Infrastructure

about 2 months ago

Toronto, CanadaStaff+

Responsibilities

Design and operate GPU infrastructure for model hosting, including provisioning and scheduling.
Build and scale model serving systems supporting real-time inference.
Implement multi-model routing for various modalities on shared infrastructure.
Own the model lifecycle from download to deployment and monitoring.
Drive inference optimization strategies including quantization and caching.
Build self-service infrastructure platforms for teams to provision resources.
Implement infrastructure-as-code at scale using tools like Terraform.
Build observability and reliability for inference systems.
Define platform standards and governance for resource management.
Lead architectural design and influence engineering direction.

8+ years of software engineering experience with 3+ years in infrastructure platforms or ML/AI infrastructure.
Deep experience with cloud infrastructure (AWS, GCP) and Kubernetes.
Hands-on experience with GPU workloads and model serving technologies.
Strong software engineering skills in Python, Go, or C++.
Experience with infrastructure-as-code tools like Terraform or Pulumi.
Experience designing self-service platforms or internal developer tooling.
Understanding of model optimization techniques.
Proven ability to lead complex cross-team technical initiatives.
Strong communication skills to influence technical direction.