Principal Engineer, AI Inference Reliability

9 months ago

Toronto, Canada or Sunnyvale, CA, USASenior / Staff+

H1B Sponsor

Responsibilities

Define and drive reliability strategy: establish SLOs and ensure alignment across engineering.
Design and implement reliability mechanisms for fault detection and recovery.
Lead large-scale incident management, including postmortems and root-cause analysis.
Architect systems for redundancy, durability, and debuggability.
Develop reliability tooling for chaos testing and load simulation.
Collaborate across software, infrastructure, and hardware teams.
Monitor and communicate reliability metrics through dashboards.
Mentor engineers and set best practices for reliable system design.

Bachelor's or master's degree in computer science or related field.
7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems.
Strong programming skills in at least one popular backend language such as Python, C++, Go, or Rust.
Deep experience with reliability principles including SLO/SLI/SLA design.
Excellent communication and cross-functional leadership skills.
Bonus: prior experience building large-scale AI infrastructure systems.

Opportunity to build a breakthrough AI platform beyond GPU constraints.
Ability to publish and open source cutting-edge AI research.
Work on one of the fastest AI supercomputers in the world.
Enjoy job stability with startup vitality.
Experience a simple, non-corporate work culture that respects individual beliefs.