Principal Engineer, AI Inference Reliability
Cerebras Systems
4 months ago
Sunnyvale, CA, USA or Toronto, Canada
Senior / Staff+
H1B Sponsor
Responsibilities
- Define and drive reliability strategy: establish SLOs and ensure alignment across engineering.
- Design and implement reliability mechanisms for fault detection and recovery.
- Lead large-scale incident management, including postmortems and root-cause analysis.
- Architect systems for redundancy, durability, and debuggability.
- Develop reliability tooling for chaos testing and load simulation.
- Collaborate across software, infrastructure, and hardware teams.
- Monitor and communicate reliability metrics through dashboards.
- Mentor engineers and set best practices for reliable system design.
Requirements
- Bachelor's or master's degree in computer science or related field.
- 7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems.
- Strong programming skills in at least one popular backend language such as Python, C++, Go, or Rust.
- Deep experience with reliability principles including SLO/SLI/SLA design.
- Excellent communication and cross-functional leadership skills.
- Bonus: prior experience building large-scale AI infrastructure systems.
Benefits
- Opportunity to build a breakthrough AI platform beyond GPU constraints.
- Ability to publish and open source cutting-edge AI research.
- Work on one of the fastest AI supercomputers in the world.
- Enjoy job stability with startup vitality.
- Experience a simple, non-corporate work culture that respects individual beliefs.
Tech Stack
C++GoPythonRust
Categories
AI & MLBackendDevOps