Cerebras Systems

Principal Engineer, AI Inference Reliability

Cerebras Systems

Apply
4 months ago
Sunnyvale, CA, USA or Toronto, Canada
Senior / Staff+
H1B Sponsor

Responsibilities

  • Define and drive reliability strategy: establish SLOs and ensure alignment across engineering.
  • Design and implement reliability mechanisms for fault detection and recovery.
  • Lead large-scale incident management, including postmortems and root-cause analysis.
  • Architect systems for redundancy, durability, and debuggability.
  • Develop reliability tooling for chaos testing and load simulation.
  • Collaborate across software, infrastructure, and hardware teams.
  • Monitor and communicate reliability metrics through dashboards.
  • Mentor engineers and set best practices for reliable system design.

Requirements

  • Bachelor's or master's degree in computer science or related field.
  • 7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems.
  • Strong programming skills in at least one popular backend language such as Python, C++, Go, or Rust.
  • Deep experience with reliability principles including SLO/SLI/SLA design.
  • Excellent communication and cross-functional leadership skills.
  • Bonus: prior experience building large-scale AI infrastructure systems.

Benefits

  • Opportunity to build a breakthrough AI platform beyond GPU constraints.
  • Ability to publish and open source cutting-edge AI research.
  • Work on one of the fastest AI supercomputers in the world.
  • Enjoy job stability with startup vitality.
  • Experience a simple, non-corporate work culture that respects individual beliefs.

Tech Stack

C++GoPythonRust

Categories

AI & MLBackendDevOps