Deployment Engineer, AI Inference
Cerebras Systems
5 months ago
Sunnyvale, CA, USA or Toronto, Canada
Mid Level / Senior
H1B Sponsor
Responsibilities
- Deploy AI inference replicas and cluster software across multiple datacenters.
- Operate across heterogeneous datacenter environments undergoing rapid growth.
- Maximize capacity allocation and optimize replica placement using constraint-solver algorithms.
- Operate bare-metal inference infrastructure while supporting transition to K8S-based platform.
- Develop and extend telemetry, observability and alerting solutions to ensure deployment reliability at scale.
- Develop and extend a fully automated deployment pipeline to support fast software updates and capacity reallocation at scale.
- Translate technical and customer needs into actionable requirements for the Dev Infra, Cluster, Platform and Core teams.
- Stay up to date with the latest advancements in AI compute infrastructure and related technologies.
Requirements
- 2-5 years of experience in operating on-prem compute infrastructure or developing and managing complex AWS plane infrastructure for hybrid deployments.
- Strong proficiency in Python for automation, orchestration, and deployment tooling.
- Solid understanding of Linux-based systems and command-line tools.
- Extensive knowledge of Docker containers and container orchestration platforms like K8S.
- Familiarity with spine-leaf (Clos) networking architecture.
- Proficiency with telemetry and observability stacks such as Prometheus, InfluxDB and Grafana.
- Strong ownership mindset and accountability for complex deployments.
- Ability to work effectively in a fast-paced environment.
Benefits
- Opportunity to build a breakthrough AI platform beyond the constraints of the GPU.
- Ability to publish and open source cutting-edge AI research.
- Work on one of the fastest AI supercomputers in the world.
- Enjoy job stability with startup vitality.
- Experience a simple, non-corporate work culture that respects individual beliefs.
Tech Stack
AWSDockerGrafanaInfluxDBKubernetesLinuxPrometheusPython
Categories
AI & MLDevOps