Staff AI Infrastructure Engineer

13 days ago

Foster City, CA, USAStaff+

H1B Sponsor

Base Salary

$241k - $331k/yr

Responsibilities

Own reliability, observability, and incident response for multi-site GPU clusters running Slurm on Kubernetes.
Debug and resolve deep infrastructure failures across storage, networking, scheduling, and GPU compute layers.
Design and execute GPU cluster scaling plans to support larger training runs.
Build automation and tooling for managing cluster operations at scale.
Drive configuration-as-code practices for reproducible and auditable cluster states.
Collaborate with AI researchers to understand training workload patterns.
Manage vendor relationships on technical issues and coordinate across multiple partners.
Contribute to capacity planning and manage cluster expansion across GPU generations.
Improve operational resilience and develop runbooks for team knowledge.

8+ years of AI/ML infrastructure engineering experience with expertise in HPC/Slurm, Kubernetes, or distributed systems.
Strong Linux systems fundamentals including networking and storage.
Hands-on experience with Kubernetes and cloud-native infrastructure.
Experience with HPC workload managers, preferably Slurm.
Ability to debug complex multi-system failures under pressure.
Proficiency in Python and Bash for automation; Go, Rust, or C/C++ is a plus.
Experience with observability stacks like Prometheus and Grafana.
Excellent communication skills for technical documentation and incident summaries.
Bonus: experience with distributed AI training infrastructure.

BashCC++GoGrafanaHelmKubernetesLinuxPrometheusPythonPyTorchRust