GrepJob
Biohub

Staff AI Infrastructure Engineer

Biohub
Apply
13 days ago
Foster City, CA, USAStaff+
H1B Sponsor

Base Salary

$241k - $331k/yr

Responsibilities

  • Own reliability, observability, and incident response for multi-site GPU clusters running Slurm on Kubernetes.
  • Debug and resolve deep infrastructure failures across storage, networking, scheduling, and GPU compute layers.
  • Design and execute GPU cluster scaling plans to support larger training runs.
  • Build automation and tooling for managing cluster operations at scale.
  • Drive configuration-as-code practices for reproducible and auditable cluster states.
  • Collaborate with AI researchers to understand training workload patterns.
  • Manage vendor relationships on technical issues and coordinate across multiple partners.
  • Contribute to capacity planning and manage cluster expansion across GPU generations.
  • Improve operational resilience and develop runbooks for team knowledge.

Requirements

  • 8+ years of AI/ML infrastructure engineering experience with expertise in HPC/Slurm, Kubernetes, or distributed systems.
  • Strong Linux systems fundamentals including networking and storage.
  • Hands-on experience with Kubernetes and cloud-native infrastructure.
  • Experience with HPC workload managers, preferably Slurm.
  • Ability to debug complex multi-system failures under pressure.
  • Proficiency in Python and Bash for automation; Go, Rust, or C/C++ is a plus.
  • Experience with observability stacks like Prometheus and Grafana.
  • Excellent communication skills for technical documentation and incident summaries.
  • Bonus: experience with distributed AI training infrastructure.

Benefits

  • Generous employer match on employee 401(k) contributions.
  • Paid time off to volunteer at an organization of your choice.
  • Funding for select family-forming benefits.
  • Relocation support for employees who need assistance moving.

Tech Stack

BashCC++GoGrafanaHelmKubernetesLinuxPrometheusPythonPyTorchRust

Categories

AI & MLData EngineeringDevOps