13 days ago
Base Salary
$241k - $331k/yr
Responsibilities
- Own reliability, observability, and incident response for multi-site GPU clusters running Slurm on Kubernetes.
- Debug and resolve deep infrastructure failures across storage, networking, scheduling, and GPU compute layers.
- Design and execute GPU cluster scaling plans to support larger training runs.
- Build automation and tooling for managing cluster operations at scale.
- Drive configuration-as-code practices for reproducible and auditable cluster states.
- Collaborate with AI researchers to understand training workload patterns.
- Manage vendor relationships on technical issues and coordinate across multiple partners.
- Contribute to capacity planning and manage cluster expansion across GPU generations.
- Improve operational resilience and develop runbooks for team knowledge.
Requirements
- 8+ years of AI/ML infrastructure engineering experience with expertise in HPC/Slurm, Kubernetes, or distributed systems.
- Strong Linux systems fundamentals including networking and storage.
- Hands-on experience with Kubernetes and cloud-native infrastructure.
- Experience with HPC workload managers, preferably Slurm.
- Ability to debug complex multi-system failures under pressure.
- Proficiency in Python and Bash for automation; Go, Rust, or C/C++ is a plus.
- Experience with observability stacks like Prometheus and Grafana.
- Excellent communication skills for technical documentation and incident summaries.
- Bonus: experience with distributed AI training infrastructure.
Benefits
- Generous employer match on employee 401(k) contributions.
- Paid time off to volunteer at an organization of your choice.
- Funding for select family-forming benefits.
- Relocation support for employees who need assistance moving.
