2 months ago
Base Salary
$150k - $220k/yr
Responsibilities
- Architect and maintain the core computing platform using Kubernetes on AWS and on-premise.
- Develop and manage infrastructure using Infrastructure-as-Code principles with Terraform.
- Design and optimize AI/ML job scheduling and orchestration systems with Slurm.
- Provision and maintain on-premise bare metal server infrastructure for GPU computing.
- Implement networking and storage solutions to support high-throughput workloads.
- Develop a comprehensive observability stack for platform health monitoring.
- Collaborate with AI researchers to build tools that accelerate development cycles.
- Automate the life cycle of single-tenant, managed deployments.
Requirements
- 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering.
- Hands-on experience building and managing production infrastructure with Terraform.
- Expert-level knowledge of Kubernetes architecture and operations.
- Experience with HPC job schedulers, specifically Slurm, for GPU workloads.
- Experience managing bare metal infrastructure and server provisioning.
- Strong scripting and automation skills in languages like Python, Go, or Bash.
Benefits
- Medical, dental, and vision benefits.
- Annual wellness stipend and mental health support.
- Unlimited PTO and generous paid parental leave.
- Flexible schedule and 12 paid US company holidays.
- 401(k) plan with company match and tax savings programs.
- Learning and education stipend, plus participation in talks and conferences.
