Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)

4 months ago

Remote, Worldwide or New York, NY, USASenior

H1B Sponsor

Base Salary

$150k - $220k/yr

Responsibilities

Architect and maintain the core computing platform using Kubernetes on AWS and on-premise.
Develop and manage infrastructure using Infrastructure-as-Code principles with Terraform.
Design and optimize AI/ML job scheduling and orchestration systems with Slurm.
Provision and maintain on-premise bare metal server infrastructure for GPU computing.
Implement networking and storage solutions to support high-throughput workloads.
Develop a comprehensive observability stack for platform health monitoring.
Collaborate with AI researchers to build tools that accelerate development cycles.
Automate the life cycle of single-tenant, managed deployments.

5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering.
Hands-on experience building and managing production infrastructure with Terraform.
Expert-level knowledge of Kubernetes architecture and operations.
Experience with HPC job schedulers, specifically Slurm, for GPU workloads.
Experience managing bare metal infrastructure and server provisioning.
Strong scripting and automation skills in languages like Python, Go, or Bash.