GrepJob
Deepgram

Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)

Deepgram
Apply
2 months ago
Remote, Worldwide or New York, NY, USASenior
H1B Sponsor

Base Salary

$150k - $220k/yr

Responsibilities

  • Architect and maintain the core computing platform using Kubernetes on AWS and on-premise.
  • Develop and manage infrastructure using Infrastructure-as-Code principles with Terraform.
  • Design and optimize AI/ML job scheduling and orchestration systems with Slurm.
  • Provision and maintain on-premise bare metal server infrastructure for GPU computing.
  • Implement networking and storage solutions to support high-throughput workloads.
  • Develop a comprehensive observability stack for platform health monitoring.
  • Collaborate with AI researchers to build tools that accelerate development cycles.
  • Automate the life cycle of single-tenant, managed deployments.

Requirements

  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering.
  • Hands-on experience building and managing production infrastructure with Terraform.
  • Expert-level knowledge of Kubernetes architecture and operations.
  • Experience with HPC job schedulers, specifically Slurm, for GPU workloads.
  • Experience managing bare metal infrastructure and server provisioning.
  • Strong scripting and automation skills in languages like Python, Go, or Bash.

Benefits

  • Medical, dental, and vision benefits.
  • Annual wellness stipend and mental health support.
  • Unlimited PTO and generous paid parental leave.
  • Flexible schedule and 12 paid US company holidays.
  • 401(k) plan with company match and tax savings programs.
  • Learning and education stipend, plus participation in talks and conferences.

Tech Stack

AWSBashGitLab CI/CDGoJenkinsKubernetesPythonTerraform

Categories