GrepJob
Scale AI

AI Infrastructure Engineer - Training Platform

Scale AI
Apply
about 2 hours ago
New York, NY, USA +2 more
Mid Level / Senior
H1B Sponsor

Base Salary

$216k - $270k/yr

Responsibilities

  • Architect and scale a multi-tenant orchestration layer for GPU clusters.
  • Design and implement scheduling primitives for training job optimization.
  • Develop observability and automated health-checking for the training stack.
  • Evaluate and integrate emerging technologies in the CNCF and AI ecosystem.
  • Collaborate with Finance and Procurement on capacity planning.
  • Participate in the team's on-call process for service availability.
  • Own projects end-to-end in a collaborative environment.

Requirements

  • 5+ years of experience in backend or infrastructure engineering.
  • At least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes).
  • Strong programming skills in one or more languages (e.g., Python, Go, Rust, C++).
  • Experience with complex compute management systems.
  • Familiarity with distributed training infrastructure and storage systems.
  • Expert-level knowledge of Kubernetes internals.
  • Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code.

Benefits

  • Comprehensive health, dental, and vision coverage.
  • Retirement benefits.
  • Learning and development stipend.
  • Generous PTO.
  • Potential commuter stipend.

Tech Stack

AWSC++GoGoogle Cloud PlatformKubernetesPythonPyTorchRustTerraform

Categories

AI & MLBackendDevOps