AI Infrastructure Engineer - Training Platform

about 2 months ago

Seattle, WA, USA +2 moreMid Level / Senior

H1B Sponsor

Base Salary

$216k - $270k/yr

Responsibilities

Architect and scale a multi-tenant orchestration layer for GPU clusters.
Design and implement scheduling primitives for training job optimization.
Develop observability and automated health-checking for the training stack.
Evaluate and integrate emerging technologies in the CNCF and AI ecosystem.
Collaborate with Finance and Procurement on capacity planning.
Participate in the team's on-call process for service availability.
Own projects end-to-end in a collaborative environment.

Requirements

5+ years of experience in backend or infrastructure engineering.
At least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes).
Strong programming skills in one or more languages (e.g., Python, Go, Rust, C++).
Experience with complex compute management systems.
Familiarity with distributed training infrastructure and storage systems.
Expert-level knowledge of Kubernetes internals.
Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code.

Benefits

Comprehensive health, dental, and vision coverage.
Retirement benefits.
Learning and development stipend.
Generous PTO.
Potential commuter stipend.

Tech Stack

AWS C++Go Google Cloud Platform Kubernetes Python PyTorch Rust Terraform

Categories

AI & ML Backend DevOps