about 2 hours ago
New York, NY, USA +2 more
Mid Level / Senior
H1B Sponsor
Base Salary
$216k - $270k/yr
Responsibilities
- Architect and scale a multi-tenant orchestration layer for GPU clusters.
- Design and implement scheduling primitives for training job optimization.
- Develop observability and automated health-checking for the training stack.
- Evaluate and integrate emerging technologies in the CNCF and AI ecosystem.
- Collaborate with Finance and Procurement on capacity planning.
- Participate in the team's on-call process for service availability.
- Own projects end-to-end in a collaborative environment.
Requirements
- 5+ years of experience in backend or infrastructure engineering.
- At least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes).
- Strong programming skills in one or more languages (e.g., Python, Go, Rust, C++).
- Experience with complex compute management systems.
- Familiarity with distributed training infrastructure and storage systems.
- Expert-level knowledge of Kubernetes internals.
- Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code.
Benefits
- Comprehensive health, dental, and vision coverage.
- Retirement benefits.
- Learning and development stipend.
- Generous PTO.
- Potential commuter stipend.
Tech Stack
AWSC++GoGoogle Cloud PlatformKubernetesPythonPyTorchRustTerraform
Categories
AI & MLBackendDevOps