about 3 hours ago
New York, NY, USA +2 more
Staff+
H1B Sponsor
Base Salary
$320k - $405k/yr
Responsibilities
- Own the technical strategy and roadmap for node lifecycle management.
- Drive cross-team initiatives to build and scale AI clusters across multiple clouds.
- Design and operate systems for automatic detection and remediation of hardware issues.
- Define infrastructure architecture to solve complex problems.
- Collaborate with cloud providers and internal teams on compute and infrastructure strategy.
- Establish operational excellence practices including incident response and on-call.
- Mentor and coach engineers to support their growth.
Requirements
- Deep expertise in distributed systems, reliability, and cloud platforms.
- Strong proficiency in at least one systems language such as Rust, Go, or Python.
- Hands-on experience with machine learning accelerators like GPUs, TPUs, or Trainium.
- Track record of leading complex technical initiatives across teams.
- Ability to communicate effectively with senior stakeholders.
Benefits
- Competitive compensation and benefits.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours.
- Collaborative office space.
Tech Stack
AWSAzureGoGoogle Cloud PlatformKubernetesLinuxNode.jsPythonRustTerraform
Categories
AI & MLData EngineeringDevOps