about 2 hours ago
Base Salary
$194k - $291k/yr
Responsibilities
- Scale automated infrastructure-as-code (IaC) pipelines to manage thousands of GPU/CPU nodes.
- Design and optimize workload orchestration for hardware utilization and job wait times.
- Create robust pipelines for extracting and transforming petabyte-scale data.
- Implement feature caching and storage solutions for low-latency access.
- Contribute to a unified ML platform that abstracts complex cloud infrastructure.
Requirements
- 5+ years of experience in ML Infrastructure, Backend Platform Engineering, or Distributed Systems.
- Familiarity with Infrastructure-as-Code tools like Terraform, Pulumi, or Crossplane.
- Experience with large-scale orchestrators for compute-heavy workloads.
- Proficiency in distributed processing frameworks like Apache Spark or Apache Beam.
- Experience with feature stores and caching layers.
- Strong understanding of distributed systems and high-performance computing.
Benefits
- Eligible for an annual performance bonus.
- Equity options available.
- Competitive benefits package.