7 months ago
San Francisco, CA, USA or New York, NY, USAMid Level / Senior
Base Salary
$165k - $259k/yr
Responsibilities
- Design and implement distributed compute infrastructure for ML data processing.
- Improve cluster observability, scheduling, and resource utilization.
- Research and implement cost-efficient compute solutions.
- Develop tools for monitoring and performance tuning of ML workloads.
- Collaborate with ML engineers to enhance training pipelines.
- Stay current with emerging technologies in distributed computing.
Requirements
- Experience with distributed computing frameworks like Ray, Dask, or Celery.
- Strong understanding of parallel computing and job scheduling.
- Ability to identify and resolve performance issues in distributed systems.
- Experience with cloud compute platforms such as AWS, GCP, or Azure.
- Familiarity with ML frameworks like PyTorch, TensorFlow, or JAX.
