GrepJob
Mind Robotics

Machine Learning Infrastructure Engineer

Mind Robotics
Apply
3 months ago
Palo Alto, CA, USAMid Level

Responsibilities

  • Build and maintain large-scale model training systems.
  • Own distributed training and core ML infrastructure.
  • Enhance iteration loops across hundreds of GPUs.
  • Collaborate with researchers to improve training processes.
  • Focus on reliability and ease of model deployment.

Requirements

  • Experience with large training systems in PyTorch or JAX.
  • Strong understanding of sharding and parallelism.
  • Ability to operate at scale in machine learning environments.
  • Familiarity with performance optimization techniques.

Tech Stack

PyTorch

Categories