
Machine Learning Infrastructure Engineer
Mind Robotics3 months ago
Palo Alto, CA, USAMid Level
Responsibilities
- Build and maintain large-scale model training systems.
- Own distributed training and core ML infrastructure.
- Enhance iteration loops across hundreds of GPUs.
- Collaborate with researchers to improve training processes.
- Focus on reliability and ease of model deployment.
Requirements
- Experience with large training systems in PyTorch or JAX.
- Strong understanding of sharding and parallelism.
- Ability to operate at scale in machine learning environments.
- Familiarity with performance optimization techniques.