
Staff Machine Learning Infrastructure Engineer
Dyna Roboticsabout 1 month ago
Base Salary
$220k - $320k/yr
Responsibilities
- Architect and own the infrastructure for large-scale GPU clusters.
- Implement sharding, activation checkpointing, and memory optimization for training massive multimodal models.
- Build a research codebase and job scheduling system that prioritizes fast iteration and automated retries.
- Design high-throughput pipelines to ingest and transform terabytes of multimodal robot data.
- Build low-latency inference pipelines for real-time robot control.
- Conduct deep systems profiling to optimize GPU utilization and performance.
Requirements
- 7+ years of engineering experience in high-performance computing or ML infrastructure.
- Deep experience with PyTorch and distributed training frameworks.
- Hands-on experience managing cloud GPU environments and container orchestration.
- Fundamental understanding of distributed systems, including memory management and communication.
- Ownership mindset with a focus on designing and operating systems end-to-end.