about 5 hours ago
Base Salary
$295k - $445k/yr
Responsibilities
- Keep large-scale RL training runs moving by addressing urgent engineering and infrastructure problems.
- Debug issues across training systems, inference, orchestration, scaling, and distributed infrastructure.
- Solve technical problems at the intersection of research and engineering.
- Improve reliability and efficiency for RL training runs.
- Assist researchers with infra-heavy integrations like multi-agent capabilities.
- Transform recurring operational issues into better tools and processes.
- Collaborate closely with research and partner teams during model run timelines.
- Debug failures across various systems and turn them into hypotheses and improvements.
Requirements
- Strong generalist engineer with experience in ML infrastructure.
- Experience in reinforcement learning, inference, scaling, or training systems.
- Ability to learn quickly and operate across unfamiliar layers.
- Strong debugging skills with high ownership and excellent communication.
- Comfortable working in messy areas with tight timelines.
- Experience with large-scale model training or high-throughput ML infrastructure is a plus.
- Background in performance optimization or production-critical infrastructure is preferred.