Software Engineer, RL Training Infra

about 2 months ago

H1B Sponsor

Base Salary

$295k - $445k/yr

Responsibilities

Keep large-scale RL training runs moving by addressing urgent engineering and infrastructure problems.
Debug issues across training systems, inference, orchestration, scaling, and distributed infrastructure.
Solve technical problems at the intersection of research and engineering.
Improve reliability and efficiency for RL training runs.
Assist researchers with infra-heavy integrations like multi-agent capabilities.
Transform recurring operational issues into better tools and processes.
Collaborate closely with research and partner teams during model run timelines.
Debug failures across various systems and turn them into hypotheses and improvements.

Strong generalist engineer with experience in ML infrastructure.
Experience in reinforcement learning, inference, scaling, or training systems.
Ability to learn quickly and operate across unfamiliar layers.
Strong debugging skills with high ownership and excellent communication.
Comfortable working in messy areas with tight timelines.
Experience with large-scale model training or high-throughput ML infrastructure is a plus.
Background in performance optimization or production-critical infrastructure is preferred.