GrepJob
OpenAI

Software Engineer, RL Training Infra

OpenAI
Apply
about 5 hours ago
San Francisco, CA, USAMid Level / Senior
H1B Sponsor

Base Salary

$295k - $445k/yr

Responsibilities

  • Keep large-scale RL training runs moving by addressing urgent engineering and infrastructure problems.
  • Debug issues across training systems, inference, orchestration, scaling, and distributed infrastructure.
  • Solve technical problems at the intersection of research and engineering.
  • Improve reliability and efficiency for RL training runs.
  • Assist researchers with infra-heavy integrations like multi-agent capabilities.
  • Transform recurring operational issues into better tools and processes.
  • Collaborate closely with research and partner teams during model run timelines.
  • Debug failures across various systems and turn them into hypotheses and improvements.

Requirements

  • Strong generalist engineer with experience in ML infrastructure.
  • Experience in reinforcement learning, inference, scaling, or training systems.
  • Ability to learn quickly and operate across unfamiliar layers.
  • Strong debugging skills with high ownership and excellent communication.
  • Comfortable working in messy areas with tight timelines.
  • Experience with large-scale model training or high-throughput ML infrastructure is a plus.
  • Background in performance optimization or production-critical infrastructure is preferred.

Categories

AI & MLBackendData Engineering