OpenAI

Training Performance Engineer

OpenAI

Apply
4 months ago
San Francisco, CA, USA
Mid Level / Senior

Base Salary

$250k - $445k/yr

Responsibilities

  • Profile end-to-end training runs to identify performance bottlenecks across compute, communication, and storage.
  • Optimize GPU utilization and throughput for large-scale distributed model training.
  • Collaborate with runtime and systems engineers to improve kernel efficiency, scheduling, and collective communication performance.
  • Implement model graph transforms to improve end-to-end throughput.
  • Build tooling to monitor and visualize MFU, throughput, and uptime across clusters.
  • Partner with researchers to ensure new model architectures scale efficiently during pre-training.
  • Contribute to infrastructure decisions that improve reliability and efficiency of large training jobs.

Requirements

  • Strong programming skills in Python and C++ (Rust or CUDA a plus).
  • Experience running distributed training jobs on multi-GPU systems or HPC clusters.
  • Ability to debug complex distributed systems and measure efficiency rigorously.
  • Exposure to frameworks like PyTorch, JAX, or TensorFlow.
  • Comfortable collaborating across teams and translating raw profiling data into practical engineering improvements.
  • Familiarity with NCCL, MPI, or UCX communication libraries is a plus.
  • Experience with large-scale data loading and checkpointing systems is a plus.
  • Prior work on training runtime, distributed scheduling, or ML compiler optimization is a plus.

Benefits

  • Hybrid work model of three days in the office per week.
  • Relocation assistance for new employees.

Tech Stack

C++PythonPyTorchRustTensorFlow

Categories

AI & MLData Engineering