Training Performance Engineer
OpenAI
4 months ago
San Francisco, CA, USA
Mid Level / Senior
Base Salary
$250k - $445k/yr
Responsibilities
- Profile end-to-end training runs to identify performance bottlenecks across compute, communication, and storage.
- Optimize GPU utilization and throughput for large-scale distributed model training.
- Collaborate with runtime and systems engineers to improve kernel efficiency, scheduling, and collective communication performance.
- Implement model graph transforms to improve end-to-end throughput.
- Build tooling to monitor and visualize MFU, throughput, and uptime across clusters.
- Partner with researchers to ensure new model architectures scale efficiently during pre-training.
- Contribute to infrastructure decisions that improve reliability and efficiency of large training jobs.
Requirements
- Strong programming skills in Python and C++ (Rust or CUDA a plus).
- Experience running distributed training jobs on multi-GPU systems or HPC clusters.
- Ability to debug complex distributed systems and measure efficiency rigorously.
- Exposure to frameworks like PyTorch, JAX, or TensorFlow.
- Comfortable collaborating across teams and translating raw profiling data into practical engineering improvements.
- Familiarity with NCCL, MPI, or UCX communication libraries is a plus.
- Experience with large-scale data loading and checkpointing systems is a plus.
- Prior work on training runtime, distributed scheduling, or ML compiler optimization is a plus.
Benefits
- Hybrid work model of three days in the office per week.
- Relocation assistance for new employees.
Tech Stack
C++PythonPyTorchRustTensorFlow
Categories
AI & MLData Engineering