Training Performance Engineer

6 months ago

San Francisco, CA, USA

Mid Level / Senior

Base Salary

$250k - $445k/yr

Responsibilities

Profile end-to-end training runs to identify performance bottlenecks across compute, communication, and storage.
Optimize GPU utilization and throughput for large-scale distributed model training.
Collaborate with runtime and systems engineers to improve kernel efficiency, scheduling, and collective communication performance.
Implement model graph transforms to improve end-to-end throughput.
Build tooling to monitor and visualize MFU, throughput, and uptime across clusters.
Partner with researchers to ensure new model architectures scale efficiently during pre-training.
Contribute to infrastructure decisions that improve reliability and efficiency of large training jobs.

Strong programming skills in Python and C++ (Rust or CUDA a plus).
Experience running distributed training jobs on multi-GPU systems or HPC clusters.
Ability to debug complex distributed systems and measure efficiency rigorously.
Exposure to frameworks like PyTorch, JAX, or TensorFlow.
Comfortable collaborating across teams and translating raw profiling data into practical engineering improvements.
Familiarity with NCCL, MPI, or UCX communication libraries is a plus.
Experience with large-scale data loading and checkpointing systems is a plus.
Prior work on training runtime, distributed scheduling, or ML compiler optimization is a plus.

C++PythonPyTorchRustTensorFlow