ML Systems Engineer

2 months ago

Menlo Park, CA, USAMid Level / Senior

H1B Sponsor

Base Salary

$300k - $400k/yr

Responsibilities

Build rack and topology-aware scheduling for GPUs across Ray, Slurm, and Kubernetes.
Develop online and offline profilers to identify and optimize bottlenecks in the training and inference stack.
Implement direct S3 checkpoint streaming to eliminate I/O bottlenecks.
Conduct benchmarking to find optimal RL training configurations.
Write and optimize communication and GPU kernels for maximum throughput.
Design zero-copy RDMA weight synchronization for low-latency RL loops.
Create fast sandbox execution environments for rapid model action rollout.
Engage with open-source communities to influence improvements for Periodic Labs.

Experience with large-scale inference infrastructure and production-scale serving architecture.
Proficiency in low-level systems programming, including RDMA and network stack optimization.
Familiarity with GPU cluster scheduling across Ray, Slurm, or Kubernetes.
Ability to write and optimize CUDA kernels and distributed training operations.
Experience in profiling and benchmarking distributed ML systems.
Knowledge of checkpoint management and cloud storage integration.
Experience contributing to open-source ML infrastructure projects.
Ability to collaborate with ML researchers on algorithm-infrastructure co-design.