
ML Systems Engineer
Periodic Labs10 days ago
Menlo Park, CA, USAMid Level / Senior
H1B Sponsor
Base Salary
$300k - $400k/yr
Responsibilities
- Build rack and topology-aware scheduling for GPUs across Ray, Slurm, and Kubernetes.
- Develop online and offline profilers to identify and optimize bottlenecks in the training and inference stack.
- Implement direct S3 checkpoint streaming to eliminate I/O bottlenecks.
- Conduct benchmarking to find optimal RL training configurations.
- Write and optimize communication and GPU kernels for maximum throughput.
- Design zero-copy RDMA weight synchronization for low-latency RL loops.
- Create fast sandbox execution environments for rapid model action rollout.
- Engage with open-source communities to influence improvements for Periodic Labs.
Requirements
- Experience with large-scale inference infrastructure and production-scale serving architecture.
- Proficiency in low-level systems programming, including RDMA and network stack optimization.
- Familiarity with GPU cluster scheduling across Ray, Slurm, or Kubernetes.
- Ability to write and optimize CUDA kernels and distributed training operations.
- Experience in profiling and benchmarking distributed ML systems.
- Knowledge of checkpoint management and cloud storage integration.
- Experience contributing to open-source ML infrastructure projects.
- Ability to collaborate with ML researchers on algorithm-infrastructure co-design.