GrepJob
Periodic Labs

ML Systems Engineer

Periodic Labs
Apply
10 days ago
Menlo Park, CA, USAMid Level / Senior
H1B Sponsor

Base Salary

$300k - $400k/yr

Responsibilities

  • Build rack and topology-aware scheduling for GPUs across Ray, Slurm, and Kubernetes.
  • Develop online and offline profilers to identify and optimize bottlenecks in the training and inference stack.
  • Implement direct S3 checkpoint streaming to eliminate I/O bottlenecks.
  • Conduct benchmarking to find optimal RL training configurations.
  • Write and optimize communication and GPU kernels for maximum throughput.
  • Design zero-copy RDMA weight synchronization for low-latency RL loops.
  • Create fast sandbox execution environments for rapid model action rollout.
  • Engage with open-source communities to influence improvements for Periodic Labs.

Requirements

  • Experience with large-scale inference infrastructure and production-scale serving architecture.
  • Proficiency in low-level systems programming, including RDMA and network stack optimization.
  • Familiarity with GPU cluster scheduling across Ray, Slurm, or Kubernetes.
  • Ability to write and optimize CUDA kernels and distributed training operations.
  • Experience in profiling and benchmarking distributed ML systems.
  • Knowledge of checkpoint management and cloud storage integration.
  • Experience contributing to open-source ML infrastructure projects.
  • Ability to collaborate with ML researchers on algorithm-infrastructure co-design.

Tech Stack

Kubernetes

Categories

AI & MLBackendData Engineering