GrepJob
Baseten

Software Engineer — GPU Networking & Distributed Systems

Baseten
Apply
2 months ago
San Francisco, CA, USA or New York, NY, USAMid Level / Senior

Base Salary

$165k - $330k/yr

Responsibilities

  • Integrate RDMA/RoCE/InfiniBand capabilities into the inference stack.
  • Implement and tune networking layers for efficient Disaggregated KV Cache Offload and WideEP.
  • Enable sub-10-second startup for trillion-parameter models through checkpointing and storage mechanisms.
  • Characterize and validate networking performance on advanced hardware clusters.
  • Design tools for visualizing packet flow and diagnosing distributed system behaviors.
  • Optimize communication libraries and potentially write custom communication kernels.

Requirements

  • Deep experience with high-performance networking protocols like InfiniBand and RoCE v2.
  • Fluency in C++ or Python with a strong understanding of modern NVIDIA architectures.
  • Ability to dive deep into source code and debug complex issues.
  • Knowledge of when to use off-the-shelf solutions versus building custom solutions.
  • Highly preferred: Knowledge of NCCL, NVSHMEM, and UCX.
  • Experience with GPUDirect Storage or high-performance filesystems.

Benefits

  • Competitive compensation, including meaningful equity.
  • 100% coverage of medical, dental, and vision insurance for employees and dependents.
  • Flexible PTO policy including a company-wide Winter Break.
  • Paid parental leave and fertility/family-building stipend.
  • Company-facilitated 401(k) and exposure to various ML startups.

Tech Stack

C++KubernetesPython

Categories

AI & MLBackendData Engineering