2 months ago
San Francisco, CA, USA or New York, NY, USAMid Level / Senior
Base Salary
$165k - $330k/yr
Responsibilities
- Integrate RDMA/RoCE/InfiniBand capabilities into the inference stack.
- Implement and tune networking layers for efficient Disaggregated KV Cache Offload and WideEP.
- Enable sub-10-second startup for trillion-parameter models through checkpointing and storage mechanisms.
- Characterize and validate networking performance on advanced hardware clusters.
- Design tools for visualizing packet flow and diagnosing distributed system behaviors.
- Optimize communication libraries and potentially write custom communication kernels.
Requirements
- Deep experience with high-performance networking protocols like InfiniBand and RoCE v2.
- Fluency in C++ or Python with a strong understanding of modern NVIDIA architectures.
- Ability to dive deep into source code and debug complex issues.
- Knowledge of when to use off-the-shelf solutions versus building custom solutions.
- Highly preferred: Knowledge of NCCL, NVSHMEM, and UCX.
- Experience with GPUDirect Storage or high-performance filesystems.
Benefits
- Competitive compensation, including meaningful equity.
- 100% coverage of medical, dental, and vision insurance for employees and dependents.
- Flexible PTO policy including a company-wide Winter Break.
- Paid parental leave and fertility/family-building stipend.
- Company-facilitated 401(k) and exposure to various ML startups.
