Distributed Systems Engineer, Data & Inference Platform

1 day ago

Responsibilities

Design and operate distributed inference systems for LLMs, optimizing throughput, latency, and cost.
Build large-scale data pipelines that ingest, transform, and curate datasets for training and evaluation.
Debug complex production issues that arise under real traffic conditions.
Collaborate with researchers and ML engineers to transition experimental workloads to production.

5+ years of experience building and operating distributed systems in production.
Deep experience with large-scale data or compute frameworks like Ray, Spark, or Flink.
Strong fluency in Python and at least one systems language such as Go, Rust, or C++.
Working knowledge of GPU/accelerator stack and CUDA fundamentals.
Experience operating Kubernetes-based infrastructure, including custom operators or schedulers.
Proven track record of managing production incidents from diagnosis to resolution.

Flexible work arrangements with in-person collaboration in the Bay Area and a global-first team.
Annual travel stipend for exploring new countries.
Weekly meal allowance for take-out or grocery delivery.
Comprehensive medical benefits and generous paid time off.