GrepJob
adaption

Distributed Systems Engineer, Data & Inference Platform

adaption
Apply
1 day ago

Responsibilities

  • Design and operate distributed inference systems for LLMs, optimizing throughput, latency, and cost.
  • Build large-scale data pipelines that ingest, transform, and curate datasets for training and evaluation.
  • Debug complex production issues that arise under real traffic conditions.
  • Collaborate with researchers and ML engineers to transition experimental workloads to production.

Requirements

  • 5+ years of experience building and operating distributed systems in production.
  • Deep experience with large-scale data or compute frameworks like Ray, Spark, or Flink.
  • Strong fluency in Python and at least one systems language such as Go, Rust, or C++.
  • Working knowledge of GPU/accelerator stack and CUDA fundamentals.
  • Experience operating Kubernetes-based infrastructure, including custom operators or schedulers.
  • Proven track record of managing production incidents from diagnosis to resolution.

Benefits

  • Flexible work arrangements with in-person collaboration in the Bay Area and a global-first team.
  • Annual travel stipend for exploring new countries.
  • Weekly meal allowance for take-out or grocery delivery.
  • Comprehensive medical benefits and generous paid time off.