Member of Technical Staff - Inference

13 days ago

Palo Alto, CA, USAMid Level / Staff+

H1B Sponsor

Base Salary

$180k - $440k/yr

Responsibilities

Architect and implement scalable distributed infrastructure for model serving.
Optimize latency and throughput of model inference under real production workloads.
Build reliable, high-concurrency serving systems with 100% uptime and 0% error rate.
Benchmark, fine-tune, and accelerate inference engines, including GPU kernel work.
Develop custom tools to trace, replay, and fix issues across the full stack.
Create robust CI/CD infrastructure for seamless endpoint deployment and updates.
Accelerate research on scaling test-time compute and model-hardware co-design.

Deep low-level systems programming experience in C/C++ or Rust.
Experience with large-scale, high-concurrent production serving.
Familiarity with GPU inference engines like vLLM, SGLang, and TensorRT-LLM.
Strong background in system optimizations such as batching and caching.
Experience with low-level inference optimizations including GPU kernels.
Knowledge of algorithmic inference optimizations like quantization and distillation.
Experience in testing, benchmarking, and ensuring reliability of inference services.
Experience designing and implementing CI/CD infrastructure for inference.