13 days ago
Palo Alto, CA, USAMid Level / Staff+
H1B Sponsor
Base Salary
$180k - $440k/yr
Responsibilities
- Architect and implement scalable distributed infrastructure for model serving.
- Optimize latency and throughput of model inference under real production workloads.
- Build reliable, high-concurrency serving systems with 100% uptime and 0% error rate.
- Benchmark, fine-tune, and accelerate inference engines, including GPU kernel work.
- Develop custom tools to trace, replay, and fix issues across the full stack.
- Create robust CI/CD infrastructure for seamless endpoint deployment and updates.
- Accelerate research on scaling test-time compute and model-hardware co-design.
Requirements
- Deep low-level systems programming experience in C/C++ or Rust.
- Experience with large-scale, high-concurrent production serving.
- Familiarity with GPU inference engines like vLLM, SGLang, and TensorRT-LLM.
- Strong background in system optimizations such as batching and caching.
- Experience with low-level inference optimizations including GPU kernels.
- Knowledge of algorithmic inference optimizations like quantization and distillation.
- Experience in testing, benchmarking, and ensuring reliability of inference services.
- Experience designing and implementing CI/CD infrastructure for inference.
Benefits
- Equity in the company.
- Comprehensive medical, vision, and dental coverage.
- Access to a 401(k) retirement plan.
- Short and long-term disability insurance.
- Life insurance and various discounts and perks.