LLM Inference Engineer

8 months ago

Palo Alto, CA, USAMid Level / Senior

H1B Sponsor

Responsibilities

Design and implement multi-node serving architectures for distributed LLM inference.
Optimize multi-LoRA serving systems.
Apply advanced quantization techniques to reduce model footprint while preserving quality.
Implement speculative decoding and other latency optimization strategies.
Develop disaggregated serving solutions with optimized caching strategies.
Continuously benchmark and improve system performance across various deployment scenarios.

Requirements

Experience optimizing LLM inference systems at scale.
Proven expertise with distributed serving architectures for large language models.
Hands-on experience implementing quantization techniques for transformer models.
Strong understanding of modern inference optimization methods.
Proficiency in Python and C++.
Experience with CUDA programming and GPU optimization.

Tech Stack

Categories

AI & MLData Engineering