
LLM Inference Engineer
Hippocratic AI8 months ago
Palo Alto, CA, USAMid Level / Senior
H1B Sponsor
Responsibilities
- Design and implement multi-node serving architectures for distributed LLM inference.
- Optimize multi-LoRA serving systems.
- Apply advanced quantization techniques to reduce model footprint while preserving quality.
- Implement speculative decoding and other latency optimization strategies.
- Develop disaggregated serving solutions with optimized caching strategies.
- Continuously benchmark and improve system performance across various deployment scenarios.
Requirements
- Experience optimizing LLM inference systems at scale.
- Proven expertise with distributed serving architectures for large language models.
- Hands-on experience implementing quantization techniques for transformer models.
- Strong understanding of modern inference optimization methods.
- Proficiency in Python and C++.
- Experience with CUDA programming and GPU optimization.