about 1 month ago
London, United KingdomStaff+ / Senior
Responsibilities
- Develop and optimize real-time multimodal models and serving frameworks.
- Implement techniques for inference optimization and model acceleration.
- Profile and enhance performance of systems using C++, CUDA, Rust, or optimized Python.
- Manage distributed systems and scaling for high-concurrency environments.
- Take ownership of models from research to production, ensuring reliability.
Requirements
- Deep understanding of modern serving frameworks like vLLM or TRT-LLM.
- Hands-on experience with quantization, distillation, and caching strategies.
- Proficiency in high-performance programming languages and profiling code.
- Experience with Kubernetes, Ray, and multi-GPU/multi-node inference.
- PhD in CS, Physics, Math, or equivalent practical experience.
Benefits
- Competitive salary range of £140,000 – £200,000.
- Equity and additional benefits included in total compensation.
- Support for open-source contributions and sharing work.
