9 months ago
Responsibilities
- Optimize GPU performance using CUDA and Triton kernels.
- Enhance the serving stack with TensorRT-LLM and Triton Inference Server.
- Implement parallelism techniques like FSDP and NCCL tuning.
- Work on quantization and PEFT strategies for model serving.
- Manage systems for observability and autoscaling.
Requirements
- Experience with GPU performance optimization and CUDA.
- Familiarity with serving stacks like TensorRT-LLM and Triton.
- Knowledge of parallelism techniques and NCCL tuning.
- Experience with quantization methods such as AWQ and GPTQ.
- Background in infrastructure-heavy startups like Databricks or Roblox.
Benefits
- On-site, in-person team environment in San Mateo.
Tech Stack
Categories
AI & MLData Engineering
