
Member of Technical Staff - ML Infrastructure & Performance
Embedding VC5 months ago
San Mateo, CA, USAMid Level / Senior
Responsibilities
- Optimize GPU performance using CUDA and Triton kernels.
- Manage the serving stack with TensorRT-LLM and Triton Inference Server.
- Implement parallelism techniques such as FSDP and NCCL tuning.
- Engage in quantization and performance-efficient fine-tuning.
- Utilize systems like Ray and Kubernetes for observability and autoscaling.
Requirements
- Experience with GPU performance optimization and CUDA.
- Familiarity with serving stacks like TensorRT-LLM and Triton Inference Server.
- Knowledge of parallelism techniques such as FSDP and NCCL tuning.
- Experience with quantization methods like AWQ and GPTQ.
- Background in infrastructure-heavy startups is preferred.
Benefits
- On-site, in-person team environment in San Mateo.