Member of Technical Staff - ML Infrastructure & Performance

5 months ago

San Mateo, CA, USAMid Level / Senior

Responsibilities

Optimize GPU performance using CUDA and Triton kernels.
Manage the serving stack with TensorRT-LLM and Triton Inference Server.
Implement parallelism techniques such as FSDP and NCCL tuning.
Engage in quantization and performance-efficient fine-tuning.
Utilize systems like Ray and Kubernetes for observability and autoscaling.

Requirements

Experience with GPU performance optimization and CUDA.
Familiarity with serving stacks like TensorRT-LLM and Triton Inference Server.
Knowledge of parallelism techniques such as FSDP and NCCL tuning.
Experience with quantization methods like AWQ and GPTQ.
Background in infrastructure-heavy startups is preferred.

Benefits

On-site, in-person team environment in San Mateo.

Tech Stack

Argo CDGrafanaKubernetesPrometheus

Categories