GrepJob
Embedding VC

Member of Technical Staff - ML Infrastructure & Performance

Embedding VC
Apply
5 months ago
San Mateo, CA, USAMid Level / Senior

Responsibilities

  • Optimize GPU performance using CUDA and Triton kernels.
  • Manage the serving stack with TensorRT-LLM and Triton Inference Server.
  • Implement parallelism techniques such as FSDP and NCCL tuning.
  • Engage in quantization and performance-efficient fine-tuning.
  • Utilize systems like Ray and Kubernetes for observability and autoscaling.

Requirements

  • Experience with GPU performance optimization and CUDA.
  • Familiarity with serving stacks like TensorRT-LLM and Triton Inference Server.
  • Knowledge of parallelism techniques such as FSDP and NCCL tuning.
  • Experience with quantization methods like AWQ and GPTQ.
  • Background in infrastructure-heavy startups is preferred.

Benefits

  • On-site, in-person team environment in San Mateo.

Tech Stack

Argo CDGrafanaKubernetesPrometheus

Categories