AI Inference Engineer - Model Optimization & Deployment

3 months ago

Foster City, CA, USA +2 moreMid Level / Senior

H1B Sponsor

Base Salary

$242k - $290k/yr

Responsibilities

Optimize large-scale models using advanced quantization and mixed-precision workflows.
Architect and implement model conversion and compilation pipelines for edge deployment.
Perform parity checking, accuracy recovery, and latency benchmarking between frameworks and compiled binaries.
Write and optimize custom CUDA kernels and TensorRT Plugins for AI accelerators.
Develop production-level C++ and Python code for real-time inference on vehicle SOCs.

Requirements

Deep expertise in model quantization and mixed-precision inference workflows.
Proven experience optimizing large-scale models utilizing KV-cache optimization and Efficient Attention mechanisms.
Extensive experience with model conversion/compilation pipelines and benchmarking.
Proficiency in low-level programming for AI accelerators, including CUDA and TensorRT.
Production-level C++ and Python programming skills for real-time inference code.

Tech Stack

C++Python PyTorch

Categories

AI & ML Embedded