4 days ago
San Francisco, CA, USA or Seattle, WA, USA
Mid Level / Senior
Base Salary
$293k - $455k/yr
Responsibilities
- Port and validate key inference and training workloads on new platforms.
- Build benchmarks and stress tests to capture end-to-end behavior of workloads.
- Deep-dive into performance on distributed training/inference.
- Create repeatable test harnesses for CI/lab environments.
- Collaborate with systems engineers to ensure platform stability and performance.
- Produce clear bug reports and prioritized issue lists for stakeholders.
Requirements
- BS in CS/EE or equivalent practical experience.
- 5+ years in ML systems, performance engineering, distributed systems, or HPC.
- Strong hands-on experience with PyTorch and modern LLM training/inference stacks.
- Experience with large-scale distributed training concepts.
- Proficiency in Python and comfort with performance-critical code (C++/CUDA/HIP is a plus).
- Strong profiling/debugging skills using tools like Nsight and perf.
Tech Stack
C++KubernetesPythonPyTorch
Categories
AI & MLData ScienceDevOpsTesting