Software Engineer, Workload Enablement

3 months ago

Seattle, WA, USA or San Francisco, CA, USAMid Level / Senior

H1B Sponsor

Base Salary

$293k - $455k/yr

Responsibilities

Port and validate key inference and training workloads on new platforms.
Build benchmarks and stress tests to capture end-to-end behavior of workloads.
Deep-dive into performance on distributed training/inference.
Create repeatable test harnesses for CI/lab environments.
Collaborate with systems engineers to ensure platform stability and performance.
Produce clear bug reports and prioritized issue lists for stakeholders.

Requirements

BS in CS/EE or equivalent practical experience.
5+ years in ML systems, performance engineering, distributed systems, or HPC.
Strong hands-on experience with PyTorch and modern LLM training/inference stacks.
Experience with large-scale distributed training concepts.
Proficiency in Python and comfort with performance-critical code (C++/CUDA/HIP is a plus).
Strong profiling/debugging skills using tools like Nsight and perf.

Tech Stack

C++Kubernetes Python PyTorch

Categories

AI & MLData ScienceDevOps Testing