Member of Technical Staff, Pre-training Systems

4 months ago

San Francisco, CA, USAMid Level / Senior

H1B Sponsor

Base Salary

$225k - $550k/yr

Responsibilities

Scale distributed training across large GPU clusters using data, tensor, and pipeline parallelism.
Optimize communication patterns and gradient synchronization.
Improve checkpointing, fault tolerance, and job recovery systems.
Profile and eliminate performance bottlenecks across compute, networking, and storage.
Enhance experiment reproducibility and orchestration workflows.
Increase hardware utilization and training throughput.
Collaborate with Kernels and Research to align model architecture with systems realities.

Requirements

Strong software engineering and distributed systems fundamentals.
Experience training large models in multi-node GPU environments.
Deep understanding of parallelism strategies and performance trade-offs.
Experience debugging cross-layer issues in production ML systems.
Strong ownership mindset and ability to operate critical infrastructure.
Track record of improving performance or reliability of large-scale systems.

Benefits

Annual salary range: $225K - $550K.
Equity is a significant part of total compensation, in addition to salary.
401(k) plan with 6% salary matching.
Generous health, dental, and vision insurance for you and your dependents.
Unlimited paid time off.
Visa sponsorship and relocation stipend to bring you to SF, if possible.

Categories

AI & ML BackendData Engineering