GrepJob
Magic

Member of Technical Staff, Pre-training Systems

Magic
Apply
4 months ago
San Francisco, CA, USAMid Level / Senior
H1B Sponsor

Base Salary

$225k - $550k/yr

Responsibilities

  • Scale distributed training across large GPU clusters using data, tensor, and pipeline parallelism.
  • Optimize communication patterns and gradient synchronization.
  • Improve checkpointing, fault tolerance, and job recovery systems.
  • Profile and eliminate performance bottlenecks across compute, networking, and storage.
  • Enhance experiment reproducibility and orchestration workflows.
  • Increase hardware utilization and training throughput.
  • Collaborate with Kernels and Research to align model architecture with systems realities.

Requirements

  • Strong software engineering and distributed systems fundamentals.
  • Experience training large models in multi-node GPU environments.
  • Deep understanding of parallelism strategies and performance trade-offs.
  • Experience debugging cross-layer issues in production ML systems.
  • Strong ownership mindset and ability to operate critical infrastructure.
  • Track record of improving performance or reliability of large-scale systems.

Benefits

  • Annual salary range: $225K - $550K.
  • Equity is a significant part of total compensation, in addition to salary.
  • 401(k) plan with 6% salary matching.
  • Generous health, dental, and vision insurance for you and your dependents.
  • Unlimited paid time off.
  • Visa sponsorship and relocation stipend to bring you to SF, if possible.

Categories

AI & MLBackendData Engineering