GrepJob
Databricks

Staff Software Engineer, AI Runtime

Databricks
Apply
about 5 hours ago
Mountain View, CA, USA or San Francisco, CA, USAStaff+
H1B Sponsor

Base Salary

$190k - $265k/yr

Responsibilities

  • Drive the architecture and evolution of AIR's managed GPU training platform.
  • Solve complex problems in large-scale training, including multi-node orchestration and GPU scheduling.
  • Enhance GPU efficiency and training performance across diverse model architectures.
  • Build resilience and observability foundations for multi-node jobs.
  • Collaborate with product, research, and platform teams to improve developer experience.
  • Lead end-to-end engineering efforts from design to production rollout.
  • Make high-impact contributions to core systems and support new accelerators.
  • Mentor engineers and influence Databricks' technical direction in AI training.

Requirements

  • 10+ years of experience in large-scale distributed systems, particularly in GPU training infrastructure.
  • Hands-on experience with distributed training frameworks like PyTorch and DeepSpeed.
  • Strong understanding of training resilience patterns for long-running jobs.
  • Solid grasp of GPU performance fundamentals and bottlenecks affecting training throughput.
  • Experience with managed, multi-tenant platform products in the cloud.
  • Strong foundation in algorithms, data structures, and system design.
  • Proven ability to deliver complex, high-impact initiatives.
  • Excellent communication skills for collaboration across teams.
  • Strategic mindset with a passion for mentoring engineers.
  • BS in Computer Science or a related field; MS or PhD preferred.

Tech Stack

Categories

AI & MLData Engineering