GrepJob
Databricks

Senior Software Engineer, AI Runtime

Databricks
Apply
about 5 hours ago
Mountain View, CA, USA or San Francisco, CA, USASenior
H1B Sponsor

Base Salary

$160k - $225k/yr

Responsibilities

  • Drive the architecture and evolution of AIR's managed GPU training platform.
  • Solve complex problems in large-scale training, including multi-node orchestration and GPU scheduling.
  • Enhance GPU efficiency and training performance across diverse model architectures.
  • Build resilience and observability foundations for multi-node jobs.
  • Collaborate with product, research, and platform teams to improve developer experience.
  • Lead end-to-end engineering efforts from design to production rollout.
  • Contribute to core systems and support the latest accelerators as the fleet grows.
  • Mentor other engineers and contribute to the technical direction of AI training infrastructure.

Requirements

  • 5+ years of experience in building and operating large-scale distributed systems.
  • Experience with distributed training frameworks like PyTorch and DeepSpeed.
  • Strong understanding of training resilience patterns for long-running jobs.
  • Solid grasp of GPU performance fundamentals and bottlenecks affecting training throughput.
  • Experience with managed, multi-tenant platform products in the cloud.
  • Strong foundation in algorithms, data structures, and system design.
  • Proven ability to deliver high-impact initiatives that create customer value.
  • Strong communication skills for collaboration across teams.
  • Customer-focused mindset with a passion for mentoring engineers.
  • BS in Computer Science or a related field; MS or PhD preferred.

Tech Stack

Apache SparkDatabricksMLflowPyTorch

Categories

AI & MLData Engineering