Staff Software Engineer, AI Runtime
Databricksabout 5 hours ago
Base Salary
$190k - $265k/yr
Responsibilities
- Drive the architecture and evolution of AIR's managed GPU training platform.
- Solve complex problems in large-scale training, including multi-node orchestration and GPU scheduling.
- Enhance GPU efficiency and training performance across diverse model architectures.
- Build resilience and observability foundations for multi-node jobs.
- Collaborate with product, research, and platform teams to improve developer experience.
- Lead end-to-end engineering efforts from design to production rollout.
- Make high-impact contributions to core systems and support new accelerators.
- Mentor engineers and influence Databricks' technical direction in AI training.
Requirements
- 10+ years of experience in large-scale distributed systems, particularly in GPU training infrastructure.
- Hands-on experience with distributed training frameworks like PyTorch and DeepSpeed.
- Strong understanding of training resilience patterns for long-running jobs.
- Solid grasp of GPU performance fundamentals and bottlenecks affecting training throughput.
- Experience with managed, multi-tenant platform products in the cloud.
- Strong foundation in algorithms, data structures, and system design.
- Proven ability to deliver complex, high-impact initiatives.
- Excellent communication skills for collaboration across teams.
- Strategic mindset with a passion for mentoring engineers.
- BS in Computer Science or a related field; MS or PhD preferred.
Tech Stack
Categories
AI & MLData Engineering