Senior Software Engineer, AI Runtime
Databricksabout 5 hours ago
Base Salary
$160k - $225k/yr
Responsibilities
- Drive the architecture and evolution of AIR's managed GPU training platform.
- Solve complex problems in large-scale training, including multi-node orchestration and GPU scheduling.
- Enhance GPU efficiency and training performance across diverse model architectures.
- Build resilience and observability foundations for multi-node jobs.
- Collaborate with product, research, and platform teams to improve developer experience.
- Lead end-to-end engineering efforts from design to production rollout.
- Contribute to core systems and support the latest accelerators as the fleet grows.
- Mentor other engineers and contribute to the technical direction of AI training infrastructure.
Requirements
- 5+ years of experience in building and operating large-scale distributed systems.
- Experience with distributed training frameworks like PyTorch and DeepSpeed.
- Strong understanding of training resilience patterns for long-running jobs.
- Solid grasp of GPU performance fundamentals and bottlenecks affecting training throughput.
- Experience with managed, multi-tenant platform products in the cloud.
- Strong foundation in algorithms, data structures, and system design.
- Proven ability to deliver high-impact initiatives that create customer value.
- Strong communication skills for collaboration across teams.
- Customer-focused mindset with a passion for mentoring engineers.
- BS in Computer Science or a related field; MS or PhD preferred.
Tech Stack
Categories
AI & MLData Engineering