about 3 hours ago
Base Salary
$320k - $405k/yr
Responsibilities
- Own, operate, and extend the Kubernetes scheduler for accelerator fleets.
- Scale the Kubernetes control plane to support large clusters.
- Design and operate core cluster services like service discovery.
- Build and maintain custom controllers, operators, and CRDs.
- Collaborate with research and training teams to understand workload requirements.
- Partner with cloud providers for feature development and escalations.
- Lead incident response and design processes to prevent failures.
Requirements
- Significant software engineering experience with production distributed systems.
- Proficiency in systems-appropriate languages such as Go, Python, Rust, or C++.
- Deep hands-on experience with Kubernetes, including scheduler and controllers.
- Ability to debug complex issues across the technology stack.
- Track record of designing reliable and correct systems.
- Strong communication skills for building consensus with stakeholders.
Benefits
- Competitive compensation and benefits.
- Optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours.
- Collaborative office space.