about 3 hours ago
Responsibilities
- Design and operate distributed systems for scheduling and executing large-scale batch workloads across Kubernetes clusters.
- Build and maintain compute platform abstractions.
- Optimize utilization of compute resources.
- Develop and improve multi-tenant scheduling strategies.
- Improve reliability and fault tolerance of large-scale distributed jobs and platform components.
- Collaborate with teams across the company to understand workload requirements and improve platform capabilities.
- Contribute to platform tooling, automation, and CI/CD workflows.
Requirements
- 7+ years of experience building and operating distributed systems or infrastructure platforms.
- Strong experience with Kubernetes and container orchestration in production grade environments.
- Proficiency developing in Golang and Python.
- Experience designing and operating large-scale batch compute systems.
- Strong debugging and problem-solving skills in complex distributed systems.
- Ability to collaborate across teams and communicate technical concepts clearly.
- Experience with at least one batch scheduling system such as Kueue, Armada, Volcano, or Slurm.