about 4 hours ago
London, United Kingdom +3 more
Mid Level / Senior
Base Salary
$230k - $405k/yr
Responsibilities
- Spin up and scale large Kubernetes clusters with automation for provisioning and lifecycle management.
- Build software abstractions to unify multiple clusters for seamless training workload interfaces.
- Own node bring-up from bare metal through firmware upgrades for fast, repeatable deployment.
- Improve operational metrics like reducing cluster restart times and accelerating upgrade cycles.
- Integrate networking and hardware health systems for end-to-end reliability.
- Develop monitoring and observability systems to maintain cluster stability under load.
Requirements
- Experience as an infrastructure, systems, or distributed systems engineer in large-scale environments.
- Strong knowledge of Kubernetes internals and cluster scaling patterns.
- Proficiency in compute infrastructure concepts including networking, storage, and security.
- Experience in automating cluster or data center operations.
- Bonus: background with GPU workloads, firmware management, or high-performance computing.
Tech Stack
Kubernetes
Categories
AI & MLData EngineeringDevOps