Site Reliability Engineer, Frontier Systems Infrastructure
OpenAI
4 months ago
San Francisco, CA, USA
Mid Level / Senior
Base Salary
$255k - $490k/yr
Responsibilities
- Spin up and scale large Kubernetes clusters with automation for provisioning and lifecycle management.
- Build software abstractions to unify multiple clusters for seamless training workloads.
- Own the bare-metal node bring-up process, ensuring fast and repeatable deployment.
- Improve operational metrics, such as reducing cluster restart times and accelerating upgrade cycles.
- Integrate networking and hardware health systems for end-to-end reliability.
- Develop monitoring and observability systems to maintain cluster stability under load.
Requirements
- Experience as an infrastructure, systems, or distributed systems engineer in large-scale environments.
- Strong knowledge of Kubernetes internals and cluster scaling patterns.
- Proficiency in cloud infrastructure concepts and automating operations.
- Deep experience operating or scaling Kubernetes clusters in hyperscale environments.
- Strong programming or scripting skills in Python, Go, or similar languages.
- Familiarity with Infrastructure-as-Code tools like Terraform or CloudFormation.
Tech Stack
GoKubernetesLinuxPythonTerraform
Categories
BackendDevOps