OpenAI

Site Reliability Engineer, Frontier Systems Infrastructure

OpenAI

Apply
4 months ago
San Francisco, CA, USA
Mid Level / Senior

Base Salary

$255k - $490k/yr

Responsibilities

  • Spin up and scale large Kubernetes clusters with automation for provisioning and lifecycle management.
  • Build software abstractions to unify multiple clusters for seamless training workloads.
  • Own the bare-metal node bring-up process, ensuring fast and repeatable deployment.
  • Improve operational metrics, such as reducing cluster restart times and accelerating upgrade cycles.
  • Integrate networking and hardware health systems for end-to-end reliability.
  • Develop monitoring and observability systems to maintain cluster stability under load.

Requirements

  • Experience as an infrastructure, systems, or distributed systems engineer in large-scale environments.
  • Strong knowledge of Kubernetes internals and cluster scaling patterns.
  • Proficiency in cloud infrastructure concepts and automating operations.
  • Deep experience operating or scaling Kubernetes clusters in hyperscale environments.
  • Strong programming or scripting skills in Python, Go, or similar languages.
  • Familiarity with Infrastructure-as-Code tools like Terraform or CloudFormation.

Tech Stack

GoKubernetesLinuxPythonTerraform

Categories

BackendDevOps