Site Reliability Engineer, Frontier Systems Infrastructure

8 months ago

San Francisco, CA, USAMid Level / Senior

H1B Sponsor

Base Salary

$255k - $490k/yr

Responsibilities

Spin up and scale large Kubernetes clusters with automation for provisioning and lifecycle management.
Build software abstractions to unify multiple clusters for seamless training workloads.
Own the bare-metal node bring-up process, ensuring fast and repeatable deployment.
Improve operational metrics, such as reducing cluster restart times and accelerating upgrade cycles.
Integrate networking and hardware health systems for end-to-end reliability.
Develop monitoring and observability systems to maintain cluster stability under load.

Requirements

Experience as an infrastructure, systems, or distributed systems engineer in large-scale environments.
Strong knowledge of Kubernetes internals and cluster scaling patterns.
Proficiency in cloud infrastructure concepts and automating operations.
Deep experience operating or scaling Kubernetes clusters in hyperscale environments.
Strong programming or scripting skills in Python, Go, or similar languages.
Familiarity with Infrastructure-as-Code tools like Terraform or CloudFormation.

Tech Stack

Go Kubernetes Linux Python Terraform

Categories