GrepJob
fal

Software Engineer, Site Reliability

fal
Apply
6 days ago
San Francisco, CA, USASenior / Staff+

Base Salary

$180k - $250k/yr

Responsibilities

  • Own and operate our Kubernetes infrastructure, including cluster lifecycle and upgrades.
  • Build and maintain CI/CD pipelines and deployment infrastructure.
  • Leverage AI to automate analysis and resolution of production issues.
  • Build dashboards, alerting, and anomaly detection across systems.
  • Define and enforce SLOs and develop incident response processes.
  • Manage and improve networking, load balancing, and service mesh configurations.
  • Drive reliability improvements through automation and chaos engineering.

Requirements

  • 5+ years of experience in managing critical production systems.
  • Strong production experience with Kubernetes at scale using infrastructure-as-code.
  • Deep knowledge of Linux networking and container networking.
  • Experience building CI/CD systems and GitOps workflows.
  • Proficiency in Python and either Go or Bash for automation.
  • Strong experience with logging, monitoring, and alerting tools.
  • Excellent communication skills and ability to drive technical decisions.
  • Self-starter who executes quickly and seeks constant improvement.

Benefits

  • Interesting and challenging work.
  • Opportunities for learning and growth.
  • Relocation assistance to San Francisco.
  • Health, dental, and vision insurance.
  • Regular team events and offsites.

Tech Stack

AnsibleBashDatadogGoGrafanaKubernetesPrometheusPythonTerraform

Categories