GrepJob
OpenAI

Site Reliability Engineer, Infrastructure - Analytics Platform

OpenAI
Apply
about 4 hours ago
San Francisco, CA, USA
Mid Level / Senior

Base Salary

$230k - $385k/yr

Responsibilities

  • Own infrastructure lifecycle management including provisioning, upgrades, scaling, and decommissioning.
  • Operate and scale ClickHouse clusters with a focus on sharding, replication, and performance tuning.
  • Manage Kafka as the ingestion backbone, enhancing throughput and failure recovery.
  • Improve end-to-end latency and reliability for data-heavy workloads.
  • Build and maintain monitoring and alerting systems, including SLIs/SLOs and actionable runbooks.
  • Define and improve incident response standards and on-call practices.
  • Own backup/restore and disaster recovery strategies, conducting regular drills.
  • Plan and execute safe rollouts across multiple environments.
  • Collaborate with software engineers to integrate reliability into design and release processes.
  • Set quality standards for operational readiness and drive adoption across teams.
  • Enhance CI/CD pipelines for faster and safer releases.
  • Strengthen security posture across infrastructure and delivery systems.

Requirements

  • Proven experience managing production infrastructure for data-heavy, low-latency systems.
  • Hands-on experience with ClickHouse, Kafka, and large-scale data systems.
  • Familiarity with Snowflake workflows and cross-system data architecture.
  • Ability to define and enforce operational standards independently.
  • Strong operational experience with Kubernetes, Terraform, and cloud infrastructure.
  • Excellent communication and collaboration skills across teams.
  • High personal rigor and organization in high-pressure environments.
  • A hands-on mindset for debugging and tuning systems.

Tech Stack

Apache KafkaClickHouseKubernetesSnowflakeTerraform

Categories

AI & MLData EngineeringDevOpsSecurity