Site Reliability Engineer, Infrastructure - Analytics Platform

about 4 hours ago

San Francisco, CA, USA

Mid Level / Senior

Base Salary

$230k - $385k/yr

Responsibilities

Own infrastructure lifecycle management including provisioning, upgrades, scaling, and decommissioning.
Operate and scale ClickHouse clusters with a focus on sharding, replication, and performance tuning.
Manage Kafka as the ingestion backbone, enhancing throughput and failure recovery.
Improve end-to-end latency and reliability for data-heavy workloads.
Build and maintain monitoring and alerting systems, including SLIs/SLOs and actionable runbooks.
Define and improve incident response standards and on-call practices.
Own backup/restore and disaster recovery strategies, conducting regular drills.
Plan and execute safe rollouts across multiple environments.
Collaborate with software engineers to integrate reliability into design and release processes.
Set quality standards for operational readiness and drive adoption across teams.
Enhance CI/CD pipelines for faster and safer releases.
Strengthen security posture across infrastructure and delivery systems.

Proven experience managing production infrastructure for data-heavy, low-latency systems.
Hands-on experience with ClickHouse, Kafka, and large-scale data systems.
Familiarity with Snowflake workflows and cross-system data architecture.
Ability to define and enforce operational standards independently.
Strong operational experience with Kubernetes, Terraform, and cloud infrastructure.
Excellent communication and collaboration skills across teams.
High personal rigor and organization in high-pressure environments.
A hands-on mindset for debugging and tuning systems.

Apache KafkaClickHouseKubernetesSnowflakeTerraform