about 4 hours ago
San Francisco, CA, USA
Mid Level / Senior
Base Salary
$230k - $385k/yr
Responsibilities
- Own infrastructure lifecycle management including provisioning, upgrades, scaling, and decommissioning.
- Operate and scale ClickHouse clusters with a focus on sharding, replication, and performance tuning.
- Manage Kafka as the ingestion backbone, enhancing throughput and failure recovery.
- Improve end-to-end latency and reliability for data-heavy workloads.
- Build and maintain monitoring and alerting systems, including SLIs/SLOs and actionable runbooks.
- Define and improve incident response standards and on-call practices.
- Own backup/restore and disaster recovery strategies, conducting regular drills.
- Plan and execute safe rollouts across multiple environments.
- Collaborate with software engineers to integrate reliability into design and release processes.
- Set quality standards for operational readiness and drive adoption across teams.
- Enhance CI/CD pipelines for faster and safer releases.
- Strengthen security posture across infrastructure and delivery systems.
Requirements
- Proven experience managing production infrastructure for data-heavy, low-latency systems.
- Hands-on experience with ClickHouse, Kafka, and large-scale data systems.
- Familiarity with Snowflake workflows and cross-system data architecture.
- Ability to define and enforce operational standards independently.
- Strong operational experience with Kubernetes, Terraform, and cloud infrastructure.
- Excellent communication and collaboration skills across teams.
- High personal rigor and organization in high-pressure environments.
- A hands-on mindset for debugging and tuning systems.
Tech Stack
Apache KafkaClickHouseKubernetesSnowflakeTerraform
Categories
AI & MLData EngineeringDevOpsSecurity