SRE, Site Reliability Engineering
Klaviyo
27 days ago
Dublin, Ireland
Mid Level / Senior
H1B Sponsor
Responsibilities
- Build, operate, and improve production systems focusing on reliability and performance.
- Automate operational tasks to reduce manual toil.
- Contribute to system design and implementation using SRE best practices.
- Define and measure SLIs and SLOs for supported services.
- Enhance observability through metrics, dashboards, and logging.
- Participate in on-call rotations and respond to production incidents.
- Assist with incident investigations and contribute to post-incident reviews.
- Analyze system behavior and capacity usage.
- Identify and address reliability issues with teammates.
- Collaborate with engineers to ship reliable systems.
- Write and maintain operational runbooks and documentation.
Requirements
- Experience operating cloud-native production systems.
- Proficient in writing production-quality code (e.g., Python, Go).
- Understanding of common failure modes in distributed systems.
- Experience with containerized workloads and platforms (e.g., Kubernetes).
- Comfortable participating in on-call rotations.
- Experience using observability tools and responding to alerts.
- Familiarity with SRE concepts such as SLIs, SLOs, and error budgets.
- Hands-on experience with infrastructure as code (e.g., Terraform).
- Ability to follow incident response processes.
- Eager to learn and improve systems over time.
- Interest in exploring AI tools and workflows.
Tech Stack
Apache KafkaAWSDjangoFastAPIKubernetesMySQLPythonRabbitMQReactRedisTerraform
Categories
DevOpsSecurity