Klaviyo

SRE, Site Reliability Engineering

Klaviyo

Apply
27 days ago
Dublin, Ireland
Mid Level / Senior
H1B Sponsor

Responsibilities

  • Build, operate, and improve production systems focusing on reliability and performance.
  • Automate operational tasks to reduce manual toil.
  • Contribute to system design and implementation using SRE best practices.
  • Define and measure SLIs and SLOs for supported services.
  • Enhance observability through metrics, dashboards, and logging.
  • Participate in on-call rotations and respond to production incidents.
  • Assist with incident investigations and contribute to post-incident reviews.
  • Analyze system behavior and capacity usage.
  • Identify and address reliability issues with teammates.
  • Collaborate with engineers to ship reliable systems.
  • Write and maintain operational runbooks and documentation.

Requirements

  • Experience operating cloud-native production systems.
  • Proficient in writing production-quality code (e.g., Python, Go).
  • Understanding of common failure modes in distributed systems.
  • Experience with containerized workloads and platforms (e.g., Kubernetes).
  • Comfortable participating in on-call rotations.
  • Experience using observability tools and responding to alerts.
  • Familiarity with SRE concepts such as SLIs, SLOs, and error budgets.
  • Hands-on experience with infrastructure as code (e.g., Terraform).
  • Ability to follow incident response processes.
  • Eager to learn and improve systems over time.
  • Interest in exploring AI tools and workflows.

Tech Stack

Apache KafkaAWSDjangoFastAPIKubernetesMySQLPythonRabbitMQReactRedisTerraform

Categories

DevOpsSecurity