GrepJob
Klaviyo

Software Engineer II, Reliability

Klaviyo
Apply
about 5 hours ago
Dublin, IrelandMid Level
H1B Sponsor

Responsibilities

  • Build, operate, and improve production systems focusing on reliability and performance.
  • Automate operational tasks to reduce manual toil.
  • Contribute to the design and implementation of systems using SRE best practices.
  • Define and measure SLIs and SLOs for supported services.
  • Enhance observability through metrics, dashboards, and logging.
  • Participate in on-call rotations and respond to production incidents.
  • Assist with incident investigations and contribute to post-incident reviews.
  • Analyze system behavior and capacity usage.
  • Identify reliability issues and collaborate with teammates to address them.
  • Write and maintain operational runbooks and system documentation.

Requirements

  • Experience operating cloud-native production systems.
  • Proficient in writing production-quality code (e.g., Python, Go).
  • Understanding of common failure modes in distributed systems.
  • Experience with containerized workloads and platforms (e.g., Kubernetes).
  • Comfortable participating in on-call rotations.
  • Familiarity with observability tools and incident response.
  • Knowledge of SRE concepts such as SLIs, SLOs, and error budgets.
  • Hands-on experience with infrastructure as code (e.g., Terraform).
  • Ability to follow incident response processes.
  • Eager to learn and experiment with AI tools.

Tech Stack

Apache KafkaAWSDjangoFastAPIKubernetesMySQLPythonRabbitMQReactRedisTerraform

Categories