GrepJob
Grafana

Staff Software Engineer - Databases SRE | UK | Remote

Grafana
Apply
about 3 hours ago
Remote, United KingdomStaff+
H1B Sponsor

Responsibilities

  • Partner closely with product engineering squads.
  • Own production reliability for high-SLA and complex customer environments.
  • Design and implement automation to scale reliability practices.
  • Ensure customers meet SLO targets.
  • Define and evolve per-tenant SLOs and reliability models.
  • Proactively reduce SLO burn to prevent repeat incidents.
  • Serve as a primary escalation point and on-call for incidents.
  • Lead customer-impacting incident response and post-incident reviews.
  • Contribute to design docs and code reviews.
  • Influence feature design for production scalability and operability.
  • Build automation to eliminate toil.
  • Improve alert quality and reduce noisy escalations.

Requirements

  • 8+ years of engineering experience, with 4+ years in SRE/CRE/production engineering.
  • Strong Kubernetes experience in AWS, GCP, or Azure.
  • Familiarity with infrastructure-as-code tooling like Helm, Terraform, or Jsonnet.
  • Experience in technical leadership and mentoring other engineers.
  • Experience operating multi-tenant systems in production.
  • Strong experience designing and implementing SLOs.
  • Proficiency in one or more programming languages (e.g., Go, Python, Java).
  • Knowledge of Linux internals, networking, and cloud storage.
  • Excellent problem-solving and troubleshooting skills.
  • Experience in blame-free incident response and writing high-quality post-incident reviews.
  • Ability to reason about performance, scaling, and failure modes.
  • Comfortable working in a self-directed engineering team.
  • Ability to partner deeply with product engineering teams.
  • Intellectual curiosity, transparency, and kindness are highly valued.

Benefits

  • 100% remote work with a global culture.
  • Opportunities for career growth and development.
  • Transparent communication and open decision-making.
  • In-person onboarding for new employees.
  • Global annual leave policy of 30 days per year, including Grafana Shutdown Days.