GrepJob
Ditto

Senior Site Reliability Engineer

Ditto
Apply
7 months ago
Remote, Worldwide +4 moreSenior
H1B Sponsor

Base Salary

$156k - $288k/yr

Responsibilities

  • Develop and maintain observability solutions using platforms like Datadog, Prometheus, and Grafana.
  • Lead incident management, coordinating response efforts and troubleshooting issues.
  • Collaborate with product engineering teams to architect reliable systems and recover from incidents.
  • Implement and maintain SLOs, monitoring, and alerting strategies for reliability at scale.
  • Design and implement automation and support tooling to enhance system resilience.
  • Lead the development of runbooks, alert definitions, and incident response procedures.
  • Participate in on-call rotations to provide 24/7 support for critical production systems.

Requirements

  • 6+ years of experience in Site Reliability Engineering or similar DevOps roles.
  • Strong experience with modern monitoring stacks including Prometheus, Grafana, and Datadog.
  • Experience in at least one systems programming language such as Python, Go, Rust, C/C++, or Java.
  • Expertise with Infrastructure as Code tools like Terraform and Helm.
  • Expertise with at least one major cloud service provider (AWS, GCP, Azure).
  • Strong communication skills for leading incident response and collaboration.
  • Willingness to engage in on-call rotations and emergency response procedures.
  • A high degree of agency and bias towards action in problem-solving.
  • Excellent problem-solving skills and a methodical approach to troubleshooting.

Benefits

  • Competitive salaries and meaningful equity for all team members.
  • Health, dental, vision, life, and disability insurance in the US.
  • 401(k) and flexible spending accounts.
  • Flexible time off for all employees.
  • Access to open office spaces in Atlanta and San Francisco for remote workers.

Tech Stack

Categories