Senior Site Reliability Engineer

7 months ago

Remote, Worldwide +4 moreSenior

H1B Sponsor

Base Salary

$156k - $288k/yr

Responsibilities

Develop and maintain observability solutions using platforms like Datadog, Prometheus, and Grafana.
Lead incident management, coordinating response efforts and troubleshooting issues.
Collaborate with product engineering teams to architect reliable systems and recover from incidents.
Implement and maintain SLOs, monitoring, and alerting strategies for reliability at scale.
Design and implement automation and support tooling to enhance system resilience.
Lead the development of runbooks, alert definitions, and incident response procedures.
Participate in on-call rotations to provide 24/7 support for critical production systems.

6+ years of experience in Site Reliability Engineering or similar DevOps roles.
Strong experience with modern monitoring stacks including Prometheus, Grafana, and Datadog.
Experience in at least one systems programming language such as Python, Go, Rust, C/C++, or Java.
Expertise with Infrastructure as Code tools like Terraform and Helm.
Expertise with at least one major cloud service provider (AWS, GCP, Azure).
Strong communication skills for leading incident response and collaboration.
Willingness to engage in on-call rotations and emergency response procedures.
A high degree of agency and bias towards action in problem-solving.
Excellent problem-solving skills and a methodical approach to troubleshooting.