7 months ago
Base Salary
$156k - $288k/yr
Responsibilities
- Develop and maintain observability solutions using platforms like Datadog, Prometheus, and Grafana.
- Lead incident management, coordinating response efforts and troubleshooting issues.
- Collaborate with product engineering teams to architect reliable systems and recover from incidents.
- Implement and maintain SLOs, monitoring, and alerting strategies for reliability at scale.
- Design and implement automation and support tooling to enhance system resilience.
- Lead the development of runbooks, alert definitions, and incident response procedures.
- Participate in on-call rotations to provide 24/7 support for critical production systems.
Requirements
- 6+ years of experience in Site Reliability Engineering or similar DevOps roles.
- Strong experience with modern monitoring stacks including Prometheus, Grafana, and Datadog.
- Experience in at least one systems programming language such as Python, Go, Rust, C/C++, or Java.
- Expertise with Infrastructure as Code tools like Terraform and Helm.
- Expertise with at least one major cloud service provider (AWS, GCP, Azure).
- Strong communication skills for leading incident response and collaboration.
- Willingness to engage in on-call rotations and emergency response procedures.
- A high degree of agency and bias towards action in problem-solving.
- Excellent problem-solving skills and a methodical approach to troubleshooting.
Benefits
- Competitive salaries and meaningful equity for all team members.
- Health, dental, vision, life, and disability insurance in the US.
- 401(k) and flexible spending accounts.
- Flexible time off for all employees.
- Access to open office spaces in Atlanta and San Francisco for remote workers.
