Staff Site Reliability Engineer

about 2 months ago

Remote, WorldwideStaff+

H1B Sponsor

Responsibilities

Design and implement comprehensive monitoring, logging, and tracing solutions.
Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
Lead incident management and response during high-impact incidents.
Architect and improve automation to eliminate operational toil.
Optimize performance of large-scale cloud deployments, focusing on Kubernetes.
Debug and harden distributed systems to enhance robustness.
Provide guidance on reliability, scalability, and operational integrity.
Educate and mentor the engineering team on reliability best practices.
Write high-quality, well-tested code in Python or Go.

Requirements

8-10 years of experience in Site Reliability Engineering or similar roles.
Strong programming skills in Python or Go.
Deep understanding of distributed systems and service-oriented architecture.
Experience with container orchestration platforms, specifically Kubernetes.
Proven track record in designing and maintaining monitoring solutions.
Strong incident management skills with experience in complex systems.
Familiarity with infrastructure as code tools like Terraform or Pulumi.
Excellent communication skills for explaining complex concepts.
Strong interpersonal skills for mentoring engineers.

Benefits

Competitive Salary & Equity.
401(k) Program with a 4% match.
Health, Dental, Vision and Life Insurance.
Short Term and Long Term Disability.
Paid Parental, Medical, Caregiver Leave.
Commuter Benefits.
Monthly Wellness Stipend.
Autonomous Work Environment.
In Office Set-Up Reimbursement.
Flexible Time Off (FTO) + Holidays.
Quarterly Team Gatherings.
In Office Amenities.

Tech Stack

DatadogDocker Go Google Cloud PlatformGrafanaKubernetesPrometheusPython Terraform

Categories