GrepJob
Replit

Staff Site Reliability Engineer

Replit
Apply
about 6 hours ago
Remote, WorldwideStaff+
H1B Sponsor

Responsibilities

  • Design and implement comprehensive monitoring, logging, and tracing solutions.
  • Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
  • Lead incident management and response during high-impact incidents.
  • Architect and improve automation to eliminate operational toil.
  • Optimize performance of large-scale cloud deployments, focusing on Kubernetes.
  • Debug and harden distributed systems to enhance robustness.
  • Provide guidance on reliability, scalability, and operational integrity.
  • Educate and mentor the engineering team on reliability best practices.
  • Write high-quality, well-tested code in Python or Go.

Requirements

  • 8-10 years of experience in Site Reliability Engineering or similar roles.
  • Strong programming skills in Python or Go.
  • Deep understanding of distributed systems and service-oriented architecture.
  • Experience with container orchestration platforms, specifically Kubernetes.
  • Proven track record in designing and maintaining monitoring solutions.
  • Strong incident management skills with experience in complex systems.
  • Familiarity with infrastructure as code tools like Terraform or Pulumi.
  • Excellent communication skills for explaining complex concepts.
  • Strong interpersonal skills for mentoring engineers.

Benefits

  • Competitive Salary & Equity.
  • 401(k) Program with a 4% match.
  • Health, Dental, Vision and Life Insurance.
  • Short Term and Long Term Disability.
  • Paid Parental, Medical, Caregiver Leave.
  • Commuter Benefits.
  • Monthly Wellness Stipend.
  • Autonomous Work Environment.
  • In Office Set-Up Reimbursement.
  • Flexible Time Off (FTO) + Holidays.
  • Quarterly Team Gatherings.
  • In Office Amenities.

Tech Stack

DatadogDockerGoGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonTerraform

Categories