GrepJob
Thrive Market

Staff Site Reliability Engineer

Thrive Market
Apply
about 1 month ago
Remote, WorldwideStaff+
H1B Sponsor

Base Salary

$180k - $225k/yr

Responsibilities

  • Define, implement, and own Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
  • Build and maintain monitoring, alerting, and observability systems using tools like Datadog and Prometheus.
  • Establish error budgets to balance feature velocity with reliability investments.
  • Lead incident response efforts and conduct blameless postmortems.
  • Design and implement chaos engineering practices to identify failure modes.
  • Architect and optimize the Kubernetes-based container orchestration platform.
  • Support large infrastructure migrations with minimal disruption.
  • Contribute to potential platform migrations focusing on reliability planning.
  • Design automated deployment pipelines for rapid, error-free releases.
  • Develop disaster recovery plans and capacity planning models.
  • Collaborate with product engineering teams to scale infrastructure in AWS.
  • Establish SRE as a practice at Thrive Market.
  • Champion a culture of operational excellence and continuous improvement.
  • Create and maintain technical documentation for operational procedures.
  • Participate in weekly on-call rotations and build sustainable practices.
  • Identify systemic problems and recommend strategic improvements.

Requirements

  • B.S. in Computer Science or equivalent professional experience.
  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering.
  • Deep expertise in Kubernetes, including cluster management and service meshes.
  • Strong systems engineering background with advanced Linux administration skills.
  • Advanced scripting and automation skills in languages like Bash and Python.
  • Extensive experience with core AWS services such as EC2 and S3.
  • Strong experience with Infrastructure as Code tools like Terraform.
  • Hands-on experience defining and implementing SLOs and SLIs.
  • Deep understanding of CI/CD pipelines and deployment strategies.
  • Expertise in monitoring and observability platforms like Grafana.
  • Strong knowledge of web application infrastructure and security best practices.
  • Excellent communication skills for leading incident response.

Benefits

  • Comprehensive health benefits including medical, dental, and vision.
  • Competitive salary plus equity.
  • 401k plan.
  • 9 observed holidays.
  • Flexible Paid Time Off.
  • Subsidized ClassPass Membership for fitness and wellness.
  • Free Thrive Market membership with employee discount.
  • Coverage for Life Coaching and Therapy Sessions.

Tech Stack

AnsibleAWSBashChefDatadogGitHub ActionsGoGrafanaIstioKubernetesPrometheusPuppetPythonRubyTerraform

Categories