
Staff Site Reliability Engineer
Thrive Marketabout 1 month ago
Base Salary
$180k - $225k/yr
Responsibilities
- Define, implement, and own Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
- Build and maintain monitoring, alerting, and observability systems using tools like Datadog and Prometheus.
- Establish error budgets to balance feature velocity with reliability investments.
- Lead incident response efforts and conduct blameless postmortems.
- Design and implement chaos engineering practices to identify failure modes.
- Architect and optimize the Kubernetes-based container orchestration platform.
- Support large infrastructure migrations with minimal disruption.
- Contribute to potential platform migrations focusing on reliability planning.
- Design automated deployment pipelines for rapid, error-free releases.
- Develop disaster recovery plans and capacity planning models.
- Collaborate with product engineering teams to scale infrastructure in AWS.
- Establish SRE as a practice at Thrive Market.
- Champion a culture of operational excellence and continuous improvement.
- Create and maintain technical documentation for operational procedures.
- Participate in weekly on-call rotations and build sustainable practices.
- Identify systemic problems and recommend strategic improvements.
Requirements
- B.S. in Computer Science or equivalent professional experience.
- 7+ years of experience in SRE, DevOps, or Infrastructure Engineering.
- Deep expertise in Kubernetes, including cluster management and service meshes.
- Strong systems engineering background with advanced Linux administration skills.
- Advanced scripting and automation skills in languages like Bash and Python.
- Extensive experience with core AWS services such as EC2 and S3.
- Strong experience with Infrastructure as Code tools like Terraform.
- Hands-on experience defining and implementing SLOs and SLIs.
- Deep understanding of CI/CD pipelines and deployment strategies.
- Expertise in monitoring and observability platforms like Grafana.
- Strong knowledge of web application infrastructure and security best practices.
- Excellent communication skills for leading incident response.
Benefits
- Comprehensive health benefits including medical, dental, and vision.
- Competitive salary plus equity.
- 401k plan.
- 9 observed holidays.
- Flexible Paid Time Off.
- Subsidized ClassPass Membership for fitness and wellness.
- Free Thrive Market membership with employee discount.
- Coverage for Life Coaching and Therapy Sessions.