3 months ago
San Francisco, CA, USASenior / Mid Level
Base Salary
$195k - $240k/yr
Responsibilities
- Instrument services end-to-end using OpenTelemetry metrics and structured logging.
- Develop and maintain SRE standards and patterns for engineering teams.
- Build internal tooling and automation in Python, Bash, and Terraform.
- Design and maintain actionable dashboards for service owners and leadership.
- Tune alerting rules to maximize signal-to-noise ratio.
- Own reliability incident response from detection to resolution.
- Track and run blameless postmortems focusing on systemic factors.
- Continuously improve MTTD and MTTR by integrating incident learnings.
- Collaborate with Customer Success to enhance service reliability.
- Define meaningful SLOs based on user journeys and performance data.
- Eliminate alert fatigue by auditing and deprecating noisy alerts.
Requirements
- 2+ years of full-time experience in an SRE or similar role.
- 3+ years of experience working in AWS with EKS and CI/CD.
- Strong hands-on experience with Git, Python, and Bash.
- Experience establishing SRE practices across multiple teams.
- Built or maintained Prometheus-based monitoring with Grafana.
- Demonstrated experience managing incidents and service outages.
- Hands-on experience integrating AI with SRE efforts.
- Proven track record of collaborating to define SLOs and operationalize alerting.
Benefits
- Hubs in San Francisco and New York City for in-person gatherings.
- Flexible PTO with U.S. holidays and a week shutdown in December.
- 100% health insurance coverage for policyholders and 75% for dependents.
- 12 weeks of paid parental leave in the US.
- 401k program with a 3% match, vested immediately.
- $500 work-from-home stipend within the first year.
- $600 technology stipend for hybrid/remote team expenses.
- $1,200 per year Health & Wellness Allowance.
