Senior Site Reliability Engineer

3 months ago

Base Salary

$195k - $240k/yr

Responsibilities

Instrument services end-to-end using OpenTelemetry metrics and structured logging.
Develop and maintain SRE standards and patterns for engineering teams.
Build internal tooling and automation in Python, Bash, and Terraform.
Design and maintain actionable dashboards for service owners and leadership.
Tune alerting rules to maximize signal-to-noise ratio.
Own reliability incident response from detection to resolution.
Track and run blameless postmortems focusing on systemic factors.
Continuously improve MTTD and MTTR by integrating incident learnings.
Collaborate with Customer Success to enhance service reliability.
Define meaningful SLOs based on user journeys and performance data.
Eliminate alert fatigue by auditing and deprecating noisy alerts.

2+ years of full-time experience in an SRE or similar role.
3+ years of experience working in AWS with EKS and CI/CD.
Strong hands-on experience with Git, Python, and Bash.
Experience establishing SRE practices across multiple teams.
Built or maintained Prometheus-based monitoring with Grafana.
Demonstrated experience managing incidents and service outages.
Hands-on experience integrating AI with SRE efforts.
Proven track record of collaborating to define SLOs and operationalize alerting.

AWSBashGitGrafanaPrometheusPython Terraform