Site Reliability Engineer

2 days ago

Remote, WorldwideSenior

Responsibilities

Partner with service teams to define SLIs and SLOs based on customer experience.
Own and evolve the Operational Readiness Review process for new services.
Strengthen the incident-to-improvement pipeline by connecting postmortem findings to operational gaps.
Act as a reliability expert for architecture reviews and resilience design.
Identify operational toil and advocate for automation to eliminate it.
Help teams design sustainable on-call practices and improve alert quality.
Track and report on organizational operational maturity and drive remediation.

7+ years of experience in SRE, production engineering, or reliability-focused roles.
A software engineering mindset with the ability to write code and build tools.
Hands-on experience defining and operationalizing SLOs/SLIs at scale.
Deep experience with incident response and postmortem facilitation.
Experience with large-scale multi-tenant systems, preferably managed database platforms.
Proficiency with cloud infrastructure, preferably AWS, and infrastructure-as-code tools.
Strong communication skills to influence without authority in a distributed organization.
Experience working in async or globally distributed teams.