2 days ago
Remote, WorldwideSenior
Responsibilities
- Partner with service teams to define SLIs and SLOs based on customer experience.
- Own and evolve the Operational Readiness Review process for new services.
- Strengthen the incident-to-improvement pipeline by connecting postmortem findings to operational gaps.
- Act as a reliability expert for architecture reviews and resilience design.
- Identify operational toil and advocate for automation to eliminate it.
- Help teams design sustainable on-call practices and improve alert quality.
- Track and report on organizational operational maturity and drive remediation.
Requirements
- 7+ years of experience in SRE, production engineering, or reliability-focused roles.
- A software engineering mindset with the ability to write code and build tools.
- Hands-on experience defining and operationalizing SLOs/SLIs at scale.
- Deep experience with incident response and postmortem facilitation.
- Experience with large-scale multi-tenant systems, preferably managed database platforms.
- Proficiency with cloud infrastructure, preferably AWS, and infrastructure-as-code tools.
- Strong communication skills to influence without authority in a distributed organization.
- Experience working in async or globally distributed teams.
Benefits
- Fully remote work with a WeWork membership or co-working allowance.
- Equity ownership through ESOP for all team members.
- Tech allowance to set up an ideal work environment.
- 100% health insurance coverage for employees and 80% for dependents.
- Annual off-sites for team connection and collaboration.
- Flexible work hours with an async-friendly approach.
- Annual education allowance for professional development.
