about 4 hours ago
Remote, WorldwideStaff+
Responsibilities
- Partner with development teams throughout the project lifecycle to ensure reliability is designed in from the start.
- Establish and evolve design reviews, launch checklists, and operational acceptance criteria.
- Define meaningful SLIs, SLOs, and error budgets to guide prioritization decisions.
- Create frameworks and tooling across AWS, GCP, and Azure to simplify reliable engineering practices.
- Influence roadmaps and resolve technical disagreements across teams.
- Lead incident management practices and facilitate blameless postmortems.
- Build relationships with cloud providers to influence roadmaps and capabilities.
Requirements
- General fluency across AWS, GCP, and Azure is essential.
- Comfort with TypeScript and Ruby on Rails is required.
- Significant experience as an SRE or production engineer with a focus on reliability is necessary.
- Strong software engineering fundamentals and the ability to write production-quality code are required.
- Proven track record of influencing teams and processes without formal authority is essential.
- Ability to drive ambiguous, high-scope problems to completion with minimal oversight is necessary.
- Experience in identifying and proposing solutions for process and technical debt is required.
- Strong verbal and written English communication skills are essential.
