1 day ago
Responsibilities
- Design, build, and maintain scalable, fault-tolerant systems.
- Define and enforce reliability processes, SLOs, SLIs, and SLAs.
- Lead complex incident responses, including on-call rotations and postmortems.
- Build automation, tooling, and self-service capabilities.
- Collaborate with engineering, product, and support teams to embed reliability into everything we do.
- Mentor engineers and promote operational excellence across the organization.
Requirements
- 7+ years of experience in SRE, DevOps, or Production Engineering roles, ideally in SaaS environments.
- Deep understanding of distributed systems, failure modes, resiliency patterns, and observability.
- Hands-on experience with building and owning monitoring tools.
- Experience with CI/CD tools and infrastructure-as-code tools.
- Solid experience with cloud platforms, preferably AWS.
- Advantage: Experience with Java.
Benefits
- Flexible hybrid work model that combines working from home and the office.
- Empowerment and ownership to solve complex problems and make impactful decisions.
