about 5 hours ago
Base Salary
$252k - $308k/yr
Responsibilities
- Set a reliability strategy centered around AI, defining SLIs, SLOs, and error budgets.
- Redesign the incident lifecycle to enhance speed using AI-assisted processes.
- Improve on-call processes through automation and AI-driven tools.
- Integrate AI-first operations into product engineering workflows.
- Architect resilient systems for capacity planning and failure isolation.
- Mentor engineers on reliability practices and establish accessible documentation.
Requirements
- 7+ years in SRE, Software Engineering, or Infrastructure Engineering with a focus on reliability.
- Experience applying AI/LLMs to operational workflows in production.
- Expertise in SLOs/SLIs, error budgets, and incident command in distributed systems.
- Proficient in software engineering with languages like Python or Go.
- Deep observability experience with tools like Datadog and CloudWatch.
- Solid infrastructure-as-code skills with Terraform and AWS.
- Familiarity with AI-assisted development tools and their application in workflows.
- Experience in fintech or regulated environments is a plus.