Staff Site Reliability Engineer

about 2 months ago

Mountain View, CA, USAStaff+

H1B Sponsor

Base Salary

$252k - $308k/yr

Responsibilities

Set a reliability strategy centered around AI, defining SLIs, SLOs, and error budgets.
Redesign the incident lifecycle to enhance speed using AI-assisted processes.
Improve on-call processes through automation and AI-driven tools.
Integrate AI-first operations into product engineering workflows.
Architect resilient systems for capacity planning and failure isolation.
Mentor engineers on reliability practices and establish accessible documentation.

Requirements

7+ years in SRE, Software Engineering, or Infrastructure Engineering with a focus on reliability.
Experience applying AI/LLMs to operational workflows in production.
Expertise in SLOs/SLIs, error budgets, and incident command in distributed systems.
Proficient in software engineering with languages like Python or Go.
Deep observability experience with tools like Datadog and CloudWatch.
Solid infrastructure-as-code skills with Terraform and AWS.
Familiarity with AI-assisted development tools and their application in workflows.
Experience in fintech or regulated environments is a plus.

Tech Stack

Amazon DynamoDBApache Kafka AWSDatadogGo Kubernetes Python Terraform

Categories