Sr. Site Reliability Engineer

3 months ago

Base Salary

$160k - $250k/yr

Responsibilities

Own the availability, latency, and performance of critical production systems.
Participate in and improve a 24/7 on-call rotation, responding to incidents and driving resolution.
Lead incident response, root cause analysis (RCA), and postmortems.
Design systems that fail gracefully and recover automatically.
Write production-grade Python code to automate infrastructure workflows and build internal reliability tools.
Eliminate manual operational work through automation and self-healing systems.
Design and implement metrics, logging, tracing, and alerting systems.
Build dashboards and tooling for real-time visibility into system health.
Operate and improve systems on cloud platforms and containers.
Scale systems to handle enterprise workloads and high-throughput traffic.
Define and enforce SLAs, SLOs, and error budgets.
Conduct load testing and chaos testing.
Partner with product and backend engineers to improve system reliability.

Opportunity to build foundational product features for an AI-first enterprise platform.
Ownership of critical systems that scale to millions of users.
Culture that values craftsmanship, autonomy, and technical excellence.
Competitive compensation, equity, and benefits package.
Collaborative work environment in the Flatiron District, Manhattan.