
Sr. Site Reliability Engineer
Standard Template Labsabout 2 months ago
Base Salary
$160k - $250k/yr
Responsibilities
- Own the availability, latency, and performance of critical production systems.
- Participate in and improve a 24/7 on-call rotation, responding to incidents and driving resolution.
- Lead incident response, root cause analysis (RCA), and postmortems.
- Design systems that fail gracefully and recover automatically.
- Write production-grade Python code to automate infrastructure workflows and build internal reliability tools.
- Eliminate manual operational work through automation and self-healing systems.
- Design and implement metrics, logging, tracing, and alerting systems.
- Build dashboards and tooling for real-time visibility into system health.
- Operate and improve systems on cloud platforms and containers.
- Scale systems to handle enterprise workloads and high-throughput traffic.
- Define and enforce SLAs, SLOs, and error budgets.
- Conduct load testing and chaos testing.
- Partner with product and backend engineers to improve system reliability.
Requirements
- Strong software engineering background with proficiency in Python.
- Experience operating production systems at scale.
- Familiarity with Kubernetes, Docker, and cloud platforms.
- Experience with on-call rotations and incident response.
- Knowledge of monitoring tools like Grafana and Prometheus.
- Ability to debug production issues under pressure.
- Experience with AI/ML systems or data pipelines is a plus.
Benefits
- Opportunity to build foundational product features for an AI-first enterprise platform.
- Ownership of critical systems that scale to millions of users.
- Culture that values craftsmanship, autonomy, and technical excellence.
- Competitive compensation, equity, and benefits package.
- Collaborative work environment in the Flatiron District, Manhattan.