GrepJob
Standard Template Labs

Sr. Site Reliability Engineer

Standard Template Labs
Apply
about 2 months ago

Base Salary

$160k - $250k/yr

Responsibilities

  • Own the availability, latency, and performance of critical production systems.
  • Participate in and improve a 24/7 on-call rotation, responding to incidents and driving resolution.
  • Lead incident response, root cause analysis (RCA), and postmortems.
  • Design systems that fail gracefully and recover automatically.
  • Write production-grade Python code to automate infrastructure workflows and build internal reliability tools.
  • Eliminate manual operational work through automation and self-healing systems.
  • Design and implement metrics, logging, tracing, and alerting systems.
  • Build dashboards and tooling for real-time visibility into system health.
  • Operate and improve systems on cloud platforms and containers.
  • Scale systems to handle enterprise workloads and high-throughput traffic.
  • Define and enforce SLAs, SLOs, and error budgets.
  • Conduct load testing and chaos testing.
  • Partner with product and backend engineers to improve system reliability.

Requirements

  • Strong software engineering background with proficiency in Python.
  • Experience operating production systems at scale.
  • Familiarity with Kubernetes, Docker, and cloud platforms.
  • Experience with on-call rotations and incident response.
  • Knowledge of monitoring tools like Grafana and Prometheus.
  • Ability to debug production issues under pressure.
  • Experience with AI/ML systems or data pipelines is a plus.

Benefits

  • Opportunity to build foundational product features for an AI-first enterprise platform.
  • Ownership of critical systems that scale to millions of users.
  • Culture that values craftsmanship, autonomy, and technical excellence.
  • Competitive compensation, equity, and benefits package.
  • Collaborative work environment in the Flatiron District, Manhattan.

Tech Stack

AWSAzureDockerGoogle Cloud PlatformGrafanaKubernetesPrometheusPython

Categories

AI & MLData EngineeringDevOps