
Senior Site Reliability Engineer
Hyperbolic Labs3 months ago
Responsibilities
- Define and maintain service level objectives for job success rates.
- Build robust incident response systems and manage capacity across the GPU network.
- Implement secure rollout and rollback mechanisms for platform stability.
- Design monitoring and alerting systems for infrastructure visibility.
- Lead incident response and post-mortem processes to improve system resilience.
- Focus on security and infrastructure hardening, ensuring tenant isolation.
Requirements
- Proven experience in site reliability engineering with SLO and SLA management.
- Strong background in capacity planning and resource allocation for distributed systems.
- Experience in incident response and reducing mean time to recovery (MTTR).
- Deep knowledge of deployment systems and automated rollback mechanisms.
- Proficient in observability tools and practices like Prometheus and Grafana.
- Strong understanding of infrastructure security and compliance frameworks.