GrepJob
Hyperbolic Labs

Senior Site Reliability Engineer

Hyperbolic Labs
Apply
3 months ago

Responsibilities

  • Define and maintain service level objectives for job success rates.
  • Build robust incident response systems and manage capacity across the GPU network.
  • Implement secure rollout and rollback mechanisms for platform stability.
  • Design monitoring and alerting systems for infrastructure visibility.
  • Lead incident response and post-mortem processes to improve system resilience.
  • Focus on security and infrastructure hardening, ensuring tenant isolation.

Requirements

  • Proven experience in site reliability engineering with SLO and SLA management.
  • Strong background in capacity planning and resource allocation for distributed systems.
  • Experience in incident response and reducing mean time to recovery (MTTR).
  • Deep knowledge of deployment systems and automated rollback mechanisms.
  • Proficient in observability tools and practices like Prometheus and Grafana.
  • Strong understanding of infrastructure security and compliance frameworks.

Tech Stack

AWSAzureGoogle CloudGrafanaPrometheus