Senior Site Reliability Engineer — AI Studio (Inference Platform)

7 months ago

Remote, United States +5 moreSenior

Responsibilities

Own the reliability, performance, and observability of the inference stack.
Design and refine telemetry pipelines for actionable insights.
Tune Kubernetes autoscalers for GPU efficiency.
Craft Terraform modules for resilient cluster creation.
Harden request-routing and retry logic to ensure user experience.
Detect, isolate, and remediate incidents using automation and runbooks.
Drive post-mortem culture to prevent recurrence of issues.

Requirements

Deep fluency with Kubernetes, Prometheus, Grafana, and Terraform.
Proficient scripting skills in Python or Bash.
Understanding of alert design and SLOs for high-throughput APIs.
Experience with GPU-heavy workloads and MLOps or model-hosting platforms.
Ability to build self-healing systems and debug performance across layers.
Strong collaboration skills with software engineers.

Benefits

Competitive salary and comprehensive benefits package.
Opportunities for professional growth within Nebius.
Flexible working arrangements.
Dynamic and collaborative work environment that values initiative and innovation.

Tech Stack

BashGrafanaKubernetesPrometheusPython Terraform

Categories