Nebius

Senior Site Reliability Engineer — AI Studio (Inference Platform)

Nebius

Apply
3 months ago
Amsterdam, Netherlands +5 more
Senior

Responsibilities

  • Own the reliability, performance, and observability of the inference stack.
  • Design and refine telemetry pipelines for actionable insights.
  • Tune Kubernetes autoscalers for GPU efficiency.
  • Craft Terraform modules for resilient cluster creation.
  • Harden request-routing and retry logic to ensure user experience.
  • Detect, isolate, and remediate incidents using automation and runbooks.
  • Drive post-mortem culture to prevent recurrence of issues.

Requirements

  • Deep fluency with Kubernetes, Prometheus, Grafana, and Terraform.
  • Proficient scripting skills in Python or Bash.
  • Understanding of alert design and SLOs for high-throughput APIs.
  • Experience with GPU-heavy workloads and MLOps or model-hosting platforms.
  • Ability to build self-healing systems and debug performance across layers.
  • Strong collaboration skills with software engineers.

Benefits

  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • Dynamic and collaborative work environment that values initiative and innovation.

Tech Stack

BashGrafanaKubernetesPrometheusPythonTerraform

Categories

AI & MLDevOps