Senior Site Reliability Engineer — AI Studio (Inference Platform)
Nebius
3 months ago
Amsterdam, Netherlands +5 more
Senior
Responsibilities
- Own the reliability, performance, and observability of the inference stack.
- Design and refine telemetry pipelines for actionable insights.
- Tune Kubernetes autoscalers for GPU efficiency.
- Craft Terraform modules for resilient cluster creation.
- Harden request-routing and retry logic to ensure user experience.
- Detect, isolate, and remediate incidents using automation and runbooks.
- Drive post-mortem culture to prevent recurrence of issues.
Requirements
- Deep fluency with Kubernetes, Prometheus, Grafana, and Terraform.
- Proficient scripting skills in Python or Bash.
- Understanding of alert design and SLOs for high-throughput APIs.
- Experience with GPU-heavy workloads and MLOps or model-hosting platforms.
- Ability to build self-healing systems and debug performance across layers.
- Strong collaboration skills with software engineers.
Benefits
- Competitive salary and comprehensive benefits package.
- Opportunities for professional growth within Nebius.
- Flexible working arrangements.
- Dynamic and collaborative work environment that values initiative and innovation.
Tech Stack
BashGrafanaKubernetesPrometheusPythonTerraform
Categories
AI & MLDevOps