Nebius

Senior Site Reliability Engineer (Compute Node Team)

Nebius

Apply
about 1 month ago
Amsterdam, Netherlands or Remote, Worldwide
Senior

Responsibilities

  • Ensure reliability, availability, and performance of compute nodes running VMs.
  • Analyze and debug Linux systems across user space and kernel space.
  • Troubleshoot complex production issues involving CPU, memory, and scheduling.
  • Work hands-on with virtualization and containerization technologies.
  • Design and evolve observability metrics and alerts.
  • Lead incident response and root-cause analysis efforts.
  • Collaborate with platform and infrastructure teams to enhance system operability.

Requirements

  • Strong expertise in Linux with a deep understanding of user space and kernel space.
  • Hands-on experience with QEMU/KVM virtualization.
  • Practical knowledge of containerization, namespaces, and cgroups.
  • Strong debugging skills with a structured approach to incident analysis.
  • SRE mindset with experience in building observability stacks.

Benefits

  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.

Tech Stack

KubernetesLinux

Categories

AI & MLData EngineeringDevOps