Senior Site Reliability Engineer (Compute Node Team)

6 months ago

Remote, Worldwide or Amsterdam, NetherlandsSenior

Responsibilities

Ensure reliability, availability, and performance of compute nodes running VMs.
Analyze and debug Linux systems across user space and kernel space.
Troubleshoot complex production issues involving CPU, memory, and scheduling.
Work hands-on with virtualization and containerization technologies.
Design and evolve observability metrics and alerts.
Lead incident response and root-cause analysis efforts.
Collaborate with platform and infrastructure teams to enhance system operability.

Strong expertise in Linux with a deep understanding of user space and kernel space.
Hands-on experience with QEMU/KVM virtualization.
Practical knowledge of containerization, namespaces, and cgroups.
Strong debugging skills with a structured approach to incident analysis.
SRE mindset with experience in building observability stacks.

Competitive salary and comprehensive benefits package.
Opportunities for professional growth within Nebius.
Flexible working arrangements.
A dynamic and collaborative work environment that values initiative and innovation.