Senior Site Reliability Engineer (Compute Node Team)
Nebius
about 1 month ago
Amsterdam, Netherlands or Remote, Worldwide
Senior
Responsibilities
- Ensure reliability, availability, and performance of compute nodes running VMs.
- Analyze and debug Linux systems across user space and kernel space.
- Troubleshoot complex production issues involving CPU, memory, and scheduling.
- Work hands-on with virtualization and containerization technologies.
- Design and evolve observability metrics and alerts.
- Lead incident response and root-cause analysis efforts.
- Collaborate with platform and infrastructure teams to enhance system operability.
Requirements
- Strong expertise in Linux with a deep understanding of user space and kernel space.
- Hands-on experience with QEMU/KVM virtualization.
- Practical knowledge of containerization, namespaces, and cgroups.
- Strong debugging skills with a structured approach to incident analysis.
- SRE mindset with experience in building observability stacks.
Benefits
- Competitive salary and comprehensive benefits package.
- Opportunities for professional growth within Nebius.
- Flexible working arrangements.
- A dynamic and collaborative work environment that values initiative and innovation.
Tech Stack
KubernetesLinux
Categories
AI & MLData EngineeringDevOps