GrepJob
Nebius

Senior HPC Cluster Engineer

Nebius
Apply
about 3 hours ago
Remote, United States
Senior

Responsibilities

  • Tune the performance of GPU clusters and InfiniBand networks for optimal operation.
  • Analyze and troubleshoot issues related to GPUs and InfiniBand networks.
  • Integrate new hardware into the existing infrastructure, supporting new GPU hardware.
  • Enhance automation systems for proactive monitoring and issue resolution.
  • Configure and manage GPU devices and InfiniBand fabrics.

Requirements

  • 5+ years of professional experience in system-level software development.
  • 3+ years of hands-on experience with Linux systems.
  • In-depth understanding of server architecture and HPC systems.
  • Strong proficiency in performance-oriented programming languages like C/C++, Go, or Python.

Benefits

  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.

Tech Stack

CC++GoLinuxPythonPyTorchTensorFlow

Categories

AI & MLData EngineeringDevOps