GrepJob
Nebius

HPC Specialist Solutions Architect

Nebius

Apply
about 4 hours ago
Remote, Worldwide
Mid Level / Senior

Responsibilities

  • Architect and implement scalable HPC clusters optimized for AI and simulation workloads.
  • Design and integrate GPU-accelerated compute infrastructures using NVIDIA technologies.
  • Deploy and manage GPU Operator and Network Operator stacks for automated lifecycle management.
  • Design and validate cloud HPC environments focusing on low-latency and high-bandwidth networking.
  • Lead reference architectures for AI/ML model training and MLOps integrations.
  • Collaborate with hardware vendors and cloud providers to optimize HPC technologies.
  • Benchmark system performance and tune resource utilization across compute, network, and storage.

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
  • 3+ years of hands-on experience architecting HPC or large-scale GPU clusters.
  • Expertise in Linux systems, Kubernetes, and CI/CD practices.
  • Strong understanding of HPC networking protocols and RDMA stacks.
  • Deep understanding of storage and I/O optimization for large datasets.
  • Familiarity with Terraform, Ansible, Helm, and GitOps workflows.
  • Strong scripting skills in Python or Bash for automation.
  • Excellent communication and documentation skills.

Benefits

  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.

Tech Stack

AnsibleBashDockerGrafanaHelmKubernetesMLflowPrometheusPythonPyTorchTerraform

Categories

AI & MLData EngineeringDevOps