Nebius

HPC Specialist Solutions Architect

Nebius

Apply
22 days ago
Remote, United States
Mid Level / Senior

Base Salary

$225k - $315k/yr

Responsibilities

  • Architect and implement scalable HPC clusters optimized for AI and simulation workloads.
  • Design and integrate GPU-accelerated compute infrastructures using NVIDIA technologies.
  • Deploy and manage GPU Operator and Network Operator stacks for automated lifecycle management.
  • Design and validate cloud HPC environments focusing on low-latency and high-bandwidth networking.
  • Lead reference architectures for AI/ML model training and MLOps integrations.
  • Collaborate with hardware vendors and cloud providers to optimize emerging HPC technologies.
  • Benchmark system performance and tune resource utilization across compute, network, and storage.

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
  • 3+ years of hands-on experience architecting HPC or large-scale GPU clusters.
  • Expertise in Linux systems, Kubernetes, and related CI/CD practices.
  • Strong understanding of HPC networking protocols and RDMA stacks.
  • Deep understanding of storage and I/O optimization for large datasets.
  • Familiarity with Terraform, Ansible, Helm, and GitOps workflows.
  • Strong scripting skills in Python or Bash for automation.
  • Excellent communication and documentation skills.

Benefits

  • 100% company-paid medical, dental, and vision coverage for employees and families.
  • Up to 4% company match on 401(k) with immediate vesting.
  • 20 weeks paid parental leave for primary caregivers and 12 weeks for secondary caregivers.
  • Remote work reimbursement of up to $85/month for mobile and internet.
  • Company-paid short-term, long-term, and life insurance coverage.

Tech Stack

AnsibleBashDockerGrafanaHelmKubernetesLinuxMLflowPrometheusPythonPyTorchTerraform

Categories

AI & MLData EngineeringDevOps