HPC Specialist Solutions Architect

Nebius

about 4 hours ago

Remote, Worldwide

Mid Level / Senior

Responsibilities

Architect and implement scalable HPC clusters optimized for AI and simulation workloads.
Design and integrate GPU-accelerated compute infrastructures using NVIDIA technologies.
Deploy and manage GPU Operator and Network Operator stacks for automated lifecycle management.
Design and validate cloud HPC environments focusing on low-latency and high-bandwidth networking.
Lead reference architectures for AI/ML model training and MLOps integrations.
Collaborate with hardware vendors and cloud providers to optimize HPC technologies.
Benchmark system performance and tune resource utilization across compute, network, and storage.

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
3+ years of hands-on experience architecting HPC or large-scale GPU clusters.
Expertise in Linux systems, Kubernetes, and CI/CD practices.
Strong understanding of HPC networking protocols and RDMA stacks.
Deep understanding of storage and I/O optimization for large datasets.
Familiarity with Terraform, Ansible, Helm, and GitOps workflows.
Strong scripting skills in Python or Bash for automation.
Excellent communication and documentation skills.

Competitive salary and comprehensive benefits package.
Opportunities for professional growth within Nebius.
Flexible working arrangements.
A dynamic and collaborative work environment that values initiative and innovation.

AnsibleBashDockerGrafanaHelmKubernetesMLflowPrometheusPythonPyTorchTerraform