HPC Specialist Solutions Architect
Nebius
about 4 hours ago
Remote, Worldwide
Mid Level / Senior
Responsibilities
- Architect and implement scalable HPC clusters optimized for AI and simulation workloads.
- Design and integrate GPU-accelerated compute infrastructures using NVIDIA technologies.
- Deploy and manage GPU Operator and Network Operator stacks for automated lifecycle management.
- Design and validate cloud HPC environments focusing on low-latency and high-bandwidth networking.
- Lead reference architectures for AI/ML model training and MLOps integrations.
- Collaborate with hardware vendors and cloud providers to optimize HPC technologies.
- Benchmark system performance and tune resource utilization across compute, network, and storage.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
- 3+ years of hands-on experience architecting HPC or large-scale GPU clusters.
- Expertise in Linux systems, Kubernetes, and CI/CD practices.
- Strong understanding of HPC networking protocols and RDMA stacks.
- Deep understanding of storage and I/O optimization for large datasets.
- Familiarity with Terraform, Ansible, Helm, and GitOps workflows.
- Strong scripting skills in Python or Bash for automation.
- Excellent communication and documentation skills.
Benefits
- Competitive salary and comprehensive benefits package.
- Opportunities for professional growth within Nebius.
- Flexible working arrangements.
- A dynamic and collaborative work environment that values initiative and innovation.
Tech Stack
AnsibleBashDockerGrafanaHelmKubernetesMLflowPrometheusPythonPyTorchTerraform
Categories
AI & MLData EngineeringDevOps