HPC Specialist Solutions Architect
Nebius
22 days ago
Remote, United States
Mid Level / Senior
Base Salary
$225k - $315k/yr
Responsibilities
- Architect and implement scalable HPC clusters optimized for AI and simulation workloads.
- Design and integrate GPU-accelerated compute infrastructures using NVIDIA technologies.
- Deploy and manage GPU Operator and Network Operator stacks for automated lifecycle management.
- Design and validate cloud HPC environments focusing on low-latency and high-bandwidth networking.
- Lead reference architectures for AI/ML model training and MLOps integrations.
- Collaborate with hardware vendors and cloud providers to optimize emerging HPC technologies.
- Benchmark system performance and tune resource utilization across compute, network, and storage.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
- 3+ years of hands-on experience architecting HPC or large-scale GPU clusters.
- Expertise in Linux systems, Kubernetes, and related CI/CD practices.
- Strong understanding of HPC networking protocols and RDMA stacks.
- Deep understanding of storage and I/O optimization for large datasets.
- Familiarity with Terraform, Ansible, Helm, and GitOps workflows.
- Strong scripting skills in Python or Bash for automation.
- Excellent communication and documentation skills.
Benefits
- 100% company-paid medical, dental, and vision coverage for employees and families.
- Up to 4% company match on 401(k) with immediate vesting.
- 20 weeks paid parental leave for primary caregivers and 12 weeks for secondary caregivers.
- Remote work reimbursement of up to $85/month for mobile and internet.
- Company-paid short-term, long-term, and life insurance coverage.
Tech Stack
AnsibleBashDockerGrafanaHelmKubernetesLinuxMLflowPrometheusPythonPyTorchTerraform
Categories
AI & MLData EngineeringDevOps