Staff Engineer, Distributed Storage and HPC & AI Infrastructure

about 2 months ago

H1B Sponsor

Base Salary

$250k - $300k/yr

Responsibilities

Design multi-petabyte AI/ML storage systems and lead capacity planning.
Optimize RDMA, InfiniBand, and 400GbE networks for maximum throughput.
Build Kubernetes storage operators for automated provisioning and multi-tenant isolation.
Deliver high data throughput per GPU node and optimize caching and data paths.
Implement multi-tier caching and optimize data locality.
Design monitoring and disaster recovery strategies to ensure high uptime.
Collaborate with ML/SRE teams and contribute to open-source projects.

8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale.
Proven experience with high-performance storage for GPU/HPC clusters.
Deep knowledge of Kubernetes and cloud-native storage in production.
Strong coding skills in Go and Python for building production-grade tools.
BS/MS in Computer Science, Engineering, or equivalent experience.
History of technical leadership in improving system performance and reliability.
Expertise in distributed storage systems like WekaFS, Lustre, or similar.
Production experience with object storage solutions like S3 or Ceph.
Familiarity with Kubernetes storage concepts and optimization for GPU workloads.
Advanced knowledge of Linux storage stack and observability tools.