
Staff Engineer, Distributed Storage and HPC & AI Infrastructure
Together AIabout 4 hours ago
Base Salary
$250k - $300k/yr
Responsibilities
- Design multi-petabyte AI/ML storage systems and lead capacity planning.
- Optimize RDMA, InfiniBand, and 400GbE networks for maximum throughput.
- Build Kubernetes storage operators for automated provisioning and multi-tenant isolation.
- Deliver high data throughput per GPU node and optimize caching and data paths.
- Implement multi-tier caching and optimize data locality.
- Design monitoring and disaster recovery strategies to ensure high uptime.
- Collaborate with ML/SRE teams and contribute to open-source projects.
Requirements
- 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale.
- Proven experience with high-performance storage for GPU/HPC clusters.
- Deep knowledge of Kubernetes and cloud-native storage in production.
- Strong coding skills in Go and Python for building production-grade tools.
- BS/MS in Computer Science, Engineering, or equivalent experience.
- History of technical leadership in improving system performance and reliability.
- Expertise in distributed storage systems like WekaFS, Lustre, or similar.
- Production experience with object storage solutions like S3 or Ceph.
- Familiarity with Kubernetes storage concepts and optimization for GPU workloads.
- Advanced knowledge of Linux storage stack and observability tools.
Benefits
- Competitive compensation and startup equity.
- Health insurance and other benefits.
- Flexibility in remote work arrangements.