
Staff Engineer, Distributed Storage,HPC & AI Infrastructure
Together AI4 days ago
Responsibilities
- Design multi-petabyte AI/ML storage systems and lead capacity planning.
- Optimize RDMA, InfiniBand, and 400GbE networks for maximum throughput.
- Build Kubernetes storage operators for automated provisioning and multi-tenant isolation.
- Deliver high data throughput per GPU node and optimize caching and data paths.
- Implement multi-tier caches and optimize data locality.
- Design monitoring and alerting systems to ensure high uptime.
- Collaborate with ML/SRE teams and contribute to open-source projects.
Requirements
- 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale.
- Proven experience deploying high-performance storage for GPU/HPC clusters.
- Deep Kubernetes and cloud-native storage experience in production environments.
- Strong coding skills in Go and Python for building production-grade tools.
- BS/MS in Computer Science, Engineering, or equivalent practical experience.
- History of technical leadership in improving performance and reliability.
- Expertise in distributed storage systems like WekaFS, Lustre, or similar.
- Production experience with object storage solutions like S3 or Ceph.
- Knowledge of Kubernetes storage components and optimization for GPU workloads.
- Advanced knowledge of Linux storage stack and observability tools.
Benefits
- Hybrid working model with 2 days a week in the Amsterdam office.