Staff Engineer, Distributed Storage,HPC & AI Infrastructure

4 days ago

Amsterdam, NetherlandsStaff+

H1B Sponsor

Responsibilities

Design multi-petabyte AI/ML storage systems and lead capacity planning.
Optimize RDMA, InfiniBand, and 400GbE networks for maximum throughput.
Build Kubernetes storage operators for automated provisioning and multi-tenant isolation.
Deliver high data throughput per GPU node and optimize caching and data paths.
Implement multi-tier caches and optimize data locality.
Design monitoring and alerting systems to ensure high uptime.
Collaborate with ML/SRE teams and contribute to open-source projects.

8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale.
Proven experience deploying high-performance storage for GPU/HPC clusters.
Deep Kubernetes and cloud-native storage experience in production environments.
Strong coding skills in Go and Python for building production-grade tools.
BS/MS in Computer Science, Engineering, or equivalent practical experience.
History of technical leadership in improving performance and reliability.
Expertise in distributed storage systems like WekaFS, Lustre, or similar.
Production experience with object storage solutions like S3 or Ceph.
Knowledge of Kubernetes storage components and optimization for GPU workloads.
Advanced knowledge of Linux storage stack and observability tools.

AnsibleGoGrafanaHelmKubernetesPrometheusPythonTerraform