GrepJob
Together AI

Staff Engineer, Distributed Storage,HPC & AI Infrastructure

Together AI
Apply
4 days ago
Amsterdam, NetherlandsStaff+
H1B Sponsor

Responsibilities

  • Design multi-petabyte AI/ML storage systems and lead capacity planning.
  • Optimize RDMA, InfiniBand, and 400GbE networks for maximum throughput.
  • Build Kubernetes storage operators for automated provisioning and multi-tenant isolation.
  • Deliver high data throughput per GPU node and optimize caching and data paths.
  • Implement multi-tier caches and optimize data locality.
  • Design monitoring and alerting systems to ensure high uptime.
  • Collaborate with ML/SRE teams and contribute to open-source projects.

Requirements

  • 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale.
  • Proven experience deploying high-performance storage for GPU/HPC clusters.
  • Deep Kubernetes and cloud-native storage experience in production environments.
  • Strong coding skills in Go and Python for building production-grade tools.
  • BS/MS in Computer Science, Engineering, or equivalent practical experience.
  • History of technical leadership in improving performance and reliability.
  • Expertise in distributed storage systems like WekaFS, Lustre, or similar.
  • Production experience with object storage solutions like S3 or Ceph.
  • Knowledge of Kubernetes storage components and optimization for GPU workloads.
  • Advanced knowledge of Linux storage stack and observability tools.

Benefits

  • Hybrid working model with 2 days a week in the Amsterdam office.

Tech Stack

AnsibleGoGrafanaHelmKubernetesPrometheusPythonTerraform

Categories

AI & MLData EngineeringDevOps