Member of Technical Staff, Supercomputing Platform & Infrastructure

over 2 years ago

Remote, Worldwide or San Francisco, CA, USAMid Level / Senior

H1B Sponsor

Base Salary

$200k - $550k/yr

Responsibilities

Design and operate large-scale GPU clusters for training and inference.
Build and maintain infrastructure using Terraform across cloud and hybrid environments.
Deploy, operate, and optimize Kubernetes clusters for AI workloads.
Develop modular, scalable infrastructure-as-code patterns for provisioning.
Improve deployment reproducibility and operational safety.
Optimize networking and storage systems for high-throughput AI workloads.
Automate fault detection and recovery across distributed clusters.
Debug complex cross-layer issues spanning hardware and software.

Strong systems engineering fundamentals.
Deep experience with Terraform, including module design and large-scale deployments.
Experience operating production GPU infrastructure or high-performance distributed systems.
Strong understanding of networking and storage systems.
Experience with major cloud platforms like GCP, AWS, Azure, or OCI.
Track record of owning production-critical infrastructure end-to-end.