Software Engineer, Supercomputing

Thinking Machines Lab

3 months ago

San Francisco, CA, USAMid Level / Senior

H1B Sponsor

Base Salary

$350k - $475k/yr

Responsibilities

Operate and automate large GPU clusters including provisioning, imaging, and capacity planning.
Write software that abstracts cluster management and presents a unified interface for training and inference.
Extend scheduling/orchestration for topology-aware placement, preemption, quotas, and fair-share multi-tenancy.
Monitor and improve operational metrics of speed, reliability, and error recovery.
Build reliable storage and artifact paths for datasets, checkpoints, and logs.
Partner with researchers to unblock scale runs and advise on parallelism and performance trade-offs.

Requirements

Bachelor’s degree or equivalent experience in computer science, engineering, or similar.
Proficiency in at least one backend language, preferably Python or Rust.
Experience operating large-scale clusters and container orchestration systems like Kubernetes or Slurm.
Comfort operating across the stack and owning projects end-to-end.
Ability to thrive in a highly collaborative environment with cross-functional partners.
A bias for action and initiative to ensure project completion.

Benefits

Generous health, dental, and vision benefits.
Unlimited PTO.
Paid parental leave.
Relocation support as needed.

Tech Stack

Kubernetes Python PyTorch RustTensorFlow

Categories

AI & ML BackendData ScienceDevOps