
Software Engineer, Supercomputing
Thinking Machines Lababout 1 month ago
Base Salary
$350k - $475k/yr
Responsibilities
- Operate and automate large GPU clusters including provisioning, imaging, and capacity planning.
- Write software that abstracts cluster management and presents a unified interface for training and inference.
- Extend scheduling/orchestration for topology-aware placement, preemption, quotas, and fair-share multi-tenancy.
- Monitor and improve operational metrics of speed, reliability, and error recovery.
- Build reliable storage and artifact paths for datasets, checkpoints, and logs.
- Partner with researchers to unblock scale runs and advise on parallelism and performance trade-offs.
Requirements
- Bachelor’s degree or equivalent experience in computer science, engineering, or similar.
- Proficiency in at least one backend language, preferably Python or Rust.
- Experience operating large-scale clusters and container orchestration systems like Kubernetes or Slurm.
- Comfort operating across the stack and owning projects end-to-end.
- Ability to thrive in a highly collaborative environment with cross-functional partners.
- A bias for action and initiative to ensure project completion.
Benefits
- Generous health, dental, and vision benefits.
- Unlimited PTO.
- Paid parental leave.
- Relocation support as needed.