GrepJob
Thinking Machines Lab

Software Engineer, Supercomputing

Thinking Machines Lab
Apply
about 1 month ago
San Francisco, CA, USAMid Level / Senior
H1B Sponsor

Base Salary

$350k - $475k/yr

Responsibilities

  • Operate and automate large GPU clusters including provisioning, imaging, and capacity planning.
  • Write software that abstracts cluster management and presents a unified interface for training and inference.
  • Extend scheduling/orchestration for topology-aware placement, preemption, quotas, and fair-share multi-tenancy.
  • Monitor and improve operational metrics of speed, reliability, and error recovery.
  • Build reliable storage and artifact paths for datasets, checkpoints, and logs.
  • Partner with researchers to unblock scale runs and advise on parallelism and performance trade-offs.

Requirements

  • Bachelor’s degree or equivalent experience in computer science, engineering, or similar.
  • Proficiency in at least one backend language, preferably Python or Rust.
  • Experience operating large-scale clusters and container orchestration systems like Kubernetes or Slurm.
  • Comfort operating across the stack and owning projects end-to-end.
  • Ability to thrive in a highly collaborative environment with cross-functional partners.
  • A bias for action and initiative to ensure project completion.

Benefits

  • Generous health, dental, and vision benefits.
  • Unlimited PTO.
  • Paid parental leave.
  • Relocation support as needed.

Tech Stack

KubernetesPythonPyTorchRustTensorFlow

Categories