GrepJob
Thinking Machines Lab

Site Reliability Engineer (SRE)

Thinking Machines Lab
Apply
7 days ago
San Francisco, CA, USAMid Level / Senior
H1B Sponsor

Base Salary

$350k - $475k/yr

Responsibilities

  • Define and own end-to-end reliability from CI/CD flows to production observability and incident response.
  • Develop Service Level Objectives for distributed training systems balancing reliability and development velocity.
  • Design and implement monitoring and observability across the full training path.
  • Drive incident response for Tinker platform issues ensuring rapid recovery and thorough incident reviews.
  • Harden multi-tenant isolation and resource scheduling for workload co-scheduling.
  • Collaborate with security teams to address production vulnerabilities.

Requirements

  • Bachelor's degree or equivalent experience in computer science, engineering, or similar.
  • Experience in distributed systems, cloud infrastructure, or site reliability engineering.
  • Proficiency in writing software to solve reliability problems, including tooling and automation.
  • Experience with production incident response, postmortems, and reliability improvement.
  • Strong communication skills and a track record of coordination across teams.

Benefits

  • Generous health, dental, and vision benefits.
  • Unlimited PTO.
  • Paid parental leave.
  • Relocation support as needed.

Tech Stack

KubernetesPyTorch

Categories

AI & MLData EngineeringDevOpsSecurity