Site Reliability Engineer (SRE)

Thinking Machines Lab

about 2 months ago

San Francisco, CA, USAMid Level / Senior

H1B Sponsor

Base Salary

$350k - $475k/yr

Responsibilities

Define and own end-to-end reliability from CI/CD flows to production observability and incident response.
Develop Service Level Objectives for distributed training systems balancing reliability and development velocity.
Design and implement monitoring and observability across the full training path.
Drive incident response for Tinker platform issues ensuring rapid recovery and thorough incident reviews.
Harden multi-tenant isolation and resource scheduling for workload co-scheduling.
Collaborate with security teams to address production vulnerabilities.

Requirements

Bachelor's degree or equivalent experience in computer science, engineering, or similar.
Experience in distributed systems, cloud infrastructure, or site reliability engineering.
Proficiency in writing software to solve reliability problems, including tooling and automation.
Experience with production incident response, postmortems, and reliability improvement.
Strong communication skills and a track record of coordination across teams.

Benefits

Generous health, dental, and vision benefits.
Unlimited PTO.
Paid parental leave.
Relocation support as needed.

Tech Stack

Kubernetes PyTorch

Categories

AI & MLData EngineeringDevOps Security