
Site Reliability Engineer (SRE)
Thinking Machines Lab7 days ago
Base Salary
$350k - $475k/yr
Responsibilities
- Define and own end-to-end reliability from CI/CD flows to production observability and incident response.
- Develop Service Level Objectives for distributed training systems balancing reliability and development velocity.
- Design and implement monitoring and observability across the full training path.
- Drive incident response for Tinker platform issues ensuring rapid recovery and thorough incident reviews.
- Harden multi-tenant isolation and resource scheduling for workload co-scheduling.
- Collaborate with security teams to address production vulnerabilities.
Requirements
- Bachelor's degree or equivalent experience in computer science, engineering, or similar.
- Experience in distributed systems, cloud infrastructure, or site reliability engineering.
- Proficiency in writing software to solve reliability problems, including tooling and automation.
- Experience with production incident response, postmortems, and reliability improvement.
- Strong communication skills and a track record of coordination across teams.
Benefits
- Generous health, dental, and vision benefits.
- Unlimited PTO.
- Paid parental leave.
- Relocation support as needed.