8 months ago
San Francisco, CA, USA or New York, NY, USAMid Level / Senior
Base Salary
$165k - $330k/yr
Responsibilities
- Design and architect scalable infrastructure systems for the ML training platform.
- Partner closely with developers and research engineers to translate training requirements into technical solutions.
- Design and architect a global training scheduler.
- Design and architect reinforcement learning systems and continuous learning pipelines.
- Drive long-term improvements to enhance system reliability and development velocity.
- Collaborate with SRE and Capacity teams to optimize training infrastructure.
- Make critical architectural decisions balancing performance and reliability.
- Lead technical discussions and mentor junior engineers on best practices.
- Contribute to the long-term technical strategy and infrastructure roadmap.
Requirements
- Bachelor’s degree in Computer Science or related field.
- Proficiency in Go, with Python experience preferred.
- Deep expertise with Kubernetes in production environments.
- Extensive experience with major cloud providers like AWS and GCP.
- Advanced understanding of distributed systems concepts and performance tuning.
- Proven experience designing observability systems.
- Experience with ML/AI workloads and MLOps platforms is highly valued.
Benefits
- Competitive compensation, including meaningful equity.
- 100% coverage of medical, dental, and vision insurance for employees and dependents.
- Flexible PTO policy including a company-wide Winter Break.
- Paid parental leave.
- Fertility and family-building stipend through Carrot.
- Company-facilitated 401(k).
- Exposure to a variety of ML startups for learning and networking opportunities.
