6 months ago
Responsibilities
- Build and scale ML-optimized HPC infrastructure using Kubernetes-based GPU/TPU superclusters.
- Collaborate with cloud providers to optimize infrastructure for AI/ML training.
- Proactively identify and resolve infrastructure bottlenecks and performance issues.
- Design intuitive self-service tools for researchers to monitor and optimize training jobs.
- Work closely with AI researchers to develop scalable infrastructure solutions.
- Advocate for best practices in observability, automation, and infrastructure-as-code.
- Mentor team members through code reviews and cross-team collaboration.
Requirements
- Deep expertise in ML/HPC infrastructure and experience with GPU/TPU clusters.
- Proven ability to deploy and manage cloud-native Kubernetes clusters.
- Strong programming skills in Python and Go, with a preference for open-source contributions.
- Familiarity with Linux internals and performance optimization for ML workloads.
- Experience collaborating with AI researchers or ML engineers.
- Self-directed problem-solving skills in a fast-paced environment.
Benefits
- An open and inclusive culture and work environment.
- Weekly lunch stipend, in-office lunches, and snacks.
- Full health and dental benefits, including a budget for mental health.
- 100% Parental Leave top-up for up to 6 months.
- Personal enrichment benefits for arts, culture, fitness, and workspace improvement.
- Remote-flexible work options with offices in major cities.
- 6 weeks of vacation (30 working days).
