2 days ago
Remote, United States
Staff+
H1B Sponsor
Base Salary
$253k - $355k/yr
Responsibilities
- Propose, design, and lead the architecture of the next-generation LLM platform.
- Architect highly fault-tolerant training infrastructure for multi-week distributed workloads.
- Design and implement robust pipelines for LLM fine-tuning.
- Build scalable systems for automated model evaluation and regression detection.
- Extend distributed data platforms to handle multimodal datasets efficiently.
- Mentor senior engineers and define technical roadmaps.
Requirements
- 10+ years of experience in production software development or distributed data systems.
- Proven track record in designing and operating large-scale ML systems.
- Hands-on experience with fault-tolerant, petabyte-scale distributed systems.
- Deep understanding of ML orchestration and fine-tuning pipelines.
- Experience with CUDA environments and GPU virtualization/containerization.
- Proficiency in Kubernetes, Docker, and production-quality code in Python and/or Go.
- Strong organizational and communication skills.
Benefits
- Comprehensive Healthcare Benefits and Income Replacement Programs.
- 401k with Employer Match.
- Global Benefit programs that fit your lifestyle.
- Family Planning Support.
- Gender-Affirming Care.
- Mental Health & Coaching Benefits.
- Flexible Vacation & Paid Volunteer Time Off.
- Generous Paid Parental Leave.
Tech Stack
DockerGoKubernetesMLflowPython
Categories
AI & MLData EngineeringDevOps
