1 day ago
London, United KingdomSenior / Staff+
H1B Sponsor
Responsibilities
- Design and operate distributed training infrastructure for neural operator architectures.
- Optimize training pipelines for throughput, fault tolerance, and cost efficiency.
- Build and maintain experiment tracking and observability systems.
- Solve data loading bottlenecks for large-scale mesh datasets.
- Optimize data pipelines for efficient I/O from cloud storage.
- Build serving infrastructure for pre-trained models.
- Design and implement model packaging pipelines for customer deployment.
- Ensure reproducibility of model checkpoints.
- Improve developer experience for the Research team.
- Collaborate with the broader Infrastructure team on shared patterns and standards.
Requirements
- 5+ years of experience building and operating ML infrastructure at scale.
- Deep expertise in distributed training and debugging NCCL hangs.
- Strong systems fundamentals including Linux, networking, and storage I/O.
- Production experience with Kubernetes and SLURM for job orchestration.
- Proficiency in Python and ML frameworks, preferably PyTorch.
- Experience with cloud GPU infrastructure, ideally CoreWeave or similar.
Benefits
- Equity options to share in the company's success.
- 10% employer pension contribution.
- Free office lunches.
- Enhanced parental leave with full pay.
- YellowNest nursery scheme for childcare support.
- 25 days of annual leave plus public holidays.
- Private medical insurance with 100% employee cover.
- Wellhub subscription for access to gyms and wellness apps.
- Eye tests for employee health.
- Dedicated support for personal development.
- Employee Assistance Programme for confidential wellbeing support.
- Bike2Work scheme and season ticket loan for commuting.
- Octopus EV salary sacrifice for sustainable driving.
