Principal Machine Learning Infrastructure Engineer

about 2 months ago

London, United KingdomSenior / Staff+

H1B Sponsor

Responsibilities

Design and operate distributed training infrastructure for neural operator architectures.
Optimize training pipelines for throughput, fault tolerance, and cost efficiency.
Build and maintain experiment tracking and observability systems.
Solve data loading bottlenecks for large-scale mesh datasets.
Optimize data pipelines for efficient I/O from cloud storage.
Build serving infrastructure for pre-trained models.
Design and implement model packaging pipelines for customer deployment.
Ensure reproducibility of model checkpoints.
Improve developer experience for the Research team.
Collaborate with the broader Infrastructure team on shared patterns and standards.

Requirements

5+ years of experience building and operating ML infrastructure at scale.
Deep expertise in distributed training and debugging NCCL hangs.
Strong systems fundamentals including Linux, networking, and storage I/O.
Production experience with Kubernetes and SLURM for job orchestration.
Proficiency in Python and ML frameworks, preferably PyTorch.
Experience with cloud GPU infrastructure, ideally CoreWeave or similar.

Benefits

Equity options to share in the company's success.
10% employer pension contribution.
Free office lunches.
Enhanced parental leave with full pay.
YellowNest nursery scheme for childcare support.
25 days of annual leave plus public holidays.
Private medical insurance with 100% employee cover.
Wellhub subscription for access to gyms and wellness apps.
Eye tests for employee health.
Dedicated support for personal development.
Employee Assistance Programme for confidential wellbeing support.
Bike2Work scheme and season ticket loan for commuting.
Octopus EV salary sacrifice for sustainable driving.

Tech Stack

GrafanaKubernetesMLflowPrometheusPython PyTorch

Categories

AI & MLData Engineering