GrepJob
PhysicsX

Principal Machine Learning Infrastructure Engineer

PhysicsX
Apply
1 day ago
London, United KingdomSenior / Staff+
H1B Sponsor

Responsibilities

  • Design and operate distributed training infrastructure for neural operator architectures.
  • Optimize training pipelines for throughput, fault tolerance, and cost efficiency.
  • Build and maintain experiment tracking and observability systems.
  • Solve data loading bottlenecks for large-scale mesh datasets.
  • Optimize data pipelines for efficient I/O from cloud storage.
  • Build serving infrastructure for pre-trained models.
  • Design and implement model packaging pipelines for customer deployment.
  • Ensure reproducibility of model checkpoints.
  • Improve developer experience for the Research team.
  • Collaborate with the broader Infrastructure team on shared patterns and standards.

Requirements

  • 5+ years of experience building and operating ML infrastructure at scale.
  • Deep expertise in distributed training and debugging NCCL hangs.
  • Strong systems fundamentals including Linux, networking, and storage I/O.
  • Production experience with Kubernetes and SLURM for job orchestration.
  • Proficiency in Python and ML frameworks, preferably PyTorch.
  • Experience with cloud GPU infrastructure, ideally CoreWeave or similar.

Benefits

  • Equity options to share in the company's success.
  • 10% employer pension contribution.
  • Free office lunches.
  • Enhanced parental leave with full pay.
  • YellowNest nursery scheme for childcare support.
  • 25 days of annual leave plus public holidays.
  • Private medical insurance with 100% employee cover.
  • Wellhub subscription for access to gyms and wellness apps.
  • Eye tests for employee health.
  • Dedicated support for personal development.
  • Employee Assistance Programme for confidential wellbeing support.
  • Bike2Work scheme and season ticket loan for commuting.
  • Octopus EV salary sacrifice for sustainable driving.

Tech Stack

GrafanaKubernetesMLflowPrometheusPythonPyTorch

Categories

AI & MLData Engineering