Platform Support Engineer (EMEA)

about 2 months ago

London, United KingdomMid Level / Senior

Responsibilities

Partner directly with customer engineering teams running training and inference workloads in production.
Help customers diagnose and resolve complex distributed systems and ML infrastructure issues.
Act as a technical advisor during high impact incidents and platform degradation events.
Translate infrastructure level issues into actionable guidance for ML engineers.
Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems.
Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues.
Analyze logs, metrics, traces, and system behavior to isolate root causes.
Identify recurring patterns across customer issues and drive long term reliability improvements.
Build internal tooling, automation, documentation, and runbooks.

Strong software engineering and systems troubleshooting background.
Experience with Kubernetes and containerized environments.
Linux systems knowledge, including networking, storage, process management, and performance tuning.
Experience with cloud infrastructure and distributed systems.
Hands-on experience operating machine learning workloads in production or research environments.
Strong communication skills and ability to work directly with highly technical customers and engineering teams.

Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.).
Retirement and financial wellness support (U.S.); Pension contribution (U.K.).
Generous paid time off, plus holidays.
Paid parental leave.
Professional development support.
Wellness and work-from-home stipends.
Flexible work environment.