GrepJob
Lightning AI

Platform Support Engineer (EMEA)

Lightning AI
Apply
about 3 hours ago
London, United KingdomMid Level / Senior

Responsibilities

  • Partner directly with customer engineering teams running training and inference workloads in production.
  • Help customers diagnose and resolve complex distributed systems and ML infrastructure issues.
  • Act as a technical advisor during high impact incidents and platform degradation events.
  • Translate infrastructure level issues into actionable guidance for ML engineers.
  • Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems.
  • Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues.
  • Analyze logs, metrics, traces, and system behavior to isolate root causes.
  • Identify recurring patterns across customer issues and drive long term reliability improvements.
  • Build internal tooling, automation, documentation, and runbooks.

Requirements

  • Strong software engineering and systems troubleshooting background.
  • Experience with Kubernetes and containerized environments.
  • Linux systems knowledge, including networking, storage, process management, and performance tuning.
  • Experience with cloud infrastructure and distributed systems.
  • Hands-on experience operating machine learning workloads in production or research environments.
  • Strong communication skills and ability to work directly with highly technical customers and engineering teams.

Benefits

  • Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.).
  • Retirement and financial wellness support (U.S.); Pension contribution (U.K.).
  • Generous paid time off, plus holidays.
  • Paid parental leave.
  • Professional development support.
  • Wellness and work-from-home stipends.
  • Flexible work environment.

Tech Stack

GrafanaKubernetesPrometheusPythonPyTorch

Categories

AI & MLData EngineeringDevOps