GrepJob
Lightning AI

Platform Support Engineer (APAC)

Lightning AI
Apply
about 5 hours ago
Remote, WorldwideMid Level / Senior

Responsibilities

  • Partner directly with customer engineering teams running training and inference workloads in production.
  • Help customers diagnose and resolve complex distributed systems and ML infrastructure issues.
  • Act as a technical advisor during high impact incidents and platform degradation events.
  • Translate infrastructure level issues into actionable guidance for ML engineers.
  • Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems.
  • Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues.
  • Analyze logs, metrics, traces, and system behavior to isolate root causes.
  • Support customers scaling workloads across multi node GPU systems.
  • Identify recurring patterns across customer issues and drive long term reliability improvements.
  • Contribute to post incident reviews and operational improvements.

Requirements

  • Strong software engineering and systems troubleshooting background.
  • Experience with Kubernetes and containerized environments.
  • Linux systems knowledge, including networking, storage, process management, and performance tuning.
  • Hands on experience operating machine learning workloads in production or research environments.
  • Strong communication skills and ability to work directly with highly technical customers and engineering teams.

Benefits

  • Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.).
  • Retirement and financial wellness support (U.S.); Pension contribution (U.K.).
  • Generous paid time off, plus holidays.
  • Paid parental leave.
  • Professional development support.
  • Wellness and work-from-home stipends.
  • Flexible work environment.

Tech Stack

GrafanaKubernetesPrometheusPythonPyTorch

Categories

AI & MLData EngineeringDevOps