
Platform Support Engineer (APAC)
Lightning AIabout 5 hours ago
Remote, WorldwideMid Level / Senior
Responsibilities
- Partner directly with customer engineering teams running training and inference workloads in production.
- Help customers diagnose and resolve complex distributed systems and ML infrastructure issues.
- Act as a technical advisor during high impact incidents and platform degradation events.
- Translate infrastructure level issues into actionable guidance for ML engineers.
- Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems.
- Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues.
- Analyze logs, metrics, traces, and system behavior to isolate root causes.
- Support customers scaling workloads across multi node GPU systems.
- Identify recurring patterns across customer issues and drive long term reliability improvements.
- Contribute to post incident reviews and operational improvements.
Requirements
- Strong software engineering and systems troubleshooting background.
- Experience with Kubernetes and containerized environments.
- Linux systems knowledge, including networking, storage, process management, and performance tuning.
- Hands on experience operating machine learning workloads in production or research environments.
- Strong communication skills and ability to work directly with highly technical customers and engineering teams.
Benefits
- Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.).
- Retirement and financial wellness support (U.S.); Pension contribution (U.K.).
- Generous paid time off, plus holidays.
- Paid parental leave.
- Professional development support.
- Wellness and work-from-home stipends.
- Flexible work environment.