
Platform Support Engineer
Lightning AIabout 5 hours ago
Seattle, WA, USA or San Francisco, CA, USAMid Level / Senior
Base Salary
$115k - $140k/yr
Responsibilities
- Partner with customer engineering teams running training and inference workloads in production.
- Help customers diagnose and resolve complex distributed systems and ML infrastructure issues.
- Act as a technical advisor during high impact incidents and platform degradation events.
- Translate infrastructure level issues into actionable guidance for ML engineers.
- Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems.
- Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues.
- Analyze logs, metrics, traces, and system behavior to isolate root causes.
- Support customers scaling workloads across multi-node GPU systems.
- Identify recurring patterns across customer issues and drive long-term reliability improvements.
- Build internal tooling, automation, documentation, and runbooks.
Requirements
- Strong software engineering and systems troubleshooting background.
- Experience with Kubernetes and containerized environments.
- Linux systems knowledge, including networking, storage, process management, and performance tuning.
- Experience with cloud infrastructure and distributed systems.
- Hands-on experience operating machine learning workloads in production or research environments.
- Strong communication skills and ability to work directly with highly technical customers and engineering teams.
Benefits
- Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.).
- Retirement and financial wellness support (U.S.); Pension contribution (U.K.).
- Generous paid time off, plus holidays.
- Paid parental leave.
- Professional development support.
- Wellness and work-from-home stipends.
- Flexible work environment.