Platform Support Engineer (APAC)

about 5 hours ago

Remote, WorldwideMid Level / Senior

Responsibilities

Partner directly with customer engineering teams running training and inference workloads in production.
Help customers diagnose and resolve complex distributed systems and ML infrastructure issues.
Act as a technical advisor during high impact incidents and platform degradation events.
Translate infrastructure level issues into actionable guidance for ML engineers.
Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems.
Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues.
Analyze logs, metrics, traces, and system behavior to isolate root causes.
Support customers scaling workloads across multi node GPU systems.
Identify recurring patterns across customer issues and drive long term reliability improvements.
Contribute to post incident reviews and operational improvements.

Strong software engineering and systems troubleshooting background.
Experience with Kubernetes and containerized environments.
Linux systems knowledge, including networking, storage, process management, and performance tuning.
Hands on experience operating machine learning workloads in production or research environments.
Strong communication skills and ability to work directly with highly technical customers and engineering teams.

Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.).
Retirement and financial wellness support (U.S.); Pension contribution (U.K.).
Generous paid time off, plus holidays.
Paid parental leave.
Professional development support.
Wellness and work-from-home stipends.
Flexible work environment.

GrafanaKubernetesPrometheusPythonPyTorch