GrepJob
Coupang

Sr. Staff Observability Engineer (GPU Cloud & Telemetry Platform)

Coupang
Apply
2 days ago
Seoul, Korea, SouthSenior / Staff+
H1B Sponsor

Responsibilities

  • Design and scale telemetry pipelines using Grafana Alloy, Datadog Vector, Grafana Mimir, and Grafana Loki.
  • Define the multi-year vision for GPU infrastructure observability.
  • Optimize telemetry pipelines for high-cardinality labels and burst-heavy workloads.
  • Architect low-latency, high-throughput pipelines for GPU metrics and distributed system logs.
  • Build rich Grafana dashboards for real-time GPU fleet health and tenant-level insights.
  • Drive adoption of SRE principles tailored to GPU workloads.
  • Develop tooling for cross-layer correlation and lead root cause analysis efforts.

Requirements

  • BS/MS in Computer Science or equivalent practical experience.
  • Extensive experience in Observability, SRE, or Distributed Infrastructure.
  • Proven track record building large-scale telemetry pipelines.
  • Strong knowledge of Grafana Alloy, Mimir, Loki, and Datadog Vector.
  • Proficient in programming with Go or Python.
  • Experience with Kubernetes, Linux internals, and GPU systems.
  • Familiarity with high-performance networking technologies.

Tech Stack

Categories

Data EngineeringDevOpsSecurity