2 days ago
Seoul, Korea, SouthSenior / Staff+
H1B Sponsor
Responsibilities
- Design and scale telemetry pipelines using Grafana Alloy, Datadog Vector, Grafana Mimir, and Grafana Loki.
- Define the multi-year vision for GPU infrastructure observability.
- Optimize telemetry pipelines for high-cardinality labels and burst-heavy workloads.
- Architect low-latency, high-throughput pipelines for GPU metrics and distributed system logs.
- Build rich Grafana dashboards for real-time GPU fleet health and tenant-level insights.
- Drive adoption of SRE principles tailored to GPU workloads.
- Develop tooling for cross-layer correlation and lead root cause analysis efforts.
Requirements
- BS/MS in Computer Science or equivalent practical experience.
- Extensive experience in Observability, SRE, or Distributed Infrastructure.
- Proven track record building large-scale telemetry pipelines.
- Strong knowledge of Grafana Alloy, Mimir, Loki, and Datadog Vector.
- Proficient in programming with Go or Python.
- Experience with Kubernetes, Linux internals, and GPU systems.
- Familiarity with high-performance networking technologies.