GrepJob
Lightning AI

Infrastructure Engineer (Observability)

Lightning AI
Apply
4 days ago
Remote, Worldwide +3 moreSenior

Base Salary

$180k - $200k/yr

Responsibilities

  • Own and evolve a scalable observability platform spanning metrics, logs, traces, and events.
  • Drive the productization of observability capabilities for internal teams and external customers.
  • Design multi-tenant observability systems with scoped access and customer-facing visibility.
  • Continuously improve observability systems to keep pace with rapid infrastructure buildouts.
  • Design and operate telemetry pipelines ingesting data from various sources.
  • Build systems to correlate signals across infrastructure layers for faster debugging.
  • Implement streaming and real-time data pipelines using tools like Kafka and OTEL.
  • Design and implement noise-resistant alerting systems to improve signal quality.
  • Create dashboards and alerting for InfraOps, Engineering, and Customer Success teams.
  • Build automated insights for proactive detection and system health visibility.
  • Contribute to broader infrastructure engineering projects beyond observability.
  • Partner with infrastructure and platform teams to embed observability into core systems.
  • Support large-scale, distributed systems across compute, networking, and storage environments.
  • Work closely with customer-facing teams to deliver external observability experiences.
  • Collaborate with engineering, operations, and support teams to improve system transparency.
  • Help define best practices for observability across the organization.

Requirements

  • 5+ years of experience in infrastructure engineering, SRE, or observability-focused roles.
  • Strong experience with monitoring systems such as Prometheus, Grafana, ELK, or VictoriaMetrics.
  • Experience building and operating observability platforms at scale.
  • Proficiency in Python, Go, or bash for automation and data integration.
  • Familiarity with containerized environments and Kubernetes observability.
  • Experience with streaming telemetry pipelines like Kafka, OTEL, or Promtail.
  • Experience with multi-tenant monitoring architectures.
  • Strong written and verbal communication skills.

Benefits

  • Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.).
  • Retirement and financial wellness support (U.S.); Pension contribution (U.K.).
  • Generous paid time off, plus holidays.
  • Paid parental leave.
  • Professional development support.
  • Wellness and work-from-home stipends.
  • Flexible work environment.

Tech Stack

Apache KafkaBashGoGrafanaKubernetesPrometheusPythonPyTorch

Categories

AI & MLData EngineeringDevOps