GrepJob
Stack AV

Site Reliability Engineer

Stack AV
Apply
about 3 hours ago
Remote, Worldwide or Pittsburgh, PA, USAMid Level / Senior
H1B Sponsor

Responsibilities

  • Instrument systems scheduling and executing large-scale batch workloads across Kubernetes clusters.
  • Diagnose and triage job failures for customers.
  • Collaborate with teams across the company to understand workload requirements and improve platform capabilities.
  • Scale the reliability and velocity of our systems and processes through increased automation.
  • Document actions to build a comprehensive library of runbooks.
  • Participate in an on-call rotation to uphold the SLOs and SLAs of production services.
  • Contribute to platform tooling, automation, and CI/CD workflows.

Requirements

  • Fundamental understanding of Linux operating system internals, TCP/IP networking, and storage subsystems.
  • Strong experience with Kubernetes and container orchestration in production grade environments.
  • Understanding of engineering design limitations and ability to provide guidance to teams.
  • Strong experience implementing and debugging cloud native and open source tools such as Kubernetes, etcd, Prometheus, OpenTelemetry.
  • Strong communication skills and the ability to work effectively in a diverse and distributed team.

Tech Stack

KubernetesLinuxPrometheus

Categories

AI & MLData EngineeringDevOps