Groupon

Principal Site Reliability Engineer (AI-first SRE)

Groupon

Apply
about 2 months ago
Remote, Argentina +6 more
Staff+
H1B Sponsor

Responsibilities

  • Architect and maintain self-healing systems with 99.9%+ availability targets.
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data.
  • Build AIOps-based observability and auto-remediation pipelines.
  • Apply predictive modeling to forecast failures before they impact users.
  • Lead chaos, performance, and resilience testing programs.
  • Map platform and service behavior to revenue impact and drive improved revenue resilience.
  • Mentor engineers and drive reliability standards across teams.
  • Partner with platform, data, and product teams to ensure stability aligns with business goals.
  • Support major incident response, incident review, and participate in on-call rotations.

Requirements

  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability.
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform.
  • Proficiency in Python or Go for automation and tooling.
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy).
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations.
  • Strong communication and influencing skills — data over hierarchy.

Benefits

  • The opportunity to work with cutting-edge technologies in a transformative environment.
  • Professional growth and leadership development pathways tailored to your aspirations.
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems.

Tech Stack

AmbassadorAWSGoGoogle Cloud PlatformGrafanaIstioKubernetesPrometheusPythonTerraform

Categories

AI & MLDevOpsTesting