Principal Site Reliability Engineer (AI-first SRE)

3 months ago

Remote, Argentina +6 more

Staff+

H1B Sponsor

Responsibilities

Architect and maintain self-healing systems with 99.9%+ availability targets.
Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
Implement adaptive SLIs/SLOs that evolve automatically from real-time data.
Build AIOps-based observability and auto-remediation pipelines.
Apply predictive modeling to forecast failures before they impact users.
Lead chaos, performance, and resilience testing programs.
Map platform and service behavior to revenue impact and drive improved revenue resilience.
Mentor engineers and drive reliability standards across teams.
Partner with platform, data, and product teams to ensure stability aligns with business goals.
Support major incident response, incident review, and participate in on-call rotations.

10+ years in software/systems engineering, including 5+ years in SRE or platform reliability.
Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform.
Proficiency in Python or Go for automation and tooling.
Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy).
Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations.
Strong communication and influencing skills — data over hierarchy.

The opportunity to work with cutting-edge technologies in a transformative environment.
Professional growth and leadership development pathways tailored to your aspirations.
A chance to leave a lasting impact by shaping the future of reliable and scalable systems.

AmbassadorAWSGoGoogle Cloud PlatformGrafanaIstioKubernetesPrometheusPythonTerraform