Trends Sign In Sign Up

Zuora

Site Reliability Engineer II

Zuora

about 1 month ago

San José, Costa Rica

Mid Level / Senior

H1B Sponsor

Responsibilities

Design and implement intelligent automation for infrastructure lifecycle management.
Apply AI/ML techniques for predictive monitoring and proactive performance optimization.
Lead complex incident response and root cause analysis efforts.
Identify and remove reliability bottlenecks using dynamic scaling and telemetry instrumentation.
Continuously enhance runbooks and playbooks by integrating machine learning insights.
Stay on the cutting edge of AIOps and cloud-native reliability practices.

Requirements

Strong hands-on experience in Linux Administration and Python Development.
Experience with Agentic AI or multi-agent frameworks.
Deep expertise with Docker and Kubernetes.
Familiarity with Kafka, ActiveMQ, MySQL, Oracle, and Redis.
Understanding of AI/ML-based anomaly detection.
Proven ability in incident management and RCA.
Experience designing and maintaining CI/CD pipelines.
Proficiency with Prometheus, Grafana, and OpenTelemetry.
A continuous learning mindset and passion for automation.
1+ years of experience in a SaaS or cloud-native environment.

Benefits

Competitive compensation, bonus opportunities, and retirement programs.
Comprehensive medical, dental, and vision coverage.
Generous, flexible time off.
Paid holidays, wellness days, and a company-wide year-end break.
6 months of fully paid parental leave.
Learning & development stipend.
Opportunities to give back, including volunteer time and donation matching.
Mental wellbeing resources and support.

Tech Stack

AnsibleApache KafkaAWSDockerGrafanaJenkinsKubernetesMySQLPrometheusPuppetPythonRedisTerraform

Categories

AI & MLDevOps