Zuora

Site Reliability Engineer II

Zuora

Apply
about 1 month ago
San José, Costa Rica
Mid Level / Senior
H1B Sponsor

Responsibilities

  • Design and implement intelligent automation for infrastructure lifecycle management.
  • Apply AI/ML techniques for predictive monitoring and proactive performance optimization.
  • Lead complex incident response and root cause analysis efforts.
  • Identify and remove reliability bottlenecks using dynamic scaling and telemetry instrumentation.
  • Continuously enhance runbooks and playbooks by integrating machine learning insights.
  • Stay on the cutting edge of AIOps and cloud-native reliability practices.

Requirements

  • Strong hands-on experience in Linux Administration and Python Development.
  • Experience with Agentic AI or multi-agent frameworks.
  • Deep expertise with Docker and Kubernetes.
  • Familiarity with Kafka, ActiveMQ, MySQL, Oracle, and Redis.
  • Understanding of AI/ML-based anomaly detection.
  • Proven ability in incident management and RCA.
  • Experience designing and maintaining CI/CD pipelines.
  • Proficiency with Prometheus, Grafana, and OpenTelemetry.
  • A continuous learning mindset and passion for automation.
  • 1+ years of experience in a SaaS or cloud-native environment.

Benefits

  • Competitive compensation, bonus opportunities, and retirement programs.
  • Comprehensive medical, dental, and vision coverage.
  • Generous, flexible time off.
  • Paid holidays, wellness days, and a company-wide year-end break.
  • 6 months of fully paid parental leave.
  • Learning & development stipend.
  • Opportunities to give back, including volunteer time and donation matching.
  • Mental wellbeing resources and support.

Tech Stack

AnsibleApache KafkaAWSDockerGrafanaJenkinsKubernetesMySQLPrometheusPuppetPythonRedisTerraform

Categories

AI & MLDevOps