Site Reliability Engineer II
Zuora
about 1 month ago
San José, Costa Rica
Mid Level / Senior
H1B Sponsor
Responsibilities
- Design and implement intelligent automation for infrastructure lifecycle management.
- Apply AI/ML techniques for predictive monitoring and proactive performance optimization.
- Lead complex incident response and root cause analysis efforts.
- Identify and remove reliability bottlenecks using dynamic scaling and telemetry instrumentation.
- Continuously enhance runbooks and playbooks by integrating machine learning insights.
- Stay on the cutting edge of AIOps and cloud-native reliability practices.
Requirements
- Strong hands-on experience in Linux Administration and Python Development.
- Experience with Agentic AI or multi-agent frameworks.
- Deep expertise with Docker and Kubernetes.
- Familiarity with Kafka, ActiveMQ, MySQL, Oracle, and Redis.
- Understanding of AI/ML-based anomaly detection.
- Proven ability in incident management and RCA.
- Experience designing and maintaining CI/CD pipelines.
- Proficiency with Prometheus, Grafana, and OpenTelemetry.
- A continuous learning mindset and passion for automation.
- 1+ years of experience in a SaaS or cloud-native environment.
Benefits
- Competitive compensation, bonus opportunities, and retirement programs.
- Comprehensive medical, dental, and vision coverage.
- Generous, flexible time off.
- Paid holidays, wellness days, and a company-wide year-end break.
- 6 months of fully paid parental leave.
- Learning & development stipend.
- Opportunities to give back, including volunteer time and donation matching.
- Mental wellbeing resources and support.
Tech Stack
AnsibleApache KafkaAWSDockerGrafanaJenkinsKubernetesMySQLPrometheusPuppetPythonRedisTerraform
Categories
AI & MLDevOps