Site Reliability Engineer II

about 23 hours ago

San José, Costa RicaMid Level / Senior

H1B Sponsor

Responsibilities

Design and implement intelligent automation for infrastructure lifecycle management.
Apply AI/ML techniques for predictive monitoring and performance optimization.
Lead complex incident response efforts and root cause analyses.
Improve system reliability through dynamic scaling and automated performance tuning.
Enhance operational runbooks by eliminating manual processes through automation.
Evaluate and adopt emerging AIOps and cloud-native technologies.
Partner cross-functionally to deliver exceptional customer experiences.

2–4 years of experience in Linux systems administration and/or Python development.
Strong Linux administration skills including troubleshooting and performance tuning.
Experience developing Python scripts for operational workflows.
Hands-on experience with Docker and familiarity with Kubernetes.
At least one year of experience supporting SaaS or cloud-native environments.
Working knowledge of messaging platforms and databases like Kafka and MySQL.
Experience contributing to CI/CD pipelines and deployment automation.
Hands-on experience with monitoring platforms such as Prometheus and Grafana.
Experience in incident response and root cause analysis.
A demonstrated passion for automation and operational efficiency.