AI Site Reliability Engineer (SRE)

about 3 hours ago

Sydney, AustraliaMid Level / Senior

H1B Sponsor

Responsibilities

Support and maintain the service quality of the customer-facing SaaS security platform.
Address complex challenges around scalability, reliability, observability, and cost efficiency.
Collaborate with Engineering teams to maintain and enhance Helm charts, application deployment, monitoring, and CI/CD pipelines.
Embed into the engineering team to understand the application deeply.
Define service verification strategies and implement them as part of the CI/CD process to meet SLAs.
Improve developer experience by optimizing CI/CD workflows and performance.
Participate in the on-call rotation, providing 24/7 support in coordination with the global SRE team.
Monitor, debug, and optimize production infrastructure and services on AWS/GCP.
Own and evolve the observability stack, including Prometheus/Mimir metrics pipelines and Grafana dashboards.
Define and instrument SLIs/SLOs across services and build alerting strategies.

4+ years of experience in a DevOps or SRE role supporting SaaS services on GCP and/or AWS.
Bachelor’s degree in Computer Science or related field.
Production Kubernetes experience, including authored Deployments and resource limits.
Strong proficiency in Kubernetes, microservices architecture, Helm, GitLab CI/CD, and ArgoCD.
Deep hands-on experience with the Grafana observability stack.
Ability to design SLI/SLO frameworks and build alerting rules.
PostgreSQL fluency in schema design, indexing, migrations, and query optimization.
Experience with async/queue-based architecture.
Programming proficiency in Python or Go.
Strong ownership mindset and comfort with production on-call responsibility.