
AI Site Reliability Engineer (SRE)
Obsidian Securityabout 3 hours ago
Sydney, AustraliaMid Level / Senior
H1B Sponsor
Responsibilities
- Support and maintain the service quality of the customer-facing SaaS security platform.
- Address complex challenges around scalability, reliability, observability, and cost efficiency.
- Collaborate with Engineering teams to maintain and enhance Helm charts, application deployment, monitoring, and CI/CD pipelines.
- Embed into the engineering team to understand the application deeply.
- Define service verification strategies and implement them as part of the CI/CD process to meet SLAs.
- Improve developer experience by optimizing CI/CD workflows and performance.
- Participate in the on-call rotation, providing 24/7 support in coordination with the global SRE team.
- Monitor, debug, and optimize production infrastructure and services on AWS/GCP.
- Own and evolve the observability stack, including Prometheus/Mimir metrics pipelines and Grafana dashboards.
- Define and instrument SLIs/SLOs across services and build alerting strategies.
Requirements
- 4+ years of experience in a DevOps or SRE role supporting SaaS services on GCP and/or AWS.
- Bachelor’s degree in Computer Science or related field.
- Production Kubernetes experience, including authored Deployments and resource limits.
- Strong proficiency in Kubernetes, microservices architecture, Helm, GitLab CI/CD, and ArgoCD.
- Deep hands-on experience with the Grafana observability stack.
- Ability to design SLI/SLO frameworks and build alerting rules.
- PostgreSQL fluency in schema design, indexing, migrations, and query optimization.
- Experience with async/queue-based architecture.
- Programming proficiency in Python or Go.
- Strong ownership mindset and comfort with production on-call responsibility.
Tech Stack
Apache KafkaAWSDatabricksElasticsearchGitLab CI/CDGoGoogle Cloud PlatformGrafanaHelmKubernetesPostgreSQLPrometheusPython