GrepJob
Obsidian Security

AI Site Reliability Engineer (SRE)

Obsidian Security
Apply
about 3 hours ago
Sydney, AustraliaMid Level / Senior
H1B Sponsor

Responsibilities

  • Support and maintain the service quality of the customer-facing SaaS security platform.
  • Address complex challenges around scalability, reliability, observability, and cost efficiency.
  • Collaborate with Engineering teams to maintain and enhance Helm charts, application deployment, monitoring, and CI/CD pipelines.
  • Embed into the engineering team to understand the application deeply.
  • Define service verification strategies and implement them as part of the CI/CD process to meet SLAs.
  • Improve developer experience by optimizing CI/CD workflows and performance.
  • Participate in the on-call rotation, providing 24/7 support in coordination with the global SRE team.
  • Monitor, debug, and optimize production infrastructure and services on AWS/GCP.
  • Own and evolve the observability stack, including Prometheus/Mimir metrics pipelines and Grafana dashboards.
  • Define and instrument SLIs/SLOs across services and build alerting strategies.

Requirements

  • 4+ years of experience in a DevOps or SRE role supporting SaaS services on GCP and/or AWS.
  • Bachelor’s degree in Computer Science or related field.
  • Production Kubernetes experience, including authored Deployments and resource limits.
  • Strong proficiency in Kubernetes, microservices architecture, Helm, GitLab CI/CD, and ArgoCD.
  • Deep hands-on experience with the Grafana observability stack.
  • Ability to design SLI/SLO frameworks and build alerting rules.
  • PostgreSQL fluency in schema design, indexing, migrations, and query optimization.
  • Experience with async/queue-based architecture.
  • Programming proficiency in Python or Go.
  • Strong ownership mindset and comfort with production on-call responsibility.

Tech Stack

Apache KafkaAWSDatabricksElasticsearchGitLab CI/CDGoGoogle Cloud PlatformGrafanaHelmKubernetesPostgreSQLPrometheusPython

Categories