GrepJob
Cribl

Staff Site Reliability Engineer

Cribl
Apply
about 3 hours ago
Remote, Australia
Staff+
H1B Sponsor

Responsibilities

  • Engage with teams to improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems focusing on availability, latency, and overall system health.
  • Identify causes of errors and instability in production cloud services and drive teams towards operational excellence.
  • Collaborate with product and platform teams to enhance systems for reliability and observability.
  • Innovate and automate to reduce operational toil.
  • Participate in on-call or off-hours duties as required.

Requirements

  • Proven experience in designing and operating observability systems for complex cloud platforms.
  • Familiarity with Configuration Management and Infrastructure as Code tools like Terraform or Ansible.
  • Knowledge of cloud platforms, preferably AWS and Azure, and container orchestration technologies.
  • Experience with APM and observability tools such as New Relic, Splunk, and Prometheus.
  • Extensive experience in continuous delivery environments.
  • Development experience with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
  • Background in Linux Systems Engineering and incident response tools.
  • Comfortable working autonomously in a distributed team.
  • Strong knowledge of cloud design patterns and application security best practices.

Tech Stack

AnsibleAWSAzureGrafanaJavaScriptKibanaLinuxNode.jsPrometheusSplunkTerraformTypeScript

Categories

BackendDevOpsTesting