about 3 hours ago
Remote, Australia
Staff+
H1B Sponsor
Responsibilities
- Engage with teams to improve service delivery and reliability across their entire lifecycle.
- Measure and monitor all production systems focusing on availability, latency, and overall system health.
- Identify causes of errors and instability in production cloud services and drive teams towards operational excellence.
- Collaborate with product and platform teams to enhance systems for reliability and observability.
- Innovate and automate to reduce operational toil.
- Participate in on-call or off-hours duties as required.
Requirements
- Proven experience in designing and operating observability systems for complex cloud platforms.
- Familiarity with Configuration Management and Infrastructure as Code tools like Terraform or Ansible.
- Knowledge of cloud platforms, preferably AWS and Azure, and container orchestration technologies.
- Experience with APM and observability tools such as New Relic, Splunk, and Prometheus.
- Extensive experience in continuous delivery environments.
- Development experience with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
- Background in Linux Systems Engineering and incident response tools.
- Comfortable working autonomously in a distributed team.
- Strong knowledge of cloud design patterns and application security best practices.
Tech Stack
AnsibleAWSAzureGrafanaJavaScriptKibanaLinuxNode.jsPrometheusSplunkTerraformTypeScript
Categories
BackendDevOpsTesting