about 4 hours ago
Responsibilities
- Engage with teams to improve service delivery and reliability.
- Measure and monitor production systems for availability and health.
- Identify causes of errors and drive operational excellence.
- Collaborate with product teams to enhance system reliability.
- Innovate and automate to reduce operational toil.
- Participate in on-call or off-hours duties as required.
Requirements
- Proven experience with observability systems for cloud platforms.
- Familiarity with Configuration Management and Infrastructure as Code tools like Terraform.
- Knowledge of cloud platforms, preferably AWS and Azure.
- Experience with APM and observability tools such as New Relic and Prometheus.
- Extensive experience in continuous delivery environments.
- Development skills in JavaScript/Node.js/TypeScript in Linux/Mac.
- Experience with incident response tools like PagerDuty.
- Strong background in Linux Systems Engineering.
- Comfortable working autonomously in a distributed team.
- Knowledge of cloud security best practices.