about 2 hours ago
Remote, United Kingdom
Mid Level / Senior
Responsibilities
- Act as a primary or escalation responder in a 24x7 on-call rotation.
- Lead or support Major Incident (MI) response, including triage, mitigation, and resolution.
- Coordinate across Engineering, Infrastructure, Security, and Product teams.
- Execute and improve runbooks, playbooks, and escalation paths.
- Drive blameless post-incident reviews (PIRs) and track corrective actions.
- Own service health monitoring across infrastructure, applications, and dependencies.
- Design and maintain alerting strategies that align with SLIs/SLOs.
- Reduce alert fatigue through signal-to-noise improvements.
- Build dashboards using tools such as Grafana, Prometheus, Datadog, Splunk, and CloudWatch.
- Automate repetitive operational tasks to reduce manual toil.
- Improve mean time to detect (MTTD) and mean time to resolve (MTTR).
- Develop scripts and tools in Python, Bash, Go, or similar to support NOC/SRE workflows.
- Implement self-healing and auto-remediation where possible.
- Partner with engineering teams to improve system design for reliability.
- Support and troubleshoot Linux-based systems, cloud platforms, and Kubernetes environments.
- Assist with capacity planning and availability reviews.
- Ensure operational readiness for production releases.
Requirements
- Strong Linux systems administration skills.
- Experience with incident management and production support.
- Familiarity with cloud infrastructure, preferably AWS.
- Experience with containers and orchestration tools like Docker and Kubernetes.
- Knowledge of monitoring and alerting platforms.
- Scripting or programming experience in Python, Bash, Go, or similar.
- Understanding of networking fundamentals such as DNS, TCP/IP, and load balancing.
- Experience working in 24x7 NOC or production operations environments.
- Ability to handle high-pressure incidents calmly and effectively.
- Strong written and verbal communication skills for incident coordination.
- Comfort working from runbooks and improving them when necessary.
Tech Stack
AnsibleAWSAzureBashDatadogDockerGoGoogle Cloud PlatformGrafanaKubernetesLinuxPrometheusPythonSplunkTerraform
Categories
DevOpsSecurity