GrepJob
NICE

Site Reliability Engineer

NICE
Apply
about 2 hours ago
Remote, United Kingdom
Mid Level / Senior

Responsibilities

  • Act as a primary or escalation responder in a 24x7 on-call rotation.
  • Lead or support Major Incident (MI) response, including triage, mitigation, and resolution.
  • Coordinate across Engineering, Infrastructure, Security, and Product teams.
  • Execute and improve runbooks, playbooks, and escalation paths.
  • Drive blameless post-incident reviews (PIRs) and track corrective actions.
  • Own service health monitoring across infrastructure, applications, and dependencies.
  • Design and maintain alerting strategies that align with SLIs/SLOs.
  • Reduce alert fatigue through signal-to-noise improvements.
  • Build dashboards using tools such as Grafana, Prometheus, Datadog, Splunk, and CloudWatch.
  • Automate repetitive operational tasks to reduce manual toil.
  • Improve mean time to detect (MTTD) and mean time to resolve (MTTR).
  • Develop scripts and tools in Python, Bash, Go, or similar to support NOC/SRE workflows.
  • Implement self-healing and auto-remediation where possible.
  • Partner with engineering teams to improve system design for reliability.
  • Support and troubleshoot Linux-based systems, cloud platforms, and Kubernetes environments.
  • Assist with capacity planning and availability reviews.
  • Ensure operational readiness for production releases.

Requirements

  • Strong Linux systems administration skills.
  • Experience with incident management and production support.
  • Familiarity with cloud infrastructure, preferably AWS.
  • Experience with containers and orchestration tools like Docker and Kubernetes.
  • Knowledge of monitoring and alerting platforms.
  • Scripting or programming experience in Python, Bash, Go, or similar.
  • Understanding of networking fundamentals such as DNS, TCP/IP, and load balancing.
  • Experience working in 24x7 NOC or production operations environments.
  • Ability to handle high-pressure incidents calmly and effectively.
  • Strong written and verbal communication skills for incident coordination.
  • Comfort working from runbooks and improving them when necessary.

Tech Stack

AnsibleAWSAzureBashDatadogDockerGoGoogle Cloud PlatformGrafanaKubernetesLinuxPrometheusPythonSplunkTerraform

Categories

DevOpsSecurity