Site Reliability Engineer (SRE)

about 2 months ago

Dublin, IrelandMid Level / Senior

Responsibilities

Design and implement scalable, reliable, and fault-tolerant systems across cloud environments.
Develop and maintain observability tools, including monitoring, logging, and alerting.
Automate infrastructure provisioning, deployment, and incident response using Infrastructure as Code tools.
Optimize system performance, scalability, and incident response workflows.
Conduct root cause analysis and implement preventative measures to minimize failures.
Ensure high availability by designing and maintaining load balancing and disaster recovery strategies.
Improve CI/CD pipelines to enhance deployment speed while maintaining stability.
Participate in on-call rotations to quickly address system failures.

Around 4+ years of experience in Site Reliability Engineering (SRE), DevOps, or System Engineering.
Strong knowledge of cloud platforms (AWS, Azure, or GCP) and cloud-native architectures.
Experience with observability and monitoring tools like Prometheus, Grafana, and Datadog.
Proficiency in Infrastructure as Code tools such as Terraform or CloudFormation.
Hands-on experience with containerization and orchestration (Docker, Kubernetes).
Strong Linux system administration and networking fundamentals.
Experience with incident management and root cause analysis.
Proficiency in scripting (Bash, Python, or Go) for automation.
Knowledge of load balancing, failover strategies, and distributed systems.
Understanding of security best practices and compliance requirements.