about 5 hours ago
Dublin, IrelandMid Level
H1B Sponsor
Responsibilities
- Build, operate, and improve production systems focusing on reliability and performance.
- Automate operational tasks to reduce manual toil.
- Contribute to the design and implementation of systems using SRE best practices.
- Define and measure SLIs and SLOs for supported services.
- Enhance observability through metrics, dashboards, and logging.
- Participate in on-call rotations and respond to production incidents.
- Assist with incident investigations and contribute to post-incident reviews.
- Analyze system behavior and capacity usage.
- Identify reliability issues and collaborate with teammates to address them.
- Write and maintain operational runbooks and system documentation.
Requirements
- Experience operating cloud-native production systems.
- Proficient in writing production-quality code (e.g., Python, Go).
- Understanding of common failure modes in distributed systems.
- Experience with containerized workloads and platforms (e.g., Kubernetes).
- Comfortable participating in on-call rotations.
- Familiarity with observability tools and incident response.
- Knowledge of SRE concepts such as SLIs, SLOs, and error budgets.
- Hands-on experience with infrastructure as code (e.g., Terraform).
- Ability to follow incident response processes.
- Eager to learn and experiment with AI tools.