about 3 hours ago
Bangkok, Thailand
Senior / Staff+
Responsibilities
- Lead the technical vision and architecture of new SRE platforms.
- Define and promote SRE best practices across Agoda’s services.
- Design, build, and operate reliability platforms to enhance system resilience.
- Own safe deployment strategies integrated with monitoring.
- Identify and mitigate reliability and scaling risks proactively.
- Lead major incident response and operational excellence initiatives.
- Maintain and evolve incident and observability tooling.
- Advance platform observability using Prometheus and Grafana.
- Define reliability roadmaps and translate business goals into technical requirements.
Requirements
- 8+ years of relevant experience in software engineering.
- Demonstrated ownership of architecting and operating mission-critical systems.
- Proven ability to lead complex cross-team initiatives.
- Expertise in programming languages such as Go, Python, Rust, or Java.
- Deep hands-on experience with the Kubernetes ecosystem.
- Observability and monitoring expertise using Prometheus and Grafana.
- Strong incident management lifecycle experience.
- Experience with reliability engineering patterns like canary deployments.
- Solid data analysis skills, including SQL and data pipelines.
- Excellent communication and collaboration skills.
Tech Stack
Argo CDGoGrafanaIstioJavaKubernetesMicrosoft SQL ServerPostgreSQLPrometheusPythonRustSQL
Categories
DevOps
