Senior Service Reliability Engineer
ThoughtWorksabout 16 hours ago
Responsibilities
- Improve site reliability by building fault-tolerant mechanisms and architectures.
- Drive the integration of observability automation into the CI/CD pipeline.
- Handle production incidents and manage communication with clients.
- Monitor performance of production systems to meet SLA and SLO metrics.
- Advise application development teams on system reliability improvements.
- Enhance system observability to reduce false alarms and improve efficiency.
- Implement chaos engineering practices for regular reliability testing.
- Align site reliability direction with client goals and business needs.
Requirements
- Hands-on experience in programming and scripting languages such as Python, Go, or Bash.
- Good understanding of at least one Public Cloud (AWS, Azure, or GCP).
- Exposure to observability tools like Grafana, Datadog, or ELK Stack.
- Familiarity with DevOps and GitOps practices.
- Knowledge of container-based architecture and orchestration tools like Kubernetes.
- Understanding of technical architecture and modern design patterns.
- Familiarity with Cloud’s Well Architected Framework principles.
Benefits
- Career development supported by interactive tools and numerous programs.
- A dynamic and inclusive community focused on continuous learning.