1 day ago
Responsibilities
- Maintain and improve observability systems (monitoring, logging, alerting).
- Define, adjust, and maintain Service Level Objectives (SLOs).
- Participate in incident resolution and on-call rotations (max 1 week/month).
- Drive proactive reliability improvements across platforms.
- Collaborate with teams to analyze failure scenarios and implement mitigations.
- Create and maintain runbooks for incident response and prevention.
- Eliminate non-value-adding tasks through automation and process optimization.
Requirements
- Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field.
- 2+ years in DevOps, SRE, or Support Engineering roles.
- Experience with incident management in high-traffic, public-facing platforms.
- Strong scripting skills in Python, Bash, or PowerShell.
- Familiarity with CI/CD tools like GitHub Actions, Azure DevOps, GitLab, Jenkins.
- Experience with monitoring/APM tools such as Datadog, New Relic, Dynatrace, Prometheus, Grafana.
- Basic knowledge of serverless services in AWS, Azure, or GCP.
- Proficiency with Docker and containerized environments.
- Excellent English communication skills (B2+ level).
- Experience working in international, cross-cultural teams.
Benefits
- Flexibility, with hybrid work options (country-dependent).
- Learning and development, with access to cutting-edge tools, training, and industry experts.
