4 months ago
Responsibilities
- Own and influence the incident management process end-to-end.
- Maintain and evolve the on-prem observability stack.
- Participate in the on-call rotation to keep production applications running smoothly.
- Develop automations and tools to support platform reliability.
- Contribute to production services with performance and resiliency in mind.
- Collaborate with product engineers to foster SRE principles within the R&D organization.
- Mentor the SRE team or product engineers.
Requirements
- Solid programming experience in Python (Django and AsyncIO) and/or Java (Spring Boot).
- Experience in maintaining an observability tools suite, specifically LGTM (Loki, Grafana, Tempo, Mimir).
- Experience in development and maintenance of Python services in production.
- Strong experience with AWS and Kubernetes.
- Proficiency in working with relational databases (PostgreSQL) and messaging systems (e.g., RabbitMQ, NATS, Kafka).
- Experience as an on-call SRE engineer.
- Enjoy hands-on troubleshooting of distributed systems in production environments.
- Strong communication skills and a desire to share knowledge on reliability.
- Proficiency in English, both written and spoken.
Benefits
- Remote-first approach with the option for hybrid work from offices in Kyiv, Warsaw, and Lisbon.
- Long-term collaboration valued through various employment arrangements.
- Work schedule aligned with EU time zones.
- Honest, open culture that values constructive feedback.
- Opportunities for professional and personal development within a collaborative team.
- Stable yet growing SaaS product offering an agile environment and strong technical challenges.
