5 months ago
Responsibilities
- Own and influence the incident management process end-to-end.
- Maintain and evolve the on-prem observability stack.
- Participate in on-call rotation to keep production applications running smoothly.
- Develop automations and tools to support platform reliability.
- Contribute to production services with performance and resiliency in mind.
- Collaborate with product engineers to foster SRE principles within the R&D organization.
- Mentor the SRE team or product engineers.
Requirements
- Solid programming experience in Python (Django and AsyncIO) and/or Java (Spring Boot).
- Experience in maintaining an observability tools suite, specifically LGTM (Loki, Grafana, Tempo, Mimir).
- Experience in development and maintenance of Python services in production.
- Strong experience with AWS and Kubernetes.
- Proficiency in working with relational databases (PostgreSQL) and messaging systems (e.g., RabbitMQ, NATS, Kafka).
- Experience as an on-call SRE engineer.
- Enjoy hands-on troubleshooting of distributed systems in production environments.
- Strong communication skills and a desire to share knowledge on reliability.
- Proficiency in English, both written and spoken.
Benefits
- Multisport Card for fitness and wellness activities (individual or family plan).
- LuxMed healthcare coverage (individual or family plan).
- UNUM life insurance protection (individual or family plan).
- Onboarding benefit allowance for necessary work equipment and setup.
- 6 self-care days beyond standard Polish vacation entitlements.
- Wellness, learning, and development budgets.
- Opportunities to purchase company stock or receive annual bonuses.
