about 11 hours ago
Responsibilities
- Participate in 24/7 on-call monitoring to observe platform and merchant performance.
- Coordinate the mitigation, recovery, and resolution of high-impact incidents.
- Communicate real-time updates to merchants during incidents.
- Analyze incident trends to identify recurring issues and advocate for long-term fixes.
- Collaborate with Operations, Product, and Engineering teams to improve monitoring strategies.
- Investigate alerts and provide feedback for effective logging and alerts.
- Mitigate merchant impact risk by actioning alerts in partnership with Engineering teams.
- Lead initiatives and project manage the development of automation for monitoring.
Requirements
- At least 5 years of experience in incident management and platform monitoring operations.
- Experience with problem management practices and root cause investigations.
- Solid communication skills to develop strong working relationships across the organization.
- Willingness to participate in the on-call rotation in a fast-paced environment.
- Experience with monitoring and logging tools like Prometheus, Grafana, and ELK Stack.
- Familiarity with observability platforms such as Datadog, Dynatrace, and Splunk.
- Excellent analytical and problem-solving skills.
- Ability to handle complex situations and multiple responsibilities simultaneously.
- Strong team player with a passion for process improvement.