about 17 hours ago
Responsibilities
- Monitor platform and merchant performance and proactively detect issues.
- Coordinate the mitigation and resolution of high-impact incidents.
- Communicate real-time updates to merchants during incidents.
- Analyze incident trends to identify recurring issues.
- Collaborate with Operations, Product, and Engineering teams to improve monitoring strategies.
- Investigate alerts and provide feedback for effective logging.
- Document learnings and contribute to the monitoring playbook.
- Lead initiatives to automate and enhance monitoring capabilities.
Requirements
- At least 5 years of experience in incident and problem management.
- Experience with monitoring and logging tools like Prometheus and Grafana.
- Strong communication skills for effective client interaction.
- Ability to analyze complex systems and identify root causes.
- Willingness to participate in on-call rotations.
- Experience with observability platforms like Datadog and Splunk.
- Strong team player with a collaborative mindset.