about 1 month ago
Responsibilities
- Triage difficult technical problems and implement solutions.
- Enhance the RabbitMQ and Redpanda observability stack by defining SLOs and alerts.
- Improve the reliability of RabbitMQ and Redpanda clients.
- Respond to and resolve incidents in a timely manner, conducting post-incident reviews.
- Collaborate with development teams to ensure reliability and scalability in new features.
- Monitor system capacity and performance, making recommendations for future growth.
Requirements
- 5+ years of experience in Site Reliability Engineering or similar roles.
- 5+ years of experience with message brokers like Kafka, RabbitMQ, and Redpanda.
- Proven track record of managing large-scale, high-availability distributed systems.
- Experience designing and implementing SLIs, SLOs, and SLAs with alerting and monitoring.
- Strong ability to work independently and lead large tasks.
- Significant production experience with Kubernetes.
- Proficient in Go, Prometheus, and Linux.
- Experience troubleshooting message broker performance issues.
Benefits
- Competitive Salary & Stock Options.
- Health Benefits.
- One-time USD $500 for new hire home-office setup.
- Monthly stipend of USD $150 via a Brex Card.
