Staff Site Reliability Engineer, Streaming

about 1 month ago

Remote, United StatesStaff+

H1B Sponsor

Responsibilities

Triage difficult technical problems and implement solutions.
Enhance the RabbitMQ and Redpanda observability stack by defining SLOs and alerts.
Improve the reliability of RabbitMQ and Redpanda clients.
Respond to and resolve incidents in a timely manner, conducting post-incident reviews.
Collaborate with development teams to ensure reliability and scalability in new features.
Monitor system capacity and performance, making recommendations for future growth.

5+ years of experience in Site Reliability Engineering or similar roles.
5+ years of experience with message brokers like Kafka, RabbitMQ, and Redpanda.
Proven track record of managing large-scale, high-availability distributed systems.
Experience designing and implementing SLIs, SLOs, and SLAs with alerting and monitoring.
Strong ability to work independently and lead large tasks.
Significant production experience with Kubernetes.
Proficient in Go, Prometheus, and Linux.
Experience troubleshooting message broker performance issues.

GoKubernetesLinuxPrometheusRabbitMQ