GrepJob
Alpaca

Staff Site Reliability Engineer, Streaming

Alpaca
Apply
about 1 month ago
Remote, United StatesStaff+
H1B Sponsor

Responsibilities

  • Triage difficult technical problems and implement solutions.
  • Enhance the RabbitMQ and Redpanda observability stack by defining SLOs and alerts.
  • Improve the reliability of RabbitMQ and Redpanda clients.
  • Respond to and resolve incidents in a timely manner, conducting post-incident reviews.
  • Collaborate with development teams to ensure reliability and scalability in new features.
  • Monitor system capacity and performance, making recommendations for future growth.

Requirements

  • 5+ years of experience in Site Reliability Engineering or similar roles.
  • 5+ years of experience with message brokers like Kafka, RabbitMQ, and Redpanda.
  • Proven track record of managing large-scale, high-availability distributed systems.
  • Experience designing and implementing SLIs, SLOs, and SLAs with alerting and monitoring.
  • Strong ability to work independently and lead large tasks.
  • Significant production experience with Kubernetes.
  • Proficient in Go, Prometheus, and Linux.
  • Experience troubleshooting message broker performance issues.

Benefits

  • Competitive Salary & Stock Options.
  • Health Benefits.
  • One-time USD $500 for new hire home-office setup.
  • Monthly stipend of USD $150 via a Brex Card.

Tech Stack

GoKubernetesLinuxPrometheusRabbitMQ

Categories