
Site Reliability Engineer
SingleStore3 months ago
Responsibilities
- Develop automation platform to manage infrastructure rollouts across cloud providers.
- Optimize telemetry platform to identify customer impacting events and provide relevant data for debugging.
- Partner with engineering team to optimize performance of services for cloud architecture.
- Debug Live Site events and conduct follow-up postmortem and RCA analysis.
- Participate in an SLA-driven on-call rotation, including after-hours and weekend participation.
Requirements
- 5 years of demonstrated experience working as a Site Reliability Engineer.
- Infrastructure automation experience with scripting skills in Python or Bash.
- Experience with the Prometheus monitoring stack; familiarity with Grafana, Mimir, and Loki is a plus.
- Knowledge of Kubernetes and the container ecosystem.
- Strong cross-group collaboration and communication skills.
- Familiarity with at least one of AWS, Azure, or Google Cloud.
- Experience debugging, diagnosing, and troubleshooting complex production software.
- B.S. Degree in Computer Science or related field.