
Staff Site Reliability Engineer
AlphaSenseabout 1 month ago
Responsibilities
- Architect frameworks and self-service tooling for service reliability.
- Drive the AIOps strategy for automated diagnostics and proactive failure prevention.
- Embed SRE practices across engineering through design reviews and operational standards.
- Act as Incident Commander during critical events and lead blameless postmortems.
- Deliver end-to-end monitoring and profiling to optimize performance.
- Mentor engineers across SRE and product teams through technical guidance.
Requirements
- 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role.
- At least 3+ years in a Senior+ SRE position.
- Strong background in running production SaaS systems at scale.
- Proficiency in at least one programming/scripting language (Python, Go, etc.).
- Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes.
- Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing).
- Experience with monitoring and alerting tools (Prometheus, Grafana, Datadog, ELK).
- Familiarity with advanced observability tools (OTEL, continuous profiling).
- Proven incident management experience, including leading high-severity incidents.
- Strong troubleshooting skills across the full stack.
- Excellent communication and collaboration skills.