
Staff Site Reliability Engineer
AlphaSenseabout 1 month ago
Responsibilities
- Architect reliability frameworks and self-service tooling for service ownership.
- Drive AIOps strategy for automated diagnostics and proactive failure prevention.
- Embed SRE practices through design reviews and operational standards.
- Lead incident management as Incident Commander during critical events.
- Deliver end-to-end monitoring and profiling to optimize system performance.
- Mentor engineers across SRE and product teams through knowledge sharing.
Requirements
- 8+ years of experience in Site Reliability Engineering or similar roles.
- At least 3 years in a Senior+ SRE position.
- Strong background in running production SaaS systems at scale.
- Proficiency in programming/scripting languages like Python or Go.
- Hands-on expertise with cloud platforms such as AWS, GCP, or Azure.
- Deep understanding of networking fundamentals like TCP/IP and DNS.
- Experience with monitoring and alerting tools like Prometheus and Grafana.
- Familiarity with advanced observability techniques.
- Proven incident management experience with high-severity incidents.
- Strong troubleshooting skills across the full stack.
- Excellent communication and collaboration skills.