
Sr. Site Reliability Engineer
Backblaze External Website3 months ago
Base Salary
$150k - $200k/yr
Responsibilities
- Own and drive the availability, durability, and performance of critical services across all production environments.
- Lead and champion complex projects from problem discovery through complete, cross-functional resolution.
- Define, establish, and enforce service health standards, including SLIs, SLOs, and error budget policies.
- Lead critical incident response and post-incident reviews to drive long-term service improvements.
- Mentor others and act as a subject matter expert in ITIL/OSS processes.
- Design and architect scalable automation solutions to eliminate toil and improve operational efficiency.
- Drive the strategic direction of monitoring, logging, and alerting frameworks.
- Build, maintain, and secure advanced CI/CD pipelines and infrastructure as code solutions.
- Write production-grade code to develop new reliability tools and enhance existing systems.
- Act as a principal partner to engineering, product, and operations teams for resilient system design.
- Lead capacity planning and disaster recovery strategy across critical infrastructure components.
- Manage vendor relationships to troubleshoot systemic issues and ensure SLA adherence.
- Drive the creation of high-quality documentation and cultivate a reliability-first engineering culture.
- Own the creation and maintenance of operational playbooks and system documentation.
- Proactively identify systemic issues and implement long-term improvements.
Requirements
- Bachelor’s degree in Computer Science, Engineering, or related field.
- 8+ years of progressive experience in site reliability, systems engineering, or operations.
- Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems.
- Expert-level Linux systems administration and advanced troubleshooting skills.
- Lead security-minded operations with a focus on system-wide patching and vulnerability identification.
- Deep mastery of service reliability concepts and incident response methodologies.
- Advanced proficiency in at least one modern scripting/programming language, preferably Python or Go.
- Proven experience with container orchestration and microservices concepts.
- Expert experience with Hashicorp products in a production environment.
Benefits
- Healthcare for family, including dental and vision.
- Competitive compensation and 401K.
- RSU grants for full-time employees.
- ESPP program.
- Flexible vacation policy.
- Maternity and paternity leave.
- MacBook Pro for work with a stipend for workstation personalization.
- Childcare bonus.
- Fertility treatment and support.
- Learning and development program.
- Commuter benefits.
- Culture that supports a healthy work-life balance.