GrepJob
Backblaze External Website

Sr. Site Reliability Engineer

Backblaze External Website
Apply
3 months ago
Remote, United StatesSenior
H1B Sponsor

Base Salary

$150k - $200k/yr

Responsibilities

  • Own and drive the availability, durability, and performance of critical services across all production environments.
  • Lead and champion complex projects from problem discovery through complete, cross-functional resolution.
  • Define, establish, and enforce service health standards, including SLIs, SLOs, and error budget policies.
  • Lead critical incident response and post-incident reviews to drive long-term service improvements.
  • Mentor others and act as a subject matter expert in ITIL/OSS processes.
  • Design and architect scalable automation solutions to eliminate toil and improve operational efficiency.
  • Drive the strategic direction of monitoring, logging, and alerting frameworks.
  • Build, maintain, and secure advanced CI/CD pipelines and infrastructure as code solutions.
  • Write production-grade code to develop new reliability tools and enhance existing systems.
  • Act as a principal partner to engineering, product, and operations teams for resilient system design.
  • Lead capacity planning and disaster recovery strategy across critical infrastructure components.
  • Manage vendor relationships to troubleshoot systemic issues and ensure SLA adherence.
  • Drive the creation of high-quality documentation and cultivate a reliability-first engineering culture.
  • Own the creation and maintenance of operational playbooks and system documentation.
  • Proactively identify systemic issues and implement long-term improvements.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field.
  • 8+ years of progressive experience in site reliability, systems engineering, or operations.
  • Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems.
  • Expert-level Linux systems administration and advanced troubleshooting skills.
  • Lead security-minded operations with a focus on system-wide patching and vulnerability identification.
  • Deep mastery of service reliability concepts and incident response methodologies.
  • Advanced proficiency in at least one modern scripting/programming language, preferably Python or Go.
  • Proven experience with container orchestration and microservices concepts.
  • Expert experience with Hashicorp products in a production environment.

Benefits

  • Healthcare for family, including dental and vision.
  • Competitive compensation and 401K.
  • RSU grants for full-time employees.
  • ESPP program.
  • Flexible vacation policy.
  • Maternity and paternity leave.
  • MacBook Pro for work with a stipend for workstation personalization.
  • Childcare bonus.
  • Fertility treatment and support.
  • Learning and development program.
  • Commuter benefits.
  • Culture that supports a healthy work-life balance.

Tech Stack

Categories