Sr. Site Reliability Engineer

3 months ago

Remote, United StatesSenior

H1B Sponsor

Base Salary

$150k - $200k/yr

Responsibilities

Own and drive the availability, durability, and performance of critical services across all production environments.
Lead and champion complex projects from problem discovery through complete, cross-functional resolution.
Define, establish, and enforce service health standards, including SLIs, SLOs, and error budget policies.
Lead critical incident response and post-incident reviews to drive long-term service improvements.
Mentor others and act as a subject matter expert in ITIL/OSS processes.
Design and architect scalable automation solutions to eliminate toil and improve operational efficiency.
Drive the strategic direction of monitoring, logging, and alerting frameworks.
Build, maintain, and secure advanced CI/CD pipelines and infrastructure as code solutions.
Write production-grade code to develop new reliability tools and enhance existing systems.
Act as a principal partner to engineering, product, and operations teams for resilient system design.
Lead capacity planning and disaster recovery strategy across critical infrastructure components.
Manage vendor relationships to troubleshoot systemic issues and ensure SLA adherence.
Drive the creation of high-quality documentation and cultivate a reliability-first engineering culture.
Own the creation and maintenance of operational playbooks and system documentation.
Proactively identify systemic issues and implement long-term improvements.

Bachelor’s degree in Computer Science, Engineering, or related field.
8+ years of progressive experience in site reliability, systems engineering, or operations.
Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems.
Expert-level Linux systems administration and advanced troubleshooting skills.
Lead security-minded operations with a focus on system-wide patching and vulnerability identification.
Deep mastery of service reliability concepts and incident response methodologies.
Advanced proficiency in at least one modern scripting/programming language, preferably Python or Go.
Proven experience with container orchestration and microservices concepts.
Expert experience with Hashicorp products in a production environment.