Staff Site Reliability Engineer

2 months ago

Remote, Worldwide

Staff+

H1B Sponsor

Responsibilities

Establish and evolve SRE best practices across the organization.
Define and drive observability strategy for system health and performance.
Design and implement software-driven solutions within the infrastructure domain.
Act as a technical leader and force multiplier across core cloud infrastructure.
Take ownership of large, ambiguous initiatives from concept to delivery.
Combine knowledge of software development, infrastructure, and security to improve platform resilience.
Proactively identify systemic risks and recommend platform upgrades.
Partner with engineering teams to improve developer workflows and operational maturity.
Provide technical mentorship and high-quality design and code reviews.
Lead by example in documentation and knowledge sharing.
Participate in and help mature incident response and post-incident learning.

Bachelor’s or Master’s degree in Computer Science or equivalent practical experience.
7+ years of experience in site reliability engineering, infrastructure engineering, or platform engineering roles.
Expert-level troubleshooting across the entire stack from application to kernel to network.
Strong command-line proficiency and deep expertise in Linux systems.
Advanced understanding of networking concepts including load balancing and service-to-service communication.
Experience working across multiple programming languages such as Python, Go, and Bash.
Strong track record of automating operational work to reduce toil.
Deep experience with cloud platforms, preferably AWS.
Strong expertise in Kubernetes and container orchestration.
Experience designing and maintaining company-wide Infrastructure as Code codebases.

AnsibleAWSAzureBashGoGoogle Cloud PlatformHelmKubernetesLinuxPythonReactTerraform