Senior/Staff Site Reliability Engineer

about 2 months ago

H1B Sponsor

Base Salary

$175k - $230k/yr

Responsibilities

Design and evolve highly reliable system architectures for high availability and scalability.
Lead complex incident response efforts and coordinate across engineering teams.
Define and implement organization-wide observability practices.
Establish and maintain reliability standards, including SLIs and SLOs.
Drive automation and infrastructure improvements to reduce operational toil.
Partner with engineering teams on system design and architecture reviews.
Evolve Sage’s cloud infrastructure to support scalable systems.
Operate and improve critical data infrastructure for high availability.
Lead capacity planning and auto-scaling efforts.
Build internal tooling and platforms to enhance the developer experience.

7-12+ years of experience in software engineering or site reliability engineering.
Experience with edge or device-based systems and managing connectivity.
Strong networking fundamentals and experience debugging distributed systems.
Experience operating and scaling production databases like PostgreSQL or MySQL.
Deep expertise in cloud infrastructure, particularly AWS or Google Cloud.
Strong experience in designing highly available systems.
Expertise in containerization and orchestration, especially with Kubernetes.
Advanced observability and monitoring skills using tools like Datadog.
Strong programming ability in languages like Go, Python, or Java.
Deep knowledge of infrastructure-as-code practices and tools like Terraform.
Ability to influence engineering teams and guide best practices.
Strong incident management and production debugging skills.