11 days ago
Base Salary
$175k - $230k/yr
Responsibilities
- Design and evolve highly reliable system architectures for high availability and scalability.
- Lead complex incident response efforts and coordinate across engineering teams.
- Define and implement organization-wide observability practices.
- Establish and maintain reliability standards, including SLIs and SLOs.
- Drive automation and infrastructure improvements to reduce operational toil.
- Partner with engineering teams on system design and architecture reviews.
- Evolve Sage’s cloud infrastructure to support scalable systems.
- Operate and improve critical data infrastructure for high availability.
- Lead capacity planning and auto-scaling efforts.
- Build internal tooling and platforms to enhance the developer experience.
Requirements
- 7-12+ years of experience in software engineering or site reliability engineering.
- Experience with edge or device-based systems and managing connectivity.
- Strong networking fundamentals and experience debugging distributed systems.
- Experience operating and scaling production databases like PostgreSQL or MySQL.
- Deep expertise in cloud infrastructure, particularly AWS or Google Cloud.
- Strong experience in designing highly available systems.
- Expertise in containerization and orchestration, especially with Kubernetes.
- Advanced observability and monitoring skills using tools like Datadog.
- Strong programming ability in languages like Go, Python, or Java.
- Deep knowledge of infrastructure-as-code practices and tools like Terraform.
- Ability to influence engineering teams and guide best practices.
- Strong incident management and production debugging skills.
Benefits
- Competitive base compensation along with stock options.
- Fully-paid health and dental insurance coverage.
- Take as you need time off policy, plus 7 paid holidays.
- Company-wide winter break during the holidays.
- Office lunch and a fully stocked snack bar.
- Up to 2 remote days per week.
