6 months ago
Base Salary
$230k - $310k/yr
Responsibilities
- Own the reliability, availability, and performance of Gamma's production systems across AWS.
- Build observability infrastructure including metrics, logging, tracing, and alerting.
- Design and implement automation to reduce operational toil and improve deployment safety.
- Lead incident response and conduct blameless post-mortems to drive systemic improvements.
- Collaborate with engineering teams on architecture reviews and reliability best practices.
- Manage and optimize compute, networking, databases, and managed services.
Requirements
- 5+ years in site reliability engineering, DevOps, or systems engineering with AWS expertise.
- Strong programming skills in Python, Go, or TypeScript/Node.js.
- Experience with infrastructure-as-code tools like Terraform or CloudFormation.
- Proven track record of improving system reliability through automation and monitoring.
- Deep understanding of networking, distributed systems, and containerization technologies.
- Sharp incident management instincts and debugging skills for complex production failures.
- Experience scaling SaaS products or familiarity with Kafka and chaos engineering is a plus.
- AWS certifications or experience with security frameworks like SOC 2 is a plus.
