5 months ago
Base Salary
$200k - $260k/yr
Responsibilities
- Design, implement, and manage monitoring and infrastructure resources across 50+ global regions.
- Lead incident management processes, including postmortems and root cause analyses.
- Automate operational tasks and workflows to maintain high reliability.
- Collaborate across teams to enhance reliability, security, and compliance.
- Optimize infrastructure costs through strategic capacity planning.
Requirements
- 5+ years of experience in Site Reliability Engineering or similar roles.
- Expertise in infrastructure as code tools like Pulumi, Terraform, or CloudFormation.
- Familiarity with observability tools and incident response practices.
- Proficiency with cloud infrastructure platforms such as Azure, GCP, or AWS.
- Strong programming skills in Python, Bash, Go, or similar languages.
- Solid understanding of CI/CD, Kubernetes, containerization, and cloud security principles.
Benefits
- In-person work model with relocation assistance for new employees.
