5 months ago
Base Salary
$238k - $290k/yr
Responsibilities
- Design, implement, and manage monitoring and infrastructure resources across 50+ global regions.
- Lead incident management processes, including postmortems and root cause analyses.
- Automate operational tasks and workflows to maintain high reliability.
- Establish best practices for security, compliance, and reliability.
- Optimize infrastructure costs through strategic capacity planning.
- Provide technical mentorship and leadership to promote best practices.
Requirements
- 10+ years of experience in Site Reliability Engineering or similar roles.
- Expertise in infrastructure as code tools like Pulumi, Terraform, or CloudFormation.
- Deep familiarity with observability tools and incident response practices.
- Proficiency with cloud infrastructure platforms such as Azure, GCP, or AWS.
- Strong programming skills in Python, Bash, Go, or similar languages.
- Proven track record of diagnosing complex system problems.
Benefits
- In-person work model with relocation assistance for new employees.
