about 3 hours ago
Base Salary
$218k - $257k/yr
Responsibilities
- Own the reliability, monitoring, and incident response lifecycle for AI infrastructure services.
- Build automation and tooling to streamline operational IT workflows.
- Partner with the Infrastructure team to extend CI/CD frameworks.
- Strengthen observability and documentation standards across IT engineering.
- Develop full-stack applications that power internal AI products and infrastructure.
Requirements
- 8+ years of experience automating and supporting cloud infrastructure (AWS).
- Proven experience deploying and managing containerized workloads using Docker and Kubernetes.
- Proficiency in at least one scripting or programming language (Python, Bash, Ruby, or Go).
- Track record of leading incident response in environments with strict SLAs.
- Utilizes generative AI responsibly to drive measurable improvements.