
Lead DevOps Engineer
TELUS Digitalabout 4 hours ago
Responsibilities
- Define platform reliability strategy and establish SLOs/SLIs for AI services.
- Design scalable and secure cloud architecture on GCP for distributed AI services.
- Build observability metrics and alerting for LLM-powered features.
- Implement resilience engineering practices for AI inference paths.
- Automate infrastructure management using Terraform and other tools.
- Enforce production readiness standards across teams launching new AI products.
- Mentor engineers and drive architecture reviews to enhance engineering culture.
Requirements
- Significant experience in infrastructure engineering combining DevOps and SRE disciplines.
- Deep expertise in GCP, with relevant cloud certifications preferred.
- Production experience with SRE fundamentals including SLO/SLI design.
- Strong background in distributed systems and resilience patterns.
- Expertise in infrastructure-as-code (Terraform) and container orchestration (Kubernetes).
- Hands-on experience with modern observability stacks and AI-specific tooling.
- Proficiency in Python, Javascript, and Bash for infrastructure tooling.