about 4 hours ago
Responsibilities
- Design and operate observability layers for AI platforms.
- Build automated findings-to-fix loops for AI and cloud platforms.
- Implement reliability and hardening controls for internal AI systems.
- Codify detections, policies, and operational checks as code.
- Review platform and AI-application changes for reliability.
- Own AI-platform-specific operational readiness.
- Continuously improve production readiness through automation.
Requirements
- 5+ years in SRE, production engineering, platform operations, or security automation.
- Hands-on scripting and coding experience, especially in Python.
- Experience building observability and alerting systems in AWS or similar environments.
- Ability to reduce operational toil through automation.
- Comfortable with incident handling and evidence-driven postmortems.
- Interest in AI systems and MCP-style integration risks is valuable.