3 days ago
Toronto, Canada
Staff+
Responsibilities
- Build and operate multi-model serving across modalities on shared infrastructure.
- Own the model lifecycle: download, deploy, serve, monitor, update, and swap.
- Drive inference optimization including latency, throughput, and cost strategies.
- Architect and build the Agentic AI Platform for autonomous agents.
- Design multi-agent coordination systems for complex workflows.
- Build robust tool-use infrastructure for safe agent interactions.
- Implement workflow automation for multi-step business and engineering tasks.
- Develop evaluation and observability frameworks for agent behavior.
- Define technical direction and architecture for agentic systems.
- Mentor engineers and contribute to best practices for agent system design.
Requirements
- 8+ years of software engineering experience, with 3+ years in AI systems or LLM applications.
- Strong understanding of LLM-based agent architectures.
- Experience building highly reliable distributed systems.
- Proficiency in Python and experience with modern LLM APIs or open-source models.
- Experience with model serving technologies.
- Understanding of distributed systems and cloud platforms.
- Strong understanding of security risks in agentic systems.
- Demonstrated experience leading complex technical initiatives.
- Strong written and verbal communication skills.
Tech Stack
Apache AirflowAWSGoogle Cloud PlatformPython
Categories
AI & MLBackendData EngineeringDevOpsTesting