about 12 hours ago
Responsibilities
- Build and run the LLM control plane/gateway with smart routing and cost tracking.
- Ship a unified API and SDKs with normalized schemas and full observability.
- Enforce safety and privacy by default through content filtering and PII redaction.
- Enable multi-model, multi-vendor use of LLMs with automated canarying.
- Own the agent runtime including tool registry and permissions.
- Design orchestration patterns and manage agent state.
- Enable platform components for training and scoring pipelines for classical ML.
- Monitor model and data drift, retraining models as needed.
- Add human-in-the-loop review before agents interact with dealer systems.
- Evolve the domain graph and build reliable data ingestion pipelines.
- Serve real-time context to agents with access controls.
- Power retrieval with hybrid search and smart caching.
- Run continuous evaluations for quality and safety of the platform.
- Define SLOs for latency and uptime, enabling autoscaling.
- Maintain a model/agent registry and support compliance.
- Provide templates and documentation for product teams.
Requirements
- 5+ years of experience in building large-scale data/ML or platform systems.
- Strong software engineering fundamentals including API design and distributed systems.
- Production experience with Python and one of Java/Scala/Go.
- Experience with MLOps at scale including pipelines and CI/CD for models.
- Familiarity with cloud and containers, preferably AWS.
- Practical ML knowledge including feature engineering and model evaluation.
- Experience building or operating an LLM gateway/control plane.
- Knowledge of agentic systems and orchestration frameworks.
- Experience with knowledge graphs and hybrid retrieval patterns.