3 months ago
Atlanta, GA, USAStaff+
Responsibilities
- Design, build, and operate the core evaluation infrastructure for LLMs and agents.
- Translate fuzzy goals into concrete, measurable signals for agent performance.
- Solve complex evaluation problems related to multi-step agents and evolving tasks.
- Use evaluation results to guide architectural decisions and model selection.
- Participate in design reviews and set technical standards for evaluation rigor.
Requirements
- 7+ years of software engineering experience, including 2+ years in AI/ML systems.
- Deep experience with backend systems in Python, including data pipelines.
- Hands-on experience evaluating LLM-based systems.
- Strong intuition for metrics, experimentation, and failure analysis.
- Excellent communication skills for collaboration with diverse stakeholders.
- A high-ownership mindset regarding system integrity and decision-making.
