Staff Engineer, AI Evals

3 months ago

Atlanta, GA, USAStaff+

Responsibilities

Design, build, and operate the core evaluation infrastructure for LLMs and agents.
Translate fuzzy goals into concrete, measurable signals for agent performance.
Solve complex evaluation problems related to multi-step agents and evolving tasks.
Use evaluation results to guide architectural decisions and model selection.
Participate in design reviews and set technical standards for evaluation rigor.

Requirements

7+ years of software engineering experience, including 2+ years in AI/ML systems.
Deep experience with backend systems in Python, including data pipelines.
Hands-on experience evaluating LLM-based systems.
Strong intuition for metrics, experimentation, and failure analysis.
Excellent communication skills for collaboration with diverse stakeholders.
A high-ownership mindset regarding system integrity and decision-making.

Tech Stack

Categories

AI & ML Backend