19 days ago
San Francisco, CA, USAMid Level / Senior
Base Salary
$200k - $300k/yr
Responsibilities
- Design and curate evaluation datasets for reliable assistant behavior coverage.
- Build and maintain large-scale evaluation pipelines measuring assistant quality.
- Develop LLM-powered judges to score metrics like correctness and response quality.
- Evaluate new models and product changes to provide quality signals before launch.
- Create observability infrastructure for AI agents to inspect behavior.
- Utilize eval results and customer feedback to drive improvements in assistant behavior.
- Collaborate with engineers to integrate evaluations into the product shipping process.
Requirements
- 2+ years of software engineering experience with strong coding skills.
- Strong backend fundamentals in Go and Python; comfortable with distributed data pipelines.
- Experience with LLM evaluation, reinforcement learning, or natural language processing.
- Analytically rigorous with a focus on predicting real user experience.
- Ability to thrive in a customer-focused, cross-functional team environment.
- A strong commitment to quality in both systems and product.
Benefits
- Comprehensive benefits package including medical, vision, and dental coverage.
- Generous time-off policy and 401k plan contributions.
- Home office improvement stipend and annual education and wellness stipends.
- Vibrant company culture with regular events and daily healthy lunches.
