5 months ago
San Francisco, CA, USAMid Level / Senior
Base Salary
$140k - $320k/yr
Responsibilities
- Design and implement task-specific evaluations to improve agent quality.
- Define tasks, curate datasets, and build evaluation harnesses for various agents.
- Develop reusable frameworks for running evaluations at scale.
- Investigate orchestration strategies for complex, multi-step tasks.
- Experiment with post-training techniques to enhance model performance.
- Run rigorous experiments and analyze results to inform model configurations.
- Collaborate with cross-functional teams to align evaluations and platform primitives.
Requirements
- MS or Ph.D. in a relevant field or equivalent practical experience.
- Strong background in machine learning and large language models.
- 2–5 years of experience with LLM technology and evaluation strategies.
- Proficiency in writing production-quality code, especially in Python.
- Experience designing and running experiments in real-world settings.
- Self-motivated and comfortable in high-ambiguity environments.
- Strong communication skills to translate vague goals into testable setups.
Benefits
- Significant equity in an early-stage, venture-backed startup.
- Comprehensive health benefits including medical, dental, and vision.
- Flexible PTO to recharge and maintain work-life balance.
