8 days ago
Responsibilities
- Design and iterate on high-quality data mixtures for late-stage training.
- Drive targeted improvements in coding, mathematics, and reasoning.
- Develop and evaluate synthetic data pipelines for training signal generation.
- Research and optimize multi-stage learning rate schedules and compute allocation.
- Implement methods for extending effective context length.
- Build evaluations to distinguish real capability improvements from overfitting.
- Measure how mid-training interventions scale with compute and data.
Requirements
- Deep familiarity with the LLM training pipeline from pre-training to post-training.
- Hands-on experience with continual pre-training and late-stage data mixing.
- Strong intuition for data quality and curation at scale.
- Experience developing or evaluating synthetic data pipelines.
- Proficiency in Python and deep learning frameworks like PyTorch.
- Strong fundamentals in optimization, statistics, and ML theory.
- A track record of original contributions in the field of AI.
Benefits
- Work in a small, highly selective team where research and product development are closely integrated.
- Access to large compute resources with training jobs running on thousands of GPUs.
- An environment that rewards speed, autonomy, and technical depth.
