Research, Mid-Training

8 days ago

H1B Sponsor

Responsibilities

Design and iterate on high-quality data mixtures for late-stage training.
Drive targeted improvements in coding, mathematics, and reasoning.
Develop and evaluate synthetic data pipelines for training signal generation.
Research and optimize multi-stage learning rate schedules and compute allocation.
Implement methods for extending effective context length.
Build evaluations to distinguish real capability improvements from overfitting.
Measure how mid-training interventions scale with compute and data.

Deep familiarity with the LLM training pipeline from pre-training to post-training.
Hands-on experience with continual pre-training and late-stage data mixing.
Strong intuition for data quality and curation at scale.
Experience developing or evaluating synthetic data pipelines.
Proficiency in Python and deep learning frameworks like PyTorch.
Strong fundamentals in optimization, statistics, and ML theory.
A track record of original contributions in the field of AI.

Work in a small, highly selective team where research and product development are closely integrated.
Access to large compute resources with training jobs running on thousands of GPUs.
An environment that rewards speed, autonomy, and technical depth.

PythonPyTorch