Research Engineer, Pretraining Scaling

8 months ago

H1B Sponsor

Base Salary

$315k - $560k/yr

Responsibilities

Own critical aspects of the production pretraining pipeline, including model operations and performance optimization.
Debug and resolve complex issues across the full stack, from hardware errors to training dynamics.
Design and run experiments to improve training efficiency and enhance model performance.
Respond to on-call incidents during model launches, diagnosing problems quickly.
Build and maintain production logging, monitoring dashboards, and evaluation infrastructure.
Add new capabilities to the training codebase, such as long context support.
Collaborate closely with teammates across locations and various teams.
Document systems, debugging approaches, and lessons learned.

Hands-on experience training large language models or expertise with JAX, TPU, PyTorch, or large-scale distributed systems.
Enjoy both research and engineering work, ideally with a 50/50 split.
Excited about being on-call for production systems and solving problems under pressure.
Thrive on impactful work that may change day-to-day based on production needs.
Excel at debugging complex problems across multiple layers of the stack.
Communicate clearly and collaborate effectively across time zones.
Passionate about refining your craft as a research engineer.
Care about the societal impacts of AI and responsible scaling.