Staff Research Engineer, Pre-training Data

27 days ago

Remote, United States

Staff+

H1B Sponsor

Base Salary

$230k - $322k/yr

Responsibilities

Architect and implement high-throughput, deterministic data sampling systems for distributed training clusters.
Design and execute dynamic curriculum learning strategies to adjust data distributions during training.
Engineer logic for serializing Reddit’s complex conversational trees into optimal training contexts.
Formulate and validate statistical hypotheses regarding data mixtures to minimize bias.
Design automated pipelines for PII redaction and quality deduplication.
Translate theoretical sampling insights into robust, low-latency production infrastructure.
Mentor senior engineers and researchers on system design and performance optimization.

8+ years of software engineering experience focused on machine learning infrastructure or LLM pre-training.
Expert proficiency in Python and distributed data processing frameworks.
Experience handling unstructured and semi-structured data at scale.
Strong mathematical foundation in probability, statistics, and sampling theory.
Deep understanding of pre-training dynamics and data quality impact on model performance.
Experience with graph data structures or serializing conversation trees is highly valued.

Apache SparkPythonPyTorchRust