
Staff Research Engineer, Pre-training Data
27 days ago
Remote, United States
Staff+
H1B Sponsor
Base Salary
$230k - $322k/yr
Responsibilities
- Architect and implement high-throughput, deterministic data sampling systems for distributed training clusters.
- Design and execute dynamic curriculum learning strategies to adjust data distributions during training.
- Engineer logic for serializing Reddit’s complex conversational trees into optimal training contexts.
- Formulate and validate statistical hypotheses regarding data mixtures to minimize bias.
- Design automated pipelines for PII redaction and quality deduplication.
- Translate theoretical sampling insights into robust, low-latency production infrastructure.
- Mentor senior engineers and researchers on system design and performance optimization.
Requirements
- 8+ years of software engineering experience focused on machine learning infrastructure or LLM pre-training.
- Expert proficiency in Python and distributed data processing frameworks.
- Experience handling unstructured and semi-structured data at scale.
- Strong mathematical foundation in probability, statistics, and sampling theory.
- Deep understanding of pre-training dynamics and data quality impact on model performance.
- Experience with graph data structures or serializing conversation trees is highly valued.
Benefits
- Comprehensive Healthcare Benefits and Income Replacement Programs.
- 401k with Employer Match.
- Global Benefit programs that fit your lifestyle.
- Family Planning Support.
- Gender-Affirming Care.
- Mental Health & Coaching Benefits.
- Flexible Vacation & Paid Volunteer Time Off.
- Generous Paid Parental Leave.
Tech Stack
Apache SparkPythonPyTorchRust
Categories
AI & MLData EngineeringData Science