Reddit

Staff Research Engineer, Pre-training Data

Reddit

Apply
27 days ago
Remote, United States
Staff+
H1B Sponsor

Base Salary

$230k - $322k/yr

Responsibilities

  • Architect and implement high-throughput, deterministic data sampling systems for distributed training clusters.
  • Design and execute dynamic curriculum learning strategies to adjust data distributions during training.
  • Engineer logic for serializing Reddit’s complex conversational trees into optimal training contexts.
  • Formulate and validate statistical hypotheses regarding data mixtures to minimize bias.
  • Design automated pipelines for PII redaction and quality deduplication.
  • Translate theoretical sampling insights into robust, low-latency production infrastructure.
  • Mentor senior engineers and researchers on system design and performance optimization.

Requirements

  • 8+ years of software engineering experience focused on machine learning infrastructure or LLM pre-training.
  • Expert proficiency in Python and distributed data processing frameworks.
  • Experience handling unstructured and semi-structured data at scale.
  • Strong mathematical foundation in probability, statistics, and sampling theory.
  • Deep understanding of pre-training dynamics and data quality impact on model performance.
  • Experience with graph data structures or serializing conversation trees is highly valued.

Benefits

  • Comprehensive Healthcare Benefits and Income Replacement Programs.
  • 401k with Employer Match.
  • Global Benefit programs that fit your lifestyle.
  • Family Planning Support.
  • Gender-Affirming Care.
  • Mental Health & Coaching Benefits.
  • Flexible Vacation & Paid Volunteer Time Off.
  • Generous Paid Parental Leave.

Tech Stack

Apache SparkPythonPyTorchRust

Categories

AI & MLData EngineeringData Science