about 2 hours ago
Toronto, Canada +3 moreSenior / Staff+
H1B Sponsor
Responsibilities
- Maintain large-scale pipelines for processing web corpora.
- Work on filtering and quality-scoring systems to identify high-value web documents.
- Analyze web data composition across domains, languages, and time periods.
- Develop and maintain highly-performant deduplication pipelines.
- Collaborate with cross-functional teams to ensure data pipelines meet model demands.
Requirements
- Strong software engineering skills with proficiency in Python.
- Experience building data pipelines.
- Familiarity with data processing frameworks like Apache Spark or Pandas.
- Experience working with large-scale web datasets.
- Knowledge of data quality assessment techniques.
Benefits
- An open and inclusive culture and work environment.
- Weekly lunch stipend, in-office lunches, and snacks.
- Full health and dental benefits, including mental health support.
- 100% Parental Leave top-up for up to 6 months.
- Personal enrichment benefits for arts, culture, fitness, and workspace improvement.
- Remote-flexible work options with offices in major cities.
- 6 weeks of vacation (30 working days).
Tech Stack
Categories
AI & MLData Engineering
