
Member of Technical Staff - Web Crawl Engineer
Reflectionabout 4 hours ago
Responsibilities
- Build and operate web-scale crawling infrastructure for data collection across billions of URLs.
- Design and optimize URL discovery, prioritization, scheduling, and crawl orchestration systems.
- Develop distributed crawlers that respect site constraints while acquiring content efficiently.
- Build systems for content extraction, rendering, parsing, and normalization across diverse web formats.
- Improve crawl coverage, freshness, efficiency, and quality through measurement and experimentation.
- Design infrastructure for large-scale recrawling, change detection, and incremental updates.
- Analyze crawl performance and web coverage to identify gaps and opportunities for improvement.
- Build observability, monitoring, and reliability systems for large-scale crawl operations.
- Debug production issues and enhance the performance and resilience of crawling infrastructure.
Requirements
- Experience building large-scale web crawling or internet-scale data collection systems.
- Strong understanding of crawling architectures and distributed crawl coordination.
- Experience with large-scale distributed systems using technologies like Ray, Spark, or similar frameworks.
- Familiarity with content extraction, HTML parsing, and modern web technologies.
- Experience operating systems that process petabyte-scale datasets.
- Strong systems engineering skills, including reliability and performance optimization.
- Experience designing experiments to improve crawl quality and efficiency.
- Excellent communication skills and ability to reason about system tradeoffs.
Benefits
- Top-tier compensation with salary and equity structured to retain talent.
- Comprehensive medical, dental, vision, life, and disability insurance.
- Fully paid parental leave for all new parents and financial support for family planning.
- Paid time off, relocation support, and additional perks for work-life balance.
- Daily lunch and dinner provided, along with regular off-sites and team celebrations.
Tech Stack
Categories
AI & MLData Engineering