Concurrent Web Crawler
Concurrent Web Crawler
Section titled “Concurrent Web Crawler”Build a concurrent web crawler that discovers and processes links using worker pools and depth-limited traversal.
Recommended Prerequisites
Complete these exercises first for the best learning experience:
- ○Worker Pool
0/1 completed
Background
Section titled “Background”Web crawlers discover and process web pages by:
- Starting from seed URLs
- Extracting links from pages
- Following links to discover new pages
- Avoiding duplicate visits
Challenges:
- Deduplication: Track visited URLs to avoid cycles
- Concurrency: Use workers to crawl multiple pages simultaneously
- Depth Control: Limit how deep the crawler goes
- Thread Safety: Safely share visited URL set across workers
Your Task
Section titled “Your Task”Build a crawler with:
- Worker pool for concurrent crawling
- URL deduplication (visited tracking)
- Depth-limited traversal
- Thread-safe operations
- Simulated link extraction (no actual HTTP for playground)
Concurrent Web Crawler
~60 minhard
Build a worker pool-based web crawler with deduplication and depth limiting
Key Concepts
Section titled “Key Concepts”- Worker Pool Pattern: Fixed number of workers processing from shared queue
- BFS Traversal: Breadth-first search controlled by depth tracking
- URL Deduplication: Mutex-protected map prevents visiting same URL twice
- Concurrent Data Structures: Thread-safe visited set shared across workers
- Dynamic Work Generation: Workers add new jobs as they discover links
Extensions
Section titled “Extensions”- Add robots.txt respect
- Implement rate limiting per domain
- Add request timeout and retry logic
- Extract and store page content
- Build inverted index for search
- Add politeness delay between requests to same domain
- Implement URL normalization (handle trailing slashes, query params)