Skip to content

Concurrent Web Crawler

Concurrent Web Crawler

Build a concurrent web crawler that discovers and processes links using worker pools and depth-limited traversal.

Recommended Prerequisites

Complete these exercises first for the best learning experience:

○Worker Pool

0/1 completed

Background

Web crawlers discover and process web pages by:

Starting from seed URLs
Extracting links from pages
Following links to discover new pages
Avoiding duplicate visits

Challenges:

Deduplication: Track visited URLs to avoid cycles
Concurrency: Use workers to crawl multiple pages simultaneously
Depth Control: Limit how deep the crawler goes
Thread Safety: Safely share visited URL set across workers

Your Task

Build a crawler with:

Worker pool for concurrent crawling
URL deduplication (visited tracking)
Depth-limited traversal
Thread-safe operations
Simulated link extraction (no actual HTTP for playground)

Concurrent Web Crawler

~60 minhard

Build a worker pool-based web crawler with deduplication and depth limiting

Key Concepts

Worker Pool Pattern: Fixed number of workers processing from shared queue
BFS Traversal: Breadth-first search controlled by depth tracking
URL Deduplication: Mutex-protected map prevents visiting same URL twice
Concurrent Data Structures: Thread-safe visited set shared across workers
Dynamic Work Generation: Workers add new jobs as they discover links

Extensions

Add robots.txt respect
Implement rate limiting per domain
Add request timeout and retry logic
Extract and store page content
Build inverted index for search
Add politeness delay between requests to same domain
Implement URL normalization (handle trailing slashes, query params)