Designing Common Systems: Design a Web Crawler

📌 Question

"Design a scalable web crawler that can continuously crawl and index millions of web pages."

This interview question evaluates your understanding of distributed processing, queuing, URL deduplication, failure recovery, and respecting web protocol standards. It’s frequently asked at companies like Google, Meta, and ByteDance.

✅ Solution

1. Clarify Requirements

Functional Requirements:

Start with a set of seed URLs
Fetch HTML content and extract links
Continuously discover and crawl new pages
Avoid crawling the same page multiple times
Respect robots.txt and polite crawling policies
Store crawled content and metadata

Non-Functional Requirements:

High throughput (millions of pages/day)
Fault-tolerant and resumable
Scalable across multiple machines
Configurable crawl depth, domain scope, and rate limit

2. Key Components of the Crawler

URL Frontier Queue: Holds URLs to be crawled
Fetcher Workers: Retrieve pages over HTTP
Parser: Extracts links, metadata, and content from HTML
Duplicate Detection Module: Filters URLs or content already crawled
Robots.txt Manager: Checks rules before crawling a domain
Storage System: Persists page content and metadata
Scheduler: Respects rate limits and domain politeness
Monitoring & Logs: Tracks errors, queue size, throughput

3. System Architecture Overview

Seed URLs are placed in the frontier queue
Fetcher workers pull from the queue and download web pages
Pages are parsed for links and content
Valid links are normalized, deduplicated, and added back to the queue
Parsed content is stored for indexing and further use
Scheduler manages priority and rate limiting per domain

This decouples crawling, parsing, and scheduling for better scalability.

4. URL Frontier Design

Implement as a distributed queue with per-domain subqueues
Prioritize breadth-first crawling with depth counters
Normalize URLs to avoid redundant entries (e.g., trailing slashes, www.)
Use a bloom filter or Redis set to track seen URLs

5. Fetcher Workers

Respect robots.txt rules using a parser
Rate limit requests per domain to avoid getting blocked
Retry failed requests with exponential backoff
Set timeouts and user-agent headers

Workers should be stateless and horizontally scalable.

6. Parser and Link Extractor

Use HTML parsers to extract:
- Internal/external links
- Titles, metadata, headings
- Canonical URLs and nofollow hints
Normalize and validate URLs before re-inserting into the queue
Filter out non-HTML content or duplicate URLs

7. Storage and Indexing

Store crawled content in a document store (e.g., MongoDB, Elasticsearch)
Use separate tables/collections for raw HTML, parsed text, and metadata
Add timestamps, crawl depth, and source URL for tracking
Compress old documents or move to cold storage

8. Failure Handling and Resumability

Periodically checkpoint queue states
Store error logs for retriable failures
Allow replay of failed jobs or domains
Monitor fetch success/failure rates

Design should be resilient to restarts, crashes, and flaky domains.

9. Politeness and Respect

Parse and enforce robots.txt rules
Limit concurrent requests per domain/IP
Use randomized delays and throttling
Implement IP rotation if accessing many domains

This ensures compliance with website policies and avoids bans.

10. Trade-Offs and Considerations

Concern	Trade-Off Options
Crawl Speed	More workers = faster, but higher risk of bans
URL Deduplication	Bloom filter (low memory) vs. persistent hash storage
Storage Format	Raw HTML vs. structured text vs. extracted metadata
Scope Limiting	Global crawl vs. single-domain crawl
Real-Time Updates	Recrawl strategy: freshness vs. efficiency

11. Monitoring & Metrics

Track:
- Queue size and crawl rate
- Failure rate and response codes
- Top domains by crawl volume
- Time since last fetch (for freshness)

Expose logs and metrics for alerting and debugging.

12. What Interviewers Look For

Understanding of distributed queuing and crawl scheduling
Respect for real-world constraints like robots.txt
Ability to handle fault tolerance and deduplication
Awareness of performance and politeness trade-offs
System modularity and horizontal scalability

✅ Summary

A well-designed web crawler requires thoughtful design across multiple axes—networking, scheduling, data storage, deduplication, and respect for external constraints. Key elements include:

Distributed URL frontier
Stateless fetcher workers
HTML parsing and content extraction
Deduplication and crawl scheduling
Resilience and respectful rate-limiting