Designing Common Systems: Design a Web Crawler

📌 Question

"Design a scalable web crawler that can continuously crawl and index millions of web pages."

This interview question evaluates your understanding of distributed processing, queuing, URL deduplication, failure recovery, and respecting web protocol standards. It’s frequently asked at companies like Google, Meta, and ByteDance.


✅ Solution

1. Clarify Requirements

Functional Requirements:

  • Start with a set of seed URLs
  • Fetch HTML content and extract links
  • Continuously discover and crawl new pages
  • Avoid crawling the same page multiple times
  • Respect robots.txt and polite crawling policies
  • Store crawled content and metadata

Non-Functional Requirements:

  • High throughput (millions of pages/day)
  • Fault-tolerant and resumable
  • Scalable across multiple machines
  • Configurable crawl depth, domain scope, and rate limit

2. Key Components of the Crawler

  • URL Frontier Queue: Holds URLs to be crawled
  • Fetcher Workers: Retrieve pages over HTTP
  • Parser: Extracts links, metadata, and content from HTML
  • Duplicate Detection Module: Filters URLs or content already crawled
  • Robots.txt Manager: Checks rules before crawling a domain
  • Storage System: Persists page content and metadata
  • Scheduler: Respects rate limits and domain politeness
  • Monitoring & Logs: Tracks errors, queue size, throughput

3. System Architecture Overview

  1. Seed URLs are placed in the frontier queue
  2. Fetcher workers pull from the queue and download web pages
  3. Pages are parsed for links and content
  4. Valid links are normalized, deduplicated, and added back to the queue
  5. Parsed content is stored for indexing and further use
  6. Scheduler manages priority and rate limiting per domain

This decouples crawling, parsing, and scheduling for better scalability.


4. URL Frontier Design

  • Implement as a distributed queue with per-domain subqueues
  • Prioritize breadth-first crawling with depth counters
  • Normalize URLs to avoid redundant entries (e.g., trailing slashes, www.)
  • Use a bloom filter or Redis set to track seen URLs

5. Fetcher Workers

  • Respect robots.txt rules using a parser
  • Rate limit requests per domain to avoid getting blocked
  • Retry failed requests with exponential backoff
  • Set timeouts and user-agent headers

Workers should be stateless and horizontally scalable.


6. Parser and Link Extractor

  • Use HTML parsers to extract:
    • Internal/external links
    • Titles, metadata, headings
    • Canonical URLs and nofollow hints
  • Normalize and validate URLs before re-inserting into the queue
  • Filter out non-HTML content or duplicate URLs

7. Storage and Indexing

  • Store crawled content in a document store (e.g., MongoDB, Elasticsearch)
  • Use separate tables/collections for raw HTML, parsed text, and metadata
  • Add timestamps, crawl depth, and source URL for tracking
  • Compress old documents or move to cold storage

8. Failure Handling and Resumability

  • Periodically checkpoint queue states
  • Store error logs for retriable failures
  • Allow replay of failed jobs or domains
  • Monitor fetch success/failure rates

Design should be resilient to restarts, crashes, and flaky domains.


9. Politeness and Respect

  • Parse and enforce robots.txt rules
  • Limit concurrent requests per domain/IP
  • Use randomized delays and throttling
  • Implement IP rotation if accessing many domains

This ensures compliance with website policies and avoids bans.


10. Trade-Offs and Considerations

ConcernTrade-Off Options
Crawl SpeedMore workers = faster, but higher risk of bans
URL DeduplicationBloom filter (low memory) vs. persistent hash storage
Storage FormatRaw HTML vs. structured text vs. extracted metadata
Scope LimitingGlobal crawl vs. single-domain crawl
Real-Time UpdatesRecrawl strategy: freshness vs. efficiency

11. Monitoring & Metrics

  • Track:
    • Queue size and crawl rate
    • Failure rate and response codes
    • Top domains by crawl volume
    • Time since last fetch (for freshness)

Expose logs and metrics for alerting and debugging.


12. What Interviewers Look For

  • Understanding of distributed queuing and crawl scheduling
  • Respect for real-world constraints like robots.txt
  • Ability to handle fault tolerance and deduplication
  • Awareness of performance and politeness trade-offs
  • System modularity and horizontal scalability

✅ Summary

A well-designed web crawler requires thoughtful design across multiple axes—networking, scheduling, data storage, deduplication, and respect for external constraints. Key elements include:

  • Distributed URL frontier
  • Stateless fetcher workers
  • HTML parsing and content extraction
  • Deduplication and crawl scheduling
  • Resilience and respectful rate-limiting