System Design distributed web crawler to crawl Billions of web pages | web crawler system design
Tech Dummies Narendra L
46 min, 1 sec
The video provides an in-depth explanation of designing a web crawler system, discussing its components, functions, storage solutions, and optimization strategies.
Summary
- The speaker introduces the concept of a web crawler, its uses, components, and the system design to efficiently crawl and store web pages.
- Key features of the crawler design include politeness, DNS queries, distributed crawling, priority crawling, and duplicate detection.
- The scale of the web pages to be crawled is discussed, estimating around fifty billion pages, and strategies for storing such large amounts of data are considered.
- The speaker explains how to ensure politeness and priority in crawling, how to detect updates and duplicates, and potential storage solutions for the crawled data.
- Details on how to implement a URL frontier to manage the crawling queue, prioritization, and handling of updates and duplicates are provided.
Chapter 1
![The speaker introduces web crawlers, their synonyms, and basic functions.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODM2LCJwdXIiOiJibG9iX2lkIn19--13af76205139a2a5f8195bdcbabb25c4f32fab10/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_20.jpg)
The speaker introduces web crawlers, their synonyms, and basic functions.
- Web crawlers are also known as spiders, bots, or simply crawlers.
- A web crawler is a framework, tool, or software used to collect web pages from the Internet for easy access and indexing.
- The crawler saves the collected web pages and recursively finds URLs within those pages to continue the process.
![The speaker introduces web crawlers, their synonyms, and basic functions.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODM2LCJwdXIiOiJibG9iX2lkIn19--13af76205139a2a5f8195bdcbabb25c4f32fab10/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_20.jpg)
Chapter 2
![Different types of web crawlers and their applications are discussed.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODM4LCJwdXIiOiJibG9iX2lkIn19--863780a73abe18d96e6978a5315e983f809458b2/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_103.jpg)
Different types of web crawlers and their applications are discussed.
- There are various web crawlers with different applications, and individuals can build custom crawlers for specific purposes.
- Common use cases include search engines, copyright violation detection, keyword-based content finding, web malware detection, and web analytics.
- These functionalities cater to different industry needs such as providing search results, detecting unauthorized content use, monitoring market trends, and collecting data for machine learning models.
![Different types of web crawlers and their applications are discussed.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODM4LCJwdXIiOiJibG9iX2lkIn19--863780a73abe18d96e6978a5315e983f809458b2/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_103.jpg)
Chapter 3
![The speaker outlines the essential features to support in the web crawler.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODQwLCJwdXIiOiJibG9iX2lkIn19--edc152a274cca124eed718affc130cdd16fcf6bf/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_201.jpg)
The speaker outlines the essential features to support in the web crawler.
- Features include politeness (avoiding server overload), DNS queries, distributed crawling for scalability, priority crawling, and duplicate detection.
- Politeness and crawl prioritization are crucial for respecting website traffic and fetching relevant content efficiently.
- Duplicate detection is vital for avoiding unnecessary crawling of content already stored, saving resources and time.
![The speaker outlines the essential features to support in the web crawler.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODQwLCJwdXIiOiJibG9iX2lkIn19--edc152a274cca124eed718affc130cdd16fcf6bf/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_201.jpg)
Chapter 4
![Estimations for scaling the crawler and storage requirements are provided.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODQyLCJwdXIiOiJibG9iX2lkIn19--f5eadb6c01c0d9d65e6b21468d5fc0f914ddeefc/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_286.jpg)
Estimations for scaling the crawler and storage requirements are provided.
- An estimation of 900 million registered websites is considered, with an assumption that 60% are functional, leading to roughly 500 million websites to crawl.
- On average, each website might contain about 100 pages, resulting in 50 billion pages to crawl and store.
- The average size of a webpage and the storage capacity required to save the crawled pages are calculated, considering only necessary content is downloaded, not the entire media.
![Estimations for scaling the crawler and storage requirements are provided.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODQyLCJwdXIiOiJibG9iX2lkIn19--f5eadb6c01c0d9d65e6b21468d5fc0f914ddeefc/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_286.jpg)
Chapter 5
![A detailed system design diagram for the crawler is explained.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODQ0LCJwdXIiOiJibG9iX2lkIn19--6f8f493d72de60bc0ab57e36454dffc8ef2668b7/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_447.jpg)
A detailed system design diagram for the crawler is explained.
- The diagram includes seed URLs to initiate the crawl, a URL frontier queue for managing crawl order, and fetchers and renderers for retrieving and processing web content.
- The system uses distributed threads or processes to fetch and render web pages concurrently and can scale by adding more machines.
- The URL frontier ensures that politeness and priority features are maintained, while the DNS resolver optimizes the domain name resolution process.
![A detailed system design diagram for the crawler is explained.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODQ0LCJwdXIiOiJibG9iX2lkIn19--6f8f493d72de60bc0ab57e36454dffc8ef2668b7/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_447.jpg)
Chapter 6
![The speaker elaborates on the URL frontier and the implementation of politeness policy.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODQ2LCJwdXIiOiJibG9iX2lkIn19--88e349cd63e7e851fed1182144628d068bfabed8/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_919.jpg)
The speaker elaborates on the URL frontier and the implementation of politeness policy.
- The URL frontier consists of front queues and back queues, prioritizers, and back queue selectors to manage URL processing.
- Each back queue corresponds to a specific host to ensure only one connection to a host at a time, maintaining politeness by not overwhelming servers.
- The system uses heaps to keep track of which URL to crawl next based on priority and delay between requests to the same host.
![The speaker elaborates on the URL frontier and the implementation of politeness policy.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODQ2LCJwdXIiOiJibG9iX2lkIn19--88e349cd63e7e851fed1182144628d068bfabed8/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_919.jpg)
Chapter 7
![Methods for detecting updates and handling duplicates are discussed.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODQ4LCJwdXIiOiJibG9iX2lkIn19--2c118b963533dad5d59e1c726914f2bc3a51b200/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_1824.jpg)
Methods for detecting updates and handling duplicates are discussed.
- Head requests are used to check if a page has been updated by comparing the last modified time without downloading the entire content.
- Duplicate detection is crucial to save resources and is conducted using hashing, signature calculations, and algorithms such as Simhash for near-duplicate detection.
- The Simhash algorithm provides a way to identify near-duplicate documents even when a certain percentage of content is different.
![Methods for detecting updates and handling duplicates are discussed.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODQ4LCJwdXIiOiJibG9iX2lkIn19--2c118b963533dad5d59e1c726914f2bc3a51b200/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_1824.jpg)
Chapter 8
![Various storage solutions for the crawled data are suggested.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODUwLCJwdXIiOiJibG9iX2lkIn19--c0e244ba98f4b1776eb1dab4605396018b862bd8/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_2535.jpg)
Various storage solutions for the crawled data are suggested.
- Storage solutions like Amazon S3, MinIO, Bigtable, GFS, HDFS, and their alternatives like HDFS Federation are considered for storing crawled web pages.
- The choice of storage solution depends on the size of the data, with considerations for efficient retrieval and optimal resource usage.
- Sharding and consistent hashing are recommended for managing large amounts of data across distributed file systems.
![Various storage solutions for the crawled data are suggested.](https://www.videogist.co/rails/active_storage/representations/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTM1ODUwLCJwdXIiOiJibG9iX2lkIn19--c0e244ba98f4b1776eb1dab4605396018b862bd8/eyJfcmFpbHMiOnsiZGF0YSI6eyJmb3JtYXQiOiJqcGciLCJyZXNpemVfdG9fbGltaXQiOls3MjAsbnVsbF19LCJwdXIiOiJ2YXJpYXRpb24ifX0=--c9426325207613fdd890ee7713353fad711030c7/7943_2535.jpg)
More Tech Dummies Narendra L summaries
![Google Docs System design | Part 1| Operational transformation | differentail synchronisation](https://www.videogist.co/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsiZGF0YSI6MTIyNDIwLCJwdXIiOiJibG9iX2lkIn19--7cf944405e9bcdde6056fdc98c5caac5b8533153/hqdefault.jpg)
Google Docs System design | Part 1| Operational transformation | differentail synchronisation
Tech Dummies Narendra L
An in-depth explanation of collaborative editing, discussing four strategies, concurrency issues, and mechanisms like Operational Transformation and Differential Synchronization.