System Design — Designing a Web Crawler

Shashank Singh
3 min readNov 1, 2023

--

Being able to design a scalable system that stays reliable during increased loads without impacting the system’s availability & data consistency, is not only important for the sake of passing an interview, but also for the career growth one looks forward to.

There are no shortcuts, no, not at all. And you can’t just master it overnight, but yes, you can start right now. And I will be trying my best to add my 2 cents to it by sharing such articles with high level design diagrams for various systems we use in our daily life.

Today it’s about “Designing a Web Crawler.”

What exactly is a web crawler?

“A web crawler is a software program that browses the World Wide Web in a methodical and automated manner to collect documents by recursively fetching links from a set of starting pages. Search engines use web crawlers to give up-to-date data.

Now that we know what a web crawler is, the next thing you need to decide on is, whether this crawler is to only fetch HTML pages, or we want to store all kinds of media we get. For the sake of keeping, it short, let’s build one for HTML only, but it should have the ability to enhance & add support for other media types as well.

Similarly for what kind of protocols it going to support. For now, HTTP but, it should be ……… 😊

And the most important is dealing with Robots Exclusion. If you don’t know what it is, I suggest you go do some digging and find out 😊

So, at a high level if we want to break the process of crawling into steps, it will look something like this: -

  1. Pick a URL from the list.
  2. Fetch the Ip address of the host & establish a connection to download the corresponding document.
  3. While parsing the contents of this document look for new URL’s.
  4. Add the new URLs to the URLs list.
  5. Store contents, or index the document, process it as needed.
  6. Go back to step 1.

There are a lot of other important things to understand & learn around designing a crawler, for instance, choosing between BFS v/s DFS.

Now, assuming there are billions of pages to be crawled, here are the most important components we need in our service.

  • URL Queue, To store the list of URLs to download.
  • HTML Fetcher, To retrieve a web page from the server.
  • Link Extractor, from HTML documents.
  • Duplicate Resolver, to avoid extracting same data twice.
  • Data-store, To store contents, metadata…etc

Now that you have the basic understanding of what’s needed, it would be easier to understand below diagram.

High Level Diagram of major components of a Web Crawler

If You want to can go into the dept of these components further, you can find multiple resources out there, in case you don’t, reach out to me and will be happy to guide you.

Happy learning.

#systemdesign #learnorburn #beconsistent #webcrawler #techleadership #architecture

--

--

Shashank Singh
Shashank Singh

Written by Shashank Singh

Building CapItAll.io | Helping & Mentoring Professionals for Career Enhancements | Engineering Leadership | Fintech | Solution Architect | Gen AI / ML

No responses yet