Web crawling is notoriously slow as it can take up to 2 seconds for an HTTP request to complete. When you are crawling billions of web pages, brute force takes billions of seconds. I am assuming that the pages that you decide to crawl at any given time, are carefully chosen so that caching, browsing and indexing schedule is already fully optimized.
Steps to increase efficiency (speed) by a factor 80,000
- Use the cloud: split your crawling across 8,000 servers. Speed improvement: 8,000.
- On each server, run 20 copies of your crawler, in parallel (let's call it sub-parralellization at the sever level). You can expect (based on our experience, assuming each server is used exclusively for this crawling project) to boost speed not by a factor 20, but maybe as high as 5. Speed improvement so far: 8,000 x 5 = 40,000.
- Change the timeout threshold (associated with each HTTP request) from 2 seconds to 0.5 seconds. This could improve speed by a factor 3 (not all web pages require 2 seconds to download), but then you will have to revisit many more pages who failed due to the short 0.5 second threshold. Because of this, the gain is not a factor 3, but a factor 2. Note that you should try different values for this threshold until you find one that is optimum. Speed improvement so far: 8,000 x 5 x 2 = 80,000.
- In addition to changing the timeout threshold, you can change the max size threshold: if a page is more than 24K, you download the first 24K, and skip the remaining. Of course, while this boosts speed performance, the drawback is information loss.
- You should also have a blacklist of websites or web pages that you don't wan't to crawl because they are consistently very slow to load or cause other speed problems (multiple redirects etc.)