One of IDEGO’s clients recently asked us to perform a very interesting data extraction. While we cannot exactly disclose the sort of information they were looking for, we certainly can describe the process and our approach using a similar example. We will create a script that finds <object> and <embed> tags on web pages and outputs data consisting of URLs and contents of src or data attributes of the aforementioned tags, grouped by domain. In short, we’ll see what people embed on their websites.
When trying to solve our clients problem, at first we decided on a traditional approach. We used Scrapy – a Python framework for web crawling – to write a scraper capable of outputting the requested data based on a list of domains supplied by the client. It did its job rather well but obviously had its limitations. First and foremost, it had to be provided with a domain list, which meant there was no possibility of randomly finding a website abundant with what our client was looking for. Secondly, because the traffic generated by our bot was significantly different and more suspicious than what a normal user would do, we had to hide behind a proxy, which made the process of scraping the web even lengthier than it would have been without this nuisance. We had to come up with something better.
Fortunately, we stumbled upon a blog post written by one of Yelp’s employees back in 2014. Yelp is an American website gathering people’s reviews of various businesses, but it also serves as Yellow Pages – it makes it easy to find phone numbers, addresses and website URLs of various companies. The article – Analyzing the Web for the price of a sandwich – describes the process of gathering the missing URLs based on phone number data already available in Yelp’s database. The algorithm was simple – take a snapshot of the entire Internet, search for phone numbers on each of the websites from the snapshot, check if they are present in Yelp’s database and if so, save the URL.
Source: http://www.allegrofun.pl/show/1273
We all remember the feeling of amusement mixed with embarrassment when we first heard a “Chuck Norris Fact” back around 2006. It was then that the concept of the great Chuck making a backup copy of the entire Internet on a single floppy disk emerged. And while it was a “joke” back then, little did we know that 10 years later we would see an initiative making it somewhat real.
Common Crawl is a non-profit foundation focused on creating snapshots of the Internet. It started small and grew over the years, but the basic idea remains the same. Each month bots are employed to scrape all (well, not all, but a whole lot of them) the possible websites on the Internet, HTML after HTML, saving the results to Amazon’s S3 cloud storage in compressed form. The raw data is available in the form of WARC files, but further processing is also applied, resulting in two other file formats – WET for plain text extracted from the HTMLs and WAT for metadata. As of March 2018, the Crawl contains 3.4 billion web pages – that’s 270 TB uncompressed! In order to make this data more bearable, it is compressed and divided into chunks of about 1 GB each. Common Crawl hosts the index files that hold paths to the chunks on its S3.
The only sensible way of processing this much data is by deploying a MapReduce job on an Amazon EMR (Elastic MapReduce) running on EC2 instances. This approach was also used and described by the Yelp employee from the article mentioned above.
The idea behind MapReduce is pretty simple. It divides the task into two steps. The first step – “Map” – is all about distributing the work among multiple workers, giving each of them a list of relatively easy tasks – in our case this atomic task is finding some sort of content on a single web page from the Common Crawl data. The output of this step is a key, value pair (e.g. a domain and a phone number or a domain and a URL, embedded content pair). The next step – “Reduce” – gathers Map outputs sharing the same key and unifies them, giving the final answer. Thanks to the fact that Amazon provides the whole infrastructure through its EMR, we don’t exactly need to worry about much else, especially that we had decided to use cc-mrjob as a basis for our script.
Cc-mrjob tries to make using Yelp’s mrjob with Common Crawl data easier by supplying a class that handles connecting to the S3 and properly divides the input. It also helps by providing ways of testing MapReduce jobs locally on smaller batches of data – a functionality that turned out to be rather useful during development of our script.
The script, as you can see, really is very simple – it takes the WARC files as an input because we are looking for content within the HTML tags. The map step yields string (domain), dictionary (with ‘url’ and ‘src_data’ fields) pairs, and the reduce step simply unifies these outputs as a string, list of dictionaries pairs. MRJob also lets us define custom arguments which we used to filter the data. We can, for example, limit our search to domains from Poland using the “–domain_suffix .pl” argument and only output embedded pdf files using the “–file_extension .pdf” option.
The only major problem was the time it took the Amazon instances to process all this data. After reading that it only took an hour for the Yelp employee, we expected a time similar to that. Instead, during our test runs, it appeared no progress was being made by the job. Only after some fiddling and testing on even smaller batches of data, we were able to establish that the job does make progress, albeit much slower than we had expected.
That is when we did the maths. The Common Crawl snapshot used by the Yelp employee was from December 2014, and because he had only been searching for phone numbers, he was able to make use of the WET files containing extracted plain text. The particular WET archive that he had used was 3.69 TB compressed, while the WARC from the same snapshot was 32 TB. The snapshot that we used – from January 2018 – was a lot bigger. The WET archive is 9.29 TB compressed and WARC is 74.33 TB. This information lets us establish two facts: WARC files are approximately 8 times bigger than WET files and our snapshot was about 2.5 times bigger than the one used in the article. A simple calculation gives us a clear answer – our task was approximately 20 times bigger than the one described in the article.
The Yelp employee clearly states in his article that he was able to process all the data in an hour using 20 c3.8xlarge instances (32 vCPUs and 60GB of RAM each). In our final and successful run, we eventually used 40 comparable instances (we doubled what they had) and the task took approximately 15 hours to complete. This only shows that the scaling wasn’t perfect – the most obvious reason could be that our regexes had to be executed on bigger amounts of data, so they took longer than theirs. Either way, the MapReduce job produced 20 GB of uncompressed output for our client, satisfying their needs and letting them perform further analyses on a batch of data much smaller than a snapshot of the entire Internet.
The only downside of the job taking 15 hours is the pricing. Had we used standard on-demand instances, it would have cost us approximately $1100. An instance itself costs $1.591 per hour, the EMR is another $0.270 per hour, which times 40 gives us $74.44 per hour.
We were smart enough to use Spot instances – they take advantage of the fact that on-demand instances rarely use up 100% of the processing power available – this otherwise wasted power is therefore rented at a much more sensible price to users willing to get it. A downside is that these instances are not persistent, and could suddenly be reassigned to their standard tasks, but for a job like ours, they’re perfect.
The Spot prices fluctuate depending on the current situation, but we will do the calculation for the same type of instance using the numbers at the time of writing, just to show the difference. The Spot version of that instance is currently $0.3085 per hour, but we still need to add $0.270 for the EMR, which times 40 gives us $23.14. That’s 3.2 times less than using on-demand instances! It’s not as cheap as a sandwich, but it’s definitely not too much to ask for the ability to process basically the entire Internet.
For a more in-depth look at the Amazon EMR, we can recommend this report: https://calhoun.nps.edu/bitstream/handle/10945/52962/17Mar_Chang_Tao-hsiang.pdf?sequence=1