How We Analyzed the Whole Web to See What People Embed on Their Websites. For 300 Euros.

Idego Group undertook an ambitious project to analyze embedded content across the internet. A client requested data extraction on what people embed on websites, specifically seeking object and embed tags along with their src or data attributes, organized by domain.

The team initially used Scrapy, a Python web crawling framework, but recognized its limitations. The tool required a predefined domain list and generated suspicious traffic patterns that necessitated proxy usage, both significantly slowing the process.

The breakthrough came from a Yelp engineering blog post describing how to analyze massive web data economically. This inspired Idego to leverage Common Crawl, a non-profit initiative creating internet snapshots. As of March 2018, the Crawl contains 3.4 billion web pages—that is 270 TB uncompressed. The data comes in WARC, WET, and WAT formats, stored on Amazon's S3.

Processing such volume required MapReduce on Amazon EMR (Elastic MapReduce). This distributed computing approach splits tasks into Map (parallel work distribution) and Reduce (consolidating results) phases. The team used cc-mrjob to streamline Common Crawl integration.

The January 2018 snapshot proved significantly larger than the 2014 version Yelp analyzed. While Yelp completed similar work in one hour using 20 instances, Idego required 40 comparable instances and approximately 15 hours. On-demand pricing would have cost roughly $1,100, but Spot instances (unused processing capacity) reduced costs to approximately $300—demonstrating efficient resource utilization for massive-scale data analysis.

How We Analyzed the Whole Web to See What People Embed on Their Websites. For 300 Euros.

Powiązane artykuły

Data Modeling: Why Is It Important?

How to Build a Brand New Data Science Team and Avoid a Failure

Powerful and Best Power BI Custom Visuals Overview