Common Crawl Makes New Web Data Available, Launches Coding Contest

Looking to do research based on data gathered from across the web? That’s one of the purposes of Common Crawl, and the group has just released new data, as well as a contest to encourage use of that data The 2012 data, which contains 3.8 billion web documents, shows stats such as 63% of top […]

Chat with SearchBot

Get Started CommonCrawl

Looking to do research based on data gathered from across the web? That’s one of the purposes of Common Crawl, and the group has just released new data, as well as a contest to encourage use of that data

The 2012 data, which contains 3.8 billion web documents, shows stats such as 63% of top level domains being .com or there being 61 million domains overall.

Common Crawl is also currently running its first-ever Common Crawl Code Contest challenging developers to do something innovative using the data relating to job trends or social impact analysis. Three winners will each get $1,000 in cash, an O’Reilly Data Science Starter Kit, one year of GitHub’s Small Plan and more. Submissions are accepted through August 29.

FYI, I’m on the advisory board for the non-profit group. There’s no compensation for that involvement. I and others just offer free advice to the group.

You can learn more about Common Crawl on its FAQ page, the Get Started page and in the video below:

[youtube width=”560″ height=”315″]https://www.youtube.com/watch?v=ozX4GvUWDm4[/youtube]

Common Crawl’s data from 2011 was recently used by Zyxt Labs to show how much Facebook has spread across the open web. See our Marketing Land article for more on that:


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About the author

Danny Sullivan
Contributor
Danny Sullivan was a journalist and analyst who covered the digital and search marketing space from 1996 through 2017. He was also a cofounder of Third Door Media, which publishes Search Engine Land and MarTech, and produces the SMX: Search Marketing Expo and MarTech events. He retired from journalism and Third Door Media in June 2017. You can learn more about him on his personal site & blog He can also be found on Facebook and Twitter.

Get the must-read newsletter for search marketers.