Looking to do research based on data gathered from across the web? That’s one of the purposes of Common Crawl, and the group has just released new data, as well as a contest to encourage use of that data
The 2012 data, which contains 3.8 billion web documents, shows stats such as 63% of top level domains being .com or there being 61 million domains overall.
Common Crawl is also currently running its first-ever Common Crawl Code Contest challenging developers to do something innovative using the data relating to job trends or social impact analysis. Three winners will each get $1,000 in cash, an O’Reilly Data Science Starter Kit, one year of GitHub’s Small Plan and more. Submissions are accepted through August 29.
FYI, I’m on the advisory board for the non-profit group. There’s no compensation for that involvement. I and others just offer free advice to the group.
You can learn more about Common Crawl on its FAQ page, the Get Started page and in the video below:
Common Crawl’s data from 2011 was recently used by Zyxt Labs to show how much Facebook has spread across the open web. See our Marketing Land article for more on that:
Related Topics: Channel: Consumer | Common Crawl | Top News










Like This Story? Please Share!
Like Our Site? Follow Us!
Follow @sengineland