Common Crawl Makes New Web Data Available, Launches Coding Contest

Looking to do research based on data gathered from across the web? That’s one of the purposes of Common Crawl, and the group has just released new data, as well as a contest to encourage use of that data

The 2012 data, which contains 3.8 billion web documents, shows stats such as 63% of top level domains being .com or there being 61 million domains overall.

Common Crawl is also currently running its first-ever Common Crawl Code Contest challenging developers to do something innovative using the data relating to job trends or social impact analysis. Three winners will each get $1,000 in cash, an O’Reilly Data Science Starter Kit, one year of GitHub’s Small Plan and more. Submissions are accepted through August 29.

FYI, I’m on the advisory board for the non-profit group. There’s no compensation for that involvement. I and others just offer free advice to the group.

You can learn more about Common Crawl on its FAQ page, the Get Started page and in the video below:

YouTube Preview Image

Common Crawl’s data from 2011 was recently used by Zyxt Labs to show how much Facebook has spread across the open web. See our Marketing Land article for more on that:

Related Topics: Channel: Consumer | Common Crawl | Top News


About The Author: is a Founding Editor of Search Engine Land. He’s a widely cited authority on search engines and search marketing issues who has covered the space since 1996. Danny also serves as Chief Content Officer for Third Door Media, which publishes Search Engine Land and produces the SMX: Search Marketing Expo conference series. He has a personal blog called Daggle (and keeps his disclosures page there). He can be found on Facebook, Google + and microblogs on Twitter as @dannysullivan.

Connect with the author via: Email | Twitter | Google+ | LinkedIn


Get all the top search stories emailed daily!  


Other ways to share:

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • Gary

    It’d be nice if they had a full text search UI built on top of it, just to be able to test what’s included in it.

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest


Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States


Australia & China

Learn more about: SMX | MarTech

Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!



Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide