Search the 15.7 million websites in Google’s C4 dataset

Now you can find out whether data from your website helped train large language models that help power Google Bard.

Chat with SearchBot

Was your website or content used to help train AI systems as part of Google’s C4 dataset? A new search tool from the Washington Post lets you find out.

Why we care. The dataset includes the types of websites and content creators that generative AI could potentially negatively impact or even wipe out, such as news and media publishers, blogs and marketing.

Search. The new search tool can be found in the Post’s article Inside the secret list of websites that make AI like ChatGPT sound smart. It created the list “based on how many ‘tokens’ appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase,” the story explained.

For example, Search Engine Land was used.

C4 Searchengineland

As were Marketing Land (a brand that no longer exists, but did in 2019) and Marketing Land Events, which hosted our SMX and MarTech conference sites.

C4 Marketingland

And Search Engine Land’s parent company site, Third Door Media.

C4 Thirddoormedia

Also, Barry Schwartz’s Search Engine Roundtable was used.

C4 Seroundtable

Only part of the data. As a reminder, the C4 (which stands for Colossal Clean Crawled Corpus) is only part of the data used by Google Bard and other large language models. It also uses Wikipedia, Reddit and other sources.

Speaking of Reddit. Reddit wants to get paid when any companies want to use its data to train AI models, the New York Times reported. Reddit has updated its API terms and will now charge some companies (e.g., Google, OpenAI) for access. Said Reddit CEO and co-founder Steve Huffman:

  • “The Reddit corpus of data is really valuable. But we don’t need to give all of that value to some of the largest companies in the world for free. Crawling Reddit, generating value and not returning any of that value to our users is something we have a problem with. It’s a good time for us to tighten things up.”

Ironically, Reddit, itself, didn’t even create any of that value. Its users did.


About the author

Danny Goodwin
Staff
Danny Goodwin has been Managing Editor of Search Engine Land & Search Marketing Expo - SMX since 2022. He joined Search Engine Land in 2022 as Senior Editor. In addition to reporting on the latest search marketing news, he manages Search Engine Land’s SME (Subject Matter Expert) program. He also helps program U.S. SMX events.

Goodwin has been editing and writing about the latest developments and trends in search and digital marketing since 2007. He previously was Executive Editor of Search Engine Journal (from 2017 to 2022), managing editor of Momentology (from 2014-2016) and editor of Search Engine Watch (from 2007 to 2014). He has spoken at many major search conferences and virtual events, and has been sourced for his expertise by a wide range of publications and podcasts.

Get the must-read newsletter for search marketers.