Publishers push Common Crawl to stop collecting content for AI training
Could AI lose a key source of training data? Major publishers want Common Crawl to stop collecting and sharing their content.
Digital Content Next (DCN) sent the Common Crawl Foundation a cease-and-desist letter demanding that it stop scraping and distributing protected publisher content.
The U.S. trade group, which represents major digital publishers (e.g., the AP, the New York Times, NBC Universal, Bloomberg, NPR, and Fox), also asked Common Crawl to remove DCN members’ content from its datasets, including paywalled and subscriber-only news articles.
Publishers question opt-outs. DCN’s lawyers raised concerns about whether Common Crawl honored publisher opt-out requests and removed older content when asked.
- The letter said Common Crawl had, in some cases, told publishers it was complying, only to later say technical costs and delays prevented full removal. DCN’s lawyers said they were reviewing whether those statements may have been inaccurate or misleading.
- Common Crawl publishes a registry of sites that have opted out of scraping. The list includes many large news publishers.
DCN alleges infringement. The letter argued that copyright law is not an opt-out system. DCN said Common Crawl “flagrantly infringed” publisher copyrights by creating and distributing datasets containing protected content without permission or compensation.
- The group also said Common Crawl made that content available to companies developing AI tools and large language models.
- DCN CEO Jason Kint said the legal notice challenges the idea that online content can be collected, stored, and reused simply because it is accessible.
Common Crawl pushes back. Executive Director Rich Skrenta denied that CCBot bypasses paywalls to scrape websites. He also denied misleading publishers after The Atlantic reported in November that some content from publishers that had requested removal remained available.
- “When a publisher asks us to remove previously crawled material, we respond promptly and initiate a removal process that reflects the technical design of our dataset,” Skrenta said.
Why we care. This fight could shape how much publisher content AI search engines can use without permission. If courts or settlements impose stricter consent requirements, AI responses may rely more on licensed sources and less on the open web.
AI training stakes. Since 2008, Common Crawl has scraped billions of webpages to build a free public archive. Its datasets have been widely used to train AI models. The New York Times’ 2023 copyright lawsuit against OpenAI cited Common Crawl as making up 60% of GPT-3’s training data, Press Gazette reported.
- A 2024 Mozilla Foundation paper said that, in its current form, generative AI likely would not have been possible without Common Crawl.
- Common Crawl has been working on open standards for AI crawling preferences, Skrenta said this week. DCN’s letter asks for a harder line: stop scraping protected publisher content and remove member content already in the datasets.
Topics on this page
Search Engine Land is owned by Semrush. We remain committed to providing high-quality coverage of marketing topics. Unless otherwise noted, this page’s content was written by either an employee or a paid contractor of Semrush Inc.