Common crawl maintains archives containing millions of articles from major news organizations that readers typically must pay to access, enabling ai developers. In the process, my reporting has found, common crawl has opened a back door for ai companies to train their models with paywalled articles from major news websites. Is this how ai companies are getting access to paywalled journalism A new report accuses common crawl of doing ai's dirty work, which the organization denies. The company quietly funneling paywalled articles to ai developers the atlantic / alex reisner / nov 5, 2025 “a search for nytimes.com in any crawl from 2013 through 2022 shows a ‘no captures’ result, when in fact there are articles from nytimes.com in most of these crawls. For more than a decade, the nonprofit common crawl has been scraping billions of webpages to build a massive archive of the internet, notes the atlantic, making it freely available for research
In recent years, however, this archive has been put to a controversial purpose Ai companies including openai,… read article share share article We’ve created a free guide that walks you through 6 simple ways to test whether your paywalled content is being reconstructed by ai tools like chatgpt, claude, or perplexity.
OPEN