News sites are blocking access to Internet Archive's Wayback Machine

Marketplace Tech

Marketplace

Technology, News

4.5 • 1.3K Ratings

🗓️ 21 April 2026

⏱️ 7 minutes

🔗️ Recording | iTunes | RSS

🧾️ Download transcript

Summary

The Wayback Machine is a project of the Internet Archive. It sends out web crawlers to take snapshots of the internet, creating a digital library of web pages. But now, some news publications are blocking its crawlers over concerns that AI companies will access the Wayback Machine’s publicly available archive and then train their AI models with the content.

Marketplace’s Stephanie Hughes talked about this with Andrew Deck at Harvard's Nieman Lab.

Transcript

Click on a timestamp to play from that location

0:00.0	Preserving the history of the internet is getting harder.
0:05.0	From American Public Media, this is Marketplace Tech. I'm Stephanie Hughes.
0:09.0	The Way Back Machine is a project of the Internet Archive. It sends out web crawlers to take
0:22.7	snapshots to the Internet, basically creating a digital library of web pages. But now some
0:28.0	news publications are blocking its crawlers. They're worried that AI companies will access the
0:33.5	Wayback Machines publicly available archive and then train their AI models with the content.
0:39.0	I talked about this with Andrew Deck at Harvard's Neiman Lab.
0:42.8	We've seen time and time again, you know, how valuable news content is to generative AI companies.
0:48.0	Publisher archives house enormous amounts of informational, kind of edited text that has been used to train a lot of the large
0:56.3	language models on the market that most people use and also the commercial products that
1:00.9	are built on top of them. So there's potential revenue loss for news publishers if this were
1:07.7	to be happening. Is it actually happening? How real is this fear?
1:11.9	Yeah, I think it's important to say that in our conversations with news publishers,
1:16.0	a lot of them were taking this action preemptively out of a fear of proxy scraping
1:22.5	rather than direct evidence that it has happened to them already.
1:26.3	None of the publishers were able to point
1:28.3	to a particular AI company or other kinds of direct evidence that their content had already
1:35.6	been scraped by the way back machine. But I do think it's fair to say that news publishers feel
1:42.0	burned by the earlier development of large language models.
1:46.0	When we do know large swaths of their content were used by Silicon Valley companies to train models without upfront compensation and permission, you know, that tension over the value of journalism has played out over the last several years. It's led to
2:01.7	lawsuits. It's led to licensing deals. Yeah, I mean, with news sites, information is their currency.
2:08.4	And as you mentioned, some have struck deals to sell their content to AI companies.
	...

Please login to see the full transcript.

Previous episode | Next episode

Disclaimer: The podcast and artwork embedded on this page are from Marketplace, and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of Marketplace and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.