meta_pixel
Tapesearch Logo
Log in
The Vergecast

How to train your data

The Vergecast

Vox Media Podcast Network

News, Tech News, Technology

4.34.3K Ratings

🗓️ 25 June 2026

⏱️ 27 minutes

🧾️ Download transcript

Summary

Training data is the raw material of the AI industry. Claude, ChatGPT, Gemini, and the rest are built on top of oceans of stuff. What is that stuff? Books. Blog posts. YouTube videos. Reddit comments. All of it and more, in virtually incomprehensible quantities. Alex Reisner, a staff writer at The Atlantic who has been investigating training data, explains how AI companies get all this data, why they'd really prefer you not know what's in it, and whether training data could ever be a fair trade. Further reading: Apple raises prices on Macs, iPads, and more by hundreds of dollars | The Verge⁠ ⁠Disney agrees to pay $50 million to YouTube TV and DirecTV subscribers | The Verge⁠ Two handlebars are better than one, right? | The Verge⁠ At Least 15 Million YouTube Videos Have Been Snatched by AI Companies⁠⁠ ⁠⁠The Hypocrisy at the Heart of the AI Industry ⁠⁠ ⁠⁠The Millions of Songs Mashed Into AI-Generated Music⁠⁠ ⁠⁠Common Crawl Is Doing the AI Industry’s Dirty Work⁠⁠ Subscribe to The Verge for unlimited access to theverge.com, subscriber-exclusive newsletters, and our ad-free podcast feed. We love hearing from you! Email your questions and thoughts to vergecast@theverge.com or call us at 866-VERGE11. Learn more about your ad choices. Visit podcastchoices.com/adchoices

Transcript

Click on a timestamp to play from that location

0:00.0

Hello and welcome to the Vergecast, the flagship podcast of music that sounds eerily but not exactly

0:07.4

like other music. I'm your friend David Pierce, and today on the show we're talking about

0:11.6

training data. It's the raw materials of everything that is AI, and we probably don't

0:18.0

understand it or talk about it enough. I'm talking to Alex Reisner,

0:21.3

who's a staff writer at The Atlantic, and over the last couple of years, he has spent a lot of

0:25.5

time investigating training data and how all of these books and all of these articles and all

0:31.5

of these YouTube videos and all of these songs are compiled into these gigantic data sets

0:35.5

that AI companies then use to form the basis of their models.

0:40.1

I think the way that these models are created and the sources of these data has a lot to do with the way that we feel about AI,

0:46.6

particularly generative AI as a creative expression.

0:50.3

And understanding how this data works, where it comes from, and how it gets used, I think is really important.

0:55.6

So we're going to have Alex on, we're going to dig into it. I'm very excited about it.

0:59.0

But first, here's everything else happening on The Verge today.

1:01.9

This is 90 seconds on the verge for Thursday, June 25th, 2026.

1:06.1

Apple just raised the prices of a huge number of its most important products.

1:10.8

iPads and Macbooks in

1:12.0

particular are now up anywhere from $100 to $500, and even like the Apple TV is way up. It's now

1:18.7

$200, roughly the price of 900 Amazon Fire Six. We knew this was coming, but this is still a big

1:24.3

moment. Apple is as well managedmanaged a supply chain company as anybody

1:28.9

and exists at really high margins.

1:31.3

If it can't keep prices down in this era of AI-driven shortages of memory and storage,

1:36.7

nobody can.

...

Transcript will be available on the free plan in 25 days. Upgrade to see the full transcript now.

Disclaimer: The podcast and artwork embedded on this page are from Vox Media Podcast Network, and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of Vox Media Podcast Network and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.

Copyright © Tapesearch 2026.