meta_pixel
Tapesearch Logo
Log in
Marketplace Tech

For data-hungry tech companies, YouTube is a gold mine

Marketplace Tech

Marketplace

Technology, News

4.51.3K Ratings

🗓️ 30 July 2024

⏱️ 11 minutes

🧾️ Download transcript

Summary

Companies competing in the chatbot wars are using something known in the industry as “the Pile” to train their large language models. It’s a trove of open-source data made up of text scraped from all around the internet, including Wikipedia and the European Parliament. Annie Gilbertson, investigative reporter for Proof News, recently took a deep dive into the Pile and discovered something else: a dataset called “YouTube Subtitles.” Marketplace’s Lily Jamali spoke with Gilbertson about her investigation and how YouTube creators feel about their content being used without their consent.

Transcript

Click on a timestamp to play from that location

0:00.0

How YouTube became a gold mine for AI Chatbots.

0:05.0

From American Public Media, this is Marketplace Tech.

0:08.0

I'm Lily Jramale. companies competing in the chatbot wars have consistently turned to something known in the industry as the pile

0:25.2

to train their large language models.

0:27.7

The pile is a massive trove of open source data, about 800 gigs worth, that's made up of several smaller data sets. data in Parliament's website, even a collection of scandalous emails from employees at the now

0:44.9

defunct Energy Company Enron.

0:48.6

Investigative reporter Annie Gilbertson of Proof News recently took a deep dive into the pile and discovered something else.

0:55.8

A data set called YouTube subtitles.

0:58.8

Text from videos are now being used by Silicon Valley heavyweights like Apple,

1:03.7

and Vidia, and Anthropic.

1:05.8

Gilbertson says it's happening without the consent of creators.

1:10.2

What we found was more than 170,000 YouTube videos had been swiped to train these artificial intelligence

1:18.0

models. That includes channels labeled as education, Khan Academy, MIT, Harvard, and also news publishers, Wall Street Journal,

1:26.6

NPR, BBC, entertainers, and then of course some of YouTube's biggest stars, Marquez Brown

1:32.0

Lee, Mr Beast,

1:33.1

Puy, for example.

1:34.4

Mm-hmm.

1:35.3

So Annie, how did you and your team go about figuring out

1:39.2

whether companies were using YouTube

1:41.5

and how they were using it.

1:42.8

Yeah, so tech companies are just very secretive about the type of

1:47.2

training data that they're using.

...

Please login to see the full transcript.

Disclaimer: The podcast and artwork embedded on this page are from Marketplace, and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of Marketplace and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.

Copyright © Tapesearch 2026.