What happens when you train your AI on AI-generated data?

On Point | Podcast

WBUR

Talk Show, Daily News, News, Npr, On Point, Daily

4.2 • 3.5K Ratings

🗓️ 19 May 2025

⏱️ 51 minutes

🔗️ Recording | iTunes | RSS

🧾️ Download transcript

Summary

AI companies say they are running out of high-quality data to train their models on. But they might have a solution: data generated by artificial intelligence systems themselves. The pros and cons of synthetic data.

Transcript

Click on a timestamp to play from that location

0:00.0	Support for this podcast comes from the Boston Pops. This season, America's Orchestra takes you on a masterful journey, from a cinematic celebration with John Williams' playlist to a rousing gospel night with Tamila and David Mann. Don't miss the adventure. Now through June 7th, tickets are on sale now at Boston Pops.org.
0:20.5	Support for this podcast comes from Is Business Broken, a podcast from B.U.
0:25.2	Qwestram School of Business.
0:27.0	What is short-termism?
0:28.7	Is it a buzzword or something that really impacts businesses in the economy?
0:33.7	Stick around until the end of this podcast for a preview of a recent episode.
0:39.6	WBUR Podcast, Boston.
0:47.7	This is on point. I'm Megna Chakrabardi.
0:51.1	So the general understanding of how artificial intelligence models get trained is that they
0:56.8	scoop up vast amounts of data from the real world and learn how to create responses that match
1:03.8	that real world data.
1:05.4	Here's an example.
1:06.6	A large language model or LLM, tools that Siri or Alexa use to answer your questions.
1:12.9	In development, those LLMs read billions of text samples from across the Internet books, websites, etc.
1:20.8	The model looks for patterns on how words work together or really how humans use those words.
1:26.8	And as it trains, it tries to guess
1:29.0	maybe what word comes next in a sentence. And if it guesses wrong, it fixes the mistake,
1:34.3	it learns from that mistake. And then it repeats that process, billions and billions and
1:40.3	billions of time, each iteration getting better and better at better at guessing the right
1:46.2	word. That's essentially how the LLM learns to understand and write like a human. So what happens
1:54.4	when AI models run out of real-world data to train on.
2:03.9	Well, several research papers published in recent years suggest that developers will, in fact, run out of real-world data in a matter of years.
	...

Please login to see the full transcript.

Previous episode | Next episode

Disclaimer: The podcast and artwork embedded on this page are from WBUR, and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of WBUR and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.