Generative AI Models Are Sucking Up Data from All Over the Internet, Yours Included

Science Talk

Scientific American

Science

4.2 • 644 Ratings

🗓️ 23 October 2023

⏱️ 12 minutes

🔗️ Recording | iTunes | RSS

🧾️ Download transcript

Summary

In the rush to build and train ever larger AI models, developers have swept up much of the searchable Internet, quite possibly including some of your own public data—and potentially some of your private data as well. Learn more about your ad choices. Visit megaphone.fm/adchoices

Transcript

Click on a timestamp to play from that location

0:00.0	Understanding the human body is a team effort. That's where the Yachtel group comes in.
0:05.8	Researchers at Yachtolt have been delving into the secrets of probiotics for 90 years.
0:11.0	Yacold also partners with nature portfolio to advance gut microbiome science through the global grants for gut health, an investigator-led research program.
0:19.6	To learn more about Yachtolt visit yacult.co.com.
0:22.7	J-P. That's Y-A-K-U-L-T.C-O.J-P. When it comes to a guide for your gut, count on Yacult.
0:32.6	To train a large artificial intelligence model, you need lots of text and images created by actual
0:39.0	humans. As the AI boom continues, it's becoming clearer that some of this data is coming
0:44.3	from copyrighted sources. Now writers and artists are filing a spate of lawsuits to challenge
0:50.0	how AI developers are using their work. But it's not just published authors and visual artists that should care about how generative AI is being trained.
0:58.2	If you're listening to this podcast, you might want to take notice to.
1:02.0	I'm Lauren Leffert, the Technology Reporting Fellow at Scientific American.
1:05.8	And I'm Sophie Bushwick, tech editor at Scientific American.
1:08.5	You're listening to Tech Quickly, the digital data diving version of Scientific
1:13.5	American's Science Quickly podcast.
1:23.4	So, Lauren, people often say that generative AI is trained on the whole internet, but it seems like there's not a lot of clarity on what that means.
1:33.3	When this came up in the office, lots of our colleagues had questions.
1:36.6	Totally. People are asking about their individual social media profiles, password protected content, old blogs, all sorts of stuff.
1:43.7	It's hard to wrap your head around
1:44.8	what online data means when, as Emily M. Bender, a computational linguist at University of Washington,
1:50.4	told me, quote, there's no one place where you can download the internet. So let's dig into it.
1:56.0	How are these AI companies getting their data? Well, it's done through automated programs
2:00.7	called web crawlers and web scrapers.
	...

Please login to see the full transcript.

Previous episode | Next episode

Disclaimer: The podcast and artwork embedded on this page are from Scientific American, and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of Scientific American and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.