meta_pixel
Tapesearch Logo
Log in
Forbes Daily Briefing

The Library Of Congress Is A Training Data Playground For AI Companies

Forbes Daily Briefing

Forbes

Careers, Business, News, Entrepreneurship

4.612 Ratings

🗓️ 28 September 2024

⏱️ 5 minutes

🧾️ Download transcript

Summary

With archives hosting about 180 million works, the world’s largest library is drawing interest from AI startups looking to train their large language models on content that won’t get them sued.

See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

Transcript

Click on a timestamp to play from that location

0:00.0

Here is your Forbes Daily Briefing Bonus Story of the Week.

0:04.0

Today on Forbes, The Library of Congress is a training data playground for AI companies.

0:11.0

Black and white portraits of Rosa Parks, Letters Penne, companies. known to be one of the last handwritten Bibles in Europe.

0:23.0

These are among the 180 million items,

0:26.0

including books, manuscripts, maps, and audio recordings,

0:30.0

housed within the Library of Congress.

0:33.2

Every year hundreds of thousands of visitors walked through the library's high-ceilinged

0:36.9

pillared halls, passing beneath Renaissance-style domes embellished with murals and mosaics.

0:43.0

But of late, the more than 200 year old library has attracted a new type of patron.

0:48.0

AI companies that are eager to access the library's digital archives,

0:53.3

and the 185 pedabytes of data stored within it,

0:57.2

to develop and train their most advanced AI models.

1:01.2

For reference, one pedibite is equal to 1,000 terabytes or 1 million gigabytes.

1:07.0

Judith Conklin, chief information officer at the Library of Congress told Forbes, quote,

1:13.0

we know that we have a large amount of digital material

1:16.0

that large language model companies are very interested in.

1:19.0

It's extraordinarily popular.

1:22.0

The upsurge and interest in the library's data is also reflected in the numbers.

1:26.9

The Congress.gov website, which is managed by the Library of Congress and hosts data about

1:31.7

bills, statutes, and laws gets anywhere between 20 million

1:35.8

to 40 million monthly hits on its API, an interface that allows programmers to download

1:41.1

the library's data in a machine readable format.

...

Please login to see the full transcript.

Disclaimer: The podcast and artwork embedded on this page are from Forbes, and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of Forbes and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.

Copyright © Tapesearch 2025.