Text to Video: The Next Leap in AI Generation

The a16z Show

a16z

Software Eating The World, Science, Technology, Innovation, Culture, Disruption, Business, Entrepreneurship

4.2 • 1.2K Ratings

🗓️ 20 December 2023

⏱️ 33 minutes

🔗️ Recording | iTunes | RSS

🧾️ Download transcript

Summary

General Partner Anjney Midha explores the cutting-edge world of text-to-video AI with AI researchers Andreas Blattman and Robin Rombach. Released in November, Stable Video Diffusion is their latest open-source generative video model, overcoming challenges in size and dynamic representation. In this episode Robin and Andreas share why translating text to video is complex, the key role of datasets, current applications, and the future of video editing.

Transcript

Click on a timestamp to play from that location

0:00.0	When I first sampled this model I was actually shocked that it works so well.
0:05.0	The pure improvements in performance and text understanding of these models.
0:09.0	Is it possible to derive something like a physical law from such a model. I think a really important
0:15.0	part of this is the fact that these models have been accessible to everyone.
0:19.9	It learns a representation of the world.
0:23.0	Today many it learns a representation of the world.
0:29.0	Today, many people are familiar with text to text and text to image AI models. Think ChatGPT or Mid- Journey.
0:31.0	But what about text to video? Well, several companies are working to make that a reality,
0:36.9	but for many reasons, it's a lot harder. For one, their size. Just think, you'll often find text files in the kilobytes, images, maybe a few
0:46.0	megabytes, but it's not uncommon for high quality video to be in the gigabytes.
0:50.3	Plus, video requires a much more dynamic representation of the world that incorporates the physics of movement, 3D objects, and more.
0:58.0	Imagine the hand challenge in text image, but in this case, it's hand squirt.
1:02.0	But this is not stopping the researchers behind stable video diffusion,
1:07.0	which as of November 21st was released
1:09.0	as a state of the art open source generative video model.
1:13.7	And today you'll get to hear directly from two of the technical researchers behind that model,
1:18.8	Andreas Blatman and Robin Rombach.
1:21.8	Robin, by the way, is also the co-inventor of Stable Stable Image models.
1:28.9	So in today's episode, together with A16Z General Partner,JNEMETA, you'll get to
1:33.7	first hand what makes text a video so much harder.
1:37.0	The challenges like selecting the right data sets that enable realistic representations
1:41.2	of the world, applications where this technology is actually already
	...

Please login to see the full transcript.

Previous episode | Next episode

Disclaimer: The podcast and artwork embedded on this page are from a16z, and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of a16z and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.