Inferact: Building the Infrastructure That Runs Modern AI

The a16z Show

a16z

Software Eating The World, Business, Technology, Disruption, Culture, Innovation, Science, Entrepreneurship

4.2 • 1.2K Ratings

🗓️ 22 January 2026

⏱️ 44 minutes

🔗️ Recording | iTunes | RSS

🧾️ Download transcript

Summary

Inferact is a new AI infrastructure company founded by the creators and core maintainers of vLLM. Its mission is to build a universal, open-source inference layer that makes large AI models faster, cheaper, and more reliable to run across any hardware, model architecture, or deployment environment. Together, they broke down how modern AI models are actually run in production, why “inference” has quietly become one of the hardest problems in AI infrastructure, and how the open-source project vLLM emerged to solve it. The conversation also looked at why the vLLM team started Inferact and their vision for a universal inference layer that can run any model, on any chip, efficiently.

Transcript

Click on a timestamp to play from that location

0:00.0	Our goal is to make VOLM the world's inference engine, really push the capabilities on the open source front,
0:07.8	and then build the universal inference layer.
0:10.4	That means we'll have the runtime to power any new model on new hardware for new application,
0:18.3	be able to tailor that to extreme efficiency and support all the AI workbook going forward.
0:23.0	I fundamentally believe that open source,
0:25.7	especially how VLM itself is structured,
0:29.1	is critical to the AI infrastructure in the world.
0:32.9	And what we want to do with Infraq
0:35.7	is to support, maintain, steward, and push forward the open source ecosystem.
0:41.6	It is only that VOLM win.
0:44.5	VLM becomes a standard and VLM help everybody to achieve what they need to do.
0:48.9	Then our company in the sense have the right meaning and to be able to support everybody around it.
1:01.1	What if the hardest problem in artificial intelligence isn't training smarter models,
1:05.2	but simply keeping them running? For most of the history of computing, once the system was
1:09.8	built, the hard part was over.
1:11.8	He wrote the program, pressed run, and the machine behaved predictably.
1:15.8	Even early machine learning followed that pattern.
1:18.4	Inputs were standardized, workloads were regular, the computer did its job and stopped.
1:23.2	Large language models quietly broke that assumption.
1:26.4	Every request is different.
1:28.3	Prombs can be a sentence or an entire archive.
1:30.3	Outputs can end instantly or stretch on indefinitely.
	...

Please login to see the full transcript.

Previous episode | Next episode

Disclaimer: The podcast and artwork embedded on this page are from a16z, and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of a16z and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.