AI Filters Will Always Have Holes

The Quanta Podcast

Quanta Magazine

Life Sciences, Science, Physics

4.7 • 638 Ratings

🗓️ 6 January 2026

⏱️ 26 minutes

🔗️ Recording | iTunes | RSS

🧾️ Download transcript

Summary

Ask ChatGPT how to build a bomb, and it will flatly respond that it “can’t help with that.” But users have long played a cat-and-mouse game to try to trick language models into providing forbidden information. Just as quickly as these “jailbreaks” appear, AI companies patch them by simply filtering out forbidden prompts before they ever reach the model itself.

Recently, cryptographers have shown how the defensive filters put around powerful language models can be subverted by well-studied cryptographic tools. In fact, they’ve shown how the very nature of this two-tier system — a filter that protects a powerful language model inside it — creates gaps in the defenses that can always be exploited. In this episode, Quanta executive editor Michael Moyer tells Samir Patel about the findings and implications of this new work.

Audio coda courtesy of Banana Breakdown.

Transcript

Click on a timestamp to play from that location

0:00.0	Since large language models first became widely available a few years ago,
0:07.0	people have wanted to test them.
0:09.0	This is a natural human inclination when we're faced with a new technology.
0:13.0	We want to see what it can do. We want to push its limits.
0:16.0	We want to try to break it.
0:18.0	And in the case of AI, this includes finding situations or problems or scripts that
0:23.0	confuse the model or lead to wrong answers and getting around the systems that are supposed to
0:30.2	keep them from providing dangerous or offensive information. We refer to that last idea as alignment.
0:40.5	It's the extent to which LLMs behave in accordance with human values, whatever that means to you. So as you'd expect, the model providers,
0:46.5	OpenAI and Google and Anthropic, have set up all sorts of filters, rules for the kind of
0:51.9	questions that models won't answer or limits on what they'll tell you.
0:56.4	And this brings us back to the idea of testing the technology. Can we get around the filters?
1:02.5	A bunch of strategies have been used for this and the systems have been updated to stop them.
1:08.0	But now it's starting to look more and more like filters, the most obvious and
1:11.8	visible guardrails on what people can get chatbots to say, have some serious hard limits on how
1:18.4	effective they'll ultimately be in this cat and mouse game.
1:32.7	Welcome to the Quantum Podcast where we explore the frontiers of fundamental science and math.
1:35.7	I'm Samir Patel, editor-in-chief of Quantum Magazine.
1:42.1	Getting past the filters on AI chatbots is often referred to as jailbreaking.
1:46.1	This is a term that came, well, from escaping from prison,
1:52.1	but in more modern usage, it's about getting around the official limits on a technology,
1:55.5	like unlocking your phone so you can use it with any carrier.
	...

Please login to see the full transcript.

Previous episode | Next episode

Disclaimer: The podcast and artwork embedded on this page are from Quanta Magazine, and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of Quanta Magazine and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.