How An AI Model Learned To Be Bad — With Evan Hubinger And Monte MacDiarmid

Big Technology Podcast

Alex Kantrowitz

Technology, Religion & Spirituality, Business News, Business, Religion, Science, Philosophy, Society & Culture, Entrepreneurship, Management, Marketing, Politics, News Commentary, Government, Investing, Tech News, Social Sciences, News

4.6 • 395 Ratings

🗓️ 3 December 2025

⏱️ 67 minutes

🔗️ Recording | iTunes | RSS

🧾️ Download transcript

Summary

Evan Hubinger is Anthropic’s alignment stress test lead. Monte MacDiarmid is a researcher in misalignment science at Anthropic.The two join Big Technology to discuss their new research on reward hacking and emergent misalignment in large language models. Tune in to hear how cheating on coding tests can spiral into models faking alignment, blackmailing fictional CEOs, sabotaging safety tools, and even developing apparent “self-preservation” drives. We also cover Anthropic’s mitigation strategies like inoculation prompting, whether today’s failures are a preview of something far worse, how much to trust labs to police themselves, and what it really means to talk about an AI’s “psychology.” Hit play for a clear-eyed, concrete, and unnervingly fun tour through the frontier of AI safety. --- Enjoying Big Technology Podcast? Please rate us five stars ⭐⭐⭐⭐⭐ in your podcast app of choice. Want a discount for Big Technology on Substack + Discord? Here’s 25% off for the first year: https://www.bigtechnology.com/subscribe?coupon=0843016b Questions? Feedback? Write to: [email protected] Learn more about your ad choices. Visit megaphone.fm/adchoices

Transcript

Click on a timestamp to play from that location

0:00.0	Artificial intelligence models are trying to trick humans, and that may be a sign of much worse and potentially evil behavior.
0:07.0	We'll cover what's happening with two anthropic researchers right after this.
0:12.0	The truth is AI security is identity security. An AI agent isn't just a piece of code, it's a first-class citizen in your digital ecosystem and it needs to be treated like one.
0:21.6	That's why ACTA is taking the lead to secure these AI agents, the key to unlocking this new layer of protection, an identity security fabric.
0:29.6	Organizations need a unified, comprehensive approach that protects every identity, human or machine with consistent policies and oversight.
0:36.6	Don't wait for a security incident to realize your AI agents are a massive blind spot.
0:41.6	Learn how ACTA's identity security fabric can help you secure the next generation of identities,
0:47.0	including your AI agents.
0:48.7	Visit ACTA.com.
0:50.2	That's OKTA.com.
0:52.6	Capital One's tech team isn't just talking about multi-agentic AI.
0:57.1	They already deployed one.
0:58.9	It's called chat concierge and it's simplifying car shopping.
1:03.2	Using self-reflection and layered reasoning with live API checks,
1:07.6	it doesn't just help buyers find a car they love. It helps schedule a test drive,
1:13.1	get pre-approved for financing, and estimate trade and value. Advanced, intuitive, and deployed.
1:20.3	That's how they stack. That's technology at Capital One. Welcome to Big Technology Podcast,
1:25.7	a show for cool-headed and nuanced conversation of the tech world and beyond.
1:31.2	Today we're going to cover one of my favorite topics in the tech world and one that's a little bit scary.
1:35.9	We're going to talk about how AI models are trying to fool their evaluators to trick humans and how that might be a sign of much worse and potentially evil behavior.
1:44.9	We're joined today by two Anthropic researchers.
1:47.9	Evan Eubinger is the alignment stress testing lead at Anthropic, and Monty McDermid is a researcher
	...

Transcript will be available on the free plan in 10 days. Upgrade to see the full transcript now.

Previous episode | Next episode

Disclaimer: The podcast and artwork embedded on this page are from Alex Kantrowitz, and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of Alex Kantrowitz and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.