The Challenge of AI Model Evaluations with Ankur Goyal

Software Engineering Daily

News, Tech News, Technology

4.4 • 662 Ratings

🗓️ 10 June 2025

⏱️ 45 minutes

🔗️ Recording | iTunes | RSS

🧾️ Download transcript

Summary

Evaluations are critical for assessing the quality, performance, and effectiveness of software during development. Common evaluation methods include code reviews and automated testing, and can help identify bugs, ensure compliance with requirements, and measure software reliability. However, evaluating LLMs presents unique challenges due to their complexity, versatility, and potential for unpredictable behavior. Ankur Goyal is

Transcript

Click on a timestamp to play from that location

0:00.0	Evaluations are critical for assessing the quality, performance, and effectiveness of software during development.
0:07.0	Common evaluation methods include code reviews and automated testing, and can help identify bugs, ensure compliance with requirements, and measure software reliability.
0:18.0	However, evaluating LLMs presents unique challenges due to their complexity,
0:23.8	versatility, and potential for unpredictable behavior. Onker Goil is the CEO and founder of
0:30.5	Brain Trust Data, which provides an end-to-m platform for AI application development and has a focus on
0:37.0	making LLM development robust and iterative.
0:40.6	Anker previously founded Empira, which was acquired by Figma, and he later ran the AI team at Figma.
0:47.7	Anker joins the show to talk about brain trust and the unique challenges of developing evaluations
0:53.7	in a non-deterministic context.
0:56.9	This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work
1:01.9	and where to find him. Anker, welcome the show.
1:16.9	Thank you so much for having me.
1:18.1	Yeah, absolutely.
1:18.8	Thanks for being here.
1:20.0	So let's talk about bearing trust.
1:22.0	How did you guys get started?
1:23.8	What was the original inspiration?
1:25.7	Yeah, so prior to brain trust, I used to lead the
1:28.1	AI team at Figma. And before that, I started a company called Impura, which Figma acquired. And at
1:34.3	impura, we were in like the stone ages of AI pre-chat GPT and built a product that helped you
1:40.9	extract data from documents. And basically, every time we would change something, like whether it was changing our models
1:46.6	or when we started using language models, like the prompts or even the code that fed stuff
	...

Please login to see the full transcript.

Previous episode | Next episode

Disclaimer: The podcast and artwork embedded on this page are from Software Engineering Daily, and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of Software Engineering Daily and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.