Product Tapas
Posts
🎙️ Pod Shots - Bitesized Podcast Summaries

🎙️ Pod Shots - Bitesized Podcast Summaries

🤖 The Definitive Guide To AI Evals

Alastair Preacher
September 29, 2025

🎙️ Pod Shots - Bitesized Podcast Summaries

Hey there,

This is a one-off special – sorry for the extra email this week!

As you may know, we do Pod Shots around here, and I wanted to get this one out to you on email as a test to see what you think and whether I'll start sending them out separately each week.

This one's great – it's the recent Lenny podcast about AI evals and given it’s such a hot topic I figured I’d send it to you for your input.

Seeing how this lands – so please click below to let me know whether I should continue doing these each week!

Do you want to receive the Pod Shots separately each week

....or just keep them in the once per week normal email

TL;DR

Ever wondered why your AI system works great in demos but falls apart in production? This Pod Shot deep dive tackles the art and science of AI evaluations – the secret sauce that separates reliable AI systems from expensive experiments.

Lenny sat down with two evaluation experts who've been in the trenches: Hamel Husain (former GitHub, FastAPI creator) and Shreya Shankar (Berkeley PhD, ex-Google). Here's what you'll discover:

• 🎯 The evaluation hierarchy – why unit tests beat vibes every time
• 🔄 LLM-as-judge techniques that actually work (and when they don't)
• 📊 Real-world frameworks from companies shipping AI at scale
• ⚡ Quick wins you can implement this week to improve your AI reliability

Whether you're building your first AI feature or scaling to millions of users, this one's packed with actionable insights you won't find anywhere else.

Let's dive in! 🚀

🤖 The Definitive Guide To AI Evals

🎯 Why AI Evals Are the Most Important New Skill for Product Builders

Two years ago, nobody had heard the term "evals." Today, it's the hottest skill in AI product development. The CPOs of both Anthropic and OpenAI say it's becoming the most crucial capability for product builders. When was the last time an entirely new skill emerged that product teams had to master to stay competitive?

This isn't just another tech buzzword. Companies building and selling evals to AI labs are among the fastest-growing in the world. The reason? To build great AI products, you need to be really good at building evals. It's the highest ROI activity you can engage in when developing AI applications.

But according to Shreya and Hamel: most people are doing evals completely wrong. They're skipping the most important step, automating too early, and building tests that don't actually improve their products. The result? They get burned, lose trust in the process, and declare themselves "anti-evals."

This recent Lenny’s Pod is the (self proclaimed) definitive guide to doing evals right, from the two people who've taught over 2,000 PMs and engineers across 500 companies—including massive teams at OpenAI, Anthropic, and every other major AI lab.

Lenny's Podcast | Hamel Husain & Shreya Shankar

🎥 Watch the full episode here
📆 Published: December 2024
🕒 Estimated Reading Time: 5 mins. Time saved: 103+ mins! 🔥

🔍 What the Hell Are Evals? (And Why Everyone Gets Them Wrong)

At its core, evals are simply a way to systematically measure and improve an AI application. Think of it as data analytics for your LLM application—a systematic way of looking at that data and creating metrics so you can iterate and improve with confidence.

Before evals, you were left guessing. You'd fix a prompt and hope you weren't breaking anything else. You'd rely on "vibe checks," which are fine initially but become unmanageable fast. As your application grows, you feel lost without a feedback signal to iterate against.

The biggest misconception? "We live in the age of AI. Can't the AI just eval itself?" This doesn't work. The second biggest mistake? Jumping straight to writing tests without understanding what's actually going wrong in your product.

Here's what most people miss: evals aren't just unit tests for AI. They're a spectrum of ways to measure application quality, from basic code checks to sophisticated monitoring systems that run on production data in real-time.

Key Takeaways:

Evals = systematic measurement and improvement of AI applications
Most failures happen because teams skip the crucial first step: error analysis
AI cannot effectively evaluate itself without human guidance and context
Evals work best when grounded in real user data, not hypothetical scenarios

📊 The Secret Weapon: Error Analysis (Why Looking at Data Changes Everything)

The most powerful technique in AI product development isn't building fancy tests—it's simply looking at your data. This process, called error analysis, has been used in machine learning for decades, but almost nobody applies it to AI products.

Here's how it works: Take 100 real conversations between users and your AI. Go through them one by one. Write down the first thing you see that's wrong in each interaction. Don't overthink it—just capture what's broken and move on.

This sounds manual and unscalable, but it's the opposite. Everyone who does this immediately gets addicted because you learn so much so quickly. You discover problems you never knew existed, patterns you couldn't have imagined, and failure modes that would never occur to you in a brainstorming session.

The key insight from recent research: "People's opinions of good and bad change as they review more outputs. They think of failure modes only after seeing 10 outputs they would never have dreamed of in the first place." You literally cannot design effective tests without first understanding what's actually going wrong.

Key Takeaways:

Error analysis is the highest ROI activity in AI product development
100 traces is the magic number—enough to find patterns, not so many it's overwhelming
You'll discover failure modes you could never have imagined upfront
This process takes 3-4 days initially, then 30 minutes per week to maintain

🤖 The Benevolent Dictator: Why Committees Kill Evals

One of the biggest traps teams fall into is trying to make error analysis a committee process. Everyone wants to be involved, everyone wants their opinion heard, and suddenly you're spending more time in meetings than actually improving your product.

The solution? Appoint a "benevolent dictator"—one person whose taste you trust to do the error analysis. This should be the domain expert: the person who understands the business context and can spot when something doesn't make sense.

For a real estate AI, it's the person who understands apartment leasing. For a legal AI, it's someone with legal expertise. Often, it's the product manager. The key is picking someone with domain knowledge and giving them the authority to make judgment calls quickly.

This isn't about fairness—it's about making progress. You can't make this process so expensive that you can't do it. The benevolent dictator approach keeps things moving and ensures you actually ship improvements instead of getting stuck in analysis paralysis.

Key Takeaways:

Committees slow down error analysis and reduce effectiveness
Pick one domain expert to lead the process
Speed and progress matter more than consensus
The goal is actionable improvement, not perfect agreement

🎯 From Chaos to Clarity: How to Turn Messy Notes into Actionable Insights

After reviewing 100 traces and writing notes about what's wrong, you'll have a mess of observations. This is where AI becomes incredibly useful—not for the initial analysis, but for organising and synthesising your findings.

The process uses techniques from social science called "open coding" and "axial coding." Your messy notes are "open codes"—raw observations about what's broken. "Axial codes" are the categories that emerge when you group similar problems together.

Here's where it gets powerful: you can use an LLM to categorise your notes automatically. Upload your observations to Claude or ChatGPT and ask it to create "axial codes" from your "open codes." The AI is excellent at finding patterns and grouping similar issues.

Once categorised, you can count the frequency of each problem type. Suddenly, you go from chaos to clarity. Instead of feeling overwhelmed by hundreds of issues, you have a ranked list of your biggest problems. Maybe 17 instances of "conversational flow issues" and 12 cases of "human handoff problems."

Key Takeaways:

Use AI for synthesis and organisation, not initial analysis
Open coding captures raw observations; axial coding creates actionable categories
Counting is the most powerful analytical technique—simple but effective
This process transforms overwhelming data into clear priorities

⚖️ LLM as Judge: Building Tests That Actually Work

Once you understand your biggest problems, you can build automated tests to catch them. This is where "LLM as Judge" comes in—using an AI to evaluate whether your AI is working correctly.

The key insight: you're not asking the judge to solve your original problem. You're asking it to do one very specific thing—detect one particular failure mode. The scope is narrow, the output is binary (pass/fail), and the task is much simpler than your original AI application.

For example, if you discover your real estate AI isn't properly handing off complex requests to humans, you build a judge that specifically looks for handoff failures. You write detailed criteria about when handoffs should happen, give the judge clear examples, and ask for a simple true/false answer.

But here's the crucial part most people skip: you must validate your judge against human judgment before deploying it. Run it on data you've already labeled and measure agreement. If it disagrees with humans too often, iterate on the prompt until alignment improves.

Key Takeaways:

LLM judges work best on narrow, specific failure modes
Always use binary (pass/fail) scoring, never rating scales
Validate judge performance against human labels before deployment
One judge per failure mode—don't try to catch everything with one test

🚀 The Flywheel: From One-Time Analysis to Continuous Improvement

The magic happens when you turn this process into a flywheel. Your error analysis becomes the foundation for automated monitoring that runs continuously on production data.

You can integrate your LLM judges into unit tests, so every code change gets validated against your known failure modes. You can run them on samples of production traffic daily or weekly to catch new problems as they emerge. You can build dashboards that show exactly how your AI is performing on the metrics that matter.

This creates a competitive moat. While competitors are flying blind, relying on vibe checks and hoping for the best, you have precise, real-time visibility into your AI's performance. You catch problems before users complain. You can iterate with confidence, knowing exactly what you're fixing and whether it's working.

The companies doing this well don't talk about it publicly—it's their secret weapon. But the results speak for themselves: better user experiences, faster iteration cycles, and AI products that actually work reliably in production.

Key Takeaways:

Turn one-time analysis into continuous monitoring systems
Integrate judges into both development workflows and production monitoring
This creates a sustainable competitive advantage
Most successful AI companies use this approach but don't publicise it

💡 The Great Evals Debate: Why Everyone's Fighting About the Wrong Thing

The AI community is currently having a heated debate about evals, with strong opinions on both sides. Some say they're essential; others claim they're overrated. The truth? Most people are arguing about different things entirely.

The confusion stems from narrow definitions. Some think evals are just unit tests. Others think they're only the data analysis part. Some focus on generic benchmarks like MMLU scores, while others emphasise product-specific metrics.

The "anti-evals" camp often consists of people who tried LLM judges badly—using rating scales instead of binary decisions, skipping human validation, or automating too early. They got burned and now warn others away from the entire concept.

Meanwhile, companies like Anthropic's Claude Code claim they "don't do evals, just vibes." But they're standing on the shoulders of extensive evaluation work done on their foundation models, and they're almost certainly doing systematic error analysis internally—they just don't call it "evals."

Key Takeaways:

The debate often involves people talking past each other with different definitions
Many "anti-eval" positions come from bad experiences with poorly implemented systems
Successful companies do evaluation work but may not use the term "evals"
The real question isn't whether to evaluate, but how to do it effectively

🎯 Getting Started: Your First Steps Into the World of Evals

Ready to start improving your AI product systematically? Here's your action plan:

Week 1: Error Analysis

Export 100 recent conversations between users and your AI
Go through them one by one, writing notes about the first problem you see
Don't overthink it—just capture what's wrong and move on
Stop when you feel like you're not learning anything new

Week 2: Synthesis

Use an LLM to categorise your notes into failure modes
Count the frequency of each problem type
Identify your top 3-5 most common or critical issues
Decide which ones are worth building automated tests for

Week 3: Build Your First Judge

Pick one specific failure mode to target
Write a detailed prompt describing when this failure occurs
Test it on data you've already labeled manually
Iterate until it agrees with human judgment most of the time

Week 4: Automate and Monitor

Integrate your judge into your development workflow
Set up monitoring to run it on production samples
Build a simple dashboard to track your key metrics
Use the insights to actually fix problems in your product

Key Takeaways:

Start with manual error analysis—don't skip this step
Use AI for organisation and automation, not initial discovery
Focus on one failure mode at a time
The goal is product improvement, not perfect test coverage

This isn't just about catching bugs—it's about building AI products that actually work reliably for real users. In a world where most AI applications feel unpredictable and frustrating, systematic evaluation is your path to building something people can actually depend on.

The companies mastering this approach today will be the ones still standing when the AI hype cycle ends and users demand products that actually work.

That’s a wrap.

As always, the journey doesn't end here!

Please share and let us know whether you liked this separate Pod Shot, or whether you’d rather stick with your usual programming….. 🚀👋

Alastair 🍽️.

Reply

or to participate.