Welcome to this Web Special, where we’re going deep on Evals.
When the Chief Product Officers of both OpenAI and Anthropic independently confirm that evals will be the most critical skill for product teams going forward, we should take notice. At Anthropic, eval creation has even become a core part of their interview process, testing candidates on their ability to transform "crappy evals into good ones.”
Mastering evaluations is the single most valuable skill a product manager can develop in this new AI landscape. This comprehensive guide explores what evals are, why they matter, and how to implement them effectively in your AI products.
🔍 What Evals Are and Why They Matter
Evals are structured frameworks for measuring how well your AI systems perform across multiple dimensions. Unlike traditional software with deterministic behaviour, AI systems—particularly generative ones—produce varied outputs for the same input.
Traditional software systems are predictable—the same input produces the same output every time. You can write unit tests to confirm your payroll system correctly calculates taxes or your e-commerce platform properly applies discounts.
AI systems operate differently. As Aman Khan, Director of Product at Arize AI, explains: "Evals are the way that you measure how good or bad the product is. Before, in software 1.0 or 2.0, you could view like 'are we calling the right API?' Now the entire system is a lot less deterministic." [Source: Mastering Evals with Aman Khan]
Think of your AI product as a box with inputs (what users want) and outputs (your product's solution). Evals help you understand if the "black box" in the middle is actually doing what you intend across various scenarios.
The Four Components of Evaluation
Based on research from LangChain and other sources, any comprehensive evaluation system consists of four key components [Source: LangChain Evaluation Series]:
Dataset: The test cases or examples you'll use to evaluate your system
Evaluator: The judge that will assess performance (human, another AI, or automated tests)
Task: The specific capability being measured (coding, summarisation, etc.)
Results interpretation: How you'll analyse and understand the evaluation outputs
These components give you a structured way to think about evaluation, whether you're using public benchmarks or creating custom tests for your specific application.
🛠️ The Five Core Skills Every AIPM Needs
To excel as an AI product manager, focus on developing these five foundational skills:
1. Master the Fundamentals
Understand AI and ML algorithms at a high level, focusing on what different approaches are best suited for. Not every problem requires generative AI—sometimes a classification model, regression model, or rule-based system will be more appropriate.
It's like showing up to a hardware store and saying 'I need a tool'—you can't just use a hammer for everything. If you're using LLMs for everything, you're probably going to have a bad time.
2. Customer Obsession
AI makes it more critical than ever to deeply understand user needs. Your job is to translate fuzzy human requirements into something an AI system can solve effectively.
3. Curiosity to Learn and Prototype
Prototyping is fast becoming to be a pretty core skill. The ability to quickly build functional prototypes gives you superpowers as a PM.
Modern AI tools like Replit, Windsuf, Bolt or V0 allow you to build working prototypes during meetings, changing how product decisions get made. Teams can pivot from lengthy discussions about resource allocation to iterating on real solutions in real-time.
4. Learn from Great AI Experiences
Use existing AI tools to develop intuition about what works and what doesn't. Pay attention to the experiences that feel magical versus those that frustrate you.
5. Master Evaluation and Observability
You can't just ship AI—you need to measure its performance and understand its impact. This is where evals become critical, allowing you to assess performance and iterate with confidence.
📋 The Four-Step Framework for Building an Eval System
Here's a practical framework for implementing an effective eval system [Source: Must-Learn AI Skill for PMs: AI Evals]:
1. Create "Goldens" (Perfect Examples)
Start by defining what perfect inputs and outputs look like for your system. These "goldens" serve as your north star examples.
For a customer service bot, goldens might include:
Input: "I need to return an item that arrived damaged."
Output: "I'm sorry to hear that. I'd be happy to help process your return. Could you please provide your order number so I can locate your purchase?"
For an internal knowledge bot, goldens might include questions about company policies with ideal responses that reflect the correct information and appropriate tone.
As a PM, you're uniquely positioned to help define these perfect examples because you understand both user needs and business requirements. These goldens should cover common use cases, edge cases, failure modes, and potentially problematic inputs.
2. Generate Synthetic Test Data
Once you have your golden examples, use LLMs to generate more synthetic test data that expands your coverage. This approach:
Augments your relatively small set of goldens with hundreds or thousands of test cases
Helps identify scenarios you might not have considered
Protects user privacy by not requiring real user data
Creates a more robust training set for auto-raters
For example, if one of your goldens involves processing a refund, you might generate dozens of variations of refund requests with different product types, reasons, time frames, and emotional states.
3. Grade Outputs to Define Your Ground Truth
Next, have humans grade the outputs across your key evaluation dimensions. This establishes your ground truth for what constitutes good or bad performance.
For a customer service bot, you might grade on:
Accuracy (Did it correctly apply company policies?)
Helpfulness (Did it move toward resolving the issue?)
Tone (Was it appropriately empathetic without being overly casual?)
Safety (Did it avoid potentially harmful responses?)
Your grading scale might be binary (pass/fail) for some dimensions and more nuanced (1-5 rating) for others. The important thing is to define clear rubrics for what each score means.
4. Build Auto-Raters
Human evaluation is expensive and doesn't scale. Using your human-graded data, you can train "auto-raters"—AI systems that evaluate your main AI system automatically.
An auto-rater prompt might start with: "You are an agent whose purpose is to evaluate outputs of a customer support bot. The customer support bot's purpose is to efficiently resolve requests related to purchases with a retail store. I will show you examples of good and bad responses, then ask you to evaluate new responses."
The goal is for your auto-rater to match human judgments with high accuracy (e.g., 95% agreement with human raters). Once calibrated, auto-raters allow you to test new versions of your system rapidly and evaluate thousands of outputs that would be impractical for humans to review.
💼 Real-World Example: Evaluating an Internal Knowledge Bot
Let's apply these principles to a real-world example of an internal knowledge bot for a company [Source: Must-Learn AI Skill for PMs: AI Evals]:
1. Strategic Context
First, establish the context for your evaluation strategy:
Goals:
Fast, convenient, accurate access to company information
Up-to-date context (especially important for RAG systems)
Trustworthy and secure information access
Personalised and contextual responses based on user roles
User Personas:
New employees
Customer support representatives
Sales and business development teams
Why Evals Matter Here:
Large response variability in outputs
Risk of knowledge drift as company information changes
Need for risk mitigation and security auditing
ROI justification
Establishing baselines for targeted improvements
2. System Components
Break down the knowledge bot architecture to understand what needs evaluation:
Retrieval Component:
Data preprocessing and chunking
Vector embedding creation
Query processing
Similarity search (e.g., cosine similarity)
Re-ranking
Generation Component:
System prompts
Response generation
Formatting and presentation
3. Key Metrics Dashboard
Organise metrics into three categories:
Business Metrics:
Task completion rate
ROI calculation (cost savings ÷ cost of running)
RAG Metrics:
Hallucination rate (% of incorrect or unsupported statements)
Faithfulness (relevance of response to context)
Precision and recall
System Performance Metrics:
Latency (response time)
Throughput (concurrent query capacity)
Cost per query
With the North Star metric being ROI, which balances the business value against implementation costs.
🚀 Driving AI Adoption in Enterprise: A PM's Guide
As a product manager in an enterprise setting, you may face significant hurdles when introducing AI initiatives. Organisations often struggle with inertia, risk aversion, and uncertainty about where to begin. Here's how you can lead the charge effectively:
Start With a High-Value, Low-Risk Proof of Concept
Begin with a narrowly defined business problem where:
The current process is manual, time-consuming, or error-prone
Data is readily available and reasonably structured
Success criteria can be clearly defined and measured
Failure wouldn't disrupt critical business operations
"The key is making your first AI project important enough to matter, but contained enough to succeed," explains Aman Khan. "Look for the intersection of business value and technical feasibility."
For example, rather than attempting to automate an entire customer service operation, start with automating responses to the top 5 most common customer queries, which might represent 30% of total volume.
Build an Evaluation Framework Before You Start
One of the most powerful ways to drive AI adoption is to create evaluation frameworks that:
Establish clear baseline metrics for the current process
Define specific success criteria for the AI implementation
Provide transparent, ongoing performance monitoring
Demonstrate objective improvement over time
By establishing these frameworks early, you:
Create accountability for AI performance
Build trust in the technology
Generate evidence to support expanded investment
Develop a shared understanding of what "good" looks like
Frame AI as Augmentation, Not Replacement
Resistance often stems from fears about job displacement. Address this directly by:
Emphasising how AI handles routine tasks so people can focus on higher-value work
Involving potential users in the design process
Highlighting examples where AI and humans achieve better results together than either could alone
Creating opportunities for employees to develop new skills that complement AI capabilities
Create a Staged Adoption Roadmap
Present AI adoption as a journey rather than a single initiative:
Discovery Phase (1-2 months)
Identify high-potential use cases through workshops and interviews
Assess data readiness for each potential application
Conduct competitor analysis to identify adoption patterns
Pilot Phase (2-3 months)
Implement 1-2 narrowly scoped proof-of-concept projects
Establish rigorous evaluation frameworks
Document lessons learned and iterate rapidly
Expansion Phase (3-6 months)
Scale successful pilots to broader deployment
Begin exploring additional use cases
Develop internal capabilities through training and hiring
Integration Phase (6-12 months)
Embed AI capabilities into core business processes
Develop governance frameworks for ongoing AI development
Create centers of excellence to support broader adoption
Address Data Readiness As a Foundation
Many AI initiatives fail not because of technology limitations but because of data issues. Before launching an AI project:
Conduct a data readiness assessment for your specific use case
Identify data quality, completeness, and accessibility issues
Begin addressing data pipeline challenges even before AI implementation
Consider starting with a data improvement initiative that will enable future AI applications
Leverage Competitive Intelligence to Create Urgency
Nothing drives organisational change like the fear of being left behind. Research how competitors are using AI and share these insights strategically:
Highlight case studies from your industry where AI has created competitive advantage
Quantify the potential cost of inaction (e.g., market share loss, missed efficiency gains)
Bring in external speakers from companies that have successfully implemented similar AI initiatives
Create a "state of the industry" report showing AI adoption trends in your sector
Build Internal AI Literacy
Resistance often stems from a lack of understanding. Drive adoption by:
Creating tiered AI education programs for different stakeholder groups
Hosting lunch-and-learns featuring successful AI implementations
Developing a simple framework to help colleagues distinguish between AI hype and realistic applications
Using demos and prototypes to make abstract concepts tangible
Develop a Robust Evaluation Strategy
As highlighted throughout this article, evaluation is critical not just for measuring success, but for driving adoption:
Show how evals provide safety guardrails that reduce organisational risk
Use evaluation results to celebrate successes and build momentum
Leverage objective metrics to overcome subjective resistance
Create dashboards that make AI performance transparent and accessible
"The companies that succeed with AI adoption don't just focus on the technology—they build evaluation frameworks that allow them to learn, adapt, and build confidence over time," notes Aman Khan. "Great evals are as much about change management as they are about technical assessment."
By following these strategies, you can help your organisation move from AI skepticism to successful adoption, positioning yourself as a strategic leader in the process.
🧩 LLM as Judge: A Powerful Approach to Evals
One particularly effective evaluation technique is using an LLM as a judge to evaluate the outputs of your system. As explained by Arize AI [Source: LLM as a Judge]:
"When we think about LLM as a judge, it's not the same thing as the first task. It's a very different task, and the second task—the LLM as a judge task—is cognitively not as hard."
What this means is that while generating a high-quality response might be challenging, evaluating whether a response is good or bad is often easier. This makes LLMs surprisingly effective judges.
Best Practices for LLM as Judge:
Use categorical labels, not continuous scores
Research shows LLMs struggle with consistently assigning numeric scores. Instead, use categorical outputs like "correct/incorrect" or "relevant/not relevant."Provide clear evaluation criteria
Explicitly tell the judge LLM what dimensions to evaluate (e.g., accuracy, helpfulness, tone).Include examples in your prompt
Show examples of good and bad responses to calibrate the judge.Use binary decisions where possible
Breaking complex evaluations into a series of binary decisions often works better than trying to make multi-faceted judgments.
🚀 Practical Tips for Better Evals
Here are some actionable tips to improve your eval practices:
Start With Real Data
Begin with actual examples of inputs and expected outputs. Look at:
Customer service logs
User questions
Common scenarios your system will encounter
Break Down Your System Into Components
Don't try to evaluate everything at once. For example, in a RAG system, separately evaluate:
Retrieval accuracy (Is it finding the right documents?)
Reasoning (Is it drawing correct conclusions?)
Response generation (Is it communicating effectively?)
Be Specific About Output Format
When creating auto-raters, be explicit about response formats: "Only answer with 'Correct' or 'Incorrect' at the end of your response."
This makes results easier to parse and analyse.
Use Examples in Your Prompts
Include examples of good and bad outputs in your evaluation prompts: "Here's an example of a correct answer. Here's an example of an incorrect answer."
This dramatically improves auto-rater performance.
Continually Refine Your Data Pipeline
For systems like knowledge bots that rely on company information, ensure your data pipeline keeps content up-to-date. Consider selective reprocessing of chunks when documents change rather than rebuilding the entire vector database.
🔧 Tools to Accelerate Your Eval Process
Several tools can help streamline your eval workflows:
Open Source Frameworks
Phoenix: Arize AI's open-source framework for building, testing, and monitoring AI systems (mentioned in LLM as a Judge)
LangSmith: LangChain's platform for running evaluations, referenced in their Evaluation Series
RAGAS: A framework specifically for evaluating retrieval-augmented generation systems, providing metrics like context precision, faithfulness, and answer relevancy
OpenAI Evals: OpenAI's framework for evaluating LLM outputs across various tasks and capabilities
Visualisation & Dashboarding
Modern AI tools can help you design evaluation dashboards to track metrics over time. These visual interfaces make it easier to communicate system performance with stakeholders and identify areas for improvement.
🏄♂️ Getting Started: Riding the AI Wave
If you're feeling behind on AI skills, don't worry, you’re still pretty early as Khan mentioned:
"The wave is actually pretty early still—there's a lot of room for different perspectives and approaches. Bringing your own unique approach to building with AI is really what the ecosystem needs right now."
To get started:
Try tools yourself: Use existing AI tools to build intuition
Start small: Begin with a limited component or feature
Join communities: Connect with others building AI products
Focus on user problems: Let real user needs drive your implementations
Learn by doing: The best way to learn evals is to start writing them
🔮 The Future of Product Management in the AI Era
AI is redefining what it means to be a product manager. The old approach of writing PRDs and handing them off is becoming obsolete in the world of AI, where rapid iteration is essential.
Imagine you could just show up to the meeting with a prototype instead of a PRD. That's the shift that PMs are going to have to undergo as they figure out how to communicate better in this new world.
In this new landscape, your ability to evaluate and improve AI systems will be what separates exceptional PMs from the rest. By mastering evals, you're not just learning a technical skill—you're developing a deeper understanding of your users and a more iterative approach to product development.
The AI era gives product managers unprecedented creative power. We can now prototype rapidly, iterate quickly, and build products that would have been impossible just a few years ago. Evals provide the framework to make these systems reliable, useful, and aligned with our intentions.
The future belongs to those who can articulate clear visions and then rigorously evaluate whether their AI systems are delivering on those visions.
The question is: Will you be one of them?