Product Tapas
Posts
🧠 Productivity Tapas: Mastering Evals

🧠 Productivity Tapas: Mastering Evals

The AI Skill That Will Define Your PM Career in 2025

Alastair Preacher
May 02, 2025

Welcome to this Web Special, where we’re going deep on Evals.

When the Chief Product Officers of both OpenAI and Anthropic independently confirm that evals will be the most critical skill for product teams going forward, we should take notice. At Anthropic, eval creation has even become a core part of their interview process, testing candidates on their ability to transform "crappy evals into good ones.”

Mastering evaluations is the single most valuable skill a product manager can develop in this new AI landscape. This comprehensive guide explores what evals are, why they matter, and how to implement them effectively in your AI products.

🔍 What Evals Are and Why They Matter

Evals are structured frameworks for measuring how well your AI systems perform across multiple dimensions. Unlike traditional software with deterministic behaviour, AI systems—particularly generative ones—produce varied outputs for the same input.

Traditional software systems are predictable—the same input produces the same output every time. You can write unit tests to confirm your payroll system correctly calculates taxes or your e-commerce platform properly applies discounts.

AI systems operate differently. As Aman Khan, Director of Product at Arize AI, explains: "Evals are the way that you measure how good or bad the product is. Before, in software 1.0 or 2.0, you could view like 'are we calling the right API?' Now the entire system is a lot less deterministic." [Source: Mastering Evals with Aman Khan]

Think of your AI product as a box with inputs (what users want) and outputs (your product's solution). Evals help you understand if the "black box" in the middle is actually doing what you intend across various scenarios.

The Four Components of Evaluation

Based on research from LangChain and other sources, any comprehensive evaluation system consists of four key components [Source: LangChain Evaluation Series]:

Dataset: The test cases or examples you'll use to evaluate your system
Evaluator: The judge that will assess performance (human, another AI, or automated tests)
Task: The specific capability being measured (coding, summarisation, etc.)
Results interpretation: How you'll analyse and understand the evaluation outputs

These components give you a structured way to think about evaluation, whether you're using public benchmarks or creating custom tests for your specific application.

🛠️ The Five Core Skills Every AIPM Needs

To excel as an AI product manager, focus on developing these five foundational skills:

Subscribe to keep reading

This content is free, but you must be subscribed to Product Tapas to continue reading.

Already a subscriber?Sign in.Not now

Reply

or to participate.