AIhero

    unlisted workshop

    You've built memory extraction and a complete agent, but how do you know they actually work? In production, you need systematic ways to measure quality, catch regressions, and compare different models.

    In this section, you'll build a real evaluation suite for your AI system using Evalite. You'll learn to make your code testable, write meaningful test cases, and use LLMs as judges to score outputs.

    What You'll Build

    A comprehensive evaluation suite that can:

    • Test memory extraction in isolation with structured datasets
    • Run end-to-end agent tests with multi-hop retrieval queries
    • Score outputs using semantic similarity and LLM judges
    • Compare model performance side-by-side with A/B testing

    What You'll Learn

    Making AI Code Testable

    Refactor your code to separate business logic from database operations. You'll extract an inner function that accepts a model parameter, making it easy to swap models and test without side effects.

    Building Evaluation Datasets

    Write test cases that cover real scenarios: extracting memories from empty databases, filtering temporary vs permanent information, and handling multi-turn conversations. You'll use fixture helpers to create message histories without manual boilerplate.

    LLM-as-Judge Scoring

    Move beyond simple assertions to semantic evaluation. You'll use answerSimilarity for fast embedding-based comparisons and answerCorrectness for factuality checks that combine embeddings with LLM judgment.

    End-to-End Agent Evaluation

    Test your complete system with queries that require multiple retrieval steps. You'll customize Evalite's UI to display tool calls alongside outputs, making it easy to debug why an agent succeeded or failed.

    Evals Project

    Matt Pocock
    Matt Pocock