You've built memory extraction and a complete agent, but how do you know they actually work? In production, you need systematic ways to measure quality, catch regressions, and compare different models.

In this section, you'll build a real evaluation suite for your AI system using Evalite. You'll learn to make your code testable, write meaningful test cases, and use LLMs as judges to score outputs.

What You'll Build

A comprehensive evaluation suite that can:

Test memory extraction in isolation with structured datasets
Run end-to-end agent tests with multi-hop retrieval queries
Score outputs using semantic similarity and LLM judges
Compare model performance side-by-side with A/B testing

What You'll Learn

Making AI Code Testable

Refactor your code to separate business logic from database operations. You'll extract an inner function that accepts a model parameter, making it easy to swap models and test without side effects.

Building Evaluation Datasets

Write test cases that cover real scenarios: extracting memories from empty databases, filtering temporary vs permanent information, and handling multi-turn conversations. You'll use fixture helpers to create message histories without manual boilerplate.

LLM-as-Judge Scoring

Move beyond simple assertions to semantic evaluation. You'll use answerSimilarity for fast embedding-based comparisons and answerCorrectness for factuality checks that combine embeddings with LLM judgment.

End-to-End Agent Evaluation

Test your complete system with queries that require multiple retrieval steps. You'll customize Evalite's UI to display tool calls alongside outputs, making it easy to debug why an agent succeeded or failed.

Evals Project