You've built a DeepSearch agent and even hooked it up to observability and can see what it's doing.

How do you know if it's actually getting better? How do you measure success and ensure your experiments are leading to real improvements, not just changes based on "vibes"?

Day 4 is where we start vibe-checking our Agent performance and iterate towards an objectively better AI product

You move from subjective feelings to objective, data-driven evaluation. It's time to learn LLM Evals – the AI engineer's equivalent of unit tests – designed to bring predictability to your probabilistic system.

We'll first take a second to discuss why evals matter then install an open-source tool I created, Evalite to set up and start running evals on our agents.

Evalite is built on top of vitest and is a great option for you so that you don't have to rely on a third-party cloud provider for your tests.

You'll walk through initializing Evalite in your project, setting up the evals folder, and understanding its .eval.ts file structure.

Next we'll choose success criteria which will provide a score taking into account factuality, relevance, source utilization, timeliness, and speed.

You'll use that criteria to write your first scorer in Evalite and see if you can get your agent to a 100 score.

By the end of Day 4, you'll have set up a foundational evaluation framework for your DeepSearch agent. You'll understand how to define what "good" looks like and have the tools to start measuring it, paving the way for more sophisticated evaluations and a truly data-driven approach to improving your AI application.