You've taken the first steps into the world of LLM evaluations, setting up Evalite and writing your initial deterministic scorer. Now, it's time to level up your testing game.
Day 5 dives deeper into creating more sophisticated evaluations and building the datasets that fuel them, all driven by the powerful concept of the "Data Flywheel."
In short, the Data Flywheel is the process of users interacting with your application that are then fed into evals which improves your product.
"Evals -> Better Product -> More Users -> More Data -> Better Evals"
The next step is to implement LLM-as-a-Judge which takes another LLM model to assess the accuracy (and other success criteria) of your agents answers.
However, your LLM Judge won't be of much use if you don't have a solid dataset to work off of. You'll need to build this dataset yourself and provide "ground truth" for your LLM Judge to compare the answers it gets from your agent to.
Building a proper dataset is no joke, you'll be spending a majority of your time doing this. Simple true/false questions are easy to come by but our goal here is to make your agent fail. To really stress test your agent you'll need a set of "multi-hop" reasoning questions that forces your agent to break down complex problems to provide the proper answer.
By the end of Day 5, you'll have significantly expanded your evaluation toolkit. You'll be able to implement LLM-as-a-Judge evals for complex criteria like factuality and have a solid methodology for building and iterating on evaluation datasets that push your DeepSearch agent to its limits, driving continuous improvement.