AIhero

    unlisted workshop

    Your AI agent has twenty tools. When a user asks to "organize my schedule," which one does it pick? When they say "book a flight" without mentioning dates or destinations, does it guess and call the wrong API, or does it ask for clarification?

    The challenge with tool-calling agents is that you're trusting the LLM to make intelligent decisions about which tools to invoke and when. As your tool library grows, so does the potential for mistakes. Some models handle ten tools gracefully but fall apart at twenty. Some excel at explicit requests but struggle with ambiguity. The only way to know is to measure systematically.

    This workshop teaches you to build a rigorous evaluation harness for tool-calling agents using Evalite.

    You'll move from manual spot-checking to automated, quantifiable feedback that tells you exactly how well your agent is performing.

    You will:

    • Set up an Evalite evaluation harness that extracts tool calls from streamText responses and inspects what the model decided to do
    • Build deterministic scorers that automatically check whether your agent called the expected tool for each input
    • Use evalite.each() to A/B test different language models (Gemini Flash, Gemini Flash Lite, GPT-4.1 mini) against the same test cases
    • Create adversarial test cases that expose agent weaknesses: ambiguous requests, missing critical information, conversational inputs that need no action, and overlapping tool functionality
    • Implement an askForClarification tool that triggers when requests are incomplete
    • Iterate on tool descriptions and prompts until your agent reliably recognizes when clarification is needed across 20+ competing tools

    By the end of this workshop, you'll have a repeatable process for evaluating tool-calling agents. You'll know which models perform best for your use case, which edge cases trip them up, and how to systematically improve tool selection through prompt engineering and scoring. No more guessing whether your agent is making the right calls, you'll have the data to prove it.

    Evals Skill Building

    Matt Pocock
    Matt Pocock