AIhero

    Evalite v1 Preview: Fast Evals, Built-in Scorers

    Matt Pocock
    Matt Pocock
    Source Code

    If you've built evals before, you know the pain. Every time you need to check if your LLM's output is correct, you're writing another custom scorer. Is the SQL valid? Did it hallucinate? Is the JSON well-formed?

    Evalite v1 (still in beta) solves this with 10 production-ready scorers, plus a major architecture upgrade that makes getting started trivial.

    Check out the full docs at v1.evalite.dev.

    10 Built-In Scorers

    Evalite v1 ships with scorers for the most common eval scenarios. No more reinventing the wheel.

    String Scorers

    These are deterministic scorers for simple text validation:

    • exactMatch - checks if output exactly matches expected string
    • contains - checks if output contains a substring
    • levenshtein - fuzzy string matching using Levenshtein distance

    The levenshtein scorer is particularly useful for SQL generation or code output where minor formatting differences shouldn't fail the eval:

    scorers: [
    {
    scorer: ({ output }) =>
    levenshtein({
    actual: output,
    expected: "SELECT * FROM users WHERE id = 1",
    }),
    },
    ];

    RAG Scorers

    These use LLM-as-a-judge to evaluate RAG pipelines:

    • faithfulness - detects hallucinations by checking if output is grounded in context
    • answerSimilarity - compares semantic similarity between output and expected answer
    • answerCorrectness - evaluates factual correctness against ground truth
    • answerRelevancy - checks if output actually answers the question
    • contextRecall - measures if all relevant context was retrieved

    Example using faithfulness to catch hallucinations:

    scorers: [
    {
    scorer: ({ output, input }) =>
    faithfulness({
    question: input.question,
    answer: output,
    groundTruth: input.context, // Retrieved context
    model: yourModel,
    }),
    },
    ];

    Advanced Scorers

    For specialized use cases:

    • toolCallAccuracy - evaluates if agents called the right tools with correct arguments
    • noiseSensitivity - tests prompt robustness by adding noise and checking consistency

    Tool call accuracy is essential for agent evals:

    scorers: [
    {
    scorer: ({ output }) =>
    toolCallAccuracy({
    actualCalls: output.toolCalls,
    expectedCalls: [{ toolName: "search", input: { query: "..." } }],
    }),
    },
    ];

    Mix and Match Scorers

    The real power comes from combining scorers. A comprehensive RAG eval might use:

    scorers: [
    {
    scorer: (opts) => faithfulness({ ...opts, model: yourModel }),
    },
    {
    scorer: (opts) => answerRelevancy({ ...opts, model: yourModel }),
    },
    {
    scorer: (opts) => contextRecall({ ...opts, model: yourModel }),
    },
    ];

    Each scorer returns a 0-1 score. Evalite aggregates them to give you an overall eval score.

    In-Memory by Default

    The biggest architectural change: Evalite v1 uses in-memory storage by default.

    Previously, you needed to set up SQLite, which added friction for new users. Now you can run npx evalite and start evaluating immediately.

    Want persistence? Switch to SQLite in your config:

    // evalite.config.ts
    import { defineConfig } from "evalite/config";
    export default defineConfig({
    storage: {
    type: "sqlite",
    path: "./evalite.db",
    },
    });

    But for most development workflows, in-memory is simpler and removes a setup step.

    Deep Vercel AI SDK Integration

    Evalite v1 is built around the Vercel AI SDK. Wrap any AI SDK model with wrapAISDKModel() to get automatic tracing and caching.

    Cache Everything

    Not just scorers - cache your entire eval pipeline. Wrap models used in your task function, in scorers, anywhere:

    import { wrapAISDKModel } from "evalite/ai-sdk";
    import { openai } from "@ai-sdk/openai";
    const model = wrapAISDKModel(openai("gpt-4"));
    evalite("RAG Eval", {
    data: [...],
    task: async (input) => {
    // Cached automatically
    const result = await generateText({
    model,
    prompt: input.question,
    });
    return result.text;
    },
    scorers: [
    {
    // Also cached automatically
    scorer: (opts) => faithfulness({ ...opts, model }),
    },
    ],
    });

    This transforms watch mode. Change scorer logic, tweak thresholds, refactor eval structure - the expensive LLM calls stay cached. Only run what changed.

    evalite watch

    The wrapper works across all AI SDK methods: generateText(), streamText(), generateObject(), and streamObject().

    Zero overhead in production - wrapAISDKModel() is a no-op when called outside Evalite's context. Your production code runs exactly as before.

    DX Improvements

    The biggest DX improvement is auto .env support. Environment variables load automatically.

    The Evalite UI got several upgrades:

    • Dark mode - Theme switcher for light/dark preferences
    • Table rendering - Objects and arrays render as markdown tables instead of JSON trees
    • Rerun button - Re-run evals in watch mode without restarting
    • AI SDK message UI - Pass AI SDK messages directly and get custom UI rendering

    Getting Started

    Evalite v1 is still in active development (beta). There's no formal migration guide yet as features are still evolving.

    To try it:

    pnpm install evalite@beta

    Full documentation at v1.evalite.dev.

    Feedback Welcome

    Since v1 is still beta, your feedback shapes the final release. Found a bug? Want a scorer we're missing?

    Join the discussion on Discord or open an issue on GitHub.

    Join over 7,000 Developers Becoming AI Engineers

    Subscribe to be the first to learn about AI Hero releases, updates, and special discounts for AI Engineers.

    I respect your privacy. Unsubscribe at any time.

    Share