Section 1 — The Model

    Prefix cache

    The providerside store that lets consecutive model provider requests skip reprocessing a shared prefix. When the start of a request matches the start of a recent one — same syst...

    Matt Pocock
    Matt Pocock

    The provider-side store that lets consecutive model provider requests skip re-processing a shared prefix. When the start of a request matches the start of a recent one — same system prompt, same history up to some point — the provider reuses its prior work and bills those tokens as cache tokens at a much lower rate.

    Anything that changes the prefix (reordering files, rewriting the system prompt mid-session, injecting a timestamp near the top) invalidates the cache from that point on, and the rest of the request bills at full input token rate.

    Usage:

    "Why did the bill spike halfway through the session?"

    "Harness started injecting the current time into the system prompt every turn. Prefix cache breaks at the first changed token, so every request after that billed at full rate."

    Want more than vocabulary?

    Join AI Hero for practical skills, thinking on AI engineering, and resources that keep you ahead of the curve.

    I respect your privacy. Unsubscribe at any time.

    Share