AIHero

    The Model

    Cache tokens

    Input tokens the provider has cached from a previous request via its prefix cache, billed at a much lower rate.

    Matt Pocock
    Matt Pocock

    Input tokens the provider has cached from a previous model provider request so it doesn't have to re-process them. When consecutive requests share a prefix, the provider reuses the work via its prefix cache and bills the cached portion at a much lower rate. The lever that makes long sessions affordable — without it, every turn re-pays for the whole history.

    The reason this matters is how sessions are billed. The model is stateless, so every request resends the entire conversation — system prompt, every message, every tool result — as input tokens. By turn fifty, each request carries fifty turns of history, and you'd pay full rate on all of it, every time. The cache changes the maths: tokens the provider has already processed in an identical prefix are billed as cache tokens, often at a tenth of the input rate or less. On a long session, most of what you send is cache tokens, and the bill stays sane.

    An example shows when tokens are cached and when they're not. Each letter stands for a block of conversation content; each request sends the conversation so far:

    Request sendsCachedBilled at full rateWhy
    ABnothingABFirst request — nothing to match against
    ABCABCAB is an exact prefix of the previous request
    ABCDABCDPrefix still intact
    AXCDAXCDAn edit changed B to X; the match fails there

    The cache is fragile in a specific way: it matches exact prefixes. If anything changes earlier in the conversation — the harness reorders content, a timestamp updates, a file's representation shifts — the cache misses from that point onward and everything after it is billed at full input rate. Caches also expire after a few minutes of inactivity, so a session resumed after a long pause re-pays its history once. When a session's cost jumps without an obvious cause, compare cache tokens to input tokens in the usage report — a broken cache shows up there first.

    Usage:

    "Cost on long sessions is brutal — eight bucks for a refactor."

    "Check the cache tokens. If the harness is reordering the system prompt or files between turns, the prefix breaks and you re-pay full input rate every request."

    Want more than vocabulary?

    Join AI Hero for practical skills, thinking on AI engineering, and resources that keep you ahead of the curve.

    I respect your privacy. Unsubscribe at any time.

    Share