AIHero

    The Model

    Next-token prediction

    What the model actually does. Samples one next token from the context, appends it, and runs again. Its only mode of operation.

    Matt Pocock
    Matt Pocock

    What the model actually does. Given a context, it samples one next token, appends it, and runs again. Every output — a sentence, a tool call, a thousand-line file — is built one token at a time. The model has no other mode of operation.

    Each step works the same way: the tokens in the context window are run through the parameters, which produce a probability for every token in the vocabulary — this one is very likely next, that one less so. One token is sampled from those probabilities, appended, and the loop runs again with the slightly longer context. That sampling step is why the same prompt produces different output on different runs: non-determinism is built into the mechanism, not a bug layered on top.

    Holding onto this mechanism explains behaviour that otherwise looks strange. The model never checks whether a token is true before emitting it — only whether it's likely — which is the root of hallucination. It commits to each token as it goes, so a confident-sounding opening sentence can steer the rest of the answer wrong. And because output tokens are produced strictly one at a time, generation speed puts a floor on how fast any agent can work.

    Usage:

    "How does the agent 'decide' to call a tool?"

    "It doesn't — it's next-token prediction all the way down. The tool call is just a structured string the harness parses out of the output stream."

    Want more than vocabulary?

    Join AI Hero for practical skills, thinking on AI engineering, and resources that keep you ahead of the curve.

    I respect your privacy. Unsubscribe at any time.

    Share