AIHero

    The Model

    Inference

    Running a trained model to generate output — what happens on every model provider request. Parameters stay fixed.

    Matt Pocock
    Matt Pocock

    Running a trained model to generate output — what happens on every model provider request. Parameters stay fixed; the model just does next-token prediction over the context it's given. Cheap relative to training, but billed per token and the dominant cost of using a model.

    A model's life splits into two phases:

    PhaseWhen it happensWhat it doesParameters
    TrainingOnce, before releaseProduces the parameters from a training corpusBeing written
    InferenceEvery time anyone uses the modelRuns the frozen parameters over your context to generate tokensRead-only

    Nothing you do at inference time writes back to the parameters — that's the reason a correction you make today doesn't stick tomorrow. The model that makes the same mistake next session, after you carefully explained the fix, hasn't ignored you; it's incapable of learning from the exchange. The model is stateless — continuity has to come from outside it — from the context window or a memory system.

    This mechanism also explains how you're billed. Every request runs the model over the full context, so cost scales with input tokens and output tokens, and an agent making dozens of tool calls pays for inference on each round trip. This is why context size is a cost question as well as a quality one.

    Usage:

    "Why does the bill scale with usage instead of being a flat license?"

    "You're paying for inference — every model provider request runs the model on the provider's hardware. Training already happened, but inference costs accrue per request, and a single turn can expand into many requests when tools are called."

    Want more than vocabulary?

    Join AI Hero for practical skills, thinking on AI engineering, and resources that keep you ahead of the curve.

    I respect your privacy. Unsubscribe at any time.

    Share