What Are Tokens?

Tokens are the fundamental building blocks that help Large Language Models (LLMs) process text. Understanding them is essential, especially since you're billed based on token usage.

What Are Tokens?

Tokens are simply numbers that represent how the LLM "thinks" about the text you provide. The process of converting text into tokens is called encoding.

The tokenization process works in two parts:

The tokenizer splits text into tokens it recognizes
These tokens are converted into numbers

Encoding

Decoding is the reverse process:

Numbers are converted back into text tokens
The tokens are joined together to form the output

Decoding

The LLM Process Flow

The complete LLM process looks like this:

Tokenizer encodes your input text into tokens
LLM processes your tokens
LLM produces output tokens
Output tokens are decoded back into readable text

LLM Process Flow

To clarify, input tokens include:

Your conversation history with the LLM
System prompts
Tool definitions

Output tokens are what the LLM sends back as a response.

You're billed for both input and output tokens, typically at different rates. One way to save money is to design your prompts to generate fewer output tokens.

How Tokens Are Created

The tokenization process starts with a large corpus of text - similar to what's used to train the LLM itself. Let's imagine a tiny corpus consisting of just one sentence: "the cat sat on the mat."

Tokenization

First, all individual characters are extracted:

T H E space C A T space S A T space O N space T H E space M A T

Each of these characters becomes its own token in the vocabulary.

Next, common groupings of characters are identified:

"TH" appears in "the" (twice)
"HE" appears in "the" (twice)
"AT" appears in "cat", "sat", and "mat"

Each of these groupings also gets assigned its own token.

Then, groups of groups are identified - like "TH" + "HE" creating "THE" (the word "the"), which gets its own token.

Vocabulary Size Matters

The goal is to create a large vocabulary of tokens because larger vocabularies can split words into fewer tokens, making processing more efficient.

Vocabulary Size

For example, a vocabulary size of 1,000 tokens might split "understanding" into 5 tokens. A vocabulary size of 50,000 tokens might split it into 3 tokens, and a vocabulary size of 200,000 tokens might split it into 2 tokens.

Having a larger vocabulary means you can split words into fewer tokens, making processing more efficient.

Handling Unusual Words

The tokenizer struggles with uncommon words. For example, "O Frabjous Day" from Lewis Carroll's poem gets split into many tokens because "Frabjous" is a made-up word that doesn't appear frequently in the training corpus.

Unusual Words

We can see that it turns it into 7 tokens - more than we'd expect from only 15 characters.

Final Thoughts

I hope that helps demystify tokens a bit. I found the tiktokenizer playground really useful for understanding this stuff.

Let me know if you have any questions - and what else would you like me to cover next?

Matt