In this lesson, we'll explore how different data formats affect token efficiency when working with LLMs. Understanding token usage is crucial for optimizing your prompts and context windows.
I want to demonstrate how you can pass different types of data to your LLMs and compare their token efficiency. Let's examine our starting point:
const DATA = [{url: 'https://aihero.dev',title: 'AI Hero',},{url: 'https://totaltypescript.com',title: 'Total TypeScript',},{url: 'https://mattpocock.com',title: 'Matt Pocock',},{url: 'https://twitter.com/mattpocockuk',title: 'Twitter',},];
We have an array of URLs, each with a URL and a title. We might be passing these to an LLM for citations or similar purposes.
We're creating three different representations of the same data:
const asXML = DATA.map((item) =>`<item url="${item.url}" title="${item.title}"></item>`,).join('\n');
const asJSON = JSON.stringify(DATA, null, 2);
const asMarkdown = DATA.map((item) => `- [${item.title}](${item.url})`,).join('\n');
When we run this code, we log the token count for each format:
console.log('Markdown tokens:', tokenize(asMarkdown).length);console.log(asMarkdown);console.log('--------------------------------');console.log('XML tokens:', tokenize(asXML).length);console.log(asXML);console.log('--------------------------------');console.log('JSON tokens:', tokenize(asJSON).length);console.log(asJSON);
The results show some interesting differences:
Format | Token Count |
---|---|
Markdown | 53 tokens |
XML | 77 tokens |
JSON | 103 tokens |
It's important not to draw overly general conclusions from this specific example. It's not always true that:
However, thinking about these representations in terms of token count is extremely valuable for optimization.
A significant aspect of context engineering (which we'll cover later) involves getting retrieved data into your LLM efficiently.
Generally speaking, the fewer tokens you spend on getting that data into your context window, the better you're doing.
I recommend experimenting with these representations:
null
and 2
to have it on a single line)The goal is to understand how data representation affects token counts and how different formats can be more or less token-efficient.
Good luck, and I'll see you in the next one.
Run the existing code to observe the token counts for each format
pnpm run dev
to execute the codeModify the markdown representation to include titles
Make the XML representation more verbose
Experiment with the JSON formatting
null, 2
) to have JSON on a single line