All Glossary Terms

Tokenisation

Token

Tokenisation is how LLMs break text into tokens — small units they can process. Token limits affect context windows, cost, and content processability.

Technical Foundations

Tokenisation is the process by which large language models break text into tokens — small units that the model can process individually. A token is typically 4 characters; a word is usually 1–2 tokens. LLMs have fixed context windows, so tokenisation directly affects how much content can be processed, how much it costs, and whether extraction is possible on long documents.

What is a Token?

A token isn't a word. "Hello" might be 1 token; "tokenisation" might be 2 tokens (token + isation); numbers and punctuation often get their own tokens. The exact tokenisation depends on the model — OpenAI's models use a different tokeniser than Anthropic's or Google's. For content creators, the key point: tokens measure how "expensive" (in terms of processing cost and context limit) your content is to an AI system. A 100-word paragraph might consume 150 tokens, influencing both cost and how much additional data fits in a request. Complex terms often split into multiple tokens, whereas common words remain singular, making specialised jargon less efficient for AI processing.

Why Tokenisation Matters for Content

AI systems have context windows: GPT-4 Turbo has ~128K tokens, Claude Opus has ~200K tokens. Within that limit, an AI system must fit: the system prompt, the user query, and the content it's processing. If your 50,000-word guide tokenises to 75,000 tokens and the context window is 128K, the AI can't fit much else and might fail to fully process your content. For Lens extraction testing, content that tokenises efficiently can be fully processed; content that's verbose and tokenises inefficiently might be truncated. Efficient tokenisation ensures the AI sees the full picture rather than a cut-off excerpt.

How Writing Structure Affects Tokenisation Efficiency

Clear, concise writing tokenises more efficiently than verbose writing. A 500-word page with unnecessary jargon might tokenise to 200 tokens per 100 words; the same information written clearly might tokenise to 120 tokens per 100 words. This matters because: (a) AI systems can process more of your content within a context window, (b) cost-per-request is lower, (c) extraction quality improves because the AI can focus on more content. Poor structure also affects tokenisation — unclear paragraph breaks, run-on sentences, and scattered organisation tokenise less efficiently.

Tokenisation and Context Windows

Every AI model has a context window — the maximum number of tokens it can accept in a single request. If your content exceeds the context window, the AI either truncates it (losing information) or fails. This is different from traditional search, where longer content isn't penalised. For AI extraction and ranking, content that fits comfortably in a context window extracts better. Very long content (50K+ words) might need to be chunked manually before submission to an AI system. Understanding these limits helps prevent data loss during analysis.

Tokenisation Across Different Models

Different AI models use different tokenisers. OpenAI's tokeniser counts the same text differently than Anthropic's. This means the same content might be "expensive" (high token count) for one AI system but "cheap" for another. You can't optimise for a specific tokeniser, but you can optimise for any tokeniser: write clearly, concisely, with good structure. Clear writing tokenises efficiently across all models. This consistency reduces variance in how your content is interpreted during extraction tasks.

Related Terms

Lens measures extraction quality on your actual content. Clear writing tokenises efficiently, fits more easily in AI context windows, and extracts more reliably.