All Glossary Terms

Large Language Model

LLM

LLMs are AI systems trained on vast text data to predict and generate language. Understand how they power AI search and what their constraints mean for content extraction.

How AI Search Works

A large language model (LLM) is an AI system trained on vast quantities of text data to predict and generate human language. LLMs form the core of AI search tools, content readers, and AI agents. Their understanding of language and ability to reason about information directly determines whether AI search tools can effectively extract from your site.

What is a Large Language Model?

LLMs learn patterns in language by being trained on enormous text datasets, then optimised to predict the next word in a sequence (next-token prediction). This simple task, repeated over billions of examples, builds emergent capabilities: reasoning, summarisation, code understanding, and question-answering. Key concepts include tokenisation (how text is broken into units), embeddings (numerical representations of meaning), and parameters (which determine model size and complexity). The training data defines the model's knowledge base. While foundation models is the broader category encompassing LLMs, the practical distinction for SEOs is that LLMs specialise in language tasks rather than other modalities like image or audio processing.

Why LLMs Matter for AI Search

When an AI search tool reads your page, it's using an LLM to understand the content and decide what's relevant. The LLM's training data, size (parameter count), reasoning ability, and context window all affect whether it can parse complex or technical content accurately, understand domain-specific language, maintain context across multiple pages, and distinguish useful information from filler. This is where theory becomes practical. Different LLMs extract differently from the same page. In Wayfinder's AI navigation research, LLMs demonstrated semantic understanding that ML alone cannot provide — handling ambiguous labels, inferring meaning from context, and reasoning about multi-step paths. Pure machine learning approaches struggle because they're context-blind and cannot reason about navigation goals expressed in natural language.

Key Limitations That Affect Content Extraction

Several constraints directly impact whether an AI search tool will successfully extract or cite your content. Context windows limit the maximum text length the model can process at once, meaning long documents may get truncated. Training data cutoffs mean the model's knowledge has a "freshness" limit — content published after the cutoff won't be in its pre-trained knowledge. Hallucination risk means LLMs can generate plausible-sounding but false information when uncertain. Reasoning boundaries exist — some tasks are harder than others, and complex multi-step extraction can fail. These limitations matter for classification of outcomes: when navigation fails, it can be due to content gaps, extraction errors, or reasoning failures, each requiring different fixes.

Related Terms

  • Hallucination (AI) — When LLMs generate false but confident information, a core limitation for search accuracy
  • Grounding (AI) — Methods to anchor LLM output in verified source content, reducing hallucination risk
  • Retrieval-Augmented Generation — Architecture that combines LLMs with external data retrieval for more accurate answers
  • AI Agent — Systems where LLMs serve as the reasoning engine for autonomous tasks
  • Semantic Search — Understanding meaning beyond keywords, powered by LLM-based embeddings
  • Vector Embedding — Numerical representations of text meaning used by LLMs to understand similarity

Compass reveals how specific AI search tools — each powered by different LLMs — actually extract from your site. See which models understand your content best.