We benchmarked nine chunking strategies across 36,450 LLM evaluations. The content that wins AI retrieval looks nothing like SEO-optimised copy. Zinsser described it in 1976.
William Zinsser published On Writing Well in 1976. It predates the internet. It predates SEO. It predates every content strategy framework your agency has ever pitched you. And when we ran 36,450 evaluations across nine different ways of splitting a web page for AI retrieval, it turned out he was right.
"All your clear and pleasing sentences will fall apart if you don't keep remembering that writing is linear and sequential, that logic is the glue that holds it together, that tension must be maintained from one sentence to the next and from one paragraph to the next and from one section to the next."
— William Zinsser, On Writing Well (1976)
This is not a post about writing style. It's about what happens when an AI tries to find the relevant passage on your page — and why the content that surfaces best looks a lot like what Zinsser was describing fifty years ago. And not at all like what your SEO strategy has been optimising for.
We benchmarked nine chunking strategies against 135 live web pages across five content types and three query-specificity levels. 36,450 LLM evaluations. The full study is here.
The headline: no single chunking strategy wins on every metric. But two lead, and they lead on different things. A sliding-window strategy wins every measure of deep retrieval: finding relevant content across the top-10 results, holding quality as queries get harder, performing consistently across every content type we tested. A semantic-heading strategy — chunks bounded by document headings — wins top-1 precision: finding the single most relevant passage at position one.
Which one matters more depends on what the AI is doing with the result. That's the fork that everything else follows from.
AI systems don't read your page. They read a fragment of your page.
Before an AI can retrieve anything from your content, it has to split it into retrievable units: chunks. A chunk might be a fixed block of 512 tokens, a sliding window that overlaps with the chunk before it, or a section bounded by your document headings. The choice of chunking strategy is made by the engineers building the retrieval system, not by you. LangChain, LlamaIndex, and every other retrieval framework ship with multiple chunking implementations and ask developers to pick one.1 Most pick a reasonable-sounding default and move on.
Our study compared nine of those strategies on the same pages with the same queries. The difference in retrieval quality between the best and worst strategy — on identical content — was nearly double. None of that variation shows up anywhere in your analytics.
Most of the content advice you'll encounter in the AEO/GEO space glosses over this entirely. A prominent voice in the industry describes the correct approach as follows: "Content chunking means structuring your content as a series of stand-alone sections or blocks, each devoted to answering one question or presenting a single idea. Instead of one long narrative, you have multiple mini-articles within the article." We won't name them here.2 The advice isn't wrong, exactly — clear structure helps across most configurations. But it assumes all AI systems are using the same chunking approach, which our study shows they aren't. Writing for one chunking model isn't sufficient when you can't know which one will be applied to your page. The content that performs best is the content that's robust across all of them.
The useful split for anyone writing web content is between two retrieval modes.
The first is citation: the AI pulls the single most relevant passage to support a specific claim. Perplexity's citations, AI Overview sources, any output where a model surfaces one specific chunk — these are top-1 retrieval problems. Get rank 1 right and you win.
The second is synthesis: the AI builds a broader answer from several passages, combining context from multiple retrieved chunks into a coherent response. This is how most AI answer generation actually works: retrieve top-3, top-5, or top-10 chunks and use them collectively as context for the reader model.3
These two modes reward different structures. And the divergence between them is steeper than you'd expect.
Sliding-window retrieval uses a 512-token window that advances in 128-token steps. Adjacent windows overlap by roughly 75%.4 When a relevant passage appears at rank 1, the rank-2 chunk is almost certainly the adjacent window — shifted 128 tokens along, carrying most of the same relevant content.
Fraction of retrieved chunks scoring 2 ('directly answers the query') by rank, for sliding_window_512_step_128 and semantic_heading. Both strategies tie at rank 1. Semantic_heading falls sharply from rank 2; sliding_window decays gradually through rank 10. Data: Wayfinder Research, April 2026.
At rank 1, the two strategies are virtually tied: 52% of sliding-window's top-1 chunks directly answered the query; 48% for semantic-heading. From rank 2 onwards, they diverge sharply. Sliding-window at rank 2: 39%. Semantic-heading: 18%. By rank 5, sliding-window is at 18%; semantic-heading is at 4%.
The mechanism is straightforward. Semantic-heading chunks are independent. If the answer is in section three, section four is about something else. It doesn't carry the argument forward — it starts a new one. The adjacent chunk scores near zero because it covers different ground.
Sliding-window rewards content where adjacent passages are thematically related. Where the argument in one paragraph flows into the argument in the next. Where the logic Zinsser was describing — tension maintained from one section to the next — is present in the writing.
There's a slightly glib way of explaining why. AI models were trained, in large part, on books, articles, and long-form documents where writers assumed their readers would encounter one paragraph after another and built their sentences so they connected.5 It's not exactly what's happening under the hood, but it's not wrong either: the retrieval pattern that works best looks a lot like the reading pattern the models were built on. Zinsser figured out in 1976 what kind of writing those models would end up rewarding. He just didn't know it yet.
The SEO structural playbook treated each heading-bounded section as a standalone unit, and it did so for understandable reasons. Keyword in the H2. Answer in the first sentence under it. Move on. This made sense for a crawler that indexed discrete documents and a featured snippet system that pulled one contained passage.
It was optimising for Google's model of reading — which was never really about reading at all. It was about indexing. That approach produces exactly what we found: strong precision at rank 1, because you've written a coherent, bounded section on a defined topic. And a cliff at rank 2, because the next section is about something entirely different.
We're not saying structure is bad. Clear headings still matter and they help both retrieval modes. But the thing SEO content sacrifices in pursuit of keyword-unit coherence is the narrative thread between sections — Zinsser's logic that holds it together. And that thread turns out to be precisely what deep retrieval rewards. The content strategy that wins across chunking configurations is the one that makes each section earn the next. Not the one that makes each section stand alone.
It's worth noting that some of the most heavily-cited content formats on the web already work this way — and probably not by accident. Wikipedia's structure reveals depth progressively: each section builds on the one before it, and by the time you reach the bottom of a well-written article you've been taken on a logical journey from overview to detail. Reddit threads work similarly: every reply inherits context from the post above, and each contributes something new to a single bounded conversation. We didn't specifically test for this in our benchmark — though Reddit threads did appear in our forum corpus — and we're speculating at the mechanism here. But both formats are consistently among the most-cited sources in AI answers, and it's at least worth asking whether their structural coherence has something to do with it.
The frustrating honest answer is that there's no single correct structure, because there's no single chunking strategy. Different AI systems chunk differently. Perplexity doesn't publish its retrieval architecture. Neither does ChatGPT. Our findings are real, but they're findings about specific retrieval configurations — not a claim about how every AI system reads every page.
What we can say with confidence: content with clear internal logic — where one section hands off to the next, where the argument accumulates rather than restarts — performs better across more retrieval configurations than content that doesn't. It wins in synthesis mode. It doesn't lose in citation mode. It's more robust to the variation you can't see or control.
The practical implication is not a new content framework. It's closer to: write well. Specifically, write with Zinsser's narrative thread in mind — the idea that each section should be causally or logically connected to what came before, not just topically adjacent to it. A reader (or a retrieval system) should be able to follow the logic from one section to the next without starting from scratch.
That's not radical advice. It's what every decent writing teacher has told every student for the last hundred years. We've just put 36,450 data points behind it.
The gap between the best and worst chunking strategy on the same pages was nearly double on our relevance scale: 0.74 versus 0.38. That's a substantial difference in how well the right content gets retrieved, all from the choice of how to split the document.
None of that variation is visible in any current AEO measurement approach. Tracking whether your brand was cited in an AI answer doesn't tell you whether the right passage was retrieved. It tells you your page was somewhere in the system. Presence and retrieval quality are not the same thing, and on our data, they can diverge substantially.
The scale of this problem is already visible if you know where to look. A recent independent analysis found that even at a 91% accuracy rate, Google's AI Overviews produce millions of wrong answers every hour.6 But the more telling number isn't the error rate — it's the ungrounding rate: 56% of answers deemed "correct" cited sources that didn't actually support what the model said. That's not hallucination in the traditional sense. That's a retrieval problem. The model found something adjacent to the right passage, generated a confident answer from it, and linked to a source that doesn't back it up. We're speculating that chunking strategy is part of the explanation — we didn't test Google's pipeline (Because we can't) — but retrieval failure upstream of generation is exactly what that pattern looks like to us.
The question that most AEO tracking answers is: did AI mention us? The question that actually matters is: when AI tried to find the answer on our page, did it find the right passage?
Those are different questions. The industry is measuring the first one. Our benchmark suggests the second one has more variance, more consequence, and currently no standard way of tracking it.
That's the problem worth solving. We haven't solved it yet, but we're working on it.
The full study — nine strategies, 135 pages, 36,450 evaluations, complete methodology and significance tests — is in our research paper: The Chunking Question: Nine Strategies, 36,450 Judgements, Two Winners →
LangChain's text splitter documentation lists at minimum eight chunking strategies as first-class options: character splitting, recursive character splitting, HTML splitting, markdown splitting, code splitting, token splitting, semantic chunking, and agentic splitting. The choice is left to the developer implementing the pipeline. ↩
We will, however, name them here: Profound. The quote is from their 2025 GEO guide: https://www.tryprofound.com/resources/articles/generative-engine-optimization-geo-guide-2025 ↩
The standard production RAG pattern is to retrieve top-k chunks (commonly top-3, top-5, or top-10), optionally rerank, and pass the results collectively to the reader model as context. Passing only the single top-1 chunk is atypical; most answer generation relies on the accumulated context of several retrieved passages. ↩
Overlap calculation: window = 512 tokens, step = 128 tokens. Each new window advances 128 tokens while retaining 512 − 128 = 384 tokens from the previous window. 384 ÷ 512 ≈ 75% overlap. The specific parameters were fixed in the study design before any analysis was run — not tuned against the results. ↩
This is a simplification of how large language model training works. Models are trained on vast corpora that include books, academic papers, Wikipedia, and long-form web content — much of which is structured according to the principles of coherent, linear writing. The claim is not that models were trained to prefer Zinsser-style prose; it's that the retrieval and embedding patterns that emerge from training on such corpora happen to reward the same structural properties Zinsser was describing. ↩
Independent analysis covered by Search Engine Land and others, April 2026. The 56% ungrounding rate refers to answers scored as "correct" under the study's rubric that were nonetheless not fully supported by the cited source. https://searchengineland.com/google-ai-overviews-accuracy-wrong-answers-analysis-473837 ↩