The Chunking Question: Nine Strategies, 36,450 Judgements, Two Winners
A benchmark of nine chunking strategies across 135 pages and 405 queries reveals that the best top-1 strategy and the best deep-retrieval strategy are different.
Overview
Chunking — how you split a source document into retrievable units — is a core configuration choice in any retrieval-augmented generation pipeline. It sets the unit of retrieval and therefore the unit of context available to the reader model. Despite its centrality, we could not find a published benchmark that compared a broad set of chunking strategies across multiple content types and query-specificity levels in one place. We built one.
Key Findings
- Two winners, different jobs:
sliding_window_512_step_128wins every chunk-level metric except P@1, MRR@10, and nDCG@10.semantic_headingwins those three. - Top-1 is not the whole story: the two strategies tie at rank 1 (~0.48 fraction of directly-answering chunks), but from rank 2 onwards semantic_heading collapses to roughly half sliding_window's score-2 density. Sliding_window's 75% overlap spreads relevant content across adjacent ranks; semantic_heading's heading-bounded chunks are independent, so rank 2 is typically a different section that scores 0 or 1.
- Page-level tells a different story: hierarchical chunking trails at chunk level (+0.12 gap to sliding_window on mean-judge) but ties it at 0.84 mean-max-page-score. Its chunk-to-page delta (+0.23) is the largest of any strategy.
- Specificity cliff: every strategy degrades monotonically head → medium → specific. Sliding_window degrades least in absolute terms (0.84 → 0.63, drop of 0.21); mid-tier strategies drop twice as far.
- A mid-tier band: five strategies (fixed_256, fixed_512, fixed_512_overlap_50, fixed_512_overlap_128, recursive_character_512) cluster in a mean-judge band of 0.42–0.48 with overlapping 95% confidence intervals.
- fixed_1024 is last: last on every metric reported, with a 95% CI that does not overlap the next-worst strategy.
Implications
The right chunker depends on your pipeline. If you pass only the top-1 retrieved chunk to the reader model, semantic_heading and sliding_window are interchangeable on this benchmark. If you pass top-k chunks with k ≥ 2 — which is standard — sliding_window delivers materially more directly-answering content. If you evaluate retrieval at page level (is a relevant page anywhere in the top-10?), hierarchical's heading-hierarchy structure competes with sliding_window at a fraction of the chunk-token budget. No single strategy wins every job; the decision should be made against the metric the downstream pipeline actually optimises.
Methodology
Nine chunking strategies ran against 135 live web pages across five content types (blog, documentation, forum, knowledge_base, landing) and three query-specificity levels (head, medium, specific). For each (content_type × specificity) cell we generated 27 queries, retrieved the top-10 chunks per strategy using MiniLM-L6-v2 embeddings, and had an LLM judge (Qwen 3.5 122B via Ollama, temperature 0) grade each retrieved chunk on a 0 (not relevant) / 1 (partially relevant) / 2 (directly answers) rubric. Total: 405 queries × 9 strategies × 10 ranks = 36,450 judgements. A 50-pair stratified spot-check (10 per content type, 20 score-0 + 15 score-1 + 15 score-2 pairs, pre-registered target ≥ 45/50) recorded 50/50 agreement between the judge and a human reviewer under an informed-review protocol. Paired bootstrap 95% confidence intervals throughout; seeds and prompts specified for reproducibility.
Get the Full Research Paper
Detailed methodology, complete data tables, and all visualisations.
Download PDF