The Embedder Question: Seven Models, 28,350 Judgements, Two Frontiers
An empirical comparison of open-weights embedding models for RAG retrieval across 135 live web pages and 405 stratified queries.
Overview
This study evaluates seven open-weights embedding models on a retrieval task designed to mirror production RAG workloads: 135 live web pages, 405 stratified queries, and an LLM relevance judge. The chunker was held constant at sliding_window_512_step_128 (the deep-retrieval winner from Wayfinder Study 01). Only the embedder was varied.
Key Findings
- Embedder choice dominates chunking choice: The 0.32 top-to-bottom spread on mean judge is larger than the spread between any two chunkers in Study 01. On a fixed chunker, the embedder is the biggest lever for retrieval quality.
- Two Pareto frontiers, different answers: The storage frontier (minimise bytes per chunk) contains three models; the throughput frontier (maximise chunks per second) contains five. Two models appear on the throughput frontier only. "Best" depends on the binding constraint.
- Identical scores, different behaviour: Two mid-tier models scored identically to ten decimal places on every aggregate metric, yet produced zero identical ranked top-10s across 405 queries and a mean Jaccard overlap of just 0.43. The identical mean is the sum of differences that cancel, not evidence of equivalence.
- A statistical near-tie with a throughput twist: Two top-tier models are indistinguishable on mean quality at overlapping 95% CIs, but one embeds 6.5× faster — an advantage single-number rankings miss.
- More layers, worse results: The 12-layer MiniLM model scored significantly below the 6-layer variant from the same family — countering the intuition that more parameters improve live-web retrieval.
Implications
The embedder is the most replaceable component of a RAG pipeline and, on this evidence, the most consequential. The Study 01 baseline ranks second-to-last here; earlier chunker scores should be read as lower bounds. Cost-quality trade-offs are two-dimensional — storage and throughput give different answers — and identical aggregate scores can hide fundamentally different retrieval behaviour. Aggregate metrics alone are insufficient for model selection.
Methodology
The benchmark replicates Study 01's retrieval-and-judge pipeline with one variable changed: the embedder. Corpus (135 live web pages), queries (405 stratified by content type and specificity), chunker (sliding_window_512_step_128), and judge (Qwen 3.5 122B, temperature 0, 0/1/2 rubric) are all identical to Study 01. Seven open-weights models were tested symmetrically with top-10 cosine-similarity retrieval. 95% confidence intervals from 1,000-iteration paired bootstraps; throughput measured on a single NVIDIA GB10 at batch size 32.
Get the Full Research Paper
Detailed methodology, complete data tables, and all visualisations.
Download PDF