The Invisible Pipeline: Why AI Visibility Tracking Is Measuring the Wrong Thing

Anthropic released Claude Opus 4.7 in April 2026. The benchmarks were extraordinary — top of the Chatbot Arena leaderboard, 99th percentile on coding evaluations, numbers that placed it among the most capable frontier models ever released. The practitioner reaction was unanimous and opposite: it's shit.¹ Writers, developers, and heavy AI users reported regressions in reasoning, creativity, and real-world task performance. The benchmark said one thing. Reality said another.

This is not a post about Claude Opus 4.7. It's about the same pattern in the part of the stack that every AI search and answer system depends on — the part that decides whether your content gets found, which passage gets surfaced, and whether the citation the model generates actually matches what your page says.

That part is invisible. And the industry is measuring the wrong thing entirely.

"When a measure becomes a target, it ceases to be a good measure."
— Charles Goodhart, paraphrased (1975)

What marketers can see vs what AI systems actually do

If you work in content, SEO, or AEO, you optimise what you can see. Headings. Keywords. Structure. Meta descriptions. Schema markup. You can audit these. You can test them. You can see when Google changes how it renders your page.

But AI retrieval doesn't work on any of that directly. Before an AI can cite your page, it has to:

Crawl your page and extract text
Chunk that text into retrievable units — a 512-token window, a heading-bounded section, or something else entirely
Embed each chunk into a high-dimensional vector space — a geometric mapping that turns text into coordinates
Encode the user's query into the same space
Retrieve the closest matches and pass them to a reader model

Every one of those steps is configurable. Every one varies across vendors. And at least three of them — chunking, embedding, and retrieval strategy — have been shown to produce massive variance in what actually gets found.

We know this because we measured it. In our previous study, nine chunking strategies on the same pages produced a nearly 2× spread in retrieval quality. In our new study, seven embedding models on the same pages produced a 37% relative spread.² The components that determine whether your content surfaces are not standardised. They're not published. They're not even stable within a single vendor over time.

And the tools the industry uses to track "AI visibility" ignore all of this.

The embedding proof

We benchmarked seven open-weights embedding models across 135 live web pages, 405 stratified queries, and 28,350 LLM relevance judgements. The full study is here. But two findings matter for the argument this post is making.

Finding one: identical scores, different geometries.

BGE-small-en-v1.5 and nomic-embed-text-v1.5 both scored 0.7091358025 on our primary metric. Identical to ten decimal places. If you were reading a leaderboard, you would reasonably conclude these models are equivalent.

They're not. On the underlying retrieval:

Zero queries produced identical ranked top-10 lists
8 of 405 produced the same unordered set
Mean Jaccard overlap between their top-10 returns: 0.43
On queries where they disagreed, BGE-small won 113 times and nomic won 111

Retrieval patterns for a representative query. BGE-small and nomic scored identically on aggregate, but retrieved observably different chunks with a Jaccard overlap of 0.43. The identical mean is the sum of differences that approximately cancel — not evidence of equivalent behaviour. Data: Wayfinder Research, April 2026.

The leaderboard calls them equal. The geometry underneath calls them different. The mean is not just lossy — it's actively misleading about what the model does.

This matters because "relevance" is not a property of text waiting to be discovered. It's a property of a specific geometric mapping. Each model learns a different one. Two models can agree on the average quality of their retrievals while disagreeing on every single query about which passage is the right one.

When you swap embedding models, you don't just tune retrieval. You change the definition of "relevant."

Finding two: two frontiers, no single answer.

Even if you accept the scalar score as meaningful, "best" depends on your binding constraint. Our study found two Pareto frontiers:

Storage frontier (minimise bytes per chunk): BGE-small-en-v1.5 → mxbai-embed-large-v1 → pplx-embed-context-v1-4b.

Throughput frontier (maximise speed): all-MiniLM-L6-v2 → BGE-small-en-v1.5 → BGE-large-en-v1.5 → mxbai-embed-large-v1 → pplx-embed-context-v1-4b.

Pareto frontiers: quality vs storage (left) and quality vs throughput (right). Two models appear on the throughput frontier only. 'Best' depends entirely on the binding constraint — and on the model's specific similarity geometry. Data: Wayfinder Research, April 2026.

Two extra models appear on throughput but not storage. BGE-large achieves the same quality as mxbai at 6.5× the speed. If your constraint is re-indexing on a deploy deadline, BGE-large is the right answer — one the leaderboard would miss entirely.

There is no universal best model. There is only best-for-this-configuration. And you don't know what configuration the AI system reading your page is using.

The vendor opacity problem

This is where it gets uncomfortable.

You don't know what embedding model Perplexity uses. You don't know what chunker ChatGPT uses. You don't know Google's retrieval architecture. These are not published. They can change without notice. And our data shows that the choice among open-weights models — models anyone can download and test — creates a 37% quality spread from a one-line configuration change.

If the hosted systems retrieving your content are running older, different, or tuned embedders, the variance in what gets found is enormous. And you have no way to know.

The current AEO measurement stack treats this as a solved problem. It tracks whether your brand was cited in an AI answer. It scores your "AI visibility." It reports presence. But presence is not retrieval quality. Being somewhere in the system is not the same as the right passage being found.

And the gap between those two things is not just theoretical. It's exactly what our data measures.

The measurement problem

The question most AEO tools answer is: did AI mention us?

The question that actually matters is: when AI tried to find the answer on our page, did it find the right passage — and would a different embedding model or chunking strategy have found a different one?

Those are different questions. And the second one has more variance, more consequence, and currently no standard way of tracking it.

The scale of this problem is already visible. A recent independent analysis found that even at a 91% accuracy rate, Google's AI Overviews produce millions of wrong answers every hour.³ But the more telling number is the ungrounding rate: 56% of answers scored as "correct" cited sources that didn't actually support what the model said.

That's not hallucination. That's retrieval failure upstream of generation. The model found something adjacent to the right passage, generated a confident answer from it, and linked to a source that doesn't back it up. We're speculating that embedding and chunking choices are part of the explanation — we can't test Google's pipeline directly — but retrieval variance is exactly what that pattern looks like to us.

The measurement tools tracking AI visibility today answer the first question and ignore the second. They report citations without knowing whether the retrieved passage actually supported the claim. They score presence without measuring passage-level accuracy. And they do this across a pipeline where the key variables — chunking strategy, embedding model, retrieval architecture — are invisible, vendor-controlled, and provably high-variance.

What to actually do with this (honestly)

The honest answer is not a checklist. It's a reframing.

If you're building content for AI retrieval: The pipeline is multiplicative, not additive. Bad chunking × bad embedding × bad retrieval = exponentially worse outcomes. The best defence is content that's robust across configurations — clear internal logic, linear argumentation, sections that hand off to each other rather than restarting. Our previous study showed this structure performs well across chunking strategies.² This study shows it's even more important than we thought, because the embedder introducing the geometry is at least as variable as the chunker splitting the text.

If you're evaluating AEO tools: Ask what they actually measure. "Did AI mention my brand?" is a presence metric. "Which passage did it retrieve?" is a quality metric. The first one is easy to track and widely available. The second one is hard, requires access to the retrieval layer, and currently has no standard. But it's the one that determines whether the citation is accurate.

If you're responsible for AI strategy in your organisation: The assumption that chunking, embedding, and retrieval are "solved infrastructure" is dangerous. They're not solved. They're high-variance, vendor-opaque, and changing. The current measurement stack gives you visibility into a proxy (citations) while hiding the actual variance (retrieval quality). That's not a criticism of any specific tool — it's a structural gap in how the industry measures AI visibility.

The problem worth solving is passage-level retrieval measurement across configurations. Tracking not just whether your page was cited, but whether the right passage was found — and how that changes when the chunking, embedding, or retrieval strategy changes. We've done this in research studies twice now. We haven't productised it yet. But we're working on it.

The full study — seven models, 28,350 judgements, the complete BGE-small/nomic coincidence analysis, and full methodology — is in our research paper: The Embedder Question: Seven Models, 28,350 Judgements, Two Frontiers →

The Chatbot Arena leaderboard showed Claude Opus 4.7 at or near the top of several categories in April 2026. The practitioner consensus on social media, developer forums, and independent evaluations was broadly negative, citing regressions in reasoning, creativity, and long-context coherence. The gap between benchmark performance and real-world utility is documented in numerous public discussions; we cite it here as an example of a general problem rather than a specific critique of Anthropic's model. ↩
Wayfinder Study 01 benchmarked nine chunking strategies across 36,450 LLM evaluations on 135 live web pages. The full study is available at [/research/chunking-question]. ↩ ↩²
Independent analysis covered by Search Engine Land and others, April 2026. The 56% ungrounding rate refers to answers scored as "correct" under the study's rubric that were nonetheless not fully supported by the cited source. https://searchengineland.com/google-ai-overviews-accuracy-wrong-answers-analysis-473837 ↩

That part is invisible. And the industry is measuring the wrong thing entirely.

"When a measure becomes a target, it ceases to be a good measure."
— Charles Goodhart, paraphrased (1975)

What marketers can see vs what AI systems actually do

But AI retrieval doesn't work on any of that directly. Before an AI can cite your page, it has to:

Crawl your page and extract text
Chunk that text into retrievable units — a 512-token window, a heading-bounded section, or something else entirely
Embed each chunk into a high-dimensional vector space — a geometric mapping that turns text into coordinates
Encode the user's query into the same space
Retrieve the closest matches and pass them to a reader model

And the tools the industry uses to track "AI visibility" ignore all of this.

The embedding proof

Finding one: identical scores, different geometries.

They're not. On the underlying retrieval:

Zero queries produced identical ranked top-10 lists
8 of 405 produced the same unordered set
Mean Jaccard overlap between their top-10 returns: 0.43
On queries where they disagreed, BGE-small won 113 times and nomic won 111

The leaderboard calls them equal. The geometry underneath calls them different. The mean is not just lossy — it's actively misleading about what the model does.

When you swap embedding models, you don't just tune retrieval. You change the definition of "relevant."

Finding two: two frontiers, no single answer.

Even if you accept the scalar score as meaningful, "best" depends on your binding constraint. Our study found two Pareto frontiers:

Storage frontier (minimise bytes per chunk): BGE-small-en-v1.5 → mxbai-embed-large-v1 → pplx-embed-context-v1-4b.

Throughput frontier (maximise speed): all-MiniLM-L6-v2 → BGE-small-en-v1.5 → BGE-large-en-v1.5 → mxbai-embed-large-v1 → pplx-embed-context-v1-4b.

There is no universal best model. There is only best-for-this-configuration. And you don't know what configuration the AI system reading your page is using.

The vendor opacity problem

This is where it gets uncomfortable.

If the hosted systems retrieving your content are running older, different, or tuned embedders, the variance in what gets found is enormous. And you have no way to know.

And the gap between those two things is not just theoretical. It's exactly what our data measures.

The measurement problem

The question most AEO tools answer is: did AI mention us?

Those are different questions. And the second one has more variance, more consequence, and currently no standard way of tracking it.

What to actually do with this (honestly)

The honest answer is not a checklist. It's a reframing.

The Chatbot Arena leaderboard showed Claude Opus 4.7 at or near the top of several categories in April 2026. The practitioner consensus on social media, developer forums, and independent evaluations was broadly negative, citing regressions in reasoning, creativity, and long-context coherence. The gap between benchmark performance and real-world utility is documented in numerous public discussions; we cite it here as an example of a general problem rather than a specific critique of Anthropic's model. ↩
Wayfinder Study 01 benchmarked nine chunking strategies across 36,450 LLM evaluations on 135 live web pages. The full study is available at [/research/chunking-question]. ↩ ↩²
Independent analysis covered by Search Engine Land and others, April 2026. The 56% ungrounding rate refers to answers scored as "correct" under the study's rubric that were nonetheless not fully supported by the cited source. https://searchengineland.com/google-ai-overviews-accuracy-wrong-answers-analysis-473837 ↩

What marketers can see vs what AI systems actually do

The embedding proof

The vendor opacity problem

The measurement problem

What to actually do with this (honestly)

Continue exploring

The Invisible Pipeline: Why AI Visibility Tracking Is Measuring the Wrong Thing

What marketers can see vs what AI systems actually do

The embedding proof

The vendor opacity problem

The measurement problem

What to actually do with this (honestly)

Continue exploring

The Invisible Pipeline: Why AI Visibility Tracking Is Measuring the Wrong Thing

What marketers can see vs what AI systems actually do

The embedding proof

The vendor opacity problem

The measurement problem

What to actually do with this (honestly)

Footnotes

Continue exploring

The Invisible Pipeline: Why AI Visibility Tracking Is Measuring the Wrong Thing

What marketers can see vs what AI systems actually do

The embedding proof

The vendor opacity problem

The measurement problem

What to actually do with this (honestly)

Footnotes

Continue exploring