AEO Metrics: Why Most Industry Tools Mislead and What Actually Matters
Honest breakdown of AI search metrics—why visibility scores are noise, which metrics work, and a real measurement framework.
The AEO industry is selling you noise. Prompt visibility scores, AI "rankings", citation counts — these metrics promise precision but deliver nothing but illusion. They measure what's easiest to quantify, not what actually matters. This is the same mistake performance marketers made when they first adopted Google Ads: treating a brand channel like a direct response channel, then wondering why it didn't perform.
AI search works fundamentally differently to traditional search. The question isn't "how many clicks did AI drive?" — it's "how many consideration sets did we enter?" Yet most teams are still measuring position and frequency as if large language models returned deterministic results. They don't.
This guide separates signal from noise. We'll show you which metrics are meaningless, which tell you something real, and a practical framework for measuring AEO that doesn't require you to lie to your stakeholders about what's actually being measured. The bottom line: most AI visibility metrics are performative nonsense. Here's what actually tells you if optimisation is working.
Why Vendor Metrics Fail
The measurement problem starts with the technology itself. LLMs are non-deterministic — meaning the same prompt rarely produces identical output. This isn't a bug; it's the core architecture.
Problem 1: LLM Non-Determinism
Research from SparkToro and Gumshoe has demonstrated that running the same question 100 times won't give you the same brand list twice. The probability of getting the same list in the same order is approximately 1 in 1,000. Even getting the same list in any order happens less than 1% of the time.
This happens because of temperature randomness, mixture-of-experts architectures routing differently, and floating-point arithmetic that's non-reproducible. Context-dependent outputs compound this: your brand might appear when asked about "best electric toothbrush for sensitive gums" but not when asked about "best electric toothbrush under £50".
What does this mean for "ranking"? Tools claiming "you rank #3 for this prompt" are selling fiction. Run the same prompt 100 times and you'll get positions like 2, 5, 1, 7, 4, 3, 6, 2, 1, 5. The average position is meaningless noise.
Problem 2: Measurement Volume is Inadequate
To get statistically valid data on whether you appear in AI responses, you need to run each prompt 60-100 times. Measure appearance frequency, not position. That's 6,000 API calls per month if you're testing 100 prompts.
Most vendor tools run 100-200 prompts once each. They're showing you a single data point and calling it a metric. It's not measurement; it's a snapshot of randomness.
Problem 3: Citation Frequency ≠ Value
Even if you reliably appear in 47% of responses, what does that mean? Frequency tells you visibility. It doesn't tell you if visibility drives business value. It doesn't account for who's asking questions, whether the mention is positive or negative, or if it actually influenced any decision. It also doesn't account for time lag — someone researching in AI today might purchase in three months, or not at all.
Problem 4: "AI Visibility Score" is Non-Standard
Different vendors use different methodologies. Some measure brand mentions only. Some measure links. Some measure position (which is invalid). Some measure "alignment" with queries in ways they won't define. The same 73 from one vendor doesn't equal 73 from another. Each is a black box with no external validation.
As the data shows, the question isn't about measuring what you can't measure, but about measuring what actually can be measured.
Metrics That Mislead
Not all metrics are created equal. Some tell you nothing. Some actively mislead. Here are the ones you should stop tracking immediately.
Avoid: Prompt Position/Ranking
"You rank #3 for X prompt."
This is equivalent to running =RANDBETWEEN(1,10) in Excel and calling it performance. The SparkToro/Gumshoe research proved position is noise, not signal. Tools selling this are selling nonsense. Skip it. Don't pay for it. Don't track it in your dashboard. It's a vanity metric with a fancy name.
Avoid: Share of Voice / AI Visibility Index
"You own 12% of AI visibility in your category."
There's no standard definition of "visibility" across vendors. Some include position, some don't. Some weight by query volume, some don't. Some measure only direct mentions, some include implied attribution. Your 12% isn't comparable to anyone else's 12%. It implies precision that doesn't exist.
Avoid: Citation Count Without Context
"You appeared in 150 AI responses this month."
A count without context is meaningless. Are those 200 useless mentions worth more than 10 valuable ones? What was said — positive, negative, neutral? Who saw it, and did they act on it? Citation frequency has value only if you also measure quality of mention. A single authoritative mention in a high-intent response beats 50 low-quality ones.
Avoid: Single-Touch Attribution
"This AI mention generated £5,000 in revenue."
Attribution is structurally impossible for AI search. Research happens days or weeks before purchase. Multiple touchpoints occur before conversion. Last-click attribution is arbitrary. Anyone claiming single-touch attribution for AI visibility is lying, either through ignorance or deliberate obfuscation.
Avoid: "AI Traffic"
"ChatGPT sent you £2,000 in revenue last month."
AI referral traffic is approximately 1% of total web traffic. For most businesses, that's too small to matter. Google Analytics can't distinguish AI-influenced clicks from AI-clicked clicks. More importantly, real impact is much larger than clicks — brand influence happens in the mental space, not the click log.
Track AI traffic if you want, but understand it's not the story. It's the tail end of a much longer chain.
Metrics That Actually Work
Some metrics do tell you something real. They're harder to collect, require more discipline, but they actually measure value. Use this five-level framework.
Level 1: Technical Accessibility (Measurable, yes/no)
Can AI agents access your content? This is binary and deterministic.
How to measure:
- Run a Compass audit (automated)
- Manual testing: Can Claude/ChatGPT find your key pages?
- Screaming Frog: Is content in HTML or JS-loaded?
Metric:
- "95% of key pages are accessible to AI agents" (yes/no + percentage)
- "0 robots.txt blocks on AI crawlers" (pass/fail)
- "100% of critical content in HTML" (pass/fail)
Why it works:
- Binary or percentage (measurable)
- Deterministic (same result each test)
- Actionable (you can fix failures)
This is your foundation. If AI can't access your content, nothing else matters.
Level 2: Content Quality (Partially measurable)
Does your content actually answer the questions people ask AI?
How to measure:
- Manual audits: Ask Claude to summarise your pages, evaluate quality
- Extraction testing: Can AI correctly extract key facts?
- Content accuracy: Is information current and correct?
Metric:
- "89% of key facts correctly extracted by AI" (manual count)
- "0 significant inaccuracies in extracted data" (manual review)
- "Content updated within last 30 days for evergreen content" (audit trail)
Why it works:
- Measurable via manual review (tedious but valid)
- Connects to real value (accurate information)
- Actionable (improve content, increase extraction rate)
Quality beats quantity. One accurate, comprehensive answer is worth ten shallow mentions.
Level 3: Brand Mention Frequency (Measurable with caveats)
How often does AI mention your brand?
How to measure (correctly):
- Run 60-100 identical prompts per brand
- Measure: "Brand appears in X% of responses"
- Don't measure position (noise)
- Example: "SURI appears in 47% of ChatGPT responses to 'best electric toothbrush'"
Metric:
- "Brand appears in 45% of key prompt responses (±5%, n=60 runs per prompt)"
This is trend data, not absolute. Variance: 45% ±5% is meaningful; 45% vs. 46% is noise.
Requires: Dedicated tool (SparkToro, Gumshoe) or in-house statistical rigour.
Why it works:
- Statistically valid with adequate sample
- Shows trend over time
- Not position (which is noise)
- Requires discipline to measure correctly
Level 4: Consideration Set / Brand Recall (Measurable via surveys)
Do people who see AI mentions of your brand remember it?
How to measure:
- Brand tracking studies (YouGov, Tracksuit, Latana)
- Pre/post surveys: "When you think of [category], which brands come to mind?"
- Measure: Brand awareness, consideration, preference lift
Metric:
- "Aided brand awareness: 35% (up 5pp vs. baseline)"
- "Consideration set inclusion: 22% (up 3pp vs. baseline)"
- "Preference shift: +2pp vs. competitors"
Why it works:
- Direct measurement of mental availability
- How TV is measured (proven framework)
- Accounts for time lag (weekly/monthly tracking)
- Shows business impact of brand visibility
Costs: £5-20k per study, quarterly or monthly.
Level 5: Purchase Attribution (Unmeasurable directly)
Does AI visibility drive sales?
How to measure:
- Marketing Mix Modelling (MMM)
- Requires: Scale, time series data, statistical expertise
- Estimates contribution (not perfect attribution)
- Example: "AI visibility contributes approximately £40-60k annually to revenue"
Why it's hard:
- Structural impossibility (multi-touch, time lag)
- Requires 12+ months of data
- Requires statistical sophistication
- Expensive (£20-50k+ for a proper study)
Why it works (when done right):
- Statistically rigorous
- Accounts for lag and multi-touch
- Shows business impact
- Only true way to measure revenue impact
This is where most teams want to be, but it's where most teams should admit they can't get there yet.
The Measurement Framework for Your Team
What should you actually measure? Start simple, then scale.
Quick wins (do first, no cost)
- Level 1: Run Compass audit. Pass/fail. Done.
- Level 2: Manual content audit. How much can AI extract correctly?
Medium investment (1-3 months)
- Level 3: Set up brand mention tracking (if you have budget for SparkToro/Gumshoe)
- Level 4: Set baseline brand tracking survey (YouGov, Latana)
Long-term (3+ months)
- Quarterly: Re-run Level 1 and 2 (maintain foundation)
- Quarterly/monthly: Brand tracking updates (measure trend)
- Annually: Brand lift study (formal measurement)
- Long-term: MMM study (if you have scale and budget)
Recommended dashboard
| Level | Metric | Frequency | Owner |
|---|---|---|---|
| 1 | AI accessibility score | Monthly | Technical SEO |
| 2 | Content extraction rate | Quarterly | Content |
| 3 | Brand mention frequency | Monthly | Marketing |
| 4 | Brand awareness (aided) | Quarterly | Brand |
| 5 | Estimated revenue contribution | Annually | Finance/Analytics |
Stakeholder communication
- Level 1-2: "We've confirmed AI can access our content and extract it correctly."
- Level 3: "Our brand appears in ~40% of relevant AI responses."
- Level 4: "Brand awareness is up 3pp since last quarter."
- Level 5: "We estimate AI contributes £50-70k annually to revenue (±20%)."
Note the uncertainty: Level 5 is an estimate, not a measurement. Be honest about that. Performance marketers won't accept this, but reality doesn't conform to org charts. If teams insist that AI search must work like Google search because it has "search" in the name, they're going to keep getting frustrated. The data doesn't care what they accept.
What Not to Report
Some metrics shouldn't appear in your dashboard, no matter how impressive they look.
Don't report prompt position. It's noise.
Don't report single-touch attribution. It's invalid.
Don't report AI traffic as main metric. It's less than 1% of total and doesn't account for brand influence.
Don't report "AI visibility score" from unvalidated tools. Their methodology is unclear.
Don't claim certainty where there's uncertainty. Estimates should include confidence intervals.
What to say instead:
- "We measure AI accessibility (can AI find content?) and content quality (can AI extract accurately?)."
- "We track brand mention frequency in AI responses using statistical sampling."
- "We estimate AI's revenue contribution via MMM at approximately £50k (±20%, 95% CI)."
- "We don't track position or click-through because these metrics are unreliable for AI."
Be honest with your stakeholders. Most teams are measuring the wrong thing. The choice isn't about not measuring, it's about choosing the right measurement.
Ready to measure AI accessibility? Compass gives you Level 1 metrics — can AI access and extract your content? Start with diagnostics.