Methodology & training data — Compass Guide

Overview

Compass uses a hybrid ML+LLM architecture to simulate how AI agents navigate websites. This guide explains the technical details for those who want to understand what's happening under the hood.

Why a hybrid approach?

The problem with pure LLM navigation

You could give an LLM a webpage and ask "which link should I click to find pricing?"

The issues:

LLMs can be unpredictable in navigation choices
Context windows fill up quickly with large pages
No structured way to evaluate hundreds of links consistently

The problem with pure ML navigation

You could train a classifier to score links by relevance.

The issues:

No semantic understanding — can't reason about ambiguous labels
Context-blind — doesn't understand the goal in natural language
Brittle — struggles with novel site structures

Compass combines both

ML handles efficient filtering and structured feature extraction. LLM handles intelligent decision-making and semantic reasoning.

This architecture performs better than running AI directly in a chat interface. The ML layer provides structured data and alignment that raw browser-based agents don't have. It's also significantly faster at scale — subsecond ML inference feeds structured candidates to Anthropic's fastest model, making bulk processing practical.

The navigation pipeline

When Compass audits a page, it follows this pipeline:

1. Page extraction

Compass fetches the page using a dual-layer approach: standard HTTP requests plus a Playwright-based headless browser for JavaScript-rendered content. This captures both static HTML and dynamically loaded elements.

For each page, we extract all navigable elements:

Links (<a> tags)
Buttons with click handlers
Form submissions
Interactive elements

For each element, we capture:

Text content (anchor text, button label)
URL or action target
Position on page (header, nav, main, footer, sidebar)
Surrounding context (parent headings, nearby text)
Visual attributes (size, prominence)

2. Feature engineering

Each link becomes a feature vector for the ML model:

Text features:

Anchor text (tokenised, embedded)
Semantic similarity to task goal
Keyword matches
Text length and formatting

Positional features:

DOM depth
Page region (header/nav/main/footer)
Relative position among siblings
Distance from page top

Contextual features:

Parent heading text
Surrounding paragraph content
Section membership
URL path components

Structural features:

Is it in main navigation?
Is it a dropdown child?
Number of sibling links
Link density in region

3. ML pre-filtering

An XGBoost model scores each link from 0 to 1 based on likelihood of leading toward the task goal.

Why XGBoost:

Fast inference (milliseconds, not seconds)
Handles mixed feature types well
Good with tabular/structured data
Interpretable via SHAP values

The model outputs:

Relevance score (0-1)
Feature importance for this prediction
Confidence interval

Links scoring below a threshold are filtered out. Typically this reduces thousands of links to 5-15 candidates.

4. LLM decision-making

The shortlist goes to Claude for final selection.

Prompt includes:

The task goal in natural language
Navigation history so far
Candidate links with context
Current page summary

Claude returns:

Selected link
Reasoning for the choice
Confidence assessment
Alternative considerations

This gives us semantic understanding that ML alone can't provide — handling ambiguous labels, inferring meaning from context, reasoning about multi-step paths.

5. Action and iteration

Compass clicks the selected link, waits for the page to load, and repeats the pipeline.

Termination conditions:

Task goal achieved (success)
Maximum depth reached (depth failure)
No viable links found (navigation failure)
Page blocked (blocked failure)

6. Result classification

After navigation completes, ML models classify the outcome:

Success classification: Did the final page contain the requested information? Uses text matching, semantic similarity, and content verification.

Failure classification: If unsuccessful, what type of failure? Blocked, navigation, depth, or content gap — each has different implications for fixes.

Training data

The corpus

Our models are trained on real navigation data:

3,348 navigation traces — complete paths from start to goal
269 websites — across industries and site types
494,197 links evaluated — with ground truth labels

Data collection

Navigation traces were collected by:

Defining tasks (e.g., "find pricing page")
Having humans navigate to complete them
Recording every page, every link seen, every choice made
Labelling outcomes (success, failure, failure type)

Ground truth

For each link in training data, we know:

Was it clicked? (positive/negative example)
Did clicking it lead toward the goal?
How many steps to success from here?

This lets us train on actual navigation behaviour, not synthetic labels.

Data splits

Training: 70% of traces
Validation: 15% of traces
Test: 15% of traces (held out, never seen during training)

Splits are stratified by website to prevent data leakage — if a site is in test, none of its traces are in training.

Model performance

Link relevance model (XGBoost)

Architecture:

XGBoost classifier
~150 features per link
500 trees, max depth 6
Trained with binary cross-entropy

Performance (test set):

Precision@5: 0.84 (84% of top-5 links are relevant)
Recall@10: 0.91 (91% of relevant links in top 10)
AUC-ROC: 0.89

Failure classification model

Architecture:

Multi-class XGBoost
Features from final page + navigation history
4 classes: blocked, navigation, depth, content_gap

Performance (test set):

Accuracy: 0.82
F1 (macro): 0.79

Success verification model

Architecture:

Binary XGBoost
Text similarity + content features
Determines if task goal was achieved

Performance (test set):

Precision: 0.91
Recall: 0.87
F1: 0.89

SHAP explainability

Every prediction includes SHAP (SHapley Additive exPlanations) values showing which features influenced the score.

Example: For a link scored 0.87:

Anchor text "Pricing" → +0.32
Located in main navigation → +0.18
URL contains "/pricing" → +0.15
Page region: header → +0.12
Low DOM depth → +0.10

This transparency helps you understand why Compass made specific choices — and what you might change to improve scores.

Limitations

What we can't fully test

JavaScript-heavy SPAs: Some single-page applications load content dynamically in ways that aren't captured even by our Playwright-based headless browser. We're working on improved JS rendering support.

Authenticated content: Compass can't log in to test content behind authentication (yet).

Real-time content: Content that changes frequently may show different results between audits.

Model limitations

Novel site structures: Sites very different from our training data may see less accurate predictions. The model learns patterns — truly novel patterns take time to incorporate.

Ambiguous tasks: Some tasks have genuinely ambiguous success criteria. "Find support" could mean contact form, help docs, or chat widget. We're working on improved success detection using a mixture of ML and LLM inference.

Language: Currently optimised for English. Other languages work but with reduced accuracy.

How AI actually accesses content

Real-world AI tools are more constrained than you might expect:

Basic fetch tools can read a page but can't follow internal links
Browser automation exists but is slow, expensive, and hits context limits
In practice, most AI interactions are: search → land on page → done

Compass uses browser automation to test full navigation paths. This means our scores represent an optimistic scenario — what would happen if AI could navigate freely.

The implication: If your site fails in Compass, it will definitely fail with real AI tools. If it passes, you're in good shape for the best-case scenario. But remember: where you land from search matters enormously. The first page is often the only page AI will see.

Next steps

Next Steps

How AI agents navigate Understanding your results Task categories reference

Overview

Compass uses a hybrid ML+LLM architecture to simulate how AI agents navigate websites. This guide explains the technical details for those who want to understand what's happening under the hood.

Why a hybrid approach?

The problem with pure LLM navigation

You could give an LLM a webpage and ask "which link should I click to find pricing?"

The issues:

LLMs can be unpredictable in navigation choices
Context windows fill up quickly with large pages
No structured way to evaluate hundreds of links consistently

The problem with pure ML navigation

You could train a classifier to score links by relevance.

The issues:

No semantic understanding — can't reason about ambiguous labels
Context-blind — doesn't understand the goal in natural language
Brittle — struggles with novel site structures

Compass combines both

ML handles efficient filtering and structured feature extraction. LLM handles intelligent decision-making and semantic reasoning.

The navigation pipeline

When Compass audits a page, it follows this pipeline:

1. Page extraction

For each page, we extract all navigable elements:

Links (<a> tags)
Buttons with click handlers
Form submissions
Interactive elements

For each element, we capture:

Text content (anchor text, button label)
URL or action target
Position on page (header, nav, main, footer, sidebar)
Surrounding context (parent headings, nearby text)
Visual attributes (size, prominence)

2. Feature engineering

Each link becomes a feature vector for the ML model:

Text features:

Anchor text (tokenised, embedded)
Semantic similarity to task goal
Keyword matches
Text length and formatting

Positional features:

DOM depth
Page region (header/nav/main/footer)
Relative position among siblings
Distance from page top

Contextual features:

Parent heading text
Surrounding paragraph content
Section membership
URL path components

Structural features:

Is it in main navigation?
Is it a dropdown child?
Number of sibling links
Link density in region

3. ML pre-filtering

An XGBoost model scores each link from 0 to 1 based on likelihood of leading toward the task goal.

Why XGBoost:

Fast inference (milliseconds, not seconds)
Handles mixed feature types well
Good with tabular/structured data
Interpretable via SHAP values

The model outputs:

Relevance score (0-1)
Feature importance for this prediction
Confidence interval

Links scoring below a threshold are filtered out. Typically this reduces thousands of links to 5-15 candidates.

4. LLM decision-making

The shortlist goes to Claude for final selection.

Prompt includes:

The task goal in natural language
Navigation history so far
Candidate links with context
Current page summary

Claude returns:

Selected link
Reasoning for the choice
Confidence assessment
Alternative considerations

This gives us semantic understanding that ML alone can't provide — handling ambiguous labels, inferring meaning from context, reasoning about multi-step paths.

5. Action and iteration

Compass clicks the selected link, waits for the page to load, and repeats the pipeline.

Termination conditions:

Task goal achieved (success)
Maximum depth reached (depth failure)
No viable links found (navigation failure)
Page blocked (blocked failure)

6. Result classification

After navigation completes, ML models classify the outcome:

Success classification: Did the final page contain the requested information? Uses text matching, semantic similarity, and content verification.

Failure classification: If unsuccessful, what type of failure? Blocked, navigation, depth, or content gap — each has different implications for fixes.

Training data

The corpus

Our models are trained on real navigation data:

3,348 navigation traces — complete paths from start to goal
269 websites — across industries and site types
494,197 links evaluated — with ground truth labels

Data collection

Navigation traces were collected by:

Defining tasks (e.g., "find pricing page")
Having humans navigate to complete them
Recording every page, every link seen, every choice made
Labelling outcomes (success, failure, failure type)

Ground truth

For each link in training data, we know:

Was it clicked? (positive/negative example)
Did clicking it lead toward the goal?
How many steps to success from here?

This lets us train on actual navigation behaviour, not synthetic labels.

Data splits

Training: 70% of traces
Validation: 15% of traces
Test: 15% of traces (held out, never seen during training)

Splits are stratified by website to prevent data leakage — if a site is in test, none of its traces are in training.

Model performance

Link relevance model (XGBoost)

Architecture:

XGBoost classifier
~150 features per link
500 trees, max depth 6
Trained with binary cross-entropy

Performance (test set):

Precision@5: 0.84 (84% of top-5 links are relevant)
Recall@10: 0.91 (91% of relevant links in top 10)
AUC-ROC: 0.89

Failure classification model

Architecture:

Multi-class XGBoost
Features from final page + navigation history
4 classes: blocked, navigation, depth, content_gap

Performance (test set):

Accuracy: 0.82
F1 (macro): 0.79

Success verification model

Architecture:

Binary XGBoost
Text similarity + content features
Determines if task goal was achieved

Performance (test set):

Precision: 0.91
Recall: 0.87
F1: 0.89

SHAP explainability

Every prediction includes SHAP (SHapley Additive exPlanations) values showing which features influenced the score.

Example: For a link scored 0.87:

Anchor text "Pricing" → +0.32
Located in main navigation → +0.18
URL contains "/pricing" → +0.15
Page region: header → +0.12
Low DOM depth → +0.10

This transparency helps you understand why Compass made specific choices — and what you might change to improve scores.

Limitations

What we can't fully test

Authenticated content: Compass can't log in to test content behind authentication (yet).

Real-time content: Content that changes frequently may show different results between audits.

Model limitations

Novel site structures: Sites very different from our training data may see less accurate predictions. The model learns patterns — truly novel patterns take time to incorporate.

Language: Currently optimised for English. Other languages work but with reduced accuracy.

How AI actually accesses content

Real-world AI tools are more constrained than you might expect:

Basic fetch tools can read a page but can't follow internal links
Browser automation exists but is slow, expensive, and hits context limits
In practice, most AI interactions are: search → land on page → done

Compass uses browser automation to test full navigation paths. This means our scores represent an optimistic scenario — what would happen if AI could navigate freely.

Next steps

Next Steps

How AI agents navigate Understanding your results Task categories reference