Deep dive into how Compass works for technically curious users
Compass uses a hybrid ML+LLM architecture to simulate how AI agents navigate websites. This guide explains the technical details for those who want to understand what's happening under the hood.
You could give an LLM a webpage and ask "which link should I click to find pricing?"
The issues:
You could train a classifier to score links by relevance.
The issues:
ML handles efficient filtering and structured feature extraction. LLM handles intelligent decision-making and semantic reasoning.
This architecture performs better than running AI directly in a chat interface. The ML layer provides structured data and alignment that raw browser-based agents don't have. It's also significantly faster at scale — subsecond ML inference feeds structured candidates to Anthropic's fastest model, making bulk processing practical.
When Compass audits a page, it follows this pipeline:
Compass fetches the page using a dual-layer approach: standard HTTP requests plus a Playwright-based headless browser for JavaScript-rendered content. This captures both static HTML and dynamically loaded elements.
For each page, we extract all navigable elements:
<a> tags)For each element, we capture:
Each link becomes a feature vector for the ML model:
Text features:
Positional features:
Contextual features:
Structural features:
An XGBoost model scores each link from 0 to 1 based on likelihood of leading toward the task goal.
Why XGBoost:
The model outputs:
Links scoring below a threshold are filtered out. Typically this reduces thousands of links to 5-15 candidates.
The shortlist goes to Claude for final selection.
Prompt includes:
Claude returns:
This gives us semantic understanding that ML alone can't provide — handling ambiguous labels, inferring meaning from context, reasoning about multi-step paths.
Compass clicks the selected link, waits for the page to load, and repeats the pipeline.
Termination conditions:
After navigation completes, ML models classify the outcome:
Success classification: Did the final page contain the requested information? Uses text matching, semantic similarity, and content verification.
Failure classification: If unsuccessful, what type of failure? Blocked, navigation, depth, or content gap — each has different implications for fixes.
Our models are trained on real navigation data:
Navigation traces were collected by:
For each link in training data, we know:
This lets us train on actual navigation behaviour, not synthetic labels.
Splits are stratified by website to prevent data leakage — if a site is in test, none of its traces are in training.
Architecture:
Performance (test set):
Architecture:
Performance (test set):
Architecture:
Performance (test set):
Every prediction includes SHAP (SHapley Additive exPlanations) values showing which features influenced the score.
Example: For a link scored 0.87:
This transparency helps you understand why Compass made specific choices — and what you might change to improve scores.
JavaScript-heavy SPAs: Some single-page applications load content dynamically in ways that aren't captured even by our Playwright-based headless browser. We're working on improved JS rendering support.
Authenticated content: Compass can't log in to test content behind authentication (yet).
Real-time content: Content that changes frequently may show different results between audits.
Novel site structures: Sites very different from our training data may see less accurate predictions. The model learns patterns — truly novel patterns take time to incorporate.
Ambiguous tasks: Some tasks have genuinely ambiguous success criteria. "Find support" could mean contact form, help docs, or chat widget. We're working on improved success detection using a mixture of ML and LLM inference.
Language: Currently optimised for English. Other languages work but with reduced accuracy.
Real-world AI tools are more constrained than you might expect:
Compass uses browser automation to test full navigation paths. This means our scores represent an optimistic scenario — what would happen if AI could navigate freely.
The implication: If your site fails in Compass, it will definitely fail with real AI tools. If it passes, you're in good shape for the best-case scenario. But remember: where you land from search matters enormously. The first page is often the only page AI will see.