Wayfinder AI
  • About
  • Blog
  • Research
  • Contact
  • Access Compass
Access Compass
CompassOverviewFeaturesPricingGuides
Wayfinder AI
© 2026 Wayfinder AI. All rights reserved.
Products
  • Compass
  • Lens (Coming soon)
  • Chart (Coming soon)
  • Radar (Coming soon)
Company
  • About
  • Blog
  • Contact
Resources
  • Glossary
  • Guides
  • Comparisons
  • Free Tools
  • Pricing
  • Privacy Policy
  • Terms of Service
All Guides

Methodology & training data

Deep dive into how Compass works for technically curious users

12 min readGoing Deeper

Overview

Compass uses a hybrid ML+LLM architecture to simulate how AI agents navigate websites. This guide explains the technical details for those who want to understand what's happening under the hood.


Why a hybrid approach?

The problem with pure LLM navigation

You could give an LLM a webpage and ask "which link should I click to find pricing?"

The issues:

  • LLMs can be unpredictable in navigation choices
  • Context windows fill up quickly with large pages
  • No structured way to evaluate hundreds of links consistently

The problem with pure ML navigation

You could train a classifier to score links by relevance.

The issues:

  • No semantic understanding — can't reason about ambiguous labels
  • Context-blind — doesn't understand the goal in natural language
  • Brittle — struggles with novel site structures

Compass combines both

ML handles efficient filtering and structured feature extraction. LLM handles intelligent decision-making and semantic reasoning.

This architecture performs better than running AI directly in a chat interface. The ML layer provides structured data and alignment that raw browser-based agents don't have. It's also significantly faster at scale — subsecond ML inference feeds structured candidates to Anthropic's fastest model, making bulk processing practical.


The navigation pipeline

When Compass audits a page, it follows this pipeline:

1. Page extraction

Compass fetches the page using a dual-layer approach: standard HTTP requests plus a Playwright-based headless browser for JavaScript-rendered content. This captures both static HTML and dynamically loaded elements.

For each page, we extract all navigable elements:

  • Links (<a> tags)
  • Buttons with click handlers
  • Form submissions
  • Interactive elements

For each element, we capture:

  • Text content (anchor text, button label)
  • URL or action target
  • Position on page (header, nav, main, footer, sidebar)
  • Surrounding context (parent headings, nearby text)
  • Visual attributes (size, prominence)

2. Feature engineering

Each link becomes a feature vector for the ML model:

Text features:

  • Anchor text (tokenised, embedded)
  • Semantic similarity to task goal
  • Keyword matches
  • Text length and formatting

Positional features:

  • DOM depth
  • Page region (header/nav/main/footer)
  • Relative position among siblings
  • Distance from page top

Contextual features:

  • Parent heading text
  • Surrounding paragraph content
  • Section membership
  • URL path components

Structural features:

  • Is it in main navigation?
  • Is it a dropdown child?
  • Number of sibling links
  • Link density in region

3. ML pre-filtering

An XGBoost model scores each link from 0 to 1 based on likelihood of leading toward the task goal.

Why XGBoost:

  • Fast inference (milliseconds, not seconds)
  • Handles mixed feature types well
  • Good with tabular/structured data
  • Interpretable via SHAP values

The model outputs:

  • Relevance score (0-1)
  • Feature importance for this prediction
  • Confidence interval

Links scoring below a threshold are filtered out. Typically this reduces thousands of links to 5-15 candidates.

4. LLM decision-making

The shortlist goes to Claude for final selection.

Prompt includes:

  • The task goal in natural language
  • Navigation history so far
  • Candidate links with context
  • Current page summary

Claude returns:

  • Selected link
  • Reasoning for the choice
  • Confidence assessment
  • Alternative considerations

This gives us semantic understanding that ML alone can't provide — handling ambiguous labels, inferring meaning from context, reasoning about multi-step paths.

5. Action and iteration

Compass clicks the selected link, waits for the page to load, and repeats the pipeline.

Termination conditions:

  • Task goal achieved (success)
  • Maximum depth reached (depth failure)
  • No viable links found (navigation failure)
  • Page blocked (blocked failure)

6. Result classification

After navigation completes, ML models classify the outcome:

Success classification: Did the final page contain the requested information? Uses text matching, semantic similarity, and content verification.

Failure classification: If unsuccessful, what type of failure? Blocked, navigation, depth, or content gap — each has different implications for fixes.


Training data

The corpus

Our models are trained on real navigation data:

  • 3,348 navigation traces — complete paths from start to goal
  • 269 websites — across industries and site types
  • 494,197 links evaluated — with ground truth labels

Data collection

Navigation traces were collected by:

  1. Defining tasks (e.g., "find pricing page")
  2. Having humans navigate to complete them
  3. Recording every page, every link seen, every choice made
  4. Labelling outcomes (success, failure, failure type)

Ground truth

For each link in training data, we know:

  • Was it clicked? (positive/negative example)
  • Did clicking it lead toward the goal?
  • How many steps to success from here?

This lets us train on actual navigation behaviour, not synthetic labels.

Data splits

  • Training: 70% of traces
  • Validation: 15% of traces
  • Test: 15% of traces (held out, never seen during training)

Splits are stratified by website to prevent data leakage — if a site is in test, none of its traces are in training.


Model performance

Link relevance model (XGBoost)

Architecture:

  • XGBoost classifier
  • ~150 features per link
  • 500 trees, max depth 6
  • Trained with binary cross-entropy

Performance (test set):

  • Precision@5: 0.84 (84% of top-5 links are relevant)
  • Recall@10: 0.91 (91% of relevant links in top 10)
  • AUC-ROC: 0.89

Failure classification model

Architecture:

  • Multi-class XGBoost
  • Features from final page + navigation history
  • 4 classes: blocked, navigation, depth, content_gap

Performance (test set):

  • Accuracy: 0.82
  • F1 (macro): 0.79

Success verification model

Architecture:

  • Binary XGBoost
  • Text similarity + content features
  • Determines if task goal was achieved

Performance (test set):

  • Precision: 0.91
  • Recall: 0.87
  • F1: 0.89

SHAP explainability

Every prediction includes SHAP (SHapley Additive exPlanations) values showing which features influenced the score.

Example: For a link scored 0.87:

  • Anchor text "Pricing" → +0.32
  • Located in main navigation → +0.18
  • URL contains "/pricing" → +0.15
  • Page region: header → +0.12
  • Low DOM depth → +0.10

This transparency helps you understand why Compass made specific choices — and what you might change to improve scores.


Limitations

What we can't fully test

JavaScript-heavy SPAs: Some single-page applications load content dynamically in ways that aren't captured even by our Playwright-based headless browser. We're working on improved JS rendering support.

Authenticated content: Compass can't log in to test content behind authentication (yet).

Real-time content: Content that changes frequently may show different results between audits.

Model limitations

Novel site structures: Sites very different from our training data may see less accurate predictions. The model learns patterns — truly novel patterns take time to incorporate.

Ambiguous tasks: Some tasks have genuinely ambiguous success criteria. "Find support" could mean contact form, help docs, or chat widget. We're working on improved success detection using a mixture of ML and LLM inference.

Language: Currently optimised for English. Other languages work but with reduced accuracy.


How AI actually accesses content

Real-world AI tools are more constrained than you might expect:

  • Basic fetch tools can read a page but can't follow internal links
  • Browser automation exists but is slow, expensive, and hits context limits
  • In practice, most AI interactions are: search → land on page → done

Compass uses browser automation to test full navigation paths. This means our scores represent an optimistic scenario — what would happen if AI could navigate freely.

The implication: If your site fails in Compass, it will definitely fail with real AI tools. If it passes, you're in good shape for the best-case scenario. But remember: where you land from search matters enormously. The first page is often the only page AI will see.


Next steps

Next Steps

How AI agents navigateUnderstanding your resultsTask categories reference
PreviousHow AI agents navigateNextTask categories reference