🔍 exampleblog.com  ·  1,204 URLs analysed  ·  Mar 7, 2026
Live
Conflicts Detected
⚠️
37
↑ 12% vs last crawl
Borderline Cases
🔍
19
↓ 3% vs last crawl
URLs Processed
1,204
via Kafka stream · 40s
Est. Recovery
📈
+22k
clicks/mo post-consolidation
Top Conflicts 37 active
URL Pair Similarity Type Action
/blog/how-to-write-meta-descriptions /guides/meta-description-best-practices
0.94
Semantic Dupe 301 /blog → /guides
/blog/internal-linking-strategy /seo-tips/internal-links-for-seo
0.91
Intent Conflict Merge, redirect weaker
/blog/what-is-keyword-cannibalization /learn/keyword-cannibalisation-explained
0.89
Semantic Dupe Ironic. Fix immediately.
/blog/page-speed-seo-impact /technical-seo/core-web-vitals-guide
0.82
Partial Overlap Differentiate scope
/blog/seo-audit-checklist /resources/technical-seo-audit-template
0.79
Partial Overlap Review intent carefully
Pipeline 8 stages
1
Sitemap Ingestion via Kafka
sitemap.xml published to a Kafka topic. Exactly-once delivery. Fault-tolerant consumer group.
Apache Kafkaexactly-once
2
Custom BPE Tokenisation
SEO-domain BPE model trained on 4.2M documents. Handles hyphenated slugs.
BPE4.2M corpus
3
Cross-Encoder Embedding
Fine-tuned multilingual cross-encoder. Joint attention captures interaction effects bi-encoders miss.
cross-encodermultilingual
4
Poincaré Hyperbolic Projection
Embeddings projected to hyperbolic space. Preserves hierarchical structure Euclidean geometry distorts.
Poincaré ballRiemannian SGD
5
Leiden Community Detection
Graph partitioned using Leiden. Resolution-limit-free improvement over Louvain.
Leidenigraph
6
Ensemble Conflict Scoring
Three signals fused via XGBoost meta-learner trained on 14k labelled pairs.
XGBoost14k labels
7
Counterfactual Impact Estimation
Causal model estimates post-consolidation traffic delta. STL decomposition for seasonal adjustment.
causal inferenceSTL
8
Recommendation Synthesis
Results ranked. Redirect map generated. Report exported.
pandascsv
💡 What this actually does
# stages 1–7, condensed
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(your_urls)
scores = cosine_similarity(embeddings)
# flag anything above 0.85. done. go for a walk.