GEO Methodology — How This Audit Works
A transparent breakdown of the 14 signals GEO Auditor measures and why each one matters for AI search visibility.
What We Actually Measure
GEO Auditor analyses your public page HTML against 14 signals grouped into five pillars: Authority, Technical, Content, Data, and Trust. Every signal is derived from what an AI crawler can read in your page source — no black-box scoring, no proprietary guesswork.
The tool simulates how a Retrieval-Augmented Generation (RAG) pipeline chunks your content by splitting the DOM at semantic boundaries (<section>, <article>, heading tags) and evaluates each chunk for information density, entity presence, and citation quality.
The academic foundation for this field is the 2023 paper “GEO: Generative Engine Optimization” (Aggarwal et al., Princeton University), which demonstrated that structured citations, statistics, and quotations measurably increased citation rates in generative search outputs.
The 14 Signals
- Semantic HTML structure
- Checks for proper use of <article>, <section>, and <h1–h6> hierarchy. AI tokenisers split pages at these boundaries — a flat DOM of divs is harder to chunk accurately.
- JSON-LD presence and validity
- Detects <script type="application/ld+json"> blocks and validates the @type, required fields, and @id cross-links against the Schema.org specification.
- FAQPage schema
- Question-answer pairs in structured data are the most directly extractable format for AI answer surfaces. Their presence and question count are both measured.
- External citation density
- Counts outbound links to recognised high-authority domains. Perplexity, in particular, weights sources that themselves link to verified references.
- Author entity binding
- Checks whether an Article or BlogPosting schema has an author property with an @id that resolves to a declared Person or Organization entity in the same page graph.
- Organisation schema
- Verifies an Organization type with name, url, and at least one sameAs reference exists in the page, either directly or via the global layout.
- Meta description quality
- Evaluates length (target: 120–155 characters), absence of keyword stuffing, and the presence of a clear value proposition.
- Title tag structure
- Checks length (under 70 characters), brand name presence, and primary keyword placement near the start of the title.
- Canonical URL declaration
- Confirms a <link rel="canonical"> tag is present and self-referential — not pointing to a different URL that could cause index consolidation issues.
- AI crawler access
- Checks whether GPTBot, ClaudeBot, PerplexityBot, and Google-Extended are explicitly allowed in the robots.txt file.
- llms.txt presence
- Checks for a machine-readable /llms.txt file at the domain root and validates its basic structure (citation preferences, permitted use, last-updated date).
- Heading hierarchy quality
- Detects skipped heading levels (e.g., H2 → H4), multiple H1 tags on one page, and a missing H1 — all of which degrade AI content chunking.
- Image alt text coverage
- The percentage of <img> elements with a non-empty alt attribute. AI systems that process page images rely on alt text for context.
- Sitemap declaration
- Confirms a Sitemap: directive in robots.txt and that the declared URL returns a valid XML sitemap with at least the current page included.
How the Score Is Calculated
Each of the 14 signals is scored on a binary (present / absent) or graded (0–10) scale depending on the signal type. The five pillar scores are weighted averages of their constituent signals, then combined into the overall GEO score (0–100).
The score reflects only what is publicly readable in the page HTML at the time of the audit. It does not reflect unpublished content, server-side redirects, pages behind authentication, or signals that require JavaScript execution to render.
Scores for individual URLs are specific to that URL — a high score on your homepage does not mean your blog posts or product pages score equally. Run the audit on each important page separately.
Further Reading
- Aggarwal et al. (2023) — “GEO: Generative Engine Optimization” · The foundational academic paper defining the GEO field, published by researchers at Princeton University.
- Google Search Central — Introduction to Structured Data · Google's official documentation on how structured data affects indexing and rich results.
- W3C — JSON-LD 1.1 Specification · The complete technical specification for the JSON-LD format used in Schema.org markup.
- Schema.org — Full Type Reference · The complete catalogue of structured data types, properties, and their expected values.
- MDN Web Docs — HTML Elements Reference · Authoritative documentation for semantic HTML elements used in GEO content structuring.
- Schema Markup Validator · Test your JSON-LD implementation for syntax errors and missing required properties.
Frequently Asked Questions
What does the technical SEO checker analyse?
The technical SEO checker audits 8 GEO scoring dimensions: relevance, authority, source attribution, topic depth, structure, clarity, citations, and direct answers. It also detects existing schema markup types, heading hierarchy, semantic HTML elements, external link quality, and estimated Flesch-Kincaid readability grade.
How does the SEO checker evaluate schema markup?
The checker parses all application/ld+json script tags on the target page, extracts @type values using a recursive entity traversal, and checks for the presence of high-impact types including FAQPage, Article, Organization, Person, and SoftwareApplication. Entity binding (co-occurrence of multiple schema types) is rewarded as a Knowledge Graph signal.
What is an Information Gain Ratio in the context of SEO?
Information Gain Ratio measures how much unique, structured information your page adds per word. It is calculated from named entity proxies (citations, external links, heading nodes, list items) divided by total word count. A higher ratio signals to LLMs that your page is dense with extractable knowledge rather than filler prose.
Does the SEO checker work on JavaScript-rendered pages?
The checker fetches raw HTML server responses. Pages that require JavaScript execution to render their content (client-side only SPAs without SSR) will receive incomplete results. This accurately reflects the limitation of most AI crawlers including OAI-SearchBot and PerplexityBot, which also do not execute JavaScript.
How does the checker score authority signals?
Authority scoring evaluates outbound links to trusted TLDs (.edu, .gov, .org) and high-authority domains (Wikipedia, Nature, IEEE, BBC, Reuters). It also checks for Organization or Person JSON-LD schemas with structured identity data. Pages with verified authorship entities and authoritative citation networks score significantly higher.
What is Hybrid BM25 Context Density?
Hybrid BM25 Context Density is a proprietary metric that approximates how well your primary keywords appear in the opening 200 words of your page. It simulates the BM25 ranking function used in retrieval pipelines like those powering Perplexity and ChatGPT Browse mode, where early keyword presence strongly influences retrieval priority.