Most advice on optimizing for generative search engines treats citation as a content quality problem. Write more authoritatively. Add statistics. Improve fluency. The implicit model is that engines cite better pages, and better pages are more polished versions of existing ones. A paper from Virginia Tech challenges that model directly — not by proposing a new optimization heuristic, but by measuring why pages fail to get cited in the first place.

The researchers built a taxonomy of citation failure from 949 contrastive pairs: cases where one page was cited and another wasn't, for the same query, despite both being retrieved. The cited and uncited pages were matched on topical relevance. The difference was somewhere in how the generative engine processed and used them. By analyzing what separated the two groups across those pairs, the paper produces the first systematic account of citation failure modes — not patterns in what successful pages look like, but diagnoses of why specific pages fail.

The number that anchors the study: 43% of topically relevant webpages receive zero citations under baseline generative engine conditions. Relevance is necessary. It is not sufficient.


The Citation Gap Is Real and Measurable

The commercial stakes of citation failure are higher than they might appear. Only 1% of users who encounter AI summaries click cited sources, compared to 15% who click traditional results when no AI summary appears — a figure drawn from an analysis of 68,879 Google searches. The click-through rate collapsed. But generative search visitors convert 23 times better than traditional organic visitors, which means the small fraction who do arrive through citations are disproportionately valuable.

43%
of topically relevant webpages receive no citation under baseline generative engine conditions — relevance is necessary but not sufficient.

The implication is that citation inclusion has become the primary traffic mechanism in generative search, and exclusion is nearly total. A page that doesn't get cited doesn't get a referral pathway. There is no second-place position that still delivers partial traffic. The binary nature of citation — cited or not — is what makes the 43% baseline exclusion rate commercially significant rather than merely academically interesting.

The researchers constructed their benchmark, MIMIQ, from 204 webpages drawn from the ClueWeb22 index, each paired with 60 queries spanning different intents, personas, and phrasings — 12,240 queries in total. The taxonomy was built separately, from 949 contrastive pairs sourced from GEO-Bench across 10 domains. The two datasets serve different purposes: MIMIQ for measuring optimization performance, the contrastive pairs for understanding why failures occur.


Why Pages Fail to Get Cited: A Four-Category Taxonomy

The taxonomy maps failures across the generative engine pipeline — fetching, parsing, and generation — and the distribution is uneven enough to matter for how practitioners prioritize.

62.2% Semantic Alignment — page framing doesn't match engine query interpretation
27.1% Content Quality — insufficient depth, clarity, or specificity
10.1% Technical Integrity — parsing or formatting failures
0.6% Systemic Exclusion — domain authority bias, unresolvable by content edits

Semantic Alignment (62.2% of failures). The largest category by a significant margin, and the one current GEO advice addresses least directly. The page reaches the model but gets judged insufficiently relevant at generation time. The failure isn't retrieval — it's that the engine doesn't consider the page a good answer to what the query actually asks. This breaks into sub-types: intent divergence (the page is informational, the query is transactional), contextual gap (the query asks about a specific entity the page doesn't explicitly address), outdated framing, and localization mismatch. A page can fail semantic alignment for one query phrasing and pass for another expressing the same underlying topic.

Content Quality (27.1% of failures). The page aligns with the query but resists synthesis. Too shallow to support a citation, fragmented into disconnected snippets, key facts diluted across filler, information presented as dense prose where structure would aid extraction. Generative engines don't quote — they synthesize. Content that doesn't yield clean extraction doesn't get cited even when it's on-topic.

Technical Integrity (10.1% of failures). The engine can't properly ingest the page. JavaScript-rendered content, access blocking, content overwhelmed by navigation boilerplate. These are upstream failures — optimization applied to a page with a technical integrity problem is wasted before it starts.

Systemic Exclusion (0.6% of failures). The page faces a structural disadvantage the content can't address. Either a higher-authority source covers identical facts — a dominant platform, an institutional reference — or the relevant content falls outside the effective context window. The researchers document cases where a university course page competes against Coursera or edX and loses regardless of content quality. For these pages, the question isn't how to optimize but whether there's an angle or specificity the dominant source doesn't serve.


AgentGEO vs. Baselines: Efficiency and Accuracy

AgentGEO runs a diagnose-then-repair loop. It compares the uncited target page against a cited competitor, selects a repair tool from a specialized library matched to the diagnosed failure type, and iterates with query-specific memory — modifying only the content chunks that matter. Batch aggregation of query-specific edits prevents overfitting to individual query phrasings.

40%
Relative citation rate improvement by AgentGEO over baselines
5%
Average share of page content modified to achieve that improvement

The performance gap against baselines is large. Under in-context generation, AgentGEO achieves a citation rate of 79.52%, compared to 68.80% for AutoGEO, the best-performing baseline. Under attribute-first-then-generate, AgentGEO reaches 70.00%, outperforming AutoGEO by 4.03 percentage points. The efficiency gap is larger still: AgentGEO modifies an average of 5% of page content to achieve those results. Baseline methods average 25% content modification — and in several topic categories, that modification actively reduces citation rates below the unoptimized baseline.

The faithfulness numbers make the tradeoff concrete. AgentGEO's Jaccard similarity — a measure of how much original content is preserved — is 82.40% under in-context generation. AutoGEO's is 17.97%. The baseline isn't just rewriting more; it's replacing most of the original page. When that original page was already performing well, the rewrite removes what was working. Health content, already cited at roughly 80% under baseline conditions, showed negative or minimal improvement after generic optimization — rewrites inadvertently removed domain-specific terminology the engine used as relevance signals.

The transferability result is worth noting separately. When AgentGEO is optimized using one citation paradigm (attribute-first) but evaluated under another (in-context), it improves citation rate by 14.31 percentage points — from 69.98% to 84.29%. The repairs generalize across evaluation conditions, which matters because production generative engines don't expose their citation mechanisms.


What AgentGEO Still Can't Fix

The paper is direct about the ceiling. In the 50-webpage experiment, the overall citation rate rose from 57.0% to 83.7% — a substantial improvement. But 163 queries remained uncited after optimization. The method does not universally solve citation failure.

Health content is the clearest documented case where optimization backfired. The original citation rate for health-related pages was already high, leaving limited room for improvement, and some optimization steps removed the domain-specific language that made those pages citable in the first place. Generic repair rules derived from aggregate patterns don't account for content that works precisely because it's specific.

Systemic exclusion — the 0.6% of failures driven by authority bias — is unresolvable through content-side edits by definition. The engine favors certain domains structurally. No rewrite changes that. The paper acknowledges this ceiling but doesn't address it; the researchers note it as a limitation of content-based optimization generally, not just of AgentGEO.

All experiments use a simulated generative engine rather than commercial systems. The simulation uses GPT gpt-4.1-mini and Claude claude-haiku-4-5-20251001 with explicit citation instructions. Both AgentGEO and AutoGEO perform better on GPT than on Claude, suggesting engine-specific variation that a controlled simulation can't fully capture. Real-world transfer to Google AI Overviews or Perplexity remains unvalidated.


Before the Next Rewrite

The paper's practical implication isn't a new optimization checklist. It's that optimization applied without diagnosis is working from the wrong starting point — and that working from the wrong starting point can make things worse.

Check technical access first: strip the page's HTML and read what remains. If the actual content is hard to find in that output, it's hard for a parser to find too. Then map the page against real query variants to find where intent alignment breaks — different intents, different phrasings, different user types. Then test whether the best content on the page supports clean extraction: can an LLM pull a clean, attributable answer from the key paragraphs alone, or does assembling the key facts require reading the whole page?

Then — before committing to content work — check who is actually getting cited for the target queries. If the competing source is a dominant platform covering identical facts, the gap is a positioning problem, not a content problem. Knowing that early is more useful than another round of generic improvements that may remove what was working.

Key Takeaway

Most citation failures stem from semantic misalignment between page framing and engine query interpretation — not from content quality or technical issues — and generic rewriting rules often make this worse.

Practitioners optimizing for generative engine citation should diagnose failure mode first: broad content rewrites are likely to hurt high-performing pages, and authority-disadvantaged pages need distribution-level interventions that content edits cannot provide.

Source

Tian, Zhihua, Chen, Yuhan, Tang, Yao, Liu, Jian, Jia, Ruoxi (2026). Diagnosing and Repairing Citation Failures in Generative Engine Optimization. arXiv:2603.09296