Forem: Cihangir Bozdogan

Vector Retrieval Quietly Replaced Keyword Match, and the SEO Stack Did Not Notice

Cihangir Bozdogan — Mon, 04 May 2026 11:44:41 +0000

How dense embedding retrieval replaced BM25 in modern AI search, what the mechanism actually does, and why exact-match SEO tactics quietly stopped working.
There is a page I audited last year that ranks well gets cited, gets quoted, gets used as a source by AI assistants for a phrase nobody types. The literal string appears nowhere in the document. The document is about the topic, plainly and accurately, in clear prose. The query is a paraphrase. Twenty years of SEO heuristics would predict this page does not match. The retrieval stack thinks it matches better than half the pages that do contain the literal phrase. The inverse also happens: a page that uses a query's exact terms three times in the title and twice in the H1, and is not getting cited at all, because the embedding model thinks the page is about something different from what the user asked. Same query class, two outcomes and the difference is mechanical. The retrieval stack changed underneath, and most of the SEO heuristics the industry still teaches are heuristics about a stack that is now the second-stage filter, not the first.

I built my mental model the slow way. I read the BEIR benchmark paper end to end, then DPR, then ColBERT, then HNSW, and then sat with a public embedding model and a corpus of my own running similarity computations against synonym pairs, paraphrase pairs, and adversarial pairs until the behaviour stopped surprising me. After that I started watching what happened to AI citations when pages were rewritten in different ways exact-match tightened, paraphrases added, exact-match stripped while semantics preserved. The pattern that fell out is not subtle, and it overturns several pieces of SEO advice that are still being repeated as if they were neutral facts.

This post is the field report. The shift from sparse to dense first-stage retrieval, what an embedding model actually represents about a page and a query, why approximate nearest neighbour search is the workhorse of the recall step, why dense-only retrieval fails in specific predictable ways and why hybrid retrieval is the production answer, and what all of that means for content design. It is technical because the mechanism is technical. The shortcuts the SEO industry has been selling are shortcuts to the wrong stack.

The Two Decades of BM25

For roughly twenty years, the dominant first-stage retrieval algorithm on the open web and inside almost every search engine, on-site search, and Lucene/Elasticsearch deployment was BM25, formalised by Robertson and Zaragoza in their 2009 retrospective "The Probabilistic Relevance Framework: BM25 and Beyond." BM25 is a sparse, lexical, term-frequency-based scorer. It builds an inverted index of terms to documents. At query time it scores documents by how often the query terms appear, weighted by inverse document frequency, with saturation and length normalisation parameters bolted on. The mathematics is closed-form, the index is small, and the recall is reasonable for queries whose terms overlap exactly with the document.

BM25 has properties the SEO industry built an entire grammar around. It rewards keyword presence. It is sensitive to keyword frequency up to a saturation point. It penalises long documents to prevent stuffing. It cannot match a paraphrase. It cannot infer that "vehicle" and "automobile" are the same concept. It cannot tell that "how to fix a slow website" and "improving page load performance" are about the same question. The keyword-research industry, on-page-optimisation playbooks, exact-match domain folklore, the H1-must-contain-the-target-keyword reflex all of that grammar is downstream of how BM25 scores documents. When the retrieval stack scores on lexical overlap, the rational thing for authors is to engineer lexical overlap. So they did, for two decades.

The thing that changed, quietly enough that most SEO commentary missed it, is that BM25 stopped being the only thing and on a growing share of the queries that matter for AI search, stopped being the dominant thing at the recall step.

The Dense Retrieval Era

Dense retrieval was not a single moment. It was a slow accumulation of papers that each made the dense approach better, cheaper, or more general. The two reference points worth knowing by name are DPR Karpukhin et al., 2020 and ColBERT Khattab and Zaharia, 2020. DPR demonstrated that a dual-encoder, where query and passage are each encoded independently into a dense vector and scored by inner product, could outperform BM25 on open-domain question answering by a substantial margin. ColBERT pushed the thinking further by keeping per-token embeddings and computing a late-interaction score, improving fine-grained matching while remaining tractable.

The third reference point that brought rigour to the comparison is the BEIR benchmark Thakur et al., 2021. BEIR took eighteen heterogeneous IR datasets, ran the major sparse and dense retrievers across all of them in zero-shot mode, and published the comparison. The headline result was less tidy than the dense-retrieval marketing wanted: dense models trained on one domain did not always transfer to another, and BM25 remained surprisingly hard to beat on certain tasks. The honest reading of BEIR is that neither sparse nor dense is a universal winner alone, and hybrid systems combining both tend to dominate.

That honest reading is the one production search systems implement. It is also the one most SEO advice ignores.

What an Embedding Model Sees in a Page

The mechanism is worth tracing. An embedding model takes a sequence of tokens your page's text, broken into sub-word tokens by the model's tokeniser and runs them through a stack of transformer layers. Each token attends to every other token (or a windowed subset). The output is a sequence of contextualised token embeddings: each token carries information about the words that surround it. The model pools that sequence into a single vector often the embedding of a [CLS] token, sometimes mean-pooling, sometimes a learned head. The result is a fixed-size vector, typically between 384 and 3072 dimensions depending on the model.

What that vector represents is meaning, not surface text. Two paragraphs saying the same thing in different words produce vectors close in the embedding space. A paragraph about "the impact of caching on web performance" and a paragraph about "how stale responses speed up rendering" sit near each other even though they share almost no tokens. This is what dense retrieval does that BM25 never could. It is also why content that is "well-written about the topic" can outrank content that is "engineered for the keyword" the model is not counting tokens, it is comparing meaning.

The flip side is that the embedding model is a learned model, not a dictionary. It has a training distribution. Concepts well-represented in training are mapped cleanly. Concepts that were absent or rare are mapped sloppily. Specific identifiers SKUs, model numbers, error codes, brand names that look like generic words frequently sit in regions of the embedding space with very low resolution. That is one of the dense-retrieval failure modes, and we will come back to it.

What an Embedding Model Sees in a Query

The query goes through the same model. The user's words or the query the LLM has rewritten on the user's behalf get tokenised, embedded, and pooled into a vector in the same space as the documents. The retrieval step is a nearest-neighbour search: which document vectors are closest to the query vector by cosine similarity or inner product?

The query embedding does several things a BM25 query cannot. It handles paraphrase: "fastest way to deploy a Next.js app" lands near documents about "Next.js deployment latency," even though "fastest" is missing from one and "latency" from the other. It handles synonym disjunction softly: a query about "vehicles" partially matches documents about "cars" without a configured dictionary. It handles intent inference up to a point: a question lands closer to documents that answer it than to documents that ask similar questions, because the model has learned the difference from training data.

What it does not do is handle exact identifiers well. A query for the SKU BTX-449-G2 returns high similarity only if the model tokenised it the same way for document and query, and embeddings of rare tokens are noisy. A query for the precise string error E_INVALID_REDIRECT may end up near generic documents about redirect errors and miss the document that contains the exact string verbatim, because the model treats the rare code as low-information. That is why hybrid retrieval exists.

Before we get there, there is a piece between the user's input and the embedding step that most operators forget about.

The Invisible Query Rewrite

When a model produces a search-grounded answer, the query that hits the retrieval stack is rarely the user's literal text. The model rewrites the question into one or more search queries sometimes expanding into sub-queries, sometimes paraphrasing, sometimes filling in implicit context from the conversation. ChatGPT search, Perplexity, Gemini grounded mode, Claude with the web search tool, and Bing Chat all do some form of query rewriting before retrieval. The stack downstream sees the rewritten query, not the user's words.

This matters for content design. Optimising for the literal user query is a fool's errand because you do not see the literal query you see the query the model decided to send, already normalised and paraphrased. What you can optimise for is the cluster of paraphrases the model is likely to produce around a given intent. This is why writing "the same answer phrased multiple ways within one page" tends to win over "the same keyword repeated multiple times within one page" the paraphrased pages match more of the rewrite distribution, which is what actually hits the index.

Approximate Nearest Neighbour at Scale

In principle the recall step is just nearest-neighbour search. In practice, exact nearest-neighbour search over hundreds of millions of vectors is infeasible at AI-search latencies. The production answer is approximate nearest neighbour, or ANN, and the dominant open-source algorithm is HNSW Hierarchical Navigable Small World graphs described by Malkov and Yashunin in 2016.

HNSW is a graph-based index. The intuition is worth holding clearly because it explains why ANN is "good enough" for the first stage.

HNSW conceptual structure (top layer is sparse, bottom layer is full)

  Layer 2 (sparse, long edges):    o ----------- o ----------- o
                                    \           /             /
  Layer 1 (denser, medium edges):   o --- o --- o --- o --- o
                                     \   /     \    \   /
  Layer 0 (full, short local edges): o-o-o-o-o-o-o-o-o-o-o-o
                                                ^
                                          query enters at top,
                                          greedy descent narrows
                                          neighbourhood at each layer

A query enters at the top layer, which has few nodes connected by long edges. The algorithm greedily walks toward the query's nearest neighbour, drops down to the next layer using the current best node as the entry point, and repeats. By the time the search reaches the bottom layer which contains every vector the candidate region is already narrowed to a small neighbourhood, and the bottom-layer search only explores a few hundred nodes instead of the full corpus. The result is sub-linear search time with high recall, configurable through parameters that trade off recall against latency.

Faiss, the open-source library from Meta, implements HNSW alongside several other ANN structures including IVF (inverted file with coarse quantisation) and product quantisation. Pinecone, Weaviate, Qdrant, Milvus, pgvector, Vespa every production vector database is a variation on these ideas. HNSW dominates the discussion because it has consistently strong recall on high-dimensional vectors with reasonable memory overhead.

The catch and it is the catch that hybrid retrieval was invented to address is that ANN is approximate. The recall step returns the top-k by approximate similarity, not the true top-k. For most queries the top few results are stable. For queries with rare terms or out-of-distribution embeddings, the approximate index can miss the document that lexical search would have found trivially. Combined with the embedding model's own weaknesses on rare and exact terms, the dense-only path has predictable failure modes.

Where Dense Alone Loses

There is a class of queries where pure dense retrieval is reliably worse than BM25.

Queries with exact identifiers product SKUs, model numbers, error codes, version strings, ISBNs, regulatory references are dense-retrieval's worst case. The embedding model has typically not seen BTX-449-G2 enough during training to give it a meaningful position in vector space. BM25 treats it as a token and finds the document instantly.

Queries with brand names that overlap common words Apple, Square, Notion, Linear, Vector are a related case. The embedding model maps "Apple" closer to "fruit," "company," and "computer" by some learned blend. The query "Apple support phone number" sits in a region where consumer-electronics documents and grocery-aisle documents coexist. BM25 does not care about meaning and scores by literal token overlap.

Queries about domains under-represented in training niche legal corpora, regional regulatory texts, deeply specialised technical fields also tend to favour BM25 because the embedding model's resolution in those regions of the space is poor.

Queries with negation and quantifiers "papers that do not use BERT," "websites without a privacy policy" are hard for embedding models, which struggle to invert meaning. BM25 with explicit operators handles these better than naive dense retrieval, although in practice the LLM usually rewrites the query into something the dense retriever can handle.

This is the empirical content of the BEIR result. Across eighteen datasets, no single retriever wins everywhere, and the cases where dense loses are not random they cluster around the failure modes above.

Hybrid Retrieval Is the Production Answer

Production AI search systems do not pick sparse or dense. They run both, fuse the results, and let the rerank stage clean it up.

The two common fusion approaches are Reciprocal Rank Fusion a simple, training-free recipe that sums the reciprocal of each document's rank in each list and learned combiners that train a model to score documents using both BM25 and dense scores as features. Vespa, Weaviate, Elasticsearch's hybrid search, Qdrant's BM25 + dense pipelines, and OpenSearch's neural-sparse hybrid all implement variations of these patterns. The rerank step that follows (a heavier cross-encoder that re-scores the top candidates) is its own conversation, and I am keeping it deliberately brief here. The point for retrieval is that the rerank cleans up the noise the recall step admitted, and the recall step is now hybrid rather than purely lexical.

Here is the comparison that matters, framed as the characteristics of each path:

Property	Sparse (BM25)	Dense (embedding-based)	Hybrid (sparse + dense)
Matches exact terms	Yes, by construction	Weakly, via tokenisation	Yes (sparse rescues this)
Matches paraphrases	No	Yes	Yes (dense provides this)
Handles synonyms	Only with explicit dictionary	Yes, learned	Yes
Handles rare identifiers	Yes	Weakly	Yes (sparse rescues this)
Handles negation	Yes, with operators	Poorly	Partial
Robust to OOD vocabulary	Yes	Poorly	Yes (sparse rescues this)
Recall vs latency at scale	Inverted index, sub-linear	ANN graph, sub-linear	Run both, fuse
Index size	Small (token postings)	Large (vector per chunk)	Sum of both
Cold-start on new content	Immediate (just index tokens)	Requires embedding compute	Both

That table is the operational summary of two decades of BM25 plus six years of dense-retrieval-at-scale. It explains why the production answer is hybrid and why neither extreme of the SEO debate "keywords are dead" or "keywords are all that matter" is correct. They are both signals. The retrieval stack uses both. Content that wins in AI search is content that survives both filters.

What Still Matters from the Lexical Era

The dense retriever does not erase the lexical signal it adds a second signal next to it. Everything BM25 ever rewarded still partially matters, but the marginal return on stuffing the same term thirty times has collapsed. What survives:

Exact entity names. Brand names, product names, person names, location names these are what hybrid retrieval rescues from dense-only failure. If your brand is Acme Software, that exact string needs to appear once on the page in plain text where the indexer can find it, somewhere unambiguous, with the surrounding paraphrases the embedding model can latch onto.

Exact identifiers. SKUs, error codes, version strings, model numbers. Same story. Once on the page in the canonical form is what you need.

Structured data. Schema.org JSON-LD remains load-bearing because it gives the indexing pipeline a clean entity graph that does not depend on parsing prose.

Brand spellings and variations. If users search for both e-mail and email or Wi-Fi and WiFi, both forms benefit from being present somewhere on the site. Embedding models are mostly robust here, not perfectly, and the BM25 leg is exact-only.

What is no longer worth doing and was probably never worth doing as much as the SEO playbooks insisted is keyword density manipulation, exact-phrase repetition, and synonym dictionaries pasted into footers. The marginal return from these tactics in a hybrid stack is approximately zero, and in some cases negative because the embedding pooling step degrades under repetition.

Designing Content for Both Filters

The practical content rule is short and unromantic: write the answer once in the canonical phrasing, then write the paraphrases around it, then make sure the structure is parseable.

The mechanism for each clause is real. Canonical phrasing gives BM25 the exact-match signal it needs. Paraphrases widen the embedding space the page covers, so the page lands close to a wider distribution of query rewrites. Parseable structure short paragraphs, one thought per chunk, headings that match the prose, schema where appropriate feeds the chunker and the structured-data layer downstream.

The thing the SEO industry got wrong, and is still getting wrong, is the assumption that you must choose between exact-match and semantic richness. The hybrid stack does not force a choice. It rewards both, scored by different paths and fused. Pages that try to win on exact-match alone fail the dense filter on paraphrases. Pages that try to win on semantic richness alone fail the sparse filter on exact identifiers and brand names. Pages that do both which is what good prose has always been match more of the query distribution.

How to Verify You Are Winning at the Embedding Layer

This is the part of the post where I tell you to stop guessing and start measuring, because the measurement is cheap and the alternative is folklore.

Pick a public embedding model text-embedding-3-small from OpenAI, voyage-3 from Voyage, or a BGE model from BAAI (free). Pick a corpus of your own pages. Embed each page. Take a list of queries you believe should match those pages literal phrasings, paraphrases, adversarial cases embed those, and compute cosine similarity between every query and every page.

What you are looking for is not absolute numbers embedding similarities are model-specific and not directly comparable across models. You are looking for ranks and gaps. For a query that should match page A, does page A come first? If it is buried under three tangentially related pages, your content is failing the dense filter, and the failure is diagnosable. Often the fix is a missing paraphrase, a buried answer the pooling step is averaging away, or a structure where the topic shifts halfway through and the pooled vector lands between two centroids.

Run the same exercise with BM25 most search libraries (Elasticsearch, OpenSearch, Vespa, Tantivy, Whoosh) implement it in a few lines. Compare. The cases where the same query ranks the page differently between the two paths are the cases where hybrid retrieval will cover or expose your content. That comparison is the thing the SEO industry pretends not to need to do, because doing it makes the folklore harder to sell.

I run this against my own content periodically, against competitors' content, and against the queries I expect AI assistants to rewrite mine into. It is the cheapest piece of due diligence in modern content engineering and consistently produces actionable findings.

A Note on Google

Whenever the dense-retrieval story comes up, someone asks "did Google switch to vectors?" The honest answer is that Google's retrieval stack is hybrid, partially private, and has been neural-augmented since well before the LLM era RankBrain (2015) and BERT integration (2019) are the named layers, but those are not the entire stack, and the company has not published a definitive "we switched from BM25 to vectors on date X" statement because the truth is more complicated. The on-the-record position is hybrid: lexical features plus learned ranking plus several layers of neural processing in concert. AI Overview and Gemini's grounded mode add their own retrieval and synthesis on top. Treating Google's stack as either "still BM25 underneath" or "all vectors now" both mis-frame it. It is layered, hybrid, mostly private. The operational stance: assume both filters are present, design content that survives both, do not bet against either signal.

The Synthesis

The recall step in modern AI search is dense, not lexical but the production stack is hybrid, and that is the framing the SEO industry has not absorbed. Embedding models match meaning. BM25 matches tokens. Both fire. The pages cited by AI assistants are the pages that survive both filters, not the pages that game one.

The single sentence: retrieval is no longer keyword match; it is hybrid recall where the dense signal handles paraphrase and intent and the sparse signal rescues exact identifiers, and content design that ignores either filter loses on the queries the other one would have caught.

If you only have time to internalise three things, in order:

The first stage is hybrid, not lexical. Dense retrieval handles paraphrase, intent, and synonyms. Sparse retrieval handles exact identifiers, brand names, and rare terms. Both fire on every query in production stacks. Content that engineers for one and ignores the other loses on the queries the other one would have caught.
The user's literal query is not the query that hits the retrieval stack. LLM rewrites paraphrase, expand, and normalise the query before retrieval. Optimising for the literal user phrasing is optimising for a string the index never sees. Optimising for the cluster of paraphrases around an intent is what moves the needle.
Measure your content with a public embedding model. It costs almost nothing. Compute similarity between your pages and the queries you expect. Cases where a topically correct page ranks low in cosine similarity are cases where your content is failing the dense filter, and the failure is usually diagnosable. The SEO industry mostly does not do this, which is why so much advice is still keyword-stack folklore.

The page that ranks for the phrase nobody types is not magic. It is a page whose embedding sits close to the embedding of the query the user actually asked, in a space the model learned from a corpus closed before either of you wrote anything. The page that wins the exact-match phrase but does not get cited is the inverse: the lexical filter passed it, the dense filter dropped it, and the rerank step never saw it. Both outcomes are mechanical, both are addressable, and the content design that addresses both is what wins the hybrid retrieval stack which is the stack that decides what AI assistants see.

The retrieval-stack synthesis here is my own reading of the primary literature Robertson and Zaragoza's The Probabilistic Relevance Framework: BM25 and Beyond (Foundations and Trends in Information Retrieval, 2009), Karpukhin et al., Dense Passage Retrieval for Open-Domain Question Answering (arXiv:2004.04906, 2020), Khattab and Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (arXiv:2004.12832, 2020), Thakur et al., BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models (arXiv:2104.08663, 2021), Malkov and Yashunin, Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (arXiv:1603.09320, 2016), and the Faiss library source and documentation combined with observable behaviour from running my own embedding-similarity computations against my own corpus and watching what happened to AI citations after content rewrites. Where I have written "in my testing" or "the pattern I observe," that is exactly what I mean. The directional claims about exact-match SEO no longer paying are mechanistic embedding similarity is computable on any public model and the audit is reproducible but I am not making quantitative promises and the magnitude of any individual rewrite varies by domain, model, and query distribution. Provider behaviour is moving; verify against current docs and current model behaviour before shipping a strategy.

Published by Cihangir Bozdogan

Nine Search Backends, Nine Different Webs. Why AI Citations Diverge for the Same Query.

Cihangir Bozdogan — Mon, 04 May 2026 11:33:42 +0000

Run the same brand-query through ChatGPT, Gemini, Perplexity, Claude, and Grok. Read the citations. The cited URLs will not be the same, the brands featured will not be the same, and in roughly a third of cases one tool will cite your brand confidently while another does not mention it at all. The temptation is to reach for an algorithmic explanation different rerankers, different summarisation styles, different prompt scaffolds. The actual explanation is upstream of all of that. Different tools sit on top of different search backends, and the backends do not see the same web.

I worked this out by running the same fifty brand-queries across nine AI tools for six months and logging every citation URL, every search-tool invocation, every backend signature I could pull out of the response trace. The divergence was not noise. It was structural. A page indexed by Bing but missing from Google's index simply does not show up in Gemini, no matter how well it is written. A site that Brave's crawler reaches but Tavily's reranker buries cannot win on Tavily-backed agents. The "AI search" abstraction collapses into a backend-coverage problem the moment you try to optimise systematically.

This post is the field report. The publicly known map of search backends behind the major AI tools. What "different index" actually means at the crawl-and-rank layer. The fusion-layer wildcards Tavily, Exa, and similar APIs that sit between agents and the open web. The provider-internal indexes nobody talks about. Why citation patterns drift over months. The practical monitoring strategy for an operator who actually wants to see the gap rather than guess at it.

The Backend Map: What Powers What

The first useful exercise is drawing the map honestly, including the parts that are partly private. Some relationships are documented, some inferred from observable signatures, and some have shifted over time. I will mark each accordingly.

Bing powers Microsoft's own grounding stack Copilot in all its forms and the Grounding with Bing Search tool exposed through Azure AI Foundry. Microsoft documents the Grounding with Bing Search service as the canonical way for an Azure-hosted agent to ground responses on real-time public web data. The legacy Bing Search API was deprecated in mid-2025 in favour of the grounding-specific service. DuckDuckGo has long sourced traditional links and images "largely from Bing" while layering its own crawler (DuckDuckBot) and specialised sources on top. The DuckDuckGo help page on results sources says exactly that.

ChatGPT search is the trickiest cell on the map and the one where I want to be most careful. The Microsoft–OpenAI partnership originally placed Bing behind ChatGPT's web-search behaviour, and many secondary sources still describe the relationship that way. Then OpenAI launched ChatGPT search in October 2024 and explicitly positioned it as a competitor to Bing. OpenAI's own announcement describes the feature as "powered by real-time web search and partnerships with news and data providers" a deliberately broad framing. The Microsoft–OpenAI partnership was renegotiated through 2025 to give OpenAI more flexibility to use multiple cloud and search providers. The honest answer for "what backend powers ChatGPT search today" is a stack that includes Bing, includes direct news-publisher integrations, and increasingly includes OpenAI's own crawl. Treating it as pure Bing is wrong now. Treating it as pure first-party is also wrong.

Gemini and Google AI Overview sit on Google Search. This one is documented unambiguously. Google's grounding documentation says "Grounding with Google Search connects the Gemini model to real-time web content" and exposes the result trace through groundingChunks, groundingSupports, webSearchQueries, and searchEntryPoint. Google's web index is the same index that powers conventional Google Search, with the same crawl and the same ranking signals. AI Overview is grounded on a subset of the top-ranked search results for the query, with the LLM synthesis on top.

Claude's web search tool uses Brave Search as the third-party search provider. This is documented in Google's own Vertex AI documentation for Anthropic partner models, which lists Brave Search as the "third-party search service that Anthropic Web Search feature can call." Brave's API page lists Mistral AI, Cohere, Together.ai, and Snowflake among its users and frames itself as "the leading search tool for applications that use Claude MCP." Claude's grounded answers bottom out in Brave's index, which is independent of Bing and Google.

Brave Search runs an independent index. Brave's API page is direct about this: "The only search API with its own Web index at scale. Truly independent, lightning-fast, and built to power AI apps." They reinforce the point: "the Brave Search API is not a scraper that simply uses bots to query Google or Bing and repackage their results. Instead, it's our own independent index of the Web packaged with our own ranking models." The published index size is "over 30 billion pages." This matters because a Claude grounding session has a different starting set of candidate URLs than a Gemini grounding session, before any ranking or reranking even happens.

Perplexity is the hybrid everyone notices and few describe accurately. Perplexity uses an internal crawler, an internal index (the answer engine running on top of it is branded "Sonar"), plus third-party search APIs. Public reporting and Perplexity's own help-centre material have at various times mentioned both Google-backed and Bing-backed paths. The exact mix has shifted over the product's life. Operators tracking Perplexity citations should not assume any single backend is the source of truth for what Perplexity sees.

Grok searches X plus the open web through xAI's Web Search and X Search tools. The xAI documentation describes a web search tool that "enables Grok to search the web in real-time and browse web pages" and an X-platform search tool with keyword, semantic, user, and thread retrieval. The web component's underlying provider has not been publicly disclosed in the same way Anthropic's choice of Brave is disclosed. What is clear is that Grok's index is biased toward X-platform content in a way no other tool's is a brand with strong X presence shows up disproportionately on Grok and disproportionately not on tools that lack an X integration.

Kagi runs a metasearch architecture: two in-house indexes Teclis (web) and TinyGem (news) combined with "anonymised API calls to all major search result providers worldwide" plus specialised vertical sources. Kagi is small-scale relative to Bing or Google but maintains a distinctive in-house crawl focused on non-commercial, "small web" content. Kagi is not a backend for any major LLM but its index character is genuinely different from the dominant ones. You.com runs its own real-time index plus vertical indexes for news, healthcare, legal, and similar independent of Bing and Google, smaller in scale.

Tavily and Exa are different in kind. They are not "search engines" in the Bing/Google sense. They are search APIs designed to be the retrieval layer for AI agents. Tavily describes itself as offering "real-time search, extraction, research, and web crawling through a single, secure API," with a "production-grade retrieval stack." Exa describes its product as an "industry-leading web index built for agents." Both decline to publicly name their upstream sources in the way Brave does, and both are widely understood by builders to combine custom crawling with dense retrieval and their own reranking on top. They are themselves a backend choice for any agent that wires them in.

The honest summary of the map: nine major surfaces, four or five distinct primary backends, and at least three more API-layer products that act as their own backends when an agent is built on top of them.

What "Different Index" Actually Means

It is easy to gloss "different backends, different indexes" as near-equivalent. It is not. Indexes differ on more axes than most operators count, and each axis is a place where two backends diverge for the same query in ways that show up directly in citation behaviour.

Crawl frequency and freshness. Common Crawl, which seeds many training corpora, adds 3–5 billion new pages per month according to its published methodology. Commercial backends crawl much more aggressively than that for high-value sites and less aggressively for the long tail. A new product page on a high-authority domain will hit Bing's index in hours and Google's in similar time. The same page on a small-authority domain might sit uncrawled by either for weeks. Brave's crawl prioritises differently again, and Tavily's and Exa's targeted crawls are shaped by which queries their customers run. A "freshness gap" between two backends is rarely a bug it is a budget allocation made deliberately.

Geo and language coverage. Google's index has significantly broader non-English coverage than Bing's. Brave is English-and-Latin-script biased. Tavily and Exa are English-dominant by default. A query in German or Japanese retrieves different breadth across these backends before any ranking layer touches the results. A citation gap in Gemini might disappear when you query in the brand's primary market language, while the same gap in Claude (Brave-backed) might persist because Brave's coverage thins out in non-English.

Deep-link coverage and JavaScript-only content. Backends differ sharply on how deep they crawl and how they handle JavaScript-rendered content. Bing and Google have invested heavily in headless rendering. Brave's public statements about its crawler are more conservative. Tavily and Exa's behaviour around JS-rendered content depends on their per-customer crawl budget. A brand that ships a JavaScript-only site will see different coverage curves across backends a fact that compounds with the well-known issue of inference-time fetchers being even less rendering-capable than crawl-time fetchers.

Robots.txt and bot identifiers. All major backends respect robots.txt, but the exact directives they honour differ at the edges. Backends respect specific bot identifiers (GoogleBot, BingBot, BraveBot, OAI-SearchBot, ClaudeBot, etc.) and a robots policy that allows one and blocks another produces a hard coverage gap. Operators who tighten their robots.txt against AI training bots without thinking carefully about the search-time fetchers occasionally cut themselves off from grounding entirely.

Content-type coverage. PDFs, video transcripts, podcast transcripts, and code repositories are covered very unevenly. Google handles PDFs more thoroughly than most. Code-heavy queries land differently on Brave because of how Brave indexes code-host content. Video-transcript surfacing depends on whether the backend has direct ingestion or transcript-extraction at crawl time. A brand whose primary content is a podcast or a video series will have a wildly different visibility profile across backends than one whose content is HTML articles.

Snippet length and chunk granularity. Once a backend indexes a page, the chunk it stores is what determines whether a passage can become a citation. Brave publishes that its API returns "up to five snippets" per result. Google's grounded responses surface a different chunk shape. Tavily and Exa, being embedding-based, serve dense vectors over whatever chunk size their pipeline uses. If your page's information is densely packed in a single section that exceeds the backend's chunk granularity, that information may never enter a citation context window even when the page itself is in the index.

The compound effect is that two backends covering "the open web" can diverge on something close to half of mid-frequency queries. The divergence is structural, not random.

The Fusion-Layer Wildcards

Tavily and Exa deserve their own section because they are increasingly the retrieval layer that AI agents actually depend on, and they break the simple "what crawl do they use" frame.

How Tavily works:

How Exa works:

A traditional search engine like Bing or Google does crawl, build an inverted index, rank with hundreds of signals, and return ranked URLs. Tavily and Exa are different. They crawl too, but they layer on dense retrieval, custom ranking, and explicit reranking optimised for LLM consumption. A page that ranks well on Bing can rank poorly on Tavily for the same query, because Tavily's ranker disagrees with Bing's. Not "is wrong" disagrees. The two systems optimise different objectives.

This matters operationally because more and more AI agents particularly in the developer-tools space and in custom MCP-server stacks wire in Tavily or Exa rather than going through a public-web search backend. An operator who only monitors citation behaviour on ChatGPT, Gemini, and Claude is missing an entire layer of agent stacks that route their grounding through these API products. For most consumer-facing brands the agent-stack layer is small today. For B2B brands selling to developers, it is not small.

The asymmetry in observability makes this harder to track. Tavily and Exa do not publish their crawl coverage or their reranking objectives in the way Google and Bing do. The way to see whether your brand is reachable through their backends is to query the API directly. There is no shortcut. A practical rule I have settled on: if a brand sells primarily to developers, Tavily and Exa coverage matters as much as Bing or Google coverage, even though no consumer ever uses Tavily directly. The retrieval layer for the agents the buyers are building is what determines whether the brand shows up in agent-driven workflows.

The First-Party Backends Nobody Talks About

There is a layer below the named backends that is increasingly load-bearing and partly private. I want to be careful here because the public documentation is thin. What I can say with confidence is what the observable signatures suggest, framed as observation rather than asserted fact.

OpenAI's ChatGPT search is grounded on sources that do not look like pure Bing results. The October 2024 launch announcement mentioned "partnerships with news and data providers" alongside search backends. Looking at citation patterns from ChatGPT search across the last six months, I see consistent appearance of certain news domains in patterns that suggest direct ingestion rather than open-web ranking. That is consistent with OpenAI building its own crawl on top of licensed data feeds. I cannot prove it from public documentation alone but the observable pattern is real, and it is the pattern an operator should expect from a company that has explicitly positioned itself as a Bing competitor.

Perplexity's own index has grown over time. The product launched as a layer on top of third-party search and has progressively built its own crawl, embedding pipeline, and ranker. The hybrid mix Perplexity ships today is genuinely different from what it shipped two years ago. Tracking Perplexity citations over time, not over a fixed snapshot, is the only reliable approach. Google's grounding pipeline is, by contrast, the most stable and most documented anchored to Google Search, which itself is the most stable index in the market.

The general rule: if you are tracking citation behaviour and your data is more than three months old, treat it as suspect. Backend mixes drift, internal crawls expand, third-party partnerships open and close.

Why Citation Patterns Drift

Once the backend map is in place, drift becomes legible. There are four causes for a brand to be cited on one tool and missing on another, each with a different fix.

Index gap. The page is not in the relevant backend's index at all. This is the most common cause and the easiest to verify. Pull the URL into the backend's site search (site:yourbrand.example on Google, on Bing, on Brave) and see whether it returns. The fix is at the crawl layer: sitemaps, internal linking, robots.txt audit, JS-rendering audit.

Ranking gap. The page is indexed but does not rank in the top-N for the relevant queries. Different backends have different top-N cutoffs for what enters the LLM's context window a typical grounding session pulls between three and ten URLs into the synthesis. A page ranked at twenty is invisible. The fix is the standard SEO playbook for the specific backend, with the caveat that different backends weight the signals differently.

Language and geo gap. The page is indexed and ranks in one geo or language but not another. Most common when a brand publishes primarily in English but operates in multiple markets. The fix is genuinely localised content, with hreflang and locale-specific URLs, not translated boilerplate.

Freshness gap. The page changed significantly since the last crawl, and the backend's snapshot is stale. AI grounding sessions read the snapshot, not the live page. The fix is a sane sitemap with lastmod and a hosting setup that does not time out crawler requests.

A specific and frustrating drift case: a brand's pages get crawled and indexed by every backend except one. The exception backend often has a specific technical incompatibility a robots.txt line that named the wrong bot, a CDN rule that returns 403 for that bot's user-agent, an SSR setup that fails for one rendering pipeline. The way to find these is bot-by-bot log analysis. I have lost count of the number of "we are invisible on Claude / on Perplexity / on Gemini" cases that turned out to be a single line in a CDN config.

Practical Monitoring: Which Two Backends First

Nine surfaces is too many to monitor day-to-day for most operators. The question becomes which two or three to start with.

The framework I have settled on uses traffic profile. For a B2C brand whose customers come primarily through Google search today, the first two surfaces to monitor are Google AI Overview and Gemini, because the Google index is the source of truth for both. The second tier is ChatGPT search and Perplexity, which capture an increasing share of grounded buyer-side queries. The third tier is Claude (Brave-backed) and Grok (X-and-web).

For a B2B brand selling to developers, the priority shifts. Claude and ChatGPT search rank highest because developers disproportionately use them. Perplexity matters because of its researcher persona. Tavily- and Exa-backed agent stacks matter because the buyer is building those agents. Google AI Overview drops in priority.

For a brand whose content is primarily video or audio, the priority shifts again. Google has the strongest video content integration and that bias propagates to Gemini and AI Overview. Other backends are uneven. Monitor whichever backend is documented to handle your content type best.

The mistake operators make most often is testing in only one tool. A single ChatGPT query that cites the brand is taken as "we are visible to AI." A single Gemini query that does not cite the brand is taken as "Gemini is broken for us." Both interpretations are wrong because they do not separate the backend layer from the synthesis layer. The right test is the same query across at least four tools, with the citation set logged for each, on a rotation that catches drift.

What to log when running this systematically:

The exact prompt, with timestamp.
The tool used and the model version.
Whether grounding fired (the response includes a tool-call indicator).
The full citation set returned, with URLs and the cited domains.
Whether the brand appears in the citation set, and if so, in which position.
The synthesis-layer mention of the brand (separate from citation), since some tools cite without naming and some name without citing.
The locale/language the query was run in.

A spreadsheet of those columns over fifty queries across nine tools, repeated monthly, gives you the picture. Anything less and you are guessing.

Backend Variation Matrix

The matrix below is the practical artefact I use in audits. It is deliberately not a ranking there is no "best" backend, just different coverage profiles. Where a cell is uncertain or has shifted recently, I have marked it as observation rather than fact.

Backend / Layer	Primary AI tools it powers	Coverage character	Independence	Public docs on internals
Google Search	Gemini grounding, AI Overview	Largest open-web index, strong multilingual, strong PDF, strong video integration, slow drift	Independent	Strong (`groundingChunks`, `webSearchQueries`, `groundingSupports`)
Bing (via Grounding with Bing Search)	Microsoft Copilot, DuckDuckGo (traditional links), historically ChatGPT search	Large open-web index, strong English, weaker non-English vs Google, fast for high-authority freshness	Independent	Documented Azure AI Foundry tool surface
Brave Search API	Claude web search tool, Mistral / Cohere / Together / Snowflake apps per Brave's own claim	~30bn pages, English-leaning, independent crawl, AI-friendly snippet sizing	Independent (explicitly not a Bing/Google reseller)	Strong (API page declares scale and method)
Perplexity (Sonar + hybrid)	Perplexity.ai answers and Search API	Hundreds of billions of pages claimed; mix of own crawl plus external APIs that has shifted over time	Hybrid; mix is not stable	Partial; mix not fully disclosed
OpenAI's emerging stack	ChatGPT search	Hybrid of partner data feeds plus open web; positioned competitively against Bing	Hybrid; trending toward independence	Limited; framed as "real-time web search and partnerships"
xAI Search (web + X)	Grok	X-platform-biased, plus open web; X integration unique among major tools	Hybrid; web provider not publicly disclosed	Partial; web sources not enumerated
You.com index	You.com search and API	Own real-time index plus vertical indexes; smaller scale than Bing/Google	Independent	Documented per vertical
Kagi (Teclis + TinyGem + metasearch)	Kagi search (consumer; not a major LLM backend)	"Small web" focus; metasearch architecture; small relative to dominant indexes	Hybrid; explicitly metasearch	Strong (named indexes and API call structure)
Tavily	AI agents using Tavily as their retrieval layer	Custom crawl plus dense retrieval plus reranker, optimised for LLM consumption	Independent at the retrieval layer	Limited; mechanism not publicly enumerated
Exa	AI agents using Exa as their retrieval layer	Embedding-based retrieval with custom ranking; positioned as "industry-leading web index built for agents"	Independent at the retrieval layer	Limited; mechanism not publicly enumerated
Common Crawl (training-time baseline)	Indirectly underlies many model training corpora	300bn+ pages, 3–5bn new pages monthly; quarterly published snapshots	Independent	Strong; methodology published

The matrix is the artefact you carry into a monitoring strategy. It tells you which backends an observed gap implicates, which backends to instrument when you need to verify the gap, and which surface to test against when you ship a fix. The "AI search" abstraction collapses into this matrix the moment you try to do real work on it.

The Synthesis

The single sentence: AI search is a thin synthesis layer over a small set of search backends, and a brand's visibility across AI tools is determined first by which backends index it and rank it, not by anything the model itself does at synthesis time.

If you only have time to internalise three things from this post, in order:

There is no "AI search" channel there are four or five distinct backend channels with different coverage curves. Google's index, Bing's index, Brave's index, Perplexity's hybrid index, and the agent-layer retrieval products like Tavily and Exa each see a different web. Treating "AI" as one channel is debugging at the wrong layer.
Pick two backends to monitor first based on traffic profile, then expand. B2C through Google traffic: Gemini and AI Overview first. B2B to developers: Claude and ChatGPT search first. Test the same fifty brand-queries across all monitored surfaces monthly, and log the citation set for each. The divergence shows up immediately.
When a citation gap appears on one backend and not another, the cause is almost always upstream of the model. Index gap, ranking gap, language/geo gap, or freshness gap each has a different fix and a different team. Engineering work on the crawl layer pays out across every AI tool that shares that backend. Synthesis-layer "AI optimisation" without backend coverage is sand-castle work.

The operator who internalises this stops asking "are we visible on AI" and starts asking "which backends index us, which rank us, and which is the cheapest to fix first." That framing change is the win. The engineering follows.

The backend market is going to keep shifting. OpenAI is building its own crawl. Anthropic's stack is expanding beyond Brave. Perplexity's mix changes annually. Tavily and Exa are growing into the retrieval layer for the agentic web. The brands cited consistently across AI tools in three years will be the ones who treated this as a coverage problem on a moving target.

The backend taxonomy in this post is my synthesis of public mechanism documentation Google's grounding-with-search documentation, Brave's Search API documentation, Microsoft's Grounding with Bing Search service docs on Azure AI Foundry, Google's Vertex AI documentation for Anthropic partner models (which is where Brave is named as the third-party search provider behind Claude's web search tool), DuckDuckGo's results-sources help page, Kagi's search-sources documentation, xAI's web search tool documentation, OpenAI's announcement of ChatGPT search, and Common Crawl's published methodology combined with observable behaviour across my own monthly cross-tool citation audits over the past six months. Where I have written "in my testing" or "the pattern I observe," that is exactly what I mean. The provider-internal index claims for OpenAI and Perplexity are framed as observation because public documentation does not enumerate the backend mix. Provider behaviour is moving; backend mixes drift on a quarterly cadence; verify against current docs before shipping a strategy.

Published by Cihangir Bozdogan

AI Search Crawlers Are Curl From 1998, Not Chrome. Your SPA Is Invisible and Here Is the Mechanism.

Cihangir Bozdogan — Sun, 03 May 2026 20:47:39 +0000

There is a meeting that happens at almost every web team I have audited where someone says "we render fine, our React app renders fine for crawlers." Most of the time this is wrong. The kind of crawler the team has in mind is Googlebot, which has shipped a JavaScript-rendering second-pass since around 2019. The kind of crawler that actually decides whether a site appears in ChatGPT, Claude, Perplexity, or any of the AI search products is not Googlebot. It is GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot and these things behave like curl from 1998. They issue an HTTP GET, parse the HTML they get back, and that is the entire interaction. JavaScript is not executed. Hydration does not happen. The single-page-app shell with <div id="root"></div> is what gets indexed.

I worked through this the slow way. Set up six identical-looking sites a Next.js SSR build, a Next.js SSG build, a Vite SPA with no SSR, a Remix SSR build, a static HTML page, and a hybrid where above-the-fold is server-rendered and below-the-fold lazy-loads. Pointed each AI search platform at each. Watched what they could and could not see. The result is not subtle: the SPA-with-no-SSR is functionally invisible to half the AI ecosystem, and the lazy-loaded content is invisible to all of it.

This post is the field report and the mechanism. The two-mode spectrum AI crawlers occupy. What each of the six relevant crawlers actually does on the wire. The hydration cliff that decides what the model sees. The five failure modes I now flag in technical audits. And the patterns that work the SSR-or-SSG sweet spot that costs almost nothing to ship and changes whether the AI ecosystem can see you at all.

Command page rendering flows

Static site generation

The Two-Mode Spectrum

Web crawlers exist on a spectrum from "raw HTTP fetcher" to "full headless browser with JavaScript execution." The endpoints of the spectrum are not philosophical. They cost different amounts of money to operate and they produce different views of the page.

Raw HTTP fetcher. A program that issues an HTTP GET, receives the response, parses HTML, follows links via the parsed href attributes. No JavaScript is executed. No CSS is laid out. No images are decoded. The cost per page is a few milliseconds plus the network round-trip. The throughput is whatever the operator's bandwidth and target rate-limiting allows. This is curl plus an HTML parser. Most AI crawlers live near this end of the spectrum.

Headless browser. A program that does everything a real browser does, headlessly. JavaScript executes, network requests fan out, the DOM gets mutated, eventually a render tree settles. The cost per page is hundreds of milliseconds to multiple seconds, depending on how heavy the page is. The throughput is one to two orders of magnitude lower than the raw fetcher. Memory cost is much higher. Googlebot's render queue, Bingbot's modern incarnation, and a handful of search-tool products live at this end.

The gap between the two modes is where the AI ecosystem currently sits: not because nobody knows how to run a headless browser, but because the cost-benefit at training-corpus or live-fetch scale is decisively against it. OpenAI is not going to run a headless browser to fetch a site on every ChatGPT search. The latency is too high. The cost is too high. The volume is too high.

The implication is direct. If the part of your page that says what your business does is rendered by JavaScript that runs after the document loads, the AI crawler that fetches your page does not see it. There is no second-pass render queue at OpenAI, Anthropic, or Perplexity. There is one pass. Whatever is in the HTML at first-paint is everything the model gets.

The Six Crawlers, Tested

Six crawlers do almost all the AI-relevant work. Their User-Agents, their robots.txt declarations, and their JS-execution behaviour are public information for the most part. Where I have observed behaviour beyond what the docs say, I have flagged it.

GPTBot (OpenAI training crawler)

User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.0; +https://openai.com/gptbot. OpenAI publishes its IP ranges. GPTBot's job is to harvest content for training data, not to fetch live for a specific user query. It respects robots.txt and respects User-agent: GPTBot Disallow: / if you set it.

The interaction shape is HTML-only. No JavaScript execution. The content GPTBot acquires is whatever the server returns to a plain GET first-paint HTML, server-rendered or static, plus any inline JSON-LD. Anything rendered after document-ready is invisible. Anything fetched from a downstream API by client-side JavaScript is invisible. The crawler acts like an HTTP fetcher, not a browser.

The implication for training inclusion is mechanical: if GPTBot cannot see your content in HTML, your content does not enter the training corpus through this path. There are other paths (Common Crawl, licensed datasets) but for the part you control directly, this is the gate.

OAI-SearchBot (OpenAI live-fetch crawler)

User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot. Same operator as GPTBot, different role. OAI-SearchBot is the fetcher used at ChatGPT-search inference time the user asks a question, the model decides to search, candidate URLs come back from the Bing-backed retrieval, and OAI-SearchBot fetches a handful of them in parallel.

This crawler operates under a much tighter latency budget than GPTBot. The user is waiting on an answer. The fetcher cannot afford to render. JavaScript is not executed. Robots.txt is honoured.

There is a subtlety here that catches operators. ChatGPT search candidates come from Bing's index. Bingbot does execute JavaScript. So a JavaScript-only site can be in Bing's index and therefore in the candidate set ChatGPT search ranks against but when OAI-SearchBot tries to live-fetch that page to get content for the answer, it gets the empty shell. The candidate ranks. The content does not appear in the cited answer. The site is in the SERP but invisible at synthesis time.

ChatGPT-User (the "browse" UA)

User-Agent contains ChatGPT-User. This is the fetcher used when a user explicitly asks ChatGPT to browse a specific URL ("can you summarise this page for me?"). It is allowed to do slightly more than OAI-SearchBot in some configurations limited rendering, but I have not seen it execute arbitrary JS reliably. Treat it the same as OAI-SearchBot for planning purposes: HTML-only is the safe assumption.

ClaudeBot (Anthropic crawler)

User-Agent: Mozilla/5.0 (compatible; ClaudeBot/1.0; +claudebot@anthropic.com). Used for harvesting training data. HTML-only. Respects robots.txt. Behaviour matches GPTBot more than it matches anyone else modest crawl rate, conservative on server load, predictable.

There is a separate UA for Claude's web search tool when it is configured (Anthropic's docs are still maturing on this), but the same pattern applies: at inference time, when a model is fetching live, JavaScript is not in the budget.

PerplexityBot (Perplexity crawler)

User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot. Perplexity operates a hybrid retrieval they pull from Bing and from their own crawler. PerplexityBot is the latter. HTML-only. Their robots.txt compliance has been a source of friction in the press; the documented behaviour is that they respect it, and the controversies have been around whether they always have.

Behavioural note from observed traffic: PerplexityBot is more aggressive than GPTBot or ClaudeBot on revisit frequency. Pages that update often see PerplexityBot more frequently. Pages that are stable see all three at similar cadences.

Bingbot

User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm. Microsoft's primary search crawler. Respects robots.txt. Executes JavaScript via a Chromium-based renderer. This puts Bingbot at the headless-browser end of the spectrum, alongside Googlebot.

Bingbot matters in the AI conversation more than the Microsoft branding suggests, because ChatGPT search and Copilot both depend on Bing's index. If your site is JavaScript-only and Bingbot can render it, you can appear in ChatGPT search candidates but as noted under OAI-SearchBot, the live-fetch step at inference time still cannot render. So Bingbot indexes you, ChatGPT ranks you, OAI-SearchBot tries to fetch you for the cited content, and gets nothing useful. The candidate ranking and the citation content are decoupled.

Googlebot

User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; Googlebot/2.1; +http://www.google.com/bot.html. Google's two-pass crawler: first pass is an HTML fetch, second pass is a render-queue pass with a Chromium-based renderer. Respects robots.txt. Executes JavaScript, but on a delay the render queue can lag the initial fetch by minutes to days.

Googlebot is important for AI in the same indirect way Bingbot is. Gemini and Google AI Overview depend on Google's index. The render queue means JavaScript-rendered content does eventually get indexed but the Gemini and AI Overview live fetcher at inference time has the same constraint as OAI-SearchBot: no JavaScript at synthesis. The same decoupling fires.

The Summary Matrix

Crawler	Operator	Role	JS execution	Robots.txt	Latency profile
GPTBot	OpenAI	Training data	No	Yes	Patient
OAI-SearchBot	OpenAI	Live fetch	No	Yes	Tight
ChatGPT-User	OpenAI	User-triggered browse	No (effectively)	Yes	Tight
ClaudeBot	Anthropic	Training / fetch	No	Yes	Patient
PerplexityBot	Perplexity	Hybrid index	No	Yes (with caveats)	Mid
Bingbot	Microsoft	Search index	Yes	Yes	Mid
Googlebot	Google	Search index	Yes (second pass)	Yes	Patient first pass, delayed render

The pattern is sharp: every crawler that fetches live for an AI synthesis call is HTML-only. Every crawler that builds a search index that AI products rank against may render JavaScript, but the JS-rendered content is only useful for ranking, not for the cited content the model actually emits.

The Hydration Cliff

If the rest of this post had a single image, it would be the hydration cliff. The picture is: a React app renders in three stages. Stage one is the HTML shell delivered by the server. Stage two is the JavaScript bundle being loaded and parsed. Stage three is the React tree mounting, fetching data, and rendering. To a user, the three stages compress to "the page loads." To an HTML-only crawler, only stage one exists.

To make this concrete, here is what curl sees against a stock Vite + React SPA with the production build:

$ curl -s https://example.com/products/widget-pro | head -30

<!doctype html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Widget Pro</title>
  </head>
  <body>
    <div id="root"></div>
    <script type="module" src="/assets/index-a1b2c3d4.js"></script>
  </body>
</html>

That is what GPTBot, OAI-SearchBot, ClaudeBot, and PerplexityBot all see. The product name, the description, the price, the availability, the JSON-LD, the reviews none of it is in the response. None of it enters the AI ecosystem through this fetch.

The same page, server-rendered (Next.js with getServerSideProps or App Router with appropriate caching):

$ curl -s https://example.com/products/widget-pro | head -50

<!doctype html>
<html lang="en">
  <head>
    <title>Widget Pro Industrial-grade widget rated for 50,000 cycles</title>
    <meta name="description" content="Widget Pro is rated for 50,000 cycles, ships from Berlin..." />
    <script type="application/ld+json">
      {
        "@context": "https://schema.org",
        "@type": "Product",
        "name": "Widget Pro",
        "description": "Industrial-grade widget...",
        "offers": { "@type": "Offer", "price": "299.00", "priceCurrency": "EUR" }
      }
    </script>
  </head>
  <body>
    <main>
      <h1>Widget Pro</h1>
      <p>Industrial-grade widget rated for 50,000 cycles. Ships from Berlin warehouse.</p>
      <p class="price">€299.00 In stock</p>
      ...
    </main>
  </body>
</html>

Same React app, same components, same product. The difference between "invisible" and "fully indexable" is whether the rendering happens on the server before the response is sent or on the client after the response is sent. To the AI ecosystem, that is the difference between not existing and existing.

The trap is that the JavaScript-only version does the right thing for users. It loads in a few hundred milliseconds, it is interactive, it is fast on subsequent navigations because of client-side routing. The user experience is fine. The crawler experience is empty.

The Five Failure Modes

These are the patterns I now flag immediately in any audit, ordered by how often they show up.

Failure Mode 1: Pure CSR with no SSR fallback. The site is a Vite, Create React App, or Angular CLI build with a near-empty index.html shell. Every other page on the site is rendered client-side from the same shell. Title and description are set client-side via JavaScript. AI crawlers see the same empty shell on every URL. The fix is to migrate to a framework with SSR or SSG capability Next.js, Remix, SvelteKit, Astro, Nuxt or to add a pre-render step (react-snap or similar) that emits static HTML for known routes.

Failure Mode 2: Hydration boundary on critical content. The site has SSR, but the critical content (product description, article body, business hours) is inside a <Suspense> boundary or a <ClientOnly> wrapper that defers rendering until hydration. AI crawlers see a loading spinner or an empty container where the content should be. The fix is to move the critical content out of the deferred boundary. Defer the comments, the related-products carousel, the live-availability widget not the product name or the article text.

Failure Mode 3: Lazy-loaded above-the-fold content. The site uses loading="lazy" for content that should be visible immediately, including text content rendered conditionally based on viewport intersection. AI crawlers do not run an Intersection Observer. They do not scroll. Anything gated on scroll position never appears. The fix is loading="lazy" for images that genuinely live below the fold; everything else stays eager.

Failure Mode 4: CDN edge cache serving stale JSON-LD. The site has perfectly valid SSR with rich JSON-LD, but the CDN edge cache has the version from the last deploy that had a bug the JSON-LD references the wrong product, the price is wrong, the availability is stale. AI crawlers ingest the stale data and the model emits answers with the stale data weeks after deploy. The fix is purposeful cache invalidation on JSON-LD-affecting deploys, ideally with surrogate-key invalidation tied to the entities the page renders.

Failure Mode 5: robots.txt that selectively blocks AI crawlers. Someone in the past read a Hacker News thread about AI training and added User-agent: GPTBot plus User-agent: ClaudeBot plus User-agent: PerplexityBot plus Disallow: / to robots.txt. This was perhaps a defensible position when the question was "do I want my content used for training." It is not a defensible position when the question is "do I want my content cited in AI answers." The training crawlers and the live-fetch crawlers are often the same agent or sibling agents from the same operator. Blocking them blocks the citation path. The fix is to decide what you actually want: if it is "no training but yes citation," you need to allow the live-fetch user-agents and disallow the training UA; if it is "everything blocked," own that and stop expecting AI visibility.

What Actually Works

Three patterns survive contact with all six crawlers and all five failure modes.

Server-side rendering. The simplest and most durable pattern. Every page returns first-paint HTML with all critical content present. Hydration happens on top of populated content, not in place of it. Frameworks that ship this out of the box: Next.js (App Router or Pages with getServerSideProps), Remix, Nuxt, SvelteKit, Astro. The performance cost is a server round-trip per page, mitigated by edge rendering and caching. The visibility benefit is binary: with SSR, AI crawlers see your content; without it, they do not.

Static-site generation. A subset of SSR where the rendering happens at build time and the output is plain HTML files. Even simpler than runtime SSR. Works for content that does not change per request most marketing pages, most blog content, most product detail pages with infrequently updated availability. Frameworks: Next.js (generateStaticParams, getStaticProps), Astro, Hugo, Gatsby, 11ty. AI-crawler-perfect.

Hybrid with a clear boundary. SSR or SSG for the content that needs to be visible, client-side for the interactive widgets that do not. Article body server-rendered; comments client-rendered. Product page server-rendered; reviews client-rendered. The key is the content that determines what your business is or what an article is about must be in first-paint HTML. The interactive layer can come on top.

The pattern that does not work and this is where I see teams burn the most time is the half-SSR build where some routes are server-rendered and some are not. AI crawlers do not infer your routing convention. They fetch URLs they discover in sitemaps, internal links, and external mentions. If a fraction of those URLs return rich HTML and the rest return shells, the visibility becomes lottery-distributed. Either commit to "every public URL is server-rendered or static" or commit to "we are invisible to AI crawlers and fine with that." There is no middle ground that produces predictable AI visibility.

The Synthesis

The mental model engineers carry of "modern web crawler" is twenty years out of date in two directions at once. It overestimates the capability of the AI crawlers (assuming they render like Chrome) and underestimates the capability of Bingbot and Googlebot (which actually do). The result is decisions that optimise for the wrong set of crawlers.

The single sentence I tell anyone shipping a public site that wants to be cited by AI: every AI crawler that fetches live at inference time is HTML-only, and every piece of content that depends on JavaScript to appear is invisible to the citation pipeline regardless of what Bingbot or Googlebot can do.

If you only have time to internalise three things from this post, in order:

Live-fetch AI crawlers do not run JavaScript. GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, ChatGPT-User. The page they see is the first-paint HTML, nothing else.
Bingbot and Googlebot are the exception, and they only help with ranking, not with citation content. They render JS, your site appears in the candidate set, but the live fetcher that grabs content for the answer cannot render. The decoupling is invisible until you measure for it.
SSR or SSG is not optional for AI visibility. It is the gate. Pure-CSR sites are functionally invisible to half the AI ecosystem and partially invisible to the other half.

Everything else the failure modes, the framework picks, the cache invalidation discipline, the robots.txt nuances is implementation detail layered over those three. The mechanism is unglamorous: HTML in, HTML out, and a vanishingly small fraction of the AI ecosystem is willing to spend a Chromium-instance worth of compute to recover what your client-side JavaScript would have produced.

The agentic web is being built on top of HTTP/1.1 and HTML parsers. It looks more like the web of 1998 than the web of 2018. If you treat it that way and ship server-rendered or static HTML, your site is visible. If you treat it like 2018 and rely on the browser, your site is invisible, and the marketing claim that "AI search is the new SEO" lands somewhere uncomfortable for your traffic.

The User-Agent strings in the per-crawler section are taken from each operator's published documentation as of writing OpenAI's bot documentation, Anthropic's ClaudeBot documentation, Perplexity's bot documentation, Microsoft's Bingbot reference, and Google's Googlebot reference. JavaScript-execution behaviour is from official documentation where available and observed traffic where not. Individual UAs are versioned and may have shifted by the time you read this; verify against the operator's current docs before shipping a robots.txt change. The "no JavaScript at inference-time fetchers" finding is consistent across every public statement and every observed log line I have seen, but I cannot rule out that a specific operator runs a small headless-browser fleet for a small fraction of fetches; if such a fleet exists, it does not change the production-planning conclusion.

Published by Cihangir Bozdogan

I Reverse-Engineered ChatGPT's Retrieval Stack. The Bottleneck Isn't What You Think.

Cihangir Bozdogan — Wed, 29 Apr 2026 20:16:29 +0000

ChatGPT cites its sources. You see the neat little [1], [2] markers, and the implicit message is: the model went out, looked at the web, brought back evidence, and is showing you receipts.

That story is half right. The other half is what every team building a RAG system gets wrong.

There is no single retrieval system inside ChatGPT. There are at least two a parametric one frozen in the weights and a live one that fires only sometimes plus a tool layer deciding which to invoke, plus a generation step that has to reconcile them when they disagree. Almost none of it is published in detail. Some is confirmed by OpenAI and Microsoft. Some is inferred from leaked system-prompt fragments and citation studies. A lot is just observable behavior if you poke it with enough queries.

I spent a week tracing the pipeline. What follows is an engineer's reading of how it actually works the two channels, the eight-step pipeline, the tool layer, and the one finding that should change how you build your own retrieval system.

Spoiler for the impatient: the bottleneck is not the LLM, and it is not the embedding model. It is the rerank step. I'll get there.

Two Channels, One Voice

Every ChatGPT response is the output of a model with access to two completely different sources of information. The model does not always tell you which one produced the sentence you're reading.

The training corpus is frozen at the knowledge cutoff. It's parametric — what the model "knows" lives in weights, not as a list of URLs it can point at. That corpus is enormous and heterogeneous: a slice of Common Crawl, licensed publisher content, public code, and — since 2024 Reddit, via the formal OpenAI/Reddit data partnership. Anything that comes from this channel has no source URL attached. The model can recite a fact; it cannot tell you where in training it saw the fact.

The live retrieval channel is different. When the browser tool fires, the model issues real search queries, fetches real pages, and the URLs travel with the content into the context window. This is the channel that produces the bracketed citations.

Here's the part that should bother you more than it does: the model does not consistently disclose which channel produced any given answer. Ask "what's the latest version of X?" and you might get a freshly retrieved answer with citations or you might get a confident, plausible answer pulled from training-time memory of an older release, no citations, no signal that retrieval was skipped. Same formatting. Same tone. Only one is right.

We come back to this. It's the most engineering-relevant idea in the whole stack, and the one ChatGPT itself handles worst.

The Pipeline, End to End

Reverse-engineered from observed behavior, OpenAI/Microsoft attestations, and citation studies, the live-retrieval pipeline runs roughly eight steps. Some implementations probably collapse steps. Some parallelize. The logical sequence is consistent:

Query rewriting and decomposition. Your prompt is rarely a good search query. The model rewrites it sometimes into parallel queries that decompose a multi-part question. "Compare X and Y on Z" becomes two or three independent retrieval calls, fused later. This happens inside the model itself and is cheap.
Search API call. Confirmed: the primary backend is the Bing Web Search API, a consequence of the OpenAI/Microsoft commercial relationship. Anything missing from Bing's index simply cannot be cited via this channel.
Result fetching. From the ranked URL list, the system fetches a small number of pages. Small is the operative word a handful, not dozens. The fetch is parallelized, so wall-clock cost is set by the slowest tail.
Page parsing. Each fetched page is converted from HTML to clean text. This is where the render gap bites. JS-heavy SPAs, late-binding hydration, content rendered after DOMContentLoaded none of it reliably visible to a server-side fetcher. Paywalled and robots-blocked pages disappear here too. OpenAI's crawler OAI-SearchBot is the publicly confirmed user agent; sites that block it block themselves out of citation.
Chunking. Long pages split into smaller passages. Standard RAG concerns apply chunk size, overlap, semantic boundaries. Bad chunking destroys grounding even when the right page got fetched. The relevant passage gets cut down the middle, and neither half scores well alone.
Re-ranking and selection. From the chunks, a smaller set is selected for the final context. This is the stage that decides what becomes citation-worthy, and it is almost certainly handled by a model either the main LLM in a separate scoring pass or a smaller dedicated ranker. The exact architecture is undisclosed.
Context assembly. Selected chunks are injected into the prompt alongside their source URLs. The [1], [2] markers are downstream of this — chunks are paired with URLs so the generation step can attribute correctly.
Generation with citation tagging. The model produces the final answer in a single forward pass, emitting citation markers tied to the assembled chunks. Mapping a generated span back to the chunk that justified it is non-trivial the model has to do an implicit alignment between what it's saying and what it was given. When that alignment is wrong, you get the well-known failure mode: a citation that doesn't actually support its claim.

A compact way to see the whole sequence: query → rewrite → search → fetch → parse → chunk → rerank → assemble → generate. Every step has a budget and a failure mode. Every step throws away information the next step could have used. That it works at all is a quiet engineering accomplishment.

The Tool Layer (and Why You Should Read Leaked Prompts Sideways)

Above the pipeline sits the tool surface the model actually calls. The model itself doesn't make HTTP requests. It emits structured tool calls; a runtime executes them and returns results.

Two surfaces dominate: a browser tool with sub-actions like opening a URL, fetching a page, following a link; and a web.run family that issues searches and returns ranked candidates. The model decides when to call each, with what arguments, how many times. From outside, it looks like a small set of structured function calls open, search, fetch, read with the LLM as the decision-maker.

Leaked system-prompt material shows consistent themes. Open multiple results in parallel. Cite all sources used. Prefer recent sources for time-sensitive queries. Handle disagreements between sources explicitly. I'm paraphrasing deliberately the leak provenance is messy and any specific snapshot's wording may not reflect current production.

The best public archive is jujumilk3/leaked-system-prompts, which collects historical snapshots from many vendors. Treat it as primary-source material useful for the shape of instructions in production prompts, not as a reliable transcript of any current system. OpenAI does not publish the full browser-tool prompt. Any individual snippet may be inaccurate, partial, or out of date.

The hygiene rule when reasoning from these leaks: infer patterns, not wording. Categories, ordering, hierarchy are stable across snapshots. Exact phrasing isn't.

Why Bing and the Google-Shaped Mystery

The choice of Bing as primary backend is a confirmed mechanism, and the reason is not technical excellence. It's commercial. OpenAI and Microsoft have a deep, well-publicized relationship, and Bing's Web Search API is the natural surface to plug into.

The trade-off is index coverage. Bing is competitive on mainstream content. On long-tail, niche, or freshly published content, it still trails Google in many domains. A page that's hours old may not be in the index ChatGPT can query. Inventing a specific lag-hour figure would be irresponsible; the directional claim — Bing-only retrieval has a freshness ceiling is what matters.

This is where the most interesting public test comes in. SEO consultant Aleyda Solís published a brand-new page, submitted it to both engines, queried ChatGPT before Bing had indexed it and ChatGPT returned a snippet matching Google's cached version. The page was findable through ChatGPT before Bing knew it existed. Search Engine Journal's coverage is the canonical write-up.

I want to be honest about what this proves and what it doesn't: there is no public confirmation of a direct Google-fallback inside the OpenAI pipeline. Some Google-shaped results may have alternate explanations third-party aggregators that themselves query Google, plug-ins or browsing modes that bypass the default Bing path, transient behaviors that have since changed. Observed behavior suggests fallback retrieval exists. The precise mechanism is not on the public record.

The largest quantitative study is Seer Interactive's analysis of 500+ SearchGPT citations: roughly 87% of cited URLs in Bing's top-20, around 56% also in Google's results at a median rank of 17, and approximately 92% of agent retrieval through the Bing API directly. Observational, not mechanistic but consistent with a Bing-primary system that has some non-Bing surface area for the long tail.

The Latency Cliff

Watching the network panel during a retrieval-on response, total time from prompt submission to first streamed token typically lands in the 4–10-second range. Where do those seconds go?

Without inventing precise milliseconds: query rewriting takes hundreds of ms (small generation step inside the main model). The search API call adds a few hundred (round-trip plus Bing's own ranking). Page fetches happen in parallel but wall-clock is gated by the slowest tail one slow origin server drags the whole budget. Parsing and chunking are CPU-bound and fast. Re-rank is another model call. Generation begins streaming once context is assembled.

The structural implication is the part that matters: fetch budget is small. ChatGPT cannot fetch fifty pages. It fetches a handful. The Seer numbers are consistent with this most cited URLs come from a tight slice of Bing's top results, not from deep crawling.

If you're optimizing for citation, increasing the count of pages an AI agent could theoretically see is at best linear. The model is rate-limited by latency, not by index breadth. Your leverage point is not "make more pages indexable." It's "be in the small set of pages that survive the rerank step."

That's the first hint at the contrarian thesis. Hold on to it.

Citation Behavior: Dedup, Diversity, Disagreement

The set of citations a ChatGPT answer surfaces is not a top-N list from the search ranker. It's the output of a selection process that visibly cares about more than relevance.

DejanMarketing's GPT-search analysis found, across a wide sample, that ChatGPT typically selects 3–10 sources per response. Not 1, not 50. That range is consistent across query types and visible in the rendered answer. The bound is almost certainly latency-driven on the upper end and grounding-quality-driven on the lower end.

Within that set, same-domain dedup is visible. A single domain rarely appears five times in one answer's citations even when the search ranker would happily return five pages from the same site. Observed behavior suggests an explicit diversity pressure possibly prompt-level, possibly ranker-level pushing the system toward distinct sources rather than concentrating on one well-ranked publisher.

Conflict handling is the more interesting case. When sources disagree, the answer language hedges "some sources report... while others suggest..." and the model usually surfaces both citations. This is consistent with a system that prefers honest conflict-surfacing over arbitrary tie-breaking. The hedge isn't a marketing feature. It's what cross-encoder rerankers naturally produce when several chunks score similarly with contradictory content.

The pattern that falls out: a small number of high-confidence citations beats a large number of shaky ones. Cross-encoder rerankers concentrate on agreement among independently-retrieved chunks a stronger signal than the absolute relevance score of any single chunk.

The Confidence-Calibration Problem

This is the engineering-relevant center of the whole system, and the part most retrieval discussions miss.

The two-channel distinction from the opening is not a clean separation at inference time. Both channels feed into a single generation pass, and the model has to decide implicitly, no externally visible toggle which to trust for any given assertion. When channels agree, this is invisible. When they disagree, it is the source of nearly every quietly-wrong answer the system produces.

The freshness disclosure problem is the simplest version. Ask "what's the latest version of X?" right after a release. Browser tool fires, search index has the new release: correct answer, release page cited. Browser tool doesn't fire model judged retrieval unnecessary, hit a rate limit, or user is on a path that doesn't invoke browsing and the model answers from training-time memory of the older release. Identical formatting. Only one is right. The user has no signal to tell them apart.

The deeper version is more subtle, and worth being explicit about. Training corpora include the model's own historical outputs. Sufficiently popular AI-generated text on the web at scrape time ends up in the next training set. So a model can be confidently wrong because a previous model was confidently wrong and the wrong answer survived into training. Re-ranking has to override parametric belief in those cases. Sometimes it does. Sometimes it doesn't particularly when the wrong belief is well-attested across many low-quality sources and the correct passage shows up in only one reranked chunk.

For an engineer building a retrieval system from scratch, the implication is concrete: make the override explicit. ChatGPT does this implicitly, and not always well. In your own RAG pipeline, decide deliberately when retrieved evidence overrides parametric belief, and surface that decision rather than letting the model arbitrate silently. A simple rule if retrieved evidence contradicts parametric memory, retrieved wins, and the system says so enforced at the prompt or rerank layer is more honest than the alternative. Even when the contradicting evidence is itself wrong, the failure mode becomes inspectable rather than invisible. That is a much better place to be.

Four Things to Take to Your Own Pipeline

Grounded in the mechanism, not the marketing.

1. The bottleneck is not the LLM. It is the rerank step. This is the contrarian thesis the post opened with, and it's the conclusion that survives the rest. If your RAG system produces bad citations, the bottleneck is almost always downstream of the embedding model. A bi-encoder retriever and a cosine-similarity index will surface plausible-but-wrong chunks faster than you can debug them. Cross-encoder reranking is the single highest-leverage stage. Spend your engineering budget on rerank quality and on chunking that respects semantic boundaries not on swapping in a slightly larger embedding model and hoping.

2. There's a latency cliff on fetch count. ChatGPT fetches a handful of pages, not dozens, and the same constraint applies to anything you build with comparable user-facing latency targets. Past roughly five-to-ten fetched pages, latency dominates the marginal grounding gain. Each extra page mostly slows the system without meaningfully improving the answer. Decide your fetch ceiling deliberately. Design for parallelism so a single slow tail doesn't blow the budget. Accept that you can't scale your way around the rerank quality problem by fetching more.

3. Citation tagging is harder than it looks. Mapping a generated span back to the chunk that justified it is a separate concern from retrieval, with its own failure modes. You can have perfect retrieval and still emit citations that don't support their attached claims. In practice this is either a separately-trained alignment component, an extra reasoning pass over the generated answer, or a constrained-decoding setup that forces citation tags to track the active context chunk. Pick one. Don't assume the LLM will do it for free the visible failure mode of "wrong citation on a true claim" is exactly what happens when you assume it will.

4. Source diversity is a feature, not a nice-to-have. If your pipeline doesn't explicitly enforce same-domain dedup or topic-cluster diversity at the rerank stage, hard-code it. Allowing one domain to dominate the cited set is the fastest way to make a RAG system look like a thinly-wrapped paraphrase of one publisher. Diversity pressure is cheap to implement a small penalty in the rerank score, a per-domain cap on selected chunks and it's the difference between a citation list that reads like research and one that reads like a single-source rewrite.

Closing

My read after a week of poking at this: ChatGPT's retrieval stack is not magic. It's a query rewrite, a search call, a small fetch budget, a re-rank, a context assembly, and a prompt with citation instructions, all wrapped in a tool layer the model decides when to invoke.

The interesting part isn't the architecture. It's the choices the system makes. What gets fetched. What gets selected. What gets attributed. When retrieval fires and when it doesn't. How the two channels of knowledge get reconciled when they disagree.

Every retrieval system built from now on makes the same set of choices. Most make them worse. The work isn't in copying the architecture. It's in making each of those choices deliberately and being honest with the user about which channel produced the answer.

That last part, especially. ChatGPT doesn't do it. Yours can.

Published by Cihangir Bozdogan