How ChatGPT, Claude, Gemini and Perplexity decide who to cite. And why each one decides differently.
In the cornerstone piece of this series we defined the discipline — Citation Engine Optimization, the five engineerable signals, the difference between AI search engines and agentic browsers. That defined the state a business needs to reach.
This piece maps the field — the actual engines doing the citing in 2026, what each one fetches, what each one weighs, and where the optimisation work lands differently engine to engine.
The discipline is not one playbook. It is the same five signals weighted differently by eight or nine different engines, each with its own crawler, its own preferences and its own citation behaviour. Optimising for the field means knowing the field.
The two-track structure of the field
Before the engine-by-engine breakdown, one structural distinction worth setting cleanly: AI search engines and agentic browsers are different surfaces with different optimisation requirements. Most public writing on this topic conflates the two.
Most engines today are in one camp or the other. A growing number — ChatGPT, Gemini, Perplexity — span both.
The distinction matters because the same site can be perfectly optimised for citation and still fail at transaction, or vice versa. The signals overlap but are not identical.
The field at a glance
The engines that matter for small business citation in 2026, with the basics on each. Bot names are case-sensitive in robots.txt and server logs. Citation behaviour is observed and where possible vendor-documented.
GPTBotChatGPT-UserOAI-SearchBotClaudeBotClaude-Web · anthropic-aiGoogle-Extended (opt-out token, separate from Googlebot)PerplexityBotPerplexity-UserxAIBot (observed)YouBot · PhindBot · Mistral fetcher · Kagi fetcherCCBotThe signal-weight matrix
Not every signal matters equally to every engine. The five-signal framework — crawl access, render stability, structural clarity, trigger language, verifiable identity — applies to all, but the weighting shifts. Where to invest first depends on which engines matter most to your business.
The matrix below reflects observed and vendor-stated weighting patterns as of mid-2026. Weights are inferred from vendor documentation, published field studies and field observation — not internal engine source code. Treat as directional, not exact.
| Engine | Crawl Access |
Render Stability |
Structural Clarity |
Trigger Language |
Verifiable Identity |
|---|---|---|---|---|---|
| ChatGPT · SearchGPT | High | Med | High | High | High |
| Claude · web tools | High | Med | High | High | High |
| Gemini · AI Mode | High | Med | High | Med | High |
| Perplexity | High | Low | High | High | High |
| Operator · Computer Use · Mariner | Med | High | High | Low | Low |
| Comet · Arc · Dia | High | High | High | Med | Med |
| Grok | Med | Low | Med | High | Med |
| You.com · Phind · Kagi | High | Med | High | High | High |
Engine-by-engine — what each one actually rewards
OpenAI gives every site three distinct crawler identities — GPTBot for training, OAI-SearchBot for the search index, and ChatGPT-User for on-demand fetch when a user invokes browsing. Block any one of these and you become invisible to that surface alone — the other two may still reach you.
Citation patterns favour pages with clean schema, single canonical URLs, and content that quotes cleanly without losing meaning under truncation. SearchGPT specifically rewards recency more than ChatGPT browse does.
Claude tends to cite passages that read coherently as standalone quotes — sentences that contain a complete claim with attribution context inside the sentence. Pages that bury claims in long compound sentences with co-references to earlier paragraphs are harder for Claude to cite cleanly.
Anthropic's documented bot identity is ClaudeBot for training and Claude-Web for on-demand fetch. Both respect standard robots.txt. Computer Use, the agentic browser, uses the same Chrome rendering pipeline as a real browser — there is no separate user-agent because it is a real browser controlled by Claude.
Citation behaviour blends classical Google ranking signals (backlinks, domain authority, content depth) with AI-specific signals (structural clarity, recency, semantic match). The Google-Extended token is the opt-out signal for Gemini training — it is not a separate user-agent that appears in logs.
Sites that allow Google-Extended make their content available to Gemini's training corpus. Sites that block it remain visible to Google Search but not to Gemini. Google built this split so publishers could permit search indexing while opting out of AI training, or vice versa.
The citation UI is the product. Perplexity displays footnoted sources prominently and clicks through to source pages at a higher rate than any other engine. Field studies suggest it is the engine most likely to cite smaller, more specific sites — domain authority matters less than topical depth and content specificity.
PerplexityBot for training, Perplexity-User for on-demand fetch. Comet, Perplexity's agentic browser, combines this citation profile with task transaction in the same session — a research-to-action flow no other engine quite matches yet.
They render pages as a real browser would — Computer Use is literally Chrome controlled by Claude. The signal that matters is whether the page renders deterministically: same screenshot every time, same element positions across reloads, no late-arriving content, no overlays that block interaction.
CLS = 0 is the single most important property for this class. A site that fails it fails the entire agentic category regardless of how well it would have been cited by the pure-citation engines.
Sources mentioned on X with engagement signal earn citation weight that pure-content engines would ignore. Real-time recency matters more here than for any other engine — Grok is structurally designed to retrieve and weight what happened today.
For most small businesses, the practical move is to maintain X presence with periodic shares of canonical content. The crawler identity remains less standardised than the other majors; xAIBot is observed in logs but documentation is sparser.
Collectively earn less traffic share than the majors but reward source quality at a higher rate. A site with strong structural clarity and topical depth that fails to crack ChatGPT's citation pool may still surface consistently in You.com or Phind.
For technical and developer audiences, Phind is now the dominant citation engine. For paid-search audiences, Kagi's source attribution is among the cleanest in the field.
Allowing CCBot is the foundation of long-term presence — being in Common Crawl means being in the training data that future models read. Blocking CCBot in pursuit of "control" is a structural mistake for most small businesses. The training-data window matters as much as the citation window.
What this means for where to invest first
The matrix suggests a practical sequence for a small business that cannot do everything at once.
GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, ChatGPT-User, Anthropic-AI, Claude-Web, Perplexity-User, Google-Extended, CCBot are all allowed. This is the only step that returns across every engine and costs almost nothing. A site that skips this is invisible everywhere.The eight engines above are not the field as it will exist in 2027. New engines launch quarterly. Existing engines change citation behaviour on roughly monthly cycles. The discipline is engineering for the principles, not for any single engine's current preferences. The five signals are stable; the weights drift. A site that engineers for the signals stays compliant as the field evolves.
A worked example — the architecture of this site
The site you are reading was engineered from the ground up against this five-signal framework. The patterns are visible in production and observable to any AI engine that reads the source.
Crawl access — robots.txt allows all major AI training and fetch bots; blocks only known unfriendly fetchers. Cloudflare Firewall rules enforce the same allowlist at the network layer for resilience against ambiguous user-agent strings.
Render stability — pages are served from Cloudflare Workers at the edge, with no origin server, no CMS, no plugin layer. Images carry explicit width and height attributes; fonts use font-display: swap with reserved space; no late-loading layout-shifting elements. The PSI report shows CLS = 0 across the site.
The live PageSpeed Insights report for this domain — Performance, Accessibility, Best Practices and SEO at 100, Agentic Browsing at 3/3. This is the render-stability and crawl-access discipline above, measured rather than asserted.
Verify: pagespeed.web.dev/analysis/https-vsourcecode-com/546k3vv85n?form_factor=desktop
Structural clarity — every KB page has a single H1, semantic landmarks, JSON-LD Article schema with full Person/Organisation attribution. The Concepts Registry is injected dynamically into each article so an AI parsing one piece sees the canonical definition source alongside.
Trigger language — the llms.txt at the root of the domain embeds natural-language symptom phrases inside every concept and KB description, so embeddings match user questions to canonical content. The same trigger phrases appear inside article bodies as natural prose.
Verifiable identity — a canonical /about page with author identity, fifteen-plus Cloudflare Workers as evidence of engineering practice, and the worked claim that the site itself is the proof of work. The architecture is the credential.
This is what a Machine-Readable Business looks like under the hood. The discipline is not theoretical — it is observable in any view-source request against the pages you have read on the way to this paragraph.
Common questions about AI engine citation behaviour
Which AI engine is most important for small business citation today?
Perplexity has the most explicit citation UI and the highest source click-through rate, making it the most directly traffic-generating engine for citation today. ChatGPT browse has the largest user base and therefore the largest potential reach. Gemini’s AI Mode rides on the largest existing search audience but the citation UI is less prominent.
For a small business deciding where to start, Perplexity is the highest-leverage single-engine focus; ChatGPT and Claude are essentially required as baseline; Gemini is unavoidable because it ships with every Android phone and Workspace account.
Do I need to allow all AI bots or can I be selective?
Selective is possible but rarely worth the complexity for a small business. The bots that matter for citation are well-behaved, respect robots.txt, and do not place meaningful load on the average site.
The case for blocking is usually misinformed — confused either with hostile scrapers (a different category) or with concerns about AI training that the Google-Extended token and the GPTBot opt-out already handle. The recommendation for almost every small business is allow all major AI bots, block known hostile scrapers, and monitor your logs.
Why is Render Stability rated low for Perplexity but high for Operator?
Because they do different things with your page. Perplexity reads the rendered HTML once, summarises it, and produces a citation. It does not interact with the page after that. Layout shift during render does not affect the summarisation or the citation decision.
Operator, by contrast, takes a screenshot, identifies an element at a coordinate, and clicks. If the layout has shifted between screenshot and click, the click misses. Render stability matters only when the engine interacts visually with the page after rendering it.
What is the difference between Google-Extended and Googlebot?
Google-Extended is not a user-agent that appears in your logs — it is a token in robots.txt that controls whether Google can use your content for AI model training (Gemini and Vertex AI). Googlebot remains the user-agent that crawls and indexes for classical Google Search.
To allow Gemini to learn from your site, your robots.txt should either permit Google-Extended explicitly or remain silent (default-allow). To block Gemini training while permitting Google Search, add Google-Extended to your Disallow rules.
Does X presence really matter for Grok citations?
For Grok specifically, yes — more than for any other engine. Grok’s architecture is built around real-time integration with X, and content surfaced on X with engagement signals receives citation weight that the pure-content engines would not give.
The practical move for a small business is to maintain X presence at a minimum useful level — periodic shares of canonical content with sensible engagement. This is not advice to chase X virality; it is acknowledgement that Grok treats X as a first-class signal in a way no other engine does.
How quickly does citation behaviour change?
Faster than SEO ranking behaviour and slower than social platform algorithm changes. Engines update their citation logic on roughly monthly cycles based on internal evaluation, with occasional larger shifts when new models ship.
The five signals are stable across these updates; the specific weighting drifts. A site that engineers for the signals remains compliant as the weights change. A site that optimises for a specific 2026 engine quirk will likely need to redo the work in 2027.
Should I implement WebMCP endpoints now?
If your business has a transactional surface that an agentic browser could meaningfully use — booking, quoting, ordering, scheduling — yes, even as an experimental implementation. Lighthouse’s Agentic Browsing audit already includes an informational WebMCP check.
Few sites have implemented it yet; early adopters will be in the small pool of agent-ready sites for a window. For pure-content sites with no transactional surface, focus on the five-signal foundation first.
What comes next in this series
The next piece — How To Find Out If AI Engines Are Even Reading Your Site — is the diagnostic instrument. How to read server logs for AI bot traffic, what healthy crawl patterns look like, and how to measure your Citation Gap against named competitors with a methodology you can run weekly.
After that, the case-study piece — what an AI-cited page looks like under the hood, with code — and the architectural piece on sites built for machines as the primary audience.
The discipline becomes documented as it forms. This series is one engineer's field notes during the formation.
Sources & references
OpenAI — GPTBot, OAI-SearchBot and ChatGPT-User documentation
Anthropic — Claude web tools and Computer Use documentation · ClaudeBot policy
Google — Google crawlers and Google-Extended documentation · Lighthouse Agentic Web audit
Perplexity — PerplexityBot policy and source documentation
Common Crawl — CCBot documentation
llmstxt.org — The llms.txt specification
vSourceCode — The Machine-Readable Business (Series cornerstone) · Google Just Added the 5th Element · Citation Engine Optimization concept · Citation Gap concept · About the engineer and the architecture