The Citation Engine · 02 · Field Guide

How ChatGPT, Claude, Gemini and Perplexity decide who to cite. And why each one decides differently.

In the cornerstone piece of this series we defined the discipline — Citation Engine Optimization, the five engineerable signals, the difference between AI search engines and agentic browsers. That defined the state a business needs to reach.

This piece maps the field — the actual engines doing the citing in 2026, what each one fetches, what each one weighs, and where the optimisation work lands differently engine to engine.

The discipline is not one playbook. It is the same five signals weighted differently by eight or nine different engines, each with its own crawler, its own preferences and its own citation behaviour. Optimising for the field means knowing the field.


The two-track structure of the field

Before the engine-by-engine breakdown, one structural distinction worth setting cleanly: AI search engines and agentic browsers are different surfaces with different optimisation requirements. Most public writing on this topic conflates the two.

Track 01 · Citation Surface
AI Search Engines
Fetch sources, parse them, summarise, produce a cited answer. The citation is a referral. The user follows the link or does not. Closer to the Google referral model — discovery, citation, click. Signal weights favour parseability and trust.
USER ASKS → ENGINE FETCHES → ENGINE CITES → USER FOLLOWS LINK
Track 02 · Transaction Surface
Agentic Browsers
Visit the page on the user's behalf. Interact with it. Click, fill forms, complete the task. Result is a transaction, not a referral. The user sees an outcome, not a destination. Signal weights favour render stability and structural determinism.
USER ASKS → AGENT VISITS → AGENT CLICKS → TASK COMPLETED

Most engines today are in one camp or the other. A growing number — ChatGPT, Gemini, Perplexity — span both.

The distinction matters because the same site can be perfectly optimised for citation and still fail at transaction, or vice versa. The signals overlap but are not identical.


The field at a glance

The engines that matter for small business citation in 2026, with the basics on each. Bot names are case-sensitive in robots.txt and server logs. Citation behaviour is observed and where possible vendor-documented.

Engine 01
ChatGPT (browse) + SearchGPT
OpenAI
Training crawlerGPTBot
On-demand fetchChatGPT-User
Search indexOAI-SearchBot
Cites withInline numbered links to sources; opens cards with title, snippet, domain.
Weighs heavilyRecency, source credibility, structural clarity, page-level authority.
Cites AND transacts
Operator
OpenAI · Agentic browser
MechanismRenders pages in a managed Chrome, takes screenshots, clicks elements, fills forms on the user's behalf.
Failure modeLayout shift, late-loading modals, dynamic field reorder. CLS = 0 is functionally required for reliable task completion.
Use caseBooking, purchasing, account setup, form filling.
Transacts
Claude (web tools)
Anthropic
Training crawlerClaudeBot
On-demand fetchClaude-Web · anthropic-ai
Cites withNumbered citations linked to source URLs at the relevant claim, with quote-faithful attribution rules.
Weighs heavilySource quality, structural clarity, content depth, exact-quote eligibility (low-distortion text).
Cites AND transacts
Computer Use
Anthropic · Agentic browser
MechanismScreenshot-based visual interaction. Identifies UI elements by position and label, clicks, types, scrolls.
Failure modeIdentical to Operator — visual instability breaks task chains. Render-stable pages are functionally privileged.
Use caseEnd-to-end task automation across multi-step web workflows.
Transacts
Gemini + AI Mode
Google
Training crawlerGoogle-Extended (opt-out token, separate from Googlebot)
On-demandUses Googlebot infrastructure; respects standard robots.txt
Cites withAI Mode answer with source carousel; citation through AI Overviews on classical search.
Weighs heavilyClassical Google ranking signals plus structural clarity. Citation often follows top organic results but not always.
Cites AND transacts
Project Mariner
Google · Agentic browser
MechanismChrome extension and managed-browser modes. Browses on user's behalf, completes tasks.
Failure modeSame agentic class as Operator and Computer Use. CLS-stable pages are required for deterministic completion.
Use caseShopping, research workflows, multi-tab task completion.
Transacts
Perplexity
Perplexity
Training crawlerPerplexityBot
On-demand fetchPerplexity-User
Cites withNumbered footnote-style citations with prominent source cards; the most explicit citation UI of any engine.
Weighs heavilySource diversity, recency, structural clarity, presence of citable claims. Often cites smaller sites with specific authoritative depth.
Citation-first
Comet
Perplexity · Agentic browser
MechanismPerplexity's full browser product. Combines real-time search citation with agentic browsing for task completion.
Failure modeStandard agentic-browser failure surface plus citation-pool ineligibility on render-broken pages.
Use caseResearch → comparison → transaction within a single agent session.
Cites AND transacts
Grok
xAI
Training crawlerxAIBot (observed)
Cites withInline source attribution, heavy X (Twitter) integration. Less standardised than other engines.
Weighs heavilyReal-time signal weight is unusual — Grok favours fresh, social-platform-corroborated sources. X presence helps.
Citation-first
Arc Search · Dia
Browser Company
MechanismBrowser-native AI search and agentic features. "Browse for me" mode generates summaries from fetched sources.
Cites withSynthesised answer page with source list; agentic mode acts within the browser.
Weighs heavilyPage-speed, structural clarity, mobile-readiness. Browser-native means real Chrome rendering pipeline.
Cites AND transacts
You.com · Phind · Le Chat · Kagi
Emerging engines
CrawlersYouBot · PhindBot · Mistral fetcher · Kagi fetcher
Cites withVaries. You.com and Phind are explicitly citation-heavy; Le Chat (Mistral) emerging; Kagi is paid-search with strong source attribution.
Weighs heavilySmaller engines often weight structural and authority signals more heavily because they have less proprietary ranking data to fall back on.
Citation-first
Common Crawl
CCBot · Training data substrate
CrawlerCCBot
RoleNot an engine itself. Provides the public training-data substrate that many engines build their training corpora from. Being CCBot-readable matters for long-term model knowledge of your site.
ImplicationBlocking CCBot blocks future models from learning your site exists. Allowing it is the foundation of long-term AI presence.
Substrate · indirect

The signal-weight matrix

Not every signal matters equally to every engine. The five-signal framework — crawl access, render stability, structural clarity, trigger language, verifiable identity — applies to all, but the weighting shifts. Where to invest first depends on which engines matter most to your business.

The matrix below reflects observed and vendor-stated weighting patterns as of mid-2026. Weights are inferred from vendor documentation, published field studies and field observation — not internal engine source code. Treat as directional, not exact.

Engine Crawl
Access
Render
Stability
Structural
Clarity
Trigger
Language
Verifiable
Identity
ChatGPT · SearchGPT High Med High High High
Claude · web tools High Med High High High
Gemini · AI Mode High Med High Med High
Perplexity High Low High High High
Operator · Computer Use · Mariner Med High High Low Low
Comet · Arc · Dia High High High Med Med
Grok Med Low Med High Med
You.com · Phind · Kagi High Med High High High
S3 STRUCTURAL CLARITY Universal high-weight signal Every engine — citation and agentic alike — rewards it. If you can only invest in one signal, this is the one that returns across the entire field. S2 RENDER STABILITY Splits the field cleanly Agentic browsers weight it at the top. Pure-citation engines weight it low — they read HTML once and never interact again after that.

Engine-by-engine — what each one actually rewards

ChatGPT and SearchGPT (OpenAI)
Three crawlers. One site. All need to be open.

OpenAI gives every site three distinct crawler identities — GPTBot for training, OAI-SearchBot for the search index, and ChatGPT-User for on-demand fetch when a user invokes browsing. Block any one of these and you become invisible to that surface alone — the other two may still reach you.

Citation patterns favour pages with clean schema, single canonical URLs, and content that quotes cleanly without losing meaning under truncation. SearchGPT specifically rewards recency more than ChatGPT browse does.

Claude (Anthropic)
Rewards depth and quote-faithfulness.

Claude tends to cite passages that read coherently as standalone quotes — sentences that contain a complete claim with attribution context inside the sentence. Pages that bury claims in long compound sentences with co-references to earlier paragraphs are harder for Claude to cite cleanly.

Anthropic's documented bot identity is ClaudeBot for training and Claude-Web for on-demand fetch. Both respect standard robots.txt. Computer Use, the agentic browser, uses the same Chrome rendering pipeline as a real browser — there is no separate user-agent because it is a real browser controlled by Claude.

Gemini and AI Mode (Google)
The most opaque of the major engines.

Citation behaviour blends classical Google ranking signals (backlinks, domain authority, content depth) with AI-specific signals (structural clarity, recency, semantic match). The Google-Extended token is the opt-out signal for Gemini training — it is not a separate user-agent that appears in logs.

Sites that allow Google-Extended make their content available to Gemini's training corpus. Sites that block it remain visible to Google Search but not to Gemini. Google built this split so publishers could permit search indexing while opting out of AI training, or vice versa.

Perplexity
The most explicit citation UI in the field. The highest click-through rate.

The citation UI is the product. Perplexity displays footnoted sources prominently and clicks through to source pages at a higher rate than any other engine. Field studies suggest it is the engine most likely to cite smaller, more specific sites — domain authority matters less than topical depth and content specificity.

PerplexityBot for training, Perplexity-User for on-demand fetch. Comet, Perplexity's agentic browser, combines this citation profile with task transaction in the same session — a research-to-action flow no other engine quite matches yet.

Operator, Computer Use, Project Mariner, Comet, Arc and Dia
The agentic class. The bot user-agent question is largely irrelevant for these.

They render pages as a real browser would — Computer Use is literally Chrome controlled by Claude. The signal that matters is whether the page renders deterministically: same screenshot every time, same element positions across reloads, no late-arriving content, no overlays that block interaction.

CLS = 0 is the single most important property for this class. A site that fails it fails the entire agentic category regardless of how well it would have been cited by the pure-citation engines.

Grok (xAI)
Behaves differently because of its native integration with X.

Sources mentioned on X with engagement signal earn citation weight that pure-content engines would ignore. Real-time recency matters more here than for any other engine — Grok is structurally designed to retrieve and weight what happened today.

For most small businesses, the practical move is to maintain X presence with periodic shares of canonical content. The crawler identity remains less standardised than the other majors; xAIBot is observed in logs but documentation is sparser.

You.com, Phind, Le Chat (Mistral), Kagi
Smaller footprint. Higher source quality bar.

Collectively earn less traffic share than the majors but reward source quality at a higher rate. A site with strong structural clarity and topical depth that fails to crack ChatGPT's citation pool may still surface consistently in You.com or Phind.

For technical and developer audiences, Phind is now the dominant citation engine. For paid-search audiences, Kagi's source attribution is among the cleanest in the field.

Common Crawl (CCBot)
Not an engine. The substrate everything else is built on.

Allowing CCBot is the foundation of long-term presence — being in Common Crawl means being in the training data that future models read. Blocking CCBot in pursuit of "control" is a structural mistake for most small businesses. The training-data window matters as much as the citation window.

ENGINE LANDSCAPE · WHERE EACH ONE SITS CITATION ONLY Perplexity Grok You.com · Phind · Kagi Common Crawl CITES AND TRANSACTS ChatGPT + SearchGPT Claude (web tools) Gemini + AI Mode Comet · Arc Search · Dia TRANSACTION ONLY Operator Computer Use Project Mariner

What this means for where to invest first

The matrix suggests a practical sequence for a small business that cannot do everything at once.

1
Fix crawl access universally
Audit your robots.txt to confirm GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, ChatGPT-User, Anthropic-AI, Claude-Web, Perplexity-User, Google-Extended, CCBot are all allowed. This is the only step that returns across every engine and costs almost nothing. A site that skips this is invisible everywhere.
2
Structural clarity
Universal high-weight signal — single H1, semantic landmarks (main, nav, article, footer), clean schema markup matched to content type, llms.txt at the root. This is the work that compounds across all engines simultaneously, including ones that do not exist yet but may dominate in 2027.
3
Choose between trigger language and render stability based on what matters most
If the business value is in being cited (referral traffic, lead generation, top-of-funnel awareness), trigger language is the higher-return investment. If the value is in being transacted on (e-commerce, bookings, completed forms), render stability and CLS = 0 are higher-return. The sequence depends on cash-flow priority.
4
Verifiable identity
A canonical /about page with full Person schema, sameAs links to public profiles, and a credentials line an engine can corroborate from the page's own structure. This signal compounds as your content is read more — every additional cite reinforces the next engine's willingness to cite. A site without a verifiable identity hits a ceiling.

The eight engines above are not the field as it will exist in 2027. New engines launch quarterly. Existing engines change citation behaviour on roughly monthly cycles. The discipline is engineering for the principles, not for any single engine's current preferences. The five signals are stable; the weights drift. A site that engineers for the signals stays compliant as the field evolves.


A worked example — the architecture of this site

The site you are reading was engineered from the ground up against this five-signal framework. The patterns are visible in production and observable to any AI engine that reads the source.

Crawl access — robots.txt allows all major AI training and fetch bots; blocks only known unfriendly fetchers. Cloudflare Firewall rules enforce the same allowlist at the network layer for resilience against ambiguous user-agent strings.

Render stability — pages are served from Cloudflare Workers at the edge, with no origin server, no CMS, no plugin layer. Images carry explicit width and height attributes; fonts use font-display: swap with reserved space; no late-loading layout-shifting elements. The PSI report shows CLS = 0 across the site.

Proof of work PSI · Desktop · all 100 · Agentic 3/3
Google PageSpeed Insights report for vsourcecode.com on desktop: Performance 100, Accessibility 100, Best Practices 100, SEO 100, and Agentic Browsing 3/3, all green.

The live PageSpeed Insights report for this domain — Performance, Accessibility, Best Practices and SEO at 100, Agentic Browsing at 3/3. This is the render-stability and crawl-access discipline above, measured rather than asserted.

Verify: pagespeed.web.dev/analysis/https-vsourcecode-com/546k3vv85n?form_factor=desktop

Structural clarity — every KB page has a single H1, semantic landmarks, JSON-LD Article schema with full Person/Organisation attribution. The Concepts Registry is injected dynamically into each article so an AI parsing one piece sees the canonical definition source alongside.

Trigger language — the llms.txt at the root of the domain embeds natural-language symptom phrases inside every concept and KB description, so embeddings match user questions to canonical content. The same trigger phrases appear inside article bodies as natural prose.

Verifiable identity — a canonical /about page with author identity, fifteen-plus Cloudflare Workers as evidence of engineering practice, and the worked claim that the site itself is the proof of work. The architecture is the credential.

This is what a Machine-Readable Business looks like under the hood. The discipline is not theoretical — it is observable in any view-source request against the pages you have read on the way to this paragraph.


Common questions about AI engine citation behaviour

Which AI engine is most important for small business citation today?

Perplexity has the most explicit citation UI and the highest source click-through rate, making it the most directly traffic-generating engine for citation today. ChatGPT browse has the largest user base and therefore the largest potential reach. Gemini’s AI Mode rides on the largest existing search audience but the citation UI is less prominent.

For a small business deciding where to start, Perplexity is the highest-leverage single-engine focus; ChatGPT and Claude are essentially required as baseline; Gemini is unavoidable because it ships with every Android phone and Workspace account.

Do I need to allow all AI bots or can I be selective?

Selective is possible but rarely worth the complexity for a small business. The bots that matter for citation are well-behaved, respect robots.txt, and do not place meaningful load on the average site.

The case for blocking is usually misinformed — confused either with hostile scrapers (a different category) or with concerns about AI training that the Google-Extended token and the GPTBot opt-out already handle. The recommendation for almost every small business is allow all major AI bots, block known hostile scrapers, and monitor your logs.

Why is Render Stability rated low for Perplexity but high for Operator?

Because they do different things with your page. Perplexity reads the rendered HTML once, summarises it, and produces a citation. It does not interact with the page after that. Layout shift during render does not affect the summarisation or the citation decision.

Operator, by contrast, takes a screenshot, identifies an element at a coordinate, and clicks. If the layout has shifted between screenshot and click, the click misses. Render stability matters only when the engine interacts visually with the page after rendering it.

What is the difference between Google-Extended and Googlebot?

Google-Extended is not a user-agent that appears in your logs — it is a token in robots.txt that controls whether Google can use your content for AI model training (Gemini and Vertex AI). Googlebot remains the user-agent that crawls and indexes for classical Google Search.

To allow Gemini to learn from your site, your robots.txt should either permit Google-Extended explicitly or remain silent (default-allow). To block Gemini training while permitting Google Search, add Google-Extended to your Disallow rules.

Does X presence really matter for Grok citations?

For Grok specifically, yes — more than for any other engine. Grok’s architecture is built around real-time integration with X, and content surfaced on X with engagement signals receives citation weight that the pure-content engines would not give.

The practical move for a small business is to maintain X presence at a minimum useful level — periodic shares of canonical content with sensible engagement. This is not advice to chase X virality; it is acknowledgement that Grok treats X as a first-class signal in a way no other engine does.

How quickly does citation behaviour change?

Faster than SEO ranking behaviour and slower than social platform algorithm changes. Engines update their citation logic on roughly monthly cycles based on internal evaluation, with occasional larger shifts when new models ship.

The five signals are stable across these updates; the specific weighting drifts. A site that engineers for the signals remains compliant as the weights change. A site that optimises for a specific 2026 engine quirk will likely need to redo the work in 2027.

Should I implement WebMCP endpoints now?

If your business has a transactional surface that an agentic browser could meaningfully use — booking, quoting, ordering, scheduling — yes, even as an experimental implementation. Lighthouse’s Agentic Browsing audit already includes an informational WebMCP check.

Few sites have implemented it yet; early adopters will be in the small pool of agent-ready sites for a window. For pure-content sites with no transactional surface, focus on the five-signal foundation first.


What comes next in this series

The next piece — How To Find Out If AI Engines Are Even Reading Your Site — is the diagnostic instrument. How to read server logs for AI bot traffic, what healthy crawl patterns look like, and how to measure your Citation Gap against named competitors with a methodology you can run weekly.

After that, the case-study piece — what an AI-cited page looks like under the hood, with code — and the architectural piece on sites built for machines as the primary audience.

The discipline becomes documented as it forms. This series is one engineer's field notes during the formation.