joshuaopolko.com received 4,288 AI fetches from 20 distinct bots in the 14 days ending June 27, 2026. ChatGPT live-user traffic alone accounted for 2,211 of those, more than all traditional search engines combined. The /claude-fable-5-issues-fixes/ page received 893 AI fetches in two weeks, driven almost entirely by real ChatGPT users asking about the model in real time. Four training crawlers are actively building datasets from this domain. 84,975 scanner requests were filtered and 1,003 spoofed bots were blocked.
What these numbers mean
This report covers first-party data from Apache server logs for joshuaopolko.com, processed by the GEO Observatory. Cloudflare sits in front of the site and passes real visitor IPs via the CF-Connecting-IP header, which Apache logs directly. All bot classification uses a combination of User-Agent matching, rDNS verification, and ASN lookup. Bots that fail rDNS verification are classified as spoofed and excluded from the counts below.
The critical distinction in this data is between live-user fetches and training fetches. Live-user means a real person asked an AI tool something, and the AI fetched this site in real time to answer. Training means a crawler was building a dataset. The two have very different implications for citability: live-user traffic means you are being cited right now; training traffic means future model versions may incorporate your content.
Bot-by-bot breakdown
| Bot | Type | Fetches | Verified |
|---|---|---|---|
| ChatGPT user-fetch | live-user | 2,211 | 1,063 / 1,075 |
| Googlebot | search | 1,018 | 158 / 221 |
| Bingbot | search | 934 | 576 / 585 |
| Petal (Huawei) | search | 624 | 0 / 0 |
| ByteDance Bytespider | training | 365 | 0 / 0 |
| Anthropic ClaudeBot | training | 353 | 0 / 180 |
| Amazonbot | training | 352 | 0 / 0 |
| Meta AI | training | 343 | 0 / 0 |
| Perplexity | ai-search | 203 | 76 / 88 |
| Yandex | search | 184 | 89 / 103 |
| Applebot | search | 177 | 84 / 95 |
| ChatGPT search | ai-search | 157 | 68 / 73 |
| OpenAI GPTBot | training | 112 | 38 / 51 |
| DuckDuckGo | search | 84 | 0 / 27 |
| Claude user-fetch | live-user | 83 | 0 / 32 |
| Seznam | search | 60 | 0 / 0 |
| You.com | training | 45 | 0 / 0 |
| Common Crawl | training | 43 | 0 / 0 |
| Perplexity user-fetch | live-user | 16 | 11 / 11 |
| Claude search | ai-search | 5 | 0 / 0 |
Verification columns show confirmed / sampled. A "0 / 0" reading means this bot was not sampled for rDNS in this window, not that it failed. Bots that fail rDNS on a sampled check are classified as spoofed and excluded from this table (1,003 spoofed requests total this period).
ChatGPT dominates live-user AI traffic
ChatGPT user-fetch sent 2,211 fetches in the 14-day window, more than Googlebot (1,018) and Bingbot (934) combined. This is ChatGPT's real-time retrieval: a user types a question, ChatGPT decides the answer needs fresh web content, and it fetches from this site. These are not training crawls. They represent direct citations in ongoing conversations.
Claude user-fetch added 83 live-user fetches, and Perplexity user-fetch added 16. The live-user AI category total was 2,310 fetches, versus 365 for AI-search crawls (bots pre-crawling for AI search indexes) and 1,613 for AI training crawls.
Four training crawlers are actively indexing this domain
ByteDance Bytespider (365 fetches), Anthropic ClaudeBot (353), Amazonbot (352), and Meta AI (343) each sent 340-370 fetches. OpenAI GPTBot added 112 more. These five bots collectively account for 1,525 of the 1,613 training fetches. This is the pool that goes into future model versions of Doubao, Claude, Amazon's LLMs, LLaMA, and GPT respectively.
You.com (45) and Common Crawl (43) round out the training category. Common Crawl data is used by many smaller LLM projects and research models. Being present in Common Crawl is a secondary citability signal with wide reach.
Top pages by AI traffic
| Page | AI fetches | Live-user | Total |
|---|---|---|---|
| /claude-fable-5-issues-fixes/ | 893 | 859 | 945 |
| /agent-zero/ | 328 | 308 | 368 |
| / (homepage) | 310 | 241 | 504 |
| /dify-self-hosted-guide/ | 302 | 285 | 320 |
| /searxng-self-hosted-guide/ | 183 | 164 | 203 |
| /claude-code-specification-workflow-mcp/ | 160 | 142 | 193 |
| /ollama/ | 77 | 70 | 121 |
| /crewai-setup-production-guide/ | 71 | 54 | 86 |
| /kidsevents/ | 57 | 25 | 128 |
| /medical-training/ | 41 | 35 | 58 |
| /hometurf/ | 27 | 7 | 54 |
| /claude-seo/ | 26 | 1 | 51 |
| /geo-observatory/ | 26 | 0 | 44 |
| /perplexica-self-hosted-guide/ | 25 | 6 | 40 |
| /geo-field-manual/ | 22 | 5 | 59 |
| /n8n-self-hosted-guide/ | 20 | 4 | 40 |
| /geo-ai-citation/ | 20 | 2 | 43 |
| /driftlights/ | 21 | 0 | 37 |
| /psychedelic-vr-visual-effects-meta-quest/ | 21 | 3 | 30 |
What drives the top pages
The /claude-fable-5-issues-fixes/ page is a case study in timely content. It covers known issues and fixes for a specific new model, a topic where real users query ChatGPT and get referred to specific sources. 834 of its 893 AI fetches came from ChatGPT user-fetch. The page did not exist a month earlier; this is what the GEO literature calls "freshness-sensitive" citation in action.
/agent-zero/ and /dify-self-hosted-guide/ represent a different pattern: high-intent install guides where someone asking "how do I set up Agent Zero" or "how do I install Dify" gets pointed here. These pages have durable citation appeal because the questions are stable, not time-sensitive.
/geo-field-manual/ and /geo-ai-citation/ have more AI training bot fetches relative to live-user fetches, suggesting they are being incorporated into model training data rather than cited in real time. This is the long-game payoff: future model versions may reflect these pages' framing of GEO concepts.
Spoofed bots and scanner noise
84,975 scanner and noise requests were filtered from these counts entirely. On top of that, 1,003 requests claimed to be known AI bots (using their User-Agent strings) but failed rDNS verification, meaning the IP did not resolve to the expected network. These are logged as spoofed and excluded. The verification rate among bots with a valid rDNS sample was 85.1% (2,163 verified out of 2,541 sampled).
Spoofed bots are common. Any request claiming to be Googlebot from a Hetzner VPS, or Bingbot from a random residential IP, fails immediately. The rDNS check is the only reliable way to distinguish real crawlers from noise.
Methodology
Data source: Apache combined_cf log format. Cloudflare passes real visitor IPs via CF-Connecting-IP; Apache logs this directly. The GEO Observatory pipeline runs daily at 06:40 UTC, processing the previous 14 days of logs. Bot classification uses User-Agent string matching against a curated bot list, then rDNS verification for bots where a known hostname pattern exists. Verification count columns in the table above show (confirmed / sampled): bots with "0 / 0" were not sampled in this window. ASN lookup via ip-api.com. All data is from server logs, not from any third-party analytics platform.
See the live version of this data at GEO Observatory (updates daily). Source code and methodology details in Site as AI Infrastructure. Questions about this report: see GEO Answers.