I asked ChatGPT about the recent copyright news. It rehashed my latest column and misconstrued the facts. But why was it on my site at all?
https://www.plagiarismtoday.com/2025/07/23/chatgpt-ignores-robots-txt-rehashes-my-column/

I asked ChatGPT about the recent copyright news. It rehashed my latest column and misconstrued the facts. But why was it on my site at all?
https://www.plagiarismtoday.com/2025/07/23/chatgpt-ignores-robots-txt-rehashes-my-column/
Here's #Cloudflare's #robots-txt file:
# Cloudflare Managed Robots.txt to block AI related bots.
User-agent: AI2Bot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: amazon-kendra
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Applebot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: AwarioRssBot
Disallow: /
User-agent: AwarioSmartBot
Disallow: /
User-agent: bigsur.ai
Disallow: /
User-agent: Brightbot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: DigitalOceanGenAICrawler
Disallow: /
User-agent: DuckAssistBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: FriendlyCrawler
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: iaskspider/2.0
Disallow: /
User-agent: ICC-Crawler
Disallow: /
User-agent: img2dataset
Disallow: /
User-agent: Kangaroo Bot
Disallow: /
User-agent: LinerBot
Disallow: /
User-agent: MachineLearningForPeaceBot
Disallow: /
User-agent: Meltwater
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: meta-externalfetcher
Disallow: /
User-agent: Nicecrawler
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: omgili
Disallow: /
User-agent: omgilibot
Disallow: /
User-agent: PanguBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: PetalBot
Disallow: /
User-agent: PiplBot
Disallow: /
User-agent: QualifiedBot
Disallow: /
User-agent: Scoop.it
Disallow: /
User-agent: Seekr
Disallow: /
User-agent: SemrushBot-OCOB
Disallow: /
User-agent: Sidetrade indexer bot
Disallow: /
User-agent: Timpibot
Disallow: /
User-agent: VelenPublicWebCrawler
Disallow: /
User-agent: Webzio-Extended
Disallow: /
User-agent: YouBot
Disallow: /
#Business #Findings
Most blocked SEO bots · Insights from ~140 million websites https://ilo.im/16439x
_____
#SEO #Bots #Crawlers #Content #Website #Blog #RobotsTxt #Development #WebDev #Backend
#Business #Explorations
What would happen if I blocked big search? · Pros and cons of blocking major search engines https://ilo.im/163yb3
_____
#SearchEngine #SEO #AI #Website #Blog #RobotsTxt #Development #WebDev #Frontend #Backend
#Development #Findings
Most blocked AI bots · ”Block rates have increased significantly over the past year.” https://ilo.im/16425n
_____
#AI #Bots #Crawlers #Content #Website #Blog #RobotsTxt #WebDev #Backend
#Business #Guidelines
The Internet Archive opt-out itch · Ways to deal with your public internet history https://ilo.im/163ssx
_____
#InternetArchive #Internet #History #Consent #Trust #Transparency #Content #Blog #Website #RobotsTxt
#fediVerse #AI #dataMining #robotsTXT #fediAdmin
This looks to me much more like that we should burry troyan horses right into the bellies of the beasts. My server rules and profiles state that all data is CC-BY-SA-NC.
If they use and train that data they definitely should become in serious legal and financial trouble.
Worrying is their self centered megalomanic ego trip, not realizing that they are the remaining world power, armed to their teeth with weapons of all kind, and with all the private data of the worlds population.
That said, having in mind that apparently you, being in charge of several #mastodon instances in the #fediVerse, are not able to fix the #robotsTxt of them while wasting time about talking of other countries internal affairs is kinda embarrassing.
sry
Extending the meta tags would be fairly straightforward. In addition to the existing "INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW", we could introduce "MODELTRAINING" and "NOMODELTRAINING".
Of course, just because there is an RfC does not mean that anyone will follow it. But it would be a start, and something to push for. I would love to hear your opinion.
3/3
This is not an acceptable situation and therefore I propose to extend the robots.txt standard and the corresponding HTML meta tags.
For robots.txt, I see two ways to approach this:
The first option would be to introduce a meta-user-agent that can be used to define rules for all AI bots, e.g. "User-agent: §MODELTRAINIGN§".
The second option would be a directive like "Crawl-delay" that indicates how to use the data. For example, "Model-training: disallow".
2/3
I was just confronted with the question of how to prevent a website being used to train AI models. Using robots.txt, I only see two options, both of which are bad in different ways:
I can either disallow all known AI bots while still being guaranteed to miss some bots.
Or I can disallow all bots and explicitly allow known search engines. This way I will overblock and favour the big search engines over yet unknown challengers.
1/3
@Codeberg Thanks for the inspiration. After checking it, I'll use the output of following command in the robots.txt of my doc:
❯ bash -c '(
curl -fsS --tlsv1.3 https://codeberg.org/robots.txt | \
tac | \
grep -A999 "^Disallow: /$" | \
grep -m1 -B999 "^[[:space:]]*$" | \
tac
curl -fsS --tlsv1.3 https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/main/robots.txt
) | sort -ru'
User-agent: omgilibot
User-agent: omgili
User-agent: meta-externalagent
User-agent: img2dataset
User-agent: facebookexternalhit
User-agent: cohere-ai
User-agent: anthropic-ai
User-agent: YouBot
User-agent: Webzio-Extended
User-agent: VelenPublicWebCrawler
User-agent: Timpibot
User-agent: Scrapy
User-agent: PetalBot
User-agent: PerplexityBot
User-agent: Omgilibot
User-agent: Omgili
User-agent: OAI-SearchBot
User-agent: Meta-ExternalFetcher
User-agent: Meta-ExternalAgent
User-agent: ImagesiftBot
User-agent: ICC-Crawler
User-agent: GoogleOther-Video
User-agent: GoogleOther-Image
User-agent: GoogleOther
User-agent: Google-Extended
User-agent: GPTBot
User-agent: FriendlyCrawler
User-agent: FacebookBot
User-agent: Diffbot
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: ChatGPT-User
User-agent: CCBot
User-agent: Bytespider
User-agent: Applebot-Extended
User-agent: Applebot
User-agent: Amazonbot
User-agent: Ai2Bot-Dolma
User-agent: AI2Bot
Disallow: /
A(I)le bekloppt
Drüben im Blog der Uberspace-Betreiber findet sich ein sehr interessanter Artikel dazu, was die (mittlerweile anscheinend komplett hohldrehenden) Bots der AI-Firmen ohne Rücksicht auf Verluste so auslösen:
(…) Zusammenfassend lässt sich sagen, dass nach unserer Beobachtung rund 30 %-50 % aller Anfragen für kleine Seiten inzwischen von Bots generiert werden. Für große Seiten schwankt diese Zahl sogar zwischen 20 % und 75 %. In unseren Augen und mit Ignorieren der robots.txt ist damit inzwischen ein Punkt erreicht, an dem dieses Verhalten von Bots nicht mehr akzeptabel ist und unserem Betrieb schadet.
blog.uberspace.de
Bei meinen unregelmässigen Ausflügen in die Serverlogs meiner eigenen Seiten, aber auch von Auftritten meiner Kunden ist das genauso: Die bot-Zugriffe haben überproportional zugenommen und es ist teilweise wirklich heftig, mit welcher Frequenz und mit wieviel wechselnden IPs die Dinger auf die Site hämmern. >:-(
#Bots #DigitaleSelbstVerteidigung #robotsTxt #Scraper #WildWest
#Robotstxt #CrawlerBacklash Trickle-down effects: "people start blocking all crawlers, and some crawlers are very important, for search indexing, internet archiving, some are used for academic research, and so the bad behaviours of all these #AIcompanies, and the backlash to it, is kind of fundamentally changing how the Internet works, how it is remembered and indexed..."
https://pca.st/yto6v3il?t=11m34s
I was about to ask, myself, where that bloke was who says "Hah! No." to such questions. (-:
That said, if the argument was (and is) that 35 years was an egregious and unjust suggestion by prosecutors, it is *still* surely egregious and unjust in *other* cases if one contends that they are alike.
*That* said, I wonder how that "Hah! No." bloke answers the question: Is ignoring robots.txt illegal?
(-:
New scraper just dropped (well, an old scraper was renamed):
Facebook/Meta updated its robots.txt entry for opting out of GenAI data scraping. If you blocked FacebookBot
before, you should block meta-externalagent
now:
User-Agent: meta-externalagent
Disallow: /
Official references:
FacebookBot
no longer mentions GenAI.Meta-ExternalAgent
mentions “AI”.#Business #Reports
Google is the only search engine that works on Reddit now · The exclusivity stems from a multi-million dollar AI deal https://ilo.im/15zlg6
_____
#Reddit #Google #SearchEngine #SEO #AI #Content #UGC #RobotsTxt #Development #WebDev
Interestingly, a few of the top websites actively invite AI crawlers to crawl them.
https://stefanbohacek.com/blog/which-top-sites-block-ai-crawlers/