mastouille.fr @admin

0 message0 participant0 message aujourd’hui

Suite du fil

**George E. (he/him/his)** @gme@bofh.social · 12 juin

George E. (he/him/his) @gme@bofh.social

# Cloudflare Managed Robots.txt to block AI related bots.

User-agent: AI2Bot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: amazon-kendra
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Applebot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: AwarioRssBot
Disallow: /

User-agent: AwarioSmartBot
Disallow: /

User-agent: bigsur.ai
Disallow: /

User-agent: Brightbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: DigitalOceanGenAICrawler
Disallow: /

User-agent: DuckAssistBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: FriendlyCrawler
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: iaskspider/2.0
Disallow: /

User-agent: ICC-Crawler
Disallow: /

User-agent: img2dataset
Disallow: /

User-agent: Kangaroo Bot
Disallow: /

User-agent: LinerBot
Disallow: /

User-agent: MachineLearningForPeaceBot
Disallow: /

User-agent: Meltwater
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: meta-externalfetcher
Disallow: /

User-agent: Nicecrawler
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: omgili
Disallow: /

User-agent: omgilibot
Disallow: /

User-agent: PanguBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: PetalBot
Disallow: /

User-agent: PiplBot
Disallow: /

User-agent: QualifiedBot
Disallow: /

User-agent: Scoop.it
Disallow: /

User-agent: Seekr
Disallow: /

User-agent: SemrushBot-OCOB
Disallow: /

User-agent: Sidetrade indexer bot
Disallow: /

User-agent: Timpibot
Disallow: /

User-agent: VelenPublicWebCrawler
Disallow: /

User-agent: Webzio-Extended
Disallow: /

User-agent: YouBot
Disallow: /

#robotstxt

**Inautilo** @inautilo@mastodon.social · 28 mai

28 mai

Inautilo @inautilo@mastodon.social

#Business #Findings
Most blocked SEO bots · Insights from ~140 million websites https://ilo.im/16439x

_____
#SEO #Bots #Crawlers #Content #Website #Blog #RobotsTxt #Development #WebDev #Backend

SEO Blog by Ahrefs · 22 maiThe SEO Bots That ~140 Million Websites Block the MostWe looked at ~140 million websites to see how often SEO bots were blocked.

**Inautilo** @inautilo@mastodon.social · 24 mai

24 mai

Inautilo @inautilo@mastodon.social

#Business #Explorations
What would happen if I blocked big search? · Pros and cons of blocking major search engines https://ilo.im/163yb3

_____
#SearchEngine #SEO #AI #Website #Blog #RobotsTxt #Development #WebDev #Frontend #Backend

Luke's Wild Website · 16 maiWhat would happen if I blocked big search?The major search engines are showing slop summaries first and websites second. They link to sites sometimes, but they’d much prefer searchers stay on the results page and ask followup questions. Engagement numbers and all that. These companies run mass content scraping campaigns to train their own slop models and provide servers to other slop companies in the same data centers their crawlers operate from. Every word I’ve thrown into this illustrious publication—the reading of which a total of five people have, at various times, expressed mild enjoyment—has been both silently and explicitly declared fair game by the slop companies. The only positive thing I can egotistically hope to grasp about this development is that a distant and diluted abstraction of my words and thoughts could be subtly influencing the thoughts and decisions of people around the world—the same way a pinch of salt would affect the flavor profile of a large vat at the soup factory, or like the breeze under the wings of chaos theory’s cute and frighteningly powerful butterfly mascot.

**Inautilo** @inautilo@mastodon.social · 22 mai

22 mai

Inautilo @inautilo@mastodon.social

#Development #Findings
Most blocked AI bots · ”Block rates have increased significantly over the past year.” https://ilo.im/16425n

_____
#AI #Bots #Crawlers #Content #Website #Blog #RobotsTxt #WebDev #Backend

SEO Blog by Ahrefs · 21 maiThe AI Bots That ~140 Million Websites Block the MostWe looked at ~140 million websites and our data shows that block rates for AI bots have increased significantly over the past year.

**Inautilo** @inautilo@mastodon.social · 12 mai

12 mai

Inautilo @inautilo@mastodon.social

#Business #Guidelines
The Internet Archive opt-out itch · Ways to deal with your public internet history https://ilo.im/163ssx

_____
#InternetArchive #Internet #History #Consent #Trust #Transparency #Content #Blog #Website #RobotsTxt

**Preston Maness ☭** @aspensmonster@tenforward.social · 31 janv. *

31 janv. *

Preston Maness ☭ @aspensmonster@tenforward.social

https://www.tiktok.com/@alberta.nyc/video/7465916806939659563?lang=en

#DeepSeek #ai #OpenAI

**PrivacyDigest** @PrivacyDigest@mas.to · 29 janv.

29 janv.

PrivacyDigest @PrivacyDigest@mas.to

#AI haters build #tarpits to trap and trick #AI #scrapers that ignore #robots.txt
#tarpit #security #privacy #robotstxt

https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/

Ars Technica · 28 janv.AI haters build tarpits to trap and trick AI scrapers that ignore robots.txtPar Ashley Belanger

A répondu dans un fil de discussion

**mʕ•ﻌ•ʔm bitPickup** @bitpickup@troet.cafe · 7 janv.

7 janv.

mʕ•ﻌ•ʔm bitPickup @bitpickup@troet.cafe

#fediVerse #AI #dataMining #robotsTXT #fediAdmin

@gimulnautti

This looks to me much more like that we should burry troyan horses right into the bellies of the beasts. My server rules and profiles state that all data is CC-BY-SA-NC.

If they use and train that data they definitely should become in serious legal and financial trouble.

@khobochka
@maxschrems

A répondu dans un fil de discussion

**jeSuisatire …ᘛ⁐̤ᕐᐷ** @jesuisatire@social.tchncs.de · 7 janv.

7 janv.

jeSuisatire …ᘛ⁐̤ᕐᐷ @jesuisatire@social.tchncs.de

@ErikUden

Worrying is their self centered megalomanic ego trip, not realizing that they are the remaining world power, armed to their teeth with weapons of all kind, and with all the private data of the worlds population.

That said, having in mind that apparently you, being in charge of several #mastodon instances in the #fediVerse, are not able to fix the #robotsTxt of them while wasting time about talking of other countries internal affairs is kinda embarrassing.
sry

#justSayin #fediAdmin

Suite du fil

**Konstantin Weddige** @weddige@gruene.social · 22 nov. 2024

22 nov. 2024

Konstantin Weddige @weddige@gruene.social

Extending the meta tags would be fairly straightforward. In addition to the existing "INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW", we could introduce "MODELTRAINING" and "NOMODELTRAINING".

Of course, just because there is an RfC does not mean that anyone will follow it. But it would be a start, and something to push for. I would love to hear your opinion.

3/3

#robotstxt #rfc

Suite du fil

**Konstantin Weddige** @weddige@gruene.social · 22 nov. 2024

22 nov. 2024

Konstantin Weddige @weddige@gruene.social

This is not an acceptable situation and therefore I propose to extend the robots.txt standard and the corresponding HTML meta tags.

For robots.txt, I see two ways to approach this:

The first option would be to introduce a meta-user-agent that can be used to define rules for all AI bots, e.g. "User-agent: §MODELTRAINIGN§".

The second option would be a directive like "Crawl-delay" that indicates how to use the data. For example, "Model-training: disallow".

2/3

0%First option: meta-user-agent
100%Second option: new directive
0%Other (please comment)

#robotstxt #rfc

**Konstantin Weddige** @weddige@gruene.social · 22 nov. 2024

22 nov. 2024

Konstantin Weddige @weddige@gruene.social

I was just confronted with the question of how to prevent a website being used to train AI models. Using robots.txt, I only see two options, both of which are bad in different ways:

I can either disallow all known AI bots while still being guaranteed to miss some bots.

Or I can disallow all bots and explicitly allow known search engines. This way I will overblock and favour the big search engines over yet unknown challengers.

1/3

#robotstxt #AI #LLM

A répondu dans un fil de discussion

**David Sardari** @duxsco@fedifreu.de · 4 sept. 2024 *

4 sept. 2024 *

David Sardari @duxsco@fedifreu.de

@Codeberg Thanks for the inspiration. After checking it, I'll use the output of following command in the robots.txt of my doc:

❯ bash -c '(
    curl -fsS --tlsv1.3 https://codeberg.org/robots.txt | \
      tac | \
      grep -A999 "^Disallow: /$" | \
      grep -m1 -B999 "^[[:space:]]*$" | \
      tac
    curl -fsS --tlsv1.3 https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/main/robots.txt
  ) | sort -ru'
User-agent: omgilibot
User-agent: omgili
User-agent: meta-externalagent
User-agent: img2dataset
User-agent: facebookexternalhit
User-agent: cohere-ai
User-agent: anthropic-ai
User-agent: YouBot
User-agent: Webzio-Extended
User-agent: VelenPublicWebCrawler
User-agent: Timpibot
User-agent: Scrapy
User-agent: PetalBot
User-agent: PerplexityBot
User-agent: Omgilibot
User-agent: Omgili
User-agent: OAI-SearchBot
User-agent: Meta-ExternalFetcher
User-agent: Meta-ExternalAgent
User-agent: ImagesiftBot
User-agent: ICC-Crawler
User-agent: GoogleOther-Video
User-agent: GoogleOther-Image
User-agent: GoogleOther
User-agent: Google-Extended
User-agent: GPTBot
User-agent: FriendlyCrawler
User-agent: FacebookBot
User-agent: Diffbot
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: ChatGPT-User
User-agent: CCBot
User-agent: Bytespider
User-agent: Applebot-Extended
User-agent: Applebot
User-agent: Amazonbot
User-agent: Ai2Bot-Dolma
User-agent: AI2Bot
Disallow: /

#ai #seo #robotstxt

**Webrocker** @blog@webrocker.de · 21 août 2024 *

21 août 2024 *

Webrocker @blog@webrocker.de

A(I)le bekloppt

Drüben im Blog der Uberspace-Betreiber findet sich ein sehr interessanter Artikel dazu, was die (mittlerweile anscheinend komplett hohldrehenden) Bots der AI-Firmen ohne Rücksicht auf Verluste so auslösen:

(…) Zusammenfassend lässt sich sagen, dass nach unserer Beobachtung rund 30 %-50 % aller Anfragen für kleine Seiten inzwischen von Bots generiert werden. Für große Seiten schwankt diese Zahl sogar zwischen 20 % und 75 %. In unseren Augen und mit Ignorieren der robots.txt ist damit inzwischen ein Punkt erreicht, an dem dieses Verhalten von Bots nicht mehr akzeptabel ist und unserem Betrieb schadet.
blog.uberspace.de

Bei meinen unregelmässigen Ausflügen in die Serverlogs meiner eigenen Seiten, aber auch von Auftritten meiner Kunden ist das genauso: Die bot-Zugriffe haben überproportional zugenommen und es ist teilweise wirklich heftig, mit welcher Frequenz und mit wieviel wechselnden IPs die Dinger auf die Site hämmern. >:-(

#Bots #DigitaleSelbstVerteidigung #robotsTxt #Scraper #WildWest

https://webrocker.de/?p=29216

blog.uberspace.deBad Robots

**Ecologia Digital** @josemurilo@mato.social · 31 juil. 2024 *

31 juil. 2024 *

Ecologia Digital @josemurilo@mato.social

#Robotstxt #CrawlerBacklash Trickle-down effects: "people start blocking all crawlers, and some crawlers are very important, for search indexing, internet archiving, some are used for academic research, and so the bad behaviours of all these #AIcompanies, and the backlash to it, is kind of fundamentally changing how the Internet works, how it is remembered and indexed..."
https://pca.st/yto6v3il?t=11m34s

Pocket CastsGoogle, Reddit, and the Robots.txt Rebellion - The 404 Media PodcastWelcome to the podcast from 404 Media where Joseph, Sam, Emanuel, and Jason catch you up on the stories we published this week. 404 Media is a journalist-owned digital media company exploring the way technology is shaping–and is shaped by–our world. We bring you unparalleled access to hidden worlds both online and IRL through investigative reporting, smart blogging, and breaking news. At 404 Media you’ll read, and hear, stories you can’t find anywhere else written by journalists who are leading experts on their beats. Subscribe to 404 Media at 404media.co to gain access to an ad-free version of this podcast, as well as a bonus podcast episodes. Subscribers are the bedrock of building a sustainable business for our journalism. Hosted on Acast. See acast.com/privacy for more information.

**JdeBP** @JdeBP@mastodon.scot · 28 juil. 2024

28 juil. 2024

JdeBP @JdeBP@mastodon.scot

@ianb @Meyerweb

I was about to ask, myself, where that bloke was who says "Hah! No." to such questions. (-:

That said, if the argument was (and is) that 35 years was an egregious and unjust suggestion by prosecutors, it is *still* surely egregious and unjust in *other* cases if one contends that they are alike.

*That* said, I wonder how that "Hah! No." bloke answers the question: Is ignoring robots.txt illegal?

(-:

https://law.stackexchange.com/q/77755/18361

Law Stack ExchangeDoes the robots exclusion standard have any legal weight?There is a standard through which websites communicate to the web crawlers upon which search engines are based which pages should be indexed and included in the search results. This is called the ...

#USLaw #RobotsTxt #Anthropic

**Seirdy** @Seirdy@pleroma.envs.net · 27 juil. 2024

27 juil. 2024

Seirdy @Seirdy@pleroma.envs.net

New scraper just dropped (well, an old scraper was renamed):

Facebook/Meta updated its robots.txt entry for opting out of GenAI data scraping. If you blocked FacebookBot before, you should block meta-externalagent now:

User-Agent: meta-externalagent
Disallow: /

Official references:

Facebook developer documentation for FacebookBot no longer mentions GenAI.
Facebook developer documentation for web crawlers, including Meta-ExternalAgent mentions “AI”.

Meta for DevelopersAbout FacebookBot - Partage - Documentation - Meta for Developers

#RobotsTxt

**Inautilo** @inautilo@mastodon.social · 26 juil. 2024 *

26 juil. 2024 *

Inautilo @inautilo@mastodon.social

#Business #Reports
Google is the only search engine that works on Reddit now · The exclusivity stems from a multi-million dollar AI deal https://ilo.im/15zlg6

_____
#Reddit #Google #SearchEngine #SEO #AI #Content #UGC #RobotsTxt #Development #WebDev