mastouille.fr est l'un des nombreux serveurs Mastodon indépendants que vous pouvez utiliser pour participer au fédiverse.
Mastouille est une instance Mastodon durable, ouverte, et hébergée en France.

Administré par :

Statistiques du serveur :

640
comptes actifs

#robotstxt

0 message0 participant0 message aujourd’hui
Suite du fil

Here's #Cloudflare's #robots-txt file:

# Cloudflare Managed Robots.txt to block AI related bots.

User-agent: AI2Bot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: amazon-kendra
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Applebot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: AwarioRssBot
Disallow: /

User-agent: AwarioSmartBot
Disallow: /

User-agent: bigsur.ai
Disallow: /

User-agent: Brightbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: DigitalOceanGenAICrawler
Disallow: /

User-agent: DuckAssistBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: FriendlyCrawler
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: iaskspider/2.0
Disallow: /

User-agent: ICC-Crawler
Disallow: /

User-agent: img2dataset
Disallow: /

User-agent: Kangaroo Bot
Disallow: /

User-agent: LinerBot
Disallow: /

User-agent: MachineLearningForPeaceBot
Disallow: /

User-agent: Meltwater
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: meta-externalfetcher
Disallow: /

User-agent: Nicecrawler
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: omgili
Disallow: /

User-agent: omgilibot
Disallow: /

User-agent: PanguBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: PetalBot
Disallow: /

User-agent: PiplBot
Disallow: /

User-agent: QualifiedBot
Disallow: /

User-agent: Scoop.it
Disallow: /

User-agent: Seekr
Disallow: /

User-agent: SemrushBot-OCOB
Disallow: /

User-agent: Sidetrade indexer bot
Disallow: /

User-agent: Timpibot
Disallow: /

User-agent: VelenPublicWebCrawler
Disallow: /

User-agent: Webzio-Extended
Disallow: /

User-agent: YouBot
Disallow: /
A répondu dans un fil de discussion

@ErikUden

Worrying is their self centered megalomanic ego trip, not realizing that they are the remaining world power, armed to their teeth with weapons of all kind, and with all the private data of the worlds population.

That said, having in mind that apparently you, being in charge of several #mastodon instances in the #fediVerse, are not able to fix the #robotsTxt of them while wasting time about talking of other countries internal affairs is kinda embarrassing.
sry

Suite du fil

Extending the meta tags would be fairly straightforward. In addition to the existing "INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW", we could introduce "MODELTRAINING" and "NOMODELTRAINING".

Of course, just because there is an RfC does not mean that anyone will follow it. But it would be a start, and something to push for. I would love to hear your opinion.

3/3

Suite du fil

This is not an acceptable situation and therefore I propose to extend the robots.txt standard and the corresponding HTML meta tags.

For robots.txt, I see two ways to approach this:

The first option would be to introduce a meta-user-agent that can be used to define rules for all AI bots, e.g. "User-agent: §MODELTRAINIGN§".

The second option would be a directive like "Crawl-delay" that indicates how to use the data. For example, "Model-training: disallow".

2/3

I was just confronted with the question of how to prevent a website being used to train AI models. Using robots.txt, I only see two options, both of which are bad in different ways:

I can either disallow all known AI bots while still being guaranteed to miss some bots.

Or I can disallow all bots and explicitly allow known search engines. This way I will overblock and favour the big search engines over yet unknown challengers.

1/3

A répondu dans un fil de discussion

@Codeberg Thanks for the inspiration. After checking it, I'll use the output of following command in the robots.txt of my doc:

❯ bash -c '(
curl -fsS --tlsv1.3 https://codeberg.org/robots.txt | \
tac | \
grep -A999 "^Disallow: /$" | \
grep -m1 -B999 "^[[:space:]]*$" | \
tac
curl -fsS --tlsv1.3 https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/main/robots.txt
) | sort -ru'
User-agent: omgilibot
User-agent: omgili
User-agent: meta-externalagent
User-agent: img2dataset
User-agent: facebookexternalhit
User-agent: cohere-ai
User-agent: anthropic-ai
User-agent: YouBot
User-agent: Webzio-Extended
User-agent: VelenPublicWebCrawler
User-agent: Timpibot
User-agent: Scrapy
User-agent: PetalBot
User-agent: PerplexityBot
User-agent: Omgilibot
User-agent: Omgili
User-agent: OAI-SearchBot
User-agent: Meta-ExternalFetcher
User-agent: Meta-ExternalAgent
User-agent: ImagesiftBot
User-agent: ICC-Crawler
User-agent: GoogleOther-Video
User-agent: GoogleOther-Image
User-agent: GoogleOther
User-agent: Google-Extended
User-agent: GPTBot
User-agent: FriendlyCrawler
User-agent: FacebookBot
User-agent: Diffbot
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: ChatGPT-User
User-agent: CCBot
User-agent: Bytespider
User-agent: Applebot-Extended
User-agent: Applebot
User-agent: Amazonbot
User-agent: Ai2Bot-Dolma
User-agent: AI2Bot
Disallow: /
#ai#seo#robotstxt

A(I)le bekloppt

Drüben im Blog der Uberspace-Betreiber findet sich ein sehr interessanter Artikel dazu, was die (mittlerweile anscheinend komplett hohldrehenden) Bots der AI-Firmen ohne Rücksicht auf Verluste so auslösen:

(…) Zusammenfassend lässt sich sagen, dass nach unserer Beobachtung rund 30 %-50 % aller Anfragen für kleine Seiten inzwischen von Bots generiert werden. Für große Seiten schwankt diese Zahl sogar zwischen 20 % und 75 %. In unseren Augen und mit Ignorieren der robots.txt ist damit inzwischen ein Punkt erreicht, an dem dieses Verhalten von Bots nicht mehr akzeptabel ist und unserem Betrieb schadet.

blog.uberspace.de

Bei meinen unregelmässigen Ausflügen in die Serverlogs meiner eigenen Seiten, aber auch von Auftritten meiner Kunden ist das genauso: Die bot-Zugriffe haben überproportional zugenommen und es ist teilweise wirklich heftig, mit welcher Frequenz und mit wieviel wechselnden IPs die Dinger auf die Site hämmern. >:-(

#Bots #DigitaleSelbstVerteidigung #robotsTxt #Scraper #WildWest

https://webrocker.de/?p=29216

blog.uberspace.deBad Robots

@ianb @Meyerweb

I was about to ask, myself, where that bloke was who says "Hah! No." to such questions. (-:

That said, if the argument was (and is) that 35 years was an egregious and unjust suggestion by prosecutors, it is *still* surely egregious and unjust in *other* cases if one contends that they are alike.

*That* said, I wonder how that "Hah! No." bloke answers the question: Is ignoring robots.txt illegal?

(-:

law.stackexchange.com/q/77755/

Law Stack ExchangeDoes the robots exclusion standard have any legal weight?There is a standard through which websites communicate to the web crawlers upon which search engines are based which pages should be indexed and included in the search results. This is called the ...

New scraper just dropped (well, an old scraper was renamed):

Facebook/Meta updated its robots.txt entry for opting out of GenAI data scraping. If you blocked FacebookBot before, you should block meta-externalagent now:

User-Agent: meta-externalagent
Disallow: /

Official references:

Meta for DevelopersAbout FacebookBot - Partage - Documentation - Meta for Developers

Just throwing out a thought before I do some research on this, but I think robots.txt needs an update.

Ideally I'd like to define an "allow list" that tells web scrapers how my content can be used. Eg.:

- monetizable: false
- fediverse: true
- nonfediverse: false
- ai: false

Etc. And I'd like to apply this to my social media profile and any other web presence, not just my personal website.