mastouille.fr @admin

1 message1 participant0 message aujourd’hui

**Rod2ik** @rod2ik@mastodon.social · 2 j

#Darkdump - L'outil #OSINT ( #Open #Source #Intelligence ) qui fouille le #dark #web pour vous

En gros,vous tapez un mot-clé,et l’outil va #scraper les sites .onion pour en extraire des emails, des métadonnées,des mots-clés,des images,des liens vers les réseaux sociaux

https://korben.info/darkdump-outil-osint-fouille-dark-web.html

**@francks** @francks@mstdn.fr · 7 juil.

7 juil.

@francks @francks@mstdn.fr

Our small team vs millions of bots

https://www.fsf.org/blogs/sysadmin/our-small-team-vs-millions-of-bots

www.fsf.orgOur small team vs millions of bots — Free Software Foundation — Working together for free software

#fsf #freesoftware #ddos

**Kuketz-Blog** @kuketzblog@social.tchncs.de · 1 juil. *

1 juil. *

Kuketz-Blog @kuketzblog@social.tchncs.de

»Cloudflare Introduces Default Blocking of A.I. Data Scrapers«

Nett, wird aber kaum funktionieren. Weil: Fortgeschrittene Scraper nutzen Browser-Emulation und rotierende IPs, um sich als echte Nutzer auszugeben und technische Erkennung zu umgehen. Da es sich nur um eine serverseitige Maßnahme ohne rechtliche Bindung handelt, können solche Akteure die Sperren leicht und folgenlos ignorieren.

https://www.nytimes.com/2025/07/01/technology/cloudflare-ai-data.html

#cloudflare #ai #ki #scraper

/kuk

The New York Times · 1 juil.Cloudflare Introduces Blocking of A.I. Scrapers By DefaultPar Natallie Rocha

**Torf und Schnee** @torf@c.im · 31 mai

31 mai

Torf und Schnee @torf@c.im

The most disgusting feature of this relatively new #AI #scraper |s plague is that they are about to defile everything we like in the *good* internet.

Images with relevant #AltText? Perfect training materials for text-to-image generative models.

Static webpages? No #Anubis - no problem to scrape.

#Anubis uses proof-of-work ( #PoW ), which implies either #JavaScript or manual instructions. No, it is a good solution... Best of the worst (as if there were any good ones...)

Last days I learned that (1) #Tor has a #PoW mechanism (2) Anubis seems to somehow whitelist #lynx browser, allowing no-JS Lynx users in (a big favour for #accessibility and #smolweb ). Good (let's hope all these will persist).

**Paolo Amoroso** @amoroso@fosstodon.org · 22 avr. *

22 avr. *

Paolo Amoroso @amoroso@fosstodon.org

Update: I reported the bot. Thanks.

A Mastodon bot account at mastodon.cloud scans the fediverse, scrapes selected web pages shared there, rewrites them with AI, posts them to its own site, and shares on Mastodon as tech news the rewritten AI slop. The bot scraped a post of mine (including the attached image) within minutes of my federated blog publishing it.

Is it worth flagging the bot and reporting it to its instance? Are the mods likely to take action?

#mastodon #moderation #ai

**César Pose** @cesarpose@infosec.exchange · 3 avr.

3 avr.

César Pose @cesarpose@infosec.exchange

#cats #catsofmastodon #cat #scraper

**PaulaToThePeople** @PaulaToThePeople@climatejustice.social · 17 janv. *

17 janv. *

PaulaToThePeople @PaulaToThePeople@climatejustice.social

FediBlock newsmast

A répondu dans un fil de discussion

**Kevin Karhan** @kkarhan@infosec.space · 27 déc. 2024

27 déc. 2024

Kevin Karhan @kkarhan@infosec.space

@khobochka guess why I maintain a #Scraper #blocklist?

In fact I know multiple people and organizations that decide to basically redirect #ValueRemoving #Scrapers like #GPTbot, #ByteSpider (which literally #DDoS'd #MattKC because #ClownFlare are a criminally incompetent #RogueISP!) to #Hetzner's 10GB Speedtest file which can be found at http://hil-speed.hetzner.com/10GB.bin as an extra middlefinger!

GitHublists.d/scrapers.ipv4.block.list.tsv at main · greyhat-academy/lists.dList of useful things. Contribute to greyhat-academy/lists.d development by creating an account on GitHub.

#Cloudflare #hetznered #ByteDance

**vanta rainbow black** @vantablack@cyberpunk.lol · 11 sept. 2024

11 sept. 2024

vanta rainbow black @vantablack@cyberpunk.lol

found another scraper indexer thingy

https://mastogizmos.com

mastogizmos.comMastoGizmos - Mastodon Tools and Searches

#scraper #indexer #fediblock

**Webrocker** @blog@webrocker.de · 21 août 2024 *

21 août 2024 *

Webrocker @blog@webrocker.de

A(I)le bekloppt

Drüben im Blog der Uberspace-Betreiber findet sich ein sehr interessanter Artikel dazu, was die (mittlerweile anscheinend komplett hohldrehenden) Bots der AI-Firmen ohne Rücksicht auf Verluste so auslösen:

(…) Zusammenfassend lässt sich sagen, dass nach unserer Beobachtung rund 30 %-50 % aller Anfragen für kleine Seiten inzwischen von Bots generiert werden. Für große Seiten schwankt diese Zahl sogar zwischen 20 % und 75 %. In unseren Augen und mit Ignorieren der robots.txt ist damit inzwischen ein Punkt erreicht, an dem dieses Verhalten von Bots nicht mehr akzeptabel ist und unserem Betrieb schadet.
blog.uberspace.de

Bei meinen unregelmässigen Ausflügen in die Serverlogs meiner eigenen Seiten, aber auch von Auftritten meiner Kunden ist das genauso: Die bot-Zugriffe haben überproportional zugenommen und es ist teilweise wirklich heftig, mit welcher Frequenz und mit wieviel wechselnden IPs die Dinger auf die Site hämmern. >:-(

#Bots #DigitaleSelbstVerteidigung #robotsTxt #Scraper #WildWest

https://webrocker.de/?p=29216

blog.uberspace.deBad Robots

**Seirdy** @Seirdy@pleroma.envs.net · 6 août 2024 *

6 août 2024 *

Seirdy @Seirdy@pleroma.envs.net

Another new LLM scraper just dropped: AI2 Bot.

First-party documentation does not list any way to opt-out except filtering the user-agent on your server/firewall. The docs list the following User-Agent to filter:

Mozilla/5.0 (compatible) AI2Bot (+https://www.allenai.org/crawler)

My server logs contained the following string:

Mozilla/5.0 (compatible) Ai2Bot-Dolma (+https://www.allenai.org/crawler)

That appears to be for Ai2’s Dolma product.

159 hits came from 174.174.51.252, a Comcast-owned IP in Oregon.

I recommend adding ai2bot to your server’s user-agent matching rules if you don’t want to be in the Dolma dataset; unlike Common Crawl, this seems tailored specifically for training LLMs with few other users.

allenai.orgCrawling notice | Ai2Explanation and technical details of Ai2's web crawler.

#Scraper

Suite du fil

**Metin Seven** @metin@graphics.social · 25 juil. 2024

25 juil. 2024

Metin Seven @metin@graphics.social

And another AI scraping case (also see my previous post)…

AI video startup Runway reportedly trained on ‘thousands’ of YouTube videos without permission

https://www.engadget.com/ai-video-startup-runway-reportedly-trained-on-thousands-of-youtube-videos-without-permission-182314160.html

Engadget · 25 juil. 2024AI video startup Runway reportedly trained on ‘thousands’ of YouTube videos without permissionPar Will Shanklin

#AI #GenAI #ArtificialIntelligence

**Metin Seven** @metin@graphics.social · 25 juil. 2024

25 juil. 2024

Metin Seven @metin@graphics.social

Noo… Really?!

Anthropic’s crawler is ignoring websites’ anti-AI scraping policies…

https://www.theverge.com/2024/7/25/24205943/anthropic-ai-web-crawler-claudebot-ifixit-scraping-training-data

The Verge · 25 juil. 2024Anthropic’s crawler is ignoring websites’ anti-AI scraping policiesPar Jess Weatherbed

#AI #GenAI #ArtificialIntelligence

**Austin Huang** @austin@mstdn.party · 24 juil. 2024 *

24 juil. 2024 *

Austin Huang @austin@mstdn.party

With regards to the utoots.com #scraper:
1. It currently depends on a Mastodon instance flashist[.]video; it is recommended to block the instance. flashist.(me|health) and previously flashist.(org|vip|live) is also operated by the same person. Ban evasion is to be expected.
2. I wrote a GitHub issue about it, archived at https://archive.ph/8ynKh. However he has chosen to cover up his GitHub profile instead.

Update: https://cyberpunk.lol/@vantablack/112849043193285926 (tldr: it's gone)

#FediBlock #MastoAdmin #FediAdmin

Suite du fil

**vanta rainbow black** @vantablack@cyberpunk.lol · 21 juil. 2024

21 juil. 2024

vanta rainbow black @vantablack@cyberpunk.lol

okay yeah https://utoots.com is DEFINITELY a scraper

i've updated the original post, making a reply too since edits don't always federate cleanly

#scraper #indexer #fediblock

**vanta rainbow black** @vantablack@cyberpunk.lol · 21 juil. 2024 *

21 juil. 2024 *

vanta rainbow black @vantablack@cyberpunk.lol

just found another scraper indexer thingy

https://utoots.com

#scraper #indexer #fediblock

**Roni Laukkarinen** @rolle@mementomori.social · 10 juil. 2024

10 juil. 2024

Roni Laukkarinen @rolle@mementomori.social

Ping @s0, I guess it’s time to reset the counter. https://www.reddit.com/r/Mastodon/s/TKQ5lMdIoY

https://cathode.church/fedi-scraper-counter.html

www.reddit.comReddit - Dive into anything

#Fediverse #Search #Scraper

**Oct** @octplane@mastodon.xyz · 2 mai 2024

2 mai 2024

Oct @octplane@mastodon.xyz

Now I have a #scraper that pulls a sud-ouest in PDF for me every day, with the help of #BNF's passe lecture, some python, some rust and some temporal...

**AGR Risk Intelligence** @agr@pleroma.envs.net · 25 avr. 2024

25 avr. 2024

AGR Risk Intelligence @agr@pleroma.envs.net

How to (try to) block #IA scrapers with a tailored `robots.txt`:

You may want to use:
[Dark Visitor](https://darkvisitors.com/)

And:

[ai.robots.txt](https://github.com/ai-robots-txt/ai.robots.txt)

Adding to the list the `robots.txt` by #VLC:

[VLC robots.txt](https://www.videolan.org/robots.txt)

Source: **Bloquer les gaveurs d'IA** by @lord (in French)

https://lord.re/fast-posts/76-bloquer-les-gaveurs-dia/

#ChatGPT #IA #Robots #Scraper

/home/lordBloquer les gaveurs d'IAVous avez un joli site ouaib avec vorte ptit contenu écrit main. C'est votre blog, votre espace de réflexion, votre zone de création, votre espace rien qu'à vous partagé au monde, votre rejeton… C'est super chouette mais bon maintenant en 2024, ça veut dire que vous nourissez les IA. ChatGPT, Claude, Mistral et sont des modêles d'IA qui ont besoin d'être nourris. Pour qu'une IA paraisse performante et naturelle, il faut lui faire ingérer le plus de textes possibles.

#scraper

**AGR Risk Intelligence** @agr@pleroma.envs.net · 25 avr. 2024

25 avr. 2024

AGR Risk Intelligence @agr@pleroma.envs.net

Comment bloquer les scrapers d'IA avec un `robots.txt` aux petits oignons? Demandez à @lord !

**Bloquer les gaveurs d'IA**

https://lord.re/fast-posts/76-bloquer-les-gaveurs-dia/

Utiliser :
[Dark Visitor](https://darkvisitors.com/)

[ai.robots.txt](https://github.com/ai-robots-txt/ai.robots.txt)

J'ajoute à la liste le `robots.txt` de #VLC également : [VLC robots.txt](https://www.videolan.org/robots.txt)

#ChatGPT #IA #Robots #Scraper

#scraper

Recherches récentes

Options de recherche

Administré par :

Statistiques du serveur :

#scraper