mastouille.fr @admin

0 message0 participant0 message aujourd’hui

**Harald Sack** @lysander07@sigmoid.social · 3 juin *

Harald Sack @lysander07@sigmoid.social

LLMs are starving for knowledge graphs. Raphael Troncy was pointing out that many LLM company crawlers are constantly visiting their KGs. Some crawlers even perform explicit SPARQL queries on the KGs.

#knowledgegraphs #eswc2025 #semweb

Suite du fil

**Dobody** @dobody@mastodon.design · 25 févr.

25 févr.

Dobody @dobody@mastodon.design

Marginalia Explore & Search - Un explorateur de pages web aléatoires (dont beaucoup se font un facelift façon Geocities), et un moteur de recherche expérimental #libre avec un #webcrawler custom. Résultats innattendus garantis!

https://explore.marginalia.nu/
https://search.marginalia.nu/

6/7

explore.marginalia.nuWebsite Explorer - http://strwbrry.neocities.org/

**IT News** @itnewsbot@schleuss.online · 24 janv.

24 janv.

IT News @itnewsbot@schleuss.online

Trap Naughty Web Crawlers in Digestive Juices with Nepenthes - In the olden days of the WWW you could just put a robots.txt file in the root of y... - https://hackaday.com/2025/01/23/trap-naughty-web-crawlers-in-digestive-juices-with-nepenthes/ #largelanguagemodel #internethacks #webcrawler

Hackaday · 24 janv.Trap Naughty Web Crawlers In Digestive Juices With NepenthesIn the olden days of the WWW you could just put a robots.txt file in the root of your website and crawling bots from search engines and kin would (generally) respect the rules in it. These days, ho…

**Marcel SIneM(S)US** @simsus@social.tchncs.de · 29 déc. 2024

29 déc. 2024

Marcel SIneM(S)US @simsus@social.tchncs.de

Die KI-Modelle beklauen die Medien — Fehlender Faktencheck der NZZ - Das Netz ist politisch https://dnip.ch/2024/12/05/die-ki-modelle-beklauen-die-medien-fehlender-faktencheck-der-nzz/

#Journalismus #journalism #ArtificialIntelligence #Urheberrecht #copyright #Datenschutz #privacy #WebCrawler
@adfichter @marcel

Das Netz ist politisch · 5 déc. 2024Die KI-Modelle beklauen die Medien — Fehlender Faktencheck der NZZ - Das Netz ist politisch"Nun ist nicht nur Petra Gössi erstaunt, auch die beiden NZZ-Journalisten sind baff. Sie sind gerade Zeugen eines Datenklaus geworden, der ihre berufliche

**Josef 'Jeff' Sipek** @jeffpc@mastodon.radio · 10 août 2024

10 août 2024

Josef 'Jeff' Sipek @jeffpc@mastodon.radio

Holy crap! #Huawei is *very* aggressively #crawling my #webserver. As in, 6+ requests/sec for many hours coming from quite a few IPs. Here are the subnets I blocked which kills most of the traffic so far:

49.0.200.0/21
94.74.80.0/20
101.44.160.0/20
111.119.192.0/20
114.119.172.0/22
114.119.176.0/20
119.8.160.0/19
119.13.96.0/20
124.243.128.0/18
159.138.96.0/20
166.108.192.0/20
166.108.224.0/20
190.92.192.0/19

The user agent string is the typical "every browser in existence". #webcrawler

**DJM (freelance for hire)** @cybeardjm@masto.ai · 18 mai 2024 *

18 mai 2024 *

DJM (freelance for hire) @cybeardjm@masto.ai

Mise à jour de mon billet "Bloquer les AI bots (OpenAI ChatGPT et al)" avec l'ajout de "FriendlyCrawler", un bot pas très sympathique hébergé chez AWS.

Pour en savoir plus :
https://www.didiermary.fr/bloquer-ai-bots-chatgpt-openai/

Didier J. MARY (blog) · 6 juil. 2024AI bots (OpenAI ChatGPT et al) - comment les bloquer - Didier J. MARY (blog)Bloquer les AI bots (OpenAI ChatGPT et al - màj) - Pour ceux qui souhaitent protéger le contenu de leur site Web ou blog, des AI bots

#AI #AIBots #WebCrawler

**Eric the Cerise has moved** @ErictheCerise@kolektiva.social · 8 mai 2024

8 mai 2024

Eric the Cerise has moved @ErictheCerise@kolektiva.social

Addendum ... did y'all (meaning the Fediverse) know that we have #WebCrawler #Honeypots ?

That is to say, honeypots are are explicitly designed to trap ill-mannered web crawlers that ignore the robots.txt file specifications?

I am only just beginning to learn about the options and possibilities, but seriously ... everyone interested in keeping the AI #LLM content-devouring bots out of your website(s), check these out...

No doubt, sufficiently clever developers can code their own, but again, in the interests of not re-inventing the wheel...

https://github.com/paralax/awesome-honeypots

GitHubGitHub - paralax/awesome-honeypots: an awesome list of honeypot resourcesan awesome list of honeypot resources. Contribute to paralax/awesome-honeypots development by creating an account on GitHub.

**gtbarry** @gtbarry@mastodon.social · 4 févr. 2024

4 févr. 2024

gtbarry @gtbarry@mastodon.social

Most Top News Sites Block AI Bots. Right-Wing Media Welcomes Them

over 88 percent of top-ranked news outlets in the US now block web crawlers used by artificial intelligence

One sector of the news business is a glaring outlier, though: Right-wing media lags far behind their liberal counterparts when it comes to bot-blocking.

#news #media #bias #rightwing #leftwing #artificialintelligence #AI #GenAI #LLM #internet #webcrawler

https://www.wired.com/story/most-news-sites-block-ai-bots-right-wing-media-welcomes-them/

WIRED · 24 janv. 2024Most Top News Sites Block AI Bots. Right-Wing Media Welcomes ThemPar Kate Knibbs

**Vedran Mandić** @vekzdran@hachyderm.io · 27 janv. 2024

27 janv. 2024

Vedran Mandić @vekzdran@hachyderm.io

I am so happy with the first own web application I have developed: Tris, a simple and free web crawler !

You can try it for free online: https://tris.fly.dev, limited to 3 parallel crawls and 100 links of path depth of 3.

Next thing I will add will be a text input to set a target domain hhh, now I am making it hard!

tris.fly.devTris - A simple and free web crawlerTris - A simple and free web crawler. Tris recursively crawls a website's domain HTML pages and collect its links, built by Vedran Mandić.

#node #nodejs #web

**fromjason.xyz** @fromjason@mastodon.social · 6 janv. 2024

6 janv. 2024

fromjason.xyz @fromjason@mastodon.social

I considered updating my robot.txt file today. It made me feel a certain kinda way— Iike I was in a Richard Matheson book.

Anyway, I signed up for buymeacoffee as a little experiment. I posted some thoughts on there about robots.txt files.

The post is free and open, and doesn't require login or anything like that. Feel free to check it out and let me know your thoughts. #robotstxt #webcrawler #ChatGPT #openai

https://www.buymeacoffee.com/fromjason/we-legend-some-thoughts-robots-txt

Buy Me a CoffeeWe are legend? Some thoughts on robots.txt — Jason Velazquez (fromjason.xyz)Post by Jason Velazquez (fromjason.xyz)

A répondu dans un fil de discussion

**this.ven** @thisven@digitalcourage.social · 31 oct. 2023

31 oct. 2023

this.ven @thisven@digitalcourage.social

@konstantin @jarwski Danke für's Teilen. Ich habe ähnliche Gedanken in meinem Song "Layer 8" verarbeitet und für die Lyrics den #PrivacyCaptcha Generator von @digitalcourage genutzt, um #Webcrawler ein bisschen ihre Arbeit schwerer zu machen.

https://this.ven.uber.space/music/inconvenient-ep/05-layer-8/

this.ven.uber.spacethis.ven - Layer 8

**Chris is.** @offby1@wandering.shop · 1 sept. 2023

1 sept. 2023

Chris is. @offby1@wandering.shop

If you run a website and you think that adding anything to robots.txt "blocks" #ChatGPT or any other #WebCrawler from accessing your site, you may want to think that process through a bit more.

#RobotsDotTxt is *asking nicely* for the robot not to look places. If you don't trust the ethics of the scanning party, I'm not sure why you would think that asking them nicely not to mine your free content for their profit would even slow them down.

**Redhotcyber** @redhotcyber@mastodon.bida.im · 27 août 2023

27 août 2023

Redhotcyber @redhotcyber@mastodon.bida.im

OpenAI rilascia il web crowler GPTBot. Migliorerà la capacità del modello e non violerà il diritto d’autore

#OpenAI ha lanciato il #webcrawler #GPTBot per migliorare i suoi modelli di #intelligenza #artificiale (#AI).

#redhotcyber #online #it #web #ai #hacking #privacy #cybersecurity #cybercrime #intelligence #intelligenzaartificiale #informationsecurity #ethicalhacking #dataprotection #cybersecurityawareness #cybersecuritytraining #cybersecuritynews #infosecurity

https://www.redhotcyber.com/post/openai-rilascia-il-web-crowler-gptbot-migliorera-la-capacita-del-modello-e-non-violera-il-diritto-dautore/

Red Hot CyberOpenAI rilascia il web crowler GPTBot. Migliorerà la capacità del modello e non violerà il diritto d'autoreIl nuovo crowler di OpenAI sta iniziando le sue scansioni. Tutti questi dati garantiranno al modello di migliorarsi costantemente.

**Paul Belcher** @pauljbelcher@eupolicy.social · 14 nov. 2022

14 nov. 2022

Paul Belcher @pauljbelcher@eupolicy.social

#Mastodon: Well, I have only been here for three days, but it feels very much like 1994 all over again: That first hesitant email, those first adventurous clicks on pre-#Google internet search engines. Remember grappling with these?

#WebCrawler
#Lycos
#AltaVista
#Excite
#Dogpile
#AskJeeves
#JumpStation

Let's be patient, progress may be happening before our eyes, again!

Recherches récentes

Options de recherche

Administré par :

Statistiques du serveur :

#WebCrawler