LLMs are starving for knowledge graphs. Raphael Troncy was pointing out that many LLM company crawlers are constantly visiting their KGs. Some crawlers even perform explicit SPARQL queries on the KGs.
LLMs are starving for knowledge graphs. Raphael Troncy was pointing out that many LLM company crawlers are constantly visiting their KGs. Some crawlers even perform explicit SPARQL queries on the KGs.
Marginalia Explore & Search - Un explorateur de pages web aléatoires (dont beaucoup se font un facelift façon Geocities), et un moteur de recherche expérimental #libre avec un #webcrawler custom. Résultats innattendus garantis!
https://explore.marginalia.nu/
https://search.marginalia.nu/
6/7
Trap Naughty Web Crawlers in Digestive Juices with Nepenthes - In the olden days of the WWW you could just put a robots.txt file in the root of y... - https://hackaday.com/2025/01/23/trap-naughty-web-crawlers-in-digestive-juices-with-nepenthes/ #largelanguagemodel #internethacks #webcrawler
Die KI-Modelle beklauen die Medien — Fehlender Faktencheck der NZZ - Das Netz ist politisch https://dnip.ch/2024/12/05/die-ki-modelle-beklauen-die-medien-fehlender-faktencheck-der-nzz/
#Journalismus #journalism #ArtificialIntelligence #Urheberrecht #copyright #Datenschutz #privacy #WebCrawler
@adfichter @marcel
Holy crap! #Huawei is *very* aggressively #crawling my #webserver. As in, 6+ requests/sec for many hours coming from quite a few IPs. Here are the subnets I blocked which kills most of the traffic so far:
49.0.200.0/21
94.74.80.0/20
101.44.160.0/20
111.119.192.0/20
114.119.172.0/22
114.119.176.0/20
119.8.160.0/19
119.13.96.0/20
124.243.128.0/18
159.138.96.0/20
166.108.192.0/20
166.108.224.0/20
190.92.192.0/19
The user agent string is the typical "every browser in existence". #webcrawler
Mise à jour de mon billet "Bloquer les AI bots (OpenAI ChatGPT et al)" avec l'ajout de "FriendlyCrawler", un bot pas très sympathique hébergé chez AWS.
Pour en savoir plus :
https://www.didiermary.fr/bloquer-ai-bots-chatgpt-openai/
Addendum ... did y'all (meaning the Fediverse) know that we have #WebCrawler #Honeypots ?
That is to say, honeypots are are explicitly designed to trap ill-mannered web crawlers that ignore the robots.txt file specifications?
I am only just beginning to learn about the options and possibilities, but seriously ... everyone interested in keeping the AI #LLM content-devouring bots out of your website(s), check these out...
No doubt, sufficiently clever developers can code their own, but again, in the interests of not re-inventing the wheel...
Most Top News Sites Block AI Bots. Right-Wing Media Welcomes Them
over 88 percent of top-ranked news outlets in the US now block web crawlers used by artificial intelligence
One sector of the news business is a glaring outlier, though: Right-wing media lags far behind their liberal counterparts when it comes to bot-blocking.
#news #media #bias #rightwing #leftwing #artificialintelligence #AI #GenAI #LLM #internet #webcrawler
https://www.wired.com/story/most-news-sites-block-ai-bots-right-wing-media-welcomes-them/
I am so happy with the first own web application I have developed: Tris, a simple and free web crawler
!
You can try it for free online: https://tris.fly.dev, limited to 3 parallel crawls and 100 links of path depth of 3.
Next thing I will add will be a text input to set a target domain hhh, now I am making it hard!
I considered updating my robot.txt file today. It made me feel a certain kinda way— Iike I was in a Richard Matheson book.
Anyway, I signed up for buymeacoffee as a little experiment. I posted some thoughts on there about robots.txt files.
The post is free and open, and doesn't require login or anything like that. Feel free to check it out and let me know your thoughts. #robotstxt #webcrawler #ChatGPT #openai
https://www.buymeacoffee.com/fromjason/we-legend-some-thoughts-robots-txt
@konstantin @jarwski Danke für's Teilen. Ich habe ähnliche Gedanken in meinem Song "Layer 8" verarbeitet und für die Lyrics den #PrivacyCaptcha Generator von @digitalcourage genutzt, um #Webcrawler ein bisschen ihre Arbeit schwerer zu machen.
https://this.ven.uber.space/music/inconvenient-ep/05-layer-8/
If you run a website and you think that adding anything to robots.txt "blocks" #ChatGPT or any other #WebCrawler from accessing your site, you may want to think that process through a bit more.
#RobotsDotTxt is *asking nicely* for the robot not to look places. If you don't trust the ethics of the scanning party, I'm not sure why you would think that asking them nicely not to mine your free content for their profit would even slow them down.
OpenAI rilascia il web crowler GPTBot. Migliorerà la capacità del modello e non violerà il diritto d’autore
#OpenAI ha lanciato il #webcrawler #GPTBot per migliorare i suoi modelli di #intelligenza #artificiale (#AI).
#redhotcyber #online #it #web #ai #hacking #privacy #cybersecurity #cybercrime #intelligence #intelligenzaartificiale #informationsecurity #ethicalhacking #dataprotection #cybersecurityawareness #cybersecuritytraining #cybersecuritynews #infosecurity
#Mastodon: Well, I have only been here for three days, but it feels very much like 1994 all over again: That first hesitant email, those first adventurous clicks on pre-#Google internet search engines. Remember grappling with these?
#WebCrawler
#Lycos
#AltaVista
#Excite
#Dogpile
#AskJeeves
#JumpStation
Let's be patient, progress may be happening before our eyes, again!