mastouille.fr est l'un des nombreux serveurs Mastodon indépendants que vous pouvez utiliser pour participer au fédiverse.
Mastouille est une instance Mastodon durable, ouverte, et hébergée en France.

Administré par :

Statistiques du serveur :

591
comptes actifs

#WebCrawler

0 message0 participant0 message aujourd’hui

Holy crap! #Huawei is *very* aggressively #crawling my #webserver. As in, 6+ requests/sec for many hours coming from quite a few IPs. Here are the subnets I blocked which kills most of the traffic so far:

49.0.200.0/21
94.74.80.0/20
101.44.160.0/20
111.119.192.0/20
114.119.172.0/22
114.119.176.0/20
119.8.160.0/19
119.13.96.0/20
124.243.128.0/18
159.138.96.0/20
166.108.192.0/20
166.108.224.0/20
190.92.192.0/19

The user agent string is the typical "every browser in existence". #webcrawler

Addendum ... did y'all (meaning the Fediverse) know that we have #WebCrawler #Honeypots ?

That is to say, honeypots are are explicitly designed to trap ill-mannered web crawlers that ignore the robots.txt file specifications?

I am only just beginning to learn about the options and possibilities, but seriously ... everyone interested in keeping the AI #LLM content-devouring bots out of your website(s), check these out...

No doubt, sufficiently clever developers can code their own, but again, in the interests of not re-inventing the wheel...

github.com/paralax/awesome-hon

GitHubGitHub - paralax/awesome-honeypots: an awesome list of honeypot resourcesan awesome list of honeypot resources. Contribute to paralax/awesome-honeypots development by creating an account on GitHub.

Most Top News Sites Block AI Bots. Right-Wing Media Welcomes Them

over 88 percent of top-ranked news outlets in the US now block web crawlers used by artificial intelligence

One sector of the news business is a glaring outlier, though: Right-wing media lags far behind their liberal counterparts when it comes to bot-blocking.

#news #media #bias #rightwing #leftwing #artificialintelligence #AI #GenAI #LLM #internet #webcrawler

wired.com/story/most-news-site

WIRED · Most Top News Sites Block AI Bots. Right-Wing Media Welcomes ThemPar Kate Knibbs

I am so happy with the first own web application 🎉 I have developed: Tris, a simple and free web crawler 🕸️ 🕷️ !

You can try it for free online: tris.fly.dev, limited to 3 parallel crawls and 100 links of path depth of 3.

Next thing I will add will be a text input to set a target domain hhh, now I am making it hard! 🙈

tris.fly.devTris - A simple and free web crawlerTris - A simple and free web crawler. Tris recursively crawls a website's domain HTML pages and collect its links, built by Vedran Mandić.
#node#nodejs#web

I considered updating my robot.txt file today. It made me feel a certain kinda way— Iike I was in a Richard Matheson book.

Anyway, I signed up for buymeacoffee as a little experiment. I posted some thoughts on there about robots.txt files.

The post is free and open, and doesn't require login or anything like that. Feel free to check it out and let me know your thoughts. #robotstxt #webcrawler #ChatGPT #openai

buymeacoffee.com/fromjason/we-

Buy Me a CoffeeWe are legend? Some thoughts on robots.txt — Jason Velazquez (fromjason.xyz)Post by Jason Velazquez (fromjason.xyz)

If you run a website and you think that adding anything to robots.txt "blocks" #ChatGPT or any other #WebCrawler from accessing your site, you may want to think that process through a bit more.

#RobotsDotTxt is *asking nicely* for the robot not to look places. If you don't trust the ethics of the scanning party, I'm not sure why you would think that asking them nicely not to mine your free content for their profit would even slow them down.