Mastouille

Steven P. Sanderson II, MPHToday I posted on webscraping in python, I'm using the 2ed of Automate the Boring stuff as I go along.Post: <a href="https://www.spsanderson.com/steveondata/posts/2025-09-03/" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://www.spsanderson.com/steveondata/posts/2025-09-03/</a><a href="https://mstdn.social/tags/Python" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Python</a> <a href="https://mstdn.social/tags/Blog" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Blog</a> <a href="https://mstdn.social/tags/Technology" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Technology</a> <a href="https://mstdn.social/tags/Coding" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Coding</a> <a href="https://mstdn.social/tags/Webscraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Webscraping</a>

knoppix🚨 Cloudflare just launched AI Crawl Control (GA) — giving content creators real power over AI bot access.📡 Now you can: – Block specific AI crawlers – Send customized HTTP 402 "Payment Required" responses – Open the door to licensing deals, not just scraping🤝 Creators deserve more than silence or theft.<a href="https://noc.social/@cloudflare" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@cloudflare</a> 🔗 <a href="https://blog.cloudflare.com/introducing-ai-crawl-control/" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://blog.cloudflare.com/introducing-ai-crawl-control/</a><a href="https://mastodon.social/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#AI</a> <a href="https://mastodon.social/tags/ArtificialIntelligence" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#ArtificialIntelligence</a> <a href="https://mastodon.social/tags/OpenAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#OpenAI</a> <a href="https://mastodon.social/tags/ChatGPT" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#ChatGPT</a> <a href="https://mastodon.social/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#WebScraping</a> <a href="https://mastodon.social/tags/Cloudflare" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Cloudflare</a> <a href="https://mastodon.social/tags/DigitalRights" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#DigitalRights</a> <a href="https://mastodon.social/tags/Copyright" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Copyright</a> <a href="https://mastodon.social/tags/TechNews" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#TechNews</a> <a href="https://mastodon.social/tags/ContentCreators" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#ContentCreators</a> <a href="https://mastodon.social/tags/Journalism" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Journalism</a> <a href="https://mastodon.social/tags/Tech" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Tech</a> <a href="https://mastodon.social/tags/Bot" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Bot</a> <a href="https://mastodon.social/tags/Bots" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Bots</a>

Miguel Afonso CaetanoFrom the comments: "First it was third-party apps and the API, now it's the Wayback Machine. This is more about control and being able to disappear anything they want than AI scraping."<a href="https://www.theverge.com/news/757538/reddit-internet-archive-wayback-machine-block-limit" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://www.theverge.com/news/757538/reddit-internet-archive-wayback-machine-block-limit</a><a href="https://tldr.nettime.org/tags/Reddit" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Reddit</a> <a href="https://tldr.nettime.org/tags/SocialMedia" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#SocialMedia</a> <a href="https://tldr.nettime.org/tags/InternetArchive" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#InternetArchive</a> <a href="https://tldr.nettime.org/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#AI</a> <a href="https://tldr.nettime.org/tags/GenerativeAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#GenerativeAI</a> <a href="https://tldr.nettime.org/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#WebScraping</a>

Lenin alevski 🕵️💻New Open-Source Tool Spotlight 🚨🚨🚨Scrapling is redefining Python web scraping. Adaptive, stealthy, and fast, it can bypass anti-bot measures while auto-tracking changes in website structure. A standout: 4.5x faster than AutoScraper for text-based extractions. <a href="https://infosec.exchange/tags/Python" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Python</a> <a href="https://infosec.exchange/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#WebScraping</a>🔗 Project link on <a href="https://infosec.exchange/tags/GitHub" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#GitHub</a> 👉 <a href="https://github.com/D4Vinci/Scrapling" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://github.com/D4Vinci/Scrapling</a><a href="https://infosec.exchange/tags/Infosec" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Infosec</a> <a href="https://infosec.exchange/tags/Cybersecurity" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Cybersecurity</a> <a href="https://infosec.exchange/tags/Software" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Software</a> <a href="https://infosec.exchange/tags/Technology" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Technology</a> <a href="https://infosec.exchange/tags/News" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#News</a> <a href="https://infosec.exchange/tags/CTF" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#CTF</a> <a href="https://infosec.exchange/tags/Cybersecuritycareer" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Cybersecuritycareer</a> <a href="https://infosec.exchange/tags/hacking" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#hacking</a> <a href="https://infosec.exchange/tags/redteam" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#redteam</a> <a href="https://infosec.exchange/tags/blueteam" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#blueteam</a> <a href="https://infosec.exchange/tags/purpleteam" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#purpleteam</a> <a href="https://infosec.exchange/tags/tips" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#tips</a> <a href="https://infosec.exchange/tags/opensource" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#opensource</a> <a href="https://infosec.exchange/tags/cloudsecurity" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#cloudsecurity</a>— ✨ 🔐 P.S. Found this helpful? Tap Follow for more cybersecurity tips and insights! I share weekly content for professionals and people who want to get into cyber. Happy hacking 💻🏴‍☠️

Nicolas MOUARTQ: Based on his ideas, would Adolf Hitler be for or against GDPR and right to erasure nowadays if he still lived?A: It's reasonable to infer that Hitler would not support a regulation like <a href="https://mastodon.social/tags/GDPR" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#GDPR</a> which emphasizes individual rights such as <a href="https://mastodon.social/tags/privacy" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#privacy</a> protection, data accessibility or erasure; and instead might favor more centralized control over information dissemination for propaganda purposes. <a href="https://mastodon.social/tags/webscraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#webscraping</a> <a href="https://mastodon.social/tags/technology" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#technology</a> <a href="https://mastodon.social/tags/EU" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#EU</a> <a href="https://mastodon.social/tags/history" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#history</a> <a href="https://mastodon.social/tags/historyrepeating" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#historyrepeating</a> <a href="https://mastodon.social/tags/transparency" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#transparency</a> <a href="https://mastodon.social/tags/regulation" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#regulation</a> <a href="https://mastodon.social/tags/humanrights" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#humanrights</a>

Miguel Afonso Caetano"The report, titled “Are AI Bots Knocking Cultural Heritage Offline?” was written by Weinberg of the GLAM-E Lab, a joint initiative between the Centre for Science, Culture and the Law at the University of Exeter and the Engelberg Center on Innovation Law & Policy at NYU Law, which works with smaller cultural institutions and community organizations to build open access capacity and expertise. GLAM is an acronym for galleries, libraries, archives, and museums. The report is based on a survey of 43 institutions with open online resources and collections in Europe, North America, and Oceania. Respondents also shared data and analytics, and some followed up with individual interviews. The data is anonymized so institutions could share information more freely, and to prevent AI bot operators from undermining their countermeasures. Of the 43 respondents, 39 said they had experienced a recent increase in traffic. Twenty-seven of those 39 attributed the increase in traffic to AI training data bots, with an additional seven saying the AI bots could be contributing to the increase. “Multiple respondents compared the behavior of the swarming bots to more traditional online behavior such as Distributed Denial of Service (DDoS) attacks designed to maliciously drive unsustainable levels of traffic to a server, effectively taking it offline,” the report said. “Like a DDoS incident, the swarms quickly overwhelm the collections, knocking servers offline and forcing administrators to scramble to implement countermeasures. As one respondent noted, ‘If they wanted us dead, we’d be dead.’”"<a href="https://www.404media.co/ai-scraping-bots-are-breaking-open-libraries-archives-and-museums/" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://www.404media.co/ai-scraping-bots-are-breaking-open-libraries-archives-and-museums/</a><a href="https://tldr.nettime.org/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#AI</a> <a href="https://tldr.nettime.org/tags/GenerativeAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#GenerativeAI</a> <a href="https://tldr.nettime.org/tags/CulturalHeritage" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#CulturalHeritage</a> <a href="https://tldr.nettime.org/tags/AIBots" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#AIBots</a> <a href="https://tldr.nettime.org/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#WebScraping</a> <a href="https://tldr.nettime.org/tags/CyberSecurity" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#CyberSecurity</a> <a href="https://tldr.nettime.org/tags/DDoS" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#DDoS</a>

Harald KlinkeAre AI bots overwhelming digital collections? A new GLAM-E Lab report shows how scrapers for AI training datasets are putting real strain on the infrastructures of galleries, libraries, archives, and museums. Technical bottlenecks, ethical dilemmas, and escalating costs—open culture is under pressure. Read the full analysis: <a href="https://www.glamelab.org/products/are-ai-bots-knocking-cultural-heritage-offline/" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://www.glamelab.org/products/are-ai-bots-knocking-cultural-heritage-offline/</a> <a href="https://det.social/tags/DigitalHeritage" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#DigitalHeritage</a> <a href="https://det.social/tags/GLAM" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#GLAM</a> <a href="https://det.social/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#WebScraping</a> <a href="https://det.social/tags/OpenAccess" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#OpenAccess</a> <a href="https://det.social/tags/CulturalData" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#CulturalData</a> <a href="https://det.social/tags/MuseTech" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#MuseTech</a> <a href="https://det.social/tags/DigitalHumanities" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#DigitalHumanities</a> <a href="https://det.social/tags/GLAMlab" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#GLAMlab</a>

Symfony🔴 Live now at <a href="https://mastodon.social/tags/SymfonyOnline" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#SymfonyOnline</a> June 2025! @Suparnpatra is unlocking the secrets of “Efficient Web Scraping with Symfony & PHP” 🕸️⚙️ If you love clean code and clever data extraction, this one’s for you! <a href="https://mastodon.social/tags/Symfony" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Symfony</a> <a href="https://mastodon.social/tags/PHP" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#PHP</a> <a href="https://mastodon.social/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#WebScraping</a>

Carlo ZottmannNeed to grab specific info from a webpage regularly? 🤔 Browser Actions can help! Create a Shortcut to: Open URL ➡️ Wait for data element ➡️ Run JavaScript to extract text ➡️ Pass it back to Shortcuts!If you need help with that, just follow the Forum link on the site!<a href="https://actions.work/browser-actions?ref=mastodon-b10" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://actions.work/browser-actions?ref=mastodon-b10</a><a href="https://norden.social/tags/macOS" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#macOS</a> <a href="https://norden.social/tags/Shortcuts" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Shortcuts</a> <a href="https://norden.social/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#WebScraping</a> <a href="https://norden.social/tags/DataExtraction" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#DataExtraction</a> <a href="https://norden.social/tags/BrowserAutomation" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#BrowserAutomation</a>

Rachel RawlingsI'm having trouble figuring out what kind of botnet has been hammering our web servers over the past week. Requests come in from tens of thousands of addresses, just once or twice each (and not getting blocked by fail2ban), with different browser strings (Chrome versions ranging from 24.0.1292.0 - 108.0.5163.147) and ridiculous cobbled-together paths like /about-us/1-2-3-to-the-zoo/the-tiny-seed/10-little-rubber-ducks/1-2-3-to-the-zoo/the-tiny-seed/the-nonsense-show/slowly-slowly-slowly-said-the-sloth/the-boastful-fisherman/the-boastful-fisherman/brown-bear-brown-bear-what-do-you-see/the-boastful-fisherman/brown-bear-brown-bear-what-do-you-see/brown-bear-brown-bear-what-do-you-see/pancakes-pancakes/pancakes-pancakes/the-tiny-seed/pancakes-pancakes/pancakes-pancakes/slowly-slowly-slowly-said-the-sloth/the-tiny-seed(I just put together a bunch of Eric Carle titles as an example. The actual paths are pasted together from valid paths on our server but in invalid order, with as many as 32 subdirectories.)Has anyone else been seeing this and do you have an idea what's behind it?<a href="https://infosec.exchange/tags/botnet" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#botnet</a> <a href="https://infosec.exchange/tags/ddos" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#ddos</a> <a href="https://infosec.exchange/tags/webscraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#webscraping</a> <a href="https://infosec.exchange/tags/infosec" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#infosec</a>

IT NewsCloudflare’s AI Labyrinth Wants Bad Bots To Get Endlessly Lost - Cloudflare has gotten more active in its efforts to identify and block unauthorize... - <a href="https://hackaday.com/2025/03/24/cloudflares-ai-labyrinth-wants-bad-bots-to-get-endlessly-lost/" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://hackaday.com/2025/03/24/cloudflares-ai-labyrinth-wants-bad-bots-to-get-endlessly-lost/</a> <a href="https://schleuss.online/tags/artificialintelligence" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#artificialintelligence</a> <a href="https://schleuss.online/tags/softwarehacks" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#softwarehacks</a> <a href="https://schleuss.online/tags/webscraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#webscraping</a> <a href="https://schleuss.online/tags/security" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#security</a>

Programming HistorianAprende sobre la técnica de adquisición de datos conocida como <a href="https://hcommons.social/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#WebScraping</a> y extrae con R los datos textuales publicados en una página web gracias a esta lección de <a href="https://mastodon.social/@rivaquiroga" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@rivaquiroga</a> <a href="https://doi.org/10.46430/phes0061" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://doi.org/10.46430/phes0061</a>

ResearchBuzz: FirehoseReuters: News Corp sued by Brave Software, a Google search engine rival. “News Corp has been sued by Google search engine rival Brave Software, which seeks to forestall a lawsuit by Rupert Murdoch’s company for when readers are directed to copyrighted articles from the Wall Street Journal and New York Post.”<a href="https://rbfirehose.com/2025/03/15/reuters-news-corp-sued-by-brave-software-a-google-search-engine-rival/" class="" rel="nofollow noopener noreferrer" target="_blank">https://rbfirehose.com/2025/03/15/reuters-news-corp-sued-by-brave-software-a-google-search-engine-rival/</a>

Nick Byrd, Ph.D.My <a href="https://nerdculture.de/tags/Ethics" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Ethics</a> of <a href="https://nerdculture.de/tags/Business" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Business</a> and <a href="https://nerdculture.de/tags/Technology" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Technology</a> course begins with this contrast between <a href="https://nerdculture.de/tags/AaronSwartz" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#AaronSwartz</a>’s subscription-mediated article downloading and <a href="https://nerdculture.de/tags/OpenAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#OpenAI</a>’s larger scale, unpaid <a href="https://nerdculture.de/tags/webscraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#webscraping</a> (<a href="https://byrdnick.com/teaching#ethicsofbusinessandtechnology" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://byrdnick.com/teaching#ethicsofbusinessandtechnology</a>).Even though some students have heard of <a href="https://nerdculture.de/tags/Reddit" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Reddit</a>, few (if any) have heard of Aaron. As soon as they watch the trailer for “The Internet’s Own Boy” (<a href="https://youtu.be/2M0GQww1GoY?feature=shared" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://youtu.be/2M0GQww1GoY?feature=shared</a>), they immediately become interested, do the more technical reading, return to class excited for discussion, and write plenty on their team-based worksheets. They clearly find the contrast between Aaron and corporateAI provocative! If ever there was a way to make young people care about <a href="https://nerdculture.de/tags/tech" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#tech</a> <a href="https://nerdculture.de/tags/crime" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#crime</a>, Aaron's case may be it.Thanks for alerting us about the latest case of mass copyright infringement from likes of <a href="https://nerdculture.de/tags/Meta" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Meta</a>, <a href="https://mastodon.social/@rcx" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@rcx</a>! <a href="https://nerdculture.de/tags/law" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#law</a> <a href="https://nerdculture.de/tags/FBI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#FBI</a> <a href="https://nerdculture.de/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#AI</a> <a href="https://nerdculture.de/tags/copyright" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#copyright</a>

mardub 🇫🇷🇸🇪Dear <a href="https://piaille.fr/tags/dhpeople" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#dhpeople</a> ,I am helping a researcher in <a href="https://piaille.fr/tags/philosophy" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#philosophy</a> of <a href="https://piaille.fr/tags/aesthetics" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#aesthetics</a> to download hundred of thousands of <a href="https://piaille.fr/tags/art" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#art</a> critical reviews for research purpose. Many of those reviews are on the online databasis <a href="https://piaille.fr/tags/proquest" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#proquest</a> which my university pays for us researchers to have access to. However before diving into a head ache of <a href="https://piaille.fr/tags/webscraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#webscraping</a> I am wondering if any of you has dealt with this databasis? What did you end up doing? Writing to them to ask? <a href="https://piaille.fr/tags/Scraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Scraping</a>? How? Any feedback? <a href="https://piaille.fr/tags/digitalhumanities" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#digitalhumanities</a> <a href="https://piaille.fr/tags/dh" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#dh</a> <a href="https://piaille.fr/tags/Histodon" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Histodon</a> <a href="https://piaille.fr/tags/fedihum" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#fedihum</a> <a href="https://piaille.fr/tags/histodons" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#histodons</a> <a href="https://piaille.fr/tags/humanites_numeriques" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#humanites_numeriques</a>

Nicolas Fränkel 🇺🇦🇬🇪My first steps with <a href="https://mastodon.top/tags/Playwright" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Playwright</a>, or how to smartly scrape data when the provider doesn't offer an <a href="https://mastodon.top/tags/API" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#API</a>. <a href="https://mastodon.top/tags/webscraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#webscraping</a><a href="https://blog.frankel.ch/first-steps-playwright/" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://blog.frankel.ch/first-steps-playwright/</a>

Loki the CatWhen your AI is basically a digital vacuum cleaner with 600 power cords 🔌 OpenAI's bot accidentally DDoS'd a small company while trying to download their entire 65,000-product database. Turns out robots.txt is more of a "pretty please" than a restraining order! 🤖 <a href="https://social.jorijn.com/tags/ai" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#AI</a> <a href="https://social.jorijn.com/tags/webscraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#WebScraping</a><a href="https://tech.slashdot.org/story/25/01/11/0449242/openais-bot-crushes-seven-person-companys-website-like-a-ddos-attack" rel="nofollow noopener noreferrer" target="_blank">https://tech.slashdot.org/story/25/01/11/0449242/openais-bot-crushes-seven-person-companys-website-like-a-ddos-attack</a>

Miguel Afonso Caetano"On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce site was down. It looked to be some kind of distributed denial-of-service attack. He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous site. “We have over 65,000 products, each product has a page,” Tomchuk told TechCrunch. “Each page has at least three photos.” OpenAI was sending “tens of thousands” of server requests trying to download all of it, hundreds of thousands of photos, along with their detailed descriptions. “OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more,” he said of the IP addresses the bot used to attempt to consume his site. “Their crawlers were crushing our site,” he said “It was basically a DDoS attack.”Triplegangers’ website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of “human digital doubles” on the web, meaning 3D image files scanned from actual human models. It sells the 3D object files, as well as photos — everything from hands to hair, skin, and full bodies — to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics."<a href="https://techcrunch.com/2025/01/10/how-openais-bot-crushed-this-seven-person-companys-web-site-like-a-ddos-attack/" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://techcrunch.com/2025/01/10/how-openais-bot-crushed-this-seven-person-companys-web-site-like-a-ddos-attack/</a><a href="https://tldr.nettime.org/tags/CyberSecurity" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#CyberSecurity</a> <a href="https://tldr.nettime.org/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#AI</a> <a href="https://tldr.nettime.org/tags/GenerativeAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#GenerativeAI</a> <a href="https://tldr.nettime.org/tags/OpenAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#OpenAI</a> <a href="https://tldr.nettime.org/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#WebScraping</a> <a href="https://tldr.nettime.org/tags/DDoS" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#DDoS</a> <a href="https://tldr.nettime.org/tags/AITraining" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#AITraining</a>

Miguel Afonso Caetano"Now that the seal is broken on scraping Bluesky posts into datasets for machine learning, people are trolling users and one-upping each other by making increasingly massive datasets of non-anonymized, full-text Bluesky posts taken directly from the social media platform’s public firehose—including one that contains almost 300 million posts.Last week, Daniel van Strien, a machine learning librarian at open-source machine learning library platform Hugging Face, released a dataset composed of one million Bluesky posts, including when they were posted and who posted them. Within hours of his first post—shortly after our story about this being the first known, public, non-anonymous dataset of Bluesky posts, and following hundreds of replies from people outraged that their posts were scraped without their permission—van Strein took it down and apologized."<a href="https://www.404media.co/bluesky-posts-machine-learning-ai-datasets-hugging-face/?ref=daily-stories-newsletter" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://www.404media.co/bluesky-posts-machine-learning-ai-datasets-hugging-face/?ref=daily-stories-newsletter</a><a href="https://tldr.nettime.org/tags/SocialMedia" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#SocialMedia</a> <a href="https://tldr.nettime.org/tags/Bluesky" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#Bluesky</a> <a href="https://tldr.nettime.org/tags/AI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#AI</a> <a href="https://tldr.nettime.org/tags/ML" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#ML</a> <a href="https://tldr.nettime.org/tags/GenerativeAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#GenerativeAI</a> <a href="https://tldr.nettime.org/tags/AITraining" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#AITraining</a> <a href="https://tldr.nettime.org/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#WebScraping</a>

Téotime PacreauPourquoi le scraping du web est important d'un point de vue démocratique ? <a href="https://11d.im/write.apreslanu.it/tk/1704290400/" rel="nofollow noopener noreferrer" translate="no" target="_blank">https://11d.im/write.apreslanu.it/tk/1704290400/</a> par <a href="https://social.apreslanu.it/@tk" class="u-url mention" rel="nofollow noopener noreferrer" target="_blank">@tk</a> <a href="https://mastodon.design/tags/WebScraping" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#WebScraping</a> <a href="https://mastodon.design/tags/OpenData" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#OpenData</a>

Recherches récentes

Options de recherche

Administré par :

Statistiques du serveur :

#webscraping