Another new LLM scraper just dropped: AI2 Bot.
First-party documentation does not list any way to opt-out except filtering the user-agent on your server/firewall. The docs list the following User-Agent to filter:
Mozilla/5.0 (compatible) AI2Bot (+https://www.allenai.org/crawler)
My server logs contained the following string:
Mozilla/5.0 (compatible) Ai2Bot-Dolma (+https://www.allenai.org/crawler)
That appears to be for Ai2’s Dolma product.
159 hits came from 174.174.51.252
, a Comcast-owned IP in Oregon.
I recommend adding ai2bot
to your server’s user-agent matching rules if you don’t want to be in the Dolma dataset; unlike Common Crawl, this seems tailored specifically for training LLMs with few other users.