It seems the dislike for machine learning runs deep. In a blog post, Cloudflare has announced that blocking machine learning scrapers is so popular, they decided to just add a feature to the Cloudflare dashboard that will block all machine learning scrapers with a single click.
We hear clearly that customers don’t want AI bots visiting their websites, and especially those that do so dishonestly. To help, we’ve added a brand new one-click to block all AI bots. It’s available for all customers, including those on the free tier. To enable it, simply navigate to the Security > Bots section of the Cloudflare dashboard, and click the toggle labeled AI Scrapers and Crawlers.
↫ Cloudflare blog
According to Cloudflare, 85% of their customers block machine learning scrapers from taking content from their websites, and that number definitely does not surprise me. People clearly understand that multibillion dollar megacorporations freely scraping every piece of content on the web for their own further obscene enrichment while giving nothing back – in fact, while charging us for it – is inherently wrong, and as such, they choose to block them from doing so.
Of course, it makes sense for Cloudflare to try and combat junk traffic, so this is one of those cases where the corporate interests of Cloudflare actually line up with the personal interests of its customers, so making blocking machine learning scrapers as easy as possible benefits both parties. I think OSNews, too, makes use of Cloudflare, so I’m definitely going to ask OSNews’ owner to hit that button.
Cloudflare further details that a lot of people are blocking crawlers run by companies like Amazon, Google, and OpenAI, but completely miss far more active crawlers like those run by the Chinese company ByteDance, probably because those companies don’t dominate the “AI” news cycle. Then there’s the massive number of machine learning crawlers that just straight-up lie about their intentions, trying to hide the fact they’re machine learning bots.
We fear that some AI companies intent on circumventing rules to access content will persistently adapt to evade bot detection. We will continue to keep watch and add more bot blocks to our AI Scrapers and Crawlers rule and evolve our machine learning models to help keep the Internet a place where content creators can thrive and keep full control over which models their content is used to train or run inference on.
↫ Cloudflare blog
I find this particularly funny because what’s happening here is machine learning models being used to block… Machine learning models. Give it a few more years down the trajectory we’re currently on, and the internet will just be bots reading content posted by other bots.
I work for a public research library that runs two open repositories that I’m a system administrator on.
We’ve had to block a number of crawlers that are very probably AI harvesters with little respect for the fact that we’re an organisation with limited resources. ByteDance we actually noticed pretty early on, but most recently we had a wave of traffic from a Tencent ASN as well. It was absolutely ridiculous, coming in from hundreds of thousands of IPs with a different user agent for each request, making it as hard to block as possible. We had to resort to blocking most of their networks (thank the stars for GeoIP databases making this kind of traffic easier to identify).
We run open repositories so we don’t *like* blocking things generally, but they’re not giving us much of a choice. AI harvesters are blatantly violating a social contract when they’re collecting and using data without even being capable of citing and attributing sources, and together with the extra load the traffic causes, it’s been enough of an issue that the question of how best to deal with it came up as a talk on the most recent Open Repositories conference. In short, open repositories all over the world are having issues with this.
We’re not using Cloudflare ourselves at my workplace, but I’m happy to see them do a directed effort.
And indeed you’re being charged! At the moment, Swedish tax kroner will sometimes go to me having to spend more time on this shit.
Reading this article I was really curious about false positives and false negatives. Nobody, not even cloudflare can get this 100% right in part because some bots aren’t merely faking the user agent strings but are using actual browser engines, the same ones that humans use. As long as a bot doesn’t exhibit unnatural characteristics, there is no heuristic that will definitively identify all bots without failing some humans as well.
I’ve noticed that some cloudflare websites use captcha loading screens that claim to automatically “identify humans”, but the truth is that captchas no longer work in the age of AI.
Yeah, false positives are always going to be an issue with any automated blocking. At the moment we don’t do automated blocking where I work, and there’s no ML models in the loop. We use monitoring thresholds to warn of issues and log analysis tools with visualisations that have in pretty much every case been able to give us an good idea of what’s happening.
And I feel it’s important that we can at least explain why someone or something was blocked if we’re having complaints. But I also have to acknowledge that the issues we’re having sometimes could reach a scale where we have to engage in *some* form of automation. And that’s not really our fault. It’s just another way in which this AI-obsession is doing it’s best to ruin the internet. So at some point I might just have to respond with “yeah, sorry, we have an automated blocking system. We can whitelist you, but we are forced to run this system to ensure the service is available to anyone at all with the budget we’re given.”
With regards to captchas, we don’t use them currently, but from what I’ve heard it’s not actually as hard to trip up bots as many might assume in “the age of ‘AI'”. They’re not really as smart as people often assume, and thinking about it for a moment, if you’re running a crawler consuming many millions of pages every minute or second, your crawler has to be selective about what to process with a backing AI-model, because that processing is comparatively *very* expensive to employ in that loop.
Book Squirrel,
I agree that it’s not your fault, however I do feel that AI is somewhat of a scapegoat. Scrapping webpages for AI does not inherently use more resources than scrapping for search engines or internet archive or other bots like getty’s bot that scraps your website for no other purpose than to send out cease and desist letters. They are using resources, have you blocked them? Most websites don’t. They did not block the AI bots used to train current AI models. By and large it did not cause any noticeable disruption to websites.
It’s quite clear this grass roots movement to block AI bots today is about people objecting to the way their web pages are being used. And I think it’s fine to want to block bots over non-technical reasons like that. However note that the tech giants are in a position to be able to use the same bots for both search engines as well as AI training. A website literally has no idea how the data is being used especially if the bot’s author isn’t forthcoming about it. Would you allow MS/google bots if they annoyed they would be using the scrapped data for AI? It’s a serious question, as most websites would loose nearly 100% of their traffic if they stop allowing their data to be scrapped for indexing.
Well, I wasn’t talking about AGI that can solve any captcha that can be conceived of in the future. The point I was trying to make “that captchas no longer work in the age of AI.” was that a new AI bot would be trained quickly rendering any new captcha obsolete. Even if a hypothetical service continually changed the captchas every day, it would not only increase operating costs but also infuriate normal users who are just trying to visit a website.
Edit, ugh: Would you allow MS/google bots if they announced they would be using the scrapped data for AI?
You might think so, but it’s become clear that various AI-scrapers tend to me more aggressive in behaviour and more dishonest in how they identify themselves, and so long as the AI-obsession continues, more will keep cropping up. They need more data than indexing bots and the companies behind them are predictably more disrespectful, dishonest and inconsiderate. Some of their CEOs have been pretty clear and explicit about how much they don’t care about responsible behaviour if it’s in service of their goals.
We can’t know for sure what the purpose of any specific bot is, but it’s not like we have no idea. We can often surmise things from their behaviour. An indexer is generally not that interested in hitting the search page, but goes instead as directly and cleanly as possible for the records. This is generally fine because it doesn’t cause that much load. But an AI harvester might actually be interested in spamming the search page with many thousands of requests to generate more query->response data for a model, which generates a much higher load. This is in fact what appears to be happening in some of these cases. And its just so incredibly stupid.
We don’t actually currently have any policy of blocking bots we suspect of doing AI harvesting. But we do have to handle bots that are a disturbance, and it just so happens that we’ve had significantly more of these kinds of disturbances ever since the tech industry started blowing incredible volumes of air into this latest bubble, and it’s not hard to make the connection.
And so, while we don’t use Cloudflare, we may still be looking into more preemptive strategies in the future.
Book Squirrel,
I’d like for us to raise the bar on evidence, what is the evidence for what you are saying? Unfortunately the cloudflare article does not compare AI to other bot activity, which would prove useful here. However they do say “When looking at the number of requests made to Cloudflare sites, we see that Bytespider, Amazonbot, ClaudeBot, and GPTBot are the top four AI crawlers.”
Providing data from websites that I host…
I accept this is anecdotal, but barring convincing evidence to the contrary citing AI bot activity as a problem seems more like posturing than a real problem. The numbers I’m seeing suggest their traffic is minuscule even for the fourth most popular AI bot. Realistically, if it weren’t for the magnifying glass on AI right now, websites would be unlikely to even notice this traffic (they didn’t notice it for years). People are extremely open to criticizing AI for any and all reasons, but their technical impact really seems quite slight.
What is the justification for this claim though? Why does an “AI harvester” needs to spam pages any more than a web indexer? Logically I feel the exact opposite is true. AI training does not actually need every last page with regular updates. If anything their biggest problem is weeding out the garbage and the reality is most web pages hold little of value for AI. Conversely people expect all the pages to be indexed and moreover updates to those pages too! This is a gargantuan effort and only the very largest companies have the resources to pull it off. Even if AI startups wanted to do this, they realistically couldn’t.
I think it’s fair to ask for the evidence of it. The data I’m seeing does not corroborate that.
And that it your right, but I still feel the public motivation for doing this is ideological rather than technical. You haven’t directly answered the question I posed but I really wonder what people intend to do if/when the tech companies that are spidering the web unify their toolchains across search and AI. It seems like a logical thing to do
Yes, it’s anecdotal, and so is what I’m telling you. Our anecdotal evidence isn’t necessarily in conflict though. You’re right that load isn’t going to be an issue for most websites. In part, as you say, because “most web pages hold little of value for AI”. But when you’re running a decently large, curated open repository with it’s own search functionality, the interest is obviously higher. And the type of requests starts mattering more than the number.
– Request for a specific record in a repository: Light, mostly static, can be easily cached.
– Search query: Comparatively much heaver, results change frequently with insertion of new records, so can’t be effectively cached. We employ a pretty fast search index, but it has its limits, and we can’t just scale up on a dime when someone decides to start spamming it.
You ask why an AI harvester might be interested in spamming requests to a search page, and the answer I think, is relatively simple (so maybe I just failed to articulate it properly?). The kind of data they need is different from indexers. Indexers have an actual search engine with its own search and ranking algorithms that they can just put records and keywords into, so they tend to go straight for the records. LLMs are not based on an transparent search algorithm, so if you want them to be able to answer questions (or rather *pretend* to be able to answer questions), they need to be trained on query->response data. How might you generate more if you’ve already exhausted most sources of that kind of data? Well, you could decide to spam search queries to a bunch of open repositories.
And no, I don’t know for absolute certain if that’s what is happening exactly, and it might well be stupid as a training strategy, but we have ended up having to block, amongst others, Bytespider, ClaudeBot and GPTBot (or bots that look like them? Who knows, right?). Google have their own search indexes, so I’m guessing if they’ve had such an idea with Gemini, they’re at least training it on their own indexes.
And at least those specific bots all identify themselves so the source of the problem is clear. I’ve no idea if they’ve started running stealthier bots to get around blocks though. I wouldn’t put it past them given the sorts of views some of the CEOs have expressed. And it seems that at least some operators have realised that they need to be dishonest in their user agent strings and use a larger part of their IP-space if they don’t want to be blocked preemptively.
Obviously, google already does this.
So if one wants to be indexed by google but not have data used for training, one can only hope they actually respect opt-out tags and such. None of that has really been properly standardised yet, and I have my doubts about how widely it’ll be respected. It’s still worth engaging in though. It would actually be nice if it was possible to repair some of all the trust that’s been broken in this.
(This is where I want to tell you: If you have concerns about the level of AI-skepticism that has been caused by all this, and the possible consequences of it, maybe focus those concerns on the companies and industries that decided to start breaking whatever little trust was left to begin with on the internet.)
Funny thing is, we did in fact end up having to block Google’s indexer for a short while a couple of years ago (nothing to do with AI at the time of course). We couldn’t contact them directly about an issue because they’re too aloof as a company, but as it turned out we’re at least important enough for them to notice when they start getting 403s, so they contacted us about it instead.
So some services do in fact have enough clout on their own to garner a response from even a large tech company. And we should never forget that a company like Google is dependent on the internet to function. Of course, coordinating across the internet is like herding a million cats, but it does happen from time to time, given enough discontent.
On the other hand, what some (if not all) of these companies imagine is that they’re going to essentially replace the internet somehow. And Google’s search index has been going to shit for a while anyway, partly of their own doing. So I don’t know how much they care anymore, especially since Sundar Pichai took over. And it seems to me that for a discoverable internet in the future, we may need a different paradigm than a centralised index. Generative models are a part of the problem here, not the solution. The thing they’re most effective at is filling up the internet with even more bullshit that needs to be filtered somehow.
Anyway, I think that’s it for me this time around. I’m going to check out and enjoy my summer vacation now.
Have a good summer!
Book Squirrel,
Of course I agree that admins need to respond to any bot causing a problem. There just doesn’t seem to be evidence of a wide-scale technical problem with bots being used for AI training specifically versus bots being used for other purposes.
I’d still like to see some evidence. I get the animosity towards AI companies, but blaming them for bad bot practices without evidence is unfair and to me it feels like there’s some scapegoating.
I accept your point. But we should agree that it’s not unique to AI bots.
I have this gripe as well. VIPs get an inside track bypassing public support options, but in terms of the normal public support channels, google gets an F grade. Obviously it doesn’t seem to make a difference to them, but if they didn’t have a monopoly I suspect such bad support would have killed the company.
I’ve heard many people claim this. On the one hand I’m not sure if there’s a way to show it to be objectively true. Google results becoming worse might be a reflection of the internet itself becoming worse (like blogs taking over professional journalism and the rise of paywalls). On the other hand it seems likely that google’s algorithms are optimized for google profits rather than user satisfaction. In any case, I agree with the need for decentralized solutions that don’t monopolize control over the internet.
I’ll respect your opinion, but it’s a huge productivity multiplier and I honestly don’t think it’s going away.
Nice chatting. You have a good summer vacation!
> “the internet will just be bots reading content posted by other bots.”
…who is going to tell him?