We are observing stealth crawling behavior from Perplexity, an AI-powered answer engine. Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences. We see continued evidence that Perplexity is repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity, as well as ignoring — or sometimes failing to even fetch — robots.txt files.
The Internet as we have known it for the past three decades is rapidly changing, but one thing remains constant: it is built on trust. There are clear preferences that crawlers should be transparent, serve a clear purpose, perform a specific activity, and, most importantly, follow website directives and preferences. Based on Perplexity’s observed behavior, which is incompatible with those preferences, we have de-listed them as a verified bot and added heuristics to our managed rules that block this stealth crawling.
↫ The CloudFlare Blog
Never forget they destroyed Aaron Swartz’s life – literally – for downloading a few JSTOR articles.
I wish more tech discussion sites would just lock down to where comments and content were visible for registered users only. Leave it up to the sites to determine how “registered users” are vetted, but you have to be verified by a human to access the content. One of my car forums does this. Post a picture of your car and you’re in. Other places, I’m hesitant to participate because of all the AI scraping.
There is a future where the internet is just a bunch of AI chatbots talking past one another and real people are just doomscrolling on TikTok because at least there, you can still (mostly) tell if it’s a person or not.
The problem with this approach is that you have to give every random website a piece of your identity, even if it’s just an e-mail address, and let’s be real, most people would give their Facebook or Google ID to login faster.
question noone asks is – how would you prevent ais/agents being able to access/takeover such ‘real human verified’ accounts/sessions? you’ll get endless marketplaces, selling ‘verified human’ accounts, india, africa, for 1$/acc at 1m accs?; pennies to dollar of inference. it’d be faster to buy 1 mil accounts on tg, than to take a photo of your own car.
In the ideal world, bots whould respect websites. In the real world, not so much. It is a nightmare and it’s hard to do anything about it as countermeasures are becoming less effective. Modern AI is better at captchas than humans. Web driver APIs defeat bot detection. It may go against cloudflare’s business model to admit it, but even their fancy heuristics can’t do much to stop bots when they start using the same browser engines. They can rate limit requests per hour, but this can lead to false positives, especially when it comes to public hotspots and CGNAT where IP addresses end up being reused by many users.
This won’t stop cloudflare from trying to protect sites from bots, but unfortunately as a regular user I have been seeing more interruptions caused by cloudflare themselves. Due to the way I quickly open many tabs, cloudflare regularly interrupts my browsing. I am able to mitigate this somewhat by opening links more in series rather than in parallel, but 1) it’s regressing my user experience on the web as a human, and 2) bots can also mimic this to pass as humans.
Last week, for the first time I discovered that I could not access a cloudflare site on my phone. No matter what I tried I could not get the page to open. It might have been my adblocker – I don’t know. The same URL opened from my desktop, which also uses adblocking. Next time I experience this I’ll conduct more tests to identify the exact reason. Setting my phone to hotspot could help track down cloudflare’s false positive to the browser itself or maybe the shared IP it was communicating over.
@Alfman, reality rarely if ever meets ideology.
I predict more Anubis-style proof-of-work “cook the planet to drive up the cost of running a bot datacenter” solutions in the future, given nobody has disproved the original thesis statement of Hashcash yet.
(That the cost to large-scale actors grows so much faster than the cost to individuals that it can be useful as a factor in a protective measure.)
Given headlines like Microsoft paying to reopen the Three Mile Island nuclear power station, I think such measures will exist mostly to protect smaller sites from being DDoSed by scraper bots, while the actual solution will have to wait for the blitzscaling to crash as anticipated by headlines from Pivot to AI like:
AI coders think they’re 20% faster — but they’re actually 19% slower (Results of a study from an A.I. research charity)
Firing people for AI is not going so well (A collection of links in support of the argument that companies that bought into the hype are quietly doing things like re-hiring people or contracting a programmer to rewrite the boss’s vibe-coding project into something actually maintainable.)
Only 3%* of US AI users are willing to pay a penny for it (Results of a survey run by a venture capital firm with their “…because we’re not marketing it correctly” stripped out.)
Hell, I try to avoid playing around with locally hosted Stable Diffusion because I can confirm that Generative AI runs on gambling addiction — just one more prompt, bro! and, aside from using Perplexity maybe once every month or two as a “Google and DuckDuckGo couldn’t turn up anything relevant, given the synonyms I could think to Google Fu up” because ChatGPT used to require that I create an account, that’s all I ever do with it. (Not that it’s especially helpful, given that, half the time, Google and DDG are returning no useful results because there are no useful results, and then it’ll hallucinate what exists at its cited links.)
As a programmer, I share the view of Why Generative AI Coding Tools and Agents Do Not Work For Me. (If I’m going to have to audit and maintain it, I might as well write it, to get the practice in to improve my ability to do that. Why would I want to “train an intern” when the intern has anterograde amnesia?)
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!),
Yeah, Proof Of Work algorithms work by increasing the cost of web requests. It is reliable and effective, however proof of work is a burden on everyone. This means higher energy costs, deteriorating battery life, higher carbon emissions, etc. We also need to be clear it is not an access barrier for bots, but rather a cost barrier. If bots are willing to pay the costs associated with solving the challenge, then they still get through just like a normal user. I wonder just how much POW we may all end up having to tolerate on the future internet. It would be nice to have an article breaking down individual and collective electricity costs associated with such POW challenges.
I’ve noticed that Cloudflair’s browser challenge algorithm runs quite alot faster on my desktop computer than on my phone, which makes sense but it’s important to point it out. Bots may be running high end hardware, meanwhile the difficulty of POW challenges is somewhat constrained by the poor experience for users of low end mobile devices.
Obviously large scale actors are dealing with tons of requests, but proportionally it’s likely that individuals will end up paying far more per request. Consider the adjacent POW industry of crypto-currencies. Individual miners have been totally displaced by huge crypto farms with scales of economy. Solving the same challenges at home typically incurs such high overhead that the hardware and electricity bills can exceed the returns.
I’ve never been a big fan of POW challenges on account of the sheer wastefulness of it all. In principal we could find better solutions that don’t involve wasting resources as POW is designed to do. For example, instead of proving you wasted CPU cycles to access a website, you might proof you made a donation to charity. I’ll call it “Proof of Charity” POC. This could be calibrated to be of similar cost as POW, but instead of only having wasted electricity to show for it, we’d actually have a tangible public benefit out of it for the same cost. POC wouldn’t stop the bots any more than POW does, but the difference is that the money that would have been wasted by everyone paying for electricity would instead be going to charity.
Of course the challenge would be to establish some kind of credit system in exchange for verified donations. Bureaucracy and even legal battles could be a stumbling block. But if POC credits for donations could be implemented somehow, then it would create a really novel opportunity to redo POW algorithms without all the waste. With POW, wasting electricity became a huge side effect of demand, however with POC, the side effect would be charity. How cool is that!
https://news.climate.columbia.edu/2022/05/04/cryptocurrency-energy/
Going by the cheapest commercial rate I could find here ($0.0707/kWh)…
https://quickelectricity.com/cost-of-electricity-by-state/
,…the energy used by bitcoin alone (150TWh) would translate to $10.605B annually (and perhaps more depending on actual electricity rates). Switching to POC currency instead of POW would have a double benefit: it would stop the waste and carbon emissions, but it would also help fund charities.
I really got off topic here, but I am curious what people think of POC?
Alfman,
This is about economy. Basic money is the factor here.
Say, you have the most optimum captcha, and absolutely only a human can pass it, but no bot (just for sake of discussion)
What stops an organization to use Amazon Mechanical Turk, and give people 10 cents per each captcha they solve? They can even do some “remote desktop” to get many IP addresses for each event.
(The same reason SPAM is still a thing, if 1 in 1,000,000 people buy your $100 snake oil, it might be worth it)
As long as the economic benefits would be more than 10 cents for solving that captcha (say being able to download 50 pages until it locks again), they will do that.
On the other hand these very real costs add up for actual humans. Making captchas harder means you need to use mental energy, or in your case be completely locked out of your own destination.
Since my long comment with citations is waiting in moderation, I’ll just say two words: “blitzscaling” and “crash”.
Surveys show that people aren’t willing to pay for what’s currently coasting on VC money to “build market share”, we’re seeing mentions of companies quietly re-hiring people after discovering generative A.I. couldn’t deliver on what they were sold, and a study found that programmers believed coding assist made them 20% faster, but objective measurement of apples-to-apples situations showed them to be 19% slower. (Probably because, as I’ve personally observed, generative A.I.’s low effort and high potential reward hooks the same psychological processes as gambling addiction.)
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!),
Some of the studies on AI exhibit author bias (to be fair in both directions). Context matters a lot. I’ve heard a lot of blanket statements about AI without context, it needs to be taken with a big grain of salt. If you are expecting the LLM to do the whole job, yeah it’s not there yet. I too have seen the imperfections, but as a starting point an LLM can still get you started much further ahead than starting from scratch.
I’ve been playing around with generative AI for music and art, and I gotta say what it can do in under a minute is quite a bit better than what I could do myself in hours. I’m not a professional artist or musician, but if I were one working on “fiver” or one of the sites like it, I’d be extremely worried that AI is not only good enough, but that buyers may not be able to tell the difference from a genuine performance and a “prompt engineer” using AI. Not a hypothetical future, but today!
In terms of coding, LLMs have a few weaknesses, especially in regards to not being able to iteratively debug their own output. But I don’t think that coders are going to escape AI’s reach long term. It’s going to keep getting better.
Alfman,
Yes, exploration has been one of my successful use cases as well. It speeds up the ramp up process significantly. Better yet, I was able to ask questions without being shy. It is a machine after all that (usually) does not judge.
ssokolow,
I’m not sure why you are on the doom and gloom train.
Yes, there was too much expansion. And yes, many companies did stupid things like thinking a (current) AI agent can replace a human coder.
But that loses the larger context, where AI/ML has already been integrated in everyday workflows and it is here to stay. Many of these people don’t even realize.
For example that “auto correct” model on your mobile keyboard? That is most likely based on BERT (Bidirectional encoder representations from transformers), the first true LLM from Google (which is also fully open source).
Your phone camera? “Automated photoshop on steroids”, it will run not only one, but several models to get the best capture from that tiny sensor.
Even the very basic things like power management is augmented with AI.
Someone jumping too much ahead should not distract from the actual progress (which is slower and deliberate).
I’m not “on the doom and gloom train”, I’m saying we’re in DotCom Bubble 2.0 right now.
Also, thank you for reminding me to look up how to turn off autocorrect on the hand-me-down iPhone 8 I use as a WiFi-only camera, e-book, and note-taking device. I was getting tired of it never doing useful things and mangling my acronyms, tech jargon, and fanfic jargon.
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!),
I think we may be in for a recession as well. However suggesting it’s the end of the road for AI companies would have been like suggesting it was the end of the road for internet companies with the dotcom bubble – it didn’t happen. Sure dotcom ended up consolidating power into fewer hands, but the remaining players faced less competition and actually became much stronger.
So, while I don’t object to your dotcom reference, I don’t necessarily come to the same conclusion as you about the collapse of AI. Of course there will be losers dropping out from the market, a market consolidation is in the cards. However it does not mean AI is going away. It may be unintuitive, but the dotcom era might be a good indicator for how AI companies will succeed a recession.
I do respect your choices, individual choice is so important and it’s worth bringing out the pitchforks when companies violate this. However I also concede that many of those choices I make don’t make a dent in the market.
I think we have the same perspective and I wasn’t clear enough.
I don’t mean the whole A.I. thing is going to crash… but I do think there will be consolidation as all the gold-rushers who aren’t the few winners crash out.
…and, commensurate with that, I think we’re currently in the middle of an attempt by companies to stake out a “We’re already doing it too deeply. It’s impractical for you to retroactively outlaw our behaviour.” push that might have worked before will for things like the GDPR and DMA started to develop.
The question is whether we can get to the point of things like CASL to start to claw back the ability to have a DDoS-bot-free ability to have small, independent websites.
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!),
Ah, ok. I assumed wrongly.
I see it as a moral problem that bots are disrespecting the wishes of website owners, Maybe there should be laws to protect websites from bots, but I think that kurkosdr’s (unpopular) point is technically right in that laws don’t currently enforce anything like robots.txt. Obviously I am in agreement that Denial of Service is harmful. Some osnews posters have reported this happening to them and I sympathies with victims. However I think there is a bit of dissonance when it comes to using DOS to make a case against AI because in principal the footprint for AI bots is identical to search engine bots and a sysadmin has no objective way to differentiate the use cases based on their server’s resource usage. The bot’s bandwidth has no connection to the way the data is being used. This is why I don’t accept the technical arguments against AI bots in general, it’s purely a rights issue.
Does someone have a right to gate keep the bots? In principal yes, for example they can go to cloudflare and turn on bot protections. In practice, we can only rely on heuristics if the bots won’t identify themselves. These heuristics inevitably inconvenience users, incur false positives and false negatives. I understand people’s rational in wanting to keep AI bots away, but it’s just going to keep getting harder to differentiate bots from humans. Conceivably we might end up with real time IP blacklists (email admins will be familiar with these), but these take work and the wack-a-mole game can punish innocent users too. I don’t see any easy answers
We’ve only spoken about negative aspects (I realize you aren’t the target demographic for AI 🙂 )…but it seems conceivable to me that in the future many more users will be embracing AI agents to perform actions on their behalf. AI will be a stand in for a person. This will really blur the line in terms of who/what you are trying to block.
The guy who runs one of the sites I’m on has complained at length because, before LLMs, search engine bots generally obeyed the crawl rate directives in robots.txt and, if they didn’t, were hard-coded to something like one request per two or three seconds, while, now, he’s flooded with crawlers that go hard in the opposite direction and throw a ton of bots at you simultaneously which only make maybe three requests each before switching up their IP address to ban-evade. It’s completely different behaviour.
Personally, I don’t earn anything from anything I make public (It feels wrong to expect people to give me something scarce (money) in exchange for something non-scarce (copies of a bit pattern) rather than the scarce time that produces it), so I don’t have any stake in trying to keep A.I. bots from my stuff on that front. It’s purely about “They shouldn’t be able to leverage their wealth to get away with ‘tragedy of the commons’-ing hosting that others pay for.”
Personally, I enjoy goofing around with a self-hosted copy of Stable Diffusion and I’m only not the target demographic for other kinds of LLMs because:
1. If it’s anything but another search engine to try after DuckDuckGo and Google and the like, then I want it self-hosted with the same payment terms (or lack thereof) as the rest of my open-source stack. (If I wouldn’t run VSCode (as opposed to VSCodium) because the terms are unacceptable, why would I become reliant on an LLM?)
2. For coding assist, the same psychology that makes me prefer Rust over languages with less compile-time correctness also means I don’t want to babysit an LLM while they have such crippling context limits. (Like i said, mentoring an intern with anterograde amnesia.)
I talk a big game about “you need to actually practice to develop a skill” when it comes to actually thinking through the solution and implementing it yourself, but I recognize that I’d be tempted if LLMs hit my psychology less like pre-TypeScript JavaScript with NPM at its worst and more like Rust.
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!),
I regularly observe bots identifying as AI bots, but they’re actually using less bandwidth than google. I fully accept that outlier cases may arise and that my not observing them doesn’t mean they don’t happen, but from where I stand that does not seem to be the norm. For most website owners it’s become a rights issue.
Maybe you are the exception to the rule, but generally speaking if it had absolutely nothing to do with rights and purely about hosting bandwidth, then people would be up in arms against all bots, which we are not seeing. There have been so many bots using up bandwidth even before LLMs came along – tons of them. Yet now we’re seeing LLM bots and only LLM bots become vilified specifically. even when they use less bandwidth than googlebot. Clearly this is about more than just bandwidth: it’s a criticism about how the data is being used.
I think we’re in agreement on self hosting. Are you a proponent of self-hosted LLMs? To me, independence from centralized control has always been extremely important. It grieves me that consumers have lost so much ground and that we’re becoming so shackled even on our own machines.
Quite a change of topic. Everyone has the prerogative when and whether to incorporate AI/LLMs in their workflows. I don’t object to anyone’s choice, whatever that choice is. That said AI may turn us all into modern day Luddites, and it didn’t turn out well for them.
We’re always going to see holdouts, determined to avoid LLMs at all costs. This is quite expected today because the technology isn’t there yet, but as the technology keeps evolving and gets better, there is going to be more and more pressure to use LLMs. The forever holdouts will still be there, and many of them will be grandfathered into stable jobs. However newer junior software engineers are really going to face the pressure. Even if babysitting the AI becomes part of the job, it’s hard to ignore the sheer speed benefits that AI brings to the table. We’re not yet to the point when AI/LLM are a requirement for software, but it’s likely coming.
Anubis’s default ruleset only shows a challenge when the User-Agent string contains “Mozilla” since that’s a pretty good heuristic for “trying to masquerade as a browser”, given that every major web browser still contains it to masquerade as Netscape Navigator.
True… but I’m thinking of examples I’ve read about where they installed Anubis specifically because the bad-faith scraper bots were eating up too many resources, such as kernel.org and Duke University’s digital archives.
People who care more about rights issues are more likely to install something like Nepenthes or iocaine, which will spend MORE resources to generate a tarpit of poison data for crawlers.
It’s both… it’s just that LLM training injected a gigantic cohort of big-tech-budget-level scraper farms all at once as they rushed to compete for who could hallucinate the least, and that tipped the scale for people who do only care about the bandwidth from “This is just the cost of doing business” to “OK, I’m sorry we have to pull a Cloudflare and shut out people without JavaScript enabled, but this is unsustainable”.
Yes, within practicality.
For something like Perplexity, which I use as a “search engine of last resort”, there isn’t much value in it, given how infrequently I make queries and how much more time and download bandwidth I imagine I’d spend maintaining “a local cache of the entire search index”.
Aside from now wanting to become dependent on SaaS, I don’t use things like GitHub Copilot mostly because “I prefer to practice my skills. If I need boilerplate reduction, I’ve got coc-snippets in my Vim and the basic built-in new project generators for Django, uv, npm, Cargo, etc. (eg. I still feel that the existence of Yeoman is indicative of misdesign in the JavaScript space.)
I also lean wary on license compliance to the point where, for my retro-programming hobby, I tracked down and eBay’d a whole bunch of different obsolete IDEs, toolchains, packaging tools, etc. for DOS, Windows 3.1x, Windows 9x, and classic Mac OS. (The retail copy of IntallShield Express 2 was sealed New Old Stock and I managed to contact Corel about buying a license for modern WinZIP Self-Extractor and having them downgrade it to match the WinZIP Self-Extractor 2.x for Win9x on their end-of-life’d downloads page.)
…but, if all that gets sorted, I could see myself self-hosting a sufficiently good code-assist model as another form of reference material to complement all the local copies of various API docs that I have in Zeal and all the local dictionaries, thesauruses, etc. that I have in GoldenDict-NG.
(I’m already planning to take the various CHMs, PDFs, EPUBs, HLPs, local HTML, etc. from things like my Apple Developer CDs, my MSDN Library CDs, the online help on my Borland Delphi CDs, Humble Book Bundles, the digital copy of the book included on the companion disc to my copy of Programming Windows, Fifth Edition, etc. and processing them into a local whole-library search index for my retro-hobby programming… bit of a shame it seems nobody made an open-source parser for the “Next-gen CHM” format that’s used in the Windows XP-era MSDN Library discs. Maybe I can decompile them using Microsoft’s tools on one of my Windows XP machines.)
We’ll see where I end up. Right now, I see it as being, at best, akin to the treadmill/churn of the full-stack SPA JavaScript ecosystem that I want no part of.
sukru,
I think most people will acknowledge that captchas are no longer viable with AI. But yeah even before AI, outsourcing to human captcha solvers was a work around.
I became so fed up with captchas that I seriously looked into buying credits on captcha solving services to stop wasting my time with these inane access barriers. The captcha-busting services existed and honestly quite affordable. My opinion about bots hadn’t changed, but captchas were becoming so aggressive by the time I was getting captchas I could no longer pass I would say the capcha industry itself became the bad guys.
I still see them on occasion, but thankfully I get fewer captchas today in general, The bot problem ever went away, but I think website owners realized the captchas had become more harmful to humans than bots.
*nod* To avoid hypocrisy, for my own sites, I have a homegrown “spam pre-filter” which I haven’t yet needed to pair with an actual spam filter which works by not being a turnkey solution but, instead, a mandatory linter for stuff I wouldn’t want from humans either.
(eg. HTML or BBcode in a field which will be interpreted as plain text, too low a percentage of non-URL text, too low a percentage of characters coming from the latin1 subset I know how to read, e-mail addresses in the message body, URLs pointing to known URL shorteners, e-mail addresses or domain names in the subject line, etc.)
“ignoring — or sometimes failing to even fetch — robots.txt files” xDDDDDDDDDD Security doors made from crepe paper don’t work? Who would have guessed?
robots.txt was always “Play nice or I’ll reach for the firewall” measure which generally worked quite well until tech bros came on the scene.
…hell, when you look at things like Bad Behaviour and Anubis, one of the core details of how they operate is that a lot of their checks only kick in if the requesting User-Agent string contains that “Mozilla” prefix all human-driven web browsers accreted, indicating that it’s either a web browser or something trying to hide as one.
> Never forget they destroyed Aaron Swartz’s life – literally – for downloading a few JSTOR articles.
Wow. Linking Swartz’s death in 2013 to Perplexity AI, a company founded 3 years ago, seems just a little far fetched.
Did I miss something?
It’s another two-tier justice system argument. If you’re a giant company, you can get away with doing on a large scale what individuals get slammed for doing on a small scale.
Hell, the reason Facebook’s in trouble is because their mass-torrenting of over 81TB of pirated books to train their LLM was SO over the top that it couldn’t be brushed under the rug anymore.
Hmm, I see.
So “they” refers to the leftist equivalent of the “deep state”.
Indeed, that was not obvious to me, thanks for the clarification.
Nico57,
Given the context, it would make sense that “they” refers to “United States Attorney’s Office”. Although given my take on how Thom thinks and speaks about the world I read his use of “they” as more generic in nature, referring to our social priorities in general. I think it may have been more deeply compelling to say “Never forget we destroyed Aaron Swartz’s life” to imply a more collective guilt, but I am over-analyzing the use of a single word.
I’m not sure I’d make that comparison.
As far as I’m aware, there’s generally no dispute that he committed suicide and generally no dispute about what drove him to it. As Wikipedia puts it:
Where’s the conspiracy? That people dispute the possibility that human beings can be emotionally fragile?
The comparison is that Facebook had to torrent 81TB of books to get “caught” and are still arguing things in court, while an individual decided to use a guest account issued to him to batch download some academic journal articles which, in a just world, would be open access anyway, and was treated as if he was Julius and Ethel Rosenberg.
For the fifth time or so Thom: the robots.txt file has no basis in law, there are no legal consequences for violating the robots.txt file, the content of the website is publicly accessible and that’s as far as copyright law is concerned. Similarly, having your scraper bot identify itself in a particular way is also voluntary with no legal consequences if you don’t.
I am saying this because I am generally against any further expansion of copyright.
People who equate law to morality tend to fall prey to extremism.
Just a heads up.