We are observing stealth crawling behavior from Perplexity, an AI-powered answer engine. Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences. We see continued evidence that Perplexity is repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity, as well as ignoring — or sometimes failing to even fetch — robots.txt files.
The Internet as we have known it for the past three decades is rapidly changing, but one thing remains constant: it is built on trust. There are clear preferences that crawlers should be transparent, serve a clear purpose, perform a specific activity, and, most importantly, follow website directives and preferences. Based on Perplexity’s observed behavior, which is incompatible with those preferences, we have de-listed them as a verified bot and added heuristics to our managed rules that block this stealth crawling.↫ The CloudFlare Blog
Never forget they destroyed Aaron Swartz’s life – literally – for downloading a few JSTOR articles.
I wish more tech discussion sites would just lock down to where comments and content were visible for registered users only. Leave it up to the sites to determine how “registered users” are vetted, but you have to be verified by a human to access the content. One of my car forums does this. Post a picture of your car and you’re in. Other places, I’m hesitant to participate because of all the AI scraping.
There is a future where the internet is just a bunch of AI chatbots talking past one another and real people are just doomscrolling on TikTok because at least there, you can still (mostly) tell if it’s a person or not.
The problem with this approach is that you have to give every random website a piece of your identity, even if it’s just an e-mail address, and let’s be real, most people would give their Facebook or Google ID to login faster.
In the ideal world, bots whould respect websites. In the real world, not so much. It is a nightmare and it’s hard to do anything about it as countermeasures are becoming less effective. Modern AI is better at captchas than humans. Web driver APIs defeat bot detection. It may go against cloudflare’s business model to admit it, but even their fancy heuristics can’t do much to stop bots when they start using the same browser engines. They can rate limit requests per hour, but this can lead to false positives, especially when it comes to public hotspots and CGNAT where IP addresses end up being reused by many users.
This won’t stop cloudflare from trying to protect sites from bots, but unfortunately as a regular user I have been seeing more interruptions caused by cloudflare themselves. Due to the way I quickly open many tabs, cloudflare regularly interrupts my browsing. I am able to mitigate this somewhat by opening links more in series rather than in parallel, but 1) it’s regressing my user experience on the web as a human, and 2) bots can also mimic this to pass as humans.
Last week, for the first time I discovered that I could not access a cloudflare site on my phone. No matter what I tried I could not get the page to open. It might have been my adblocker – I don’t know. The same URL opened from my desktop, which also uses adblocking. Next time I experience this I’ll conduct more tests to identify the exact reason. Setting my phone to hotspot could help track down cloudflare’s false positive to the browser itself or maybe the shared IP it was communicating over.
@Alfman, reality rarely if ever meets ideology.
I predict more Anubis-style proof-of-work “cook the planet to drive up the cost of running a bot datacenter” solutions in the future, given nobody has disproved the original thesis statement of Hashcash yet.
(That the cost to large-scale actors grows so much faster than the cost to individuals that it can be useful as a factor in a protective measure.)
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!),
Yeah, Proof Of Work algorithms work by increasing the cost of web requests. It is reliable and effective, however proof of work is a burden on everyone. This means higher energy costs, deteriorating battery life, higher carbon emissions, etc. We also need to be clear it is not an access barrier for bots, but rather a cost barrier. If bots are willing to pay the costs associated with solving the challenge, then they still get through just like a normal user. I wonder just how much POW we may all end up having to tolerate on the future internet. It would be nice to have an article breaking down individual and collective electricity costs associated with such POW challenges.
I’ve noticed that Cloudflair’s browser challenge algorithm runs quite alot faster on my desktop computer than on my phone, which makes sense but it’s important to point it out. Bots may be running high end hardware, meanwhile the difficulty of POW challenges is somewhat constrained by the poor experience for users of low end mobile devices.
Obviously large scale actors are dealing with tons of requests, but proportionally it’s likely that individuals will end up paying far more per request. Consider the adjacent POW industry of crypto-currencies. Individual miners have been totally displaced by huge crypto farms with scales of economy. Solving the same challenges at home typically incurs such high overhead that the hardware and electricity bills can exceed the returns.
I’ve never been a big fan of POW challenges on account of the sheer wastefulness of it all. In principal we could find better solutions that don’t involve wasting resources as POW is designed to do. For example, instead of proving you wasted CPU cycles to access a website, you might proof you made a donation to charity. I’ll call it “Proof of Charity” POC. This could be calibrated to be of similar cost as POW, but instead of only having wasted electricity to show for it, we’d actually have a tangible public benefit out of it for the same cost. POC wouldn’t stop the bots any more than POW does, but the difference is that the money that would have been wasted by everyone paying for electricity would instead be going to charity.
Of course the challenge would be to establish some kind of credit system in exchange for verified donations. Bureaucracy and even legal battles could be a stumbling block. But if POC credits for donations could be implemented somehow, then it would create a really novel opportunity to redo POW algorithms without all the waste. With POW, wasting electricity became a huge side effect of demand, however with POC, the side effect would be charity. How cool is that!
https://news.climate.columbia.edu/2022/05/04/cryptocurrency-energy/
Going by the cheapest commercial rate I could find here ($0.0707/kWh)…
https://quickelectricity.com/cost-of-electricity-by-state/
,…the energy used by bitcoin alone (150TWh) would translate to $10.605B annually (and perhaps more depending on actual electricity rates). Switching to POC currency instead of POW would have a double benefit: it would stop the waste and carbon emissions, but it would also help fund charities.
I really got off topic here, but I am curious what people think of POC?
Alfman,
This is about economy. Basic money is the factor here.
Say, you have the most optimum captcha, and absolutely only a human can pass it, but no bot (just for sake of discussion)
What stops an organization to use Amazon Mechanical Turk, and give people 10 cents per each captcha they solve? They can even do some “remote desktop” to get many IP addresses for each event.
(The same reason SPAM is still a thing, if 1 in 1,000,000 people buy your $100 snake oil, it might be worth it)
As long as the economic benefits would be more than 10 cents for solving that captcha (say being able to download 50 pages until it locks again), they will do that.
On the other hand these very real costs add up for actual humans. Making captchas harder means you need to use mental energy, or in your case be completely locked out of your own destination.
Since my long comment with citations is waiting in moderation, I’ll just say two words: “blitzscaling” and “crash”.
Surveys show that people aren’t willing to pay for what’s currently coasting on VC money to “build market share”, we’re seeing mentions of companies quietly re-hiring people after discovering generative A.I. couldn’t deliver on what they were sold, and a study found that programmers believed coding assist made them 20% faster, but objective measurement of apples-to-apples situations showed them to be 19% slower. (Probably because, as I’ve personally observed, generative A.I.’s low effort and high potential reward hooks the same psychological processes as gambling addiction.)
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!),
Some of the studies on AI exhibit author bias (to be fair in both directions). Context matters a lot. I’ve heard a lot of blanket statements about AI without context, it needs to be taken with a big grain of salt. If you are expecting the LLM to do the whole job, yeah it’s not there yet. I too have seen the imperfections, but as a starting point an LLM can still get you started much further ahead than starting from scratch.
I’ve been playing around with generative AI for music and art, and I gotta say what it can do in under a minute is quite a bit better than what I could do myself in hours. I’m not a professional artist or musician, but if I were one working on “fiver” or one of the sites like it, I’d be extremely worried that AI is not only good enough, but that buyers may not be able to tell the difference from a genuine performance and a “prompt engineer” using AI. Not a hypothetical future, but today!
In terms of coding, LLMs have a few weaknesses, especially in regards to not being able to iteratively debug their own output. But I don’t think that coders are going to escape AI’s reach long term. It’s going to keep getting better.
Alfman,
Yes, exploration has been one of my successful use cases as well. It speeds up the ramp up process significantly. Better yet, I was able to ask questions without being shy. It is a machine after all that (usually) does not judge.
ssokolow,
I’m not sure why you are on the doom and gloom train.
Yes, there was too much expansion. And yes, many companies did stupid things like thinking a (current) AI agent can replace a human coder.
But that loses the larger context, where AI/ML has already been integrated in everyday workflows and it is here to stay. Many of these people don’t even realize.
For example that “auto correct” model on your mobile keyboard? That is most likely based on BERT (Bidirectional encoder representations from transformers), the first true LLM from Google (which is also fully open source).
Your phone camera? “Automated photoshop on steroids”, it will run not only one, but several models to get the best capture from that tiny sensor.
Even the very basic things like power management is augmented with AI.
Someone jumping too much ahead should not distract from the actual progress (which is slower and deliberate).
“ignoring — or sometimes failing to even fetch — robots.txt files” xDDDDDDDDDD Security doors made from crepe paper don’t work? Who would have guessed?
robots.txt was always “Play nice or I’ll reach for the firewall” measure which generally worked quite well until tech bros came on the scene.