The robots.txt file governs a give and take; AI feels to many like all take and no give. But there’s now so much money in AI, and the technological state of the art is changing so fast that many site owners can’t keep up. And the fundamental agreement behind robots.txt, and the web as a whole — which for so long amounted to “everybody just be cool” — may not be able to keep up either.
↫ David Pierce for The Verge
Another thing “AI” does not respect.
This is a very nuanced topic.
I’ve seen a lot of websites that only allow google robots while blocking others. This kind of favoritism is kind of unfortunate. Yet on the other hand, I’ve witnessed firsthand how bingbot was ignoring robots.txt for a few years. We opened up a ticket with microsoft and at the time that was just the way it was programmed to operate, I think they didn’t want to subject themselves to robots discrimination. Many other bots also wouldn’t respect it either, like the getty images bot.
This is another important point, robots.txt doesn’t convey usage. Allowing your website to be crawled for indexing may result in contents being used to train AI models as well.
Should they be ignoring robots? I see both sides of the argument and honestly some robots I sympathize with more than others. Bots like getty only exist in order to serve copyright take down notices – not that useful to society. Other bots enable image searching, which can’t work if their bots are blocked.
https://comicbook.com/irl/news/doot-doot-an-internet-mystery-is-solved/
Internet archive is helping to preserve our history and keep it from disappearing on massive scale. We’re already able to witness this happening.
Welcoming the bot eliminator for people against AI: GPTbot receives large amount of valid words in random order.
gagol2,
I couldn’t find a link, would you provide one?
It doesn’t seem like spamming random words would be a good long term solution because randomly generated words stands out statistically. Also as a human, I feel it would get annoying to see random garbage injected into websites.
IMHO the best way to trick AI, would be to use AI to generate well written garbage, but then you inevitably end up tricking humans (and search engines) too. This seems like a fundamental problem.
It is only an idea, but if are really interested i could slap something together in nodejs for the giggles .
gagol2,
Oh, I thought you were referring to something that was already being done.
Random words is algorithmically trivial. A better algorithm would follow a realistic distribution. But my main concern was more along the lines of this polluting webpages designed for humans. Anything done to hide this garbage from humans would naturally make it detectable by bots.
I did some personal research in random number development, making credible unintelligible phrases at high throughput sounds like a nice theoretical summit to climb. I smell a lot of arrays of pointers sized power of two in my future.
gagol2,
That reminds me, I built a random phrase generator several years ago. Input pages from osnews and voila…
Why should it? Let me ask one more time, but louder for the people in the back: Why should it?
Why should a crappy txt file placed on your server give you extra legal rights on top of the ones prescribed by copyright? It’s good that robots.txt exists as a voluntary standard, but no organization or individual is bound by it. Even Web Archive decided to start ignoring the robots.txt file some years ago because most robots.txt files are tailored towards search engines (not archiver systems) and they are under no obligation to respect it.