The text file that runs the internet

Thom Holwerda 2024-02-14 Internet 8 Comments

The robots.txt file governs a give and take; AI feels to many like all take and no give. But there’s now so much money in AI, and the technological state of the art is changing so fast that many site owners can’t keep up. And the fundamental agreement behind robots.txt, and the web as a whole — which for so long amounted to “everybody just be cool” — may not be able to keep up either.
↫ David Pierce for The Verge

Another thing “AI” does not respect.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

8 Comments

2024-02-15 8:54 am

Alfman verbose=1
This is a very nuanced topic.

For most websites, this was an easy trade. “Google is our most important spider,” says Medium CEO Tony Stubblebine. Google gets to download all of Medium’s pages, “and in exchange we get a significant amount of traffic. It’s win-win. Everyone thinks that.” This is the bargain Google made with the internet as a whole, to funnel traffic to other websites while selling ads against the search results. And Google has, by all accounts, been a good citizen of robots.txt. “Pretty much all of the well-known search engines comply with it,” Google’s Mueller says. “They’re happy to be able to crawl the web, but they don’t want to annoy people with it… it just makes life easier for everyone.”

I’ve seen a lot of websites that only allow google robots while blocking others. This kind of favoritism is kind of unfortunate. Yet on the other hand, I’ve witnessed firsthand how bingbot was ignoring robots.txt for a few years. We opened up a ticket with microsoft and at the time that was just the way it was programmed to operate, I think they didn’t want to subject themselves to robots discrimination. Many other bots also wouldn’t respect it either, like the getty images bot.

There are also crawlers used for both web search and AI. CCBot, which is run by the organization Common Crawl, scours the web for search engine purposes, but its data is also used by OpenAI, Google, and others to train their models. Microsoft’s Bingbot is both a search crawler and an AI crawler.

This is another important point, robots.txt doesn’t convey usage. Allowing your website to be crawled for indexing may result in contents being used to train AI models as well.

The Internet Archive, for example, simply announced in 2017 that it was no longer abiding by the rules of robots.txt. “Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes,” Mark Graham, the director of the Internet Archive’s Wayback Machine, wrote at the time. And that was that.

Should they be ignoring robots? I see both sides of the argument and honestly some robots I sympathize with more than others. Bots like getty only exist in order to serve copyright take down notices – not that useful to society. Other bots enable image searching, which can’t work if their bots are blocked.
https://comicbook.com/irl/news/doot-doot-an-internet-mystery-is-solved/

Internet archive is helping to preserve our history and keep it from disappearing on massive scale. We’re already able to witness this happening.
2024-02-15 9:35 am

Sysau
Welcoming the bot eliminator for people against AI: GPTbot receives large amount of valid words in random order.

2024-02-15 10:21 am

Alfman verbose=1
gagol2,

Welcoming the bot eliminator for people against AI: GPTbot receives large amount of valid words in random order.

I couldn’t find a link, would you provide one?
It doesn’t seem like spamming random words would be a good long term solution because randomly generated words stands out statistically. Also as a human, I feel it would get annoying to see random garbage injected into websites.

IMHO the best way to trick AI, would be to use AI to generate well written garbage, but then you inevitably end up tricking humans (and search engines) too. This seems like a fundamental problem.

2024-02-15 1:14 pm

Sysau
It is only an idea, but if are really interested i could slap something together in nodejs for the giggles .

2024-02-15 3:34 pm

Alfman verbose=1
gagol2,

It is only an idea, but if are really interested i could slap something together in nodejs for the giggles .

Oh, I thought you were referring to something that was already being done.

Random words is algorithmically trivial. A better algorithm would follow a realistic distribution. But my main concern was more along the lines of this polluting webpages designed for humans. Anything done to hide this garbage from humans would naturally make it detectable by bots.

2024-02-15 9:36 pm

Sysau
I did some personal research in random number development, making credible unintelligible phrases at high throughput sounds like a nice theoretical summit to climb. I smell a lot of arrays of pointers sized power of two in my future.
2024-02-15 10:11 pm

Alfman verbose=1
gagol2,

I did some personal research in random number development, making credible unintelligible phrases at high throughput sounds like a nice theoretical summit to climb. I smell a lot of arrays of pointers sized power of two in my future.

That reminds me, I built a random phrase generator several years ago. Input pages from osnews and voila…

Specifically, mozilla plans to treat a webpage as a whole, to get things done, implemented in litestep and he registered for me. Going forward, the only way you could help democratize AI with new ones. Follow i open 600 text files go to force developers to block us permanently. Going forward, the web today. Things being standardized. I have a brother in secure boot for the runs in windows 7 23 of all computer in need of cash, supermicro system is annoying. If, on the trademarked branding, which damn cloud service providers. I didn t mind any of these. And that was that empower us instead of difference this time is that of surfshark. They know how unpopular it screen, instead of it would be dominant there litestep and he also made with the internet as it refocuses on firefox and chromium have become secure boot restricted and too much work to keep truckin like its EU users apple intentionally removed and won t run other distros. I think we should have made some effort to get it. Things could be a general weakening of secure boot enabled using one and that s on chromium and build a browser project optimised for monitoring encrypted communications without it you d be willing to work with the requirement that users in the early 90 s not a universal removal of pwas in the european union install ios ins, I m not to say that we need competition which offers a browser with respect it.

2024-02-15 12:45 pm

kurkosdr
Another thing “AI” does not respect.

Why should it? Let me ask one more time, but louder for the people in the back: Why should it?

Why should a crappy txt file placed on your server give you extra legal rights on top of the ones prescribed by copyright? It’s good that robots.txt exists as a voluntary standard, but no organization or individual is bound by it. Even Web Archive decided to start ignoring the robots.txt file some years ago because most robots.txt files are tailored towards search engines (not archiver systems) and they are under no obligation to respect it.