An extensive study by the European Broadcasting Union and the BBC highlights just how deeply inaccurate and untrustworthy “AI” news results really are.
↫ BBC’s study press release
- 45% of all AI answers had at least one significant issue.
- 31% of responses showed serious sourcing problems – missing, misleading, or incorrect attributions.
- 20% contained major accuracy issues, including hallucinated details and outdated information.
- Gemini performed worst with significant issues in 76% of responses, more than double the other assistants, largely due to its poor sourcing performance.
- Comparison between the BBC’s results earlier this year and this study show some improvements but still high levels of errors.
“AI” sucks even at its most basic function. It’s incredible how much money is being pumped into this scam, and how many people are wholeheartedly defending these bullshit generators as if their lives depended on it. If these tools can’t even summarise a text – something you learn in early primary school as a basic skill – how on earth are they supposed to perform more complex tasks like coding, making medical assessments, distinguish between a chips bag and a gun?
Maybe we deserve it.

This focuses on the wrong questions. AI is not a scam. It wouldn’t be so scary if it was.
I was reading Donald Knuth’s take on ChatGPT yesterday. Very interesting, and it linked to a very valid critique of AI:
https://www-cs-faculty.stanford.edu/~knuth/chatGPT20.txt
cheemosabe,
While it is refreshing to see Knuth, you should also look at the date of those “conversations”:
The language model landscape change significantly in those two years period. Not only the models are now much more refined, they now integrate with real life data sources, like Google Search, or Wikipedia for truthful results.
(This is called “retrieval augmented generation”)
For example, the third question:
After a lengthy exploration of different approaches, (Gemini, Google’s ChatGPT) concludes:
This means it can tap into not only a larger internal knowledge base, it can also look up what Mathematica is doing as well.
And these will only get better.
Basically, you have now several additional tools that make LLMs much more powerful
1 – Larger context sizes, and hence being able to “dump information”
2 – Tool calling (“get_temperature(), list_files()”)
3 – A combination of these, the “RAG” (Retrieval Augmented Generation), where results of web searches, wikipedia articles or other truthful sources are retrieved using standard Information Retrieval algorithms and fed into that context
4 – Better evals like MMLU
5 – “Thinking” being able to ingest information slowly, and ability to go back and fix responses before producing final output
Over time this only gets better. Yes, even those in Knuth’s early conversations with ChatGPT were quite good. But today you’d get more truthful answers.
Yes, they’ve advanced, and personally I really doubt they’ll reach a plateau. I do leave that possibility open, if only because, of the many far more intelligent people than me that I follow, a couple do think that LLMs are a dead-end (John Carmack, Rich Sutton). I highly doubt it though. I don’t see how someone can so easily dismiss the possibility of combining LLMs with something else like RL or GANs to greatly improve them. Even if they do plateau, there are so many interesting things that have been learned from LLMs and they seem such a rich source of further inquiry. Figuring out how ChatGPT generated that sonnet-haiku at Knuth’s request would be so interesting to me (using LLMs for model interpretability is one valuable output). Even the word2vec paper from 2013 is so interesting to think about, representing words as vectors and obtaining sister as brother-man+woman.
cheemosabe,
I think that “plateau” is already beyond human level.
On many “language” tasks, the “large language models” have already surpassed average humans, and sometimes human experts.
They still lack some nuance, can hallucinate, and miss details. But that is where augmentation comes. And they can easily be fed requests for self correction. Prompting it “Are you sure the author meant X, but did not make a joke” could easily fix common mistakes.
Basically those tasks where LLM are already superior to, or at expert level are:
1 – Translation (in blind tests).
https://www.getblend.com/blog/which-llm-is-best-for-translation/
2 – Summarization (~on par with freelance writers)
https://www.researchgate.net/publication/377949691_Benchmarking_Large_Language_Models_for_News_Summarization
3 – Understanding (comprehension, like high school tests)
https://odsc.medium.com/20-llm-benchmarks-that-still-matter-379157c2770d
Overall they are at a level that make them extremely useful, when paired with (a) other machine tools, (b) humans.
They are essentially the next generation of human-machine interface. They for example, can provide a first draft of a translation, and then a paid expert can go over and edit it. Or the other way around, they can be used to verify translations and find areas of improvement.
The future is already here, and very exiting.
(Yes, embeddings are extremely useful. We had built models for embedding based recommendations for Google Chrome. Today they are used in RAG with vector databases)
how come they do so terribly poorly in the BBC study, it’s pretty much the same task at had as the all above.
dsmogor,
You can actually check some of the sample BBC queries and their analysis:
https://www.bbc.co.uk/aboutthebbc/documents/bbc-research-into-ai-assistants.pdf
There are two main reason I see here:
1 – Wrong tool, wrong job
2 – Analysis methods
What do I mean?
1 – They ask for summaries of recent news events of obscure historic ones. However for these require the “full” versions of models. When I tried some of the queries in that report today, I got much better answers with “pro” versions (and we can see why the light versions would struggle)
2 – They had queries on highly contested subjects like Middle East, and analyze the output from a biased point of view. These should have been avoided. Or at least should have had a more objective approach knowing the sensitivity of the issue.
I would add a third one:
They have vested interest against success of AI models. Traditional publishers will have difficulty competing against AI agents. And this shows on how the “study” itself was conducted. And they obviously are trying to influence government policies, or at least public perception.
“A committee of horses found that the automobile fails on 45% of rural road without pavement”
sukru,
They didn’t show all their data, so in the back of my mind I am wondering if they cherry picked the data they wanted. I hope they didn’t do that.
I do think a robust LLM test would be really interesting though. In scientific tests and medical trials we use the scientific method and blind testing to eliminate sources of bias, Don’t limit the responses to AI services, but include human expert generated content too, perhaps include responses from the original authors. Don’t tell the judge who/what generated the text. Don’t even tell the judge any responses were generated by human because it should not matter to a judge’s objective assessment and we don’t want to leak side channel information.
It’s fine if the human responses rank better, but let data show that with an experiment designed to eliminate or account for the judge’s own bias! Not sure that Thom would be interested, but it seems like something that could be tested on osnews.
I prefer things like Edan Meyer’s take… that “‘scaling’ isn’t scalable” as the title card puts it. (In essence, training improved LLMs is so horrendously resource-inefficient that it’s a dead end on that alone.)
As soon as the VC money dries up (and there are signs it may happen soon), we’re going to see a gigantic crash as they can no longer afford the training costs.
ssokolow,
It might be true that training increasingly larger models becomes an exponentially harder problem. But at the same time, we probably don’t even need to do that.
Basically we are doing thee major expansions
1 – Adding new features to existing models, like tool calling (no need to teach everything, if they can just tap into Google Search and Wikipedia)
2 – “Distilling models. Using larger models as “teacher” to build much smaller students models with incredible performance for their size.
3 – Improving on device inference performance. With recent advancements in mlx, llama.cpp, vllm, and others we can now achieve 10x the performance of just two years ago with the same exact models, on same exact hardware (and of course the hardware is also being optimized)
You can now basically have GTP-4 comparable performance on your local $4000 machine, including extended features set.
(If you say $4,000 is too much, two years ago you’d have at least one more zero behind that number. Soon, we might cut one)
He also talks about how LLMs can only interpolate, not extrapolate, which is another reason to not get too excited about them, and how a human doesn’t need to be trained on the entire Internet to solve problems more effectively.
(I can’t remember if he used a phrase equivalent to what I call “world model”, but they also don’t have a world model, which is why things like Stable Diffusion and Midjourney struggle abominations and with keeping background details consistent as they pass behind foreground objects.)
ssokolow,
Once again they are looking at the wrong place, and setting incorrect expectations. (Early on many people did that, so this can be excused)
LLMs are literally “large language models”, which means they excel at language related tasks (which include deductive reasoning). They are essentially the next generation of machine.- human interface.
They can be paired with many tools, including ILP (inductive logic programming) systems, which can be built on languages like Prolog.
(Like they tap into math backends for advanced math tasks, or image models for OCR, and so on).
LLMs are an interface. They take human input and convert into machine commands, and process machine output for human understandable results.
As for Stable Diffusion…
Although it employs tranformers, it is ultimately a “de-noising algorithm”
And tries to answer the question: “can I make this image less noisy, and also try to match the general concept of the user query”
How to they work?
There are many good explanations. A random one that came up at Google Search: https://poloclub.github.io/diffusion-explainer/
The funny thing is people still believing news are anything other than brainwashing and propaganda to begin with. AI is just another manipulation tool on top of it all.
jbauer,
Yea, the white-house today misrepresents content most of the time 🙁
Everyone has natural biases, which isn’t a surprise, but the amount of lying makes 1984 look pale by comparison.
Oh, the same BBC that made a “documentary” without excuses about AI did that
https://www.bbc.co.uk/news/articles/c629j5m2n01o
Let’s get rid of the scare quotes and call it FI.
I’d be gobsmacked if AI is only misleading people 45% of the time.
I might not know much but I do know a couple of specific technologies and sciences very very well, at least enough to suffer imposter syndrome. Responding to queries AI basically reports on those technologies and sciences with a 100% failure rate. Simply because AI almost certainly includes pseudo-science and fiction and uses it to build definitive conclusions. This is the ultimate fail, shizen in equals shizen out.
I read a study a year or so back that showed researchers could use carefully crafted questions to change the answers an AI delivers, no fake data needed, just ask the right questions and you can get AI drawing the wrong conclusions.
cpcf,
“100% failure rate” seems like a big stretch to me. Do you have a reference?
In my testing I’ve found that LLMs do answer well the majority of the time, at least in a normal conversation without contextual manipulation. I have found that I can sometimes deliberately compel the LLM to lie by asking it to take a persona that would lie and/or by instructing it to comply with a falsehood. The LLM will respond as requested and go on trying to defend the lie. While this level of LLM compliance could be considered a fault, I do feel it’s very important to separate these cases of intentional prompt manipulation verses the LLM’s own unprompted lies. This is why it’s extremely important to provide the entire discussion context, including the initialization instructions (if any were supplied).
In general I’m not a fan of debating a conclusion when the underlying data hasn’t been provided. Can you share the exact questions and prompts being used so that we can all independently try them?
Half of news articles misrepresent things or just want to push the journalist’s worthless opinions… then half of the summaries made by AI are incorrect… in the future we won’t be able to trust any news any more.
Some look at this from a glass half full perspective, mostly correct is a great achievement. But that’s not what computers and calculators are meant to be, the problem here isn’t the accuracy of the total content, it’s not like a score on a test where 90% is a great score. If my AI driver takes me on a journey across the state and ends by driving into a tree it’s a fail no matter how much of the trip went well, we do not measure reality the same way Elon Musk measures a launch.
AI and other models can’t be serving up random fantasies mixed in with genuine content, if an AI gives you an explainer of how observations of Mercury’s orbit confirmed relativity, it can’t then close off with Mercury is made of cheese and still be correct.
A dictionary that gets the meaning 100% correct but misspells the word is next to worthless!
We aren’t assessing pre-schoolers.
cpcf,
Perhaps a simpler LLM that doesn’t contain any knowledge but simply interfaces between you and encyclopedic sources would be more up your alley? I think an LLM works well as a librarian. I would also like an LLM mode that always cites specific sources for claims. This seems technically doable and would address most criticisms about hallucinations. However I suspect that while an AI company that quoted sources verbatim would no longer be criticized over accuracy, they would suddenly become inundated by take down requests. By generalizing information as they do, it’s no longer a copyrighted expression. So in an ironic way, copyright discourages the use of information that hasn’t been reinterpreted by the LLM.
Alfman,
They already do that. The recent models do give citations to their results.
And when I have a concern, I look those sources up. Sometimes they really do not support the summary. And then one can always ask “are you sure source ABC supports this? Go read it again. And double check all your work”
sukru,
I ran some queries on chatgpt and I see it’s gotten much better about including links. Things seem to be improving. I am most interested in the offline models personally. Alas, the downloadable models are a bit older and I run them with reduced precision because they’re so big. I don’t run the newest hardware, but it makes good use of the 64GB I over provisioned my computers with 🙂
Anyway, I think that designing an LLM to work as a reference librarian rather than a database of knowledge could actually prove for those who don’t want LLM to generate information but still benefit from it’s powerful lookups.
Alfman,
Agreed, we have spent (wasted?) a lot of time and money on teaching increasingly larger corpora to language models.
However making them access to actual information has much better returns.
“Librarian”… a good metaphor.
I hand held AI and got it to decode some old 1990s EA 3D model formats. It worked. It was able to decode them and write a 3D model viewer for WIndows/Mac/Linux. I then told Codex to add VR support using OpenVR (Steam VR’s Library) it worked. AI’s improvement rate is getting crazy. It’s already able to uplift a project from libSDL1.2 to libSDL2. In 6-12 months assuming the progress doesn’t stall we should be able to throw most open source games at ChatGPT codex and have it generate new features/maintain codebases just by issuing instructions.
Whether or not AI is a scam, we’ll soon be facing the same problem: millions upon millions of workers laid off, with no income, no healthcare, no future, and nothing to lose. The neoliberal solution is “ignore them”, and the fascist solution is “put most of them in concentration camps and create jobs for the remainder in the camp infrastructure”. The first is obviously unsustainable, the second… well, you’d better hope it’s unsustainable.
rainbowsocks,
That’s just it, jobs are going to continue getting automated regardless of any of our opinions about it. I know people don’t want to hear it, but realistically I don’t think we’re able to stop corporations from replacing expensive employees with cheaper AI. These changes are a preview of things to come.
https://www.reuters.com/business/world-at-work/amazon-targets-many-30000-corporate-job-cuts-sources-say-2025-10-27/
We need to prepare society for this! Yet all I see happening is social safety nets actively being removed at the same time, which is insane and leaves us in very bad shape for the oncoming storm. People have been projecting the collapse of the AI industry, but while there will be an economic collapse at some point, I don’t think it’s regular workers who come out on top. I predict the remaining AI companies stand to benefit from consolidation while jobs continue to be displaced.
Exactly. And not just that, the same will happen even if replacing workers with chatbots is largely not cost effective. Right now the AI bubble is propping up the entire US market. If (IMO when) that bubble collapses, it will be worse than 2008. Massive layoffs will be inevitable regardless of how good automation has gotten. We’re either fucked one way, fucked another way, or fucked both ways.
As to your second point. Again, I think it’s important to point out that the pro AI people are consistently the *same* people destroying the safety nets, in an absolutely deliberate fashion. Why this is depends on who you ask. Some of my friends think it’s just idiocy; they’re rich and out of touch, are opposed to safety nets on principle, and don’t even consider the ramifications of a huge increase in the number of desperate people. I take a more conspiracist angle, which I think is well evidenced by the ideologies of people like Mark Andreessen and Peter Thiel: they believe most people are Untermenschen who don’t deserve to live, and see mass elimination of jobs, social safety nets, medical care, etc. as a way to hasten the deaths of the undeserving. You can see a similar theme at play in the response to COVID, where “human life should be protected” quickly gave way to “the elderly and disabled should die so we can make more money” even among supposed liberals.
rainbowsocks,
But unfortunately it is cost effective.
Look at Amazon. They replaced the first tier customer service with AI. Yes, it is worse than a trained representative that you connect to Tennessee, however for many years it was not a local person in this country, but increasingly a random person in a third world country, that received minimal education, except reading a script.
There were many cases in Amazon Prime forums where customers complained that CS representatives were outright lying. Yes, to get their “tickets” closed, they would promise refunds or other “solutions” without actually doing them.
If your competition is a lowest tier “human” service, AI all of a sudden becomes an improvement.
(Don’t get me wrong, I wish they would “on shore” customer service. Companies want to get rid of this, but that is their most important public face)
Fox and NPR should get an AI, Their accuracy will increase 11 fold ….