An extensive study by the European Broadcasting Union and the BBC highlights just how deeply inaccurate and untrustworthy “AI” news results really are.
↫ BBC’s study press release
- 45% of all AI answers had at least one significant issue.
- 31% of responses showed serious sourcing problems – missing, misleading, or incorrect attributions.
- 20% contained major accuracy issues, including hallucinated details and outdated information.
- Gemini performed worst with significant issues in 76% of responses, more than double the other assistants, largely due to its poor sourcing performance.
- Comparison between the BBC’s results earlier this year and this study show some improvements but still high levels of errors.
“AI” sucks even at its most basic function. It’s incredible how much money is being pumped into this scam, and how many people are wholeheartedly defending these bullshit generators as if their lives depended on it. If these tools can’t even summarise a text – something you learn in early primary school as a basic skill – how on earth are they supposed to perform more complex tasks like coding, making medical assessments, distinguish between a chips bag and a gun?
Maybe we deserve it.


This focuses on the wrong questions. AI is not a scam. It wouldn’t be so scary if it was.
I was reading Donald Knuth’s take on ChatGPT yesterday. Very interesting, and it linked to a very valid critique of AI:
https://www-cs-faculty.stanford.edu/~knuth/chatGPT20.txt
cheemosabe,
While it is refreshing to see Knuth, you should also look at the date of those “conversations”:
The language model landscape change significantly in those two years period. Not only the models are now much more refined, they now integrate with real life data sources, like Google Search, or Wikipedia for truthful results.
(This is called “retrieval augmented generation”)
For example, the third question:
After a lengthy exploration of different approaches, (Gemini, Google’s ChatGPT) concludes:
This means it can tap into not only a larger internal knowledge base, it can also look up what Mathematica is doing as well.
And these will only get better.
Basically, you have now several additional tools that make LLMs much more powerful
1 – Larger context sizes, and hence being able to “dump information”
2 – Tool calling (“get_temperature(), list_files()”)
3 – A combination of these, the “RAG” (Retrieval Augmented Generation), where results of web searches, wikipedia articles or other truthful sources are retrieved using standard Information Retrieval algorithms and fed into that context
4 – Better evals like MMLU
5 – “Thinking” being able to ingest information slowly, and ability to go back and fix responses before producing final output
Over time this only gets better. Yes, even those in Knuth’s early conversations with ChatGPT were quite good. But today you’d get more truthful answers.
Yes, they’ve advanced, and personally I really doubt they’ll reach a plateau. I do leave that possibility open, if only because, of the many far more intelligent people than me that I follow, a couple do think that LLMs are a dead-end (John Carmack, Rich Sutton). I highly doubt it though. I don’t see how someone can so easily dismiss the possibility of combining LLMs with something else like RL or GANs to greatly improve them. Even if they do plateau, there are so many interesting things that have been learned from LLMs and they seem such a rich source of further inquiry. Figuring out how ChatGPT generated that sonnet-haiku at Knuth’s request would be so interesting to me (using LLMs for model interpretability is one valuable output). Even the word2vec paper from 2013 is so interesting to think about, representing words as vectors and obtaining sister as brother-man+woman.
cheemosabe,
I think that “plateau” is already beyond human level.
On many “language” tasks, the “large language models” have already surpassed average humans, and sometimes human experts.
They still lack some nuance, can hallucinate, and miss details. But that is where augmentation comes. And they can easily be fed requests for self correction. Prompting it “Are you sure the author meant X, but did not make a joke” could easily fix common mistakes.
Basically those tasks where LLM are already superior to, or at expert level are:
1 – Translation (in blind tests).
https://www.getblend.com/blog/which-llm-is-best-for-translation/
2 – Summarization (~on par with freelance writers)
https://www.researchgate.net/publication/377949691_Benchmarking_Large_Language_Models_for_News_Summarization
3 – Understanding (comprehension, like high school tests)
https://odsc.medium.com/20-llm-benchmarks-that-still-matter-379157c2770d
Overall they are at a level that make them extremely useful, when paired with (a) other machine tools, (b) humans.
They are essentially the next generation of human-machine interface. They for example, can provide a first draft of a translation, and then a paid expert can go over and edit it. Or the other way around, they can be used to verify translations and find areas of improvement.
The future is already here, and very exiting.
(Yes, embeddings are extremely useful. We had built models for embedding based recommendations for Google Chrome. Today they are used in RAG with vector databases)
The funny thing is people still believing news are anything other than brainwashing and propaganda to begin with. AI is just another manipulation tool on top of it all.
jbauer,
Yea, the white-house today misrepresents content most of the time 🙁
Everyone has natural biases, which isn’t a surprise, but the amount of lying makes 1984 look pale by comparison.