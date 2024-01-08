The developer OpenAI has said it would be impossible to create tools like its groundbreaking chatbot ChatGPT without access to copyrighted material, as pressure grows on artificial intelligence firms over the content used to train their products.
Chatbots such as ChatGPT and image generators like Stable Diffusion are “trained” on a vast trove of data taken from the internet, with much of it covered by copyright – a legal protection against someone’s work being used without permission.↫ Dan Milmo for the Guardian
I can’t become a billionaire without robbing banks so therefore robbing banks should be legal.
Like many other things OpenAI says, this should be taken with a grain of salt.
If your aim is to build a “language” model, there is definitely sufficient information available in the public domain. Yes, it won’t “speak” very modern English, but it will have the basic functionality.
If your aim is to be able to answer questions, it is possible to “purchase” datasets, or integrate open sources, like Wikipedia (CC BY-SA licensed).
If you really need the most up to date information on everything, you can include a “web search” module, which ChatGPT and Google’s Bard already does. Basically you’d do a web query as an agent of the user, download those pages, and summarize, again at the users request. This will obviously slower than “memorizing” that information, but not much different than a “very advanced” screen reader.
In any case, OpenAI seems to build their public support to get preferential treatment from the government. Don’t get me wrong, they offer a very valuable scientific output, and a useful tool (which I pay the monthly fee for). But it does not mean they act like any other business.
Spicy take: human artists and writers also learn by digesting copyrighted works.
Hexadecima,
You hit the nail on the head. Everyone single one of us with academic training takes away knowledge from copyrighted works, be it text books, news articles, etc. Even though it’s nearly all copyrighted we have the right to remember it, talk about it, and even profit from it, without the author’s permission. Like it or not copyright law allows this. Storing this knowledge in our brains has never been vilified before. I believe the real criticism of using copyrighted works to train artificial neural nets (as opposed to biological ones) isn’t that regurgitating knowledge violates copyright law, but that artificial NN are becoming more scalable and effective.
* I understand that there are cases of actual of copyright infringement, where the NN reproduces a work verbatim. That obviously needs to be fixed. But even once verbatim reproduction gets fixed to fully respect copyright law, people are still going to get upset over the AI, that’s just the truth.
I suspect some would favor amending copyright law with more blatantly discriminatory terms…
I can imagine how these discriminatory terms could end up causing their own new problems once we inevitably start treating human ailments like dementia and parkinsons with machine implants.
sukru,
That’s a great point, if we limit training material to public domain and things that are out of copyright, then it fundamentally changes the AI’s level of expertise. Public domain offers a great representation of human knowledge over half a century ago, but it opens up humongous knowledge gaps.
Just take the field of medicine for example, AI has great opportunity to improve patient services.
I fully understand the need for quality controls, but everyone including Thom should agree that depriving the AI of modern texts and making it dependent on antiquated ones isn’t a productive path…there needs to be a better solution.