Adobe Creative Cloud users opened their apps yesterday to find that they were forced to agree to new terms, which included some frightening-sounding language. It seemed to suggest Adobe was claiming rights over their work.
Worse, there was no way to continue using the apps, to request support to clarify the terms, or even uninstall the apps, without agreeing to the terms.
↫ Ben Lovejoy at 9To5Mac
Of course users were going to revolt. Even without the scary-sounding language, locking people out of their applications unless they agree to new terms is a terrible dark pattern, and something a lot of enterprise customers certainly aren’t going to be particularly happy about. I’ve never worked an office job, so how does stuff like this normally go? I’m assuming employees aren’t allowed to just accept new licensing terms from Adobe or whatever on their office computers?
In response to the backlash, Adobe came out and said in a statement that it does not intend to claim ownership over anyone’s work, and that it’s not going to train its ML models on customers’ work either. The company states that to train its Firefly ML model, it only uses content it has properly licensed for it, as well as public domain content. Assuming Adobe is telling the truth, it seems the company at least understands the concept of consent, which is good news, and a breath of fresh air compared to crooks like OpenAI or GitHub. Content used for training ML models should be properly licensed for it, and consent should be properly obtained from rightsholders, and taking Adobe at their word, it seems that’s exactly what they’re doing.
Regardless, the backlash illustrates once again just how – rightfully – weary people are of machine learning, and how their works might be illegally appropriated to train such models.
I was once presented with a modal dialog asking me to accept updated terms and conditions for Jira on what appeared to be on behalf of the entire company. I approached out compliance team about the matter. They told me “Just click I agree”.
I bet your company already have a modified Terms of Service contract that supercedes Atlassian Jira’s blanket TOS for the masses. These are usually hammered out by the compliance team when new software is introduced or proposed.
You are confusing published works with private/confidential works (which is what some works created with Adobe’s software are). Not everything made with Adobe’s software is eventually published, you know.
To put it in simple terms, Adobe doesn’t need to put anything in the ToS to scrape published works from the internet (or borrow published works from the local library) to train ML models, they could already do that before. Adobe’s new ToS is an issue of privacy.
According to which law? Any transient copies of published works made for the purpose of training ML models fall under fair use, and you are not seriously claiming that the weights of a neural net are a copy of any work, right?
Seriously Thom, I understand you have a vendetta against AI for taking your job and all, but what you are proposing is a massive extension of copyright that would make 2014 Thom Holwerda really hate 2024 Thom Holwerda.
Except, you’re entirely ignoring the fact that tools like Copilot and ChatGPT have consistently been found to reproduce licensed works verbatim, without accompanying licensing terms or in violation of existing copyright. Try writing a scientific paper or newspaper article with copy and pasted sections from other works without proper attribution or citations. That’s called plagiarism at the least, and copyright infringement at worst.
Again, why isn’t Copilot being trained on proprietary Microsoft code? You know just as well as I do it’s because Microsoft doesn’t want Copilot to spit out their proprietary code verbatim
Yes, I agree.
The isn’t isn’t licensing per se, but how current “Artificial Intelligence” works. Currently, it’s some kind of sophisticated statistical system with good lessons learnt from Machine Learning “, nothing else.
They are starting to be multimedia. Then they mist evolve a lot their logical reasoning a develop a sense of creativity.
Current AI must be carefully audited and surveyed, because it’s a plagiarist.
timofonic,
I’d very much encourage those criticizing AI for using statistical models reflect on the statistical properties of our own human brains. After all, we serve as the model for so many of our artificial NN accomplishments.
Even by our own human standards, LLMs are proving themselves to be creative. They can and do create works that have legitimately never been produced before.
Sometimes we do find some examples where the LLM outputs a verbatim copy, and I wouldn’t say we should ignore this, but we should be fair in recognizing that this is the exception and not the rule. I think it’s wrong to criticize the entire LLM model over exceptional cases. In terms of fixing it I think we’ll be able to work out the kinks as the training process gets refined.
The reason I don’t like LLMs is they tend to be chaotic and impossible to understand their internal behavior, much like the human brain is. This is good if you want them to produce art and do other harmless “creative” fun, and it can be useful for producing drafts for work, but when I see people proposing that we have LLMs drive cars, I am genuinely scared.
I prefer the “inference” kind of AI (as seen in Prolog and Rules-engine systems) where there are strict rules in place, but something like that would require making our cities less chaotic, so that’s out of the question. So, we make our computers chaotic instead (LLMs).
kurkosdr,
Yeah, I understand that. Yet you’ve probably been on a bus driven by a bus driver whose brain you didn’t fully understand either 🙂
(I appreciate your point though).
I was really impressed by prolog’s capabilities when I was studying it years ago… Having a language that can output possibilities and permutations that fit some criteria without actually writing the code to generate those outputs directly is wild. I liked that idea a lot although I’ve never used anything like it for real project work, I could see something like that being very useful. Combining it with AI in some way could be very interesting.
Thing is, the human driver is legally liable, has a self-preservation instinct, and has also a conscience that allows them to understand the importance of being legally liable. LLMs aren’t humans and can’t be put on trial (which means they aren’t liable) and don’t have a self-preservation instinct or conscience.
This is what I don’t like about the idea of LLMs driving cars: They are pitched as several times more safe than human drivers but when legal liability is brought to the table suddenly LLMs can make gross mistakes that no human with a self-preservation instinct and a conscience would.
If you do that, you have essentially created something very close to AGI (combining logical thinking with impulsive thinking). But nobody has a clue how to combine those two types of AI (inference and LLM).
BTW if you are interested in more reading, the following is a good starting point:
https://en.wikipedia.org/wiki/Hubert_Dreyfus%27s_views_on_artificial_intelligence
This was from the time AI research went all-in in the other direction, where everything was about symbolic logic. Now everything is sub-symbolic logic (LLMs).
Symbolic logic, despite being unable to cope with chaos as well as LLMs do, was at least based on well-defined rules and was debuggable.
kurkosdr,
We generally expect people to behave as you’ve described, but that’s still based on assumptions…
https://www.bbc.com/news/world-32610497
Even though an LLM might not reach 100% accuracy, I think that in terms of statistical averages one can still come out ahead of humans regardless of lack of conscience. You’ve brought up a fascinating topic and if you want to continue it maybe start a new thread. WordPress is bad at deep discussion threads.
Thanks for the link 🙂 I have to go now but I’ll try reading it later.
I believe I’ve mentioned something like this before: artificial neural networks are inspired by elements of how brain tissue neural networks function, but they don’t “model” their function to the extent that you appear to assume.
They may be similar in some ways but are also fundamentally different.
Admittedly I’m not sure how much it matters to most of the discussions here except to say that you can’t really just assume that artificial neural nets are or will become capable of doing the same things in the same ways as brain tissue neural nets do.
Book Squirrel,
I thought you were going to point out that LLMs don’t work exactly the way brains do, which would be easier for me to agree with. However since you haven’t qualified the assertion and made a broad generalization about artificial NN in general, well I have to disagree with it. There’s nothing intrinsically off limits with tissue neural net that can not be modeled using an artificial neural net. For example, many of the artificial DNNs we design today don’t use feedback loops, but in principal there’s no reason it can’t be done.
https://news.mit.edu/2019/improved-deep-neural-network-vision-systems-just-provide-feedback-loops-0429
I’m not sure if maybe you misspoke, but I’m pushing back on the basis of what you are literally saying here. Obviously we don’t have the technology to scan a full brain in order to simulate it, but in principal how do you justify the assertion that an artificial neural net will not become capable of doing the same things as tissue neural nets?
Of course if you were just using “artificial NN” as a stand it for “LLM”, then what you are saying makes more sense, however it’s a bit of a different discussion.
The mathematical model we use for artificial neurons does not model real neurons in detail. Two reasons for this:
– Some aspects of real neurons are still not well understood (iirc).
– Some aspects of real neurons would be prohibitively expensive to try to model in a computer.
And both of those points go double when we start zooming out to actual networks.
You can compare this in some ways to how aeroplanes aren’t birds. Planes can be said to be a technology inspired by birds, it has the same bodyplan of two wings, a tail and some landing gear. But it is also very much not a bird, specialised and simplified in its construction, which is what makes us able to build one from scratch and scale it up to move people around.
Running artificial NNs on digital hardware obviously requires them to be a significantly simplified version of what happens in biology. But that also allows us to scale them up to be somewhat useful for certain things.
(But also, much like planes, artificial NNs cause emission of enough greenhouse gases that we’re going to need to limit our usage of them on a global scale.)
Book Squirrel,
I believe this technology is coming into reach, but ironically artificial neurons that do what we train them to do are probably going to be more useful than human neurons that act human. To the corporations paying for this stuff, human employees (and their human neural brains) are merely the means to an end.
Thom Holwerda,
When NNs start to reproduce their training data, it’s causes by a phenomenon of neural network over fitting.
https://medium.com/analytics-vidhya/the-perfect-fit-for-a-dnn-596954c9ea39
https://machinelearningmastery.com/introduction-to-regularization-to-reduce-overfitting-and-improve-generalization-error/
Reproducing training data to pass the training data set is easy, creating generalized knowledge is hard. But we know for a fact that LLMs are capable of generalizing and that is typically the goal. While AI trainers aren’t helpless to fix the training parameters responsible for it, the trouble is these models are so large it takes a while to iteratively “debug” them. I do think most of the issues are being worked on and will be solved if they aren’t already. I’d genuinely like to see an article cover the evolution of LLMs over time to see whether and how the AI models are improving. Teething problems are inevitable, but as AI matures I don’t think it’s a fundamental problem that can’t be fixed through NN training fitness functions that penalize copies.
It’s not an issue of having to “debug” them, the models have so many parameters, and they’re trained on so much data that “debugging” anything properly becomes practically impossible.
As per the article you linked to, one of the best strategies to avoid overfitting is to increase the size of the training data. But they can’t do that, because they already ingested the entire internet into the training corpus. And to the extent that you can do sensible data augmentation on textual data, they’re probably already doing that too.
It is also suggested that you reduce the complexity of the network, but I would argue that with the billion-parameter datacenter models the complexity is the whole point. It’s what gives you the best results. If you reduce the complexity to force the network to generalise more, you will also get worse results in general (see GPT-4 turbo).
Another issue is that there is no specific task that you can test the large datacenter models against to decide if they’re overfitting, because they’re not being trained for any specific task. They’re being billed as general problem solvers, so you can just use whatever prompts you like, and so *of course* overfitting will happen, and can pretty much always be induced when the number of parameters is this high.
You can specialise models of course by reducing both complexity and training corpus, but then in the case of say, specialised coding models, they will get much worse at responding to and stringing together natural language, which tends to be the primary thing that makes them appear useful and impressive.
Book Squirrel,
Are you suggesting that it’s impossible to debug LLMs to produce more favorable outputs? If so it’s not really true because in practice we can do this by changing the training and fitness functions. However if that’s not what you mean then please clarify.
Reducing the complexity of the network is one way to reduce over-fitting, but I don’t think it’s necessarily the only way. One solution I’ve heard that can work is to create a “temperature” variable that increases generalization early in the training and then to slowly fit more detail so that the details don’t overpower the DNN’s general fitness optimization.
https://www.baeldung.com/cs/simulated-annealing
I agree that these LLM like chatgpt ought to become more specialized. It’s neat that a know-it-all LLM works, but I think that breaking them down into more specialized domains with another DNN to route requests makes a lot of sense for future AI development.
My own belief is that all of the problems we are seeing today are fundamentally solvable, but it does take some time (and money) given the scale of these DNN.
I suppose you can in some sense, if you’re willing to spend the time, energy and water required to retrain the model each “debugging” cycle.
But to “fix” specific broken outputs in an LLM, most resort to RLHF (reinforcement learning from human feedback). But that has an effect more akin to applying duct tape or sledgehammers, it doesn’t actually fix any fundamental problems with the model and can degrade the network.
And how do we even measure a “favorable output”?
Again, quoting myself above, “there is no specific task that you can test the large datacenter models against to decide if they’re overfitting, because they’re not being trained for any specific task.”
Even what we might term as specialised coding models are too general for that. The “fitting” we’re talking about here refers typically to some simple and easily testable problem domain we’re trying to solve. But if you don’t have that, and you just allow people to send whatever prompts they like to the model, you can always come up with a problem (prompt) that it will overfit on.
So if it is a generative model with enough resolution/parameters, and it is trained on copyrighted materials, that means you can make it regurgitate copyrighted materials.
Book Squirrel,
Yes, we can debug by tweaking the data and fitness functions used to create the LLM in the first place. I never said it was easy, but it is possible. We don’t just throw up our hands and give up.
We absolutely can train a LLM with respect to some specific criteria that needs to be improved through the use of fitness functions. And copyright compliance is no exception. It should improve over time.
No, it doesn’t HAVE to regurgitate copyrighted materials if you penalize that during training.
No, we absolutely should give this up in a world where we’re combating climate change and environmental disasters (including, you know, droughts).
Spending the amount of energy and water required to keep training and retraining these AI models in large datacenters is an atrocity in our current situation.
Even if that can be made to work reliably somehow, having it “improve over time” is just not good enough for people harmed by it now. If you have to spend months or years fixing a “bug” as serious as this in a service, then the service should be shut down until it works reliably (which might be never). This goes for GDPR issues pointed out by noyb and Schrems as well.
But in any case, even if it would be possible within x number of retraining cycles, that is quite frankly x too many for a climate that desperately needs to start cooling off like 20 years ago.
Book Squirrel,
That’s the interesting part though, while training is inefficient, a trained NN is trivial to scale and fairly efficient for the amount of output it can produce. If you were to run the numbers it’s quite conceivable that the carbon and energy footprint of AI is actually much less than the carbon and energy footprint of employees who would do the same tasks. So by your logic you could hypothetically make the case that AI should be replacing humans…yikes. While I’m not trying to sell this idea, I do think the point is important. Just saying that LLMs are inefficient isn’t that meaningful unless we ask “compared to what”.
You are entitled to that opinion, and I don’t necessarily disagree. There is a significant risk of stepping on people’s livelihoods and I think many people are in for a reckoning, but I also don’t think we can put the cork back in the bottle. Either we find a way to distribute the benefits of automation to all levels of society, or this automation will end up exacerbating our social inequalities.
No?
I didn’t compare them exactly because the comparison is entirely pointless. Even if we become able to shunt a more significant amount of work to generative models, any amount of work they do is going to cause extra emissions compared to if a human does it, because I’m working from the perfectly reasonable assumption that we’re not going to want to murder humans that get replaced in their work. We seem to agree on this, and I simply take that as a given when I argue with someone, because otherwise I’m going to have to accuse that person of being a psychopath.
Yes, there are elements of this that we agree on 100%.
You seem often to assume that people who are critical of these AI models are not also advocating for finding ways to distribute the benefits of automation. But I am. Automation displacing people is not a new problem by any stretch. I’ve mentioned being opposed to capitalist systems before: I’m a socialist who engages in politics and agitate for change in how we set up our social and economic systems. The benefits of automation should result in people having to work less, not companies and rich people getting richer. It seems like we agree on this, and I spend a fair amount of my spare time to agitate for such changes in various ways.
But I can work on multiple levels. I can *also* note as a computer scientist, system administrator and generally tech-literate person that these “AI”-models are not as useful as they appear to be when you see the hype, that most of their actual usefulness comes from essentially functioning as copyright obfuscators, and that the environmental cost of training and retraining them is absolutely insane in a world where we’re combating climate changes and environmental disasters.
And it is absolutely worth fighting them on those levels as well. Because by my experience it’s easier to strangle the relevant companies using existing regulation like copyright and GDPR and referring to existing climate policy agreements than it is to get politicians to instate the new kind of socialist policies that would be required for fairer distribution. In the short term, you usually have to resort to the tools you actually have.
But I’m not defeatist about policies for fairer distribution. I have been on the side of fighting for that for more than a decade *anyway*, and I still am, even if it’s incredibly frustrating.
But also, the urgent environmental and climate issues with expanding datacenters at the current rate could still exist, even under a fairer social and economic system. So that particular fight has to happen regardless. We need to stop the current ridiculously counterproductive expansion of datacenters somehow. It doesn’t do us any good to just lay down and say that we can’t “put the cork back in the bottle”, whatever that means. You could say the same about “distributing the benefits”. Capitalism escaped from the bottle a long time ago, and it is much, much harder to curb as a whole. But we will need to do that if we want to properly “distribute the benefits” of any progress in automation, and I’m right there with you on that.
If and when that happens, and if it happens on a publicly-facing service, then it’s copyright infringement (because it “makes available” copyrighted content).
But using copyrighted content to train LLMs is not itself illegal (it is only if the LLM is trained in such a way that it can spit out copyrighted content with the right prompt).
It’s a lot like how a torrent client is not illegal, but the way you use it might be.
PS: Yes, “debugging” an LLM to not spit out copyrighted content is hard… the law doesn’t care. Copyright is infringed the moment you make an unlicensed copy available, the law doesn’t care how.
To be clear, I think copyright holders should “patrol” public-facing LLMs for copyright infringement and pursue such cases of LLMs spitting out copyright content with the right prompt (and secure nice paychecks for every infraction).
kurkosdr,
I agree. That’s a simple solution that respects the traditional rights of copyright holders while not expanding the copyright system with new AI prohibitions.
Incidentally we’d probably need more training for AI to handle cases where verbatim copies are actually appropriate and desirable. For example, I’m thinking of things like quoting the constitution and other public domain sources.
While we’re down this rabbit hole, I’d argue that in principal fair use rights should be applied to AI generated works too. However in practice these are so inconsistently applied even in human cases that it might be safest to assume that fair use rights don’t exist at all, unfortunately 🙁
https://en.wikipedia.org/wiki/Lenz_v._Universal_Music_Corp.
kurkosdr,
Yes. people are making a fuss about LLM violating copyrights simply for being trained on public works, but I feel this view significantly expands copyright in ways that are not supported by precedence. We’ve always been allowed to learn from public works to learn to generalize their information, and even create our own new works, all without ever obtaining any permission from the author.
In cases where the “derived work” is so significantly transformed that it’s not even recognizable, IMHO there shouldn’t be a case for copyright infringement. I don’t say this because I’m being dismissive of creator rights or insist on “AI rights” or anything like that. But the changes required to declare AI infringing end up expanding the scope of copyrights so broadly that now work X could be found to infringe NOT because of noticeable similarities between works, but because it was generated by AI trained on public works without permission. This would be a vast expansion of copyrights.
Also how is that going to work in court when you have a contested work? It used to be that copyright cases would look for evidence of infringement in the form of similarities to earlier works, which was the whole point. Now we’re saying that similarities don’t matter if it was generated by an AI without permission. But if we do away with comparability, we’re also throwing away the evidentiary link to infringement. Where does that leave us? Literally everyone could be accused of “AI copyright infringement” and the courts would be left to settle cases based on what the parties are saying without evidence. If you are accused of using AI, you might have little proof that you didn’t use it. And conversely you could be “guilty” of using AI with there being no proof that you did. This opens the door to a copyright witch hunt where evidence is completely sidelined. IMHO this won’t end well.
Exactly. I’m all for keeping copyrights simple and consistent between humans and AI, and we can do that by using the same rules without having to come up with any new copyright restrictions for AI. If an LLM is found reproducing works this way, then it infringes, otherwise it doesn’t infringe. This not only keeps things simple but it also does away with the need to reinterpret copyrights for AI.
Anyone who thinks copyright law on training AI is settled (in any direction) is seriously deluding themselves. We have to wait for company owners to write the laws on the backs of large checks, and hand those checks to politicians to be passed in various legislatures (principally in the US, where that type of activity is not just legal, it’s encouraged). Once that happens, there will still be decades of lawsuits, where the side who spends the most money on the case wins! It might be easy to predict the outcome, but it’s not settled.
I doubt copyright law will change, considering the current copyright law is dictated to all WIPO members by WIPO. There is no worldwide accord to change copyright laws around the world (especially in the decade of tensions we live in), so the existing law will be interpreted for any new cases.
And what if the new terms would be we own your future work and we will use it to “train AI”. What then? People would stop using Adobe Creative Cloud?
Basically, I think legal would say no. Then users will be cut off from software that they need to do their jobs, escalate to management, who will argue it out with legal and win. The tos will then be approved along with a committee looking for a replacement that will take 6 months and reach the conclusion that any change off of adobe would be more expensive, interrupt work flows, and reduce productivity and it will finally be dropped.
If half of the time that today is spent finding AI-related news (to criticize)¹ was used to fix and improve some website features², or cover interesting things technically and with less personal bias (like until some years ago) and/or even develop more merchandising stuff³, it would be easier to convince and attract more readers who could contribute donations and subscriptions to maintain the site and remove ads in the future. I understand that it would be more complicated to wait for the site to publish more 100% authorial content (such as long articles)…
But unfortunately OSNews has become very boring lately. And the way this is happening is just sad.
¹ while I agree that there are very bad implications around AI in relation to copyright and privacy, it has been very annoying to see a blog I have followed since 2002 become a place where half the posts seem to just be bitter rants about AI.
² which several readers have been pointing out for some time
³ perhaps the old OSNews logo on t-shirts, mugs, stickers, mousepads, etc.