Microsoft’s GitHub Copilot is massive copyright infringement

Thom Holwerda 2023-04-21 Legal 26 Comments

Before you read this article – note that Codeium offers a competitor to GitHub Copilot. This means they have something to sell, and something to gain by making Copilot look bad. That being said – their findings are things we already kind of knew, and further illustrate that Copilot is quite possibly one of the largest, if not the largest, GPL violations in history.

To prove that GitHub Copilot trains on non permissive licenses, we just disable any post-generation filters and see what GPL code we can generate with minimal context.
We can very quickly generate the GPL license for a popular GPL-protected repo, such as ffmpeg, from a couple lines of a header comment.

Codeium claims it does not use GPL code for its training data, but the fact it uses code licensed more permissively still raises questions. While the BSD and MIT-like licenses are more permissive and lack copyleft, they still require the inclusion of the terms of the license and a copyright notice to be included whenever the covered code is used. I’m not entirely sure if using just permissively licensed code as training data is any better, since unless you’re adding the licensing terms and copyright notice with every autocompleted piece of code, you’re still violating the license.

If Microsoft or whoever else wants to train a coding “AI” or whatever, they should either be using code they own the copyright to, get explicit permission from the rightsholders for “AI” training use (difficult for code from larger projects), or properly comply with the terms of the licenses and automatically add the terms and copyright notices during autocomplete and/or properly apply copyleft to the newly generated code. Anything else is a massive copyright violation and a direct assault on open source.

Let me put it this way – the code to various versions of Windows has leaked numerous times. What if we train an “AI” on that leaked code and let everyone use it? Do you honestly think Microsoft would not sue you into the stone age?

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

26 Comments

2023-04-21 5:18 pm

Alfman verbose=1
I actually think we’re reaching a point with AI where copyright concepts are becoming fundamentally intractable. This has always been the case to an extent. Human programmers often train themselves with copyrighted works too, and we would then use the knowledge greened from such sources to create “our own code”. AI is essentially doing the same thing, it’s only that the AI training is much more explicit whereas we automatically give humans a pass because nobody was recording the information transfer.

Enforcement of copyright through algorithms that are so transformative as to rewrite the code for new contexts is ultimately unenforceable. The copyright regime, willingly or not, is going to have to adapt to the new realities created by AI even if by force. Courts may try to block this, but it will be futile as the AIs will inevitably end up underground where it will continue to exist but it’s origins will be literally untraceable. Whether we think it’s fair or not is irrelevant. Whether courts embrace it or shun it is irrelevant. There’s no mistaking this is the future. As AI becomes more efficient both the companies and their developers will increasingly use it’s code generation advantages. For better or worse, those who refuse will become redundant and it’s only a matter of time.

2023-04-21 9:57 pm

oiaohm
There is a big problems here.
1) Law does not class AI is entity of it own right.
2) Clean room requirements of reverse engineering in many countries is because a human who is too close to a work can copy its function way too perfectly so not be classed as a independent work so copyright infringement.

Yes Clean room one group sees the copyrighted work you wish to copy and documents carefully following strict rules not to give away internal details where possible. Then another group make the code to replace it.

“AI is essentially doing the same thing,”
That the problem same as a single human no clean room. Yes many single humans making alternative implementations have been done for copyright infringement because they unintentionally/subconsciously copied sections of code from the work they were attempt to re-implement.

Problem is AI models I have seen for this are single room models and there is no AI rules to define how to make clean room separation to prevent the work the AI generating in fact being tainted by error…
That the problem how do you define the Clean room line for AI.

Literally untraceable is what a lot of people who solo re-implement while looking at code they should not think as well. Over time it turned out to be traceable. Be it AI be it human re-implementing things wrong process equal getting caught sooner or latter.

Alfman human brain is really a lousy data storage for most people. AI Copilot havee the same lousy. There is a clear problem with Copilot demoed. Copilot manage to dump out ffmpeg copyright notice word for word. I will tell something critical you take even the most long term ffmpeg developer give them a blank screen and tell them to type that copyright header word for word they cannot. Yes humans with 100 percent perfect recall can be done for copyright infringement and have been in past. Classic example is Mozart.
https://www.todayifoundout.com/index.php/2017/01/time-mozart-pirated-music-1700s/
So humans are not above been done for copyright infringement because they remember something then recall it word for word/note for note(Mozarts case). Copilot been demoed doing something that if a human does it they are in trouble. Now something not as in face could happen as well.

At this point training coding AI assistants using restrictive license works is most likely highly risky and will most likely be the cause of court cases in the future. Lot of the image generation people are finding they can get basically the same level of stuff using items are are not restrictive license.

2023-04-22 1:59 am

Alfman verbose=1
oiaohm,

There is a big problems here.
1) Law does not class AI is entity of it own right.

It doesn’t have to. since It’s going to be unknowable whether AI was used to create any given software. I understand that some legislators may have a problem with it, but regardless of what they declare the law to be, it’s not going to be enforceable unless they expect people to out themselves.

2) Clean room requirements of reverse engineering in many countries is because a human who is too close to a work can copy its function way too perfectly so not be classed as a independent work so copyright infringement.

I doubt clean room reverse engineering will make a difference for those who view AI output as derivative. Even if an AI were trained to reverse engineer software binaries just as a human would (instead of looking at source code), I believe we’d still be having a philosophical debate over the AI’s output being derivative of someone else’s work.

Be it AI be it human re-implementing things wrong process equal getting caught sooner or latter.

I think this is naive. If you give many developers the same logical programming task, the solutions aren’t all going to be unique and many will end up using the same algorithms and logical refactors of each other even if they don’t cheat or look at each other’s work. Obviously we’d look at things like variable names and comments that are more “creative” to detect possible coping, but those are trivial for an AI to change. Unless the copying is egregious, it won’t be possible to rule out false positives and false negatives due to natural collisions (see birthday paradox).

Alfman human brain is really a lousy data storage for most people. AI Copilot havee the same lousy. There is a clear problem with Copilot demoed. Copilot manage to dump out ffmpeg copyright notice word for word.

I was more focused on where AI coding is headed, not necessarily where it’s at today. I’m not convinced that the flaws of “ms copilot” impose a fundamental limitation on what AI will become.

2023-04-22 1:16 pm

Lennie
“. If you give many developers the same logical programming task, the solutions aren’t all going to be unique and many will end up using the same algorithms and logical refactors of each other even if they don’t cheat or look at each other’s work. ”

But using the same algorithms and logical refactors is allowed, copying text isn’t (which is what the AI currently ends up doing. And at the moment the AI isn’t ‘aware’ in a lot of cases this is the case.

Actual clean room works:

https://www.youtube.com/watch?v=rYPbdyM_-Dw

Not sure if musicians are happy with it though, because this gives studios a lot of control.

A good discussion on the legal side:

https://www.youtube.com/watch?v=fS8pAPN9Er0

2023-04-22 2:38 pm

Alfman verbose=1
Lennie,

But using the same algorithms and logical refactors is allowed, copying text isn’t.

This is why I alluded to creative metadata beyond the algorithm, including comments, variable names, function names, spacing and so on. The choice of names and formatting is generally incidental to an algorithm. However once you canonicalize the algorithms, you end up creating a 1:1 representation between a specific algorithm and it’s source form. So the difference between copying an algorithm and it’s canonical source form because mute.

(which is what the AI currently ends up doing. And at the moment the AI isn’t ‘aware’ in a lot of cases this is the case.

Well, ChatGPT as it stands today works with text, it wasn’t even designed for software development, source code just happened to be part of it’s training data. Obviously though it’s only a matter of time before AI is trained on canonical source representations and intermediate byte code to focus more on the software logic and less on incidental textual metadata.

Not sure if musicians are happy with it though, because this gives studios a lot of control.

A good discussion on the legal side:

I appreciate the legal discussion, though I still maintain that the legal environment is going to have to cope with AI and it’s underlying sources in many works being unknowable (unless the AI’s trainer comes forward with that information). A lot of legal copyright experts are going to end up on shaky ground because they’ll lack proof. I predict this will become a rather fundamental challenge for copyright going forward.

For the sake of argument, courts/legislators might decree that AI output is “derivative” if it was trained using copyright works, but this presents a major philosophical double standard. It’s a near statistical certainty that typical human developers are trained on copyrighted works as well, be it university textbooks, research papers, GPL/MIT code, etc. If a court rules that AI work infringes when there’s a record of how the AI sausage was made, then this is very inconsistent with ruling that human developers who are trained using the exact same sources are allowed to get away with it. I sense that there’s a growing bias against AI simply for not being human.
2023-04-24 3:02 am

Lennie
You know, looking at the landscape/big picture…

I recently took a large Github Gist and wanted to use it as part of the code at my company, mostly like a library, I had already changed it to fit my need. Then I looked again: Gists don’t have a copyright notice. This won’t do. A few weeks later I probably found something better for working with this programming language I’m less experienced with. Libraries/frameworks, etc.

All I know is that I can’t use AI as a productivity tool for code, because of copyright issues which I think will and should arise. But maybe a competitor will.

Their is someone who made a whole AI program/framework who had no coding experience, but just asking ChatGPT to generate things. And telling it what does not work.

Their are people who made similar things which make ChatGPT generate the code, the test cases and do the debugging to eventually get to something that fits the goal.

If someone makes a CoPilot program which separates the code by FOSS license and adds full attribution I guess it could be used. I guess GPL code will not be one of them ?

So honestly, I’ll adopt it when it’s available. I don’t know where this will lead us…
2023-04-24 3:10 am

sukru
Lennie,

What these tools do is identify patterns and make it useable for your needs.

The issues arise when patterns are unique to a small population. Like a single project in GitHub (hence completion of license notices for a specific project).

It should not be too hard to say, the following pattern is not copyrightable.

resp, err := http.Get(“https://osnews.com”)
if err != nil {
return err
}
defer resp.Body.Close()

Just writing http.Get( could initiate this completion (including the backwards direction), and there would be no issues finding hundreds, if not thousands of previous work to support this is “in the open”.

However, these models, probably while trying to reach the market, do not place enough measures to prevent leaking private details.

Say, you used a simple “firewall”: each code pattern should have at least 50 uses.

Then, a simple repo with 50 files will easily leak though.

Update it to say: at least 50 different repos

Then a popular code with 50 forks will still leak through.

Update it again, and it leaks in a different way.

That is why privacy and ethics are important in machine learning design.

(Again, this general idea is immensely useful, however implementations need more work).
2023-04-24 3:35 am

sukru
Alfman,

I think there is even better ground for comments.

There used to be applications for comment generation in IDEs. For example https://submain.com/ghostdoc/ for C# projects in Visual Studio. It works by following basic patterns. If you have something like:

void printStats(Stats stats);

It could then generate (making this up):

// Prints given statistics.
//
// stats: Statistics object of Stats type.
void printStats(Stats stats);

And this was before AI.

Add in AI, and things get even better.

If you also have the ability to recognize common patterns like error handling, retry mechanisms, design patterns, or even specific algorithms, like dynamic programming, then the AI assistant can easily generate useful documentation, without breaking any copyrights.
2023-04-24 5:32 am

Alfman verbose=1
sukru,

And this was before AI.

Add in AI, and things get even better.

If you also have the ability to recognize common patterns like error handling, retry mechanisms, design patterns, or even specific algorithms, like dynamic programming, then the AI assistant can easily generate useful documentation, without breaking any copyrights.

I do agree with this and indeed most of your points about AI. However I think people are still going to find it divisive when those training the AI incorporate 3rd party sources into the training process. The thing is that humans do this as well, we learn from external sources constantly without regards to whether licenses allow us to do so and in my view this input is absolutely essential to learning what’s expected of a developer/artist/etc regardless of whether the intelligence is artificial or biological. I also think it’s transformational and for these reasons this training should be permitted. It is completely unfair and hypocritical to judge AI by different standards than humans, but I think we’re already seeing people adopting a view that AI should not be allowed to learn from anything without explicit permission to do so.
2023-04-24 6:19 am

oiaohm
Alfman the problem with Copilot is that it being caught making one to one copies.

“The thing is that humans do this as well, we learn from external sources constantly without regards to whether licenses allow us to do so and in my view this input is absolutely essential to learning what’s expected of a developer/artist/etc regardless of whether the intelligence is artificial or biological. ”

Most of the time this end up being legally fine because most human brains forget the finer details and don’t make one to one copies.
https://www.dundaslawyers.com.au/innocent-infringement-of-copyright/
Alfman there is special section of law where human copies without knowing they have copied. Its still illegal but the punishment is restricted.

So if you have studied some source code that a restricted license like GPL and you put a one to one to one copy of the code into another project that not GPL unintentionally because it was in your subconscious this could get you charged with what called innocent infringement of copyright. Please note this means you are required to take actions to correct your human mistake.

The problem with Copilot is that is absolutely making 1 to 1 copies and it was trained on GPL and other restricted license works.

Training on restricted license works is as a human does have risks legally. Yes some programming courses recommend keeping a journal/log of what works you have trained on and worked on. This can be important if you end up in a copyright case to prove it was an innocent mistake so that you only have to take actions to correct instead having huge fines and major legal issues.

Lot of company contracts that say you cannot work on something in the same field for X number of years is part protection to the person and the company from having a copyright infringement fight 5-7 years is normally classed long enough for normal person memory to degrade their recall enough that they cannot 1 to 1 copy and have to go back to base principals that are not copyright protected..
2023-04-24 8:41 am

Alfman verbose=1
oiaohm,

Alfman the problem with Copilot is that it being caught making one to one copies.
…
The problem with Copilot is that is absolutely making 1 to 1 copies and it was trained on GPL and other restricted license works.

You’re focusing on copilot, which is fine, but my response is the same as before: I’m talking about where AI is headed. Early teething problems with copilot will be solved…what then? That’s the real dilemma.

Most of the time this end up being legally fine because most human brains forget the finer details and don’t make one to one copies.

Alfman there is special section of law where human copies without knowing they have copied. Its still illegal but the punishment is restricted.

So then you believe AI should be allowed to train on copyrighted works if it doesn’t produce and exact copy? I believe content should be given the same treatment without regards to if it were generated by AI or human, but this is obviously something some others will disagree with.

So if you have studied some source code that a restricted license like GPL and you put a one to one to one copy of the code into another project that not GPL unintentionally because it was in your subconscious this could get you charged with what called innocent infringement of copyright. Please note this means you are required to take actions to correct your human mistake.

This is completely tangential to the topic of AI; the problem of incidental infringement is a real problem when multiple developers coincidentally use the same algorithm to solve the same problem. The odds of an incidental collision quickly increases as the number of “unique” developer samples increases. Should a work be considered derivative if it has the same canonical source form but with superficial cosmetic changes, like spacing/comments/names/etc?
If you answer yes, then it’s trivial to introduce superficial cosmetic changes to defeat copyright. But if you answer no, then it becomes far more likely that two developers independently using the same algorithm will unintentionally (or worse, intentionally) step on each other’s copyrights.

Training on restricted license works is as a human does have risks legally. Yes some programming courses recommend keeping a journal/log of what works you have trained on and worked on. This can be important if you end up in a copyright case to prove it was an innocent mistake so that you only have to take actions to correct instead having huge fines and major legal issues.

Most software developers have never had an issue on account of most software not being open source and open to public scrutiny. Again, it’s nothing specific to AI, but open source projects that violate licenses may be more legally vulnerable on account of their code being published. Whereas proprietary developers that violate licenses have a lot more cover on account of the slim chance their proprietary source code will ever be audited. I’m sure some proprietary companies do audit their own code, but I think it’s the exception and not the norm.

Lot of company contracts that say you cannot work on something in the same field for X number of years is part protection to the person and the company from having a copyright infringement fight 5-7 years is normally classed long enough for normal person memory to degrade their recall enough that they cannot 1 to 1 copy and have to go back to base principals that are not copyright protected..

Actually it’s far more common to see restrictions through NDAs by the company you are leaving rather than the one you are coming into. In all my career, I’ve only seen outgoing contract restrictions and not incoming ones.
2023-04-24 8:49 am

Alfman verbose=1
Should a work be considered derivative if it has the same canonical source form but with superficial cosmetic changes, like spacing/comments/names/etc? If you answer yes, then…

It’s too late to edit, but I got the yes & no reversed in my post above.
2023-04-24 10:31 am

oiaohm
Alfman
“So then you believe AI should be allowed to train on copyrighted works if it doesn’t produce and exact copy? I believe content should be given the same treatment without regards to if it were generated by AI or human, but this is obviously something some others will disagree with.”
Current legal framework being sure the AI has not something that breaches the current laws for what a human is allowed todo will be tricky as hell. You have to remember every year human get into trouble resulting in lots of agreements out of court over the issue they looked at a copyrighted work and copied by human error and got caught.

” the problem of incidental infringement is a real problem when multiple developers coincidentally use the same algorithm to solve the same problem. The odds of an incidental collision quickly increases as the number of “unique” developer samples increases. Should a work be considered derivative if it has the same canonical source form but with superficial cosmetic changes, like spacing/comments/names/etc?
If you answer yes, then it’s trivial to introduce superficial cosmetic changes to defeat copyright. But if you answer no, then it becomes far more likely that two developers independently using the same algorithm will unintentionally (or worse, intentionally) step on each other’s copyrights.”

Answering this turns into total legal hell with more precedents on determination than you want a point a stick at. Base documents for this process is only 12 thousand A4 page. independent creation defense is the core where you have to prove no direct exposure to to the original work and that only the tip of a huge iceberg legally. Remember majority of those 12000 pages are rules humans have had to play by for over hundred years.

Few thing that are a problem here.
1) Data set size: Lets take AI data set it Stable Diffusion, Copilot…. If you had to sit down and in fact read that data set. You don’t have enough life time todo it some would need thousands of life times some would need billions of life times to read/view the data-set..
2) Everything in the AI data set is something it been exposed to if applying the same legal rules as humans.
3) Everything AI or a Person has been exposed that has copyright protection the person/AI can have a claim of copyright infringement over if they product something close enough to the item they were exposed to.
4) Legal define of close enough to be classed as infringement has changed case to case.
5) Provability of exposure. AI is simpler to prove exposure than a human because all you need to do it ask in many cases was X images/files… in the AI training data set. Human not simple to check the data set human has had contact with over their life time.

Yes these 5 points are only the starting point into a very deep rabbit hole.

From my point of view the simplest way is just have AI data sets filled with public domain and other lightly licensed works. This has advantage less risk of ending up in court.

Alfman problem AI datasets are that huge as well that if you were checking everything AI generated against what is in the dataset that trained it you would be waiting days to years before the AI could give you result.

Next thing to consider yes we have AI trained to generate code. There can also be AI trained to look for copyright infringement. AI is a very doubled sided sword.

I don’t think its exactly good idea to train AI on “copyrighted works” with restrictive licenses to generate items due to how simple it will be to miss a 1 to 1 copy event and end up screwed in court. Even non restrictive copyright works today could be problem remember these works were made by a human so they could be tainted with what human believed that could get away with. Yes what humans have believed they could get away with once we have better AI systems for detecting infringement might not be so safe.

AI with datasets based on public domain works as in the ones fallen out of copyright will be safe from copyright law. Now we have the Disney problem. Take steamboat willy its fallen out of copyright lets say AI that had that in it data set generated a 1 to 1 copy its out of copyright it fine right. No its not Disney trademarked it so used the wrong way you are now in court for trademark infringement.

Its bad enough having to deal with Trademark law for AI generated stuff without having to deal with copyright law on top.

Alfman we have intellectual property laws that is a headache of our own invention. Humans end up done in by these laws and a human over a life time only is exposed to a small data set of data compared to large AI solution.

Item like Copilot and others like it trained of of protected works with our current means of auditing humans and AI output seams people taking stupid risk by doing something because they can and have not asked if they should.

Maybe some point of future we will have AI good enough at detecting infringement that error by AI generating stuff based of protected works will pick up the mistake but those don’t exist yet.

Yes it really simple to forget technology is doubled side at min. AI to generate work is one side AI to detect infringement is another.

Think youtube you upload a video with a little bit of a protected song that was playing in background and the automatic anti copyright system goes off. Early versions of copyright infringement detection AIs are already out there and will also keep on getting better..
2023-04-24 11:18 am

Alfman verbose=1
oiaohm,

Current legal framework being sure the AI has not something that breaches the current laws for what a human is allowed todo will be tricky as hell. You have to remember every year human get into trouble resulting in lots of agreements out of court over the issue they looked at a copyrighted work and copied by human error and got caught.

It’s tricky due to a multitude of reasons. It’s not even consistent. Whether or not someone is infringing another’s copyright can be subjective between various courts and some have directly contradicted others as has happened in google versus oracle.

Answering this turns into total legal hell with more precedents on determination than you want a point a stick at. Base documents for this process is only 12 thousand A4 page. independent creation defense is the core where you have to prove no direct exposure to to the original work and that only the tip of a huge iceberg legally. Remember majority of those 12000 pages are rules humans have had to play by for over hundred years.

What those rules say becomes irrelevant if companies/devs start taking these AI models trained with copyrighted works “underground” (ie undisclosed). Lawyers and courts are going to be at a complete loss regardless of their intentions. They won’t even know if copyright infringement occurred, much less who the victims could be. They’ll lack proof unless the AI trainers out themselves.

5) Provability of exposure. AI is simpler to prove exposure than a human because all you need to do it ask in many cases was X images/files… in the AI training data set. Human not simple to check the data set human has had contact with over their life time.

Yes, this was my point earlier about judicial bias against AI simply because they know how the sausage is made while ignoring humans doing the same thing.

From my point of view the simplest way is just have AI data sets filled with public domain and other lightly licensed works. This has advantage less risk of ending up in court.

Most of the content in the world is not public domain, realistically it would cripple AI (for better or worse) if it were only allowed to be trained from sources that explicitly allow AI training. This debate notwithstanding though, legislators will not be able to contain it even if they wanted to. Development on these will proceed underground and across international boundaries beyond their reach.

Alfman we have intellectual property laws that is a headache of our own invention. Humans end up done in by these laws and a human over a life time only is exposed to a small data set of data compared to large AI solution.

I know, It’s why I said the copyright regime would be forced to adapt to the new realities created by AI whether they want to or not.

Think youtube you upload a video with a little bit of a protected song that was playing in background and the automatic anti copyright system goes off. Early versions of copyright infringement detection AIs are already out there and will also keep on getting better..

A bit of a tangent, but yes the youtube audio copyright scan tool does a good job of finding matches. Yet has been notoriously criticized for faulty judgement over fair use rights, punishing videos that were properly licensed, retroactively stealing 100% of the creator’s revenue even if only a tiny portion of a copyrighted work is detected, etc. Youtube has gotten so bad with copyright strikes that many content creators have been forced to abandon their fair use rights and avoid parody and citing snippets that are explicitly allowed by copyright law. Youtube is in the wrong, but as with many things google does they rely 100% on automation and they’re too cheap to hire people to handle the automation’s errors and mistakes. While their automation has no legal authority, it effectively became the rules that the community have to live by. Oh well. It seems that false positives and false negatives are both going to continue to be a problem.
2023-04-24 12:45 pm

oiaohm
“What those rules say becomes irrelevant if companies/devs start taking these AI models trained with copyrighted works “underground” (ie undisclosed). Lawyers and courts are going to be at a complete loss regardless of their intentions. They won’t even know if copyright infringement occurred, much less who the victims could be. They’ll lack proof unless the AI trainers out themselves.”
Or the AI happens to have a huge glaring flaw in what it generates as Copilot does. The horrible part here is lot of copyright law is run on “guilty until the claim of innocence” base of DCMA.

Sorry for changing order
“A bit of a tangent, but yes the youtube audio copyright scan tool does a good job of finding matches. Yet has been notoriously criticized for faulty judgement over fair use rights, punishing videos that were properly licensed, retroactively stealing 100% of the creator’s revenue even if only a tiny portion of a copyrighted work is detected, etc. Youtube has gotten so bad with copyright strikes that many content creators have been forced to abandon their fair use rights and avoid parody and citing snippets that are explicitly allowed by copyright law. Youtube is in the wrong, ”
Stop you there. Problem here Youtube is following exactly how DCMA say it should function.

“we have intellectual property laws that is a headache of our own invention.”
I wrote this because it is kind critical. Different copyright systems inside the USA end up with Copyright enforcement not being the same thing even inside the USA.

.

“Most of the content in the world is not public domain, realistically it would cripple AI”
There is a question here.
Lot of copyrighted works are derivative work of public domain works by humans.

AI based on only what is public domain then generating from there could answer question how creative us humans really are. The answer may not be what we like.

AI does threaten to open up many Pandora boxes. AI could start answering how much creative invention is free will and how much was just always going to be the outcome.

There have already been cases of AI trained on only public domain data(over 70 years old) generating 99% the same as AI trained on copy protected works current day because the last 70 years was basically predictable outcomes in those areas.

Yes smaller start data set of public domain being 99% equal to datasets 1000x bigger have happened.

There is reasons to do the public domain route the results there could make feeding in more modern works in particular areas pointless.

Think about it in the last 70 years how many new music genres and painting genres have we had. Then how much in the genres we had over 70 ago is really that different now. Lot of ways in a lot of places humans are stuck in the past.
2023-04-24 5:57 pm

Alfman verbose=1
oiaohm,

Or the AI happens to have a huge glaring flaw in what it generates as Copilot does.

Once you have an AI trained on canonical code that only reflects logical algorithms without the textual metadata, then it’s going to be impossible for AI to leak the metadata part. The metadata (ie superfluous source code elements that don’t affect an algorithm) could be recreated independently and fluidly to provide the benefits that sukru mentioned earlier. This has cool applications and could even be used to create comments for your own code.

Stop you there. Problem here Youtube is following exactly how DCMA say it should function.

This is completely wrong and the algorithm’s disregard for fair use rights has been a notorious problem for youtube content creators. They issue copyright strikes for merely finding matching audio, without regards to one’s fair use rights, proper licensing, legal parody, etc. And then youtube’s punishment doesn’t remotely fit the crime. A creator may spend several hours/days creating their own content and google will steal 100% of their revenue for using a small snippet in a fair use context. This has “solved” the copyright problem, but only in that content creators have stopped exercising their fair use rights and many won’t insert a clip for editorial/review purposes. For example I was watching a review for the super mario movie where the reviewer said he really liked the inclusion of a specific Nintendo song from an old game, however he didn’t include a fair use clip of it for fear of automated copyright strikes. So I had no idea what music he was even referring to. Google’s automated system is lazy and issues strikes after merely identifying a source, but they’re clearly enforcing their own rules without regards to what people are allowed to do by law.

The horrible part here is lot of copyright law is run on “guilty until the claim of innocence” base of DCMA.

Again, that’s a google thing, it’s not the law.

Lot of copyrighted works are derivative work of public domain works by humans.

We should point out that relatively little code is being explicitly put out into the public domain. The vast majority is licensed, even FOSS.

AI based on only what is public domain then generating from there could answer question how creative us humans really are. The answer may not be what we like.

Obviously since there’s no controversy over using public domain work, I have no issue with someone training AI this way, but unlike with art and music, there’s going to be much less content to work with when it comes to software because copyrights still apply on nearly all of it. Thanks to our “mickey mouse” copyright laws, it will be many decades before today’s modern source code falls into the public domain naturally when most of us will be dead.

Even the knowledge repositories that are extremely popular for human learning, including wikipedia, use non-public domain licenses.
https://creativecommons.org/licenses/by-sa/3.0/
And it’s incompatible with other licenses including the GPL. Practically no humans using these resources to increase their learning and solve problems give a damn about this, but if you want the AI to be technically compliant, then knowledge becomes legally isolated and compartmentalized to comply with license requirements. And furthermore even with an AI that properly respects all the licenses of it’s sources, it still doesn’t really prove the sources used weren’t themselves infringing. While it’s interesting to ponder all these details, ultimately I don’t think any of it is relevant; people are going to develop and use AI regardless of legal technicalities.

AI does threaten to open up many Pandora boxes.

Yeah no kidding, haha.

There have already been cases of AI trained on only public domain data(over 70 years old) generating 99% the same as AI trained on copy protected works current day because the last 70 years was basically predictable outcomes in those areas.

Yes smaller start data set of public domain being 99% equal to datasets 1000x bigger have happened.

There is reasons to do the public domain route the results there could make feeding in more modern works in particular areas pointless.

Claims as bold as these need a citation. I believe a breadth of inputs is critical to producing relevant, knowledgeable, and well rounded AI. Depriving the AI of input that humans have access to will limit an AI’s overall problem solving abilities. For example, if we were to artificially constrict AI training to latin input sources, and then use it to run tasks and solve problems in english, I expect the AI would be significantly handicapped next to an identically trained AI who’s training was allowed to include english.
2023-04-24 10:38 pm

oiaohm
Alfman
“Again, that’s a google thing, it’s not the law.”
https://patentassociate.com/2019/08/23/copyright-dmca-vs-fair-use/
No its not totally a google/youtube thing is its the core law of DMCA.

Yes the DMCA law has no punishments for making 100% false/baseless claims of guilt but has punishments for making false claims of innocent . Like it or not everything youtube does to creators when it comes to copyright strikes in 100 percent inline with the DMCA. DMCA fair usage does not exist. Yes the algorithm in the DMCA law youtube has basically implemented.

DMCA protects the hosting/harbor while at putting the party creating content/ship goods though the port on sharp end of point sticks.

https://www.creativebloq.com/news/stable-diffusion-ai-art
Its a double sided thing. As you make AI not be able to be done for copyright infringement while using protected works you end up under mining the quality of the input sources.

AI trained on public domain works does not have this downgrading requirement to remain legal.

Alfman it the old saying of computing “garbage in garbage out.” This is the risk of trying to make a AI that is copyright safe while using copyright works is that all you might do is create garbage in garbage out.
2023-04-25 2:58 am

Alfman verbose=1
oiaohm,

No its not totally a google/youtube thing is its the core law of DMCA.

Yes the DMCA law has no punishments for making 100% false/baseless claims of guilt but has punishments for making false claims of innocent . Like it or not everything youtube does to creators when it comes to copyright strikes in 100 percent inline with the DMCA. DMCA fair usage does not exist. Yes the algorithm in the DMCA law youtube has basically implemented.

Your link contains a lot of FUD, probably because it’s posted by a lawyer who benefits from scare tactics. And while this doesn’t make him wrong, some of his own legal sources contradict his own claims on the topic.

https://www.law.cornell.edu/uscode/text/17/107

Notwithstanding the provisions of sections 106 and 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include—
(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
(2) the nature of the copyrighted work;
(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
(4) the effect of the use upon the potential market for or value of the copyrighted work.
The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors.

Google copyright strikes go beyond the law, which explicitly respects fair use rights as I’ve been saying. However I understand it’s much easier for google to ignore them and I agree with you that there are no real consequences for companies that simply ignore the existence of fair use rights.

AI trained on public domain works does not have this downgrading requirement to remain legal.

Ok, but once again there’s nothing to debate with regards to public domain so I don’t know why you keep bringing it up. The entirety of the controversy circles around AI using copyrighted sources.

Alfman it the old saying of computing “garbage in garbage out.” This is the risk of trying to make a AI that is copyright safe while using copyright works is that all you might do is create garbage in garbage out.

We can all recognize the legal quagmire it presents. I’m sure we’ll be debating the AI copyright issues for a long time. But ultimately I don’t think it’s that relevant because AI is coming regardless of what the courts and lawyers want to say about it. They’re going to be forced to adapt to the new reality that AI creates. If the courts rule that training AI without a license violates copyright, which they may do, we won’t stop using it, it will just move it offshore and underground. It could become a modern equivalent to 1920’s prohibition with the law getting trampled.
2023-04-25 10:54 am

oiaohm
Alfman its not fud you counted with copyright law that you have to use to claim damages not the bits that make up DMCA Safe Harbors that google/youtube/internet providers…. have to follow.

https://www.copyright.gov/512/
DMCA Safe Harbors. Read over the process flows.
1) Notice — Rightsholder sends notice to online service provider regarding infringing material that appears on the online service provider’s system.
2) Remove Access to Material — Online service provider must act expeditiously to remove or disable access to the infringing material.
3) Notify User — Online service provider must then promptly notify the user that originally uploaded the material that it has been removed.
4) Counter-notice — User may submit a counter-notice requesting the reinstatement of the material, if the user believes the removal was due to a mistake or misidentification.
5) Restore Access or Initiate Court Action — Online service provider must restore access to the material after no less than 10 and no more than 14 business days, unless the original notice sender informs the service provider that it has filed a court action against the user.

Google copyright strikes process is exactly DMCA Safe Harbor. DMCA Safe Harbor. no requirement to check for fair usage before taking down the content. DMCA Safe Harbor. gives no allowance for service provider to ignore or require more proof claims from parties who have made false claims repeatedly before acting.

Also note section 5 someone has made a false claim by DMCA/copyright strike google/youtube is required to keep that content removed for 10 days and put it back before 14 days is up.

Main copyright law and DMCA are not in alignment.

Main copyright law is what you have to use for damaged. DMCA is what you get to use if you just want to disrupt the heck out someone and ruin their business with false claims.

Alfman majority youtube/google copyright strike problem is they have implemented DMCA Safe Harbor as it written in the USA law. DMCA Safe Harbor needs some alterations.

1) allowance for fair usage assessments before take downs.
2) allowance for provider to say no more strikes for you because you have done too many false claims now you have to spend the money go to court and get injunctions or some other anti troll system. Of course provider in these cases still would be require to have records of false claims or be highly punished themselves.

2023-04-24 10:51 am

cb88
Inference and training are separate so yes … it is two separate clean rooms.

There are models that can do both at the same time but most don’t. It’s 2 clean room by definition.

The AI program running the inference… .never saw the original code it only saw the resultant AI model. That’s probably provable also.

2023-04-24 11:38 am

sukru
There are “student-teacher” models.

One could be trained on the wide web.
The second is trained by the first model, without access to the training data.

https://ai.plainenglish.io/knowledge-distillation-aka-teacher-student-model-4f16f701ac79

There are many way to achieve this, but even now people are — at simplest level — using ChatGPT API calls to train their domain specific models.
https://beebom.com/how-train-ai-chatbot-custom-knowledge-base-chatgpt-api/

So, yes, this is entirely possible.

2023-04-24 12:21 pm

oiaohm
Student-teacher model has issues we can see thing from the human clean room work with software has had to develop lots of rules so that the data about how a function works does not in fact contain the function code. Setting a clean room to replace something only to create a 1 to 1 duplication is a total failure of course.

I am not saying the Student-teacher model is a bad idea but it is just a redo of the clean and dirty room model of replacing software software. Early on in clean room software development there were some major goof ups.
2023-04-24 12:26 pm

sukru
oiaohm,

Yes, of course you can duplicate the teacher model 1:1 in the student.

But generally, you would have much less “free parameters” in the student one. For example in the context of training your own GPT from ChatGPT it would be many orders smaller.

This will increase generalization, and drop (most?, all?, many?) of copyright issues.
2023-04-24 10:45 pm

oiaohm
“This will increase generalization, and drop (most?, all?, many?) of copyright issues.”

I said look at historic clean room reverse engineering issues. So may not solve copyright issues without due care. AI systems normally have more perfect memory than a human.

Increased generalization has another issue risk of garbage in garbage out increases. This is the problem the process to solve the copyright issue using copyrighted inputs could end up generalization on the AI model to the point the model is useless.

Also remember DMCA parties can make false claims against you and you have to be able to prove you are clean. So legal is not helpful either.
2023-04-25 1:00 am

sukru
Increased generalization has another issue risk of garbage in garbage out increases. This is the problem the process to solve the copyright issue using copyrighted inputs could end up generalization on the AI model to the point the model is useless.

Pruning (reducing free parameters) is actually a major way to reduce impact of noisy training data, and overfitting in machine learning: https://arxiv.org/pdf/1611.06211.pdf

There is active research in model size optimization, which is especially useful for deploying to lower powered devices, like mobile phones, and reducing operational costs for scaling up.

This is an established thing in the ML domain.

2023-04-23 5:08 pm

dsmogor
I’d say the irony is the FUD weapon can this time be turned against MS by hitting its clients in a quite similar way they used it (through their SCO proxies) against Linux companies the other day: how can you be sure your system does not contain copyrighted code subject to future litigation?