Before you read this article – note that Codeium offers a competitor to GitHub Copilot. This means they have something to sell, and something to gain by making Copilot look bad. That being said – their findings are things we already kind of knew, and further illustrate that Copilot is quite possibly one of the largest, if not the largest, GPL violations in history.
To prove that GitHub Copilot trains on non permissive licenses, we just disable any post-generation filters and see what GPL code we can generate with minimal context.
We can very quickly generate the GPL license for a popular GPL-protected repo, such as ffmpeg, from a couple lines of a header comment.
Codeium claims it does not use GPL code for its training data, but the fact it uses code licensed more permissively still raises questions. While the BSD and MIT-like licenses are more permissive and lack copyleft, they still require the inclusion of the terms of the license and a copyright notice to be included whenever the covered code is used. I’m not entirely sure if using just permissively licensed code as training data is any better, since unless you’re adding the licensing terms and copyright notice with every autocompleted piece of code, you’re still violating the license.
If Microsoft or whoever else wants to train a coding “AI” or whatever, they should either be using code they own the copyright to, get explicit permission from the rightsholders for “AI” training use (difficult for code from larger projects), or properly comply with the terms of the licenses and automatically add the terms and copyright notices during autocomplete and/or properly apply copyleft to the newly generated code. Anything else is a massive copyright violation and a direct assault on open source.
Let me put it this way – the code to various versions of Windows has leaked numerous times. What if we train an “AI” on that leaked code and let everyone use it? Do you honestly think Microsoft would not sue you into the stone age?
I actually think we’re reaching a point with AI where copyright concepts are becoming fundamentally intractable. This has always been the case to an extent. Human programmers often train themselves with copyrighted works too, and we would then use the knowledge greened from such sources to create “our own code”. AI is essentially doing the same thing, it’s only that the AI training is much more explicit whereas we automatically give humans a pass because nobody was recording the information transfer.
Enforcement of copyright through algorithms that are so transformative as to rewrite the code for new contexts is ultimately unenforceable. The copyright regime, willingly or not, is going to have to adapt to the new realities created by AI even if by force. Courts may try to block this, but it will be futile as the AIs will inevitably end up underground where it will continue to exist but it’s origins will be literally untraceable. Whether we think it’s fair or not is irrelevant. Whether courts embrace it or shun it is irrelevant. There’s no mistaking this is the future. As AI becomes more efficient both the companies and their developers will increasingly use it’s code generation advantages. For better or worse, those who refuse will become redundant and it’s only a matter of time.