Before you read this article – note that Codeium offers a competitor to GitHub Copilot. This means they have something to sell, and something to gain by making Copilot look bad. That being said – their findings are things we already kind of knew, and further illustrate that Copilot is quite possibly one of the largest, if not the largest, GPL violations in history.
To prove that GitHub Copilot trains on non permissive licenses, we just disable any post-generation filters and see what GPL code we can generate with minimal context.
We can very quickly generate the GPL license for a popular GPL-protected repo, such as ffmpeg, from a couple lines of a header comment.
Codeium claims it does not use GPL code for its training data, but the fact it uses code licensed more permissively still raises questions. While the BSD and MIT-like licenses are more permissive and lack copyleft, they still require the inclusion of the terms of the license and a copyright notice to be included whenever the covered code is used. I’m not entirely sure if using just permissively licensed code as training data is any better, since unless you’re adding the licensing terms and copyright notice with every autocompleted piece of code, you’re still violating the license.
If Microsoft or whoever else wants to train a coding “AI” or whatever, they should either be using code they own the copyright to, get explicit permission from the rightsholders for “AI” training use (difficult for code from larger projects), or properly comply with the terms of the licenses and automatically add the terms and copyright notices during autocomplete and/or properly apply copyleft to the newly generated code. Anything else is a massive copyright violation and a direct assault on open source.
Let me put it this way – the code to various versions of Windows has leaked numerous times. What if we train an “AI” on that leaked code and let everyone use it? Do you honestly think Microsoft would not sue you into the stone age?