Today, we are launching a technical preview of GitHub Copilot, a new AI pair programmer that helps you write better code. GitHub Copilot draws context from the code you’re working on, suggesting whole lines or entire functions. It helps you quickly discover alternative ways to solve problems, write tests, and explore new APIs without having to tediously tailor a search for answers on the internet. As you type, it adapts to the way you write code—to help you complete your work faster.
Sounds like a cool and useful feature, but this does raise some interesting questions about the code it generates. Sure, generated code might be entirely new, but what about possible cases where the code it “generates” is just taken from the existing projects the AI was trained on? The AI was trained on open source code available on GitHub, including a lot of code licensed under, for instance, the GPL. GitHub says in the Copilot FAQ:
GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set. Here is an in-depth study on the model’s behavior. Many of these cases happen when you don’t provide sufficient context (in particular, when editing an empty file), or when there is a common, perhaps even universal, solution to the problem. We are building an origin tracker to help detect the rare instances of code that is repeated from the training set, to help you make good real-time decisions about GitHub Copilot’s suggestions.
That 0.1% may not sound like a lot, but that’s misleading – another way to put it is that out of every 1000 suggestions Copilot makes, 1 is copy/pasted code someone has written and selected a license for, and that license must, of course, be respected. On top of that, it’s hard to argue that code generated from a set of existing open source code doesn’t constitute a derivative work, and is thus covered by the copyright open source licenses are based on.
I am not a lawyer, so I’m not going to argue Copilot is definitively a massive GPL violation, but as a layman, on the face of it, it definitely feels like a tool that’s going to strip a lot of code from their licenses – without consent and permission of the code’s authors.