GitHub Copilot and Copyright

Posted on Jul 5, 2021

It’s all over the news, so I won’t get into stuff everyone seems to know already: GitHub launched Copilot, an IA-powered service that is able to suugest you code, even entire functions, by reading your comments or the code you’re writing… And the FOSS community cried war: “If they used GPL code, then they’re violating GPL!”

Disclaimer… and why I know a bit about copyright: I work as Foreign Rights Manager at Next Door Publishers. My job is to negotiate licensing deals with publishing houses, literary agents, and sometimes independent authors so that my company is legally authorized to publish translations of foreign works… or, viceversa, selling licenses to publishers or agents abroad that are interested in translating works published by us and for which we still hold the rights to. So, I can confidently say that I have some practical experience with copyright, although in the editorial world. I am not a lawyer, though. In fact, I occasionally check in with my company’s lawyer to clear out subtle issues that sometimes arise. This is NOT legal advice, just a more informed opinion.

First, copyright is… hard. Every jurisdiction has its own quirks. Also, every license agreement between two parties is… very different… because copyright is one of those areas where parties have very huge legal space to agree on almost whatever they like in the manner they like. Sometimes deals get unblocked by proposing very ad hoc terms which sometimes sound ridiculous out of context but… if you knew the whole story, you’d see the point behind them: Agreeing to pay the copyright holder a percentage of the money accrued from poetry recitals that include excerpts from their work? That can make a difference!

On this matter, well, first things first… I don’t see any relevant differences across the jurisdictions I am familiar with (US, UK, Spain, and general EU law on copyright).

So, the main point people are arguing about is that Copilot was probably trained with GPL code from GitHub repos. Therefore, code that used GPL suggested code would be deemed “modifications” under GPL Section 0… But! Bear in mind that GPL defines this as follows:

To “modify” a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy.

This is one of those things where I consciously err on the side of caution. Yes, in theory, copying a couple of lines of code might be allowed by your jurisdiction if it’s “unoriginal” code. Copyright protects “original” work, not whatever thing you write and publish. The sentence right before this one is probably not under any copyright… But if you took the whole paragraph and used it on your blog or work of art, chances are high that you’re infringing on my copyright1 if you’re not quoting me or covered by fair use.

So, when Copilot suggests you just a couple of boilerplate lines or some calls that are almost logical consequences from how the APIs you’re using are defined,2 well… legally it’d depend.

But what about whole functions? A function more often than not is a conceptual unity, so you could easily argue that functions are subject to copyright… and I’d probably agree even if it’s a small function… It’s not about how much, but about what you’re copying. There you’d argue that the GPL should be enforced.

Enforced by whom? Against whom?

The team behind Copilot just fed lots of code to a machine… which has the ability to suggest code that is derivative of that original code or might even be exact copies of that code. But Copilot isn’t publishing that code… it’s sending you code on demand when you type in stuff on Visual Studio with that extension turned on. The fact that unpublished derivative works are not subject to the conditions of the GPL is… a rather weird move made by the GPL itself. The GPL does this in a very convoluted way, with the whole “conveying” definition and how “propagating without conveying” is free from following the GPL’s conditions (Section 2.)

To “propagate” a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well.

To “convey” a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying.

Copilot is propagating some code into someone else’s computer “into” a private copy… Well, private at first. It’s no different than you grabbing some GPL code on your own and pasting into some private project of yours. And the GPL would even allow you to “propagate” that code within your company, group of friends, local user group, organization and have all of your colleagues use it… as long as your company doesn’t decide to make that code available to clients or the public.

Copyright law, by default, usually deems derivate works as such regardless of publishing status. Imagine my publishing house had a translation (= derivative work) ready for sale, someone cracked into our systems, and leaked the unpublished version of it on the Internet. They could inflict lots of harm to all parties. That person would’ve infringed on our license and the copyright of the author of the original work. By stating that derivate works fall under copyright from the point of their creation and not publication, exactly like original works are, you get two desired effects:

  1. That you need a license to create a derivative work, otherwise you’re infringing the copyright of the author of the original work.
  2. That your derivative work, if properly cleared, enjoys all protections as if it was an original work.

The GPL doesn’t do this: it leaves any private modifications unprotected even if you’re planning to publish them later, because it exempts them from all conditions imposed by the license. Unpublished code that uses GPL code is in a very gray legal area where they should be getting copyright protection from the law itself (i.e., proprietary), but there’s no transferral of license either as Section 10 only applies if the work is “conveyed…”, so there’s no proper clearance going on? In my opinion the GPL is full of these gray areas because it’s written out of mistrust over copyright law (on which it depends), unlike the Apache License3 or even the simple MIT/BSD/etc. ones… but that’s a topic for another day.

In our case, all of this means that Copilot is just acting as a smart search engine that is giving you some code for you to use… The violation would only occur if you publish that code and the blame, of course, would go to Microsoft and GitHub… No, just kidding, the blame would all be yours.

“But they tricked me into it!,” you could argue. “I never knew it was GPL code! I should’ve been given a notice!”

That I agree with… but it’s not legally required… because all of this is happening in private between you and Mr. Copilot, so, again, no “conveying…” Therefore, no requirement to make license and copyright notices explicit (Sections 4, 5.)

If GitHub wanted to act in favor of their users, they should provide notice on what the license terms are for the suggested code. Of course, this opens the hard question on what if the IA is able to suggest some newish code from some GPL code it was trained with… but which might fall into the “not really a derivative work” anymore… Deciding those cases is hard even for judges, so not sure how you’d train an IA to solve the question on whether something is or is not a derivative work…

Yet again, erring on the side of caution seems to be the best here. If I was in charge of this, which I should… I’d probably shut GitHub down entirely, though… If I was in charge, I’d favor including a notice like “This code may be derivative work of code under License,” with a link to the original so the user can make a decision. Obvious derivative works would explicitly state “This code is very probably a derivative work…”

Notice how I generalize to all licenses above… The public debate has been centered around the GPL, but the Apache license, the MIT/X11/Expat-type license I use, the ISC license, and even the BSD-type licenses, all of them have some sort of requirement to comply with even if they’re not copyleft. Failing to reproduce copyright and license notices for BSD-licensed code makes the derivative work as illegal as not providing the source code for GPL code you’ve taken… or even starting to translate a book without having the license agreement signed by both parties.

And how is it possible that a FOSS enthusiast and quite public person becomes Foreign Rights Manager? Well, that story is for another time! For more peaceful times, I guess…

  1. I hold all rights for this blog… see footer! There’s a reason for this and I’m not sharing it with you people today! ↩︎

  2. Think of setting up some data structures provided by an API to some values that are just directly known from the specs. ↩︎

  3. The Apache License is a charm. OK, this is personal opinion, but the way it’s written is way closer to a real license agreement than the GPL is. Have a look on how it separates “Derivate Works” from “redistribution,” so the default law provisions apply. And by the way, the part where it deals with patches sent to upstream, which defaults to copyleft (I bet you’ve never been told this!) if no other agreement is in force… is perfect… A fork is not the same as a contribution! (But the GPL doesn’t make the distinction!) ↩︎