First things first: I’m not a lawyer. I don’t even play one on TV. Nothing in this post should be taken as legal advice. You should contact a real legal professional before making any decisions with regards to your legal rights, including copyrights. If swelling lasts more than 4 hours seek medical attention. Capital at risk. Etc.
Second things second: What is AI? You know what, let’s not get into it. What we are concerning ourselves with here is the specific technology of Large Language Models (LLMs) which take in huge libraries of human-generated text in order to build a predictive model that can simulate something a person might say or write. The more text these models can take in, the more realistic they seem to a human observer.So, there is a lot of value for companies producing LLMs to use as much text as possible in their training sets.
The open access movement is built on the idea that knowledge should be shared. That we create a richer culture and more accurate and expansive science when we share our ideas with each other freely. That’s the whole point of this Open Access week theme, ‘Community over Commercialisation’. We share to enrich our communities of practice, not some tech oligarch.
Many believe this sharing ethos applies to training LLMs as much as anything else. Some, however, believe LLM companies are exploiting this openness to create propriety technologies that don’t benefit society as a whole, or even the knowledge communities they extract material from, instead making a tiny number of white guys in Silicon Valley insanely wealthy while they drain the last drops of the Colorado River Basin and somehow make climate change even worse. Some people feel that way. You know, I’m staying neutral. I’m presenting all viewpoints.
If you do want to share your work with anyone who would like to read it, but don’t want to have it used to train LLMs, can you have the one without the other? Can work licensed with Creative Commons licenses be used for training LLMs without specific permission from the author?
Maybe.
I’m sorry, that’s the only honest answer for reasons we’ll get into. However, it’s also the only honest answer about work that isn’t licensed with Creative Commons. It is the legal position of most AI companies and some legal experts that using works to train LLMs falls under copyright exceptions such as ‘Fair Use’ or ‘Fair Dealing’, in which case it wouldn’t matter if a work is creative commons licensed or all rights reserved. If training an LLM falls under a copyright exception, copyright, public domain, or creative commons licensed material is all equally up for grabs.
The EU’s recent AI Act seems to refute that, saying that using a text in training an LLM constitutes ‘copying’ under the law. This may be why companies such as Microsoft are suddenly going to publishers and paying for licences to use their work to train LLMs. However, and I apologise for bringing this up, we aren’t in the EU, so what does that mean for UK authors? Also unclear.
For copyright, jurisdiction attaches where the copy is made, but when you’re talking about digital products, jurisdiction gets complicated and could be claimed anywhere a digital copy is accessed, even just by RAM. Many LLMs will chose to comply with the EU AI act rather than take the financial risk of skirting it, and there’s a high likelihood of other countries, especially ones that do heavy trading with the EU such as the UK, harmonising with their legislation. So, while there is nothing in the UK now that specifies if using a text to train an LLM is fair dealing or not, that may change soon.
As an author, your best course of action is probably to behave as if training LLMs is a copyright-protected activity. After all, if it isn’t, then there’s nothing you can do anyway. So, if training is a copyright protected activity, if you use a Creative Commons license are you giving permission for an LLM to train on your work? It depends on the license and the entity making the LLM/the purpose the LLM is made for. Let’s look at the different license terms:
BY (Attribution): using this requires a user to cite your work in any further use they make of it. It’s hard to see how an LLM could do this. Any one sentence generated by ChatGPT, for instance, is the result of training on hundreds of thousands of works. However, if it’s not copying a specific phrase from any one work, that may not be enough use to require attribution. All this will need to be decided by courts or legislators, but the BY license doesn’t seem to provide strong protection against LLM use, in my reading.
NC (Non-commercial): this requires any use of the work be for a non-commercial purpose. This would seem to prevent use by most LLMs, but OpenAI was founded as a non-profit (they aren’t anymore), and some stage of development of many LLMs may be done by university labs or other non-profit organisations. An NC license would certainly prevent the likes of Microsoft or Google using your work. Probably.
ND (no-derivatives): This condition prevents reuse unless the entire work is reproduced in whole. This would seem to prevent use in LLM training. It’s hard to see how they could meet that condition. However, it also prevents many types of reuse that you may want to allow, such as another artist sampling a few bars from one of your compositions, or another scientist using your protocol without copying out the whole thing in their paper.It’s worth thinking carefully about what you would like to share before applying this one.
SA (share-alike): This requires any use of the work also be released with the same open licence. It is designed to enforce a protected ‘digital commons’. An actual legal scholar wrote an analysis of LLMs ability to use SA licensed work, so read that if you are curious. It is the least used license in academic publishing so the protection offered may not be relevant to you.
Of course, you can combine and layer these conditions, so a CC BY-NC licence provides more and different protection that a CC BY alone.
Because all of this is ambiguous (CC licenses were not written with AI in mind) the Creative Commons foundation is considering creating what they’re calling ‘preference signals’ which will address training in LLMs. Unlike the license conditions (BY, NC, ND, SA) these preference signals are not enforceable (under current copyright law), but rather indicate to companies, and their web crawlers, an author’s preference. These would potentially include signals such as ‘Don’t train’,‘Train, but disclose it was trained on my data’,or ‘Train, only if your model uses renewable sources of energy’.
I believe creating preference signals like these, and tech companies promising to abide by them, would be a positive step to enhancing trust and confidence in the open accessmovement. Scholars shouldn’thave to choose between being open and participatingin systems they might feel are unethical. Conversely, those who want to allow LLMs to use their data, perhaps hopingto improve these systems, should be empowered to do so. Creative Commons licences are supposed to be about author choice and agency, and you shouldn’thave to feel you’regiving up anything to use them.
In the present legal landscape, it doesn’t seem like Creative Commons licenses provide much less protection against your work being used to train an LLM than standard copyright terms. What is really needed is for legislators worldwide to follow the example of the EU and clarify these legal issues with AI, so we aren’t all having to operate in the present morass.
Further Reading
https://creativecommons.org/2023/08/18/understanding-cc-licenses-and-generative-ai/
https://creativecommons.org/2024/08/23/six-insights-on-preference-signals-for-ai-training/
https://www.technollama.co.uk/creative-commons-and-ai-training
https://creativecommons.org/2021/03/04/should-cc-licensed-content-be-used-to-train-ai-it-depends/
https://sr.ithaka.org/our-work/generative-ai-licensing-agreement-tracker/ (a tracker of academic presses that are licensing content to LLMs)
Varina Jones-Reid
Open Research Team
Image (minus text) created by Lincoln Rogala, via wikimedia, licenced CC BY-SA, meme version licenced under same conditions