Is AI Training Infringement?

Emerging Copyright Governance Frameworks Across the US, China, and Europe

by Joseph B. Karaganis

Joseph Karaganis is a rising third-year student at Columbia University and a Laidlaw scholar. His research focuses on comparative AI policy as a basis for informing emerging US conversations on the topic. He visited to the AI, Media, and Democracy Lab in June 2024, in order to learn from researchers, policy advocates, and civil society leaders working in the field, and to engage with the recent wave of EU tech regulation (especially the EU AI Act). He became particularly interested in intellectual property and transparency questions surrounding AI and how they might impact news media groups; this thread of research culminated in two posts for our blog.

It didn’t take long after the release of ChatGPT for media organizations and Big Tech companies to lock horns over the use of copyrighted materials in the training of AI models. The industry’s current training paradigm relies on the mass scraping of data from the internet and other large digital databases before feeding them to neural networks. US publishers and professional associations have demanded action from legislators to characterize this process as infringement when it copies their proprietary content without permission.

In some cases, these complaints have boiled over into litigation. In December of 2023, the New York Times sued OpenAI—ChatGPT’s developer—for copyright infringement. OpenAI has defended its practices by pointing to the “fair use” doctrine of American copyright law, which protects the reproduction of copyrighted material under certain conditions; factors include whether the use is sufficiently “transformative” and whether it undermines the market position of the original work. It has also pointed to earlier court decisions that provided “fair use” exceptions to digital products built on data-scraping, such as Google Books. In response, the Times has challenged the “transformative” character of AI training by finding examples of outputs that regurgitate copyrighted content word-for-word and by arguing that ChatGPT erodes the value of its subscription model.

To date, there has been very little adjudication of these issues in the US. Several other media groups launched similar infringement suits in the wake of the New York Times case, but these are also in their early stages. Other major publishers (notably the AP and News Corp, owner of the New York Post and Wall Street Journal) have taken a different approach: establishing licensing agreements with OpenAI in order to secure compensation without the burden of a protracted lawsuit. What prevails for now is a legal ambiguity that casts a shadow over—but does not appear to be significantly impeding—the development of new models.

In China, regulators have yet to address these copyright questions directly; decisions by the country’s high-tech internet courts have contributed to a lively judicial debate on adjacent issues without reaching consensus on the legal status of the training data itself. Only the EU, with its “opt-out” provision established in the Directive on Copyright in the Digital Single Market, has a comprehensive and unified standard for the protection of copyrighted material during generative AI training—although questions still remain surrounding the law’s implementation and the scope of protection.

My goal with this post is to briefly assess the American public policy response to this situation before turning to the alternative frameworks emerging in China and Europe. I do this in order to reflect on the US approach from a comparative perspective.

Luke Conroy and Anne Fehres & AI4Media / Better Images of AI / Models Built From Fossils / CC-BY 4.0

The US: Kicking the Can

The state of debate surrounding AI and copyright in the US is still in its infancy and has not yet settled on a definite attitude towards the training process. Still, several early regulatory attempts have made clear that the issue is on the mind of (federal) policymakers. An important component of US governance efforts are standardized ‘best practices’ guidelines aimed at the growing private sector. While none of these impose binding obligations on developers, the most important ones have at least referenced the possibility of copyright infringement during model training. 

The executive branch’s first comprehensive collection of AI ‘best practices’ is the NIST AI Risk Management Framework (RMF)—released in January 2023. It suggests that training data could be “subject to copyright” and that companies are obliged to follow “applicable intellectual property rights laws” during model development, preferably by “maintaining the provenance of training data.” One provision, MAP 4.1, advises developers to “[map] AI technology and [the] legal risks of its components,” including copyright infringement. The April 2024 NIST RMF Generative AI Profile goes further, noting that IP questions are particularly crucial for foundation AI models and that while “GAI [generative AI] systems may infringe on copyrighted or trademarked content,” particular applications of copyright law are still being “debated in legal fora.” As a prescriptive measure, it calls on companies to develop more robust documentation and accountability practices that mitigate the risk of IP infringement.

By deferring judgment to the (so far, unresolved) debates being held in “legal fora”, the guidelines do not commit the executive branch to an interpretation of existing copyright doctrine. The posture reinforces an ambiguity that offers AI developers a green light to continue operating under the expectation of a fair use protection, given that this interpretation of copyright law has yet to be denied by the courts.

The government’s hesitancy to stake ground in the copyright debate is also apparent in the binding pieces of federal regulation, the most important of which is President Biden’s October 2023 Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Article 5.2 of the order suggests that initial rulemaking will be allocated to the United States Copyright Office of the Library of Congress (USCO), which is expected to publish a report on the copyright issues raised by AI training before the end of the fiscal year. Within 180 days of those recommendations being published, the order states that the President will consult with the Copyright Office’s director to consider “potential executive actions relating to copyright and AI.” The USCO’s authority covers the interpretation of existing copyright laws, not the creation of new ones: their recommendations may inform the federal government’s stance on the scope of “fair use” protections, but not much more. Whether or not US governance efforts continue to approach training via the “fair use” angle will turn on the President’s decisions once the report is released.

There has been some forward movement in Congress. The most notable ‘pro-creator’ proposal has come from Senator Adam Schiff of California, whose Generative AI Copyright Disclosure Act would require developers to register all copyrighted content used during the training process within 30 days of a system’s release. The USCO would subsequently maintain a public database of all such notices so that proprietors could see whether or not their work has been used to train an AI model. This would in principle lay the groundwork for media organizations or other content creators to initiate licensing negotiations—or infringement suits—with AI developers who use their work sans permission. But this proposal rests on the assumption that the use of copyrighted material in AI training constitutes infringement; an issue that will ultimately still hinge on the clarification offered by the USCO, on subsequent executive branch decisions, and on any precedents created by by the current wave of lawsuits (assuming any make it to trial). 

China: Cut from the Same Cloth

US policymakers have been largely noncommittal on the legal status of training inputs, and government officials have taken a similar approach in China. The CCP has not yet clarified its position on whether or not the use of copyrighted material in AI training constitutes infringement under Chinese copyright doctrine. However, some special-jurisdiction courts have begun to rule on related issues surrounding generative AI, intellectual property, and creative rights.

Intellectual property concerns appeared in China’s national AI regulations with the July 2023 Interim Measures for the Management of Generative Artificial Intelligence Services, which covers a wide range of AI use-cases. Article 7 of the law addresses the “pre-training” and “optimization training” of AI models, and puts forward two IP-related conditions. First, developers must “use data and foundational models that have lawful sources”. Second, they must not infringe on the intellectual property rights of others, where such rights are involved. This commitment to existing IP approaches raises the same questions as the American governance platform: does the current, widely adopted AI training paradigm constitute infringement? And what IP rights are involved, exactly? Although this is as far as CCP policymakers have been willing to go (for now), Chinese courts have used this language to establish precedents on some important closely-related topics that could impact AI developers.

In November 2023, the Beijing Internet Court ruled that AI-generated images can be protected by Chinese copyright law, as long as the level of human involvement meets the bar for “intellectual achievement” and “originality.” However, the court determined that these assessments would be provided on a case-by-case basis (in contrast, the USCO has ruled that AI-generated works cannot be protected under copyright, unless the contributions of AI are de minimis). A ruling in February of 2024 by the Guangzhou Internet Court went in a different direction by establishing that AI providers can be held liable for infringement if their system output violates a rights-holder’s entitlement to exclusive reproduction and adaptation of a copyrighted work (in this case, by creating images that were substantially similar to a registered superhero character). The speed at which these two cases were decided—far in advance of the analogous legal disputes raised in the US—reflects the unique administrative infrastructure of the Chinese Internet Court system, which operates on a much faster timeline than American civil courts; adjudication occurs over the internet.

Neither case directly implicates the issue of copyright infringement in training data (both focus exclusively on the output of AI models), although the second seems to lay the groundwork for future legal action. If outputs—generated by a learning process that relies on copyrighted material—can infringe on IP rights by reproducing that material, then addressing the initial use of the copyrighted material may be a reasonable next step. At the very least, the second case implies a judicial stance willing to defend creative rights at the expense of untrammeled AI innovation. Still, the Guangzhou court was clearly hesitant about going too far in this direction; it limited financial damages to a negligible amount and angled the remedy towards the creation of stronger governance mechanisms, so as to not “put excessive obligations on the service providers.”

The questions that remain regarding training data may be resolved by a comprehensive national Chinese AI law, which experts expect to see in the coming years. Two proposal drafts from scholars have already surfaced, which might offer some insight into the current attitudes of legal academics. A 2023 draft by the Chinese Academy of Social Sciences reaffirmed that AI developers must “respect” intellectual property rights and proposes that the state “innovatively [explore] IP systems that are adapted to the development of AI.” While still noncommittal, the language does suggest a solution that transcends existing copyright frameworks by recourse to innovation rather than clarification.

A March 2024 draft by scholars at the East China University of Political Science and Law puts forward a bolder solution. Article 24 sketches out a “reasonable use” criterion for AI training. As long as the use of copyrighted content “is different from the original purpose or function of the data and does not affect the normal use of the data or unreasonably harm the legitimate rights and interests of the data’s owner,” then developers may “forgo payment of remuneration to the data’s owner without the data owner’s permission” (although they must still have the data “marked in a conspicuous manner”). This proposed solution would give AI companies significant legal cover to continue with current development paradigms—while inevitably setting in motion debates regarding the standards for “unreasonable harm.” The criterion would treat AI training processes to a standard that roughly follows the American “fair use” exception to infringement, although the requirement that developers disclose their use of copyrighted content would satisfy some of the transparency conditions currently being demanded by media organizations and other rights holders.

The EU: A Path Forward?

While neither the US nor China have established firm national policies that resolve the AI training question, the EU has already put in place an interesting and innovative regulatory approach that tries to preserve creative rights without undermining innovation.

The EU strategy revolves around a central mechanism: the “opt-out” system for creators, first put forward in Article 4.3 of the EU’s 2019 Directive on Copyright in the Digital Single Market  (DCDSM). 4.3 says that online text- and data-mining is legally protected, but that copyright holders may “expressly reserve” the right to have their content omitted from these processes—with exceptions for scientific research applications. While the rules were developed before the release of the newer generative AI models—like ChatGPT— the “opt-out” provision has already been integrated into the EU’s AI governance framework. The EU AI Act requires that developers of general-purpose AI models produce summaries of the content used in model training and establish policies to comply with the demands of 4.3, including “opt-out.” Instead of stretching existing copyright laws to the AI context, the DCDSM and AI Act amend the scope of copyright protections in order to reflect new technological conditions.

While this solution appears to be flexible enough for both parties, time will tell whether “opt-out” can deliver sufficient protection to creators and manageable expectations for developers. For example, while the DCDSM text references the use of “machine-readable means” for an “opt-out” statement, some have pointed out that no widely-recognized standards for such “means” exist. As the AI Act is phased-in over the next several years, the EU will need to ensure that creatives are in a position to conveniently disclose their data-sharing preferences; it will also need to address the issues of retroactivity (what about AI models that have already scraped copyrighted content from the internet without compensating creators?) and globalization (what about systems trained on non-European data?). Whether or not copyright holders have the capacity to push back on AI text-mining will also depend on the implementation of the data transparency standards established by the AI Act; the EU’s AI Office plans to finalize and release a template for data disclosure in early 2025. But, even if its practical success requires a stronger effort from regulators and industry norm-setters, the AI Act’s uptake of the “opt-out” provision is a commendable attempt to update copyright protections in response to AI’s watershed moment.Both the US and China have hesitated to proactively govern on AI copyright issues—instead, the two countries have simply reasserted the authority of existing frameworks, whose implications for the AI training process have not yet been clearly articulated. In the US, this clarification has been deferred to the USCO and forthcoming legal debates; Chinese regulators also have the advantage of observing court decisions before finalizing their own copyright approach. The EU’s decision to act early and establish a clear and comprehensive (if imperfect) governance framework for AI training may provide clarity to rights holders and developers during this period of rapid innovation and regulatory overhaul. “Opt-out” offers a policy solution to a problem that in the US and China is currently being approached via the interpretation of antiquated doctrine—a lengthy process that may end up giving AI companies enough time to devour the entire internet regardless of where protections fall.