WIPO: How to Train Your A.I.

The World Intellectual Property Organization (WIPO) last week published the comments it received as part of its public consultation on the impact and implications of artificial intelligence technology on intellectual property policy and practice.

It received over 200 submissions, from individuals, industry groups, companies, legal and technology experts and member states, so it’s a lot to go through. But from an initial sampling of comments from primarily copyright-focused groups it’s pretty clear where the early policy battle lines will be drawn up: Does/should the use of copyrighted works in A.I. training data sets require a license, and do policy makers need to create a legal framework for the use of copyrighted works in training data?

Those are not the only points of dispute. The comments contain many varied and conflicting views on several other issues raised by WIPO in its call for comments, including whether copyright can or should be attributed to works “autonomously generated” by A.I., what type of human involvement would be sufficiently creative to deserve copyright protection, and whether some new, sui generis form of protection is needed to accommodate works generated by A.I. (I recommend Creative Commons’ provocative, iconoclastic take.)

But those questions are in some sense abstract and theoretical, going to the purposes of copyright and the intrinsic value of human creativity. And in any case, they’re addressed to the hypothetical output of A.I. systems.

The issue of using copyrighted works in training data is about inputs. And it’s not at all hypothetical or abstract; it’s happening right now, with existing works. And it obtains irrespective of what happens to the outputs.

The order of battle is also well-practiced and familiar.

Organizations representing copyright owners generally concur that the use of copyrighted works in training data does and should require prior authorization by the copyright owner, and that no new exceptions or exemptions to current laws need to be set out by policymakers to enable the development of A.I.

The Association of American Publishers offers a typical example of the genre:

Though the question cannot be answered in the abstract, it nonetheless remains AAP’s view that wholesale, un-permissioned reproduction of copyrighted works in which data subsists, even for the purpose of machine learning, is likely to be infringing. Where data embodied in copyrighted works is to be used for machine learning purposes, the scope and terms of such use can best be set out in a licensing agreement between the parties. Licensing remains the most flexible tool through which AI training can be promoted, while also recognizing and protecting the copyrights of rights holders.

It is worth noting that, notwithstanding the lack of a specific exception for AI training in many jurisdictions, commercial and non-commercial entities are already engaging in AI training activities. Usage of data embodied in copyrighted works for machine learning purposes is already ably facilitated through licensing agreements or contracts between the data user and the owner(s) of the copyrighted works in which data may subsist. The fact that these arrangements already exist show that many current national copyright law frameworks are not a hindrance to AI development and enrichment, thereby negating any perceived need for creating new exceptions and limitations purportedly to satisfy the purpose of AI training.

Groups representing technology developers, as well as those purporting to speak on behalf of the public and other copyright users, argue that compiling copies of copyrighted works into A.I. training databases is, or ought to be, permitted under existing fair use/fair dealing exceptions and that no prior authorization should be required.

The Computer & Communications Industry Association offers a lengthy exegesis:

The use of the data subsisting in copyright works without authorization for machine learning should not constitute an infringement of copyright. In the United States, the existing statutory framework and related case law concerning the fair use right, 17 U.S.C. § 107, clearly permit the ingestion of large amounts of copyrightable material for the purpose of an AI algorithm or process learning its function. In jurisdictions without a fair use provision, an explicit exception may be necessary. Because of the importance of the lawfulness of ingestion to this inquiry and AI more generally, we will explain in detail how fair use permits this activity.

AI algorithms and other processes often require the ingestion of large amounts of material. Assembling that material may entail converting it into a more usable format, e.g., translating image files into machine-readable files. In addition, backup copies of the materials will be necessary to protect against loss of data in the event of system failure. Temporary reproductions of portions of the material in a computer’s random access memory are a normal part of the process of training and AI algorithm. All these copies are not viewable or consumable by the outside world. Because these non-expressive copies are not consumable by the public, they do not function as market substitutes for copies of the ingested works…

Treating the unauthorized use of data subsisting in copyrighted work for machine learning as an infringement would have a significant adverse impact on the development of AI and the free flow of data to improve innovation in AI. Companies and research institutions alike would become hesitant to ingest copyrighted works for machine learning because of the potential exposure to infringement liability. Such a result would be counterproductive to copyright law’s ultimate goal of promoting the creation and dissemination of works.

Many of the questions raised by WIPO and in a parallel inquiry by the U.S. Patent & Trademark Office, regarding the output of A.I. systems are profound, and require the careful weighing of several overlapping factors, including legal questions, economic issues, the intrinsic as well as market value of human ingenuity and creativity, and the advancement of technology.

But questions regarding A.I. inputs, such as the use of pre-existing works of authorship or invention, are concrete, and of growing commercial urgency. The answers to those questions, moreover, will inevitably shape and be reflected in the answers to the more abstract questions about outputs.

So, maybe tackle first things first?