Is copyright a barrier for AI?

This is not the first time that the issue of copyright has fueled debates on the fair distribution of the economic benefits of digital companies. But the complaint filed last December by the New York Times against OpenAI, the American start-up behind ChatGPT, has rekindled this perennial controversy.

The newspaper challenges the leading start-up in the development of artificial intelligence (AI) tools, the right to make commercial use of its content and demands several billion dollars from it. In its response, OpenAI believes that its use of content remains “loyal” since it does not a priori give direct access to the works, even if the newspaper tries to show otherwise, with examples to support its claims.

Are these borrowings marginal or structural? Should the authors of this content give their permission, or even be compensated as soon as they appear in the training databases of large language models? These are fundamental issues raised by the New York Times.

Developing AI tools, such as ChatGPT or MidJourney, indeed requires a gigantic amount of data: texts, images, videos, etc. These thus constitute the raw materials of AI.

In the United States, the rules of “fair use”, the reasonable use of copyrighted content, provide jurisprudential exceptions, appreciated on a case-by-case basis by the American judiciary.

This principle has thus allowed Google to offer a search service in its library of digitized books, Google Books, but not to resell these books in digital format without the explicit agreement of the rights holders.

On this model, it is possible that the judge could therefore decide that the content used to train AI systems falls under a usage that could nevertheless give rise to financial compensation for publishers. It is moreover rather the amount of these compensations that is currently the subject of discussions between generative AI firms and publishers.

Scrape first, negotiate later

To deploy their tools, companies have not paid much attention to scruples and have absorbed all the content at their disposal, as shown by Common Crawl. This database references more than 250 billion web pages used in particular by the AI of Google and Facebook.

In this immense web directory, we find among the sites most absorbed by AI, Wikipedia of course, but also major news sites (the Guardian, Forbes … and the New York Times) as well as pirated book sites, ranging from JK Rowling to Hannah Arendt …

Fortified by “fair use”, copyright is not really a risky area for the extractive companies of generative AI. The short history of copyright disrupted by digital technology demonstrates that it has adapted more for technological innovation to advance rather than stopped it.

This is the case of the European regulation on the general data protection regulation (GDPR) which mainly framed and enclosed the circulation of personal data by introducing numerous exceptions to facilitate their collection.

Can publishers withdraw their content from this calculating indexing? OpenAI has put in place such a system. It remains to be seen what the reality of withdrawal is: the refusal of indexing is currently based more on the update of content than on the history, that is to say the articles that have already been consumed by the training data of the machines. The opacity of these training ‘corpora’ remains the black spot since it prevents rights holders from being able to verify if their works feed AI.

The question arises, of the competition of content generated by AI against human productions

Not to mention that another problem will be more difficult to assess: are some contents in these corpora valued differently according to their quality? Data from news sites may well have been given a different weight in the training data of AI models due to their better quality. To train a language model like ChatGPT, information from a site like the New York Times is more valuable than from an amateur site whose sources are not verified.

The major players in generative AI have undoubtedly understood that they need fresh and quality content to improve their models and produce appropriate responses. They are starting to pull out their checkbook to make agreements with news sites, as with the Axel Springer group. This is clearly the main consequence of the complaint from the New York Times.

Under pressure from publishers and competition, OpenAI has changed its doctrine and begun negotiations by considering paying annual usage licenses to certain news sites, in an approach certainly more opportunistic than strategic, especially to counter other players in the sector about to sign similar agreements. This is already the case with Apple, which has engaged in negotiations with several press groups to develop its own AI system.

Copyright is not the only problem on the horizon

But it is not only the copyright on training data that is likely to pose a problem. The question also emerges of the competition of content generated by AI against human productions. Like the style transfer, which is not a matter of copyright but of counterfeiting, or even parasitism, which could go as far as identity theft, potentially even more problematic.

A problem all the more thorny as the productions of generative AI are likely tomorrow to be deployed on the same markets as original works. The amount of royalties that generative AI actors will pay could then become inadequate if they were to offer information services competing with those of the press. As the specialist in the subject Frédéric Filloux explains: “Who would subscribe for $ 15 a month at the New York Times if they found the same thing better from AI providers?”

Leave a Reply

Your email address will not be published. Required fields are marked *