AI versus Copyright

The issue of LLMs using other people’s copyright works as training data is really hotting up.

The biggest case to date was Open AI being sued by the New York Times for wholesale copying of NYT articles (200 years’ worth).

However, the French competition authority has just fined Google €250 million for, amongst other things, allowing Bard (now Gemini) to scrape – without permission – the content of a number of French publishers.

The big difference between these two approaches is that, in the EU and the UK, since the 2019 Directive on Copyright And Related Rights, the rights holders (i.e. newspaper and magazine owners) can prevent web-scraping by ensuring that their prior agreement is required.

In the US, there is no such legislation and the argument on whether the rights holder’s permission is or is not required turns on fair use, which in turn means going to court to get a resolution.

However, even if OpenAI were to win the argument on fair use, it looks like that won’t be enough to allow it to market its LLM in the EU.  The EU AI Act expressly provides that the providers of general purpose AI models must “put in place a policy to respect Union copyright law in particular to identify and respect, including through state of the art technologies, the reservations of rights” set out in the 2019 Directive on Copyright And Related Rights.

Does that mean that it’s OK if the LLM doesn’t web-scrape EU copyright works, and only scrapes US copyrighted works? No, because one of the primary reasons for the provision is to create a level playing field.

As one of the AI Act’s recitals makes clear “Any provider placing a general purpose AI model on the EU market should comply with this obligation [i.e. right holder consent], regardless of the jurisdiction in which the copyright-relevant acts underpinning the training of these general purpose AI models take place. This is necessary to ensure a level playing field among providers of general purpose AI models where no provider should be able to gain a competitive advantage in the EU market by applying lower copyright standards than those provided in the Union.”

Does that mean that it’s OK if the LLM doesn’t web-scrape EU copyright works, and only scrapes US copyrighted works?