Making sense of data - part four.

If you are training an LLM, then you will need a lot of data. And the web is full of publicly available data that, for most LLMs, is ideal training data.

But just because material is publicly available on the web doesn’t mean you can do what you want with it.

If the material you want to use is available on someone else’s platform then, depending on how you got access to the platform, it’s likely that you are going to be subject to the platform’s terms and conditions. For example, both Meta and Google (You Tube), will have made sure that their terms and conditions prevent third parties from scraping their data. (Having your own platform, like Meta and Google, gives you a substantial advantage in the AI race).

If the material you want to scrape is copyright, then the issue arises as to whether the scraping and subsequent use is a breach of copyright. This is the core of the Open AI v New York Times litigation, with OpenAI arguing that its use was fair use. But fair use is a creature of US copyright law and does not exist (at least in that form) in other countries. Some countries, like the UK and EU, have already put in place laws to prevent web-scraping for commercial use, unless you have got the permission of the copyright owner, even if the material is publicly available. In fact, Google just got fined €250m by the French authorities for (amongst other things) letting Bard/Gemini loose on French materials without having got the appropriate permissions.

In fact, the EU AI Act takes it a step further. If your LLM was trained (outside the EU) on data which did not respect the copyright holders’ rights, then you can’t put that LLM on the market in the EU.

And then there’s the processing of personal data. Clearview AI is a company that scraped photos from the internet (ie. from around the world, including France and the UK) so as to build up a database which could be searched using facial recognition: its main customers were law enforcement agencies looking for bad guys. Both the French CNIL and the UK’s ICO made it clear that “the “publicly accessible” nature of data does not affect the qualification of personal data and that there is no general authorisation to re-use and further process publicly available personal data”.

However, it looks like there’s going to be a shift on this. Clearview AI scraped photos, and then held on to them. Plus, the data processing was specific to the individual: the photos identified as Joe Bloggs were labelled differently to the photos of Jane Smith. LLMs make a much more transient use of data and are (usually) indifferent as to the individual to which the data relates. A few weeks ago the European Data Protection Board (a collective of all the EU data regulators) published a paper indicating that, provided that you put in place the appropriate safeguards, you could justify web scraping of personal data by an LLM as a legitimate interest.

If your LLM was trained (outside the EU) on data which did not respect the copyright holders’ rights, then you can’t put that LLM on the market in the EU.