Protecting Your Digital Content: Copyright Challenges in the Age of AI and Web Scraping

22 Votes

2024-07-30 Published, 2024-08-29 Updated
2613 Words, 14 Minute Read

Richard (Senior Partner)

Richard has been with the firm since 1992 and was one of the founding partners

What is 'Publicly Available'?

Your website, in most cases is copyrighted, it's YOUR hard work and by publishing it you have a copyright to that 'work product' in most countries. You permit visitors access to your copyrighted work on the implied basis that they cannot reproduce it without permission, except as fair dealing (later). This is the same for any media, websites, songs, books, videos etc.

Google

I think it's fair to say that allowing Google to 'index' your website is a good thing, it allows people to find it, but of course google ARE storing a copy of your website's content in their index, and technically that's copyrighted. Google will say that its out of necessity, and indeed if you don't want Google to index your site you can stop it with a meta-tag or robots.txt.

The Wayback machine

Is an archive of web pages captured regularly over time, again, it's reproducing without permission, and again you can exclude your site very easily from the Wayback Machine with a header or robots.txt.

The Issue

The issue with both of these is the implied authorisation rather than specific authorisation. In a perfect world, google, the Wayback Machine and many others would, and should ask permission to do what they do, a meta-tag or robots.txt providing positive consent, rather than just do it unless we specifically say not to.

This has been a debate since the dawn of the internet, and there are valid arguments on both sides of this, because Google aren't 'selling' your website content, yet, instead they are providing a service TO YOU allowing people to find your site. The Wayback Machine also provides a service allowing people to look at websites in the past, a genuine archive of information.

The problem here is that, if we allow these companies to have an 'implied' right, then it opens the gates to everyone else. Any company can now, with impunity, scrape the content of your hard work, and do whatever they want with it, and if you challenge them, they'll simply fall back to the 'publicly available' implied right, or fair dealing.

This is now becoming a significant issue because AI companies are taking your content, and using it to train their models, which they then monetise, profiting from your work product, and this is, in my opinion, firmly over the line.

UK Copyright Law

In addition to seeking to protect copyright works, the current legislation (Copyright, Designs and Patents Act 1988, as amended) has an allowance for "Fair dealing", which permits the use of copyrighted works without permission from the copyright owner under certain conditions, specifically:

Non-commercial research and private study

This applies only to non-commercial purposes and excludes commercial research. It does not cover broadcasts, sound recordings, or films and has limited application to software.

Criticism or review

The work must have been made available to the public. The use must be genuinely for critique or review purposes, which can include social or moral implications of the work. If you ran a news site, you could quote sections of this page on your site providing you attribute it correctly, but you can't reproduce or republish it.

Reporting current events

This applies to reporting on events that are current and of national or international importance. The exception does not cover photographs and generally does not cover this sort of article since it's a stretch to say that its of national or international importance - only I think that.

Quotation

Short extracts can be used for purposes such as academic articles or history books. The use must be fair and acknowledge the source. It is clear that this is for non-commercial purposes.

Caricature, parody, or pastiche

This allows for the use of copyrighted material to create new works for humour or ridicule, introduced in 2014 for some unknown reason.

Factors Determining Fairness

When determining whether a use qualifies as fair dealing, several factors are considered;

Purpose and character of the use: Whether the use is for a purpose allowed under fair dealing (e.g., criticism, review, reporting current events).

Amount and substantiality: The quantity and importance of the material used. Only as much as is necessary should be used.

Effect on the market: Whether the use competes with the copyright owner’s exploitation of the work. If it does, fair dealing is less likely to be upheld.

Acknowledgement: Whether the original author is sufficiently acknowledged.

Practical Application

Fair dealing is a defence that comes into play when a copyright infringement claim is made. The defendant must prove that their use falls within one of the fair dealing categories and that it is fair. Overall, fair dealing in the UK is more restrictive and specific compared to the US fair use doctrine, focusing on a limited set of purposes and requiring a case-by-case assessment of fairness.

Scope and Jurisdiction

One of the largest issues around copyright infringement is jurisdiction. Does the law applicable to the infringer or the infringed apply, and to what extent?

Principle of National Treatment

The principle of national treatment, established by international conventions such as the Berne Convention, means that a copyright owner is treated as if they are a national of the territory where the infringement occurs. This principle allows the copyright owner to claim the protection of copyright laws in the country where the infringement took place.

However, it's not as clear-cut as simply taking action in the country where the infringement occurred, there's far more to it, for example target; If the infringing act targets or influences another countries people or causes activities relating to the infringed content outside the country of infringement, then courts within that country may assert jurisdiction. An example would be if a British blog about British law was plagiarised by a website in Eritrea, but that website was being served to a British audience primarily, then a British court may assert jurisdiction, but, any ruling of that court would of course be unenforceable in Eritrea. A British court could however prevent that website from being accessible in the UK (mostly) given the case succeeds.

The copyright Tribunal

In the UK we have the copyright tribunal, which sounds like a great place to start, but you'll soon find that this aptly named copyright tribunal is only to resolve commercial licensing disputes, and actually will not deal with copyright infringement or plagerism and piracy. Well done.

Foreign litigation

I think for most people, litigating overseas is prohibitively expensive notwithstanding the significant variances in copyright law worldwide. Eritrea, Turkmenistan, San Marino, Palau, Kosovo and Somalia for example have no practical copyright framework, so any infringement in these countries is, by default unenforceable. Even if you spent the significant monies required to peruse and succeed against a foreign national or company, again the chances of recovering any award let alone the legal costs is slim in many parts of the world.

AI Training

What if, an AI company uses your website content to train its model, the content is no longer in its original form but instead converted to a vector representation, and incorporated into a model of vectors that no longer bares the original format, and could be argued that the original content cannot be reproduced in its original form by the language model. We know that AI companies are scraping website content wholesale including social media sites, and even using transcriptions of video's for text training, and media for stable diffusion/DALL-E training. If you send a query to a language model, the result it produces is constructed from the learned data, not the content verbatim, so how can copyright be established? It currently cannot.

Social Media Giants

Nothing you post on social media can be claimed as your copyrighted works. When using these platforms you effectively give up any rights to copyright, and at the same time permit the platform limited rights to its use. Facebook for example scans every post and message by every member to analyse interests in ad campaigns, and is now using content from groups to train their AI models. This is the way it is. Google can and do scan your gmail emails, your websites and any other publicly available information to populate their search service including image searches and are now using that same data to train their AI models. Microsoft, Adobe, Stability and Anthropic all scrape data from websites with impunity to train their AI models and produce monetised content. You may think that GDPR should protect you, but alas no. GDPR within the EU only protects individuals personal data, like email, phone and address, but it doesn't protect any content produced by that individual or any company.

Sham Companies

Possibly an unfair term, but there are many companies out there offering a wide selection of 'copyright' protection services, yet in reality they bring nothing new to the party. Companies who promise to protect your web content do so by automated means, searching for YOUR content on other sites and then sending notices out, which are then mostly ignored. This is something you can easily do yourself which will cost you less and be more effective since these companies generate massive spam and are regularly filtered out. We use images from FlatIcon and Icons8 that we paid for, yet at least once a month one of these companies emails the postmaster to tell us that we're using them illegally. We're not. This sort of spam/scam just devalues the odd legitimate claim that might have been read.

The American DMCA offers subscribers 'protection' in exchange for their $150/year and yet when you read the small print it just provides a 'badge' some image watermarking, and a DIY take-down kit, which is a template you can send to someone who's copied your content. It provides no other protection IF the infringement is by a USA company.

In the UK instead you'll find a plethora of solicitors lining up to take your money in exchange for doing something about your copyright issue, but as we already know, that's pretty much nothing more than a 'letter' without a sizeable retainer.

Quick Recap

We've established that whilst copyright law would seem to side with the producer of the work, in the real world it does not. Some companies will take action when content is shared on their platforms, Google for example does have a system to handle copyright claims, although it is fundamentally broken. Facebook, when pressed can be encouraged to remove copyrighted content, yet the process is long and arduous deliberately. All we can really do as content creators is attempt to protect our work by using the tools available to us at source. We'll discuss some of those below.

Policy

You should have a strong website content use policy, and it should be on every page. It should define what can and cannot be done with the content. You may think this is pointless given that it will be ignored by every automated scraper and data miner without exception, but having one gives you potential leverage should you wish to support any manual claim of infringement. You can see GEN's content policy by clicking the link at the bottom of EVERY PAGE. Take a look to get an idea of what's needed.

Watermarks

For images, video's and other media, consider embedding watermarks to help you clearly identify that content has been reproduced unlawfully, but be aware that modern AI tools are able to strip out backgrounds, and watermarks fairly easily. There are more advanced types of watermarking; Digital watermarking, that embeds a watermark by encoding it into the media in a way that is harder to remove, but again, AI tools allow images to be 'recreated' from a source image, which muddies the waters greatly.

Automation

We can use robots.txt, a special file that when placed in the root directory of your website, tells 'some' spiders/scrapers to ignore certain files and folders, or all files and folders. Let's look at the example below:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Allow: /

Here we are telling GPTBot (OpenAI's web crawler), ChatGPT-User (when GPT scrapes the web), Google-Extended (when google is scraping data to train Gemini), CCBot (an open source web crawler) that you DO NOT allow them to access any files on your website. You are trusting that these automated software solutions are going to take account of this, and whilst they should, we know many don't.

You can attempt to block them with .htaccess in Apache (See our blog post 20240518 for more information and examples)

You can use metadata in the web page header, for example:

This tells search engines to index and follow links on your pages, but not to use images for AI training or indexing. Again, this is not respected by all, and actually only a few so to that extent it has little effect.

A Github Proposal (which isn't a standard or will probably ever become one) suggests that we should have a meta tag specifically for AI content scrapers for example:

There is much debate around this, but they all follow the same basic idea, that a meta tag should be able to exclude that page from AI use, for search, training or both. Likewise, there are proposals to have other mechanisms, like robots.txt and even .well-known/ai-consent.txt but none of these are standards.

Conclusion

For 'human' reproduction of copyright material, you can track them down, prepare a case and issue a claim but here are the problems with that;

You will need to prove more likely than not, that the reproduction is a direct copy
You will need to prove that it's not covered by any of the exceptions above
You will need to prove that you have suffered financially in order to quantify the claim
You will need to prove that the infringer is in scope for a claim, based on their location

Ultimately, it's going to be a great deal of work, and the rewards will likely be small if any. You may succeed in some claims, but you will fail in others and overall, unless your content is highly valued the cost will by far outweigh the reward.

At this time, July 2024, there is no effective way to indicate that your website content is not free to use without limit. Having a policy is a good plan, and using the various mechanisms to attempt to limit it can be effective, but you will always be fighting a loosing battle as more and more AI companies spring up, ignore the tags and circumvent your blocks.

GEN's website uses an advanced detection system that records access to the site by IP address, and then analyses that audit log to detect IP addresses that are scraping data or attempting exploits. Assuming those IPs are not known to be Google, we add them to the blacklist, and thereafter any request to the site renders a 403 error. Additionally, repeated requests to our website from those blacklisted IP's adds incremental delays in serving the content slowing them down and eventually producing timeouts. This is the advantage of being a software house with a team experienced in web developers, and such a solution is closely integrated with the content management system and not something we can just retrofit into any site, or I assure you we would offer it.

I genuinely believe that standards will emerge in the coming months/years, but there will always be companies who ignore them, that's the problem with the Internet, its global and there is no one jurisdiction to handle its use.

What do you think? Comments welcome

22 Votes

Comments (3)

Simon S · 2024-08-12 09:29 UTC
I think copyright when it comes to the web is no longer a thing that can be controlled. there is no much content theft and not just by AI but by everyone really that its impossible to track. perhaps if someone managed to embed some sort of tracking into content then it would be easier but not right now.

Ricky R · 2024-08-10 13:13 UTC
Interesting. but I think you forgot about youtube and their attempt at copyright protection or lack. I think the number of fake copyright claims far outweighs any real ones, and the real ones are probably weak at best. If there was ever a way to screw copyright up, youtube managed it.

alex b · 2024-08-06 14:30 UTC
I think anything you put online looses any copyright because anyone can from anywhere and few have the money to fight that. AI is just another nail in the coffin of copyright.