Change and Constancy: Challenges of AI Industry to Traditional Copyright Protection System

image.png

I Introduction

The rapid advancements in artificial intelligence (AI) technologies, particularly the development and deployment of large language models (LLMs), have a great impact on traditional systems of copyright protection. As generative AI systems increasingly contribute to content creation, the challenges of applying traditional copyright frameworks to these new technologies become more pronounced. This article will explore the various legal complexities surrounding AI-generated content, specifically focusing on copyright issues related to LLM development, the extent of copyright protection for AI-generated content(AIGC), and practical recommendations for both AI developers and content creators navigating this evolving landscape.

II Copyright Issues Related to LLM Development

Large Language Models are the backbone of so much generative AI today. Developing them involves a series of stages, and each stage raises important legal challenges, particularly in relation to copyright. LLM development can be broken into four critical stages and each one presents its own unique copyright puzzle. First, there's data collection, which is all about how the raw materials are gathered. Then there are data storage questions, which concern how and where this massive amount of information is held. Then comes data training, where the AI actually learns from that data. Finally, there is content generation—the stage at which the AI starts producing new stuff.

图片1.png

Data Collection

This is where AI systems scoop up the raw ingredients they need to learn. The challenge comes from how this data is obtained.

In China, for instance, the Copyright Law strictly prohibits circumventing technological measures to access copyrighted material. This means if data is collected through methods that bypass those digital locks, it may constitute copyright infringement. Furthermore, in practice, if data is crawled from others' website in violation of the crawler protocol, it may also violate the Anti-Unfair Competition Law.

图片2.png

Data Storage

When it comes to storing data, we need to differentiate between temporary and permanent scenarios, as their copyright implications can be quite distinct. 

In the field of copyright law, temporary copying, such as a computer's cache files stored in memory, is generally not considered to constitute infringement. This is because it does not result in the permanent fixation of copyrighted works; rather, it is similar to seeing the works reflected in a mirror—visible but not materially or permanently reproduced. 

In contrast, permanent storage refers to situations where data is stored permanently in a tangible copy, usually in an independent database. This often infringes on the right of reproduction of the copyright holder, unless prior authorization has been granted or a specific fair use exemption applies.

图片3.png

Data Training

Data training is the real core stage of AI developing—this is where the AI learns, analyzes, and processes massive amounts of data. It seems like a global balancing act when it comes to fair use, with different countries taking really varied approaches.

图片4.png

In China, within the framework under the Bern Convention, a three-step test applies for fair use: it must be allowed only in special situations, cannot affect the normal use of the work, and must not unreasonably harm the rights holder's legitimate interests. Although AI training isn't explicitly lifted in China's copyright law as a type of fair use, a Chinese court has recognized AI training as fair use in practice, depending on the specific circumstances.

In the United States, fair use is assessed based on those four factors you often hear about: purpose and character of the use, nature of the work, amount and substantiality used, and the market effect. An American court has also recognized AI training as fair use recently.

In European Union, the 2019 Digital Single Market Directive (DSM) proposed a widely discussed mechanism—"Opt-Out". It allows AI developers to use data for the purpose of text and data mining without authorization, but the copyright owner can clearly declare the reservation of the right. That is, if the copyright owner does not make a clear statement, it will be regarded as "implied consent". The intent was to balance fostering AI innovation with protecting creator interests. However, there's significant ongoing debate about what actually constitutes a valid statement for reservation. Is a human-readable note enough, or does it have to be machine readable? There is still a lot of controversy and uncertainty in practice.

In a word, the key takeaway across all these jurisdictions, despite their different approaches, is that fair use is always determined case by case, without complete certainty.

Content Generation

Content generation is when the AI creates new content based on previous stages. But what happens if the content it generates is substantially similar to an existing copyrighted work?

图片5.png

In such cases, courts often apply the Access + Substantial Similarity test. Legally speaking, If the copyrighted work was published first, access is presumed. Therefore, if AI-generated content is substantially similar to the work, it is likely to constitute infringement. In addition, the rights holders may claim that AI developers used their works for training without permission, not just the final output.

To help illustrate this, consider the Hangzhou Ultraman Case. In this case, the plaintiff sued an AI platform for generating images substantially similar to the iconic Ultraman character. 

The court made a distinction between the input (training data) and the output (generated content) stages. The output stage constituted infringement because of the substantial similarity. However, for the input stage, the judge held that the data training could be regarded as fair use, because the training process is not for the purpose of using the original expression of the work, not affect the normal use of the work and not unreasonably damage the interests of the right holder.

图片6.png

A similar case in the U.S. is Authors v. Anthropic AI. The plaintiffs sued the AI company for using their books to train large language models. This case involved three specific accused acts: using the books for training, digitizing lawfully bought books for storage, and using pirated books for storage. 

After the trial, the court issued a mixed ruling: the first two acts, training and digitizing legally acquired books, were considered fair use. But the third act, using pirated books for storage, was definitively found to be infringement.

图片7.png

The key takeaway is that AI developers must ensure their training data is sourced legally, especially to avoid storing pirated content.

III Is AI-Generated Content Protected by Copyright?

Is AI-generated content protected by copyright? This isn't as simple as it sounds—it goes to the very heart of what copyright is for.

The absolute bedrock principle of copyright law emphasizes human authorship and originality. One of the legislative purposes of copyright law is to encourage human creation and dissemination of works. Since AI is not a civil subject and cannot be motivated by copyright law, it generally cannot be recognized as an author or copyright holder. This resembles that famous U.S. case where a monkey took a selfie but couldn't hold the copyright.

For AI-generated content to be considered a copyrightable work, there must be a human contribution. If the content is completely generated by AI without any human involvement, or with only little human creative contribution, such as simple prompts like "draw a cute cat," it cannot be protected by copyright. If a human plays a direct and major role in creation, with AI providing only limited assistance, it may constitute a copyrightable work. The most controversial issue is: what level of human intellectual input qualifies for protection under copyright law? Jurisdictions differ on this threshold.

图片8.png

In cases like the Thaler case, where the image was autonomously generated by AI without any human intervention, the court ruled that the image was not protected by copyright. The reasoning was simple: there was no human involvement in the creative process.

图片9.png

However, in cases where humans provide complex prompts and parameters to the AI model, some jurisdictions like China have ruled that such AI generated content can be copyrighted. In the Chunfeng Case, the court held that detailed prompts and ongoing adjustments by the human user could indeed inject enough human originality for copyright protection. 

图片10.png

But similar situations may have different results in the United States. In the Théâtre D'opéra Spatial case, an artist used Midjourney AI to create an image. Despite providing over 624 prompts and using editing tools, the Copyright Office ruled that the work was not copyrightable because the final image still largely reflected the AI's data and algorithm rather than human creativity.

图片11.png

This comparison raises the question: what level of human input is truly significant enough to count as authorship? 

In the Rose Enigma case, the artist used AI to assist in creating an image, but the core of the work started from her own hand-drawn sketch. The U.S. Copyright Office granted copyright protection for the human-created aspects of the work but they specifically excluded the AI-generated elements. This case makes it clear that AI-assisted human works can still qualify for copyright, while purely AI-generated content should be excluded.

图片12.png

Another case involved content completely created through an AI model. In the A Single Piece of American Cheese case (2025, USA), the author was granted copyright for the way he selected, coordinated, and arranged the AI-generated elements. This case shows that even when AI is used in the creative process, the human choices in how those elements are put together may also be protected by copyright, though the purely AI-generated content itself should still be excluded.

图片13.png

In conclusion, there is no doubt that human originality remains the most important factor for AI-generated content to receive copyright protection. However, different jurisdiction may have different opinions on what level of human input is required.

IV Practical Suggestions for AI Developers and Content Creators

Given these complexities and differing rulings, what does this all mean for developers and creators trying to move forward? Here are some practical recommendations:

For AI developers: ensure all training data is legally sourced; avoid using pirated data, especially for permanent storage; avoid circumventing technological protection measures; respect rights holders' explicit reservations of rights; verify output content isn't substantially similar to protected works.

For AI content creators: use complex prompts and parameters(rather than simple input) to demonstrate creative contribution; apply local or personalized adjustment tools to enhance human contribution; combine AIGC with original human-created content; maintain detailed records of the creation process for potential rights demonstration and claims.

图片14.png

V Conclusion

AI is transforming the creative landscape. As AI technology evolves, we will need greater clarity on these copyright issues. However, one principle remains certain: human involvement continues to be central to the legal protection of creative works.

By respecting copyright laws, using data responsibly, and ensuring that AI tools complement (rather than replace) human creativity, we can navigate this rapidly evolving landscape.


*This article is adapted from the author's presentation at the 2025 AIPPI China Youth IP Dialogue.