Technology15 min read

Machine Learning Training Data: Patent and Copyright Intersections

Expert analysis of AI training data patents and copyright. Covers landmark cases, licensing strategies, and compliance for machine learning datasets.

WeAreMonsters2026-02-03

Machine Learning Training Data: Patent and Copyright Intersections

The intersection of AI training data patent protection and copyright law has emerged as one of the most complex intellectual property landscapes in modern technology. As we analyse the evolving legal framework surrounding machine learning datasets, training data patents increasingly cover data selection, preparation, and augmentation methods, while copyright issues surround the use of copyrighted works in training sets, creating a multifaceted IP environment for AI development that demands careful navigation.

This complex legal terrain affects every stakeholder in the AI ecosystem—from researchers developing foundation models to content creators protecting their intellectual property rights. In our experience, understanding both patent strategies for training data methodologies and copyright implications for content usage has become essential for AI companies, publishers, artists, and legal practitioners operating in this rapidly evolving space.

Training Data Patents: Protecting Methodology and Innovation

Recent Patent Activity in Data Processing Methods

The United States Patent and Trademark Office (USPTO) has documented "unprecedented growth" in AI patent applications, with the agency releasing updated data through its Artificial Intelligence Patent Dataset (AIPD 2023) covering 15.4 million U.S. patent documents from 1976–2023 3738. The dataset classifies 13.2 million granted patents and pre-grant publications across eight AI component technologies, demonstrating the expanding scope of AI-related intellectual property protection 38. We see two particularly notable applications that demonstrate the scope of current patent activity in training data methodologies 12.

Salesforce Inc.'s "Automated Data Extraction Pipeline for LLM Training" (US20250060944A1), published 20 February 2025, describes comprehensive methods for preparing code segments and other data specifically for large language model training 2. This patent application, filed in August 2023, covers the entire pipeline from data extraction through processing, highlighting how companies are seeking to protect their data preparation workflows.

Similarly, a recent patent application (20250117666) covers "Data Generation and Retraining Techniques for Fine-Tuning of Embedding Models for Efficient Data Retrieval," issued 10 April 2025 1. This patent protects methods for generating training samples using large language models to create datasets where information chunks are marked as relevant or irrelevant to specific queries, then training embedding models with these samples for retrieval and generative AI systems 1.

Synthetic Data Generation Patents

The synthetic data generation sector has witnessed particularly aggressive patent filing activity. Robert Bosch GmbH received patent 12,242,957 on 4 March 2025, covering devices and methods for generating synthetic data in generative networks using gradient-based approaches 3. This patent represents the growing recognition that synthetic data generation methodologies constitute valuable intellectual property worth protecting.

Qualcomm's recent patent on "Generative data augmentation with task loss guided fine-tuning" (publication 20250157207, May 2025) demonstrates sophisticated approaches to synthetic dataset creation 4. The method generates synthetic datasets with generative models and tunes them based on feedback from task networks, incorporating iterative refinement capabilities that optimise training data quality 4.

Major technology companies have filed numerous synthetic data patents beyond IBM's portfolio. Microsoft holds a granted patent (US11508360B2) covering synthetic data generation for training natural language understanding models 39, while Mastercard Asia Pacific filed advanced synthetic data training systems patents (US20240046012A1) 40. Adobe Inc. published a 2025 patent (US20250078200A1) covering generative neural networks that interactively create digital images based on natural language feedback 41. OpenAI maintains an active patent (US12061880B2) on systems for generating code using language models trained on computer code 42.

IBM has been particularly active in this space, filing multiple patents including "Systems and methods for advanced synthetic data training and generation" (US20240046012A1) and "Synthetic data generation" (US20240104168A1) 56. These patents focus on using generative models like GANs (Generative Adversarial Networks) and VAEs (Variational Autoencoders) for creating synthetic training data, with recent approaches emphasising task-loss guided optimisation where synthetic data generation is iteratively refined based on feedback from downstream AI models 4.

Data Curation and Quality Filtering Patents

Patents protecting data curation methodologies represent another significant category of training data intellectual property. These inventions often focus on automated methods for assessing and improving training data quality, addressing the critical challenge of maintaining dataset integrity at scale.

Meta-learning approaches like DataRater use meta-gradients to automatically learn which data proves valuable for training, moving beyond manual tuning and hand-crafted heuristics to assign preference weights to individual data points based on their estimated value 7. This approach optimises for improved training efficiency on held-out validation data and proves particularly effective for filtering low-quality content while reducing computational requirements 7.

Data Filtering Networks (DFNs) represent another patentable approach—neural networks specifically designed to filter large uncurated datasets 8. Research has demonstrated that filtering quality differs significantly from downstream task performance, with models exhibiting lower general accuracy but trained on small high-quality data often producing superior training sets compared to models with higher general performance 8.

Copyright Issues: The Battleground of Content Rights

Copyright Office Guidance and Fair Use Framework

The U.S. Copyright Office released its comprehensive "Copyright and Artificial Intelligence Part 3: Generative AI Training" report on 9 May 2025, fundamentally reshaping how we understand copyright protections in AI training contexts 43910. The Office concluded definitively that multiple activities in developing and deploying generative AI systems implicate copyright owners' exclusive rights, including data collection, curation, training, and output generation 43.

The report establishes that using copyrighted works to train AI models may constitute infringement of reproduction and derivative work rights, noting that AI model weights can contain copies of training data when models "memorise" that data 44. The Office concluded definitively that copyright protections apply to AI training, rejecting any blanket fair use defence for all instances of AI development 11.

Rather than providing binary answers, the Copyright Office determined that fair use protection depends on case-by-case evaluation of the four statutory factors under Section 107 of the Copyright Act 912. The report emphasises that training on diverse datasets can be transformative, but transformativeness is "context-dependent," with uses for research or analysis more likely qualifying as fair use than commercial uses that compete with original works 4344.

The Office stressed that transformativeness analysis considers whether a model generates outputs that "share the purpose of appealing to a particular audience" with the original works 44. The report rejects arguments that AI training is inherently transformative because it's not expressive or analogous to human learning 44. This nuanced approach recognises both the innovative potential of AI development and the legitimate rights of content creators.

Web Scraping Legal Landscape

Web scraping for AI training exists within an increasingly complex legal framework that has evolved dramatically in recent years 13. While scraping publicly available data remains generally legal, the legality depends critically on methodology, content selection, and intended use 14. Key considerations include respecting robots.txt files, terms of service compliance, and copyright law adherence, alongside ensuring scraping intent aligns with commercial use policies 14.

The UK Information Commissioner's Office (ICO) determined that legitimate interests represents the only realistic lawful basis for using web-scraped personal data to train generative AI models 15. However, developers must satisfy a comprehensive three-part test including necessity requirements and balancing considerations. The ICO emphasises that web scraping for AI training constitutes high-risk, invisible processing activity, requiring developers to demonstrate that alternative data collection methods prove unsuitable 15.

The role of robots.txt has evolved from a voluntary technical guideline to a potential source of legal liability 16. Recent legal scholarship argues that violating robots.txt can create civil liability under contract and tort law in common law jurisdictions, including potential claims for trespass to chattels and negligence 16. Recent court cases demonstrate the unsettled nature of this legal framework: federal courts dismissed a class-action lawsuit against Google for allegedly scraping data to train AI models in June 2024, but allowed plaintiffs to amend their complaint 66. Similarly, in November 2024, a court dismissed copyright infringement claims in Jobiak vs. Botmakers for scraping an AI job database, but granted leave to amend 6768. These cases suggest courts are still developing jurisprudence on robots.txt violations and AI scraping liability.

Artist and Creator Rights Concerns

Multiple lawsuits against major AI companies from publishers, artists, and creators are actively reshaping data collection practices 13. The New York Times v. OpenAI lawsuit represents a particularly significant case, with Judge Sidney Stein of the Southern District of New York rejecting OpenAI and Microsoft's motion to dismiss in March 2025, allowing the lawsuit to proceed 5152. The court allowed direct infringement claims, contributory copyright infringement claims, and trademark dilution claims to move forward 53.

The court has not yet ruled on whether OpenAI and Microsoft can claim protection under the fair use doctrine 5152. OpenAI argues its training on publicly available data is "grounded in fair use," but judges are only beginning to consider this central legal question across multiple AI copyright cases 5152. The litigation landscape demonstrates the tension between AI innovation and creator rights protection.

Content creators have raised legitimate concerns about unauthorised use of their works in training datasets, arguing that such use diminishes the market value of their creations and violates their exclusive rights under copyright law. In our experience, these concerns have prompted both legal action and calls for industry-wide licensing standards that provide fair compensation to creators while enabling continued AI development.

Current Disputes: Landmark Cases Shaping the Legal Landscape

Getty Images v. Stability AI: UK Court Ruling

The High Court of England and Wales delivered judgment on 4 November 2025, in Getty Images v. Stability AI Limited (2025 EWHC 2863 (Ch)) with Mrs Justice Joanna Smith presiding 4546. Getty Images sued Stability AI, alleging the company scraped millions of Getty images without consent to train its Stable Diffusion AI model, with claims including copyright infringement, database right infringement, trade mark infringement, and passing off 1718.

The court largely rejected Getty Images' infringement claims, with limited exceptions 47. Most significantly, the court rejected Getty's secondary copyright infringement claim, finding that Stable Diffusion models do not contain or store reproductions of the works used in training—concluding there are no "copies in the model" 4748. The court endorsed that AI models contain no reproductions of relevant copyright works 47.

However, the court found that Stable Diffusion's inclusion of Getty Images' trade marks in AI-generated outputs constituted trade mark infringement, with responsibility lying with the model provider rather than users 4647. The court established as fact that Getty Images' copyright-protected works were used to train Stable Diffusion, wherever that training occurred 46.

The judgment held that an "article" under UK copyright law can be intangible, a position that appears to depart from EU copyright principles, though the judgment did not address this apparent conflict with retained EU law 4748. This landmark ruling provides significant precedent for how courts may evaluate AI training data cases globally.

Anthropic Settlement: Precedent-Setting Resolution

Anthropic reached a landmark $1.5 billion (approximately £1.2 billion) settlement in September 2025 to resolve copyright infringement litigation brought by authors over AI training data 2021. The settlement covers approximately 500,000 books, providing authors roughly $3,000 (£2,400) per book and marking a significant turning point in AI copyright disputes 2021.

On 23 June 2025, Judge William Alsup of the U.S. District Court for the Northern District of California issued his "Order on Fair Use" in Bartz et al. v. Anthropic PBC (No. C 24-05417 WHA) 4950. The ruling established crucial precedent by determining that Anthropic made "fair use" of books by authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson to train its Claude large language model, finding the use transformative and consistent with copyright's purpose 50.

However, Judge Alsup simultaneously ruled that Anthropic's copying and storage of more than 7 million pirated books in a "central library" infringed the authors' copyrights and was not fair use 4950. The court rejected Anthropic's justification for downloading books from pirate sites, noting the company had "many places from which" it could have legally purchased books but preferred to steal them to avoid "legal/practice/business slog" 49.

This distinction between legitimate training and illegal data acquisition proved critical, as potential statutory damages could reach $150,000 (£120,000) per infringed work—theoretically exposing Anthropic to hundreds of billions in liability 22. A trial was scheduled for December to determine damages for the nine books involved (four novels by Bartz, two non-fiction works by Graeber, and three by Johnson), with Anthropic generating over $1 billion in annual revenue from Claude 4950. The settlement validates class action approaches to AI copyright claims and suggests courts will scrutinise data acquisition methods regardless of whether the training use qualifies as fair use 23.

Visual Artists' Ongoing Litigation

Visual artists Sarah Andersen, Kelly McKernan, Karla Ortiz, and seven others continue pursuing copyright infringement claims against Stability AI, Midjourney, DeviantArt, and Runway AI 2425. The lawsuit, filed in January 2023, alleges these companies unlawfully used billions of images scraped from the internet—including the artists' copyrighted work—to train their systems, particularly Stability's Stable Diffusion model 26. We find this case particularly instructive for understanding how courts evaluate the relationship between training data acquisition and downstream model outputs.

U.S. District Judge William Orrick indicated in May 2024 he was inclined to allow the copyright lawsuit to proceed, and by August 2024 ruled that artists could continue pursuing claims that the companies illegally stored their works and violated copyright law 2425. The judge stated: "The plausible inferences at this juncture are that Stable Diffusion by operation by end users creates copyright infringement and was created to facilitate that infringement by design" 25.

While the judge dismissed other claims including unjust enrichment and breach of contract, the lawsuit's core claim—that using artists' work to train AI systems infringes copyright—has survived dismissal motions 25. The central legal question of whether AI companies can claim "fair use" of copyrighted data remains unresolved, with the case ongoing in U.S. District Court for the Northern District of California 24.

Technical Considerations: Balancing Innovation and Legal Compliance

Data Quality and Curation Methodologies

Recent 2024–2025 research has revolutionised data quality filtering for large language model pretraining. Classifier-based Quality Filtering (CQF)—the standard method used in GPT-3, Llama, and PaLM—has been challenged by new findings showing that while CQF improves downstream task performance, it doesn't necessarily enhance language modelling on high-quality datasets because it implicitly filters the reference high-quality set itself 54.

Advanced filtering methods now include LLM-based line-level filtering through FinerWeb-10BT, which uses GPT-4o mini to label individual lines across nine quality categories, achieving higher accuracy on HellaSwag benchmarks with up to 25% less data 55. Multi-dimensional approaches like FIRE integrate multiple quality raters into unified frameworks, requiring less than 37.5% of training data to reach target performance 56, while Meta-rater evaluates data across four dimensions (professionalism, readability, reasoning, cleanliness) and doubles convergence speed for 1.3B parameter models 57. We typically see organisations adopting these advanced filtering approaches when building enterprise-grade AI systems.

Meta-learning approaches like DataRater use meta-gradients to estimate the value of individual training points, achieving significant compute efficiency improvements across different dataset scales 58. Quality assessment frameworks like QuRating train language models to discern specific data qualities including writing style, required expertise, facts and trivia, and educational value 27. This method uses pairwise judgments to train QuRater models that assign scalar quality ratings across corpora, enabling selective sampling that balances quality with diversity while improving perplexity and in-context learning performance compared to uniform sampling baselines 27.

Neural network-based filtering approaches have demonstrated that filtering quality differs fundamentally from downstream task performance 8. Data Filtering Networks (DFNs) have been used to construct state-of-the-art image-text datasets, with DFN-5B enabling CLIP models that achieve 84.4% zero-shot ImageNet accuracy 8. Practical frameworks like NeMo Curator provide comprehensive tools for quality assessment using heuristics and ML classifiers 28.

Synthetic Data as Legal Alternative

Synthetic data generation has emerged as a crucial strategy for avoiding copyright complications while maintaining training effectiveness. Recent patent activity demonstrates significant innovation in this space, with companies developing sophisticated approaches to synthetic dataset creation that reduce reliance on potentially infringing content.

Qualcomm's task-loss guided approach generates synthetic datasets with generative models and refines them based on feedback from task networks, incorporating iterative refinement capabilities that optimise training data quality 4. This methodology addresses both legal concerns and technical requirements by creating original content specifically designed for training purposes.

IBM's advanced synthetic data training and generation methods focus on using generative models like GANs and VAEs for creating synthetic training data 56. These approaches enable companies to develop high-quality training datasets while avoiding the legal complexities associated with copyrighted content usage.

Federated Learning Privacy Solutions

Federated learning presents another technical approach that addresses both privacy and copyright concerns by keeping raw training data on user devices while sharing only processed model updates with central servers 29. Google has deployed production ML models using federated learning with formal differential privacy guarantees, including training a Spanish-language Gboard next-word-prediction model using the DP-FTRL algorithm with formal privacy guarantee (ρ=0.81 zero-concentrated-differential-privacy) 5960.

Google also deployed the Smart Text Selection feature using federated learning combined with secure aggregation and distributed differential privacy, reducing memorisation by over two-fold 60. The National Institute of Standards and Technology (NIST) published updated "Guidelines for Evaluating Differential Privacy Guarantees" in March 2025, providing practitioners guidance on implementing differentially private solutions and identifying common privacy pitfalls 61.

Multiple patents address privacy-preserving approaches in federated learning, representing valuable intellectual property for companies developing these systems. IBM's privacy-preserving federated learning patent (US12160504B2) focuses on encryption and secure aggregation methods, using encryption keys and aggregation vectors to protect participant data during training processes 30. Microsoft's federated learning system (US20240211633A1) addresses personal information protection in training data, with applications covering how training data content portions are processed while maintaining privacy 31.

Google researchers documented practical secure aggregation for federated learning, designing communication-efficient protocols that enable multiple parties to compute aggregate values without revealing individual training data 29. Their protocol tolerates up to one-third of users failing during processes while achieving reasonable communication efficiency for high-dimensional data, though challenges remain including verifying server-side differential privacy guarantees and handling large multi-modal models that exceed traditional federated learning frameworks 62.

Strategic Implications: Navigating the Evolving IP Landscape

AI Company Licensing Strategies

AI companies have significantly accelerated licensing agreements with publishers and content creators to obtain high-quality training data while avoiding copyright disputes. OpenAI has been the most active in securing publisher content licensing deals, with agreements including the Financial Times (April 2024), News Corp (May 2024, reportedly worth over $250 million over five years covering Wall Street Journal and MarketWatch), and Time Magazine (June 2024) 636465. OpenAI's News Corp deal includes guarantees that content won't appear in ChatGPT immediately after publication 65.

Major 2025 deals demonstrate the growing recognition that licensing represents a crucial strategy for accessing training data legally 323334. Amazon and The New York Times signed a multi-year deal allowing Amazon to use NYT articles, NYT Cooking, and The Athletic content for AI products like Alexa, including real-time content summaries and model training 33. Meta entered licensing agreements with seven publishers including CNN, Fox News, People Inc., and USA Today Co. to incorporate content into Meta's Llama LLM, with content from People, Better Homes and Gardens, Allrecipes, and Food and Wine being integrated alongside USA Today's archive and 200+ local publications 34.

These licensing arrangements typically include payment for content access plus additional value exchanges such as privileged access to AI tools and developer support, with publishers using licensed AI tools to create new products 3263. Common elements include ChatGPT summaries linking back to source publications and strategic partnerships providing publishers revenue from content distribution—historically excluded from internet giants' profits 63. Financial terms often remain undisclosed, with deal structures varying between lump-sum payments and pay-per-use models 34.

Content Creator Strategies

The Anthropic settlement has reset the balance of power between AI companies and content creators, providing significant leverage for creators in future negotiations 23. Content creators are developing increasingly sophisticated strategies for protecting their intellectual property rights while potentially benefiting from AI development.

Artists, authors, and publishers are pursuing both litigation and licensing approaches, recognising that successful legal challenges can establish valuable precedents while licensing agreements provide direct compensation. The class action approach validated by the Anthropic settlement suggests that collective action by content creators can effectively address AI companies' use of copyrighted materials.

Some creators are also exploring technological solutions, including blockchain-based rights management systems and digital watermarking technologies that help identify unauthorised use of their content in training datasets.

Patent Portfolio Development

Companies developing AI training systems are building comprehensive patent portfolios covering data processing methodologies, synthetic data generation techniques, and quality filtering approaches. These portfolios serve multiple strategic purposes: protecting proprietary methodologies, creating licensing opportunities, and establishing defensive positions against potential patent infringement claims.

The USPTO's January 2025 Artificial Intelligence Strategy emphasises the agency's role in promoting responsible AI innovation while addressing AI-related patent examination 35. This guidance encourages companies to develop patent strategies that align with both innovation goals and responsible AI development principles. We advise clients to consider patent portfolio development as part of their broader IP strategy when investing significantly in AI training infrastructure.

Future Outlook: Regulatory Developments and Industry Standards

EU AI Act Training Data Requirements

The European Union's AI Act Article 53 establishes comprehensive obligations for providers of general-purpose AI models, including specific transparency requirements regarding training data 6970. Providers must "draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model, according to a template provided by the AI Office" 70.

According to Recital 107, training data summaries should be "generally comprehensive in scope instead of technically detailed" and must include main data collections or sets used in training (such as large private or public databases or data archives), narrative explanations about other data sources, and identification of text and data protected by copyright law 3670. The requirements aim to facilitate parties with legitimate interests—including copyright holders—to exercise their rights under EU law 36.

While providers must disclose training data information, they can protect trade secrets and confidential business information through appropriate confidentiality measures 36. The AI Office is expected to provide a template for the summary that is "simple, effective, and allow[s] the provider to provide the required summary in narrative form" 36. Providers must also prepare technical documentation (Annex XI) for regulators and maintain detailed information for downstream providers integrating the model (Annex XII) 70. For organisations operating in both the UK and EU, we recommend developing documentation processes that satisfy both regulatory frameworks simultaneously.

Industry Licensing Standards

The acceleration of licensing agreements between AI companies and content creators suggests the emergence of industry-wide standards for training data acquisition. These standards will likely address fair compensation mechanisms, usage restrictions, and attribution requirements that balance AI innovation with creator rights protection.

AI companies are pursuing licensing aggressively because they have "exhausted all easily accessible data" and face copyright lawsuits over unpaid content use 33. This dynamic creates market pressure for standardised licensing frameworks that provide certainty for both AI developers and content creators.

Technological Solutions

Technical innovations are emerging to address training data challenges while respecting intellectual property rights. Synthetic data generation continues advancing, with more sophisticated methods for creating training datasets that maintain effectiveness while avoiding copyright complications.

Privacy-preserving techniques like federated learning and differential privacy offer approaches that protect sensitive information while enabling model training. These technologies may become standard requirements as regulatory frameworks evolve and privacy concerns intensify.

Automated content identification and filtering systems are also advancing, providing tools for AI companies to identify and exclude copyrighted materials from training datasets. These systems may become essential compliance tools as legal requirements for training data transparency increase.

What You Should Do: Practical Steps

Whether you are developing AI systems or protecting creative content, the current legal landscape demands proactive engagement with training data issues.

For AI Developers

Priority Action Why It Matters
Immediate Audit training data sources Identify potential copyright exposure before litigation risk materialises
Short-term Implement data provenance tracking Essential for EU AI Act compliance and legal discovery
Ongoing Evaluate synthetic data alternatives Reduces reliance on potentially infringing content
Strategic Consider licensing partnerships Major players are securing content rights; delays increase costs

For Content Creators

Priority Action Why It Matters
Immediate Document your published works Creates evidence trail for potential claims
Short-term Review platform terms of service Understand what rights you may have already licensed
Ongoing Monitor AI outputs for your content Several services now offer AI copyright detection
Strategic Engage with collective licensing discussions Industry standards are forming now; participation matters

What NOT to Do

  • Do not assume fair use protects all training – Courts are evaluating this case-by-case, and the Copyright Office explicitly rejected blanket fair use defences
  • Do not ignore robots.txt compliance – Violating these directives may create legal liability beyond mere technical violations
  • Do not acquire training data from pirate sources – The Anthropic ruling explicitly rejected this approach, even when training itself qualifies as fair use
  • Do not conflate UK and US copyright frameworks – Fair use is a US doctrine; the UK has narrower fair dealing exceptions that may not cover commercial AI training

Conclusion

The intersection of AI training data patents and copyright law represents one of the most dynamic and consequential areas of modern intellectual property law. As we have examined throughout this analysis, patent protection for training data methodologies continues expanding through sophisticated synthetic data generation techniques, automated curation systems, and privacy-preserving approaches that offer innovative solutions to complex legal and technical challenges.

Copyright issues surrounding training data usage have reached critical mass, with landmark settlements like Anthropic's $1.5 billion (£1.2 billion) agreement and ongoing litigation establishing new precedents for how courts evaluate fair use claims in AI contexts. The distinction between legitimate training on lawfully acquired data and infringement through unauthorised acquisition has emerged as a crucial legal principle that will guide future development practices.

The strategic implications for stakeholders are profound: AI companies must develop comprehensive approaches combining patent portfolio development, licensing agreements, and technical solutions that respect creator rights while enabling continued innovation. Content creators have gained significant leverage through successful litigation and class action approaches, creating opportunities for fair compensation while contributing to AI advancement.

Looking forward, regulatory developments like the EU AI Act's transparency requirements and the U.S. Copyright Office's nuanced fair use framework will shape industry practices globally. The emergence of standardised licensing models, combined with advancing synthetic data generation and privacy-preserving technologies, suggests a future where AI training data challenges can be addressed through collaborative industry solutions rather than adversarial litigation.

We anticipate continued evolution in this space as courts provide additional guidance on fair use boundaries, regulatory authorities refine compliance requirements, and industry stakeholders develop mutually beneficial frameworks for balancing innovation with intellectual property protection. Success in this environment requires understanding both the technical and legal dimensions of training data challenges while remaining adaptable to rapidly evolving best practices and regulatory requirements.

Important: This article provides general information about AI training data and intellectual property law. It does not constitute legal advice. The legal landscape in this area is evolving rapidly, and specific circumstances may require different approaches. We recommend consulting qualified legal counsel for matters involving copyright infringement claims, patent strategy, or regulatory compliance.


Sources

[1] Data Generation and Retraining Techniques for Fine-Tuning of Embedding Models. USPTO Patent Application 20250117666. https://patents.justia.com/patent/20250117666

[2] Automated data extraction pipeline for large language model training. US Patent Application 20250060944A1. https://patents.google.com/patent/US20250060944A1/en

[3] Device and method for the generation of synthetic data in generative networks. US Patent 12,242,957. https://patents.justia.com/patent/12242957

[4] Generative data augmentation with task loss guided fine-tuning. Qualcomm Patent Publication 20250157207. https://patent.nweon.com/40483

[5] Systems and methods for advanced synthetic data training and generation. US Patent Application 20240046012A1. https://patents.google.com/patent/US20240046012A1/en

[6] Synthetic data generation. US Patent Application 20240104168A1. https://patents.google.com/patent/US20240104168A1

[7] DataRater: Meta-Learned Dataset Curation. arXiv:2505.17895v1. https://arxiv.org/html/2505.17895v1

[8] Data Filtering Networks. OpenReview. https://openreview.net/pdf?id=KAk6ngZ09F

[9] U.S. Copyright Office Issues Guidance on Generative AI Training. Jones Day Insights, May 2025. https://www.jonesday.com/en/insights/2025/05/us-copyright-office-issues-guidance-on-generative-ai-training

[10] Copyright Office Releases Pre-Publication Report On Copyrighted Works In Generative AI Training. Mondaq, 2025. https://www.mondaq.com/unitedstates/copyright/1627700

[11] The Copyright Office Weighs in on Use of Copyrighted Material in Generative AI Training. Munck Wilson Mandala, 2025. https://www.munckwilson.com/news-insights/copyright-office-generative-ai-training-guidance-2025/

[12] Copyright Office Issues Key Guidance on Fair Use in Generative AI Training. Wiley Rein LLP, 2025. https://www.wiley.law/alert-Copyright-Office-Issues-Key-Guidance-on-Fair-Use-in-Generative-AI-Training

[13] Web Scraping for AI Training: Legal Issues, Best Practices, and What You Need to Know. Dataprixa, 2025. https://dataprixa.com/web-scraping-for-ai-training-legal-issues/

[14] Is Web Scraping Legal? Laws, Ethics, and Best Practices. AiMultiple, 2025. https://research.aimultiple.com/is-web-scraping-legal/

[15] The lawful basis for web scraping to train generative AI models. UK Information Commissioner's Office. https://ico.org.uk/about-the-ico/what-we-do/our-work-on-artificial-intelligence/response-to-the-consultation-series-on-generative-ai/the-lawful-basis-for-web-scraping-to-train-generative-ai-models/

[16] The liabilities of robots.txt. arXiv:2503.06035. https://arxiv.org/pdf/2503.06035

[17] Getty Images -v- Stability AI. UK High Court Judgment, November 2025. https://www.judiciary.uk/judgments/getty-images-v-stability-ai/

[18] Getty Images lawsuit says Stability AI misused photos to train AI. Reuters, February 6, 2023. https://www.reuters.com/legal/getty-images-lawsuit-says-stability-ai-misused-photos-train-ai-2023-02-06/

[19] Getty Images v Stability AI: The UK court's first word on use of copyright works in AI model development. Paul, Weiss Analysis. https://www.paulweiss.com/media/mvzhvtmh/getty_images_v_stability_ai_the_uk_courts_first_word_on_use_of_copyright_works_in_ai_model_development.pdf

[20] Anthropic settles with authors in first-of-its-kind AI copyright infringement lawsuit. NPR, September 5, 2025. https://www.npr.org/2025/09/05/nx-s1-5529404/anthropic-settlement-authors-copyright-ai

[21] Anthropic to pay authors $1.5B to settle lawsuit over pirated chatbot training material. NPR, September 5, 2025. https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-settlement-pirated-chatbot-training-material

[22] Bartz v. Anthropic: Settlement reached after landmark summary judgment and class certification. Inside Tech Law, September 2025. https://www.insidetechlaw.com/blog/2025/09/bartz-v-anthropic-settlement-reached-after-landmark-summary-judgment-and-class-certification

[23] Anthropic Settlement Resets Balance of Power for Content Creators. Bloomberg Law, 2025. https://news.bloomberglaw.com/business-and-practice/anthropic-settlement-resets-balance-of-power-for-content-creators

[24] Stability AI, Midjourney should face artists' copyright case, judge says. Reuters, May 8, 2024. https://www.reuters.com/legal/litigation/stability-ai-midjourney-should-face-artists-copyright-case-judge-says-2024-05-08/

[25] AI companies lose bid to dismiss parts of visual artists' copyright case. Reuters, August 13, 2024. https://www.reuters.com/legal/litigation/ai-companies-lose-bid-dismiss-parts-visual-artists-copyright-case-2024-08-13/

[26] Judge pares down artists' AI copyright lawsuit against Midjourney, Stability AI. Reuters, October 30, 2023. https://www.reuters.com/legal/litigation/judge-pares-down-artists-ai-copyright-lawsuit-against-midjourney-stability-ai-2023-10-30/

[27] QuRating: Training language models to discern data quality. arXiv:2402.09739. https://arxiv.org/abs/2402.09739

[28] Quality Assessment & Filtering. NVIDIA NeMo Framework Documentation. https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html

[29] Practical Secure Aggregation for Privacy-Preserving Machine Learning. Google Research Paper. https://research.google.com/pubs/archive/45808.pdf

[30] Privacy-preserving federated learning. US Patent 12160504B2. https://patents.google.com/patent/US12160504B2/en

[31] System and Method for Federated Learning. US Patent Application 20240211633A1. https://patents.google.com/patent/US20240211633A1/en

[32] AI Content Licensing Deals With Publishers: Complete Updated Index. Variety, 2025. https://variety.com/vip/breaking-down-ai-content-licensing-all-the-publisher-deals-training-ai-models-1236093395

[33] New York Times partners with Amazon for first AI licensing deal. Reuters, May 29, 2025. https://www.reuters.com/business/retail-consumer/new-york-times-amazon-sign-ai-licensing-deal-2025-05-29/

[34] Meta enters AI licensing fray, striking deals with People Inc., USA Today Co. and more. Digiday, 2025. https://digiday.com/media/meta-enters-ai-licensing-fray-striking-deals-with-people-inc-usa-today-co-and-more/

[35] USPTO Artificial Intelligence Strategy. USPTO, January 2025. https://www.uspto.gov/sites/default/files/documents/uspto-ai-strategy.pdf

[36] Recital 107 Transparency obligations for providers of general-purpose AI models concerning training data. EU AI Act. https://ai-act-law.eu/recital/107/

[37] USPTO sees unprecedented growth in AI patent apps. Law360, 2025. https://www.law360.com/articles/1323341/uspto-sees-unprecedented-growth-in-ai-patent-apps

[38] Artificial Intelligence Patent Dataset. USPTO Office of the Chief Economist, January 8, 2025. https://www.uspto.gov/ip-policy/economic-research/research-datasets/artificial-intelligence-patent-dataset

[39] Synthetic data generation for training of natural language understanding models. US Patent 11508360B2. https://patents.google.com/patent/US11508360B2/en

[40] Systems and methods for advanced synthetic data training and generation. US Patent Application 20240046012A1. https://patents.google.com/patent/US20240046012A1/en

[41] Utilizing a generative neural network to interactively create and modify digital images based on natural language feedback. US Patent Application 20250078200A1. https://patents.google.com/patent/US20250078200A1/en

[42] Systems and methods for generating code using language models trained on computer code. US Patent 12061880B2. https://patents.google.com/patent/US20240020096A1/en

[43] Copyright and Artificial Intelligence Part 3: Generative AI Training Report (Pre-Publication Version). U.S. Copyright Office, May 9, 2025. https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report-Pre-Publication-Version.pdf

[44] Copyright Office Weighs In on AI Training and Fair Use. Skadden, May 2025. https://www.skadden.com/insights/publications/2025/05/copyright-office-report

[45] Getty Images -v- Stability AI. UK High Court Judgment 2025 EWHC 2863 (Ch), November 4, 2025. https://www.judiciary.uk/judgments/getty-images-v-stability-ai/

[46] Getty Images issues statement on ruling in Stability AI UK litigation. Getty Images, November 2025. https://newsroom.gettyimages.com/en/getty-images/getty-images-issues-statement-on-ruling-in-stability-ai-uk-litigation

[47] Getty Images v Stability AI: English High Court Rejects Secondary Copyright Claim. Latham & Watkins, November 2025. https://www.latham.london/2025/11/getty-images-v-stability-ai-english-high-court-rejects-secondary-copyright-claim/

[48] Getty Images v Stability AI: English High Court Rejects Secondary Copyright Claim. JD Supra, November 2025. https://www.jdsupra.com/legalnews/getty-images-v-stability-ai-english-4699976/

[49] Judge William Alsup Order on Fair Use. Bartz v. Anthropic, June 23, 2025. https://www.softic.or.jp/application/files/9917/5091/1904/Judge-Alsup-order-on-fair-use-and-infringement-Jun-23-2025.pdf

[50] Anthropic wins key US ruling on AI training in authors' copyright lawsuit. Reuters, June 24, 2025. https://www.reuters.com/legal/litigation/anthropic-wins-key-ruling-ai-authors-copyright-lawsuit-2025-06-24/

[51] Judge explains order for New York Times in OpenAI copyright case. Reuters, April 4, 2025. https://www.reuters.com/legal/litigation/judge-explains-order-new-york-times-openai-copyright-case-2025-04-04/

[52] Judge allows 'New York Times' copyright case against OpenAI to go forward. NPR, March 26, 2025. https://www.npr.org/2025/03/26/nx-s1-5288157/new-york-times-openai-copyright-case-goes-forward

[53] The New York Times Company v. Microsoft Corporation et al. Document 514 (S.D.N.Y. 2025). https://law.justia.com/cases/federal/district-courts/new-york/nysdce/1:2023cv11195/612697/514/

[54] The data-quality illusion: Rethinking Classifier-based quality filtering for LLM Pretraining. arXiv:2510.00866v1, 2024. https://arxiv.org/html/2510.00866v1

[55] FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering. Astrophysics Data System, 2025. https://ui.adsabs.harvard.edu/abs/2025arXiv250107314H/abstract

[56] FIRE: Improving Training of General-Purpose AI. arXiv:2502.00761, 2025. https://arxiv.org/abs/2502.00761

[57] Meta-rater: Multi-dimensional data quality evaluation for large language model training. arXiv:2504.14194, 2025. https://arxiv.org/abs/2504.14194

[58] DataRater: A Meta-Learning Approach to Data Quality Assessment. OpenReview. https://openreview.net/pdf/1f1b76cc4d32555d17590a69a790b866d5c6245d.pdf

[59] Federated Learning with Formal Differential Privacy Guarantees. Google Research Blog. https://research.google/blog/federated-learning-with-formal-differential-privacy-guarantees/

[60] Distributed differential privacy for federated learning. Google Research Blog. https://research.google/blog/distributed-differential-privacy-for-federated-learning/

[61] Guidelines for Evaluating Differential Privacy Guarantees. NIST, March 2025. https://www.nist.gov/publications/guidelines-evaluating-differential-privacy-guarantees

[62] Advanced Differential Privacy in Machine Learning: Challenges and Opportunities. arXiv:2410.08892, 2025. https://arxiv.org/pdf/2410.08892

[63] AI content licensing deals: Where OpenAI, Microsoft, Google, and others see opportunity. CB Insights Research, 2025. https://www.cbinsights.com/research/ai-content-licensing-deals/

[64] OpenAI to use FT content for training AI models in latest media tie-up. Reuters, April 29, 2024. https://www.reuters.com/technology/financial-times-openai-sign-content-licensing-partnership-2024-04-29/

[65] Sam Altman's OpenAI signs content agreement with News Corp. Reuters, May 22, 2024. https://www.reuters.com/technology/sam-altmans-openai-signs-content-agreement-with-news-corp-2024-05-22/

[66] Google Dodges Lawsuit Over Web Scraping for AI Models, for Now. Bloomberg Law. https://news.bloomberglaw.com/litigation/google-dodges-lawsuit-over-web-scraping-for-ai-models-for-now

[67] Court Dismisses AI Scraping Claim, But Grants Leave To Amend. Mondaq, November 2024. https://www.mondaq.com/unitedstates/copyright/1540260/court-dismisses-ai-scraping-claim-grants-leave-amend

[68] Jobiak Case: Court Dismisses Motion Over AI Database Copyright. National Law Review. https://natlawreview.com/article/court-dismisses-ai-scraping-claim-grants-leave-amend

[69] AI Act Explorer: Article 53. EU AI Act Service Desk. https://ai-act-service-desk.ec.europa.eu/en/ai-act/article-53

[70] Article 53: Obligations for Providers of General-Purpose AI Models. EU Artificial Intelligence Act. https://artificialintelligenceact.eu/article/53/

Reader Tools

No notes yet

Select text anywhere and click
"Save" to add research notes