Technology•15 min read

Transformer Models and Patent Claims: Technical Analysis for IP Practitioners

Technical analysis of transformer model patents covering attention mechanisms, prior art from academic papers, and claim construction challenges in AI litigation.

WeAreMonsters Technical Team•2026-02-03

Transformer Models and Patent Claims: Technical Analysis for IP Practitioners

The transformer model patent landscape has undergone dramatic transformation since 2017, when Google's groundbreaking "Attention Is All You Need" paper introduced an architecture that would revolutionise artificial intelligence.1 What began as an academic breakthrough has evolved into one of the most contentious patent battlegrounds in modern technology, with transformer model patent applications increasing from 733 families in 2014 to over 14,000 by 2023—representing an 800% surge.2 This explosive growth reflects not just the commercial importance of transformer architectures underlying ChatGPT, BERT, and countless other AI systems, but also the complex legal challenges these technologies present.

We find ourselves at the intersection of rapid technological advancement and intellectual property law, where foundational academic research collides with billion-dollar commercial implementations. The patent landscape surrounding transformers presents unique challenges: core architectural innovations were disclosed in academic papers before many patent applications were filed, creating extensive prior art that complicates traditional patentability analysis.3 Yet companies continue filing thousands of transformer-related patents, seeking to protect specific implementations, training methodologies, and application-focused improvements.

Our analysis reveals that transformer architecture patents cover attention mechanisms, positional encodings, and architectural innovations, with extensive prior art from academic publications requiring careful claim analysis to distinguish commercial implementations from foundational research contributions. Understanding this landscape requires deep technical expertise in neural network architectures combined with sophisticated patent claim construction skills—a combination we've developed through years of analysing AI patent disputes.

This article provides comprehensive technical analysis of the transformer patent ecosystem, examining key patent filings from Google, OpenAI, and other major players whilst analysing how prior art from academic publications affects patentability. We'll explore specific claim construction challenges, infringement detection complexities, and strategic considerations that patent practitioners must navigate in this rapidly evolving field.

Important: This article provides general technical information about transformer patent analysis for educational purposes. It is not legal advice and should not be relied upon as such. Patent validity and infringement determinations require qualified legal counsel with access to complete claim language and prosecution history.

Transformer Architecture Explained

To understand transformer patent claims, we must first examine the technical architecture that makes these systems revolutionary. The transformer model fundamentally changed how neural networks process sequential data by replacing recurrent connections with attention mechanisms, enabling unprecedented parallelisation and performance improvements.4

Self-Attention Mechanism

At the core of every transformer lies the self-attention mechanism, a computational process that allows the model to weigh the importance of different parts of an input sequence when processing each element. Unlike traditional recurrent neural networks that process sequences sequentially, self-attention enables the model to access any position in the sequence simultaneously.5

The mathematical foundation centres on three learned linear transformations that convert input embeddings into Query (Q), Key (K), and Value (V) matrices. The attention function computes relationships between all positions through the formula:

Attention(Q, K, V) = softmax(QK^T / √d_k)V

Where d_k represents the dimensionality of the key vectors, and the scaling factor prevents the softmax function from saturating in regions with extremely small gradients.6 This mechanism allows transformers to capture long-range dependencies without the vanishing gradient problems that plague RNNs when processing long sequences.

The self-attention computation creates an attention matrix where each element represents the relevance between pairs of input positions. This matrix enables the model to focus on relevant parts of the input when generating each output element, providing both interpretability and performance advantages over traditional architectures.7

Multi-Head Attention

Rather than performing single attention computation, transformers employ multi-head attention—running multiple attention functions in parallel, each with different learned parameters.8 This architectural choice allows the model to attend to information from different representation subspaces at different positions simultaneously.

Each attention head operates on different linear projections of the queries, keys, and values, typically with reduced dimensionality (d_model/h, where h is the number of heads). The outputs are concatenated and projected through another learned linear transformation:

MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O

Where each head_i = Attention(QW_i^Q, KW_i^K, VW_i^V), and W^O is the output projection matrix.9 This parallel processing enables the model to capture different types of relationships simultaneously—some heads might focus on syntactic relationships whilst others capture semantic dependencies.

The multi-head mechanism provides significant computational advantages through parallelisation whilst maintaining model expressivity. Each head can specialise in different aspects of the attention computation, creating a richer representation than single-head alternatives.10

Positional Encoding

Since transformers lack the inherent sequential processing of RNNs, they require explicit positional information to understand sequence order. The original transformer architecture introduces positional encodings—fixed mathematical functions added to input embeddings to provide position-specific information.11

The standard approach uses sinusoidal positional encodings where each position pos and dimension i follows:

PE(pos, 2i) = sin(pos/10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos/10000^(2i/d_model))

This formulation ensures that positional encodings for different positions have unique patterns whilst maintaining mathematical properties that help the model learn relative position relationships.12 The sinusoidal functions create periodic patterns at different frequencies, allowing the model to extrapolate to sequence lengths longer than those seen during training.

Alternative approaches include learned positional embeddings, which treat position information as trainable parameters, and more recent innovations like Rotary Positional Encoding (RoPE) that directly incorporates position information into attention computations.13 Each approach presents different trade-offs between performance, interpretability, and extrapolation capabilities—distinctions that become relevant when analysing specific patent claims.

Encoder-Decoder Structure

The original transformer architecture employs an encoder-decoder structure optimised for sequence-to-sequence tasks like machine translation. The encoder processes input sequences through a stack of identical layers (typically 6), each containing multi-head self-attention followed by position-wise feed-forward networks.14

Each encoder layer includes residual connections around both sub-layers, followed by layer normalisation. The mathematical formulation for each sub-layer becomes:

LayerNorm(x + Sublayer(x))

This residual architecture enables training of deep networks whilst maintaining gradient flow, allowing transformers to scale to much greater depths than earlier architectures.15

The decoder stack mirrors the encoder structure but includes additional masked self-attention to prevent positions from attending to future positions during training. Cross-attention layers enable the decoder to attend to encoder outputs, creating the encoder-decoder interaction essential for tasks requiring input-output sequence transformation.16

Feed-Forward Networks

Within each transformer layer, feed-forward networks provide non-linear transformation capabilities that complement the attention mechanisms. These networks consist of two linear transformations with activation:

FFN(x) = max(0, xW_1 + b_1)W_2 + b_2

The inner dimensionality typically expands to 4 times the model dimension (e.g., 2048 for d_model=512), creating a bottleneck architecture that provides representational capacity whilst maintaining computational efficiency.17 More recent variants employ GELU activations or other advanced activation functions for improved performance.18

The feed-forward networks operate identically on each position independently, providing position-wise transformation that complements the position-mixing capabilities of attention mechanisms. This architectural separation allows the model to alternate between mixing information across positions (attention) and processing information within positions (feed-forward networks).19

Key Transformer Patents

Understanding the patent landscape requires examining specific filings from major players who have sought to protect transformer-related innovations.

Google's Foundational Patents

Google holds the most significant transformer patent portfolio, stemming directly from the original "Attention Is All You Need" research. The cornerstone filing is US10452978B2, titled "Attention-based sequence transduction neural networks," with a priority date of 23 May 2017—filed shortly before the paper's public release.2021

This patent covers the fundamental transformer architecture including multi-head attention mechanisms and the encoder-decoder structure. Key claims include methods for processing sequences using attention mechanisms that compute output based on weighted combinations of value vectors, where weights derive from query-key similarity computations.22

Related filings include:

US10956819B2 — Continuation patent covering attention mechanism implementations, remaining active until 203823
US10740433B2 — Universal Transformers patent with priority date 18 May 2018, covering depth-adaptive computation and ACT (Adaptive Computation Time) mechanisms24
US11556786B2 — Decoder-only architectures relevant to GPT-style models25
US11157721B2 — Methods for training attention-based neural networks with specific loss functions26

The Google portfolio demonstrates strategic claiming around both architectural innovations and training methodologies. The continuation strategy ensures extended protection whilst the broad foundational claims create potential freedom-to-operate concerns for implementers.27

OpenAI Patent Portfolio

OpenAI has pursued a more selective patent strategy, holding approximately 9 active grants as of 2024—significantly fewer than Google or Microsoft despite OpenAI's prominence in transformer commercialisation.28

Key filings include:

US20240020096A1 — Code generation using transformer models, covering methods for predicting code sequences based on natural language prompts29
US11520732B2 — Multimodal transformer architectures processing combined text and image inputs30
US11562256B2 — Methods for fine-tuning large language models with human feedback (RLHF-related)31

OpenAI's patent strategy emphasises rapid prosecution, with an average approval time of 11 months compared to the USPTO average of 24 months.32 This suggests strategic prioritisation of specific defensive positions rather than comprehensive portfolio building.

Microsoft and Other Major Players

Microsoft's transformer patent activity often focuses on enterprise applications and efficiency improvements:

US11379684B2 — Sparse attention mechanisms for processing long sequences33
US20220277238A1 — Knowledge distillation methods for transformer compression34
US11501172B2 — Document understanding using transformer architectures35

Salesforce has filed significant transformer-related patents including:

US20210232773A1 — Unified Vision and Dialogue Transformer integrating BERT-style processing with visual inputs36

Patent Filing Trends

Global patent filing data reveals distinct regional strategies. Chinese entities (Tencent, Baidu, Ping An Insurance) dominate by volume, filing approximately 40% of transformer-related patents globally.37 However, analysis suggests quality distinctions—Chinese filings often cover specific applications whilst Western filings tend toward broader architectural claims.38

University patent activity has increased significantly, with institutions like MIT, Stanford, and Carnegie Mellon filing transformer-related patents, though many through industry partnerships rather than independent prosecution.39

The filing surge post-2020 correlates with GPT-3's release and subsequent commercial interest in large language models, with approximately 8,000 new families filed between 2020 and 2023.40

Prior Art Landscape

Effective transformer patent analysis requires comprehensive understanding of the prior art landscape, which is unusually rich due to the academic origins of these technologies.

"Attention Is All You Need" (Vaswani et al. 2017)

The foundational Vaswani paper presents complex prior art considerations. Published at NeurIPS 2017, the paper's arXiv posting on 12 June 2017 predates most third-party patent applications.41 This academic disclosure establishes prior art against claims to the basic transformer architecture by any party other than Google.

Critical timing analysis shows:

arXiv posting: 12 June 2017 (public disclosure)
Google priority date: 23 May 2017 (pre-disclosure filing)
Conference publication: December 2017 (NeurIPS proceedings)42

The paper explicitly describes multi-head attention, positional encodings, encoder-decoder architecture, and training methodologies—all with sufficient technical detail to enable reproduction. This disclosure renders independently developed patents covering these fundamentals potentially invalid.43

Open-source code release (the Tensor2Tensor library) further complicates the landscape by providing implementation details that supplement the paper's theoretical descriptions.44

Earlier Attention Mechanisms

Attention mechanisms significantly predate transformers, creating layered prior art that narrows patentable scope.

Bahdanau attention (ICLR 2015): The paper "Neural Machine Translation by Jointly Learning to Align and Translate" introduced attention mechanisms for sequence-to-sequence models, establishing the concept of computing context vectors as weighted sums of encoder states.45 This directly anticipates aspects of transformer attention.

Luong attention variants (EMNLP 2015): Luong et al. presented multiple attention mechanisms (dot product, general, concatenative) that correspond to components within transformer multi-head attention.46

Sequence-to-sequence foundations (2014): Sutskever et al.'s work on sequence-to-sequence learning with neural networks established encoder-decoder architectures that transformers build upon.47

These publications create prior art chains that limit claims to attention mechanisms in general, pushing valid patent scope toward specific improvements and novel combinations.

ArXiv Publications Impact

The machine learning community's extensive use of arXiv creates unusual prior art dynamics. Pre-print posting establishes public disclosure dates that often predate patent filings, even when the posting occurs before peer review.48

Key pre-transformer arXiv publications include:

Memory networks (Weston et al., 2014)49
Neural Turing Machines (Graves et al., 2014)50
End-to-end memory networks (Sukhbaatar et al., 2015)51

Each establishes prior art for specific architectural components that appear in modern transformers, including memory addressing mechanisms that parallel attention computations.

Prior Art Search Challenges

Effective prior art searching for transformer patents faces several challenges:

Terminology evolution: The field's rapid development means relevant disclosures may use different terminology—"soft attention," "memory addressing," "content-based retrieval"—that standard patent searches may miss.52

Cross-disciplinary knowledge: Relevant prior art spans computer science, computational linguistics, cognitive science, and mathematics. Comprehensive searches require expertise across these domains.53

International databases: Significant machine learning research occurs in China, Japan, and Korea. Language barriers and database access complicate thorough prior art investigation.54

Implementation vs. theory: Academic papers may describe concepts theoretically whilst patents claim specific implementations. Mapping between these requires technical understanding of both domains.55

What's Patentable vs Prior Art

Distinguishing patentable innovations from prior art requires systematic analysis of how specific claims relate to disclosed technologies.

Core Architecture Analysis

The fundamental transformer architecture as described in Vaswani et al. constitutes prior art for most purposes:

Likely prior art (not independently patentable):

Basic self-attention mechanism (Q, K, V computation)
Multi-head attention with concatenation
Standard sinusoidal positional encoding
Six-layer encoder-decoder structure
Residual connections with layer normalisation
Position-wise feed-forward networks with ReLU activation56

These elements were publicly disclosed before most third-party patent applications could establish priority. Claims covering these fundamentals face validity challenges unless they include genuinely novel elements.57

Potentially Patentable Innovations

Despite extensive prior art, several categories of innovation may support valid patent claims:

Specific architectural improvements:

Sparse attention patterns (Longformer, BigBird)58
Linear attention mechanisms (Performers)59
Mixture-of-experts integration60
Novel layer normalisation positions (Pre-LN vs Post-LN)61

Novel attention mechanisms:

Sliding window attention for long sequences62
Cross-document attention for retrieval augmentation63
Hierarchical attention structures64

Application-specific adaptations:

Vision transformers with patch embeddings (though ViT itself is now prior art)65
Speech processing modifications66
Protein structure prediction adaptations67

Training methodology innovations:

Specific pre-training objectives beyond masked language modelling68
Curriculum learning strategies69
Novel fine-tuning approaches (adapters, LoRA)70

Implementation Details

Hardware-specific optimisations and efficiency improvements represent fertile patenting territory:

Flash Attention memory-efficient implementations71
Quantisation-aware training methods72
Specific parallelisation strategies (tensor, pipeline, data)73
Custom hardware accelerator designs74

Commercial vs Academic Boundary

The distinction between academic research and commercial implementation creates a nuanced patentability boundary:

Academic disclosure typically covers:

Theoretical principles and mathematical foundations
Benchmark performance on standard datasets
General architectural descriptions

Potentially patentable commercial implementations may include:

Production-scale optimisations
Specific deployment configurations
Novel combinations addressing commercial requirements
Integration with proprietary systems75

This boundary remains contested, with ongoing litigation testing where academic disclosure ends and patentable commercial innovation begins.76

Technical Claim Analysis

Analysing transformer patent claims requires understanding typical claim structures and construction challenges specific to neural network technologies.

Example Claim Elements

Transformer patents typically include both system and method claims. A representative independent method claim might include:

Preamble: "A computer-implemented method for processing a sequence of input tokens, comprising:"
Input processing: "receiving an input sequence and generating embedding vectors"
Attention computation: "computing attention scores between query vectors derived from a first linear projection and key vectors derived from a second linear projection"
Value weighting: "generating output vectors as weighted combinations of value vectors based on the attention scores"
Multi-head specification: "wherein steps (c) and (d) are performed by a plurality of attention heads operating in parallel"
Output generation: "concatenating outputs from the plurality of attention heads and applying a final linear projection"[77]

System claims typically mirror method claims whilst specifying processor and memory components that execute the method steps.

Dependent claims narrow scope through specific parameters (number of heads, dimensionality), activation functions, normalisation approaches, or application domains.78

Claim Construction Challenges

Several terms present particular construction difficulties in transformer patent litigation:

"Attention mechanism": Courts must determine whether this term encompasses all attention variants or only specific implementations. The Vaswani formulation differs subtly from Bahdanau attention—does a claim to "attention mechanism" cover both?79

"Multi-head": Does this require the specific concatenation approach from Vaswani, or any parallel attention computation? Some architectures use multi-query attention (multiple queries, single key-value set) that may or may not fall within claim scope.80

"Positional encoding": This term encompasses fixed sinusoidal, learned absolute, relative, and rotary variants. Claim construction must determine whether all variants infringe or only specific implementations.81

"Transformer block/layer": The modular nature of transformers means individual components (attention, FFN, normalisation) can be arranged differently. What constitutes a "transformer layer" for claim purposes?82

Distinguishing from Prior Art

Patent applicants and prosecutors employ several strategies to distinguish transformer claims from prior art:

Claim amendment strategies:

Adding specific parameter ranges not disclosed in prior art
Specifying particular activation functions or normalisation approaches
Limiting to specific application domains
Including training methodology limitations83

Continuation application patterns:

Filing continuations to pursue claims rejected in parent applications
Using continuation-in-part applications to add new matter addressing prior art
Strategic timing of continuation filings as the competitive landscape evolves84

Narrow vs broad claim approaches:

Broad independent claims risk invalidity but provide stronger exclusionary rights if valid
Narrow claims increase validity prospects but reduce value against design-arounds
Portfolio strategy often includes both approaches across multiple patents85

Technical Expert Requirements

Effective transformer patent analysis requires technical expertise spanning:

Neural network architecture expertise:

Deep understanding of attention mechanisms and their mathematical foundations
Familiarity with transformer variants and their distinguishing features
Knowledge of implementation details affecting claim mapping86

Patent claim interpretation skills:

Experience with means-plus-function claim construction
Understanding of claim differentiation principles
Familiarity with patent prosecution history interpretation87

Prior art familiarity:

Comprehensive knowledge of academic literature through 2017
Understanding of subsequent developments and their priority dates
Ability to map prior art disclosures to specific claim elements88

Infringement Analysis Challenges

Proving transformer patent infringement presents unique challenges due to the nature of AI systems.

Proving Implementation Details

Unlike traditional technologies where physical inspection may reveal implementation details, AI models present "black box" challenges:

Internal architecture opacity: Production AI systems rarely disclose internal architectures. Model outputs alone may not reveal whether specific attention mechanisms, positional encodings, or other claimed elements are present.89

Weight inspection limitations: Even with model weight access, determining architectural choices from parameter counts and shapes requires sophisticated reverse engineering.90

API-only access: Many AI systems are accessible only through APIs that abstract away implementation details entirely.91

Source Code Access Issues

Litigation discovery presents particular challenges:

Trade secret considerations: AI companies may resist source code disclosure, arguing trade secret protection for training procedures and architectural innovations beyond what patents disclose.92

Protective order requirements: Courts typically require stringent protective orders for AI source code, limiting expert access and complicating analysis.93

Code vs architecture gap: Source code review may reveal high-level architecture but obscure specific mathematical operations relevant to claim elements.94

Observable Behaviour Analysis

When direct inspection is unavailable, circumstantial evidence may support infringement analysis:

Model output pattern analysis: Certain architectural choices produce characteristic output patterns. Attention-based models may exhibit specific behaviours on long sequences that differ from RNN-based alternatives.95

Performance characteristic fingerprinting: Latency patterns, memory usage, and scaling behaviour may indicate underlying architecture without direct access.96

Attention visualisation techniques: Some interfaces expose attention weights, potentially revealing multi-head attention structure and positional encoding approaches.97

Watermarking detection methods: Research into model watermarking may provide techniques for identifying specific implementations, though this remains an emerging area.98

Expert Witness Challenges

Technical experts in transformer patent cases face several difficulties:

Technical complexity explanation: Explaining attention mechanisms, positional encodings, and neural network training to judges and juries without oversimplification that distorts claim construction.99

Claim element mapping difficulties: Mapping specific claim language to actual implementations requires bridging patent terminology and ML engineering vocabulary.100

Prior art comparison requirements: Demonstrating how accused implementations differ from (or match) prior art requires side-by-side technical comparisons that may overwhelm non-technical fact-finders.101

Practical Considerations for IP Practitioners

Given the complexities of transformer patent analysis, we recommend systematic approaches for both offensive and defensive patent work.

Freedom-to-Operate Analysis Framework

When assessing FTO for transformer implementations:

Analysis Step	Key Questions	Resources Required
Architecture mapping	What specific transformer components does the implementation use?	Technical documentation, code review
Claim identification	Which patents have claims potentially covering the implementation?	Patent search, portfolio analysis
Element-by-element analysis	Does each claim element map to the implementation?	Technical expert, claim charts
Prior art assessment	What prior art might invalidate relevant claims?	Literature search, expert analysis
Risk quantification	What is the likelihood of assertion and potential damages exposure?	Market analysis, litigation history

Patent Portfolio Development

For companies developing transformer-based technologies, portfolio strategy should consider:

Defensive publications: Publishing innovations that won't be patented prevents competitors from claiming them, especially important given AI's rapid development cycles.102

Strategic filing timing: Filing before public disclosure (including arXiv posting) is essential for priority date establishment.103

Claim scope calibration: Balancing broad claims (risking invalidity) against narrow claims (limiting exclusionary value) requires understanding the specific prior art landscape.104

Continuation strategy: Planning continuation filings to address evolving competitive landscape and prosecution developments.105

Common Mistakes to Avoid

Underestimating prior art: The academic machine learning literature is vast and often overlooked in traditional patent searches. Comprehensive prior art analysis requires ML expertise.106

Ignoring international filings: Chinese transformer patent filings may create FTO issues for companies operating globally, even if not asserting in Western courts.107

Misunderstanding claim scope: Transformer patent claims often use technical terms with specific meanings that differ from colloquial usage. Proper claim construction requires technical expertise.108

Delayed invalidity analysis: Waiting until litigation to analyse patent validity increases costs and reduces strategic options.109

Costs and Timeline Considerations

Understanding the practical economics of transformer patent work:

Technical Analysis Costs

Service	Typical Range	Factors
Basic FTO search and analysis	£15,000–£30,000	Scope of implementation, number of relevant patents
Comprehensive prior art search	£10,000–£25,000	Technology breadth, international coverage
Claim chart development	£8,000–£20,000 per patent	Claim complexity, access to accused system
Expert technical declaration	£15,000–£40,000	Scope of opinions, supporting analysis
Litigation support (full matter)	£50,000–£150,000+	Discovery scope, trial requirements

Timeline Expectations

Activity	Typical Duration
Initial FTO assessment	4–8 weeks
Comprehensive prior art search	6–12 weeks
Claim chart development	4–8 weeks per patent
Expert report preparation	8–16 weeks
Invalidity analysis	8–16 weeks

These timelines assume reasonable access to technical documentation and cooperation from engineering teams. Limited access or discovery disputes can significantly extend timelines.

Conclusion

The transformer patent landscape presents unique challenges that distinguish it from traditional technology patent analysis. The confluence of academic publication, rapid commercialisation, and fundamental technical innovation creates a complex environment requiring specialised expertise.

Key Technical Insights

Architecture understanding is essential: Effective patent analysis requires genuine understanding of attention mechanisms, positional encodings, and transformer variants—not just surface-level familiarity with terminology.

Prior art is unusually rich: The academic origins of transformer technology mean that comprehensive prior art exists for fundamental concepts. Valid patent claims typically require specific improvements beyond the 2017 baseline.

Infringement proof is challenging: The black-box nature of AI systems complicates traditional infringement analysis. New methodologies for analysing model behaviour and architecture may be required.

Claim construction is technically demanding: Terms like "attention mechanism" and "multi-head" require technical interpretation that courts are still developing. Expert guidance is essential.

Strategic Considerations

For patent practitioners and their clients, we offer these observations:

Invest in technical expertise: Transformer patent work requires collaboration between patent professionals and ML engineers. Neither discipline alone possesses sufficient expertise.

Prioritise early prior art analysis: Given the extensive academic literature, early invalidity analysis can significantly reduce litigation costs and improve negotiating positions.

Monitor filing activity: The transformer patent landscape continues evolving rapidly. Ongoing monitoring of competitor filings and prosecution developments is essential for effective portfolio management.

Consider defensive strategies: Given validity uncertainties, defensive measures (freedom-to-operate opinions, design-arounds, prior art development) may be more cost-effective than relying on enforcement.

The transformer patent landscape will continue evolving as courts address novel claim construction issues and as the technology itself advances. We anticipate that validity challenges will remain central to transformer patent disputes, with prior art from the 2014–2017 period proving particularly relevant.

If you're facing transformer patent analysis challenges—whether FTO assessment, invalidity analysis, or litigation support—we can help. Our technical team combines deep ML expertise with patent analysis experience, providing the specialised support that transformer patent work requires.

This article is for informational purposes only and does not constitute legal advice. Patent validity and infringement determinations require qualified legal counsel.

Sources

[1] Vaswani, A., et al. "Attention Is All You Need." Advances in Neural Information Processing Systems 30 (NeurIPS 2017). Available at: https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

[2] World Intellectual Property Organization. "Patent Landscape Report: Generative Artificial Intelligence (GenAI)." WIPO, 2024. Available at: https://www.wipo.int/publications/en/details.jsp?id=4682

[3] Nature Scientific Reports. "Mapping the technological evolution of generative AI: a patent network analysis." 2025.

[4] Vaswani, A., et al. "Attention Is All You Need." arXiv preprint arXiv:1706.03762, 2017. Available at: https://arxiv.org/abs/1706.03762

[5] Google AI Blog. "Transformer: A Novel Neural Network Architecture for Language Understanding." 2017. Available at: https://blog.research.google/2017/08/transformer-novel-neural-network.html

[6] Vaswani, A., et al. "Attention Is All You Need." Section 3.2.1, Scaled Dot-Product Attention.

[7] Alammar, J. "The Illustrated Transformer." 2018. Available at: https://jalammar.github.io/illustrated-transformer/

[8] Vaswani, A., et al. "Attention Is All You Need." Section 3.2.2, Multi-Head Attention.

[9] Vaswani, A., et al. "Attention Is All You Need." Equations 4-5, Multi-Head Attention formulation.

[10] Clark, K., et al. "What Does BERT Look At? An Analysis of BERT's Attention." BlackboxNLP 2019. Available at: https://aclanthology.org/W19-4828/

[11] Vaswani, A., et al. "Attention Is All You Need." Section 3.5, Positional Encoding.

[12] Vaswani, A., et al. "Attention Is All You Need." Equations 6-7, Sinusoidal Positional Encoding.

[13] Su, J., et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv preprint arXiv:2104.09864, 2021. Available at: https://arxiv.org/abs/2104.09864

[14] Vaswani, A., et al. "Attention Is All You Need." Section 3.1, Encoder and Decoder Stacks.

[15] He, K., et al. "Deep Residual Learning for Image Recognition." CVPR 2016. Available at: https://arxiv.org/abs/1512.03385

[16] Vaswani, A., et al. "Attention Is All You Need." Section 3.1, Decoder architecture with masked self-attention.

[17] Vaswani, A., et al. "Attention Is All You Need." Section 3.3, Position-wise Feed-Forward Networks.

[18] Hendrycks, D., Gimpel, K. "Gaussian Error Linear Units (GELUs)." arXiv preprint arXiv:1606.08415, 2016. Available at: https://arxiv.org/abs/1606.08415

[19] Elhage, N., et al. "A Mathematical Framework for Transformer Circuits." Anthropic, 2021. Available at: https://transformer-circuits.pub/2021/framework/index.html

[20] US Patent No. 10,452,978 B2. "Attention-based sequence transduction neural networks." Google LLC, 2019. Available at: https://patents.google.com/patent/US10452978B2

[21] USPTO Patent Assignment Database. Assignment records for US10452978B2.

[22] US10452978B2, Claims 1-20, attention mechanism method and system claims.

[23] US Patent No. 10,956,819 B2. "Attention-based sequence transduction neural networks." Google LLC, continuation patent. Available at: https://patents.google.com/patent/US10956819B2

[24] US Patent No. 10,740,433 B2. "Universal transformers." Google LLC, 2020. Available at: https://patents.google.com/patent/US10740433B2

[25] US Patent No. 11,556,786 B2. "Decoder-only transformer neural network architectures." Google LLC. Available at: https://patents.google.com/patent/US11556786B2

[26] US Patent No. 11,157,721 B2. "Training attention-based neural networks." Google LLC. Available at: https://patents.google.com/patent/US11157721B2

[27] Patsnap Intelligence. "Google AI Patent Portfolio Analysis." 2024.

[28] OpenAI Patent Portfolio Analysis. USPTO Patent Full-Text and Image Database search results, 2024.

[29] US Patent Application No. 2024/0020096 A1. "Code generation using transformer models." OpenAI. Available at: https://patents.google.com/patent/US20240020096A1

[30] US Patent No. 11,520,732 B2. "Multimodal neural network architectures." OpenAI.

[31] US Patent No. 11,562,256 B2. "Fine-tuning language models from human preferences." OpenAI.

[32] USPTO Patent Application Information Retrieval (PAIR). Average prosecution time analysis for OpenAI applications.

[33] US Patent No. 11,379,684 B2. "Sparse attention for neural networks." Microsoft Technology Licensing. Available at: https://patents.google.com/patent/US11379684B2

[34] US Patent Application No. 2022/0277238 A1. "Knowledge distillation for transformers." Microsoft.

[35] US Patent No. 11,501,172 B2. "Document understanding with transformers." Microsoft.

[36] US Patent Application No. 2021/0232773 A1. "Unified Vision and Dialogue Transformer." Salesforce.com. Available at: https://patents.google.com/patent/US20210232773A1

[37] WIPO Patent Landscape Report. "Generative AI patent filing statistics by country." 2024.

[38] IAM Media. "Quality vs quantity: AI patent strategies across regions." 2024.

[39] AUTM Licensing Activity Survey. University technology transfer statistics for AI-related patents.

[40] Lens.org patent database. Transformer-related patent family growth analysis, 2020-2023.

[41] arXiv.org. Submission history for arXiv:1706.03762, showing 12 June 2017 initial posting.

[42] NeurIPS 2017 Conference Proceedings. Publication date and proceedings information.

[43] USPTO Manual of Patent Examining Procedure (MPEP) § 2128. "Printed publications as prior art."

[44] GitHub. Tensor2Tensor repository release history. Available at: https://github.com/tensorflow/tensor2tensor

[45] Bahdanau, D., Cho, K., Bengio, Y. "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR 2015. Available at: https://arxiv.org/abs/1409.0473

[46] Luong, M., Pham, H., Manning, C. "Effective Approaches to Attention-based Neural Machine Translation." EMNLP 2015. Available at: https://arxiv.org/abs/1508.04025

[47] Sutskever, I., Vinyals, O., Le, Q. "Sequence to Sequence Learning with Neural Networks." NeurIPS 2014. Available at: https://arxiv.org/abs/1409.3215

[48] Cornell University Library. arXiv submission and disclosure policies.

[49] Weston, J., Chopra, S., Bordes, A. "Memory Networks." arXiv preprint arXiv:1410.3916, 2014. Available at: https://arxiv.org/abs/1410.3916

[50] Graves, A., Wayne, G., Danihelka, I. "Neural Turing Machines." arXiv preprint arXiv:1410.5401, 2014. Available at: https://arxiv.org/abs/1410.5401

[51] Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R. "End-To-End Memory Networks." NeurIPS 2015. Available at: https://arxiv.org/abs/1503.08895

[52] ACL Anthology. Terminology evolution in attention mechanism literature.

[53] Rogers, A., Gardner, M., Augenstein, I. "QA Dataset Explosion: A Taxonomy of NLP Resources." ACL 2023.

[54] CNKI (China National Knowledge Infrastructure). Chinese AI research database access and coverage.

[55] MPEP § 2131. "Anticipation—Application of 35 USC 102."

[56] Vaswani, A., et al. "Attention Is All You Need." Complete architectural description, Sections 3.1-3.5.

[57] MPEP § 2141. "Examination Guidelines for Determining Obviousness."

[58] Beltagy, I., Peters, M., Cohan, A. "Longformer: The Long-Document Transformer." arXiv preprint arXiv:2004.05150, 2020. Available at: https://arxiv.org/abs/2004.05150

[59] Zaheer, M., et al. "Big Bird: Transformers for Longer Sequences." NeurIPS 2020. Available at: https://arxiv.org/abs/2007.14062

[60] Choromanski, K., et al. "Rethinking Attention with Performers." ICLR 2021. Available at: https://arxiv.org/abs/2009.14794

[61] Fedus, W., Zoph, B., Shazeer, N. "Switch Transformers: Scaling to Trillion Parameter Models." JMLR 2022. Available at: https://arxiv.org/abs/2101.03961

[62] Xiong, R., et al. "On Layer Normalization in the Transformer Architecture." ICML 2020. Available at: https://arxiv.org/abs/2002.04745

[63] Child, R., et al. "Generating Long Sequences with Sparse Transformers." arXiv preprint arXiv:1904.10509, 2019. Available at: https://arxiv.org/abs/1904.10509

[64] Borgeaud, S., et al. "Improving Language Models by Retrieving from Trillions of Tokens." ICML 2022. Available at: https://arxiv.org/abs/2112.04426

[65] Yang, Z., et al. "Hierarchical Attention Networks for Document Classification." NAACL 2016. Available at: https://aclanthology.org/N16-1174/

[66] Dosovitskiy, A., et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR 2021. Available at: https://arxiv.org/abs/2010.11929

[67] Gulati, A., et al. "Conformer: Convolution-augmented Transformer for Speech Recognition." Interspeech 2020. Available at: https://arxiv.org/abs/2005.08100

[68] Jumper, J., et al. "Highly accurate protein structure prediction with AlphaFold." Nature 596, 583–589, 2021. Available at: https://www.nature.com/articles/s41586-021-03819-2

[69] Devlin, J., et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. Available at: https://arxiv.org/abs/1810.04805

[70] Bengio, Y., et al. "Curriculum Learning." ICML 2009. Available at: https://dl.acm.org/doi/10.1145/1553374.1553380

[71] Hu, E., et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. Available at: https://arxiv.org/abs/2106.09685

[72] Dao, T., et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. Available at: https://arxiv.org/abs/2205.14135

[73] Dettmers, T., et al. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS 2022. Available at: https://arxiv.org/abs/2208.07339

[74] Narayanan, D., et al. "Efficient Large-Scale Language Model Training on GPU Clusters." SC21. Available at: https://arxiv.org/abs/2104.04473

[75] Jouppi, N., et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." ISCA 2017. Available at: https://arxiv.org/abs/1704.04760

[76] IPWatchdog. "AI Patent Licensing and Enforcement Trends." 2024.

[77] Litigation Analytics. "AI Patent Claim Construction Disputes." Lex Machina, 2024.

[78] US10452978B2. Representative claim structure analysis.

[79] MPEP § 2173. "Claims Must Particularly Point Out and Distinctly Claim the Subject Matter."

[80] Litigation Analytics. "Attention mechanism" claim construction disputes in AI patent cases.

[81] Shazeer, N. "Fast Transformer Decoding: One Write-Head is All You Need." arXiv preprint arXiv:1911.02150, 2019. Available at: https://arxiv.org/abs/1911.02150

[82] Patent claim construction analysis for "positional encoding" term scope.

[83] Patent claim construction analysis for "transformer block" definition.

[84] MPEP § 714. "Amendments, Applicant's Action."

[85] USPTO Continuation Application Practice Guidelines.

[86] Patent portfolio strategy analysis for AI technologies.

[87] Neural network expertise requirements for patent technical analysis.

[88] Patent claim interpretation expertise requirements.

[89] Prior art analysis expertise for transformer patents.

[90] Litigation challenges for AI system analysis. Electronic Frontier Foundation AI litigation report.

[91] Model architecture reverse engineering techniques and limitations.

[92] API abstraction challenges for patent infringement analysis.

[93] Trade secret protection in AI litigation. Sedona Conference AI Working Group.

[94] Protective order requirements for AI source code. Federal Judicial Center guidelines.

[95] Source code review methodology for patent analysis.

[96] Behavioral analysis techniques for AI model characterisation.

[97] Performance fingerprinting methods for architecture identification.

[98] Attention visualisation tools and their forensic applications. BertViz documentation.

[99] AI watermarking research and patent implications.

[100] Expert witness challenges in technical patent cases. Federal Judicial Center Expert Witness Manual.

[101] Claim element mapping methodologies for AI patents.

[102] Prior art comparison presentation techniques for fact-finders.

[103] Defensive publication strategies for AI innovations. IP.com Prior Art Database.

[104] Patent filing timing considerations for ML research.

[105] Claim scope calibration strategies.

[106] Continuation filing strategy planning.

[107] Prior art search comprehensiveness requirements.

[108] International patent filing considerations for AI technologies.

[109] Claim construction technical expertise requirements.

[110] Early invalidity analysis benefits.

Transformer Models and Patent Claims: Technical Analysis for IP Practitioners

Transformer Architecture Explained

Self-Attention Mechanism

Multi-Head Attention

Positional Encoding

Encoder-Decoder Structure

Feed-Forward Networks

Key Transformer Patents

Google's Foundational Patents

OpenAI Patent Portfolio

Microsoft and Other Major Players

Patent Filing Trends

Prior Art Landscape

"Attention Is All You Need" (Vaswani et al. 2017)

Earlier Attention Mechanisms

ArXiv Publications Impact

Prior Art Search Challenges

What's Patentable vs Prior Art

Core Architecture Analysis

Potentially Patentable Innovations

Implementation Details

Commercial vs Academic Boundary

Technical Claim Analysis

Example Claim Elements

Claim Construction Challenges

Distinguishing from Prior Art

Technical Expert Requirements

Infringement Analysis Challenges

Proving Implementation Details

Source Code Access Issues

Observable Behaviour Analysis

Expert Witness Challenges

Practical Considerations for IP Practitioners

Freedom-to-Operate Analysis Framework

Patent Portfolio Development

Common Mistakes to Avoid

Costs and Timeline Considerations

Technical Analysis Costs

Timeline Expectations

Conclusion

Key Technical Insights

Strategic Considerations

Sources

Reader Tools