Skip to main content
DHS Author Manuscripts logoLink to DHS Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 26.
Published in final edited form as: Proc Mach Learn Res. 2023 Jul;23:40373–40389.

CODEIPPROMPT: Intellectual Property Infringement Assessment of Code Language Models

Zhiyuan Yu 1, Yuhao Wu 1, Ning Zhang 1, Chenguang Wang 1, Yevgeniy Vorobeychik 1, Chaowei Xiao 2,3
PMCID: PMC12377501  NIHMSID: NIHMS2100216  PMID: 40860285

Abstract

Recent advances in large language models (LMs) have facilitated their ability to synthesize programming code. However, they have also raised concerns about intellectual property (IP) rights violations. Despite the significance of this issue, it has been relatively less explored. In this paper, we aim to bridge the gap by presenting CODEIPPROMPT, a platform for automatic evaluation of the extent to which code language models may reproduce licensed programs. It comprises two key components: prompts constructed from a licensed code database to elicit LMs to generate IP-violating code, and a measurement tool to evaluate the extent of IP violation of code LMs. We conducted an extensive evaluation of existing open-source code LMs and commercial products, and revealed the prevalence of IP violations in all these models. We further identified that the root cause is the substantial proportion of training corpus subject to restrictive licenses, resulting from both intentional inclusion and inconsistent license practice in the real world. To address this issue, we also explored potential mitigation strategies, including fine-tuning and dynamic token filtering. Our study provides a testbed for evaluating the IP violation issues of the existing code generation platforms and stresses the need for a better mitigation strategy.

1. Introduction

The recent advancements in large language models such as GPT-4 have brought about revolutionary changes in the field of artificial intelligence and natural language processing. These models have demonstrated the ability to generate content that closely resembles human-created materials, leading to the emergence of a new form of content known as Artificial Intelligence Generated Content (AIGC). An important application of AIGC is code generation, which has already been commercialized (e.g., Copilot (GitHub, 2022)) and used by many developers and researchers to improve the efficiency of programming in the real world (Spencer, 2022). Recently, Microsoft has made its code LM API available for enterprise services in production (Microsoft, 2023).

However, the use of AI-generated code also raises legal and ethical concerns. A key issue is the potential violation of IP rights, as shown in Figure 1. As code generative models are trained on open-source repositories, people found that they can produce programs that are similar or even identical to existing ones without compliance with associated licenses (Davis, 2022; Karpinski, 2021). Recently, Microsoft, GitHub, and OpenAI are being sued in a class action lawsuit of copyright infringement for allowing Copilot to reproduce licensed code without following license terms (Butterick, 2022). As such, it raises widespread concerns that users of such models may be at risk of inadvertently violating the IP rights of original works without being aware or notified.

Figure 1.

Figure 1.

Code language models can reproduce existing code used to train the model. Since many of them are licensed with restrictive terms, it can cause IP infringement without proper measures to ensure compliance. The example shown is generated by Copilot, with the full example and original code presented in the appendix.

In this work, we introduce CODEIPPROMPT, the first automated testing framework to evaluate the extent to which code language models generate IP-violating code. It is enabled by two key functionalities: extracting function signature and accompanying comments from licensed code to construct prompts, and measuring the extent of IP violation with code plagiarism similarity scores.

To conduct a comprehensive evaluation, we collected over 4M real-world licensed repositories to evaluate 10 models across 5 programming languages, including both state-of-the-art open-source models (CodeRL (Le et al., 2022), CodeGen (Nijkamp et al., 2022), CodeParrot (Tunstall et al., 2022)), and commercial products (Codex (Chen et al., 2021), Copilot, ChatGPT, and GPT-4). The results showed that the issue is prevalent across all of these models, as most of them are capable of generating code strongly resembling portions of licensed software within 50 prompted code generations. With further investigation into the root cause, we found that their training datasets explicitly contain a significant amount of source code under restrictive licenses, and some of such copyrighted code is included in all of the studied datasets. Additionally, we identified the implicit inclusion of restrictive code in permissive and public-domain code repositories due to inconsistent licensing practices, and those issues in the code supply chain make the removal of restrictive data even more challenging in the real world. To mitigate this issue, we explored both data-based and decoding-based strategies, but found that they were either ineffective or resulted in a significant drop in performance. Lastly, we present insights on potential mitigation directions and recommend measures to take.

We aim to shed light on the landscape of IP protection in generated code by language models. Our findings highlight the challenges in mitigating this prevalent issue and the pressing need to reconsider the data used for training. We release our code and datasets to encourage further progress in improving IP protection and compliance.

2. Intellectual Property of Source Code

Characterizing Code IP via Licenses.

Identifying IP violations in source code is crucial, yet it poses significant challenges due to the diverse licenses that manage code IP and the complex terms associated with them. In the realm of software development, licenses are widely accepted as legal instruments for regulating and managing IP, and they also serve as a crucial foundation for courts to make informed decisions in intellectual property litigations (Murray, 2020). These licenses define the terms under which the code can be used, modified, and distributed, and failure to comply with these terms can result in IP infringement. However, the vast number of licenses provide diverse levels of IP protection, each with intricate policies and conditions that can be difficult to understand and operationalize.

To better understand the landscape of IP protection in relation to source code, we conducted a survey of over 200 open-source licenses approved by the Open Source Initiative (OSI). These licenses can be broadly categorized based on their level of permissiveness, as shown in Table 1. For instance, permissive licenses such as the MIT License and the BSD License, impose minimal restrictions on the use and modification of the source code, requiring only that proper attribution be included in the distribution. In contrast, copyleft licenses are more restrictive, requiring that any modified versions of the code be distributed under the same license, even if the code is incorporated into proprietary software. Additional details can be found in Appendix A.

Table 1.

The mainstream open-source licenses can be categorized based on the permissiveness required by license terms.

Category Licenses* Permissiveness
Public Domain Unlicense, CC, WTFPL No legal restrictions
Permisive MIT, ISC, Apache, BSD,
BSL, Artistic, Zlib, AFL
Include a copy of the license and attribution to the authors.
Weak Copyleft MPL, EPL, LGPL Release part of code under the same license
Strong Copyleft GPL, AGPL Release the entire program under the same license
*

Some may represent a series of licenses with different versions.

Scope of CODEIPPROMPT.

In this work, we consider IP infringement as violating license terms. Specifically, permissive and copyleft licenses impose certain obligations that must be adhered to, such as requiring attribution or requiring derivative works to be distributed under the same license. However, the existing code LMs have not taken measures to ensure compliance, and therefore reproducing code under these licenses can bring the risks of IP violation. In contrast, public-domain licenses do not impose any legal restrictions and the code is generally free to use and distribute. Therefore, we consider code LMs’ reproducing permissively and copyleft licensed code as IP infringement, while reproducing public-domain licensed code is not.

It is noteworthy that this study does not involve repositories without explicit licensing statements. From a regulatory standpoint, such programs retain the copyright and restrict third parties from utilizing, modifying, or distributing the software without explicit consent from the original creators (GitHub, 2023). However, the lack of a licensing statement inadvertently introduces legal ambiguities and complexities, for which reason they are excluded from the scope of this study.

3. Related Work

The use of LMs in programming has motivated numerous studies on performance benchmarking and evaluation, including code-writing capabilities such as code syntax understanding (Shen et al., 2022) and the functional correctness of generated code (Chen et al., 2021; Hendrycks et al., 2021; Austin et al., 2021). Additionally, some research has focused on evaluating the security level of generated code (Pearce et al., 2022; Perry et al., 2022; Sandoval et al., 2022) and the potential for privacy breaches through LM outputs (Pan et al., 2020). Though closely related, IP violations of LM outputs remain under-explored in these works.

From the perspective of intellectual property, the existing work primarily focuses on protecting the IP of generative models (Zhang et al., 2018; Xue et al., 2021; He et al., 2022), as the training of such models can often be resource-intensive (Dale, 2021). Besides, recent findings have highlighted instances where the output of LMs is copied verbatim from the training English text, and such characteristics can be leveraged for malicious membership inference attacks (Carlini et al., 2021; Lee et al., 2022). However, the potential for LMs to generate copyrighted programs and related issues of IP violation have received little research so far. In this study, we aim to address this gap by examining such characteristics of LMs, and also exploring potential mitigation strategies to inspire future research.

4. Design of CODEIPPROMPT

The core design of CODEIPPROMPT includes a three-step process to enable automatic evaluation, as depicted in Figure 2. To create a comprehensive dataset for evaluation, we compiled a collection of licensed code repositories from GitHub, totaling 4,075,553 across 34 different licenses. From the sampled licensed code, we extracted function signatures and accompanying comments to serve as prompts, and the resulting generated code is subsequently compared to the original program to calculate similarity scores. More details of the framework are described below.

Figure 2.

Figure 2.

CODEIPPROMPT constructs prompts from a collection of source code database, and the generated programs are analyzed against source files for similarity scores.

4.1. Construct Prompts from Licensed Code

Real-world Data as Foundation of Prompts.

To effectively measure the real-world risks of IP infringement, it is desirable for the prompts to possess several characteristics. Firstly, they should be natural in code context, including appropriate function name construction and the inclusion of comments that summarize the functionality of defined functions. Secondly, they should be representative of real-world licensed code. And thirdly, they should be comprehensive to cover diverse licenses and programming languages. To achieve this, we employed the data collection process involving three key steps. First, the GitHub REST API was used to gather repository information across varying licenses and programming languages in parallel. Second, a parser was implemented to examine the collected metadata, download target repositories with scheduling to handle the API rate limit, and categorize the entries in a database on the metadata. Lastly, the resources were opportunistically compressed and uploaded to cloud storage, followed by post-processing to filter and extract target programs.

Construct Prompts from Data.

Most existing models are trained on both natural language and programming code to understand prompts in terms of program functionality (described in comments) and syntax (in function signatures). We follow this nature and construct prompts to constitute these two components that can be generalized to a variety of programming languages and code LMs. An example prompt consisting of a function signature is shown in Figure 1. As we focused on functions or classes as units of evaluation, each prompt is limited to a single function signature and its accompanying comment, as the inclusion of other unrelated content may inadvertently affect evaluation results. To handle the diverse syntactic structures in various programming languages and avoid disruption from irrelevant components (e.g., variable names or comments containing keywords), we compiled various regular expressions to identify elements such as comments, functions, and classes. With individual lines of code as units, these regular expressions were used to match and extract target code snippets while preserving the original code context. We sampled source files representing copyleft and permissively licensed code and derived prompts across five programming languages (i.e. Python, C, C++, C#, Java) with varying lengths.

Context-aware Preprocessing.

The prompts constructed from real-world sources inadvertently include sensitive data such as tokens. To manage such information while preserving the semantic meaning, we employed a recognizer with regular expressions and NLP module. Specifically, we built on Presidio (Microsoft, 2022) and spaCy (Vasiliev, 2020) to customize context-aware anonymization logic, which transformed identifiable information into generic placeholders. To further improve the quality of the prompt set, we also implemented a filter to remove empty prompts and those containing special characters. As a result, we obtained over 179K prompts2 with the statistics summarized in Table 2.

Table 2.

Data statistics of the constructed prompts using CODEIPPROMPT. The number of tokens is measured by tokenizers from the multi-lingual CodeGen model, and is presented with its mean and the standard deviation (in subscripts).

# Prompts Total Permissive
52.0K
Weak Copyleft
77.1K
Strong Copyleft
50.0K
179.1K C
6.2K
C++
6.2K
C#
37.6K
Python
30.1K
Java
99.0K
# Tokens Avg. Permissive
13.27.3
Weak Copyleft
13.27.5
Strong Copyleft
13.36.2
13.27.1 C
18.29.5
C++
18.311.5
C#
14.67.9
Python
11.66.0
Java
12.66.1

4.2. Benckmarking IP Violation with Similarity Score

Under copyright law, a key requirement for proving copyright infringement is demonstrating that the suspicious infringing work is substantially similar to the copyrighted material (U.S.C, 2022). This case-by-case assessment is typically conducted by jurors and judges based on empirical judgment, as there is no universal standard to provide a definitive percentage or number of lines of code for infringement (USCourts, 2022; Balganesh et al., 2014). For instance, in a well-known software infringement case between Google and Oracle, nine lines of matched code were deemed substantially similar by the court (Court, 2014). Therefore, instead of calculating the exact number of lines of matched code, we approximate this approach by quantifying similarity with scores and setting a threshold based on human empirical analysis.

Calculating Similarity Scores with Adapted Plagiarism Detection.

The key technique for identifying IP violations in generated code involves comparing its similarity to original sources. In order to generalize our framework to the rapidly growing number of licensed programs, we choose not to supervise-train DNN-based similarity scoring models on existing code. Instead, we adapt and incorporate JPlag (Prechelt et al., 2002) and Dolos (Maertens et al., 2022), two of the most widely recognized code plagiarism detection tools that have been utilized as expert witness evidence in lawsuits. Both tools take as input a set of programs and compare them pairwise, producing a similarity score ranging from 0 to 1 for each pair. The main difference is the method by which they calculate similarity. JPlag uses lexical analysis and string tiling to compare programs, while Dolos converts them into abstract syntax trees (ASTs) and calculates similarity based on the coverage of unique AST fingerprints. Therefore, the similarity scores generated by these two tools cover different types of code plagiarism, and we take the maximum between the two as the result.

However, the calculation is based on file-wise comparisons, which could be misleading when language models only reproduce a portion of the source code. In such cases, the score could be extremely low due to the length of the files, even if the produced code is a direct copy of part of the original program file. To address this, we adapt the comparison process by first matching and extracting the most similar code snippets in the source programs, on which the similarity score is calculated. The prompts are subtracted from the generated code before the scores are calculated. This enables a more accurate assessment even in cases where only a portion of the code has been copied. In this study, we consider a similarity score > 0.5 as potential plagiarism, which was selected based on human studies as described below. While this threshold is not a definitive measure, it serves as a quantitative indicator within the framework.

Validating Similarity Scores and Selecting Threshold.

To validate the use of similarity scores and come up with an appropriate threshold, we conducted IRB-approved human studies involving nine participants with over five years of programming experience. Each participant independently assessed 200 pairs of programs (with scores ranging from 0 to 1 in 0.05 increments and 10 pairs at each step) for the presence of plagiarism. To validate the effectiveness of similarity scores, a Pearson correlation test was performed between the number of participants who considered the programs substantially similar and the corresponding similarity scores. We found a Pearson correlation coefficient ρ = 0.84, indicating a strong alignment between similarity scores and human perception of plagiarism extent. The threshold of 0.5 was selected via this process. We compared human labels against those derived from similarity scores using various thresholds, and found that the threshold at approximately 0.5 achieves a good balance of precision and recall, with both values higher than 0.9 as shown in Figure 3.

Figure 3.

Figure 3.

Recall and precision for different thresholds.

Benchmark Metrics for IP Infringement.

With a given code language model, it was employed for ten code generations3 for each sampled prompt. The maximum score was subsequently utilized as the similarity score for the respective prompt. We then performed bootstrapping by sampling n =50 code generations 1K times. Two metrics were used to characterize the models (Wang et al., 2022): (1) the Expected Maximum (EM) similarity calculated by the mean of the maximum scores from 1K bootstrapped samples; and (2) the Empirical Probability (EP) measured as the mean probability of generating code with score > 0.5 at least once in the samples. In the present context, the EM score measures the worst-case scenario in which generated code is highly similar to existing code, while the EP reflects the frequency at which the model generates potentially IP-violating code.

5. Evaluating Code Language Models

Using CODEIPPROMPT, we evaluated 10 code generation language models, comprising 6 open-source models and 4 commercial products. They cover a wide range of architectures including GPT-4, GPT-3.5 (i.e., ChatGPT), GPT-3 (i.e., Copilot and Codex), GPT-2 (i.e., CodeParrot), encoder-decoder (i.e., CodeRL), autoregressive transformer (i.e., CodeGen), etc. The names of the models under these frameworks are GPT-4, ChatGPT, Copilot, code-davinci002, CodeParrot-110M and CodeParrot-1.5B, CodeT5-large and CodeT5-large-ntp-py, CodeGen-350M and CodeGen2.7B. More details can be found in Appendix B. In the study, we followed the original settings of code language models and employed nucleus sampling (Holtzman et al., 2020) with top-p where p = 0.95. The following experiments were carried out with the Hugging Face Transformers Library. Examples of plagiarized code generated by these models are provided in the Appendix.

IP-violating Generations with Prompts in the Wild.

Using our framework, we began by evaluating the real-world risks of generating IP-violating code on above code LMs. The evaluation strategies followed Section 4, and the results are presented in Table 3. We observed that most of the models can generate potential IP-violating code within 50 generations with a relatively high probability. Interestingly, the two commercial products achieved relatively lower EM and EP among the models. We speculate the potential reason is that they were trained on a significantly larger corpus. As our method conducts a pair-wise similarity comparison, it becomes less probable for the generated code to match the exact source code from which the prompt is derived. On the other hand, however, scanning through all existing license programs is computationally prohibitive and therefore not adopted in this study. This issue is further discussed in Section 9. The CodeT5-large model exhibits the lowest similarity scores. Manual qualitative analysis revealed that this was because many of the generated code snippets were not meaningful. In contrast, the CodeT5-large-ntp-py model, which was fine-tuned on Python programs with additional data, demonstrated significantly higher EM and EP that are at similar levels of other models.

Table 3.

Evaluation results with prompts sourced from GitHub.

GPT-4 ChatGPT Copilot Codex CodeT5-large CodeT5-ntp-py CodeGen-350M CodeGen-2.7B CodeParrot-110M CodeParrot-1.5B

Permissive EM 0.600.14 0.560.17 0.560.22 0.680.13 0.080.19 0.940.11 0.940.08 0.750.26 0.980.04 0.980.05
EP 0.50 0.45 0.48 0.71 0.06 0.99 0.99 0.74 0.99 0.99

Weak Copyleft EM 0.700.18 0.770.16 0.660.20 0.600.16 0.140.22 0.960.08 0.920.09 0.720.15 0.580.25 0.990.01
EP 0.84 0.90 0.57 0.64 0.07 0.99 0.99 0.93 0.56 0.99

Strong Copyleft EM 0.560.15 0.610.19 0.680.14 0.710.15 0.100.20 0.920.17 0.800.26 0.980.04 0.990.02 0.990.01
EP 0.61 0.64 0.75 0.78 0.05 0.94 0.81 0.99 0.99 0.99

All EM 0.640.20 0.670.18 0.620.25 0.640.20 0.110.21 0.920.12 0.930.09 0.760.27 0.980.04 0.990.01
EP 0.61 0.65 0.64 0.75 0.04 0.99 0.99 0.76 0.99 0.99

ChatGPT and GPT-4.

Our evaluation extends to GPT-4 and ChatGPT. Although these models are optimized for dialogue rather than code generation, they have exhibited superior capabilities in coding (Kashefi & Mukerji, 2023). Overall, the EM and EP measurements for these models are similar to those of the other two commercial products, Copilot and Codex. This suggests that a substantial amount of (licensed) programs were involved in the training phase. This observation aligns with expectations, since the initial training data encompasses approximately eight million web pages, and many of which can contain licensed code. This can be particularly true for educational websites, blogs, and Wikipedia that contain existing programs. This issue is further intensified by the mechanism that users’ inputs are additionally incorporated for model training (OpenAI, 2023). Given the increasing number of users utilizing ChatGPT (and GPT-4) for coding tasks, it becomes imperative to take measures to address potential IP infringement risks.

Impacts of Number of Generations.

To measure the impacts of the number of generation n, we varied n from 1 to 1000 and utilized Copilot and Codex for evaluation. The results are presented in Figure 4. We observed that both models can generate highly similar code within 100 generations with a probability of p > 0.9. Besides, both EM and EP will increase as n becomes larger. This is because more trials are likely to induce both more instances with the score > 0.5 (EM) and highly similar code (EP). More importantly, the behaviors of models become very similar when n is extremely small or large, since in such cases, it could be either very difficult or easy to encounter instances of plagiarizing existing code. Therefore, we kept n =50 in the following study to better distinguish model behaviors. This parameter is also set to be configurable to enable customized usage.

Figure 4.

Figure 4.

Results of Codex and Copilot in terms of EM and EP, with the variance of EM shown as shade.

Impacts of Programming Languages.

Some code language models are optimized for a specific programming language such as Python, which raises the question of whether this could be an impact factor in our study. As such, we analyzed multi-lingual open-source models and two commercial products and examined the impact of programming languages on the models’ reproducibility of code (i.e., the extent to reproduce licensed code). The results are presented in Figure 5. While we did not observe significant differences across various languages, we did find that the reproducibility of Python code was generally the highest in terms of both expected maximum similarity and empirical probability. This could be attributed to the fact that both commercial products, Codex and Copilot, claim to be most capable in Python, and the open-source models studied were trained on datasets with a majority of Python programs. Besides, the measurements are comparable for C and C++ languages, potentially because their syntax is highly similar. Across these languages, the two CodeGen models with different sizes also exhibit similar reproducibility, since their model architecture and training data are the same.

Figure 5.

Figure 5.

Expected maximum (top) and empirical probability (bottom) across five programming languages.

Impacts of Prompt Length.

Intuitively, one might expect that longer prompts would increase the likelihood of the model reproducing the original programs. To investigate this potential relationship, we conducted both parametric and non-parametric correlation tests with the null hypothesis that the true correlation between the similarity score and prompt length is zero. The Pearson test produced a correlation coefficient of ρ = 0.0378 with a p-value of p =0.1472, while the Spearman and Kendall tests yielded ρ =−0.0034 and p = 0.8949, and ρ = −0.0023 and p = 0.8980, respectively. None of these analyses provided statistically significant evidence (p < 0.1) to reject the null hypothesis. Therefore, we concluded that there is no significant evidence of a correlation between the similarity scores (indicating the extent of reproducing code) and prompt lengths.

Prompts from Training Datasets.

While the above studies indicate the risk of generating IP-violating code, the results were impacted by the imbalance in training data. Specifically, some of the prompts can be derived from source code included in training code, and the reproducibility of such prompts is likely to be higher. To further investigate the intrinsic characteristics, that is, the extent of a code LM to reproduce its learned code, we constructed additional prompts from individual training datasets of each open-source model. Note that n = 5 is adopted during this evaluation due to their high reproducibility of training data, and the results are summarized in Table 5. Each model generally achieved higher similarity scores and probabilities when the prompts were derived from its training dataset. It is surprising that prompts derived from other datasets also produced relatively high scores, even though they were not used for training. Further investigation revealed that these datasets shared a significant amount of code data, which will be further described in Section 6. To enable a more accurate assessment, we preprocessed the data to exclude shared source code, and the results are shown on the right side of Table 5. Another interesting finding is that the model scale alone did not significantly impact the ability to reproduce code. For example, the two models under the CodeGen and CodeParrot frameworks had a difference in model scale of ∼10x, but produced similar results.

Table 5.

Evaluation results of models with respect to prompts derived from individual training datasets. Left: prompts were derived from direct sampling of training datasets. Right: prompts were derived from filtered datasets where overlapped corpus were not included (*NO stands for non-overlap).

ModelDataset CodeRL
CodeGen
CodeParrot


CodeRL-NO*
CodeGen-NO
CodeParrot-NO
EM EP EM EP EM EP EM EP EM EP EM EP

CodeT5-large 0.280.26 0.18 0.190.26 0.07 0.32 0.26 0.22 0.31 0.14 0.19 0.220.16 0.13 0.260.19 0.17

CodeT5-large-ntp-py 0.92 0.13 0.98 0.740.21 0.84 0.640.11 0.94 0.91 0.15 0.98 0.380.21 0.24 0.400.06 0.14

CodeGen-350M 0.590.23 0.74 0.76 0.16 0.95 0.650.25 0.68 0.330.06 0.34 0.78 0.14 0.94 0.320.05 0.28

CodeGen-2.7B 0.540.12 0.80 0.78 0.15 0.96 0.660.24 0.66 0.280.04 0.33 0.75 0.11 0.98 0.360.08 0.26

CodeParrot-110M 0.500.20 0.20 0.550.17 0.63 0.66 0.17 0.76 0.310.04 0.22 0.230.06 0.30 0.71 0.23 0.80

CodeParrot-1.5B 0.580.17 0.65 0.600.23 0.68 0.65 0.17 0.73 0.340.07 0.27 0.270.09 0.36 0.70 0.20 0.73

6. Licensed Code in Training Datasets

In order to demystify the root causes, we analyzed the existing dataset used to train the language models. We evaluated four large-scale dataset4 for training state-of-the-art code generation models. Specifically, we focused on The Pile, CodeParrot-Clean, CodeSearchNet, and GitHub Code (GCPY), which were widely used and adopted to train CodeGen, CodeParrot, and CodeRL models respectively. We passed the repository names to GitHub API to obtain the license information. More detailed information is presented in Appendix D. As a result, we found that they explicitly or implicitly include a significant amount of restrictive code.

6.1. Licensed Code in Training Dataset

License Distribution.

The proportions of each license category in the studied datasets are shown in Table 6. The results revealed that all of these datasets contain source code with restrictive licenses, which are involved to train code language models. The majority of the code in these datasets is under permissive licenses, however, a notable portion is copyleft licensed. Additionally, code with public-domain licenses typically has the lowest proportion, likely due to the fact that most high-quality repositories are not licensed under this category.

Table 6.

License composition for individual code corpus datasets.

Dataset Total Public Domain Permissive Weak Copyleft Strong Copyleft

GCPY 3063k 2.18% 72.77% 3.18% 21.88%
CodeParrot-Clean 437k 1.67% 64.98% 3.41% 29.94%
CodeSearchNet 114k 0.50% 91.88% 1.26% 6.37%
The Pile 191k 40.52% 50.82% 1.58% 7.08%

Overlapped Restrictive Code Among Datasets.

Our analysis also revealed a significant number of overlapped licensed code among multiple datasets, with many of these being copyleft licensed. To further investigate this issue, we grouped the overlapped code based on their occurrences across the four datasets and analyzed the license composition for each group. The results are shown in Figure 6. We identified a total of 499,837 code repositories shared by two datasets, 28,655 shared by three datasets, and 2,804 included in all four datasets (i.e., the occurrence is four). For code data that is included in two or more datasets, the majority is permissively licensed, but a significant portion is copyleft licensed. For instance, 13.3% of the code shared by all four datasets is copyleft licensed.

Figure 6.

Figure 6.

Percentage of each category of licensed code in the overlapped corpus among the four datasets, the occurrence indicates the number of datasets that include the portion.

We also analyzed the overlapped code data for each individual dataset. For each dataset, we identified the code repositories that were also included in other datasets and analyzed their licenses. The results are summarized in Table 8. For instance, 33.43% of the overlapped data between CodeParrot-Clean and other datasets is copyleft licensed, and 28.75% of the overlapped data for GCPY is copyleft licensed. These findings suggest that these datasets contain a significant amount of restrictive code that is shared among them. As a result, it is important to carefully consider the license composition of the data in order to ensure compliance with the license and protect intellectual property rights.

Table 8.

The total number of overlapped code and their license distributions for each dataset (OL represents overlap).

Dataset # OL OL% Public Domain Permissive Weak Copyleft Strong Copyleft

GCPY 526k 17.17% 1.49% 69.75% 3.18% 25.57%
CodeParrot-Clean 424k 97.03% 1.68% 64.89% 3.41% 30.02%
CodeSearchNet 79k 69.30% 0.14% 92.43% 2.16% 5.27%
The Pile 65k 34.03% 1.34% 84.66% 2.64% 11.35%

Implicit Inclusion of Restrictive Code.

Another real-world problem that further complicates the issue lies in the implicit inclusion of copylefted code in less restrictively licensed (such as public-domain and permissive) repositories. As a matter of fact, this has been a longstanding issue in the code supply chain, as previously documented in the literature (Wolter et al., 2022). To investigate this problem, we first developed a non-overlap set of repositories that are included by the four datasets, totaling 3,240,755 repositories. We sampled 345,222 repositories and used the GitHub API to check if they were forked from more restrictive licensed repositories. As a result, 353 violations were identified, which consist of 0.1% of the examined repositories. For instance, 268 repositories were forked from strong copyleft licensed code, but were not themselves licensed under strong copyleft. Similarly, 71 repositories were forked from weak copyleft licensed code, but were neither licensed under strong nor weak copyleft. A total of 14 repositories were forked from permissive code, however, they were released under public-domain licenses that require no restriction.

What is probably more concerning is that there may be a significant number of repositories that include portions of code from restrictive sources without providing appropriate licenses or attribution (Wolter et al., 2022). As such cases are not included in the study, the boundaries between permissible and restricted code can be even more blurred than our results suggest.

7. Evaluating Mitigation Methods

Although the issue has been less studied before, potential mitigation strategies are not entirely out of scope. We investigated the effectiveness of two approaches: data-based techniques, which involve fine-tuning the model with public-domain licensed code; and decoding-based techniques, where we developed a dynamic token filtering strategy to prevent similar generations. We used the four models under the CodeRL and CodeParrot frameworks as the basis for performance comparisons.

Fine-tune Models.

In this approach, we wonder if further tuning models with public-domain data would help them recall less restrictive code. Specifically, we used samples from previously collected public-domain data, totaling over 17K code snippets. The hyperparameters for tuning each model are detailed in Table 9 in the Appendix. We then used the same strategies in Section 5 to assess the performance.

Dynamic Token Filtering.

Recent research on controllable text generation has shown promise in preventing the generation of toxic natural language (Ghosh et al., 2017; Dathathri et al., 2020). Inspired by the concept, we developed a dynamic filtering mechanism where we only decode k tokens at a time and evaluate the similarity using the CODEIPPROMPT framework. If the score is above 0.5, we roll back one token and choose from the remaining options. We also added a termination condition where no options are available to ensure a score lower than 0.5. To balance computational costs, we set k =10 in the experiments.

Besides EM and EP for evaluation in terms of IP, we included additional measurement on code synthesis capability using the APPS (Hendrycks et al., 2021) and HumanEval (Chen et al., 2021) benchmark with pass@k metrics. The results summarized in Table 7 showed that both strategies were useful in reducing the reproducibility of code, although to different extents. Data-based fine-tuning was less effective in preventing the reproduction of code, as most of the EM and EP scores did not significantly drop to the desired range. The only exception occurred in the CodeT5-large model, where the reproducibility actually increased significantly. This is because the model’s synthesis capability improved with fine-tuning, leading it to exhibit similar characteristics as CodeT5-large-ntp-py under the same framework. However, its reproducibility is indeed reduced via tuning, which is lower compared to the CodeT5-large-ntp-py models with similar synthesis capabilities but higher reproducibility. On the other hand, dynamic filtering was able to suppress the similarity scores to within the threshold of 0.5, but it also reduced the overall ability of the models to generate coherent code, as indicated by the APPS metric. Through human inspection, we found that blocked tokens led the model to generate suboptimal or even incorrect solutions. As a result, neither of these approaches fully solves the problem, and may come at the cost of reduced synthesis performance. Potential improvements are further discussed in Section 8.

Table 7.

Performance of the mitigation strategies compared to baseline models under the CodeParrot and CodeRL frameworks. The code synthesis capability is evaluated using the HumanEval (CodeParrot) and APPS (CodeRL) benchmarks.

Model Baseline Model Fine-Tune Dynamic Filter
EM EP Pass@1 Pass@10 Pass@100 EM EP Pass@1 Pass@10 Pass@100 EM EP Pass@1 Pass@10 Pass@100

CodeParrot-110M 0.980.04 0.99 3.41% 5.39% 7.02% 0.710.13 0.70 3.56% 5.17% 7.44% 0.460.08 0.00 1.17% 1.69% 3.28%

CodeParrot-1.5B 0.990.01 0.99 3.84% 6.77% 10.02% 0.670.18 0.69 3.14% 5.84% 8.28% 0.400.10 0.00 1.46% 2.11% 2.78%

CodeT5-large 0.110.21 0.04 0.00 0.00 0.00 0.710.15 0.68 1.70e-03 7.78e-03 0.02 0.100.20 0.00 0.00 0.00 0.00

CodeT5-large-ntp-py 0.920.12 0.99 1.40e-05 1.30e-04 8.00e-04 0.820.10 0.87 2.48e-03 8.64e-03 0.02 0.470.06 0.00 0.00 2.83e-05 1.42e-05

8. Discussions

Mitigation by Removing Restrictive Data.

One potential mitigation strategy is to reduce the percentage of restrictive code within the training dataset. While the inclusion of copyrighted content in training machine learning models remains a controversial topic, with legal cases (Coyer, 2022) having been brought forth, it seems necessary to consider this approach. Despite this, our analysis suggests that simply excluding them from the training dataset may not be fully effective, as there may be a non-negligible number of restrictive code being implicitly included in “permissive” repositories. It is therefore imperative to implement more careful and granular processing of training data.

Mitigation with Controllable Code Generation.

In this study, we implemented a dynamic blocklist based on similarity scores as feedback to disallow the generation of restrictive code. However, we observed a significant drop in performance as the current models rely, to some extent, on recalling training data to provide solutions. Potential improvements include increasing efficiency by learning hidden representations from restrictive code, or by improving the balance between performance and IP compliance. In the context of code generation, the unique challenge for controllable generation is that IP violations cannot be assessed at the granularity of individual tokens, but rather require spans of sufficient length. Therefore, further research is needed for adaptive code generation control.

Mitigation by Enabling More Intelligent Model.

Another important direction in addressing the issue is to develop more “intelligent” models. In our evaluation that covers diverse language model architectures, we found that in some cases, even state-of-the-art models or commercial products seem to simply copy previously learned information. Therefore, exploring improved model architectures from the perspective of IP protection could be a valuable future work direction. CODEIPPROMPT can serve as a benchmark to evaluate and compare different model architectures.

Attributing to Origins of Generated Code.

One significant challenge in addressing the issue is the difficulty in identifying the source of the code. This lack of information makes it difficult to determine the applicable licensing terms and conditions for the generated code. To tackle the problem, it is essential to develop mechanisms for tracing the generated code back to its training data to determine whether there are any potential matches. However, the traditional method of iteratively comparing the generated code with the potentially extensive training data can be computationally expensive. Therefore, it is worth considering more efficient solutions as future research directions.

CODEIPPROMPT Beyond Benchmarking IP Violation.

The impact of CODEIPPROMPT could extend beyond benchmarking the extent of IP rights violation by code LMs. One important application is to evaluate the effectiveness of mitigation methods. For example, it can be used to assess the extent to which new model architectures rely on direct recall of training code, and to verify the accuracy of attribution based on similarity scores. Additionally, CODEIPPROMPT can be employed to detect implicit inclusion of restrictive code within more permissively licensed programs. As a result, CODEIPPROMPT can serve as an instrumental tool in addressing the issue and ensuring improved IP protection.

9. Limitations

There are several limitations to our study. First, our constructed prompts only cover a portion of existing licensed code with a focus on five commonly-used programming languages (i.e., Python, C, C++, C#, and Java). Second, we did not compare the generated code against every single licensed program file, so the maximum similarity scores could be even higher than our results suggest. Therefore, our results can serve as a lower bound for similarity, while the generated code might be even more similar to other existing licensed programs. Future work could expand upon our analysis by evaluating a larger range of code generation models and programming languages.

10. Conclusion

We present CODEIPPROMPT, a generalizable platform for evaluating the extent to which language models can reproduce learned code, which can result in potential intellectual property infringement. Using the framework, we analyzed multiple state-of-the-art models and commercial products, and investigated potential strategies for addressing the issue. To better understand the root cause, we examined the existing large training datasets and discovered that many of them contain a significant number of copyrighted programs, either explicitly or implicitly. Finally, we provide insights on mitigation strategies and shed light on future directions.

11. Ethical Considerations

The authors of this work hereby affirm that they are aware of and adhere to the NeurIPS Ethics Guidelines. Code language models were utilized in this work, and the risks and potential harm of which are discussed in (Brown et al., 2020). All generated code, particularly those that may potentially violate original licenses, were deprecated following the conclusion of the research and were not used outside of the study. We do not anticipate that the use of our framework will result in harmful outputs.

Supplementary Material

1

Table 4.

Confusion matrix for precision and recall calculation.

True Positive (TP)
Human Similar
Score > Threshold
False Positive (FP)
Human Not-Similar
Score > Threshold

True Negative (TN)
Human Not-Similar
Score < Threshold
False Negative (FN)
Human Similar
Score < Threshold

Acknowledgments

This work was partially supported by the NSF (CNS-1916926, CNS-2238635), ARO (W911NF2010141), DHS (No. 17STQAC00001–06-00), and Intel.

Footnotes

2

Additional prompt sets are available on the project website.

3

For Copilot, the top ten suggestions were saved individually.

4

Some of the included code repositories have been deprecated or migrated and were therefore excluded from the analysis.

5

To avoid confusion, we use the original name of the released model. The two models are built upon CodeT5 as the backbone and are therefore named with CodeT5 in the official release.

References

  1. Austin J, Odena A, Nye M, Bosma M, Michalewski H, Dohan D, Jiang E, Cai C, Terry M, Le Q, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021. [Google Scholar]
  2. Balganesh S, Manta ID, and Wilkinson-Ryan T Judging similarity. Iowa L. Rev, 100:267, 2014. [Google Scholar]
  3. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901, 2020. [Google Scholar]
  4. Butterick M We’ve filed a lawsuit challenging github copilot, an ai product that relies on unprecedented open-source software piracy. https://githubcopilotlitigation.com/, November 2022. [Google Scholar]
  5. Carlini N, Tramer F, Wallace E, Jagielski M, Herbert-Voss A, Lee K, Roberts A, Brown T, Song D, Erlingsson U, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021. [Google Scholar]
  6. Chen M, Tworek J, Jun H, Yuan Q, Pinto H. P. d. O., Kaplan J, Edwards H, Burda Y, Joseph N, Brockman G, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. [Google Scholar]
  7. Court S Oracle america, inc. v. google inc. https://scholar.google.com/scholar_case?case=15197092051369647665, 2014. [Google Scholar]
  8. Coyer C Lawyers expect more litigation, and clarity, around machine learning’s copyright issues. https://www.law.com/legaltechnews/2022/08/19/lawyers-expect-more-litigation-and-clarity-around-machine-learnings-copyright-issues, August 2022. [Google Scholar]
  9. Dale R Gpt-3: What’s it good for? Natural Language Engineering, 27(1):113–118, 2021. [Google Scholar]
  10. Dathathri S, Madotto A, Lan J, Hung J, Frank E, Molino P, Yosinski J, and Liu R Plug and play language models: A simple approach to controlled text generation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=H1edEyBKDS. [Google Scholar]
  11. Davis T Copilot emits large chunks of my copyrighted code. https://twitter.com/docsparse/status/1581461734665367554?lang=en, October 2022. [Google Scholar]
  12. Gao L, Biderman S, Black S, Golding L, Hoppe T, Foster C, Phang J, He H, Thite A, Nabeshima N, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. [Google Scholar]
  13. Ghosh S, Chollet M, Laksana E, Morency L, and Scherer S Affect-lm: A neural language model for customizable affective text generation. In Barzilay R and Kan M (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 634–642. Association for Computational Linguistics, 2017. 10.18653/v1/P17-1059. URL https://doi.org/10.18653/v1/P17-1059. [DOI] [Google Scholar]
  14. GitHub. Copilot: Your ai pair programmer. https://github.com/features/copilot, Jun 2022.
  15. GitHub. Choose an open source license - no license. https://choosealicense.com/no-permission/, May 2023.
  16. He X, Xu Q, Lyu L, Wu F, and Wang C Protecting intellectual property of language generation apis with lexical watermark. AAAI, 2022. [Google Scholar]
  17. Hendrycks D, Basart S, Kadavath S, Mazeika M, Arora A, Guo E, Burns C, Puranik S, He H, Song D, and Steinhardt J Measuring coding challenge competence with APPS. In Vanschoren J and Yeung S (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2.html. [Google Scholar]
  18. Holtzman A, Buys J, Du L, Forbes M, and Choi Y The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rygGQyrFvH. [Google Scholar]
  19. Husain H, Wu H-H, Gazit T, Allamanis M, and Brockschmidt M Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019. [Google Scholar]
  20. Karpinski S Copilot autocompletes the fast inverse square root implementation from quake iii. https://twitter.com/stefankarpinski/status/1410971061181681674?lang=en, July 2021. [Google Scholar]
  21. Kashefi A and Mukerji T Chatgpt for programming numerical methods. Journal of Machine Learning for Modeling and Computing, 2023. [Google Scholar]
  22. Le H, Wang Y, Gotmare AD, Savarese S, and Hoi SC Coderl: Mastering code generation through pretrained models and deep reinforcement learning. arXiv preprint arXiv:2207.01780, 2022. [Google Scholar]
  23. Lee K, Ippolito D, Nystrom A, Zhang C, Eck D, Callison-Burch C, and Carlini N Deduplicating training data makes language models better. In Muresan S, Nakov P, and Villavicencio A (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22–27, 2022, pp. 8424–8445. Association for Computational Linguistics, 2022. 10.18653/v1/2022.acl-long.577. URL https://doi.org/10.18653/v1/2022.acl-long.577. [DOI] [Google Scholar]
  24. Maertens R, Van Petegem C, Strijbol N, Baeyens T, Jacobs AC, Dawyndt P, and Mesuere B Dolos: Language-agnostic plagiarism detection in source code. Journal of Computer Assisted Learning, 2022. [Google Scholar]
  25. Microsoft. Presidio: Data protection and de-identification sdk. https://microsoft.github.io/presidio/, Dec 2022.
  26. Microsoft. General availability of azure openai service expands access to large, advanced ai models with added enterprise benefits. https://azure.microsoft.com/en-us/blog/general-availability-of-azure-openai-service-expands-access-to-large-advanced-ai-models-with-added-enterprise-benefits/, Jan 2023.
  27. Murray A The $100 million court case for open source license compliance. https://githubcopilotlitigation.com/, June 2020. [Google Scholar]
  28. Nijkamp E, Pang B, Hayashi H, Tu L, Wang H, Zhou Y, Savarese S, and Xiong C Codegen: An open large language model for code with multi-turn program synthesis. ArXiv preprint, abs/2203.13474, 2022. [Google Scholar]
  29. OpenAI. What is chatgpt? https://help.openai.com/en/articles/6783457-what-is-chatgpt, May 2023.
  30. Pan X, Zhang M, Ji S, and Yang M Privacy risks of general-purpose language models. In 2020 IEEE Symposium on Security and Privacy (SP), pp. 1314–1331. IEEE, 2020. [Google Scholar]
  31. Pearce H, Ahmad B, Tan B, Dolan-Gavitt B, and Karri R Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pp. 754–768. IEEE, 2022. [Google Scholar]
  32. Perry N, Srivastava M, Kumar D, and Boneh D Do users write more insecure code with ai assistants? arXiv preprint arXiv:2211.03622, 2022. [Google Scholar]
  33. Prechelt L, Malpohl G, and Philippsen M JPlag: Finding plagiarisms among a set of programs. Citeseer, 2002. [Google Scholar]
  34. Sandoval G, Pearce H, Nys T, Karri R, Dolan-Gavitt B, and Garg S Security implications of large language model code assistants: A user study. arXiv preprint arXiv:2208.09727, 2022. [Google Scholar]
  35. Shen D, Chen X, Wang C, Sen K, and Song D Benchmarking language models for code syntax understanding. arXiv preprint arXiv:2210.14473, 2022. [Google Scholar]
  36. Spencer M Github copilot makes developers 1.55x more productive. https://www.linkedin.com/pulse/github-copilot-makes-developers-155x-more-productive-michael-spencer, Sept 2022. [Google Scholar]
  37. Tunstall L, von Werra L, and Wolf T Natural language processing with transformers. “ O’Reilly Media, Inc.”, 2022. [Google Scholar]
  38. U.S.C. Copyright law of the united states (title 17). https://www.copyright.gov/title17/, Oct 2022.
  39. USCourts. Copying—access and substantial similarity. https://www.ce9.uscourts.gov/jury-instructions/node/274, Dec 2022.
  40. Vasiliev Y Natural Language Processing with Python and SpaCy: A Practical Introduction. No Starch Press, 2020. [Google Scholar]
  41. Wang B, Ping W, Xiao C, Xu P, Patwary M, Shoeybi M, Li B, Anandkumar A, and Catanzaro B Exploring the limits of domain-adaptive training for detoxifying large-scale language models. Advances in Neural Information Processing Systems, 2022. [Google Scholar]
  42. Wang Y, Wang W, Joty S, and Hoi SC Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021. [Google Scholar]
  43. Wolter T, Barcomb A, Riehle D, and Harutyunyan N Open source license inconsistencies on github. ACM Transactions on Software Engineering and Methodology, 2022. [Google Scholar]
  44. Xue M, Zhang Y, Wang J, and Liu W Intellectual property protection for deep learning models: Taxonomy, methods, attacks, and evaluations. IEEE Transactions on Artificial Intelligence, 3(6):908–923, 2021. [Google Scholar]
  45. Zhang J, Gu Z, Jang J, Wu H, Stoecklin MP, Huang H, and Molloy I Protecting intellectual property of deep neural networks with watermarking. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pp. 159–172, 2018. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES