A dataset for human-written and AI-generated code source classification

Ghizlane Boukili; Said EL Garouani; Jamal Riffi

doi:10.1016/j.dib.2026.112527

. 2026 Jan 29;65:112527. doi: 10.1016/j.dib.2026.112527

A dataset for human-written and AI-generated code source classification

Ghizlane Boukili ^1,^⁎, Said EL Garouani ¹, Jamal Riffi ¹

PMCID: PMC12907682 PMID: 41704515

Abstract

The rapid rise of AI code generation tools has created significant challenges for computer science educators in verifying the authenticity of students’ code writing. While generic AI detection tools exist, they often struggle to accurately identify AI-generated code because of the unique patterns and structures of programming languages. Research in this area requires data to develop effective systems for detecting AI code. We introduce a specialized dataset designed to support the creation of domain-specific detection tools to fill this gap. The dataset comprises 10,000 annotated code samples, consisting of 5000 human-written and 5000 AI-generated code samples, across Python, Java, C, and C++. The human-written samples were collected from a public repository, while the AI-generated samples were produced using ChatGPT’s API with varied prompts. Each sample is labeled with its origin, whether human or AI, enabling the robust training of machine learning and deep learning models for code source discrimination. The dataset and experiment code are publicly available to support further research in AI-generated code detection.

Keywords: ChatGPT, Programming languages, Machine learning, Detection, Prompt

Specifications table

Subject	Computer Sciences
Specific subject area	Machine learning, Deep Learning, LLM.
Type of data	Table, CSV file.
Data collection	A total of 10,000 code samples (5000 human-written and 5000 Chat-GPT-generated) were collected from a public repository and the ChatGPT API, covering four programming languages: Java, Python, C, and C++. All samples are compiled in single CSV file named HumanVsAI_CodeDataset.csv containing five columns: Problem_id, Sample_Code, Generated, Language, and Source.
Data source location	Institution: LISAC Laboratory, Faculty of Sciences Dhar El Mahraz, Sidi Mohammed Ben Abdellah University City/Town/Region: Fez Country: Morocco
Data accessibility	Repository name: Mendeley Data Data identification number: 10.17632/kjh95n54f8.2 Direct URL to data: https://data.mendeley.com/datasets/kjh95n54f8/1
Related research article	None.

Open in a new tab

1. Value of the Data

•
This dataset is a diverse and single-label code source suitable for training machine learning or deep learning models to develop robust classifiers that distinguish AI code from human-written code.
•
NLP techniques for programming language processing can leverage this dataset to detect AI-generated code patterns.
•
The dataset maintains perfect class balance, containing equal human-written and AI-generated code samples.
•
Researchers can use this dataset to evaluate and benchmark code-analysis models.
•
It supports stylometric analysis of coding style differences between human authors and AI.
•
The collection supports research in code authenticity and plagiarism detection.

2. Background

As large language models (LLMs) are increasingly used to generate code, distinguishing between AI-written and human-written programs has become an important research challenge. Several datasets already exist for related tasks in software engineering and machine learning, such as CodeSearchNet [1], program classification and translation CodeNet [2], and bug detection ManySStuBs4J [3]. More recent benchmarks like HumanEval [4] and MBPP (Mostly Basic Python Problems) [5] are designed for program synthesis, but they mainly measure how well models solve coding tasks rather than comparing AI and human code. Previous studies also note that researchers focused on AI provenance are limited by the lack of datasets that include both sources. To address this gap, we created a dataset that combines human-written code with AI-generated code samples applicable across diverse programming languages. The dataset provides labeled examples of both types of code, making it useful for developing and testing methods in AI code detection or feature analysis.

3. Data Description

3.1. Data format and statistics

This dataset is distributed as a compressed ZIP archive named ‘Code_Dataset’. The primary file, HumanVsAI_CodeDataset.csv, as shown in Fig. 1, contains 10,000 code samples, with 5000 written by humans and 5000 generated by AI. The dataset consists of a balanced collection of code samples from both ChatGPT and human programmers. As indicated in Table 1 along with Fig. 2, the samples are spread across four programming languages: Java, Python, C, and C++, classified from the most challenging to the easiest problems. This distribution emphasizes differences in code between the two sources.

Fig 1 dummy alt text — CSV dataset file (HumanVsAI_CodeDataset.csv).

Table 1.

Dataset statistics.

Statistical	AI	Human
Number of samples	5000	5000
Number of Java samples	1340	1605
Number of Python samples	1471	1207
Number of C samples	968	769
Number of C++ samples	1221	1419

Open in a new tab

Fig 2 dummy alt text — AI vs human code by language.

The archive also contains supplementary materials. These include a file providing a generalized overview of the code prompts using ChatGPT, three notebooks (.ipynb) for machine learning experiments, deep learning experiments, and a dataset description. Also, two README files that provide detailed documentation and requirements for reproducing the experiments. Together, these resources make the dataset reproducible and directly applicable for research on classifying AI-generated code from human-written code.

The human-written code was sourced from the large-scale CodeNet [2] dataset, which contains approximately 14 million code samples. This dataset is based on submissions from two online judge websites: AIZU Online1 and AtCoder.2 These platforms allow programmers to test their skills by solving programming problems presented in courses or contests. Users submit their solutions for evaluation by an automated review system, which then returns the results. For the AI-generated code, samples were produced using the AI tool ChatGPT.

The dataset is organized into a table with the following columns:

1-
Problem_id: Is a task-level identifier created for this dataset, generated using the CodeT5+ model [6], to group code samples solving the same programming-problem. It is distinct from the original Project CodeNet identifier and is used to control task-level grouping and prevent data leakage during evaluation.
2-
Sample_Code: Contains the source code snippet, representing the content of the programming task and solution.
3-
Generated: Indicates whether the code was written by a human or generated by ChatGPT, providing information about the authorship of each sample.
4-
Language: Specifies the programming language of the code, which helps to categorize the data and enables language-specific analyses.
5-
Source: Provides information about the origin of the code snippet.

This structure allows for easy filtering, analysis, and processing of code samples across different languages and sources. Table 2 presents four representative rows of our dataset, illustrating its organization as tabular form.

Table 2.

Representative examples of the dataset.

problem_id	Sample_Code	Generated	Language	Source
Prob1996	/** Built using CHelper plug-in; solution at top */ public class Main {public static void main(String[] args){Scanner in=new Scanner(System.in); PrintWriter out=new PrintWriter(System.out); BTollGates solver=new BTollGates(); solver.solve(1,in,out); out.close(); …}	Human	Java	CodeNet
Prob4923	/* author:nishi5451 created:11.08.2020 / #include 〈bits/stdc++.h〉 using namespace std; #define rep(i,n) for(int i = 0;i < n;i*++) typedef long ll; int main(){string s; cin>>s; int ans=0; for(auto c:s){if(c=='1′) ans++;} cout<<ans<<endl; …}	Human	C++	CodeNet
Prob2126	#include <iostream>#include <cmath>#include 〈functional〉 double f(double x){return xxx-x-2;} double bisectionMethod(auto f,double a,double b){double c; for(int i = 0;i < 100;i++){c=(a + b)/2;if(std::fabs(f(c))<1e-5)return c;if(f(a)*f(c)<0)b = c;else a = c;} return c;} int main(){double root=bisectionMethod(f,1,2); std::cout<<``Root:''<<root<<std::endl; …}	AI	C++	ChatGPT-4
Prob3685	public class SommeNombres {public static void main(String[] args){int somme=0; for(int i = 1;i ≤ 10;i++){somme+=i;} System.out.println(``The sum of numbers from 1 to 10 is: ''+somme); …}	AI	Java	ChatGPT-3.5

Open in a new tab

Note: The Sample_Code column is truncated for readability. The full dataset is available in the online repository.

3.2. Classification results

To evaluate our proposed dataset, we examined whether human-written and AI-generated code samples could be distinguished based on extracted features. We used two classifiers: Extreme Gradient Boosting (XGBoost) [7] on hand-crafted features and a Long-Short-Term memory (LSTM) [8] on raw code sequences. Both achieved high performance across all standard metrics, as reported in Table 3. This confirms the dataset’s coherence and its suitability for evaluating both manual and learned representations. Details regarding the dataset-oriented interpretation of these results are provided in Section 4.2, focusing on how the extracted features contribute to class separability and what these outcomes reveal about the quality and structural properties of the datasets.

Table 3.

The classification performance.

Baseline classification	Accuracy %	Precision %	Recall %	F1 score %	AUC %
XGBoost	94.55	95	95	94.54	98.63
LSTM	98.37	97.59	99.22	98.40	99.42

Open in a new tab

4. Experimental Design, Materials, and Methods

4.1. Experimental design and materials

The data were collected over three months and divided into two primary steps:

4.1.1. Data collection

•
Human code samples: The curation of the 5000 human-authored samples followed a structured, multi-stage pipeline utilizing the hierarchical metadata of the Project CodeNet repository [2]. The process began at the dataset level by consulting the main metadata file to identify specific problem IDs from the thousands of available challenges. To ensure the final collection reflected a realistic and varied landscape of programming, we identified candidate tasks from the total pool of 4053 problem IDs by consulting the dataset-level metadata file, problem_list.csv. We further refined this selection by reviewing the individual HTML description files for each problem to maintain a balance between elementary exercises and sophisticated algorithms. This selection process resulted in a foundation of 937 unique tasks for the human-authored subset. Once the tasks were identified, problem-level metadata was used to perform high-precision filtering based on the selection criteria presented in Table 4. Every code sample was selected based on its unique submission ID and a status of “Accepted” (AC), ensuring the code was functionally correct and passed all prescribed tests. To capture diverse human code in patterns, a strategy was implemented to select up to 2 unique users per language for each task. Additionally, the size of the source code in bytes was analyzed to intentionally include a mix of short-form logic and long-form implementations for every task. Using the 937 tasks as the basis for distribution, we obtained a total of 1605 Java samples, 1419 C++ samples, 1207 Python samples, and 769 C samples. While the goal was to capture two unique users per task, our final dataset utilized a mix where some tasks contributed two unique and others contributed only one. This decision to use variable number of authors per tasks was an intentional choice to preserve the authenticity and integrity of the data. Because some programming languages are much more widely used than others in the source repository of the CodeNet dataset, not every task contained multiple accepted solutions for every language. By selecting only what was naturally available, the dataset remains a realistic reflection of human coding practices in real-world programming environments. At the final step, all the selected code samples were gathered into a single consolidated Excel file. The human-written samples used in this dataset were collected in 2024 from metadata archive (Project_CodeNet_metadata.tar.gz), which predates IBM’s consolidation of metadata and source files into the unified (Project_CodeNet.tar.gz,7.8GB) release.

•
AI code samples: The AI-generated code samples were produced using ChatGPT-3.5 and ChatGPT-4 during a documented generation period spanning October to December 2024. To build this subset, we followed a systematic strategy that ensures both functional quality and variety. First, we generate the same task used in human code, we extracted detailed requirements directly from the original HTML problem descriptions in the Project CodeNet repository. However, we observed that the powerful generative capabilities of ChatGPT often produced code that nearly identical to existing human submissions. To maintain logical diversity, we implemented a process of prompt reformulation, where we made minor adjustments to the phrasing of the tasks to encourage the AI to generate distinct logical structures while strictly preserving the original problem intent. Our methodology also accounts for the different ways human and AI code are collected. For the human-authored data, we utilized a “one-to-many” relationship, where a single programming task is linked to multiple solutions from different authors. In contrast, for the AI-generated data, our process focused on a “one-to-one” approach, where each unique prompt typically yielded a single, specialized response. In this framework, each task was used to prompt the AI across four programming languages—C, C++, Java, and Python—resulting in one distinct code sample per language for every task. To avoid an unbalanced dataset and better reflect real-world AI usage, the generated subset was enriched with additional mixed tasks drawn from both competitive programming (CodeNet) and non-competitive domains. The inclusion of non-domain tasks was chosen to cover a wide range of programming domains and problem types across both basic and advanced tasks, ensuring that the resulting dataset is both diverse and complex for AI code classification research. The non-competitive domains include: games and simulations, Object-Oriented Programming (OOP) and classes, security, machine learning and deep learning, economics and financial problems, statistical and data analysis, and library and file handling. This expansion ensured the collection reached the target of 5000 samples while providing a broad and diverse spectrum of solved tasks with an equal number of AI-generated samples per language was drawn from tasks shared with the human dataset to control the task-related confounding. The prompts presented in the complete list in “Coding_prompts.csv” file within the compressed ZIP provide an overview of task intent rather than the full instructions used during generation, specifically for the non-competitive tasks. In practice, task-specific and more detailed prompts were employed by accompanying the prompts with brief explanations and structured outline of the reasoning steps preceding the final solution, reflecting common instructional usage, simple prompts were also used where appropriate, as the ChatGPT model is capable of producing solutions even when explicit reasoning or step-by-step justification was not requested. As result, the generated code covers a spectrum of complexity and abstraction, which is consistent with the variety found in practical AI-generated code.

Table 4.

Human-authored subset selection criteria.

Criterion	Metadata field	Objective
Correctness	status	Only submissions marked as ``Accepted'' (AC) were selected to guarantee the code passed all tests.
Task logic diversity	problem_id	937 unique tasks were selected to ensure a broad landscape of programming logic.
Language	language	Selection was filtered across C, C++, Java, and Python for diverse representations.
Authorship variety	user_id	Selected up to two unique users per language to capture different human coding patterns.
Stylistic diversity	code_size	Used the byte count of the source file to select a mix of short and long-form of the code
Data integrity	submission_id	Each sample is uniquely identified and extracted from its specific directory path.

Open in a new tab

4.1.2. Data labeling and organization

After merging the AI-generated and human-written code into a unified Excel spreadsheet, a verification step was applied to detect duplicate rows, missing values, and structural inconsistencies prior to label assignment. After this verification was completed, we proceeded to the labeling phase, assigning the label “Human” to human-written code and ‘’AI’’ to ChatGPT-generated code. Each code sample was also annotated with its programming language and source. All samples were first arranged in an Excel file and later transformed into CSV format using a Python script. Fig. 3 describes the data construction process.

Fig 3 dummy alt text — Flowchart of the multi-step process.

4.2. Machine and deep learning experiments

In this subsection, we present the adopted training setup for the experiments and describe the two different approaches by interpreting their results in relation with our dataset.

•
XGBoost Experiment: We implemented an XGBoost classifier using four categories of handcrafted features: lexical, structural, semantic, and behavioral to distinguish AI-generated from human-written code. The strong performance observed, with an average accuracy of 99 % across various metrics (including a test AUC of 0.988 and F1-score of 0.945), and particularly the high consistent GroupKFold-validation results, serves primarily as robust quality control for the dataset. These metrics demonstrate conclusively that the dataset inherently contains highly separable classes of human and AI-generated code. The effectiveness of the diverse features—lexical, structural, semantic, and behavioral—in achieving such clear distinction strongly indicates that these code characteristics provide robust and distinguishable patterns. For instance, differences captured by lexical features point to inherent stylistic variations; structural features reveal distinct organizational and logical constructs; semantic features likely highlight characteristic variable naming or intent expressions; and behavioral features capture typical human coding habits. The consistently high accuracy across folds, coupled with a low standard deviation, underscores that these features are not only present but are consistently distinct across different problem_ids, validating the dataset’s quality and the absence of significant class overlap, thereby providing a rich foundation for differentiating between code origins.
•
LSTM Experiment: To assess the quality of our dataset, we evaluated an LSTM network to capture sequential and contextual patterns from code samples processed via the CodeT5. The training process, optimized with early stopping and problem_id stratification to prevent data leakage, yielded a test accuracy of 98.37 % and an F1-score of 0.9840. These results are framed strictly as quality control evidence, demonstrating that the dataset possesses exceptional class separability and a clear signal. The fact that a standard model can achieve such high performance highlights the strong intrinsic distinction of the extracted features and the robust structural properties of the data itself. Ultimately, this outcome confirms that the dataset contains highly informative patterns for distinguishing human from AI code, making it and excellent resource for developing reliable detection systems.

The objective of this combination of classical and deep learning experiments is to demonstrate that the dataset is balanced, representative, and suitable for both approaches.

Limitations

The limitation of this dataset is that the AI-generated code samples are produced solely by ChatGPT, which may reduce generalizability to outputs from other models. Likewise, the human-written code is collected from a single source, limiting the diversity of coding styles and practices. In addition, the dataset covers only four programming languages, which constrains its applicability to multilingual code detection tasks, and AI-generated code was produced without advanced prompting strategies, which may limit the diversity of prompting conditions represented in the dataset.

Ethics Statement

•
All code samples are included with appropriate attribution to their sources.
•
This dataset includes code samples written by human programmers from publicly available repositories, as well as code generated by large language models (LLMs).
•
No personal or sensitive information is included in this data; users of this dataset are encouraged to employ it responsibly and ethically.

Credit Author Statement

Ghizlane Boukili: Conceptualization, Methodology, Software, Data curation, Writing – original draft; Jamal Riffi: Supervision, Writing – review & editing, Project administration; Said El Garouani: Writing – review & editing, Project administration.

Acknowledgements

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

https://onlinejudge.u-aizu.ac.jp/home.

https://atcoder.jp/.

Data Availability

(Mendeley Data).HumanVSAI_CodeDataset (Original data)

References

1.Husain H., Wu H.-H., Gazit T., Allamanis M., Brockschmidt M. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. 2020. [DOI] [Google Scholar]
2.Puri R., Kung D.S., Janssen G., Zhang W., Domeniconi G., Zolotov V., Dolby J., Chen J., Choudhury M., Decker L., Thost V., Buratti L., Pujar S., Ramji S., Finkler U., Malaika S., Reiss F. Proc. 35th Conf. Neural Inf. Process. Syst. (NeurIPS) Track Datasets Benchmarks. Vol. 1. 2021. CodeNet: A large-scale AI for code dataset for learning a diversity of coding tasks.https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/a5bfc9e07964f8dddeb95fc584cd965d-Paper-round2.pdf [Google Scholar]
3.Karampatsis R.-M., Sutton C. How often do single-statement bugs occur? The ManySStuBs4J dataset. Proc. 17th Int. Conf. Min. Softw. Repos.; Seoul Republic of Korea; 2020. p. 573‑577. ACM. [DOI] [Google Scholar]
4.Chen M., Tworek J., Jun H., Yuan Q., de Oliveira Pinto H.P., Kaplan J., Edwards H., Burda Y., Joseph N., Brockman G., Ray A., Puri R., Krueger G., Petrov M., Khlaaf H., Sastry G., Mishkin P., Chan B., Gray S., Ryder N., Pavlov M., Power A., Kaiser L., Bavarian M., Winter C., Tillet P., Such F.P., Cummings D., Plappert M., Chantzis F., Barnes E., Herbert-Voss A., Guss W.H., Nichol A., Paino A., Tezak N., Tang J., Babuschkin I., Balaji S., Jain S., Saunders W., Hesse C., Carr A.N., Leike J., Schulman J., Hilton J., Nakano S., Hyeongwon C. 2021. Evaluating Large Language Models Trained on Code. [DOI] [Google Scholar]
5.Austin J., Odena A., Nye M., Bosma M., Michalewski H., Dohan D., Jiang E., Cai C., Terry M., Le Q., Sutton C. Program Synthesis with Large Language Models. 2021. 10.48550/arXiv.2108.07732 [Google Scholar]
6.Wang Y., Wang W., Joty S., Hoi S.C.H. CodeT5: Identifier-aware Unified Pre-trained Encoder-DecoderModels for Code Understanding and Generation. Dominican Republic; 2021. pp. 8696–8708. [DOI] [Google Scholar]
7.Chen T., Guestrin C. Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. 2016. XGBoost: a scalable tree boosting system; p. 785‑794. [DOI] [Google Scholar]
8.Hochreiter S., Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735‑1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

(Mendeley Data).HumanVSAI_CodeDataset (Original data)

[bib0001] 1.Husain H., Wu H.-H., Gazit T., Allamanis M., Brockschmidt M. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. 2020. [DOI] [Google Scholar]

[bib0002] 2.Puri R., Kung D.S., Janssen G., Zhang W., Domeniconi G., Zolotov V., Dolby J., Chen J., Choudhury M., Decker L., Thost V., Buratti L., Pujar S., Ramji S., Finkler U., Malaika S., Reiss F. Proc. 35th Conf. Neural Inf. Process. Syst. (NeurIPS) Track Datasets Benchmarks. Vol. 1. 2021. CodeNet: A large-scale AI for code dataset for learning a diversity of coding tasks.https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/a5bfc9e07964f8dddeb95fc584cd965d-Paper-round2.pdf [Google Scholar]

[bib0003] 3.Karampatsis R.-M., Sutton C. How often do single-statement bugs occur? The ManySStuBs4J dataset. Proc. 17th Int. Conf. Min. Softw. Repos.; Seoul Republic of Korea; 2020. p. 573‑577. ACM. [DOI] [Google Scholar]

[bib0004] 4.Chen M., Tworek J., Jun H., Yuan Q., de Oliveira Pinto H.P., Kaplan J., Edwards H., Burda Y., Joseph N., Brockman G., Ray A., Puri R., Krueger G., Petrov M., Khlaaf H., Sastry G., Mishkin P., Chan B., Gray S., Ryder N., Pavlov M., Power A., Kaiser L., Bavarian M., Winter C., Tillet P., Such F.P., Cummings D., Plappert M., Chantzis F., Barnes E., Herbert-Voss A., Guss W.H., Nichol A., Paino A., Tezak N., Tang J., Babuschkin I., Balaji S., Jain S., Saunders W., Hesse C., Carr A.N., Leike J., Schulman J., Hilton J., Nakano S., Hyeongwon C. 2021. Evaluating Large Language Models Trained on Code. [DOI] [Google Scholar]

[bib0005] 5.Austin J., Odena A., Nye M., Bosma M., Michalewski H., Dohan D., Jiang E., Cai C., Terry M., Le Q., Sutton C. Program Synthesis with Large Language Models. 2021. 10.48550/arXiv.2108.07732 [Google Scholar]

[bib0006] 6.Wang Y., Wang W., Joty S., Hoi S.C.H. CodeT5: Identifier-aware Unified Pre-trained Encoder-DecoderModels for Code Understanding and Generation. Dominican Republic; 2021. pp. 8696–8708. [DOI] [Google Scholar]

[bib0007] 7.Chen T., Guestrin C. Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. 2016. XGBoost: a scalable tree boosting system; p. 785‑794. [DOI] [Google Scholar]

[bib0008] 8.Hochreiter S., Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735‑1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]

PERMALINK

A dataset for human-written and AI-generated code source classification

Ghizlane Boukili

Said EL Garouani

Jamal Riffi

Abstract

1. Value of the Data

2. Background