Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2021 Nov 30;16(11):e0260395. doi: 10.1371/journal.pone.0260395

Detecting fabrication in large-scale molecular omics data

Michael S Bradshaw 1,*, Samuel H Payne 2
Editor: Frederique Lisacek3
PMCID: PMC8631639  PMID: 34847169

Abstract

Fraud is a pervasive problem and can occur as fabrication, falsification, plagiarism, or theft. The scientific community is not exempt from this universal problem and several studies have recently been caught manipulating or fabricating data. Current measures to prevent and deter scientific misconduct come in the form of the peer-review process and on-site clinical trial auditors. As recent advances in high-throughput omics technologies have moved biology into the realm of big-data, fraud detection methods must be updated for sophisticated computational fraud. In the financial sector, machine learning and digit-frequencies are successfully used to detect fraud. Drawing from these sources, we develop methods of fabrication detection in biomedical research and show that machine learning can be used to detect fraud in large-scale omic experiments. Using the gene copy-number data as input, machine learning models correctly predicted fraud with 58–100% accuracy. With digit frequency as input features, the models detected fraud with 82%-100% accuracy. All of the data and analysis scripts used in this project are available at https://github.com/MSBradshaw/FakeData.

Introduction

Fraud is a pervasive problem and can occur as fabrication, falsification, plagiarism or theft. Examples of fraud are found in virtually every field, such as: education, commerce, and technology. With the rise of electronic crimes, specific criminal justice and regulatory bodies have been formed to detect sophisticated fraud, creating an arms-race between methods to deceive and methods to detect deception. The scientific community is not exempt from the universal problem of fraud, and several studies have recently been caught manipulating or fabricating data [1, 2] or are suspected of it [3]. More than two million scientific articles are published yearly and ~2% of authors admit to data fabrication [4]. When these same authors were asked if they personally knew of colleagues that had fabricated, falsified, or modified data, positive response rates rose 14–19% [4, 5]. Some domains or locales have somewhat higher rates of data fabrication; in a recent survey of researchers at Chinese hospitals, 7.37% of researchers admitted to fabricating data [6]. Overall, these rates of data fabrication potentially mean tens to hundreds of thousands of articles are published each year with manipulated data.

Data in the biological sciences is particularly vulnerable to fraud given its size—which makes it easier to hide data manipulation—and researcher’s dependence on freely available public data. Recent advances in high-throughput omics technologies have moved biology into the realm of big-data. Many diseases are now characterized in populations, with thousands of individuals characterized for cancer [7], diabetes [8], bone strength [9], and health care services for the general populace [10]. Large-scale characterization studies are also done for cell lines and drug responses [11, 12]. With the rise of importance of these large datasets, it becomes imperative that they remain free of errors both unintentional and intentional [13].

Current methods for ensuring the validity of research is largely limited to the peer-review process which, as of late, has proven to be insufficient at spotting blatant duplication of images [14], let alone subtleties hidden in large scale data. Data for clinical trials can be subject to reviews and central monitoring [15, 16]. However, the decision regarding oversight methodology and frequency is not driven by empirical data, but rather is determined by clinics’ usual practice [17]. The emerging data deluge challenges the effectiveness of traditional auditing practices to detect fraud, and several studies have suggested addressing the issue with improved centralized and independent statistical monitoring [5, 6, 16, 18]. However, these recommendations are given chiefly to help ensure the safety and efficacy of the study, not data integrity.

In 1937, physicist Frank Benford observed in a compilation of 20,000 numbers that the first digit did not follow a uniform distribution as one may anticipate [19]. Instead, what Benford observed was that digit 1 occurred about 30% of the time, 2–18%, 3–13%, and that the pattern continues decaying, ending with digit 9 occurring < 5% of the time. Why this numerical pattern exists can be explained by looking at the relative change from lower vs higher first digit numbers. For example, moving a value from 1,000 to 2,000 is a 100% increase, while changing from 8,000 to 9,000 is an only increase of 12.5%. This pattern holds true in most large collections of numbers, including scientific data, where the upper and lower limits are not tightly bound. Comparing a distribution of first digits to a Benford distribution can be used to identify deviations from the expected frequency, often because of fraud. Recently Benford’s law has been used to identify fraud in financial records of international trade [20] and money laundering [21]. It has also been used on a smaller scale to reaffirm suspicions of fraud in clinical trials [3]. It should be noted that Benford’s Law, despite being called a law, it not always followed and does have some limitations. If the upper and lower limits of a dataset are tightly bound (the dataset cannot span orders of magnitudes of values), a Benford-law like digit distribution may not be able to form.

The distinction between fraud and honest error is important to make; fraud is the intent to cheat [5]. This is the definition used throughout this paper. An honest error might be forgetting to include a few samples, while intentionally excluding samples would be fraud. Incorrectly copying and pasting values from one table to another is an honest error, but intentionally changing the values is fraud. In these examples the results may be the same but the intent behind them differs wildly. In efforts to maintain data integrity, identifying the intent of the misconduct may be impossible and is also a secondary consideration after suspect data has been identified.

Data fabrication is “making up data or results and recording or reporting them” [5]. This type of data manipulation when not documented for bonafide applications such as simulation or imputation of missing values is free from the above ambiguity relating to the author’s intent. Making up data “such that the research is not accurately represented in the research record” [5] is always wrong. We explore methods of data fabrication and detection in molecular omics data using supervised machine learning and Benford-like digit-frequencies. We do not attempt to explain why someone may choose to fabricate their data as other study have done [6, 22]; our only goal is to evaluate the utility of digit-frequencies to differentiate real from fake data. The data used in this study comes from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) cohort for endometrial carcinoma, which contains copy number alteration (CNA) measurements from 100 tumor samples [23, 24]. We created 50 additional fake samples for these datasets. Three different methods of varying sophistication are used for fabrication: random number generation, resampling with replacement, and imputation. We show that machine learning and digit-frequencies can be used to detect fraud with near perfect accuracy.

Methods

Real data

The real data used in this publication originated from the genomic analysis of uterine endometrial cancer. As part of the Clinical Proteomics Tumor Analysis Consortium (CPTAC), 100 tumor samples underwent whole genome and whole exome sequencing and subsequent copy number analysis. We used the results of the copy number analysis as is, which is stored in our GitHub repository at https://github.com/MSBradshaw/FakeData.

Fake data

Fake data used in this study was generated using three different methods. In each method, we created 50 fake samples which were combined with the 100 real samples to form a mixed dataset. The first method to generate fake data was random number generation. For every gene locus, we first find the maximum and minimum values observed in the original data. A new sample is then fabricated by randomly picking a value within this gene specific range. The second method to generate fake data was sampling with replacement. For this, we create lists of all observed values across the cohort for each gene. A fake sample is created by randomly sampling from these lists with replacement. The third method to generate fake data is imputation performed using the R package missForrest [25], which we repurposed for data fabrication. A fake sample was generated by first creating a copy of a real sample. Then we iteratively nullified 10% of the data in each sample and imputed these NAs with missForrest until every value had been imputed and the fake sample no longer shared any data originally copied from the real sample (S1 Fig).

Machine learning training

With a mixed dataset containing 100 real samples and 50 fake samples, we proceeded to create and evaluate machine learning models which predict whether a sample is real or fabricated (S2 Fig). The 100 real and 50 fake samples were both randomly split in half, one portion added to a training set and the other held out for testing. Given that simulations on biological data like this have never, to our knowledge, been done, we did not have any expectation as to which type of model would perform best at this task. Thus, we tried a wide variety of models, all implementing fundamentally different algorithms. Sticking to models included in SciKit Learn [26] with a common interface increased code reusability and allowed for quick and consistent comparison. Using Python’s SciKitLearn library, we evaluated five machine learning models:

  1. Gradient boosting (GBC): [27] an ensemble method based on the creation of many weak decision trees (shallow trees, sometimes containing only 2 leaf nodes).

  2. Naïve Bayes (NB): [28] type of probabilistic classifier based on Bayes Theorem.

  3. Random Forest (RF): [29] ensemble method of many decision trees, differs from GBC in that the decision trees are not weak, they are full trees working on slightly different subsets of the training features.

  4. K-Nearest Neighbor (KNN): [30] this does not perform any learning per-se, but classifies based on proximity to labeled training data.

  5. Support Vector Machine (SVM): [31] is a statistical based learning method that operates by trying to maximize the size of the gap between classification categories.

Training validation was done using 10-fold cross validation. We note explicitly that the training routine was never able to use testing data. After all training was complete, the held-out test set was then fed to each model for prediction and scoring. We used simple accuracy and F1 scores as evaluation metrics. For each sample in the test set, ML models would predict whether it was real or fabricated. Model accuracy was calculated as the number of correct predictions divided by the number of total predictions. To assess the amount of false positives and false negatives we also compute the F1 score [32]. The entire process of fake data generation and ML training/testing was repeated 50 times. Different random seeds were used when generating each set of fake data. Thus, fake samples in all 50 iterations are distinct from each other. Grid search parameter optimization was performed to select the hyperparameter set for each model. The parameter search spaces used, and all of the data and analysis scripts used in this project, are available at https://github.com/MSBradshaw/FakeData. A full list of the final parameters used for each model-dataset pair can be found in “S1 File”.

We compared two types of input to the machine learning models here. For the first we use gene copy-number data from CPTAC as the features (17,156 training features/genes in total) with added fabricated samples as the training and test data. In the second section, rather than directly using the copy-number values, we use the proportional frequency of the digits 0–9 in the first and second positions after the decimal place (digit-frequencies). This results in 20 training features in total, those features being: frequency of 0 in the first position, frequency of 1 in the first position … frequency of 9 in the first position, frequency of 0 in the second position, frequency of 1 in the second position… frequency of 9 in the second position.

Benford-like digit frequencies

Benford’s Law or the first digit law has been instrumental at catching fraud in various financial situations [20, 21] and in small scale clinical trials [3]. The distribution of digit frequencies in a set of numbers conforming to Benford’s Law has a long right-tail; the lower the digit the greater its frequency of occurrence. The CNA data used here follows a similar pattern (S3 Fig). The method presented here is designed with the potential to generalize and be applied to multiple sets of data of varying types and configurations (i.e. different measured variables (features) and different quantities of variables). Once trained, machine learning models are restricted to data that conform to the model input specifications (i.e. the same number of input features, for example). Converting all measured variables to digit frequencies circumvents this problem. Digit frequencies are calculated as the number of occurrences of a single digit (0–9) divided by the total number of features. In the method described in this paper, a sample’s features are all converted to digit frequencies of the first and second digit after the decimal. Thus for each sample the features are converted from 17,156 copy number alterations to 20 digit frequencies. Using this approach, whether a sample has 100 or 17,156 features it can still be trained on and classified by the same model (though it’s effectiveness will still be dependent on the existence of digit-frequency patterns).

Computing environment

Data fabrication was performed using the R programming language version 3.6.1. For general computing, data manipulation, and file input output we used several packages from the tidyverse: [33] readr, tibble, and dplyr. Most figures were generated using ggplot2 in R, with grid, and gridExtra filling some gaps in plotting needs. Data fabricated with imputation was performed using the missForest package [25].

The machine learning aspect of this study was performed in Python 3.8.5. All models and methods for the evaluation used came from the package SciKit-Learn (sklearn) version 0.23.2 [26]. Pandas version 1.1.3 was used for all reading and writing of files [34]. The complete list of parameters used for each model and dataset pair can be found in the supplemental material online, “S1 File”.

Results

Our goal is to explore the ability of machine learning methods to identify fabricated data hidden within large datasets. Our results do not focus on the motivations to fabricate data, nor do they explore in depth the infinite methodological ways to do so. Our study focuses on whether machine learning can be trained to correctly identify fabricated data. Our general workflow is to take real data and mix in fabricated data. When training, the machine learning model is given access to the label (i.e. real or fabricated); the model is tested or evaluated by predicting the label of data which was held back from training (see Methods).

Fake data

The real data used in this study comes from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) cohort for endometrial carcinoma, specifically the copy number alteration (CNA) data. The form of this real data is a large table of floating point values. Rows represent individual tumor samples and columns represent genes; values in the cells are thus the copy number quantification for a single gene in an individual tumor sample. This real data was paired with fabricated data and used as an input to machine learning classification models (see Methods). Three different methods of data fabrication were used in this study: random number generation, resampling with replacement, and imputation (S1 Fig). The three methods represent three potential ways that an unscrupulous scientist might fabricate data. Each method has benefits and disadvantages, with imputation being both the most sophisticated and also the most computationally intense and complex. As seen in Fig 1, the random data clusters are far from the real data. Both the resampled and imputed data cluster tightly with the real data in a PCA plot, with the imputed data also generating a few reasonable outlier samples.

Fig 1. Principal component analysis of real and fake samples.

Fig 1

Copy number data for the real and fabricated samples are shown. The fabricated data created via random number generation is clearly distinct from all other data. Fabricated data created via resampling or imputation appears to cluster very closely with the real data.

To look further into the fabricated data, we plotted the distribution of the first two digits after the decimal place in the real and fake data (S3 Fig). While none of the fake have quite the spread of digit distributions in terms of variation, data created via imputation matches the real data the closest in terms of mean digit frequencies. We also examined whether fake data preserved correlative relationships present in the original data (S4 Fig). This is exemplified by two pairs of genes. PLEKHN1 and HES4 are adjacent genes found on chromosome 1p36 separated by ~30,000 bp. Because they are so closely located on the chromosome, it is expected that most copy number events like large scale duplications and deletions would include both genes. As expected, their CNA data has a Spearman correlation coefficient of 1.0 in the original data, a perfect correlation. The second pair of genes, DFFB and OR4F5, are also on chromosome 1, but are separated by 3.8 Mbp. As somewhat closely located genes, we would expect a modest correlation between CNA measurements, but not as highly correlated as the adjacent gene pair. Consistent with this expectation, their CNA data has a Spearman correlation coefficient of 0.27. Depending on the method of fabrication, fake data for these two gene pairs may preserve these correlative relationships. When we look at the random and resampled data for these two genes, all correlation is lost (S4C–S4F Fig). Imputation, however, produces data that closely matches the original correlations, PLEKHN1 and HES4 R2 = 0.97; DFFB and OR4F5 R2 = 0.32 (S4G and S4H Fig).

Machine learning with quantitative data

We tested five different methods for machine learning to create a model capable of detecting fabricated data: Gradient Boosting (GBC), Naïve Bayes (NB), Random Forest (RF), K-Nearest Neighbor (KNN) and Support Vector Machine (SVM). Models were given as features the quantitative data table containing copy number data on 75 labeled samples, 50 real and 25 fake. In the copy number data, each sample had measurements for 17,156 genes, meaning that each sample had 17,156 features. After training, the model was asked to classify held-out testing data containing 75 samples, 50 real and 25 fake. The classification task considers each sample separately, meaning that the declaration of real or fake is made only from data of a single sample. We evaluated the models on accuracy (Fig 2A, 2C and 2E) to quantify true positive and true negatives and F1 scores (Fig 2B, 2D and 2F) to assess false positives and false negatives. To ensure that our results represent robust performance, model training and evaluation was performed 50 times; each time a completely new set of 25 fabricated samples were made (see Methods). Reported results represent the average accuracy of these 50 trials.

Fig 2. Classification accuracy using copy number data.

Fig 2

Fabricated data was mixed with real data and given to four machine learning models for classification. Data shown represents 50 trials for 50 different fabricated dataset mixes. Features in this dataset are the copy number values for each sample. Outliers are shown as red asterisks; these same outliers are shown also as normally colored points in jittered-point overlay. A. Results for data fabricated with the random method, mean classification accuracy: RF 97% (+/- 2.5%), SVM 98% (+/- 1.5%), GBC 92% (+/- 4.2%), NB 88% (+/- 3.5%), KNN 72% (+/- 3.4%). B. Results for data fabricated with the random method, mean classification F1: RF 0.95 (+/- 0.03), SVM 0.98 (+/- 0.02), GBC 0.88 (+/- 0.07), NB 0.85 (+/- 0.04), KNN 0.25 (+/- 0.16) C. Results for data fabricated with the resampling method, mean classification accuracy: RF 70% (+/- 2.6%), SVM 67% (+/- 2.7%), GBC 74% (+/- 6%), NB 58% (+/- 15.2%), KNN 67% (+/- 0%). D. Results for data fabricated with the resampled method, mean classification F1: RF 0.21 (+/- 0.12), SVC 0.38 (+/- 0.09), GBC 0.53 (+/- 0.12), NB 0.19 (+/- 0.23), KNN 0 (+/- 0). E. Results for data fabricated with the imputation method, mean classification accuracy: RF 100% (+/- 0%), SVM 100% (+/- 0%), GBC 100% (+/- 0%), NB 66% (+/- 6.7%), KNN 100% (+/- 0%). F. Results for data fabricated with the imputation method, mean classification F1: RF 1 (+/- 0), SVM 1 (+/- 0), GBC 1 (+/- 0), NB 0.62 (+/- 0.05), KNN 1 (+/- 0).

The five models overall performed relatively well on the classification task for data fabricated with the random approach. The average accuracy scores of 50 trials was: RF 96%, SVM 98%, GBC 92%, NB 88%, and KNN 72% (Fig 2A). Mean classification accuracies were lower for data created with the resampling method, with most models losing anywhere from 5–31% accuracy (RF 70%, SVM 67%, GBC 74%, NB 58%, and KNN 67%) (Fig 2C). Since the resampling method uses data values from the real data, it is possible that fake samples very closely resemble real samples. Imputation classification accuracy results were quite high (RF100%, SVM 100%, GBC 100%, NB 66%, KNN 100%). While RF, GBC and KNN all increased in accuracy compared to the resampled data, NB performed more or less at the expected baseline accuracy (Fig 2E).

Machine learning with digit frequencies

We were unsatisfied with the classification accuracy of the above models. One challenge for machine learning in our data is that the number of features (17,156) far exceeds the number of samples (75). In situations similar to this with high dimensionality data, feature reduction techniques can be used to reduce the number of features to increase performance and or decrease training time an example of this is principal component analysis [35]. We therefore explored ways to reduce or transform the feature set, and also to make the feature set more general and broadly applicable. Intrigued by the success of digit frequency methods in the identification of financial fraud [21], we evaluated whether this type of data representation could work for bioinformatics data as well. Therefore, all copy number data was transformed into 20 features, representing the digits 0–9 in the first and second place after the decimal of each gene copy number value. While Benford’s Law describes the frequency of the first digit, genomics and proteomics data are frequently normalized or scaled and so the first digit may not be as characteristic. The shift to use the digits after the decimal point rather than the leading digit is necessary because of the constraint that Benford’s law works (best) for numbers spanning several orders of magnitude. Because of the normalization present in the CNV data, the true first digits are bounded, for this reason we use the first and second digits after the decimal place, the first unbounded digits in the dataset. This is a data set specific adjustment and variations on it may need to be considered prior to its application on future datasets. For example in a dataset composed mainly of numbers between 0 and 0.09, you may need to use the third and fourth decimal point digits. Due to this adjustment, our method may be accurately referred to as Benford’s Law inspired or Benford-like. These digit frequency features were tabulated for each sample to create a new data representation and fed into the exact same machine learning training and testing routine described above. Each of these 20 new features contain decimal values ranging from 0.0 to 1.0 representative of the proportional frequency that digit occurs. For example, one sample’s value in the feature column for the digit 1 may contain the value 0.3. This means that in this sample’s original data the digit 1 occured in the first position after the decimal place 30% of the time.

In sharp contrast to the models built on the quantitative copy number data with random and resampled data, machine learning models which utilized the digit frequencies were highly accurate and showed less variation over the 50 trials (Fig 3). When examining the results of the data fabricated via imputation, the models achieved impressively high accuracy despite using drastically less information than those trained with the quantitative copy number values. As an average, accuracy for the 50 trials on the imputed data, RF, SVM, and the GBC models achieved 100% accuracy. The NB and KNN models were highly successful with a mean classification accuracy 98% and 96% respectively.

Fig 3. Classifications accuracy using digit frequency data.

Fig 3

Fabricated data was mixed with real data and given to four machine learning models for classification. Data shown represents 50 trials for 50 different fabricated dataset mixes. Features in this dataset are the digit frequencies for each sample. The red asterisk represents outliers in the boxplot; these same outliers are shown as normally colored points in jittered-point overlay. A. Results for data fabricated with the random method, mean classification accuracy: RF 100% (+/- 0%), SVM 100% (+/- 0%), GBC 100% (+/- 0%), NB 100% (+/- 0%), KNN 97% (+/- 1.3%). B. Results for data fabricated with the random method, mean classification F1: RF 1 (+/- 0), SVM 1 (+/- 0), GBC 1 (+/- 0), NB 1 (+/- 0), KNN 0.96 (+/- 0.02) C. Results for data fabricated with the resampling method, mean classification accuracy: RF 99% (+/- 0.8%), SVM 95% (+/- 2.3%), GBC 99% (+/- 1.7%), NB 96% (+/- 2.1%), KNN 85% (+/- 4.4%). D. Results for data fabricated with the resampled method, mean classification F1: RF 0.99 (+/- 0.01), SVM 0.94 (+/- 0.03), GBC 0.98 (+/- 0.02), NB 0.95 (+/- 0.03), KNN 0.82 (+/- 0.04) E. Results for data fabricated with the imputation method, mean classification accuracy: RF 100% (+/- 0%), SVM 100% (+/- 0.7%), GBC 100% (+/- 0%), NB 98% (+/- 0.7%), KNN 96% (+/- 1.5%). F. Results for data fabricated with the imputation method, mean classification F1: RF 1 (+/- 0), SVM 0.99 (+/- 0.01), GBC 1 (+/- 0), NB 0.97 (+/- 0.01), KNN 0.94 (+/- 0.02).

Machine learning with limited data

With 17,156 CNA gene measurements, the digit frequencies represent a well sampled distribution. Theoretically, we realize that if one had an extremely limited dataset with CNA measurements for only 10 genes, the sampling of the frequencies for the 10 digits will be poor. To understand how much data is required for a good sampling of the digit-frequencies, using the imputed data, we iteratively downsampled our measurements from 17,000 to 10, (1700 was used instead of the full 17,156 since there is would be no way to do multiple unique permutations selecting 17,156 features from a set of 17,156 features). With the gene-features remaining in each downsample, the digit frequencies were re-calculated. Downsampling was performed uniformly at random without replacement. For each measurement size 50 replicates were run, all with different permutations of the downsamples. Results from this experiment can be seen in Fig 4. The number of gene-features used to calculate digit frequencies does not appear to make a difference at n > 500. In the 100 gene-feature trial, both NB and KNN have a drastic drop in performance, while the RF and GBC model remained relatively unaffected down to approximately 40 features. Surprisingly, these top performing models (GBC and RF) do not drop below 95% accuracy until they have less than 20 gene-features.

Fig 4. Classification accuracy vs number of features.

Fig 4

The original 17,156 CNA measurements in the imputed dataset were randomly downsampled incrementally from 17,000 to 10 and converted to digit-frequency training and test features for machine learning models. When 1,000+ measurements are used in the creation of digit-frequencies features, there appears to be little to no effect on mean accuracy. Once the number of features drops below 300 all models begin to lose accuracy rapidly. RF remained above 97.5% accuracy until less than 30 measurements were included.

One hesitation for using machine learning with smaller datasets (i.e. fewer gene-features per sample) is the perceived susceptibility to large variation in performance. As noted, these downsampling experiments were performed 50 times, and error bars representing the standard error are shown in Fig 4. We note that even for the smallest datasets, performance does not drastically vary between the 50 trials. In fact the standard error for small datasets (e.g. 20 or 30 gene-features) is lower than when there were thousands. Thus we believe that the digit-frequency based models will perform well on both large-scale omics data and also on smaller ‘targeted’ data acquisition paradigms like multiplexed PCR or MRM proteomics.

Discussion

We present here a proof of concept method for detecting fabrication in biomedical data. Just as has been previously shown in the financial sector, digit frequencies are a powerful data representation when used in combination with machine learning to predict the authenticity of data. Although the data used herein is copy number variation from a cancer cohort, we believe that the Benford-like digit frequency method can be generalized to any tabular numeric data. While multiple methods of fabrication were used, we acknowledge there are more subtle or sophisticated methods. We believe that fraud detection methods, like the models presented herein, could be refined and generalized for broad use in monitoring and oversight.

The model described here is trained to operate specifically on CNA data. However, using digit frequencies as the feature transformation creates the option to train a model on multiple data sources with different numbers of features. Here we used the copy number measurements for 17,156 genes, but since these measurements are transformed into 20 features representing digit frequencies, theoretically, various CNA datasets with any number of measures could be used for training or testing. Just as Benford demonstrated that diverse, entirely unrelated datasets followed the same distribution of first digit, we are hopeful the same stands true for large biological datasets. However, further research would be needed to determine if a model trained on digit-frequencies of one type of omics data could be generalized and be used on another. The generalizability to such situations would likely depend on the digit distributions of the other datasets. One way to circumvent this dataset specific dependency may be to create statistical tests or use unsupervised clustering algorithms that operate within a single dataset. Moreover, future work on feature selection could potentially simplify the classification further and avoid machine learning.

A logical and exciting next step is to use this model on real published data and search for cases of fraud. There are several challenges standing in the way of doing this quickly and effectively. First, is the access to data. Not all journals require that data associated with a publication be made accessible and some journals that do require data accessibility count a statement to the effect of “data will be made available upon request to the authors” as sufficient—which we would argue does not constitute accessible data. Second, is the format of data. Here we used tabular CNA data generated from a large sequencing experiment, but there are numerous complex steps separating the original fastq files from nice tabular CNA data, which brings us to a third challenge. Third, reproducibility of data. Unless the study provides the tabulated form of the data or has perfectly reproducible methods for processing the rawest data, it would be difficult to know if the data being fed into the model is the exact same as that used in a study’s analysis.

In order to test this method on real data, we attempted to find retracted papers known to have committed fraud. Retraction Watch (www.retractionwatch.com) maintains a large searchable database of retracted papers which aided in this task. Unfortunately, once retracted, an article and it’s associated supplemental material is typically no longer available from the journal. We were able to locate some retracted papers in their original form through Sci-Hub, and within a few of these papers we were able to get URLs that were still active and pointing to where their paper’s data was deposited. This however presented more challenges in the form of inconsistent formats, incomplete records (data provided for some but not all of the analyses), conversion from PDF file format to tables, and an enormous amount of manual curation.

In order for methods like this to be used broadly for data monitoring, it would require all data to be truly publicly available, in usable formats, and/or with readily reproducible methods. Even if this mass testing and monitoring of data with methods as presented was possible at this time, it should not be used as the sole determinant of trueness or falseness of a dataset; we have shown this method to be very accurate, but not perfect. The possibility of false-positives and false-negatives still exists.

A consideration in choosing to publish a method like this is the possibility it could be used for its opposite purpose and aid those attempting to commit fraud by providing a means of evaluating the quality of their data fabrication. If we had built a ready to use, easy to install and run tool, for this purpose, we would not publicly publish it. The methods we present here are a proof of concept, not a complete product. Despite being completely open source and transparent, we anticipate it would still require a great deal of time, effort, and talent to repurpose our code for something other than simply reproducing our results. We expect anyone with the required amount of time and talent could instead produce their own real data and research. To those in the future that build upon and further this type of work, we encourage you to also consider if you should publish it or not.

There is an increasing call for improved oversight and review of scientific data [5, 6, 16, 18], and various regulatory bodies or funding agencies could enforce scientific integrity through the application of these or similar methods. For example, the government bodies charged with evaluating the efficacy of new medicine could employ such techniques to screen large datasets that are submitted as evidence for the approval of new drugs. For fundamental research, publishers could mandate the submission of all data to fraud monitoring. Although journals commonly use software tools to detect plagiarism in the written text, a generalized computational tool focused on data could make data fraud detection equally simple.

Supporting information

S1 Fig. Methods of data fabrication.

(A) The random method of data fabrication identifies the range of observation for a specific locus and then randomly chooses a number in that range. (B) The resampling method chooses values present in the original data. (C) The imputation method iteratively nullifies and then imputes data points from a real sample.

(TIF)

S2 Fig. Training and testing overview.

After creating 50 fake samples using any one of the three methods of fabrication, the 100 real samples and 50 fake samples were randomly split into a train and test set of equal size and proportions (50 real and 25 fake in each set). The training sets were then used to train various machine learning models using 10-fold cross validation. Next, trained models were used to make predictions on the testing data. Predictions were then scored with total accuracy.

(TIF)

S3 Fig. Distribution of first digits.

Distribution of normalized first-digit after the decimal frequencies in 75 real copy-number samples (A) and 50 fake samples generated by the random (B), resampled (C) and imputed (D) methods of fabrication. The x-axes represents each digit in the first position after the decimal place. The y-axes represents the normal frequency of the digit. Black lines represent the mean and diamonds represent outliers. Similar to what is seen in a distribution of first digits conforming to Benford’s Law, the CNA data also exhibits a long-right tail.

(TIF)

S4 Fig. Data relationships in fabricated data.

The correlation between pairs of genes is evaluated to determine whether fabrication methods can replicate inter-gene patterns. Plots on the left hand side (A,C,E, and G) display data from two correlated genes PLEKHN1 and HES4, adjacent genes found on 1p36. Plots on the right hand side (B,D,F, and H) display genes DFFB and OR4F5 gene with marginal Spearman correlation in the real data (.27). The plots reveal that random and resample data have little to no correlation between related genes. Imputation produces data with correlation values that are similar to the original data (.97 and.35, respectively).

(TIF)

S1 File. Parameters for models.

Contains the hyperparameters used for all machine learning models depending on the type of data used.

(TXT)

Data Availability

All data used in this paper can be found in GitHub (https://github.com/MSBradshaw/FakeData).

Funding Statement

This work was supported by the National Cancer Institute (NCI) CPTAC award [U24 CA210972] awarded to SP. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. https://proteomics.cancer.gov/programs/cptac.

References

  • 1.Burton F. The acquired immunodeficiency syndrome and mosquitoes. Med J Aust. 1989;151: 539–540. [PubMed] [Google Scholar]
  • 2.Kupferschmidt K. Tide of lies. Science. 2018;361: 636–641. doi: 10.1126/science.361.6403.636 [DOI] [PubMed] [Google Scholar]
  • 3.Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ. 2005;331: 267–270. doi: 10.1136/bmj.331.7511.267 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Fanelli D. How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data. PLoS One. 2009;4: e5738. doi: 10.1371/journal.pone.0005738 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.George SL, Buyse M. Data fraud in clinical trials. Clin Investig. 2015;5: 161–173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Yu L, Miao M, Liu W, Zhang B, Zhang P. Scientific Misconduct and Associated Factors: A Survey of Researchers in Three Chinese Tertiary Hospitals. Account Res. 2020. doi: 10.1080/08989621.2020.1809386 [DOI] [PubMed] [Google Scholar]
  • 7.Blum A, Wang P, Zenklusen JC. SnapShot: TCGA-Analyzed Tumors. Cell. 2018;173: 530. doi: 10.1016/j.cell.2018.03.059 [DOI] [PubMed] [Google Scholar]
  • 8.TEDDY Study Group. The Environmental Determinants of Diabetes in the Young (TEDDY) study: study design. Pediatr Diabetes. 2007;8: 286–298. doi: 10.1111/j.1399-5448.2007.00269.x [DOI] [PubMed] [Google Scholar]
  • 9.Orwoll E, Blank JB, Barrett-Connor E, Cauley J, Cummings S, Ensrud K, et al. Design and baseline characteristics of the osteoporotic fractures in men (MrOS) study—a large observational study of the determinants of fracture in older men. Contemp Clin Trials. 2005;26: 569–585. doi: 10.1016/j.cct.2005.05.006 [DOI] [PubMed] [Google Scholar]
  • 10.Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562: 203–209. doi: 10.1038/s41586-018-0579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483: 603–607. doi: 10.1038/nature11003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Subramanian A, Narayan R, Corsello SM, Peck DD, Natoli TE, Lu X, et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell. 2017;171: 1437–1452.e17. doi: 10.1016/j.cell.2017.10.049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Caswell J, Gans JD, Generous N, Hudson CM, Merkley E, Johnson C, et al. Defending Our Public Biological Databases as a Global Critical Infrastructure. Front Bioeng Biotechnol. 2019;7: 58. doi: 10.3389/fbioe.2019.00058 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Bik EM, Casadevall A, Fang FC. The Prevalence of Inappropriate Image Duplication in Biomedical Research Publications. MBio. 2016;7. doi: 10.1128/mBio.00809-16 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Knepper D, Fenske C, Nadolny P, Bedding A, Gribkova E, Polzer J, et al. Detecting Data Quality Issues in Clinical Trials: Current Practices and Recommendations. Ther Innov Regul Sci. 2016;50: 15–21. doi: 10.1177/2168479015620248 [DOI] [PubMed] [Google Scholar]
  • 16.Baigent C, Harrell FE, Buyse M, Emberson JR, Altman DG. Ensuring trial validity by data quality assurance and diversification of monitoring methods. Clin Trials. 2008;5: 49–55. doi: 10.1177/1740774507087554 [DOI] [PubMed] [Google Scholar]
  • 17.Morrison BW, Cochran CJ, White JG, Harley J, Kleppinger CF, Liu A, et al. Monitoring the quality of conduct of clinical trials: a survey of current practices. Clin Trials. 2011;8: 342–349. doi: 10.1177/1740774511402703 [DOI] [PubMed] [Google Scholar]
  • 18.Calis KA, Archdeacon P, Bain R, DeMets D, Donohue M, Elzarrad MK, et al. Recommendations for data monitoring committees from the Clinical Trials Transformation Initiative. Clin Trials. 2017;14: 342–348. doi: 10.1177/1740774517707743 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Benford F, Langmuir I. The Law of Anomalous Numbers. American Philosophical Society; 1938. [Google Scholar]
  • 20.Cerioli A, Barabesi L, Cerasa A, Menegatti M, Perrotta D. Newcomb-Benford law and the detection of frauds in international trade. Proc Natl Acad Sci U S A. 2019;116: 106–115. doi: 10.1073/pnas.1806617115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Badal-Valero E, Alvarez-Jareño JA, Pavía JM. Combining Benford’s Law and machine learning to detect money laundering. An actual Spanish court case. Forensic Sci Int. 2018;282: 24–34. doi: 10.1016/j.forsciint.2017.11.008 [DOI] [PubMed] [Google Scholar]
  • 22.George SL. Research misconduct and data fraud in clinical trials: prevalence and causal factors. Int J Clin Oncol. 2016;21: 15–21. doi: 10.1007/s10147-015-0887-3 [DOI] [PubMed] [Google Scholar]
  • 23.Lindgren CM, Adams DW, Kimball B, Boekweg H, Tayler S, Pugh SL, et al. Simplified and Unified Access to Cancer Proteogenomic Data. J Proteome Res. 2021;20: 1902–1910. doi: 10.1021/acs.jproteome.0c00919 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Dou Y, Kawaler EA, Cui Zhou D, Gritsenko MA, Huang C, Blumenberg L, et al. Proteogenomic Characterization of Endometrial Carcinoma. Cell. 2020;180: 729–748.e26. doi: 10.1016/j.cell.2020.01.026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28: 112–118. doi: 10.1093/bioinformatics/btr597 [DOI] [PubMed] [Google Scholar]
  • 26.Pedregosa F, Varoquaux G, Gramfort A. Scikit-learn: Machine learning in Python. of machine Learning …. 2011. https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf?source=post_page
  • 27.Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 2002;38: 367–378. [Google Scholar]
  • 28.The Optimality of Naive Bayes. [cited 3 Apr 2021]. https://www.aaai.org/Library/FLAIRS/2004/flairs04-097.php
  • 29.Breiman L. RANDOM FORESTS. [cited 4 Apr 2021]. https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
  • 30.Bentley JL. Multidimensional binary search trees used for associative searching. Commun ACM. 1975;18: 509–517. [Google Scholar]
  • 31.Scholkopf B, Smola AJ, Williamson RC, Bartlett PL. New support vector algorithms. Neural Comput. 2000;12: 1207–1245. doi: 10.1162/089976600300015565 [DOI] [PubMed] [Google Scholar]
  • 32.Sasaki Y. The truth of the F-measure. 2007 [cited 30 Apr 2021].
  • 33.Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, et al. Welcome to the tidyverse. J Open Source Softw. 2019;4: 1686. [Google Scholar]
  • 34.McKinney W. Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference. SciPy; 2010.
  • 35.Caggiano A, Angelone R, Napolitano F, Nele L, Teti R. Dimensionality Reduction of Sensorial Features by Principal Component Analysis for ANN Machine Learning in Tool Condition Monitoring of CFRP Drilling. Procedia CIRP. 2018;78: 307–312. [Google Scholar]

Decision Letter 0

Frederique Lisacek

19 Feb 2021

PONE-D-20-32745

Detecting fabrication in large-scale molecular omics data

PLOS ONE

Dear Dr. Bradshaw,

Thank you for submitting your manuscript to PLOS ONE and apologies for the extended reviewing time. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The worth of your approach is unquestionable yet, several aspects of your manuscript lack depth and maturity, as established by both reviewers. For example, the applicability of the method and its limitations are not addressed.

Another example is the introduction of the Benford's law which raised many questions in the review process.

Please submit your revised manuscript by April 12, 2021. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Frederique Lisacek

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2.We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors present an interesting approach to detect falsification in big datasets by means of Machine Learning and a well-known feature, Benford digit preferences.

- In the "Methods" section, authors should make it explicit (or much clearer) that they are comparing TWO approaches based on ML: the first one using the actual copy-number values (not really raw data) as inputs and the second using extracted features (digit frequency)as inputs (not doing it clearly may induce the readers to confusion when reading the corresponding results section).

- In the MLTraining section, the authors mention 6 ML methods they have chosen to evaluate but they don't offer any insight on the reasons behind such a choice: Why did they selected those 6 ML methods ? Were they already known for performing well for this kind of data/problem? Were they expected to perform better than other methods? Were they selected to represent a diverse-enough palette of ML methods?.

- In addition to this, the authors should provide a short description and some literature pointers about these methods as they may allow the reader to better understand what these methods do (if not how). E.g., saying that Random Forest is an ensemble decision-tree based method, or that KNN is based on proximity/similarity in the input space and does not actually perform any "learning"... and so on for GBD, NB, MLP, and SVM.

- In the Benford-like Digit Preferences section, authors should mention that this operation is a relatively simple “feature extraction” operation (I mean, it's not trivial, nor logic, but simple). So, it is this well-informed feature extraction which allows ML models to improve their predictive performance. In addition, they should assess how far the sole use of that feature reduces the prediction problem to a simple classification problem where ML is not really necessary. Would a much simpler method produce similar results?

- In the ML with quantitative data section, the authors mention that they "evaluated the model on simple accuracy". I don't think that using simple accuracy is very informative. In the context of fake detection it should be important to assess how many false negatives and false positives are detected. I would propose the use of F1 metrics as much more informative (thus adequate) than accuracy. In addition, it should be expected that “real-world” data would have a very different distribution of fake-real samples making accuracy even less adequate (or predictive).

- Authors say that SVM and MLP performed poorly. It is a bit surprising that these 2 methods had exhibited such a poor performance, but authors don't elaborate more on this: have they tried to investigate why? Could it be due to poor configuration efforts? I feel it was too "easy" to simply exclude them from further analysis.

- What's the meaning of the red asterisks present in Figures 2 and 3? It was not possible for me to figure it out. Such kind of unexplained information may be perturbing for the readers.

- In the ML with limited data section, authors mention they downsampled data, but they don't mention for which kind of fake-generation method was that done. Although it seems it was done for resampling. Stating this is very short and would facilitate the comprehension of the experiment performed.

- The Discussion section is far too short and doesn't explore the possible implications of the proposed work nor the potential limits and reaches of the method.

Among the potential issues that should be pertinent and I would expect to be discussed are the following:

1. As mentioned before, I would expect authors to report and analyse the figures concerning the False positive and False Negatives. Even at such high accuracy values, it would be good to know if the methods under evaluation would more easily miss fake data or produce false positives.

2. In the same sense it would be interesting to determine how the methods perform when the amount of falsified data is different (either higher or lower). What happens if the distribution of falsified data in the test set is (drastically) different than the one from the training set?

Finally, authors could also (optionally) consider discussing the risk that their method could be used as "predictor" in an "adversarial attack" approach allowing to create fake data which should be detected as valid by this detector.

Reviewer #2: ### General comments

The article presents an evaluation of different machine-learning approaches to detect fraud data, and evaluates their performances on artificial fraud data generated according to three different models: random number generation, resampling and data imputation.

The article addresses an important problem for life sciences, but int its current state the evaluation suffers from several weaknesses that should be handled before publication.

In particular:

1. The three models used to generate fake data are not justified in a convincing way. Is there any reason to believe that they correspond to actual frauds? If so, examples should be provided. If not the relevance of the evaluation is questionable. Would it be possible to apply the method on actual fraud data, that has been published, detected (ans supposedly retracted)?

3. Normally, a comparative evaluation of supervised classification methods requires to tune the parameters of each of them, which was not done here. In R, you can fine tune methods for all the classical supervised classification methods (I guess similar methods exist for other language like Python) I strongly recommend to use them, identify the optimal parameters for each method, and redo the whole performance analysis. The comparison is worthless without this.

3. The main approach defended in the manuscript is to replace the actual measurements (real and fake data) by the two first decimal digits. The idea relies on Frank Benford's law, according to which the frequency distribution of leading digits from real-life sets of numerical data does not follow a uniform distribution, contrary to what might be expected.

This law is invoked like a magical trick in this context: the manuscript does not provide any explanation about the reasons for this law, it does not indicates why it would apply to the CNV data analysed here. This should be clarified. For example, it is known that one situation in which Benford's law works is for long right-tailed distributions (which is for example the case of gene expression data). The article should at least provide an histogram of the distribution of the real values and discuss its adequacy to Benford's law.

Besides, if this is the main idea, the actual distribution of leading digits should be displayed on some figure, for the real and fake data.

4. There is no indication about the usability of the method in real life conditions.

How could the ML programs be trained for real dataset? Would you recommend to generate specific fake data for each one? What about the generalization power of the approach?

What would the method give if they would be applied on a large collection of actual published data? Would some of these data set be qualified as fraud?

In summary, I think that the paper address an important issue in data science (with applications to life sciences), but in its current state it is not convincing, because of methodological weaknesses in the evaluation of performances, and because there is no indication of the relevance of the models used to generate fake data. I however think these limitations could be addressed in a revised version of the manuscript.

### Specific comments

Line 31. "When asked if their 31 colleagues had fabricated data, positive response rates rose to 14-19%"

This question is imprecise and thus the answer impossible to interpret. Does it mean that the 14-19% of the researchers personally know colleagues who fabricated data, or that they are aware of published articles where data fabrication was demonstrated (and the articles this retracted), or that they have a general awareness of the fact that data fabrication happens?

Line 57. "Frank Benford observed in a compilation of 20,000 numbers that the first digit did not follow a uniform distribution as one may anticipate"

It would be useful to explain the reason for this surprising behavior, especially since it is the basis of one of your fraud detection method.

Line 65. Section "Methods"

The computing environment should be described, in particular the language and libraries used for the analysis. I guess all this could be found on the github repository, but we have no guarantee on the long-term sustainability of a github repository, so the minimal information should be provided in the Methods section, as recommended for scientific publications.

Line 81. "Three different methods of varying sophistication are used for fabrication: random number generation, resampling with replacement and imputation"

Is there any example of actual frauds (demonstrated) that use this kind of data number generation? If yes citations should be provided. If not it question the practical relevance of the evaluation.

Line 86, section Real Data. This section should describe the dimensions of the real data set (number of features). The info comes below, but it is expected to be found here.

Has the real data been published? If so, could you provide the reference of the publication, the data repository and the accession number? Could you also provide the URL of the CPTAC portal mentioned in this section?

Line 103. "Then we iteratively nullified 10% of the data and imputed these NAs with missForrest until every value has been imputed"

What is the principle of this method? Do you impute the values based on the neighboring cells in the rows (samples), columns (features), both? This matters since the imputation should reflect the likely method used by people who generate fraud data. Moreover, the way the imputation is done is likely to affect the machine-learning performances.

L122, section "Machine learning training". It would be good to compute the performance

L96, "For every gene locus, we first find the maximum and minimum values observed in the original data. A new sample is then fabricated by randomly picking a value within this gene specific range"

and further L158. "the random data clusters far from the real data"

Do you mean you used a uniform distribution to generate random numbers? If so it is not surprising that these fake samples clusterize far away from the real data and other fake data. Why did you use such a model rather than some random number model closer to the data ? For example a multivariate normal model whose parameters (correlation matrix- have been estimated on the real data. This would be a much more relevant way to generate more relevant random numbers.

L182. The abbreviations are missing for several methods (NB, RF), whereas they are used in the text and figures.

L196. The theoretical baseline accuracy is 66% according to the training/testing class sizes. It would be worth checking empirically the untrained performances of the different ML methods, by computing the accuracy in an "untrained" mode, i.e. by randomly permuting the training and testing labels. In principle this should return accuracies of ~66%, but there are sometimes tricky issues, so it is worth testing it.

It would also be useful to plot the baseline + untrained performances on the accuracy box plots.

L182. The parameters used for each ML method should be provided (either here or in the Material and Methods section).

L190. The accuracy is not a sufficient parameter to evaluate the performances of a 2-group classifier aiming at detecting one particular case (declare as "positives" the fake samples). For each method, you should compute he sensitivity and false predictive rate. The results of the different methods could be displayed on a classical Sn / FPR plot (in addition, f you tune some quantitative parameters you could draw a ROC curve).

L 194. "SVM and MLP performed poorly compared to other classification methods". I suspect this comes from the fact that you let all the methods run with their default parameters. In particular, SVM results vary hugely depending on the choice of the kernel, and the optimal kernel is a case-by-case affair, so you should absolutely test the performance of the different kernels (linear, radial, polynomial, sigmoid).

Actually, a comparative evaluation requires to tune the parameters of each ML method, which was not done here. In R, you can fine tune methods for all the classical supervised classification methods. I strongly recommend to use them and redo the whole performance analysis.

L225. "One challenge for machine learning in our data is that the number of features (~17,000) far exceeds the number of samples (75). We therefore explored ways to reduce or transform the feature set, and also to

228 make the feature set more general and broadly applicable."

This is a very strange motivation for using a digit preference approach, which looks a bit like a magical trick in this context.

If the goal is to reduce the over-dimensionality of the feature space, a first and obvious option would have been to train the classifiers on the first components (this is a very classical approach). Another possibility would be to test any classical method for feature selection.

L225. Why are there 17.000 features in the original dataset ? There are ~50,000 genes in the current annotations of Human genome.

L232 "the decimal of each gene expression value".

Are we speaking of CNV or transcriptome ?

L245. "Converting all measured variables to digit frequencies circumvents this problem. For instance, if you had a data set of CNA and transcriptomic data a machine learning model could not train and test on both of these. "

I don't see any reason for this claim. If both CNV and expression data are real numbers (which is the case) they can perfectly be combined in a feature matrix to feed the ML methods. The fact that their range and distributions would differ might pose a problem for some methods, but most if not all of the methods you used are equipped to handle variables with different data ranges. And in any case you did not check if the data distribution would or not fit the assumptions underlying the different methods.

"The features in these datasets would differ in the number of features and what these features represent. "

This makes no sense. If you combine NCV and expression data for the same samples, the number of features (genes) should in principle be the same for the two datasets, and they should thus be balanced (and in any case, which would not even be a prerequisite for combining them). In addition, all the ML methods are classically used to analyse features representing different things (e.g. size, weight, fat content, protein content,...), this is the essence of multivariate analysis.

L254. "over the 50 trails" Did you mean "trials" ?

L283. "Surprisingly, these top performing models (GBC and Random Forest) do not drop below 95% accuracy until they have less than 20 gene-features."

Why is this surprising? This simply reflects the fact that the fake data is simple to detect, which may come from the way they are generated.

What I find surprising here is that you can learn something from the distribution of the two first digits (i.e. 10 x 10 numbers) computed from only 20 features. This means that each pair of digit is expected to be found 0.2 times in the data. I an thus very skeptical about this result, and I suspect there is a trick somewhere. I would suggest you to check how the distribution of digits evolves (separately for the real and for each fake data) as you reduce the dimension of the feature space, and to see if there is not some bias.

L281. "In the 100 gene-feature trial, both Naive Bayes and KNN have a significant drop in performance"

This drop of performances on Figure 4 may be a visual artifact resulting from the arbitrary numbers of features chosen for your analysis : you increase the number of features by steps of 10 until 100, then you jump from 100 to 500, then to 1000, 2000 and you increase by steps of 2000. If the goal is to display the impact of N on both e small and large range, you should better use an XY plot with a logarithmic X axis. Also, it would be worth exploring the region between 100 and 500, since this is the place where you claim to observe a drop. I would recommend to add a measurement of the performances with 200, 300, 400 features, respectively.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Jacques van Helden (ORCID 0000-0002-8799-8584)

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Nov 30;16(11):e0260395. doi: 10.1371/journal.pone.0260395.r002

Author response to Decision Letter 0


30 Apr 2021

A copy of this information was also uploaded as a Word Doc with our responses highlighted for ease of understanding

Reviewer #1: The authors present an interesting approach to detect falsification in big datasets by means of Machine Learning and a well-known feature, Benford digit preferences.

1 - In the "Methods" section, authors should make it explicit (or much clearer) that they are comparing TWO approaches based on ML: the first one using the actual copy-number values (not really raw data) as inputs and the second using extracted features (digit frequency)as inputs (not doing it clearly may induce the readers to confusion when reading the corresponding results section).

Response: Added to end of machine learning methods

2 - In the MLTraining section, the authors mention 6 ML methods they have chosen to evaluate but they don't offer any insight on the reasons behind such a choice: Why did they selected those 6 ML methods ? Were they already known for performing well for this kind of data/problem? Were they expected to perform better than other methods? Were they selected to represent a diverse-enough palette of ML methods?.

Response: An explanation has been added to the machine learning section explaining why these 6 were chosen

3 - In addition to this, the authors should provide a short description and some literature pointers about these methods as they may allow the reader to better understand what these methods do (if not how). E.g., saying that Random Forest is an ensemble decision-tree based method, or that KNN is based on proximity/similarity in the input space and does not actually perform any "learning"... and so on for GBD, NB, MLP, and SVM.

Response: I have added a one sentence description of each algorithm and a reference so readers can learn more if they choose.

4 - In the Benford-like Digit Preferences section, authors should mention that this operation is a relatively simple “feature extraction” operation (I mean, it's not trivial, nor logic, but simple). So, it is this well-informed feature extraction which allows ML models to improve their predictive performance. In addition, they should assess how far the sole use of that feature reduces the prediction problem to a simple classification problem where ML is not really necessary. Would a much simpler method produce similar results?

Response: It is possible that other methods of data/feature extraction could produce a similar result. However, the exploration of alternative methods is outside of the scope of our manuscript.

5 - In the ML with quantitative data section, the authors mention that they "evaluated the model on simple accuracy". I don't think that using simple accuracy is very informative. In the context of fake detection it should be important to assess how many false negatives and false positives are detected. I would propose the use of F1 metrics as much more informative (thus adequate) than accuracy. In addition, it should be expected that “real-world” data would have a very different distribution of fake-real samples making accuracy even less adequate (or predictive).

Response: Figure 2 and 3 have been updated to include accuracy and F1 scores. As a side note, the F1 score plots we generated appear extremely similar to the accuracy ones, but if you look closely they are not. All the F1 scores in the digit-preference section are very close to 1.0. This was a good recommendation, thanks.

6 - Authors say that SVM and MLP performed poorly. It is a bit surprising that these 2 methods had exhibited such a poor performance, but authors don't elaborate more on this: have they tried to investigate why? Could it be due to poor configuration efforts? I feel it was too "easy" to simply exclude them from further analysis.

Response: In the initial study hyperparameter optimization was not performed. We have added grid search parameter optimization for each model and the “optimal” sets were used for all final test set results. All results have been updated to reflect these changes. SVM was also optimized, has included in all analysis and performs comparably to all other models. The optimization of the MLP is a near infinite search space and since we have five other models that work extremely well we do not see the need to spend the time and resources optimizing the MLP and have removed MLP from the paper entirely now.

7 - What's the meaning of the red asterisks present in Figures 2 and 3? It was not possible for me to figure it out. Such kind of unexplained information may be perturbing for the readers.

Response: The red asterisk represents outliers in the boxplot. An explanation of this had been added to captions for figures 2 and 3.

8 - In the ML with limited data section, authors mention they downsampled data, but they don't mention for which kind of fake-generation method was that done. Although it seems it was done for resampling. Stating this is very short and would facilitate the comprehension of the experiment performed.

Response: It was done with imputed data, a sentence has been added explaining this

9 - The Discussion section is far too short and doesn't explore the possible implications of the proposed work nor the potential limits and reaches of the method.

Among the potential issues that should be pertinent and I would expect to be discussed are the following:

10- 1. As mentioned before, I would expect authors to report and analyse the figures concerning the False positive and False Negatives. Even at such high accuracy values, it would be good to know if the methods under evaluation would more easily miss fake data or produce false positives.

Response: This has been addressed, see previous comments

11 - 2. In the same sense it would be interesting to determine how the methods perform when the amount of falsified data is different (either higher or lower). What happens if the distribution of falsified data in the test set is (drastically) different than the one from the training set?

Response: This has been addressed in the discussion section

12 - Finally, authors could also (optionally) consider discussing the risk that their method could be used as "predictor" in an "adversarial attack" approach allowing to create fake data which should be detected as valid by this detector.

Response: we have added this to the discussion.

Reviewer #2: ### General comments

13 - The article presents an evaluation of different machine-learning approaches to detect fraud data, and evaluates their performances on artificial fraud data generated according to three different models: random number generation, resampling and data imputation.

The article addresses an important problem for life sciences, but int its current state the evaluation suffers from several weaknesses that should be handled before publication.

In particular:

14 - 1. The three models used to generate fake data are not justified in a convincing way. Is there any reason to believe that they correspond to actual frauds? If so, examples should be provided. If not the relevance of the evaluation is questionable. Would it be possible to apply the method on actual fraud data, that has been published, detected (ans supposedly retracted)?

Response: Finding an actual fraud dataset would be ideal. Prior to submitting this article I spent a good deal of time trying to find datasets from papers listed in RetractionWatch. Unfortunately, since these articles have been taken down already, finding links to where their data was deposited is difficult or impossible.

15 - 3. Normally, a comparative evaluation of supervised classification methods requires to tune the parameters of each of them, which was not done here. In R, you can fine tune methods for all the classical supervised classification methods (I guess similar methods exist for other language like Python) I strongly recommend to use them, identify the optimal parameters for each method, and redo the whole performance analysis. The comparison is worthless without this.

Response: We added parameter optimization for each model using GridSearch from sklearn in python. All results have been rerun and reported with their optimized models and we have added additional detail to the methods section reflecting this.

16 - 3. The main approach defended in the manuscript is to replace the actual measurements (real and fake data) by the two first decimal digits. The idea relies on Frank Benford's law, according to which the frequency distribution of leading digits from real-life sets of numerical data does not follow a uniform distribution, contrary to what might be expected.

This law is invoked like a magical trick in this context: the manuscript does not provide any explanation about the reasons for this law, it does not indicates why it would apply to the CNV data analysed here. This should be clarified. For example, it is known that one situation in which Benford's law works is for long right-tailed distributions (which is for example the case of gene expression data). The article should at least provide an histogram of the distribution of the real values and discuss its adequacy to Benford's law.

Besides, if this is the main idea, the actual distribution of leading digits should be displayed on some figure, for the real and fake data.

Response: A supplemental figure of the distribution of digit frequencies in the real and fake data has been added, along with an explanation of how the CNA data is similar to a distribution following Benford’s law.

17 - 4. There is no indication about the usability of the method in real life conditions.

How could the ML programs be trained for real dataset? Would you recommend to generate specific fake data for each one? What about the generalization power of the approach?

Response: We have addressed this in new paragraphs added to the discussion section

What would the method give if they would be applied on a large collection of actual published data? Would some of these data set be qualified as fraud?

Response: This is now addressed in the discussion

In summary, I think that the paper address an important issue in data science (with applications to life sciences), but in its current state it is not convincing, because of methodological weaknesses in the evaluation of performances, and because there is no indication of the relevance of the models used to generate fake data. I however think these limitations could be addressed in a revised version of the manuscript.

### Specific comments

18 - Line 31. "When asked if their 31 colleagues had fabricated data, positive response rates rose to 14-19%"

This question is imprecise and thus the answer impossible to interpret. Does it mean that the 14-19% of the researchers personally know colleagues who fabricated data, or that they are aware of published articles where data fabrication was demonstrated (and the articles this retracted), or that they have a general awareness of the fact that data fabrication happens?

Response: I have added clarifying detail, the question in the survey was geared towards personally knowing of colleague that fabricated data

19 - Line 57. "Frank Benford observed in a compilation of 20,000 numbers that the first digit did not follow a uniform distribution as one may anticipate"

It would be useful to explain the reason for this surprising behavior, especially since it is the basis of one of your fraud detection method.

Response: an explanation has been added to the introduction

20 - Line 65. Section "Methods"

The computing environment should be described, in particular the language and libraries used for the analysis. I guess all this could be found on the github repository, but we have no guarantee on the long-term sustainability of a github repository, so the minimal information should be provided in the Methods section, as recommended for scientific publications.

Response: we have created a computing environment section of the paper and added a supplemental file with the parameters used for each ML model.

21 - Line 81. "Three different methods of varying sophistication are used for fabrication: random number generation, resampling with replacement and imputation"

Is there any example of actual frauds (demonstrated) that use this kind of data number generation? If yes citations should be provided. If not it question the practical relevance of the evaluation.

Response: Finding an actual fraud dataset would be ideal. Prior to submitting this article we spent a good deal of time trying to find datasets from papers listed in RetractionWatch. Unfortunately, since these articles have been taken down already, finding links to where their data was deposited is difficult or impossible. We have added a discussion of this and the challenges to the manuscript.

22 - Line 86, section Real Data. This section should describe the dimensions of the real data set (number of features). The info comes below, but it is expected to be found here.

Has the real data been published? If so, could you provide the reference of the publication, the data repository and the accession number? Could you also provide the URL of the CPTAC portal mentioned in this section?

Response: Citations for CPTAC have been added. All data used in our analyses also exists in our github repo.

23 - Line 103. "Then we iteratively nullified 10% of the data and imputed these NAs with missForrest until every value has been imputed"

What is the principle of this method? Do you impute the values based on the neighboring cells in the rows (samples), columns (features), both? This matters since the imputation should reflect the likely method used by people who generate fraud data. Moreover, the way the imputation is done is likely to affect the machine-learning performances.

Response: the imputation is done based on neighboring samples (rows). The order of features (columns) is not necessarily meaningful. Additional information on how the imputations was done can be found in the paper describing the tool we use: missForest R package. Link: https://doi.org/10.1093/bioinformatics/btr597

23 - L122, section "Machine learning training". It would be good to compute the performance

L96, "For every gene locus, we first find the maximum and minimum values observed in the original data. A new sample is then fabricated by randomly picking a value within this gene specific range"

and further L158. "the random data clusters far from the real data"

Do you mean you used a uniform distribution to generate random numbers? If so it is not surprising that these fake samples clusterize far away from the real data and other fake data. Why did you use such a model rather than some random number model closer to the data ? For example a multivariate normal model whose parameters (correlation matrix- have been estimated on the real data. This would be a much more relevant way to generate more relevant random numbers.

Response: Yes, this is a simple method and we intended it as such. Our goal with the three methods of fake data generation we selected was to create fake data with varying degrees of sophistication. Our random number generation was intended to be the easiest method to detect. The changes you propose here sound similar from our third and most sophisticated method functions - imputation.

24 - L182. The abbreviations are missing for several methods (NB, RF), whereas they are used in the text and figures.

Response: abbreviations have been added where the terms are first used

25 - L196. The theoretical baseline accuracy is 66% according to the training/testing class sizes. It would be worth checking empirically the untrained performances of the different ML methods, by computing the accuracy in an "untrained" mode, i.e. by randomly permuting the training and testing labels. In principle this should return accuracies of ~66%, but there are sometimes tricky issues, so it is worth testing it.

It would also be useful to plot the baseline + untrained performances on the accuracy box plots.

26 - L182. The parameters used for each ML method should be provided (either here or in the Material and Methods section).

Response: The full list of parameters for all the models is long. Rather than include it in the text we have prepared a supplementary file (supplementary_file_1.txt) that will accompany the manuscript.

27 - L190. The accuracy is not a sufficient parameter to evaluate the performances of a 2-group classifier aiming at detecting one particular case (declare as "positives" the fake samples). For each method, you should compute he sensitivity and false predictive rate. The results of the different methods could be displayed on a classical Sn / FPR plot (in addition, f you tune some quantitative parameters you could draw a ROC curve).

Response: Reviews 1 and 2 both pointed out this weakness of our analysis but proposed slightly different figure additions to address it. Per recommendation of review 1 we have included plots of F1 scores adjacent to the accuracy plots. This addresses the same concern of accuracy alone not measuring or reporting false positives or false negatives. As a side note, the F1 score plots we generated appear extremely similar to the accuracy ones, but if you look closely they are not. All the F1 scores in the digit-preference section are very close to 1.0.

28 - L 194. "SVM and MLP performed poorly compared to other classification methods". I suspect this comes from the fact that you let all the methods run with their default parameters. In particular, SVM results vary hugely depending on the choice of the kernel, and the optimal kernel is a case-by-case affair, so you should absolutely test the performance of the different kernels (linear, radial, polynomial, sigmoid).

Actually, a comparative evaluation requires to tune the parameters of each ML method, which was not done here. In R, you can fine tune methods for all the classical supervised classification methods. I strongly recommend to use them and redo the whole performance analysis.

Response: We added parameter optimization and all performance evaluations have been re run. Once optimized SVM performed similarly to the other models. The optimization of the MLP is a near infinite search space and since we have five other models that work extremely well we do not see the need to spend the time and resources optimizing the MLP and have removed from the paper entirely now.

29 - L225. "One challenge for machine learning in our data is that the number of features (~17,000) far exceeds the number of samples (75). We therefore explored ways to reduce or transform the feature set, and also to

228 make the feature set more general and broadly applicable."

This is a very strange motivation for using a digit preference approach, which looks a bit like a magical trick in this context.

If the goal is to reduce the over-dimensionality of the feature space, a first and obvious option would have been to train the classifiers on the first components (this is a very classical approach). Another possibility would be to test any classical method for feature selection.

Response: Feature reduction techniques are not uncommon in ML. For example: principal component analysis is used to reduce high dimensionality data. An explanation of this and an additional citation have been added.

30 - L225. Why are there 17.000 features in the original dataset ? There are ~50,000 genes in the current annotations of Human genome.

Response: this is the data from the CPTAC dataset. See their publications for a further explanation [PMID: 33560848, PMID: 32059776]

31 - L232 "the decimal of each gene expression value".

Are we speaking of CNV or transcriptome ?

Response: Agreed, this statement was unclear / incorrect. Updated to “gene copy number value”

32 - L245. "Converting all measured variables to digit frequencies circumvents this problem. For instance, if you had a data set of CNA and transcriptomic data a machine learning model could not train and test on both of these. "

I don't see any reason for this claim. If both CNV and expression data are real numbers (which is the case) they can perfectly be combined in a feature matrix to feed the ML methods. The fact that their range and distributions would differ might pose a problem for some methods, but most if not all of the methods you used are equipped to handle variables with different data ranges. And in any case you did not check if the data distribution would or not fit the assumptions underlying the different methods.

"The features in these datasets would differ in the number of features and what these features represent. "

This makes no sense. If you combine NCV and expression data for the same samples, the number of features (genes) should in principle be the same for the two datasets, and they should thus be balanced (and in any case, which would not even be a prerequisite for combining them). In addition, all the ML methods are classically used to analyse features representing different things (e.g. size, weight, fat content, protein content,...), this is the essence of multivariate analysis.

Response: If the number of genes (features) in two datasets are not the same (as would likely happen if they were from different experiments) you could not train on one and then the next because the input data shapes would not match. For example we have ~17,000 features in our CNA data, if we trained a model initially on this dataset we could not then test or train on a data set with 10,000 features; the models cannot handle this. However if you transform and reduce the ~17,000 into the 20 features of digit-preferences and then did the same for the 10,000 you could train on one and test or train more with the same model. This is just a question of the shape of the input data.

There is already an explanation of this in our manuscript: “Thus for each sample the features are converted from 17,156 copy number alterations to 20 digit preferences. Using this approach, whether a sample has 100 or 17,156 features it can still be trained on and classified by the same model."

33 - L254. "over the 50 trails" Did you mean "trials" ?

Response: Yes, fixed.

34 - L283. "Surprisingly, these top performing models (GBC and Random Forest) do not drop below 95% accuracy until they have less than 20 gene-features."

Why is this surprising? This simply reflects the fact that the fake data is simple to detect, which may come from the way they are generated.

What I find surprising here is that you can learn something from the distribution of the two first digits (i.e. 10 x 10 numbers) computed from only 20 features. This means that each pair of digit is expected to be found 0.2 times in the data. I an thus very skeptical about this result, and I suspect there is a trick somewhere. I would suggest you to check how the distribution of digits evolves (separately for the real and for each fake data) as you reduce the dimension of the feature space, and to see if there is not some bias.

Response: We understand the reviewers hesitancy at our result. We were also impressed at the performance. But we note that, as cited in our introduction, simple digit frequencies have been very successful at finding fraud in financial and other numeric data. Additionally, we note our complete transparency in the analysis and reporting. All of the data has been open since the first submission to bioRxiv almost 2 years ago. All figures are made with publicly available code, and can be manually inspected or verified.

35 - L281. "In the 100 gene-feature trial, both Naive Bayes and KNN have a significant drop in performance"

This drop of performances on Figure 4 may be a visual artifact resulting from the arbitrary numbers of features chosen for your analysis : you increase the number of features by steps of 10 until 100, then you jump from 100 to 500, then to 1000, 2000 and you increase by steps of 2000. If the goal is to display the impact of N on both e small and large range, you should better use an XY plot with a logarithmic X axis. Also, it would be worth exploring the region between 100 and 500, since this is the place where you claim to observe a drop. I would recommend to add a measurement of the performances with 200, 300, 400 features, respectively.

Response: we have increased the granularity of this figure to include all 100 feature steps from 100-1000 features and used a log scale x axis.

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Jacques van Helden (ORCID 0000-0002-8799-8584)

Attachment

Submitted filename: Rebuttal.docx

Decision Letter 1

Frederique Lisacek

17 Sep 2021

PONE-D-20-32745R1

Detecting fabrication in large-scale molecular omics data

PLOS ONE

Dear Dr. Bradshaw,

Thank you for submitting your manuscript to PLOS ONE and for your patience. This manuscript is very well received and the stakes are rather high if such work is not given the attention it deserves. The selection of fair and expert reviewers is the main reason for the delay. This expertise is rare.

After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The first round of reviews revealed issues that you have attended to and the new reviewer spotted a last issue regarding the processing of real datasets that you need to consider. This is rather minor in terms of effort on your part and will be major in terms of impact.

Please submit your revised manuscript by Nov 01, 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Frederique Lisacek

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #3: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #3: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 2nd review

----------

- Response 4: "the exploration of alternative methods is outside of the scope of our manuscript" --> I was expecting more a comment about that possibility than a deep exploration. One could discuss such issue based for example on the fact that k-NN (the less ML of the methods presented) also improves drastically (significantly :-) its predictive performance.

- A comment I didn't think of on the first review that could enrich the discussion concerns the potential of using a similar approach, but in an unsupervised (or semi-supervised) manner for detecting "anomalies" in datasets so as to flag potential falsifications (as done in fraud or cyberattack contexts) without having a training set of already known fake-data strategies. (For future work, not to be addressed now)

- Finally, I haven't addressed the responses to the comments from the 2nd reviewer as I find he or she would be the best placed to judge on their quality.

Minor comments

- Line 131: per-se instead of per-say

- Line 233: we tested FIVE different ,methods (not six)

- Line 247: "the remaining four models" --> Not clear which are the "remaining" models or even why is that word used here.

Reviewer #3: I love the idea behind this paper. Fraud is a very significant problem within scientific research, particularly for increasingly data right subject areas, and should be a concern for all of us in this community. Taking steps to develop tools to detect fraud is a key pillar of addressing this issue, alongside good Open Science practices that ensure transparency and replicability throughout the research chain (including in peer review!). Unfortunately, however, I have some significant concerns about this manuscript as it stand and I'm not convinced it's ready to be published as it stands. I outline those concerns below. First though, I would like to strongly encourage the authors to continue developing this manuscript, despite the significant review and publication delays since the first BioRxiv preprint. I have no doubt this will be a valuable piece of work in this important area.

Primary concerns:

1. The authors use three different mechanisms for generating fake data, random number generation, resampling and imputation. These are implemented with the aim of approximating the real data as accurately as possible. It is far from clear to me that any of these strategies reflect strategies scientists would actually use to fake data, however they are as reasonable as any other strategies given the lack of evidence base on this. The way they are used however is where my concern lies.

What motivation would someone have to fake data that reflects the real data and doesn't generate any clear 'result'?!? What the authors have here is a model that detects simulated data with the same characteristics as the current data. I suppose it is possible that scientists might want to fake (simulate) extra samples with similar statistical characteristics as their real data so as to inflate the sample number in their experiments, but it seems far more likely to me that scientists would try to fake data to generate a result.

For this paper, the simplest fake result to add to the data would be a shift in the CNV value to higher or lower values for specific subsets of samples or (more likely) for specific genes within specific subsets of sample. This more realistic test would be simple to implement within the three methods used here. To summarise; in order to be convinced that these methods are useful, I want to see their performance on a real world dataset with a CNV-treatment result in it, with different types of faked data (global up-/down- & specific gene up-down) added to either enhance/deplete the significance of the result, or to add new results to the data. I'd also like to see how models trained on the fake data with these signals in perform; retraining of the models here would probably require a more nuanced investigation for the training in order to avoid training the models just to recognise the up-/down-regulation, rather than the other characteristics of the fake data.

2. For the model trained on the two decimal digits, the models are essentially being trained to detect data that doesn't obey Benford's-law. The authors haven’t demonstrated that the machine learning models outperform the far simpler process of making the appropriate histogram and fitting a curve based on Benford's law to this and seeing if you get a decent fit (with a KS test, for example). It's possible that the ML models outperform this simple test, but the authors need to do this comparison to motivate the use of the more complex and opaque ML algorithms.

3. Figures 2 & 3 contain boxplots suggesting that some of the models have zero variation in their performance across different data subsets. In some cases this is because the clarifiers are apparently perfectly good/bad accuracy (which I am deeply suspicious of and seems too good to be true) but in some cases it's perfectly consistent accuracy performance (e.g. Fig 2 panel C). These results seem to be in disagreement with Figure 4 which suggests that the average accuracy performance never reaches 100% for any of the models. Something isn't right here. The authors need to carefully inspect their methods reconcile these figures, and either convincingly justify the perfectly good/bad/consistent performance or (more likely) fix the bug that’s causing these.

Detail comments:

1. Line 69/70. I think the readers would benefit from adding some clarity on the limitations of Benford's law here. In particular, it's only really valid for data that spans several orders of magnitude, and for data where the upper/lower limits are not tightly bounded.

2. Line 86. I disagree with the statement that "making up data is always wrong"; a bit more nuance is needed here. Firstly, simulating data has a long history of being informative in many areas of science. Secondly, there is a grey area here around imputed data, and particularly the imputation of missing data. It is, for example, commonplace to model a covariate in order to impute missing values in this data, and then to use this covariate data - including the imputed data - in a second model which leads to interpretation. From a certain perspective (my perspective, for example!) this could be seen as 'making up data' that has a direct impact on results/conclusions (depending on the scale of the missing data). This is certainly not widely considered wrong or inappropriate.

3. Line 111-113: does this mean that some of the samples are represented twice in the 150 sample dataset, albeit with 10% imputed data? How do we know that the ML models are learning to separate the fake samples from the real based on the imputed data signal, rather than needing both a duplicated sample and the imputed data signal. If you added samples that aren't in the original data, with a 10% imputation, would the performance of the models be as good?

4. Line 166: "Machine learning cannot…" I know what you're getting at, but this is not well worded. I think you wat to say something like : "Trained ML models are restricted to data that conform to the model input specifications (i.e. the same number of input features, for example).

5. Line 168: I think it would be worth noting here that the generalizability of this model comes with a cost - it will only work for data where Benfords-law should be valid, which is certainly not all datasets.

6. Line 177: "tiddy-verse" should be "tidyverse"

7. Line 233: "six" this should be five - this needs checking throughout the paper since the number of models used has changed through the review process.

8. Line 247: "The remaining four models…". I think this region of text has been re-ordered quite a bit during the review process and it doesn't make much sense now since we haven't has the results for the fifth model yet at this point. I think the authors need to give this section a careful read and make sure it flows sensibly.

9. Line 286: "[29149684]" I think this should be a reference??

10. Line 289: "While Benford's law…" The shift to use the decimal point digits rather than the leading digit is necessary because of the constraint that Benford law works (best) for numbers spanning several orders of magnitude. This is not the case for the first digit in the CNV data, but this value is usually a non-zero value so the first and second digits necessarily span orders of magnitude. This is a dataset specific approach though. in a dataset comprised mainly of numbers between 0 and 0.09, you would need to use the third and fourth decimal point digits. This would be work illuminating here.

11. Line 301-307: "Machine learning typically…" repetition of previous text and discussion. I think this can be removed.

12. Line 339. I'm not very surprised at the reasonable performance with as few as 10 genes here. 10 genes x 75 samples = 750 datapoints. This is plenty to build a histogram to compare with the Benfords Law curve (the equivalent of which is what the ML models are learning to do) .

13. Line 353: I think "per data point" should be "per sample" here.

14. Figure 2: I can't really see the details of this figure well - the resolution is quite low in the PDF embedding. It would be useful to explain the components of the box plot (median, quartiles, indents, etc) for those not familiar with boxplots.

15. Supp. Fig. 3. This figure is really useful (I'd put it in the main paper) but it's a nightmare to read because it's very busy. I suggest that the authors split the figure into four facet panels with one dataset per panel.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Carlos Peña-Reyes

Reviewer #3: Yes: Dr Nicholas Schurch

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Nov 30;16(11):e0260395. doi: 10.1371/journal.pone.0260395.r004

Author response to Decision Letter 1


8 Nov 2021

Response to Reviews

Below is a detailed response to reviewer critiques.

- Bradshaw and Payne

Reviewer #1: 2nd review

----------

Reviewer #1 has asked for an extended discussion of two points. We have addressed each of these within the manuscript.

- Response 4: "the exploration of alternative methods is outside of the scope of our manuscript" --> I was expecting more a comment about that possibility than a deep exploration. One could discuss such issue based for example on the fact that k-NN (the less ML of the methods presented) also improves drastically (significantly :-) its predictive performance.

We have inserted some text that discusses this point.

- A comment I didn't think of on the first review that could enrich the discussion concerns the potential of using a similar approach, but in an unsupervised (or semi-supervised) manner for detecting "anomalies" in datasets so as to flag potential falsifications (as done in fraud or cyberattack contexts) without having a training set of already known fake-data strategies. (For future work, not to be addressed now)

We are grateful for this suggestion and have added a bit about the possibility of cluster and a couple citations to work on it in other fields.

Minor comments

We have fixed all the minor typographical errors noted

Reviewer #3:

I love the idea behind this paper. Fraud is a very significant problem within scientific research, particularly for increasingly data right subject areas, and should be a concern for all of us in this community. Taking steps to develop tools to detect fraud is a key pillar of addressing this issue, alongside good Open Science practices that ensure transparency and replicability throughout the research chain (including in peer review!). Unfortunately, however, I have some significant concerns about this manuscript as it stand and I'm not convinced it's ready to be published as it stands. I outline those concerns below. First though, I would like to strongly encourage the authors to continue developing this manuscript, despite the significant review and publication delays since the first BioRxiv preprint. I have no doubt this will be a valuable piece of work in this important area.

Primary concerns:

1. The authors use three different mechanisms for generating fake data, random number generation, resampling and imputation. These are implemented with the aim of approximating the real data as accurately as possible. It is far from clear to me that any of these strategies reflect strategies scientists would actually use to fake data, however they are as reasonable as any other strategies given the lack of evidence base on this. The way they are used however is where my concern lies.

What motivation would someone have to fake data that reflects the real data and doesn't generate any clear 'result'?!? What the authors have here is a model that detects simulated data with the same characteristics as the current data. I suppose it is possible that scientists might want to fake (simulate) extra samples with similar statistical characteristics as their real data so as to inflate the sample number in their experiments, but it seems far more likely to me that scientists would try to fake data to generate a result.

For this paper, the simplest fake result to add to the data would be a shift in the CNV value to higher or lower values for specific subsets of samples or (more likely) for specific genes within specific subsets of sample. This more realistic test would be simple to implement within the three methods used here. To summarise; in order to be convinced that these methods are useful, I want to see their performance on a real world dataset with a CNV-treatment result in it, with different types of faked data (global up-/down- & specific gene up-down) added to either enhance/deplete the significance of the result, or to add new results to the data. I'd also like to see how models trained on the fake data with these signals in perform; retraining of the models here would probably require a more nuanced investigation for the training in order to avoid training the models just to recognise the up-/down-regulation, rather than the other characteristics of the fake data.

We appreciate the reviewer’s concern on this point and understand the fundamental question to be about whether our methods of making up data are realistic. We offer as a counter point two very high profile papers have been questioned. Both these papers make up data in a manner similar to our method. And the fraud is detected using digit frequencies. Therefore, we feel that our methods are realistic.

The first example, extensively explained here (https://datacolada.org/98), a PNAS paper is discovered to have fabricated half of their dataset. Moreover, and this is very important, the data can be observed to be fabricated based partially on a severe anomaly in the digit frequencies. The paper doubled the dataset by duplicating data points and introducing a strong signature discovered in digit frequencies.

The second and very recent example, extensively explained here (http://steamtraen.blogspot.com/2021/07/Some-problems-with-the-data-from-a-Covid-study.html), a COVID study was shown to be fabricated and contained duplicated patients. Again, one of the ways that fabrication was discovered and confirmed is through digit biases.

2. For the model trained on the two decimal digits, the models are essentially being trained to detect data that doesn't obey Benford's-law. The authors haven’t demonstrated that the machine learning models outperform the far simpler process of making the appropriate histogram and fitting a curve based on Benford's law to this and seeing if you get a decent fit (with a KS test, for example). It's possible that the ML models outperform this simple test, but the authors need to do this comparison to motivate the use of the more complex and opaque ML algorithms.

We do not claim to be identifying data that simply “doesn't obey Benford's-law”. We refer to our method as “Benford-like digit frequency”, it is inspired by Benford’s Law and takes the idea of digit frequencies but the digit distributions of real data in supplemental figure 3, while reminiscent of it, is not Benford’s distribution. Comparing our Benford-like digit-frequencies to the real distribution of Benford’s law is not a useful exercise. Additionally some of our ML models are no more complex than a KS test. We are not convinced that we need to prove one better than the other. We found a variety of methods that appear to work.

3. Figures 2 & 3 contain boxplots suggesting that some of the models have zero variation in their performance across different data subsets. In some cases this is because the clarifiers are apparently perfectly good/bad accuracy (which I am deeply suspicious of and seems too good to be true) but in some cases it's perfectly consistent accuracy performance (e.g. Fig 2 panel C). These results seem to be in disagreement with Figure 4 which suggests that the average accuracy performance never reaches 100% for any of the models. Something isn't right here. The authors need to carefully inspect their methods reconcile these figures, and either convincingly justify the perfectly good/bad/consistent performance or (more likely) fix the bug that’s causing these.

The largest x value used in Figure 4 was 17,000 (as opposed to the complete full number of features 17156) which can explain this change of accuracy. Because the analysis of Figure 4 depends on repeatedly randomly sampling a subset of features, using the true full set of features had to be excluded as there is only one way to pick 17156 features from 17156 features. Clarifying detail has been added that 17,000 was the max number of features used and an explanation as to why.

Detail comments:

1. Line 69/70. I think the readers would benefit from adding some clarity on the limitations of Benford's law here. In particular, it's only really valid for data that spans several orders of magnitude, and for data where the upper/lower limits are not tightly bounded.

Clarification added about the importance of spanning orders of magnitude and not being bounded

2. Line 86. I disagree with the statement that "making up data is always wrong"; a bit more nuance is needed here. Firstly, simulating data has a long history of being informative in many areas of science. Secondly, there is a grey area here around imputed data, and particularly the imputation of missing data. It is, for example, commonplace to model a covariate in order to impute missing values in this data, and then to use this covariate data - including the imputed data - in a second model which leads to interpretation. From a certain perspective (my perspective, for example!) this could be seen as 'making up data' that has a direct impact on results/conclusions (depending on the scale of the missing data). This is certainly not widely considered wrong or inappropriate.

We are grateful for pointing out this difficult description in the text. We are certainly not suggesting that imputation is wrong, and the reviewer is correct that it is widely used. Here we use algorithms which perform imputation for the purpose of inventing/fabricating datasets. As stated in the introduction, there is a nuance in the way we categorize this which depends on the author’s intent. We have added some new text to hopefully clarify our thinking.

3. Line 111-113: does this mean that some of the samples are represented twice in the 150 sample dataset, albeit with 10% imputed data? How do we know that the ML models are learning to separate the fake samples from the real based on the imputed data signal, rather than needing both a duplicated sample and the imputed data signal. If you added samples that aren't in the original data, with a 10% imputation, would the performance of the models be as good?

We are again grateful to the reviewer for pointing out this section and how it needs to be clarified. In our method, the samples that are fabricated via imputation are composed of 100% fabricated imputation data. The data was just nullified and imputed in chunks of 10%, repeated 10 times so all the data in the final fake sample was fake and unique to it. Doing it in chunks like this was a necessary adjustment to meet the expectations of missForest (plus it sped up an extremely slow process). Additional detail has been added to this section.

4. Line 166: "Machine learning cannot…" I know what you're getting at, but this is not well worded. I think you want to say something like : "Trained ML models are restricted to data that conform to the model input specifications (i.e. the same number of input features, for example).

Thank you. We have adopted your explanation.

5. Line 168: I think it would be worth noting here that the generalizability of this model comes with a cost - it will only work for data where Benfords-law should be valid, which is certainly not all datasets.

I have added a clarifying statement about this “though it’s effectiveness will still be dependent on the existence of digit-frequency patterns”

6. Line 177: "tiddy-verse" should be "tidyverse"

Fixed

7. Line 233: "six" this should be five - this needs checking throughout the paper since the number of models used has changed through the review process.

Fixed, the other review caught this too.

8. Line 247: "The remaining four models…". I think this region of text has been re-ordered quite a bit during the review process and it doesn't make much sense now since we haven't has the results for the fifth model yet at this point. I think the authors need to give this section a careful read and make sure it flows sensibly.

Yes, some mistakes were made and not caught after our first round of revisions (number mismatch and confusing use of “remaining”). Changes have been made to this paragraph and are highlighted.

9. Line 286: "[29149684]" I think this should be a reference??

Yep, thanks

10. Line 289: "While Benford's law…" The shift to use the decimal point digits rather than the leading digit is necessary because of the constraint that Benford law works (best) for numbers spanning several orders of magnitude. This is not the case for the first digit in the CNV data, but this value is usually a non-zero value so the first and second digits necessarily span orders of magnitude. This is a dataset specific approach though. in a dataset comprised mainly of numbers between 0 and 0.09, you would need to use the third and fourth decimal point digits. This would be work illuminating here.

These details have been added.

11. Line 301-307: "Machine learning typically…" repetition of previous text and discussion. I think this can be removed.

Agreed and removed. Brevity is better

12. Line 339. I'm not very surprised at the reasonable performance with as few as 10 genes here. 10 genes x 75 samples = 750 datapoints. This is plenty to build a histogram to compare with the Benfords Law curve (the equivalent of which is what the ML models are learning to do) .

13. Line 353: I think "per data point" should be "per sample" here.

Changed

14. Figure 2: I can't really see the details of this figure well - the resolution is quite low in the PDF embedding. It would be useful to explain the components of the box plot (median, quartiles, indents, etc) for those not familiar with boxplots.

Sorry the resolution was low, our figures were all saved and submitted with fairly high resolution. I assume some resolution was lost in the PDF embedding process but hope that is not the case in the actual publication. If the next figures still look fuzzy you can find all the actually .png and .tiff files properly labeled and in the Figures directory on the Github repo: https://github.com/MSBradshaw/FakeData/tree/master/Figures

15. Supp. Fig. 3. This figure is really useful (I'd put it in the main paper) but it's a nightmare to read because it's very busy. I suggest that the authors split the figure into four facet panels with one dataset per panel.

Supplemental Figure 3 has been split in two a 4 panel plot

Attachment

Submitted filename: Response to Reviews.docx

Decision Letter 2

Frederique Lisacek

10 Nov 2021

Detecting fabrication in large-scale molecular omics data

PONE-D-20-32745R2

Dear Dr. Bradshaw,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Frederique Lisacek

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Frederique Lisacek

19 Nov 2021

PONE-D-20-32745R2

Detecting fabrication in large-scale molecular omics data

Dear Dr. Bradshaw:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Frederique Lisacek

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Methods of data fabrication.

    (A) The random method of data fabrication identifies the range of observation for a specific locus and then randomly chooses a number in that range. (B) The resampling method chooses values present in the original data. (C) The imputation method iteratively nullifies and then imputes data points from a real sample.

    (TIF)

    S2 Fig. Training and testing overview.

    After creating 50 fake samples using any one of the three methods of fabrication, the 100 real samples and 50 fake samples were randomly split into a train and test set of equal size and proportions (50 real and 25 fake in each set). The training sets were then used to train various machine learning models using 10-fold cross validation. Next, trained models were used to make predictions on the testing data. Predictions were then scored with total accuracy.

    (TIF)

    S3 Fig. Distribution of first digits.

    Distribution of normalized first-digit after the decimal frequencies in 75 real copy-number samples (A) and 50 fake samples generated by the random (B), resampled (C) and imputed (D) methods of fabrication. The x-axes represents each digit in the first position after the decimal place. The y-axes represents the normal frequency of the digit. Black lines represent the mean and diamonds represent outliers. Similar to what is seen in a distribution of first digits conforming to Benford’s Law, the CNA data also exhibits a long-right tail.

    (TIF)

    S4 Fig. Data relationships in fabricated data.

    The correlation between pairs of genes is evaluated to determine whether fabrication methods can replicate inter-gene patterns. Plots on the left hand side (A,C,E, and G) display data from two correlated genes PLEKHN1 and HES4, adjacent genes found on 1p36. Plots on the right hand side (B,D,F, and H) display genes DFFB and OR4F5 gene with marginal Spearman correlation in the real data (.27). The plots reveal that random and resample data have little to no correlation between related genes. Imputation produces data with correlation values that are similar to the original data (.97 and.35, respectively).

    (TIF)

    S1 File. Parameters for models.

    Contains the hyperparameters used for all machine learning models depending on the type of data used.

    (TXT)

    Attachment

    Submitted filename: Rebuttal.docx

    Attachment

    Submitted filename: Response to Reviews.docx

    Data Availability Statement

    All data used in this paper can be found in GitHub (https://github.com/MSBradshaw/FakeData).


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES