Testing the inflationary effects of the detected PPCC DDs in the batch balanced case of DMD, leukemia, and ALL data sets
In each subplot, the title describes the data set used, whereas the subtitle states the sizes of the training and validation sets. Each subplot shows the distributions of model accuracies across different training-validation sets. The x-axis labels describe the characteristics of each training-validation set: “i Doppel” (where i = 0, 2, 4, 6, 8, 10 or 0, 2, 4, 6 or 0, 2, 2, 4, 5) refers to a training-validation set where there are i numbers of PPCC DD samples in the validation set; PPCC DD samples are samples in the validation set that are PPCC DDs with at least one sample in the training set. “i Pos Con” (where i = 10, 6, 5) refers to training-validation sets with i samples duplicated from the training set. “Neg Con” refers to the accuracies produced by 22 binomial distributions (n = 10, p = 0.5). The y-axis indicates the validation accuracies of all models (1 indicates all validation samples were correctly classified). The performance of 22 models with different feature sets (20 models with random feature sets (gray), one model with features of highest variance (pink) and one model with features of lowest variance (green)) were evaluated for each training-validation set. The scatterplot shows the accuracies of each model, the violin plot shows the distribution of random model accuracies, and the cross bar highlights the mean random model accuracy
(A–C) High random model accuracies can be observed for (A) and (B), whereas for (C), random model accuracies remained close to 0.5 across all training-validation sets. In (A), a positive relationship between the number of PPCC DD samples and random model validation accuracies is evident. This suggests that most of the tested PPCC DDs are functional doppelgängers (FDs). In (B), we observed a more gradual increasing trend between “4 Doppel” and “6 Doppel.” This suggests that only PPCC DDs added between “4 Doppel” and “6 Doppel” training-validation sets are FDs.