Abstract
Motivated by an open problem of validating protein identities in label-free shotgun proteomics work-flows, we present a testing procedure to validate class (protein) labels using available measurements across N instances (peptides). More generally, we present a non-parametric solution to the problem of identifying instances that are deemed as outliers relative to the subset of instances assigned to the same class. The primary assumption is that measured distances between instances within the same class are stochastically smaller than measured distances between instances from different classes. We show that the overall type I error probability across all instances within a class can be controlled by some fixed value (say α). We also demonstrate conditions where similar results on type II error probability hold. The theoretical results are supplemented by an extensive numerical study illustrating the applicability and viability of our method. Even with up to 25% of instances initially mislabeled, our testing procedure maintains a high specificity and greatly reduces the proportion of mislabeled instances. The applicability and effectiveness of our testing procedure is further illustrated by a detailed example on a proteomics data set from children with sickle cell disease where five spike-in proteins acted as contrasting controls.
Keywords: Non-parametric, hypothesis testing, machine learning, classification, proteomics
1. Introduction
The research presented in this paper is motivated by an open problem in the quantification of proteins in a label-free shotgun proteomics work-flow.
In label-free shotgun proteomics, the experimental units of interest (proteins) are not measured directly but are represented by measurements on one to 500+ enzymatically cleaved pieces known as peptides. The amino acid sequences composing each peptide are not known apriori, but inferred based on algorithmic procedures acting on spectrum data from the mass spectrometer. By keeping ‘correctly’ labeled peptides (instances) and removing inaccurately labeled peptides from each protein (class), subsequent quantitative analyses which assume that all measurements are equally representative of the protein are thus more accurate and powerful. Our proposed testing procedure is designed to accomplish this validation task. More specifically, we present a non-parametric solution to the problem of identifying instances that are outliers relative to the subset of instances assigned to the same class. This serves as a proxy for finding errors in the data set: instances for which the class label is recorded incorrectly, or where the quantitative data for a particular instance are sufficiently noisy as to render them uninformative for the purposes of summarizing the class.
Within the field of proteomics, inconsistencies across peptide measurements (including those caused by misidentified peptides, inaccurate quantification, and peptide sequences originating from multiple proteins) can obfuscate patterns in protein abundance measurements across samples (or subjects), making it more difficult to correctly identify proteins that vary across the groups of interest. While a more stringent criterion on the false discovery rate (FDR) [2] of the original identification algorithm is likely to reduce the proportion of misidentified peptides, it will inevitably also remove correct matches due to the inherent trade-off between the FDR and the false non-discovery rate (FNR). While some proteomics work-flows have stressed identification accuracy over quantity (e.g. multiple reaction monitoring, [1]), shotgun proteomics, in particular, has traditionally focused on attempting to identify as many peptides as possible, with robust summarization methods used to address the prevailing inconsistencies across the peptides [10,14,17].
More recently, alternative methods that explicitly address identification errors and other inconsistencies between the patterns in abundance measurements of peptides from the same protein have also been proposed. Unfortunately, however, none of these algorithms has seen much use in practice. Protein Quantification by Peptide Quality Control (PQPQ) [5] is an ad-hoc method that finds a set of ‘representative’ peptides from each protein. Additional peptides from these proteins are only retained if they have similar characteristics to one of the ‘representative’ peptides. The Bayesian proteoform method proposed by Webb-Roberston et al. [18] is primarily focused on grouping peptides based on their pattern of positive and negative differences across comparison groups (e.g. disease/healthy or treatments). This focus on comparison groups can create heterogeneous peptide clusters that meet the single criteria required, but are otherwise dissimilar. Lucas et al. [8] also uses Bayesian methods (within the Gaussian framework), but focuses on creating so-called ‘meta-proteins’, consisting of peptides with similar abundance patterns across multiple proteins. In contrast, our proposed filtering algorithm requires no parametric assumptions on the distribution of abundances for each protein as it is entirely non-parametric. It is specifically designed to identify a subset of validated peptides that can be used for any subsequent analysis, including tests of differential abundance, classification and clustering.
Outside of proteomics, a similar problem has been studied within the field of classification. Classification models trained on data with labeling errors tend to be more complex and less accurate than models trained on data without labeling errors [11,12,19]. As described in Frénay and Verleysen [6], multiple algorithms have been developed to improve classification algorithms in the presence of labeling errors. Similar to the approaches utilized in proteomics, these have been grouped into three categories: robust methods, explicit modeling methods, and filtering methods. In particular, these filtering methods can also be used to identify (and remove) mislabeled peptides from proteins.
By filtering method, we refer specifically to an algorithm designed to determine whether each instance (e.g. peptide) should be retained in its original class (e.g. protein), or whether it should be removed from that class. This inherently assumes that every instance has an original class label; these algorithms do not provide an initial determination of the class label. In the case of proteomics, these original class labels are likely to originate from identification algorithms, or modifications thereof. For example, proteins with high sequence overlap can be combined into a single ‘protein group’ and analyzed together, or split into subsets depending on the memberships of the different peptides.
In many respects, both the proteomics and classification filtering algorithms function by separating out ‘good’ instances from ‘bad’ instances. However, there is an important distinction between the problem of interest within proteomics and that addressed by the classification algorithms. In the classification problem, the aim is an improved ability to classify new instances: the accuracy of the resulting training set is secondary to the overall improved performance of the resulting classification procedure. On the other hand, the overall accuracy of the filtered proteomics data set is of paramount interest in the proteomics problem, and there is no need to classify new peptides.
In this paper, we present our proposed non-parametric filtering/testing procedure and demonstrate its properties. As will be seen below, it is capable of validating the initial labeling of large proteins (40+ peptides) with high accuracy, even in the presence of heterogeneity across peptides within a protein or when up to 25% of peptides are mislabeled. A detailed comparison of our proposed procedure against several alternative algorithms from both proteomic and classification literature can be found in Key [7]. This comparison includes proteomics-focused methods from Forshed et al. [5] and Webb-Roberston et al. [18], and other classification-focused algorithms implemented in the NoiseFiltersR package [9]. For the sake of space, we do not provide the details of this comparison here, but it will be presented and discussed in detail in a subsequent paper.
The paper is organized as follows. Section 2 presents in a non-parametric setting the theoretical basis for the algorithm as well as the testing procedure. Section 3 uses a simulation study to demonstrate the effectiveness of the proposed algorithm. Section 4 illustrates the application of the algorithm to a real proteomics data set (sickle cell disease in children) where spiked proteins provided an experimental control on protein identification. Section 5 wraps up the paper with a discussion and concluding remarks.
2. Methodology
2.1. The basic setup
Consider a data set of N instances, of which are presumed to belong to classes respectively, with for . Let be a ‘mega-class’ consisting of the instances unassigned to a specific class or instances which are the sole representative of a class, such that . For simplicity, we use the shorthand notation to indicate an instance i which belongs to class while indicates an instance which does not. Let be the (vector) of observations from instance i, ( ), across n (independent) samples. The available data is thus is an matrix of such quantitative observations (in the case of shotgun proteomics, this corresponds to the observed relative quantity of each peptide). The ‘distance’ between any two instances with observations and can thus be measured using any standard distance or quasi-distance function,
| (1) |
For instance, could be a measure of the dissimilarity between peptides over the n samples in the study. In this case, one popular quasi-distance function is defined by the correlation between and , , where
| (2) |
Here, , for each .
Let be the (symmetric) matrix comprised of these between-instance observed distances. Without loss of generality, we assume that the entries of are ordered such that the first entries belong to , the next entries belong to , etc. Accordingly, we partition as
| (3) |
where is the matrix of between-instance distances within class and the elements of represent the distances between the instances belonging to and the instances belonging to . Note, in particular, that .
To begin with, consider at first class and the instances initially assigned to it. For a fixed i, , let be a given instance in class and set , be the ith row of , where and for , . Clearly, is the ith row of for . For the class , we consider the stochastic modeling of the elements of . We assume that for each instance , the observed within-class distances from it to the other instances in are random variable according to a class-specific distribution, , so that
| (4) |
(since ). Here, the c.d.fs are defined for each fixed , and any as
| (5) |
Our notation in (5) stresses that we are assuming that the distribution of distances from each individual instance to the remaining instances in is identical. Similarly, the distances between the given instance, , and the instances in class , are also random variables according to some distribution , so that
| (6) |
Here, are K−1 distinct c.d.fs defined for each , and any as
| (7) |
We assume throughout this work that are continuous distributions with p.d.fs respectively. If we further assume that all the instances from class are equally representative of the true measurement across all n samples, we would expect that the distances between any two measurements from within class are stochastically smaller than distances between measurements from within and instances associated with class where . Accordingly, we have the following assumption.
Assumption 2.1 Stochastic ordering —
For each ,
(8) or equivalently,
In light of (7), the distribution of distances from the ith instance in to any other random instance J, selected uniformly from among the instances not in , is thus the mixture,
| (9) |
where we have taken . Further, if I is a randomly selected instance in class , selected with probability , for , then it follows that
| (10) |
Similarly, if I and J represent two distinct instances, both randomly selected from with , then
| (11) |
It follows by Assumption 2.1 that
| (12) |
Now, for any define
| (13) |
It can be easily verified that the function has a unique solution, , such that and
| (14) |
As we will see below, the value of serves as a cut-off point to differentiate between the distribution governing distances between instances in class and the distribution of distances going from instances in to all remaining instances. To allow for our procedure to effectively discern between these two distributions, we have to restrict or limit the extent of the potential ‘overlap’ between them to less than 50% of their respective areas (see Figure 4 in Section 3.2.2 below for an illustration of this point). Accordingly, and in light of Assumption 2.1 and (12), we require that this cut-off point between these two distributions will strictly be greater than the median of , namely,
Figure 4.
A histogram of all distances within (red/left) and between instances in and those in (blue/right) for p = 0, , n = 100, and . The lines are generated from a normal distribution with mean and standard deviations matched to the data.
Assumption 2.2 Restricted Overlap —
We assume that in (14) so that .
2.2. The testing procedure
2.2.1. Constructing the test
Consider at first class and the instances initially assigned to it. Based on the available data, we are interested in constructing a testing procedure for determining whether or not a given instance that was assigned to class should be retained or be removed from it (and potentially be reassigned to a different class). That is, for each selected instance from the list of instances labeled , we consider the statistical test of the hypothesis
| (15) |
against
| (16) |
for . The final result of these successive hypotheses tests is the set of all those instances in for which was rejected and thus, providing the set of those instances in which were deemed to have been mislabeled. As we will see below, the successive testing procedure we propose is constructed so as to control the maximal probability of a type I error, while minimizing the probability of a type II error.
Towards that end, define
| (17) |
for each where is defined by (14) and is the indicator function of the set . In light of the relation (12), will serve as a test statistic for the above hypotheses. The distribution of under both the null and alternative hypotheses can be explicitly defined as Binomial random variables, as is presented in the following lemma (proof omitted).
Lemma 2.1
Let be as define in (17) above with , then
if , we have ;
if , we have .
Accordingly, the statistical test we propose will reject the null hypothesis in favor of for small values of , say if for some suitable critical value (to be explicitly determined below) which should satisfy,
| (18) |
for each and some fixed (and small) statistical error level . The constant is the (appropriately calculated) αth percentile of the distribution. That is, if , denotes the c.d.f. of a distribution, then for given α and τ, the value is determined so as
| (19) |
The final result of this repeated testing procedure is given by the set of all instances in for which was rejected,
providing the set of those instances in for which the binomial threshold is achieved and therefore have been deemed mislabeled. Similarly,
provides the set of instances correctly identified in . It remains only to determine the optimal value of for the test.
2.2.2. Controlling the procedure's type I error
With and as are given in (15) and (16), let
| (20) |
The hypothesis above states that all the instances in are correctly identified, whereas is the hypothesis that at least one of the instances in is misidentified. We denote by the cardinality of the set ,
so that R is a random variable taking values over . Note trivially that .
We consider the ‘global’ test which rejects in (20) if for at least one , or equivalently, if . The probability of a type I error associated with this ‘global’ test is therefore
| (21) |
using the Bonferroni inequality since by (18)–(19), . The calculations for , can be controlled by taking in (19) for some , to ensure that and that .
Note that if were to be independent or associated random variables [4], then under , , and . In this case,
It follows for sufficiently large (as ) that
| (22) |
By Lemma 2.1 (b), the distribution of the test statistic under the alternative hypothesis, , is readily available so that (in similarity to (18)), the probability of the type II error of the test of (15)–(16), can explicitly be calculated as,
| (23) |
for each . Further, whenever the conditions for the normal approximation to the binomial probabilities hold (i.e. , see Shader and Schmid [13]), in (23), can easily be evaluate as
where Φ denotes the standard normal c.d.f..
Remark 2.1
Under certain circumstances, namely if , the ‘symmetry’ property of the Binomial distribution about τ and , (with , per Assumption 2.2), implies that in (23). That is, whenever the critical test value as is determined by (19) with fixed α and , satisfies , where , then our testing procedure has the added feature that the type I error rate, (see (18)) is being controlled by , which also serves as an upper bound that controls the type II error rate, . Thus, in this circumstance, the probability of the type I error as well as that of the type II error are simultaneously being controlled by . We summarize this observation in Lemma 2.2. We note, however, that (as can be seen from (19)), the critical test value is intricately dependent on and τ. Since the values of α and are both fixed by design, the only determining parameter in (19) is τ (aside from ) to ensure that and hence, by Lemma 2.2, . It can be shown that for this to hold, τ should be sufficiently far from 0.5, say, for some , as is determined by
However, though available, there is no need to explicitly calculate the value of this threshold in order to ascertain whether or not in (23); merely verifying that is sufficient.
Lemma 2.2
For a given , consider the test of versus in (15)–(16) with a critical test value, as is determined in (18)–(19). Then, if we have
Proof.
Indeed, if , then , and hence,
2.3. Estimation
We note that both and are generally unknown (as are and ψ), but can easily be estimated non-parametrically from the available data by their respective empirical c.d.fs, for sufficiently large and . For each given instance ,
| (24) |
where, for each ,
Clearly, and are empirical c.d.fs for estimating, based on the ith instance, and respectively. Accordingly, when combined,
| (25) |
are the estimators of and , respectively. Further, in similarity to (13), we set
| (26) |
and we let denote the ‘solution’ of ; that is
| (27) |
Clearly, the value of τ in (14) would be estimated by
| (28) |
Note that in view of (24), , with for . With as an estimate of in (14), we have that and and therefore an equivalent estimate of is
| (29) |
By Lemma 2.1 (a), and hence, in (29) is an unbaised estimator of τ, .
2.4. An illustration
To help visualize the procedure, consider a simulated data set consisting of two features ( )) grouped into three classes of 500 observations each. These are shown as blue closed circles (the 'blue' cluster), gray Xs (the 'gray' cluster), and red open circles (the 'red' cluster). Three such scenarios are depicted in Figure 1, where each point's coordinates have been generated using a bivariate normal distribution1 (so that n = 2 and ). For the purpose of the illustration, we will focus on one specific point, labeled P in the figure, but the procedure is applied simultaneously to all points in each class, and can be iterated across all three classes. To validate whether or not the initial identification of P as belonging to the blue cluster is correct, we count how many other blue instances fall within units of P, utilizing the Euclidean distance metric. In the two-dimensional Euclidean space, this can be represented using a circle around P, thus providing a clear visualization of the algorithm's mechanics.
Figure 1.
An illustration of how the proposed algorithm works in a simple (two-dimensional) case. The three scenarios vary by the amount of overlap between the different clusters, which impacts the value of and τ, and the subsequent estimate of each.
The value of (obtained from (27)) is based on both the distribution of within-class distances, G, and between-class distances, F, as estimated by (25). It can be verified that in these three scenarios, (2.1) holds for the stochastic dominance of the between-class distances over the within-class distances. Consequently, it produces a contextually self-adjusting bound ( ) around P (or any other blue point) where most of the other points assigned to the same cluster are likely to be within the boundary if the point is identified correctly.
Using an alternative bound (say ), the number blue points falling within units around P is no greater than the number found when compared to units, and is likely smaller. However, the estimated probability of falling within units of P is also smaller (or at least no greater) than . Thus, fewer blue points must be within units in order to successfully validate the inclusion of P into the blue cluster. The choice of threshold thus determines whether P must be extremely close to a small number of blue points, or moderately close to a large number of blue points. As seen in Figure 1, the value of as calculated by this algorithm tends to fall in the latter category, although it is affected by the extent of the overlap between the clusters. The calculated values of , , and obtained from applying the testing procedure in (15) to all three classes with in (19), are shown in Table 1. We note that in all cases, that (Assumption 2.2) and that (Remark 3.2).
Table 1.
The outcome of running the algorithm on the three scenarios in Figure 1.
| Scenario | Class | a | τ | |
|---|---|---|---|---|
| A | Blue | 355 | 2.51 | 0.784 |
| Red | 375 | 2.63 | 0.819 | |
| Black | 402 | 2.81 | 0.866 | |
| B | Blue | 382 | 2.71 | 0.832 |
| Red | 383 | 2.69 | 0.833 | |
| Black | 489 | 4.47 | 0.996 | |
| C | Blue | 472 | 3.89 | 0.977 |
| Red | 472 | 3.91 | 0.977 | |
| Black | 489 | 4.51 | 0.996 |
The calculated values of , and hence and are greatly affected by the amount of overlap between classes. In Figure 1(c), the clear separation between the three classes causes the respective estimates of τ to be nearly 1, as seen in Table 1, Scenario C. By shifting the red class closer to the blue class (i.e. Figure 1(b) and Table 1, Scenario B), the values of and both decrease as expected for these two classes.
3. A simulation study
Our simulation study is designed to mimic the conditions of a LC-MS/MS shotgun proteomics study. In this light, we consider a set-up in which instances (peptides) belong to class/protein , and instances model the peptides belonging to any other class/protein. The distance between instances is measured using correlation distance, again mimicking a common way to measure similarity between peptides, although similar results can be obtained using other (quasi-) distance metrics (e.g. Euclidean distance).
3.1. The simulation setup
To establish some notation, suppose that is a random vector having some joint distribution . We assume, without loss of generality, that the values of are standardized, so that and , for each . We denote by the corresponding correlation (covariance) matrix for , . To simplify, we assume that K = 2 so that the N instances are presumed to belong to either class or . Accordingly, we partitioned and as , with, and , , and
with .
As in Section 2, let denote the data matrix of the observed intensities. We denote by the jth row of , , and we assume that are independent and identically distributed as .
Using standard notation, we write, , for the identity matrix and for the matrix of 1s. For the simulation studies we conducted, we took to be the N-variate normal distribution, so that
where
| (30) |
| (31) |
| (32) |
for with .
To allow for misclassification of instances, we included for a certain proportion p, some of the ‘observed’ instances intensities from that were actually simulated with being replaced by in (30) above. Thus, m is the number of misclassified instances among the instances that were initially labeled as belonging to .
Remark 3.1
In this simulation, reflects (as a proxy) the common characteristics of two mislabeled instances. When , distances between two mislabeled instances have the same distribution as two correctly labeled instances, as would be the case for binary classification when . When , distances for two mislabeled instances have the same distribution as distances between a correctly labeled instance and a mislabeled instance. This would be the case if the probability that two mislabeled instances come from the same class is zero.
For each simulation run, we recorded , , and counted the number of true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs), as defined in Table 2. From this data, we calculated the sensitivity, specificity, false discovery proportion (FDP), false non-discovery proportion (FNP), and percent reduction in FNP ( ) for each run; defined as follows:
| Proportion of correctly removed instances out of all mislabeled instances. | |
| Proportion of correctly retained instances out of all correctly labeled instances. | |
| Proportion of incorrectly removed instances out of all those removed. | |
| Proportion of incorrectly retained instances out of all those retained. | |
| Percent reduction in FNP relative to p. |
Table 2.
For a single run, each instance has one of four possible outcomes.
| Truth | ||||
|---|---|---|---|---|
| Correctly labeled | mislabeled | Total | ||
| Result | Keep | TN | FN | |
| Remove | FP | TP | R | |
| m | ||||
The notation for the total count of instances with each these outcomes in a single run.
Each statistic was averaged over all 1000 runs.
3.2. Simulation results
We conducted B = 1000 simulation runs of the test procedure with n = 10 , 25, 50, 75, 100, 250, 500, 700; , 50, 100, 500; ; ; and , , , and . We also varied the value of p, the proportion of mislabeled instances such that p = 0.0, 0.05, 0.1, 0.2, 0.25 . In particular, p = 0 means no mislabeling and that the initial labeling is perfect.
3.2.1. Results with no mislabeling (case p = 0)
Simulations where p = 0 (i.e. no mislabeling) were used to illustrate the theoretical assumptions of the testing procedure, as this presents a case in which the global null hypothesis in (20) holds. In this case, in particular, the FDP measures the proportion of incorrectly rejected hypotheses in (15) out of all rejected hypothesis tests, given that at least one hypothesis test was rejected.
Remark 3.2
For p = 0, every rejected hypothesis test in (15) is incorrectly rejected. Thus, the FDP is 1 if any rejected hypothesis tests are observed, and zero otherwise (by the definition of the FDP). Consequently, the average value of the FDP over all B runs provides an estimate of .
Figure 2 and Table 3 show the FDP for various n, , and . As seen in the figures, the FDP converges to zero as n increases for all values of , but the convergence slows as increases. When n is small, at least one instance was removed from almost all classes. We attribute this behavior to the inherent correlation structure of the data. This will be explored further in Section 3.2.2.
Figure 2.
The as a function of n and for (a) and (b) where p = 0.
Table 3.
Simulation results for p = 0, , and .
| n | FDP | Spec. | FDP | Spec. | FDP | Spec. | FDP | Spec. |
|---|---|---|---|---|---|---|---|---|
| 10 | 0.916 | 0.929 | 1.000 | 0.880 | 1.000 | 0.834 | 1.000 | 0.737 |
| 25 | 0.843 | 0.947 | 0.999 | 0.904 | 1.000 | 0.860 | 1.000 | 0.766 |
| 50 | 0.673 | 0.965 | 0.993 | 0.933 | 1.000 | 0.894 | 1.000 | 0.807 |
| 75 | 0.461 | 0.979 | 0.969 | 0.952 | 1.000 | 0.921 | 1.000 | 0.841 |
| 100 | 0.298 | 0.987 | 0.879 | 0.967 | 0.999 | 0.943 | 1.000 | 0.872 |
| 250 | 0.001 | 1.000 | 0.060 | 0.999 | 0.414 | 0.995 | 0.999 | 0.977 |
| 500 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.089 | 1.000 |
| 700 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 |
The sensitivity and FNP are excluded from the table because they are either undefined or constant when p = 0. As explained in Remark 3.2, the FDP provides an estimate of when p = 0.
The specificity measures the ability to keep instances which are correctly labeled. From Figure 3 and Table 3, it can be seen that while almost all runs remove at least one instance (based on the FDP), most instances are retained. With , , and , an average of 23.2 out of 25 were retained in each run. This number decreased as increased, corresponding to the slower convergence of the FDP when is larger. Even in this case, for n = 10, and , an average of 368.5 instances are retained each time.
Figure 3.
The specificity as a function of n, for (a) and (b) where p = m = 0.
For p = 0, the sensitivity is undefined and the FNP is universally zero. These statistics are relevant only when at least one mislabeled instance is present in the data set (i.e. p>0) and therefore are omitted from Table 3.
3.2.2. Behavior under artificially constructed independence
In light of Remark 3.2 and the likely impact of the correlation structure present in the data on the FDP, we designed a simulation study to explore this effect. In this study, the ‘distance’ matrices were artificially created in a manner which preserved dependence due to symmetry but removed all other dependencies across distances.
To simulate ‘distance’ matrices in this case, we began by randomly generating a distance matrix using the original test procedure with p = 0 and n = 100, , , and . A normal distribution was fit to the within- distances and the to distances, as shown in Figure 4. For , 50, 100, 500, 1000, 2000, 5000, and 12,000 and , these normal distributions were used to generate B new distance matrices by drawing for as follows:
Figure 5 shows the results of this procedure on five sets of B = 1000 runs. As increased, the FDP also increased, but never surpassed the theoretical limit of , consistent with the theoretical result given in (22).
Figure 5.

Direct simulation of the distance matrix using independent normal draws from a normal distribution and p = 0, performed as five batches of B = 1000 runs each. The average across all five batches is shown as squares. The line at the top of the plot shows for .
3.2.3. Results under some mislabeling, (case p>0)
To develop context regarding the results with some mislabeling (i.e. p>0), we first consider for a moment a trivial filtering procedure which retains all instances in . Since the incorrect instances are retained, the FNP of the trivial procedure is . Clearly, an FNP of p can be achieved without performing any filtering at all simply by returning all instances in . For a testing procedure to improve upon this nominal level, it must result in an FNP below p. With p>0, the FDP calculates the proportion of correctly labeled instances among those removed. However, in the context of proteomics, where the set of retained peptides are used in subsequent analyses, we found that the specificity was a more relevant metric. Consequently, the FNP and specificity are the primary statistics used to evaluate our proposed testing procedure in the presence of labeling errors (p>0), with providing a standardized method of evaluating the decrease in FNP in a manner independent of p.
Figure 6 provides the average FNP and over the B = 1000 simulation runs as a function of n for and varying combinations of , and p. In all cases, the procedure reduced the FNP relative to p (that is, for all results). Small values of n produced the smallest reduction (highest FNP), but this converged at approximately n = 100 to a value dependent on p, , and . Table 4 shows how the average FNP compares between small sample sizes ( ) and large sample sizes (averaged across n = 250, 500, and 700) at each value of p, , and . For p = 0.05, the FNP converged to 0 for all values of and . For higher values of p, decreasing and decreasing caused the FNP to be higher for . For example, when p = 0.25 and , for example, the instances remaining in the class after filtering will still include 0.26% ( ) mislabeled instances for n = 10 and 2.85% ( ) mislabeled instances when n is large. These are both substantial decreases from 25% mislabeled instances as seen in the unfiltered data.
Figure 6.
The FNP as a function of n and for p = 0.05, 0.10, 0.20, and 0.25 with .
Table 4.
The mean and at n = 10 and for each combination of p, , and at .
| n = 10 | n = 10 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| p | FNP | FNP | FNP | FNP | |||||
| 0.05 | 25 | 0.0169 | 66.2 | 0.0000 | 100.0 | 0.0168 | 66.4 | 0.0000 | 100.0 |
| 50 | 0.0131 | 73.8 | 0.0000 | 100.0 | 0.0135 | 72.9 | 0.0000 | 100.0 | |
| 100 | 0.0150 | 69.9 | 0.0000 | 100.0 | 0.0137 | 72.6 | 0.0000 | 100.0 | |
| 500 | 0.0110 | 78.0 | 0.0000 | 100.0 | 0.0116 | 76.8 | 0.0000 | 100.0 | |
| 0.10 | 25 | 0.0361 | 63.9 | 0.0007 | 99.3 | 0.0388 | 61.2 | 0.0000 | 100.0 |
| 50 | 0.0394 | 60.6 | 0.0015 | 98.5 | 0.0429 | 57.1 | 0.0001 | 99.9 | |
| 100 | 0.0325 | 67.5 | 0.0009 | 99.1 | 0.0363 | 63.7 | 0.0001 | 99.9 | |
| 500 | 0.0255 | 74.5 | 0.0004 | 99.6 | 0.0298 | 70.2 | 0.0000 | 100.0 | |
| 0.15 | 25 | 0.0585 | 61.0 | 0.0047 | 96.8 | 0.0635 | 57.7 | 0.0006 | 99.6 |
| 50 | 0.0637 | 57.5 | 0.0068 | 95.5 | 0.0694 | 53.8 | 0.0011 | 99.3 | |
| 100 | 0.0585 | 61.0 | 0.0061 | 95.9 | 0.0659 | 56.1 | 0.0010 | 99.3 | |
| 500 | 0.0446 | 70.3 | 0.0032 | 97.8 | 0.0511 | 66.0 | 0.0003 | 99.8 | |
| 0.20 | 25 | 0.1254 | 37.3 | 0.0468 | 76.6 | 0.1427 | 28.6 | 0.0249 | 87.6 |
| 50 | 0.1042 | 47.9 | 0.0300 | 85.0 | 0.1229 | 38.6 | 0.0108 | 94.6 | |
| 100 | 0.0922 | 53.9 | 0.0214 | 89.3 | 0.1069 | 46.6 | 0.0057 | 97.2 | |
| 500 | 0.0707 | 64.6 | 0.0115 | 94.2 | 0.0852 | 57.4 | 0.0021 | 98.9 | |
| 0.25 | 25 | 0.1612 | 35.5 | 0.0854 | 65.8 | 0.1891 | 24.4 | 0.0678 | 72.9 |
| 50 | 0.1399 | 44.0 | 0.0574 | 77.1 | 0.1657 | 33.7 | 0.0304 | 87.8 | |
| 100 | 0.1303 | 47.9 | 0.0485 | 80.6 | 0.1584 | 36.6 | 0.0205 | 91.8 | |
| 500 | 0.1026 | 59.0 | 0.0285 | 88.6 | 0.1292 | 48.3 | 0.0078 | 96.9 | |
Figure 7 provides the average specificity over the 1000 simulation runs as a function of n for and varying combinations of , and p. The specificity always converged to 1 as n increased, with convergence by n = 250 in all cases. For larger values of p, convergence was faster (by n = 100). For small values of n, a higher specificity is observed when and is smaller, although even in the worst case, the specificity was greater than 0.75.
Figure 7.
The specificity as a function of n and for and 0.25 with .
Table 5 gives the average estimate of the FDP, FNP, , sensitivity, and specificity in the case where n = 50 and across all combinations of p and . As already noted above, the table shows that the FNP and sensitivity decrease as p and increase. The FDP gives the proportion of correctly labeled instances among all the removed instances. This measure increases as a function of , corresponding to the decrease in the specificity of the procedure resulting in more correctly labeled instances being filtered out. On the other hand, it decreases as a function of p due to the increased proportion of mislabeled instances available to be removed.
Table 5.
The , , sensitivity (Sens.), and specificity (Spec.) of the algorithm using simulated data with n = 50 and .
| p | FDP | FNP | Sens. | Spec. | FDP | FNP | Sens. | Spec. | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.00 | 25 | 0.673 | 0.000 | 0.965 | 0.630 | 0.000 | 0.969 | ||||
| 50 | 0.993 | 0.000 | 0.933 | 0.980 | 0.000 | 0.936 | |||||
| 100 | 1.000 | 0.000 | 0.894 | 1.000 | 0.000 | 0.900 | |||||
| 500 | 1.000 | 0.000 | 0.807 | 1.000 | 0.000 | 0.813 | |||||
| 0.05 | 25 | 0.131 | 0.000 | 99.242 | 0.991 | 0.989 | 0.138 | 0.001 | 98.912 | 0.987 | 0.988 |
| 50 | 0.310 | 0.000 | 99.305 | 0.992 | 0.975 | 0.306 | 0.000 | 99.745 | 0.997 | 0.973 | |
| 100 | 0.349 | 0.000 | 99.384 | 0.994 | 0.965 | 0.372 | 0.000 | 99.422 | 0.995 | 0.959 | |
| 500 | 0.542 | 0.000 | 99.763 | 0.998 | 0.926 | 0.559 | 0.000 | 99.819 | 0.999 | 0.913 | |
| 0.10 | 25 | 0.035 | 0.003 | 96.689 | 0.961 | 0.996 | 0.049 | 0.002 | 97.840 | 0.974 | 0.994 |
| 50 | 0.046 | 0.003 | 96.583 | 0.969 | 0.994 | 0.080 | 0.002 | 97.726 | 0.980 | 0.989 | |
| 100 | 0.086 | 0.002 | 97.793 | 0.980 | 0.988 | 0.144 | 0.001 | 98.609 | 0.988 | 0.978 | |
| 500 | 0.197 | 0.001 | 99.018 | 0.992 | 0.968 | 0.268 | 0.001 | 99.431 | 0.995 | 0.949 | |
| 0.15 | 25 | 0.013 | 0.010 | 93.127 | 0.921 | 0.998 | 0.028 | 0.008 | 94.708 | 0.940 | 0.996 |
| 50 | 0.016 | 0.011 | 92.580 | 0.930 | 0.997 | 0.040 | 0.007 | 95.421 | 0.957 | 0.993 | |
| 100 | 0.027 | 0.009 | 93.669 | 0.946 | 0.995 | 0.072 | 0.006 | 95.928 | 0.966 | 0.985 | |
| 500 | 0.069 | 0.005 | 96.927 | 0.974 | 0.986 | 0.145 | 0.003 | 98.166 | 0.986 | 0.965 | |
| 0.20 | 25 | 0.001 | 0.055 | 72.335 | 0.760 | 1.000 | 0.009 | 0.050 | 75.199 | 0.784 | 0.999 |
| 50 | 0.006 | 0.037 | 81.397 | 0.844 | 0.999 | 0.027 | 0.030 | 84.936 | 0.874 | 0.994 | |
| 100 | 0.011 | 0.026 | 87.029 | 0.893 | 0.997 | 0.042 | 0.018 | 91.068 | 0.928 | 0.989 | |
| 500 | 0.026 | 0.014 | 93.150 | 0.945 | 0.993 | 0.094 | 0.009 | 95.687 | 0.967 | 0.971 | |
| 0.25 | 25 | 0.000 | 0.094 | 62.489 | 0.668 | 1.000 | 0.017 | 0.105 | 58.032 | 0.619 | 0.998 |
| 50 | 0.003 | 0.065 | 74.092 | 0.779 | 0.999 | 0.022 | 0.060 | 75.990 | 0.793 | 0.996 | |
| 100 | 0.005 | 0.053 | 78.976 | 0.833 | 0.999 | 0.032 | 0.046 | 81.705 | 0.856 | 0.991 | |
| 500 | 0.013 | 0.031 | 87.412 | 0.903 | 0.996 | 0.067 | 0.022 | 91.073 | 0.934 | 0.976 | |
The sensitivity of the procedure, as discussed above, gives the proportion of mislabeled instances that are detected out of all mislabeled instances. This measure increases when p is small and is large, and decreases for large p and small ; corresponding to the FNP estimate. For example, when p = 0.20, , and the data consists of 100 mislabeled instances and 400 correctly labeled instances. On average, the procedure removed 94.5% of mislabeled instances and only 0.7% correctly labeled instances, based on the reported sensitivity and specificity. Thus, the resulting filtered data set has an average of 5.5 mislabeled instances and 397.2 correctly labeled instances for an average of 402.7 total instances. This reflects a decrease in the proportion of mislabeled instances by 93.15%: while 20% of the original data set was mislabeled, only 1.4% of the filtered data set remains mislabeled.
4. Proteomics case study
To illustrate the applicability of our algorithm on a real data set, we applied it to a large observational study consisting of measured peptide intensities from 120 different serum samples collected from children with sickle cell disease. This study included additional runs (injections) to experimentally assess whether each peptide was quantified accurately and, for five specific proteins, whether the peptides were matched to the correct protein. A complete description of the methodology used to generate this verification data and decide whether protein label for the peptides from these five proteins can be found in Key [7].
Briefly, the so-called ‘spike-in subset’ consists of eight aliquots from a separate study where the protein concentration measurements were available: five in which a single protein was spiked at high intensity plus three controls as seen in Table 6. Correctly identified peptides from the spiked proteins should have an artificially high intensity in the corresponding aliquot while misidentified peptides lack this intensity ‘spike’, and thus appear ‘unspiked’. Unfortunately, this is not a ‘gold-standard’ on whether or not every peptide accurately reflects the abundance of the protein. Commercially produced proteins used in the spike-in subset likely have less sequence and/or post-translational modification variability relative to the human population, resulting in some correctly identified peptide sequences not overly abundant in the spiked samples. Conversely, because the procedures used to identify and assess the abundance of each peptide are performed using different data [15], some peptides can be identified correctly (with support from the spike-in data set), despite having a very low signal-to-noise ratio and thus poorly reflect the protein's abundance. Consequently, a manual determination of peptide accuracy (incorporating both abundance and labeling) based on heat maps of each of the five spiked proteins. This created a pair of decision rules that could be used as an objective (albeit imperfect) set of benchmarks for the purpose of evaluating the algorithm. In Key [7], they are further used for the purpose of comparing the performance of the proposed algorithm against other available algorithms.
Table 6.
The original quantity of the five spike-in proteins in 25 ml of each sample in the spike-in subset, as measured by nephelometry, as well as the quantity of protein added to generate the spike sample.
| Pre-depletion (μg/25 μl) | ||||
|---|---|---|---|---|
| Symbol | s1 | s2 | s3 | Spike quantity (μg/25 μl) |
| A1AG1 | 5.53 | 21.18 | 29.50 | 125 |
| APOA1 | 20.23 | 30.00 | 41.25 | 100 |
| APOB | 9.30 | 10.50 | 24.75 | 167 |
| CERU | 3.00 | 7.13 | 6.03 | 38 |
| HEMO | 9.48 | 20.85 | 22.38 | 68 |
Proteins were spiked into sample s1.
4.1. Results
Results from the five proteins for which spike data is available are shown in Table 7. All five proteins showed at least at 25% reduction in the proportion of false positives regardless of the benchmark against which they were compared. In addition, the specificity was at least 84% across all five proteins, showing that the amount of poorly quantified or misidentified peptides was decreased while the algorithm generally retained peptides that had the spike signal (specificity of at least 0.84).
Table 7.
The estimated FNP and of the algorithm using the spike and manual benchmarks as a stand-in for the true peptide status.
| Spike | Manual | |||||||
|---|---|---|---|---|---|---|---|---|
| Protein | FNP | Sens. | Spec. | FNP | Sens. | Spec. | ||
| HEMO | 0.05 (0.06) | 0.33 | 0.91 | −26 | 0.20 (0.29) | 0.36 | 1.00 | −28 |
| A1AG1 | 0.11 (0.18) | 0.50 | 0.87 | −38 | 0.17 (0.30) | 0.53 | 0.95 | −42 |
| APOA1 | 0.07 (0.18) | 0.67 | 0.96 | −61 | 0.09 (0.20) | 0.60 | 0.96 | −53 |
| APOB | 0.09 (0.14) | 0.53 | 0.84 | −40 | 0.13 (0.25) | 0.59 | 0.91 | −48 |
| CERU | 0.09 (0.17) | 0.59 | 0.90 | −50 | 0.23 (0.32) | 0.42 | 0.92 | −28 |
For reference, the estimated FNP without any filtering is shown in parentheses.
Differences between the two benchmarks were primarily attributed to the different focus each had. The decision in the spike benchmark is completely independent of the quantitative data used by the filtering algorithm, but it cannot account for peptides with a low signal-to-noise ratio. Consequently, the spike benchmark tends to assign the ‘truth’ status of more peptides as ‘good’ relative to the manual method, resulting in a lower FNP prior to any filtering.
Figure 8 shows an example of the proteomics profiles from a single protein. Despite the fact that all 187 peptides were marked as originating from ceruloplasmin (CERU), approximately 6–8 different abundance patterns can clearly be visually discerned from the heat map, not including the ‘noisy’ cluster at the top of the figure marked in pink in the dendrogram. Despite this, the spike data generally supports the conclusion that most (if not all) of these peptides truly originated from a single protein source. Only three peptides lack the appropriate spike signal, and while a visual analysis of the heat map supports these being true misidentifications, their correlation with the observed data suggests that the cost of incorrectly retaining these peptides is likely to be relatively low. Among these ‘clean’ clusters, the algorithm removes one small cluster that has a substantially distinct abundance profile relative to the more dominant patterns, and four other peptides from the top (slightly noisier) cluster.
Figure 8.
Heat map showing the relative abundance data of ceruloplasmin (CERU) for each patient (column) and peptide (row). Peptides identified incorrectly according to the spike and manual benchmarks are as open circles and in grey, respectively. Using our algorithm, values of above the critical value ( ) are predicted to be correct, while values below 107 are predicted to be incorrect. The spike produced an inconclusive result for one peptide, shown as a ×.
The top ‘noisy’ cluster can be further split visually into two groups at the dashed gray line. Above this line, the spike signal is only present for four peptides, and no abundance patterns are visible across the columns. The algorithm only retains a single one of these peptides. Below the dashed line, the spike-in data suggests that the peptides were (generally) identified correctly, but that the abundances are noisy. Visually, the left side of the heat map appears redder than the left, similar to the behavior of some cleaner subsets. These peptides are generally retained.
5. Discussion
In this paper, we have presented a testing procedure for identifying incorrectly labeled instances when two or more classes are present. Our non-parametric approach to the problem of filtering incorrect labels, which requires very few assumptions, yields a very high specificity, and can be implemented very easily and efficiently using standard statistical software. We demonstrated its applicability and effectiveness using real proteomic data observed in children with sickle cell disease using spike-in proteins to provide a contrasting control on protein labels. Further insights to the properties of our procedure were demonstrated via a simulation study.
As demonstrated in the simulation study, our testing procedure generally has a high specificity and low FNP, especially when the number of measurements (n) on each instance is large. Decreasing the value of n yields a more conservative test (i.e. one which is less likely to reject the null hypothesis in (15) and remove instances), since each distance is measured less precisely. Even for extremely small values of n, it was still possible to reduce the FNP and maintain a specificity over 80% for all classes.
For a fixed n, classes with fewer instances (i.e. small values of ) had a higher FNP and specificity relative to larger classes. Compared to ‘classical’ classification algorithms, however, such a property is relatively unique. Many of the existing classification procedures found in the literature have difficulties when the number of instances varies substantially across classes [16]. This difficulty extends to those procedures utilizing these classification approaches to also detect mislabeled instances [7]. Our testing procedure avoids this problem and maintains the integrity of small classes by analyzing each using a ‘one-vs-all’ strategy that is most conservative for small values of (say for ).
On the other hand, when the number of instances is extremely low ( based on our simulation studies) the accuracy of the non-parametric estimates, especially , become unreliable. In the most extreme cases, the available data is insufficient to ever reject the null hypothesis even if a reliable estimate of τ could be found. For example, in LC-MS/MS proteomics, extremely small proteins ( ) often consist entirely of inaccurate or mislabeled peptides and make up a substantial proportion of the reported proteins. This will be addressed in a subsequent work using a complementary procedure where instances are only retained if the null hypothesis is rejected.
The use of a Bonferroni-type procedure is aimed at protecting against removing correctly identified instances is extremely conservative, prioritizing a high specificity at the cost of a higher FNP. Even in this conservative case, the FNP in our simulations was universally reduced across all values of , n, ρ and p. Less conservative FWER procedures, FDR-type procedures [2,3], or procedures seeking to explicitly control the FNP could also be considered to further reduce the FNP in the filtered data. The primary convenience of the Bonferroni procedure is its universal applicability, especially in light of the complexity of the dependency structure of the distances and consequently of the tests statistics, .
Because the testing procedure estimates and non-parametrically using the available data, these estimates are affected by the presence of mislabeled instances. The resulting estimates, and of τ and , may therefore be biased when a large number of mislabeled instances are included in the class. One possible method of remedy is to iteratively remove a small number of instances and re-estimate τ and until some stopping criterion is met. Developing such a sequential estimation procedure is left to future work.
Although we have used Pearson's correlation as a measure distance in Sections 3 and 4, the procedure is generally applicable whenever the observed data for each instance can be effectively combined as a ‘measure of distance’. Usage of different distance measures is possible as was illustrated in Section 2.4 with the Euclidean distance measure.
Note
Scenario A Blue: , Red: , Black: Scenario B Blue: , Red: , Black: Scenario C Blue: , Red: , Black: .
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Anderson L. and Hunter C.L., Quantitative mass spectrometric multiple reaction monitoring assays for major plasma proteins, Mol. Cell. Proteom. 5 (2006), pp. 573–588. doi: 10.1074/mcp.M500331-MCP200 [DOI] [PubMed] [Google Scholar]
- 2.Benjamini Y. and Hochberg Y., Controlling the false discovery rate: A practical and powerful approach to multiple testing, J. Roy. Stat. Soc. Ser. B (Methodol.) 57 (1995), pp. 289–300. [Google Scholar]
- 3.Benjamini Y. and Yekutieli D., The control of the false discovery rate in multiple testing under dependency, Ann. Stat. 29 (2001), pp. 1165–1188. doi: 10.1214/aos/1013699998 [DOI] [Google Scholar]
- 4.Esary J.D., Proschan F., and Walkup D.W., Association of random variables, with applications, Ann. Math. Stat. 38 (1967), pp. 1466–1474. doi: 10.1214/aoms/1177698701 [DOI] [Google Scholar]
- 5.Forshed J., Johansson H.J., Pernemalm M., Branca R.M.M., Sofi Sandberg A., and Lehtiö J., Enhanced information output from shotgun proteomics data by protein quantification and peptide quality control (PQPQ), Mol. Cell. Proteom. 10 (2011), pp. 1–9. doi: 10.1074/mcp.M111.010264 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Frénay B. and Verleysen M., Classification in the presence of label noise: A survey, IEEE Trans. Neural Netw. Learn. Syst. 25 (2014), pp. 845–869. doi: 10.1109/TNNLS.2013.2292894 [DOI] [PubMed] [Google Scholar]
- 7.Key M.C., ClassCleaner: A quantitative method for validating peptide identification in LC-MS/MS workflows, Ph.D. thesis, Indiana University, Indianapolis, 2020.
- 8.Lucas J.E., Thompson J.W., Dubois L.G., McCarthy J., Tillmann H., Thompson A., Shire N., Hendrickson R., Dieguez F., Goldman P., Schwarz K., Patel K., McHutchison J., and Moseley M.A., Metaprotein expression modeling for label-free quantitative proteomics, BMC Bioinform. 13 (2012), p. 74. doi: 10.1186/1471-2105-13-74 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Morales P., Luengo J., Garcia L.P.F., Lorena A.C., de Carvalho A.C.P.L.F., and Herrera F., NoiseFiltersR: Label noise filters for data preprocessing in classification, 2016. Available at https://cran.r-project.org/package=NoiseFiltersR
- 10.Polpitiya A.D., Jun Qian W., Jaitly N., Petyuk V.A., Adkins J.N., Camp D.G., Anderson G.A., and Smith R.D., DAnTE: A statistical tool for quantitative analysis of proteomics data, Bioinformatics 24 (2008), pp. 1556–1558. doi: 10.1093/bioinformatics/btn217 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Quinlan J.R., Induction of decision trees, Mach. Learn. 1 (1986), pp. 81–106. doi: 10.1007/bf00116251 [DOI] [Google Scholar]
- 12.Sáez J.A., Galar M., Luengo J., and Herrera F., Analyzing the presence of noise in multi-class problems: Alleviating its influence with the one-vs-one decomposition, Knowl. Inf. Syst. 38 (2014), pp. 179–206. doi: 10.1007/s10115-012-0570-1 [DOI] [Google Scholar]
- 13.Schader M. and Schmid F., Two rules of thumb for the approximation of the binomial distribution by the normal distribution, Am. Stat. 43 (1989), pp. 23–24. [Google Scholar]
- 14.Silva J.C., Gorenstein M.V., Zhong Li G., Vissers J.P.C., and Geromanos S.J., Absolute quantification of proteins by LCMSE: A virtue of parallel MS acquisition, Mol. Cell. Proteom. 5 (2006), pp. 144–156. doi: 10.1074/mcp.M500230-MCP200 [DOI] [PubMed] [Google Scholar]
- 15.Steen H. and Mann M., The ABC's (and XYZ's) of peptide sequencing, Nat. Rev. Mol. Cell. Biol. 5 (2004), pp. 699–711. doi: 10.1038/nrm1468 [DOI] [PubMed] [Google Scholar]
- 16.Sun Y., C Wong A.K., and Kamel M.S., Classification of imbalanced data: A review, Int. J. Pattern Recognit. Artif. Intell. 23 (2009), pp. 687–719. doi: 10.1142/S0218001409007326 [DOI] [Google Scholar]
- 17.Suomi T., Corthals G.L., Nevalainen O.S., and Elo L.L., Using peptide-level proteomics data for detecting differentially expressed proteins, J. Proteome Res. 14 (2015), pp. 4564–4570. doi: 10.1021/acs.jproteome.5b00363 [DOI] [PubMed] [Google Scholar]
- 18.Webb-Robertson B.-J.M, Matzke M.M., Datta S., Payne S.H., Kang J., Bramer L.M., Nicora C.D., Shukla A.K., Metz T.O., Rodland K.D., Smith R.D., Tardiff M.F., McDermott J.E., Pounds J.G., and Waters K.M., Bayesian proteoform modeling improves protein quantification of global proteomic measurements, Mol. Cell. Proteom. 13 (2014), pp. 3639–3646. doi: 10.1074/mcp.M113.030932 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zhu X. and Wu X., Class noise vs. attribute noise: A quantitative study, Artif. Intell. Rev. 22 (2004), pp. 177–210. doi: 10.1007/s10462-004-0751-8 [DOI] [Google Scholar]







