Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2022 Feb 24;50(8):1725–1749. doi: 10.1080/02664763.2022.2038546

A statistical testing procedure for validating class labels

Melissa C Key a,b,CONTACT, Susanne Ragg c, Benzion Boukai d
PMCID: PMC10228321  PMID: 37260475

Abstract

Motivated by an open problem of validating protein identities in label-free shotgun proteomics work-flows, we present a testing procedure to validate class (protein) labels using available measurements across N instances (peptides). More generally, we present a non-parametric solution to the problem of identifying instances that are deemed as outliers relative to the subset of instances assigned to the same class. The primary assumption is that measured distances between instances within the same class are stochastically smaller than measured distances between instances from different classes. We show that the overall type I error probability across all instances within a class can be controlled by some fixed value (say α). We also demonstrate conditions where similar results on type II error probability hold. The theoretical results are supplemented by an extensive numerical study illustrating the applicability and viability of our method. Even with up to 25% of instances initially mislabeled, our testing procedure maintains a high specificity and greatly reduces the proportion of mislabeled instances. The applicability and effectiveness of our testing procedure is further illustrated by a detailed example on a proteomics data set from children with sickle cell disease where five spike-in proteins acted as contrasting controls.

Keywords: Non-parametric, hypothesis testing, machine learning, classification, proteomics

1. Introduction

The research presented in this paper is motivated by an open problem in the quantification of proteins in a label-free shotgun proteomics work-flow.

In label-free shotgun proteomics, the experimental units of interest (proteins) are not measured directly but are represented by measurements on one to 500+ enzymatically cleaved pieces known as peptides. The amino acid sequences composing each peptide are not known apriori, but inferred based on algorithmic procedures acting on spectrum data from the mass spectrometer. By keeping ‘correctly’ labeled peptides (instances) and removing inaccurately labeled peptides from each protein (class), subsequent quantitative analyses which assume that all measurements are equally representative of the protein are thus more accurate and powerful. Our proposed testing procedure is designed to accomplish this validation task. More specifically, we present a non-parametric solution to the problem of identifying instances that are outliers relative to the subset of instances assigned to the same class. This serves as a proxy for finding errors in the data set: instances for which the class label is recorded incorrectly, or where the quantitative data for a particular instance are sufficiently noisy as to render them uninformative for the purposes of summarizing the class.

Within the field of proteomics, inconsistencies across peptide measurements (including those caused by misidentified peptides, inaccurate quantification, and peptide sequences originating from multiple proteins) can obfuscate patterns in protein abundance measurements across samples (or subjects), making it more difficult to correctly identify proteins that vary across the groups of interest. While a more stringent criterion on the false discovery rate (FDR) [2] of the original identification algorithm is likely to reduce the proportion of misidentified peptides, it will inevitably also remove correct matches due to the inherent trade-off between the FDR and the false non-discovery rate (FNR). While some proteomics work-flows have stressed identification accuracy over quantity (e.g. multiple reaction monitoring, [1]), shotgun proteomics, in particular, has traditionally focused on attempting to identify as many peptides as possible, with robust summarization methods used to address the prevailing inconsistencies across the peptides [10,14,17].

More recently, alternative methods that explicitly address identification errors and other inconsistencies between the patterns in abundance measurements of peptides from the same protein have also been proposed. Unfortunately, however, none of these algorithms has seen much use in practice. Protein Quantification by Peptide Quality Control (PQPQ) [5] is an ad-hoc method that finds a set of ‘representative’ peptides from each protein. Additional peptides from these proteins are only retained if they have similar characteristics to one of the ‘representative’ peptides. The Bayesian proteoform method proposed by Webb-Roberston et al. [18] is primarily focused on grouping peptides based on their pattern of positive and negative differences across comparison groups (e.g. disease/healthy or treatments). This focus on comparison groups can create heterogeneous peptide clusters that meet the single criteria required, but are otherwise dissimilar. Lucas et al. [8] also uses Bayesian methods (within the Gaussian framework), but focuses on creating so-called ‘meta-proteins’, consisting of peptides with similar abundance patterns across multiple proteins. In contrast, our proposed filtering algorithm requires no parametric assumptions on the distribution of abundances for each protein as it is entirely non-parametric. It is specifically designed to identify a subset of validated peptides that can be used for any subsequent analysis, including tests of differential abundance, classification and clustering.

Outside of proteomics, a similar problem has been studied within the field of classification. Classification models trained on data with labeling errors tend to be more complex and less accurate than models trained on data without labeling errors [11,12,19]. As described in Frénay and Verleysen [6], multiple algorithms have been developed to improve classification algorithms in the presence of labeling errors. Similar to the approaches utilized in proteomics, these have been grouped into three categories: robust methods, explicit modeling methods, and filtering methods. In particular, these filtering methods can also be used to identify (and remove) mislabeled peptides from proteins.

By filtering method, we refer specifically to an algorithm designed to determine whether each instance (e.g. peptide) should be retained in its original class (e.g. protein), or whether it should be removed from that class. This inherently assumes that every instance has an original class label; these algorithms do not provide an initial determination of the class label. In the case of proteomics, these original class labels are likely to originate from identification algorithms, or modifications thereof. For example, proteins with high sequence overlap can be combined into a single ‘protein group’ and analyzed together, or split into subsets depending on the memberships of the different peptides.

In many respects, both the proteomics and classification filtering algorithms function by separating out ‘good’ instances from ‘bad’ instances. However, there is an important distinction between the problem of interest within proteomics and that addressed by the classification algorithms. In the classification problem, the aim is an improved ability to classify new instances: the accuracy of the resulting training set is secondary to the overall improved performance of the resulting classification procedure. On the other hand, the overall accuracy of the filtered proteomics data set is of paramount interest in the proteomics problem, and there is no need to classify new peptides.

In this paper, we present our proposed non-parametric filtering/testing procedure and demonstrate its properties. As will be seen below, it is capable of validating the initial labeling of large proteins (40+ peptides) with high accuracy, even in the presence of heterogeneity across peptides within a protein or when up to 25% of peptides are mislabeled. A detailed comparison of our proposed procedure against several alternative algorithms from both proteomic and classification literature can be found in Key [7]. This comparison includes proteomics-focused methods from Forshed et al. [5] and Webb-Roberston et al. [18], and other classification-focused algorithms implemented in the NoiseFiltersR package [9]. For the sake of space, we do not provide the details of this comparison here, but it will be presented and discussed in detail in a subsequent paper.

The paper is organized as follows. Section 2 presents in a non-parametric setting the theoretical basis for the algorithm as well as the testing procedure. Section 3 uses a simulation study to demonstrate the effectiveness of the proposed algorithm. Section 4 illustrates the application of the algorithm to a real proteomics data set (sickle cell disease in children) where spiked proteins provided an experimental control on protein identification. Section 5 wraps up the paper with a discussion and concluding remarks.

2. Methodology

2.1. The basic setup

Consider a data set of N instances, of which N1,N2,,NK1 are presumed to belong to classes C1,C2,,CK1, respectively, with Nk>1 for k=1,,K1. Let CK be a ‘mega-class’ consisting of the NK instances unassigned to a specific class or instances which are the sole representative of a class, such that NK=Nk=1K1Nk. For simplicity, we use the shorthand notation iCk to indicate an instance i which belongs to class Ck while iCk indicates an instance which does not. Let xi=(xi1,xi2,,xin) be the (vector) of observations from instance i, ( i=1,,N ), across n (independent) samples. The available data is thus X=(x1,x2,,xN) is an n×N matrix of such quantitative observations (in the case of shotgun proteomics, this corresponds to the observed relative quantity of each peptide). The ‘distance’ between any two instances with observations xi and xj can thus be measured using any standard distance or quasi-distance function,

dij=dist(xi,xj)0. (1)

For instance, dij could be a measure of the dissimilarity between peptides over the n samples in the study. In this case, one popular quasi-distance function is defined by the correlation between xi and xj, dij=1rij, where

rij:=cor(xi,xj)==1n(xix¯i)(xjx¯j)=1n(xix¯i)2=1n(xjx¯j)2. (2)

Here, x¯i==1nxi/n, for each i=1,,N.

Let D={dij:i,j=1,,N} be the N×N (symmetric) matrix comprised of these between-instance observed distances. Without loss of generality, we assume that the entries of D are ordered such that the first N1 entries belong to C1, the next N2 entries belong to C2, etc. Accordingly, we partition D as

D=[D11D12D1KD21D22D2KDK1DK2DKK], (3)

where Dkk is the Nk×Nk matrix of between-instance distances within class Ck and the elements of Dk1k2 represent the distances between the Nk1 instances belonging to Ck1 and the Nk2 instances belonging to Ck2. Note, in particular, that Dk1k2Dk2k1.

To begin with, consider at first class C1 and the N1 instances initially assigned to it. For a fixed i, i=1,,N1, let iC1 be a given instance in class C1 and set di:=(di(1),di(2),,di(K)), be the ith row of D, where di(1):=(di1,,diN1) and for k2, di(k):=(di(1+j=1k1Nj),,di(j=1kNj)). Clearly, di(k) is the ith row of D1k for k=1,,K. For the class C1, we consider the stochastic modeling of the elements of D11,D12,,D1K. We assume that for each instance iC1, the observed within-class distances from it to the other N11 instances in C1 are i.i.d. random variable according to a class-specific distribution, G(), so that

(di1,,di(i1),di(i+1),,diN1)i.i.d.Gi(), (4)

(since dii0). Here, the c.d.fs Gi() are defined for each fixed iC1, and any jC1 as

Gi(t)G(t):=Pr(dijt|iC1,jC1,ji)tR. (5)

Our notation in (5) stresses that we are assuming that the distribution of distances from each individual instance to the remaining instances in C1 is identical. Similarly, the distances between the given instance, iC1, and the Nk instances in class Ck,k=2,,K, are also i.i.d. random variables according to some distribution F(k)(), so that

(dij)i.i.d.F(k)()jCk,(j=1,Nk). (6)

Here, F(2),F(3),,F(K) are K−1 distinct c.d.fs defined for each iC1, and any jCk as

F(k)(t)=Pr(dijt|iC1,jCk)tR. (7)

We assume throughout this work that G(),F(2)(),,F(K)() are continuous distributions with p.d.fs g(),f(2)(),,f(K)(), respectively. If we further assume that all the N1 instances from class C1 are equally representative of the true measurement across all n samples, we would expect that the distances between any two measurements from within class C1 are stochastically smaller than distances between measurements from within C1 and instances associated with class Ck where k1. Accordingly, we have the following assumption.

Assumption 2.1 Stochastic ordering —

For each k,k=2,,K,

Pr(dijt|iC1,jCk)Pr(dijt|iC1,jC1,ji) (8)

or equivalently,

F(k)(t)G(t)tR.

In light of (7), the distribution of distances from the ith instance in C1 to any other random instance J, selected uniformly from among the NN1 instances not in C1, is thus the mixture,

F¯(t):=Pr(diJt|iC1,JC1)=1NN1k=2KNkF(k)(t), (9)

where we have taken Pr(JCk|JC1):=Nk/(NN1) . Further, if I is a randomly selected instance in class C1, selected with probability Pr(I=i|IC1)=1/N1 , for i=1,2,,N1, then it follows that

Pr(dIJt|IC1,JC1)=i=1N11N1Pr(diJt|iC1,JC1)=i=1N11N1F¯(t)F¯(t). (10)

Similarly, if I and J represent two distinct instances, both randomly selected from C1 with Pr(I=i,J=j|IC1,JC1)=1/N1(N11) , then

G¯(t):=Pr(dIJt|IC1,JC1,JI)=i=1N1j=1,jiN1Pr(dijt|iC1,jC1,ji)=i=1N1j=1,jiN11N1(N11)Gi(t)G(t). (11)

It follows by Assumption 2.1 that

F¯(t)G¯(t)tR. (12)

Now, for any tR define

ψ(t):=G¯1(1F¯(t)). (13)

It can be easily verified that the function h(t):=ψ(t)t has a unique solution, t, such that t=ψ(t) and

G¯(t)=1F¯(t):=τ. (14)

As we will see below, the value of t serves as a cut-off point to differentiate between the distribution governing distances between instances in class C1 and the distribution of distances going from instances in C1 to all remaining instances. To allow for our procedure to effectively discern between these two distributions, we have to restrict or limit the extent of the potential ‘overlap’ between them to less than 50% of their respective areas (see Figure 4 in Section 3.2.2 below for an illustration of this point). Accordingly, and in light of Assumption 2.1 and (12), we require that this cut-off point t between these two distributions will strictly be greater than the median of G¯, namely,

Figure 4.

Figure 4.

A histogram of all distances within C1 (red/left) and between instances in C1 and those in C2 (blue/right) for p = 0, N1=500, n = 100, and ρ12=ρ2=0.2. The lines are generated from a normal distribution with mean and standard deviations matched to the data.

Assumption 2.2 Restricted Overlap —

We assume that t>G¯1(1/2) in (14) so that τ=G¯(t)>0.5.

2.2. The testing procedure

2.2.1. Constructing the test

Consider at first class C1 and the N1 instances initially assigned to it. Based on the available data, we are interested in constructing a testing procedure for determining whether or not a given instance that was assigned to class C1 should be retained or be removed from it (and potentially be reassigned to a different class). That is, for each selected instance from the list i{1,2,,N1} of instances labeled C1, we consider the statistical test of the hypothesis

H0(i):iC1(theinitiallabeliscorrect) (15)

against

H1(i):iC1(theinitiallabelisincorrect), (16)

for i=1,,N1. The final result of these successive N1 hypotheses tests is the set of all those instances in C1 for which H0(i):iC1 was rejected and thus, providing the set of those instances in C1 which were deemed to have been mislabeled. As we will see below, the successive testing procedure we propose is constructed so as to control the maximal probability of a type I error, while minimizing the probability of a type II error.

Towards that end, define

ZijC1,jiI[dijt]=j=1,jiN1I[dijt] (17)

for each i{1,2,,N1} where t is defined by (14) and I[A] is the indicator function of the set A. In light of the relation (12), Zi will serve as a test statistic for the above hypotheses. The distribution of Zi under both the null and alternative hypotheses can be explicitly defined as Binomial random variables, as is presented in the following lemma (proof omitted).

Lemma 2.1

Let Zi be as define in  (17) above with i=1,2,,N1, then

  1. if iC1, we have Zi|H0(i)Bin(N11,G¯(t))Bin(N11,τ);

  2. if iC1, we have Zi|H1(i)Bin((N11,F¯(t))Bin(N11,1τ).

Accordingly, the statistical test we propose will reject the null hypothesis H0(i):iC1 in favor of H1(i):iC1 for small values of Zi, say if Ziaα for some suitable critical value aα (to be explicitly determined below) which should satisfy,

α^:=Pr(Ziaα|H0(i))=Pr(Ziaα|τ)α, (18)

for each i=1,,N1 and some fixed (and small) statistical error level α(0,0.5). The constant aα is the (appropriately calculated) αth percentile of the Bin(N11,τ) distribution. That is, if b(k,p,n):=j=0k(nj)pj(1p)nj,k=0,,n, denotes the c.d.f. of a Bin(n,p) distribution, then for given α and τ, the value aα is determined so as

aα=arg maxk=0,N11{b(k,N11,τ)α}. (19)

The final result of this repeated testing procedure is given by the set of all instances in C1 for which H0(i):iC1 was rejected,

Rα:={i;i=1,,N1:Ziaα},

providing the set of those instances in C1 for which the binomial threshold is achieved and therefore have been deemed mislabeled. Similarly,

Aα:={i;i=1,,N1:Zi>aα}={1,,N1}Rα,

provides the set of instances correctly identified in C1. It remains only to determine the optimal value of aα for the test.

2.2.2. Controlling the procedure's type I error

With H0(i) and H1(i) as are given in (15) and (16), let

H0=i=1N1H0(i)andH~1=i=1N1H1(i). (20)

The hypothesis H0 above states that all the instances in C1 are correctly identified, whereas H1 is the hypothesis that at least one of the instances in C1 is misidentified. We denote by R=|Rα| the cardinality of the set Rα,

R=i=1N1I[Ziaα]

so that R is a random variable taking values over {0,1,,N1} . Note trivially that N1R|Aα|.

We consider the ‘global’ test which rejects H0 in (20) if for at least one i,i=1,,N1, Ziaα or equivalently, if {R>0}. The probability of a type I error associated with this ‘global’ test is therefore

α:=Pr(R>0H0)=Pr(i=1N1{Ziaα}H0)i=1N1Pr(ZiaαH0(i))=N1α^N1α, (21)

using the Bonferroni inequality since by (18)–(19), α^α. The calculations for α, can be controlled by taking α=α0/N1 in (19) for some α0, to ensure that αα0 and that α^α0/N1.

Note that if {Zi,Z2,,ZN1} were to be independent or associated random variables [4], then under H0, I[Ziaα]Bin(1,α^),i=1,2,,N1, and RBin(N1,α^) . In this case,

Pr(R=0H0)=(1α^)N1(1α0N1)N1.

It follows for sufficiently large N1 (as N1) that

α=1Pr(R=0H0)1eα0<α0. (22)

By Lemma 2.1 (b), the distribution of the test statistic Zi under the alternative hypothesis, H1(i), is readily available so that (in similarity to (18)), the probability of the type II error of the test of (15)–(16), can explicitly be calculated as,

β^:=Pr(Zi>aα|H1(i))=Pr(Zi>aα|1τ)=1b(aα,N11,1τ), (23)

for each i=1,,N1. Further, whenever the conditions for the normal approximation to the binomial probabilities hold (i.e. min((N11)τ,(N11)(1τ))>5 , see Shader and Schmid [13]), β^ in (23), can easily be evaluate as

β^1Φ(aα+0.5(N11)(1τ)(N11)τ(1τ)),

where Φ denotes the standard normal c.d.f..

Remark 2.1

Under certain circumstances, namely if aα(N11)/2, the ‘symmetry’ property of the Binomial distribution about τ and 1τ, (with τ>0.5, per Assumption 2.2), implies that β^α in (23). That is, whenever the critical test value aα as is determined by (19) with fixed α and N1, satisfies aαa, where a:=[(N11)/2]+1, then our testing procedure has the added feature that the type I error rate, α^ (see (18)) is being controlled by α=α0/N1, which also serves as an upper bound that controls the type II error rate, β^. Thus, in this circumstance, the probability of the type I error as well as that of the type II error are simultaneously being controlled by α=α0/N1. We summarize this observation in Lemma 2.2. We note, however, that (as can be seen from (19)), the critical test value aα is intricately dependent on α,N1 and τ. Since the values of α and N1 are both fixed by design, the only determining parameter in (19) is τ (aside from τ>0.5) to ensure that aα(N11)/2 and hence, by Lemma 2.2, β^α. It can be shown that for this to hold, τ should be sufficiently far from 0.5, say, τ[τ,1) for some τ>0.5, as is determined by

τ:=arg max0.5<p<1{b(a,N11,p)α}.

However, though available, there is no need to explicitly calculate the value of this threshold τ in order to ascertain whether or not β^α in (23); merely verifying that aα(N11)/2 is sufficient.

Lemma 2.2

For a given α<0.5, consider the test of H0(i) versus H1(i) in  (15)–(16) with a critical test value, aα as is determined in  (18)–(19). Then, if aα(N11)/2, we have

β^=Pr(Zi>aα|1τ)Pr(Ziaα|τ)α.
Proof.

Indeed, if aα(N11)/2, then N11aα(N11)/2aα, and hence,

β^=Pr(Ziaα|1τ)=Pr(ZiN11aα|τ)Pr(Ziaα|τ)α.

2.3. Estimation

We note that both F¯ and G¯ are generally unknown (as are t and ψ), but can easily be estimated non-parametrically from the available data by their respective empirical c.d.fs, for sufficiently large N1 and NN1. For each given instance iC1,

G^i(t):=1N11jC1,jiN1I[dijt],F^i(t):=k=2KNkNN1F^i(k)(t) (24)

where, for each k=2,3,,K,

F^i(k)(t):=1NkjCkI[dijt].

Clearly, G^i(t) and F^i(t) are empirical c.d.fs for estimating, based on the ith instance, G¯(t) and F¯(t), respectively. Accordingly, when combined,

F¯^(t)=1N1i=1N1F^i(t),G¯^(t)=1N1i=1N1G^i(t), (25)

are the estimators of F¯(t) and G¯(t), respectively. Further, in similarity to (13), we set

ψ^(t):=G¯^1(1F¯^(t)), (26)

and we let t^ denote the ‘solution’ of ψ^(tc)=tc; that is

t^:=inft{ψ^(t)t}. (27)

Clearly, the value of τ in (14) would be estimated by

τ^=G¯^(t^). (28)

Note that in view of (24), N1τ^i=1N1τ^i, with τ^iG^i(t^) for i=1,2,,N1. With t^ as an estimate of t in (14), we have that Zi(N11)G^i(t^) and τ^iZi/(N11) and therefore an equivalent estimate of τ^ is

τ^=1N1i=1N1ZiN11. (29)

By Lemma 2.1 (a), E[τ^i|H0(i)]=(N11)τ and hence, τ^ in (29) is an unbaised estimator of τ, E[τ^|H0]=τ.

2.4. An illustration

To help visualize the procedure, consider a simulated data set consisting of two features ( =(x1,x2)) grouped into three classes of 500 observations each. These are shown as blue closed circles (the 'blue' cluster), gray Xs (the 'gray' cluster), and red open circles (the 'red' cluster). Three such scenarios are depicted in Figure 1, where each point's coordinates have been generated using a bivariate normal distribution1 (so that n = 2 and N1=N2=N3=500). For the purpose of the illustration, we will focus on one specific point, labeled P in the figure, but the procedure is applied simultaneously to all points in each class, and can be iterated across all three classes. To validate whether or not the initial identification of P as belonging to the blue cluster is correct, we count how many other blue instances fall within t^ units of P, utilizing the Euclidean distance metric. In the two-dimensional Euclidean space, this can be represented using a circle around P, thus providing a clear visualization of the algorithm's mechanics.

Figure 1.

Figure 1.

An illustration of how the proposed algorithm works in a simple (two-dimensional) case. The three scenarios vary by the amount of overlap between the different clusters, which impacts the value of t and τ, and the subsequent estimate of each.

The value of t^ (obtained from (27)) is based on both the distribution of within-class distances, G, and between-class distances, F, as estimated by (25). It can be verified that in these three scenarios, (2.1) holds for the stochastic dominance of the between-class distances over the within-class distances. Consequently, it produces a contextually self-adjusting bound ( t) around P (or any other blue point) where most of the other points assigned to the same cluster are likely to be within the boundary if the point is identified correctly.

Using an alternative bound (say t<t), the number blue points falling within t units around P is no greater than the number found when compared to t^ units, and is likely smaller. However, the estimated probability of falling within t units of P is also smaller (or at least no greater) than τ^. Thus, fewer blue points must be within t units in order to successfully validate the inclusion of P into the blue cluster. The choice of threshold thus determines whether P must be extremely close to a small number of blue points, or moderately close to a large number of blue points. As seen in Figure 1, the value of t^ as calculated by this algorithm tends to fall in the latter category, although it is affected by the extent of the overlap between the clusters. The calculated values of t, τ^, and aα obtained from applying the testing procedure in (15) to all three classes with α=0.05/500 in (19), are shown in Table 1. We note that in all cases, that t^>0.5 (Assumption 2.2) and that aα>250 (Remark 3.2).

Table 1.

The outcome of running the algorithm on the three scenarios in Figure 1.

Scenario Class a t^ τ
A Blue 355 2.51 0.784
  Red 375 2.63 0.819
  Black 402 2.81 0.866
B Blue 382 2.71 0.832
  Red 383 2.69 0.833
  Black 489 4.47 0.996
C Blue 472 3.89 0.977
  Red 472 3.91 0.977
  Black 489 4.51 0.996

The calculated values of t^, and hence t^ and aα are greatly affected by the amount of overlap between classes. In Figure 1(c), the clear separation between the three classes causes the respective estimates of τ to be nearly 1, as seen in Table 1, Scenario C. By shifting the red class closer to the blue class (i.e. Figure 1(b) and Table 1, Scenario B), the values of t^ and τ^ both decrease as expected for these two classes.

3. A simulation study

Our simulation study is designed to mimic the conditions of a LC-MS/MS shotgun proteomics study. In this light, we consider a set-up in which N1 instances (peptides) belong to class/protein C1, and N2 instances model the peptides belonging to any other class/protein. The distance between instances is measured using correlation distance, again mimicking a common way to measure similarity between peptides, although similar results can be obtained using other (quasi-) distance metrics (e.g. Euclidean distance).

3.1. The simulation setup

To establish some notation, suppose that y=(y1,y2,,yN) is a N×1 random vector having some joint distribution HN. We assume, without loss of generality, that the values of y are standardized, so that E(yi)=0 and V(yi)=1, for each i=1,2,,N. We denote by D the corresponding correlation (covariance) matrix for y, D=cor(y,y). To simplify, we assume that K = 2 so that the N instances are presumed to belong to either class C1 or C2. Accordingly, we partitioned y and D as y=[y1,y2], with, y1=(y1,1,y1,2,,y1,N1) and y2=(y2,N1+1,y2,N1+2,,y2,N1+N2), N1+N2=N, and

D=[D1,1D1,2D2,1D2,2],

with Dk,=cor(yk,y),k,=1,2.

As in Section 2, let X denote the n×N data matrix of the observed intensities. We denote by xj:=(xj1,xj2,,xjN) the jth row of X, j=1,2,,n, and we assume that x1,x2,,xn are independent and identically distributed as yHN.

Using standard notation, we write, 1n=(1,1,,1), In for the n×n identity matrix and Jn=1n1n for the n×n matrix of 1s. For the simulation studies we conducted, we took HN to be the N-variate normal distribution, so that

xjNN(0,D),j=1,2,,n,i.i.d.,

where

D1,1=(1ρ1)IN1+ρ1JN1, (30)
D2,2=(1ρ2)IN2+ρ2JN2, (31)
D2,1=D1,2=ρ121N21N1 (32)

for ρ=(ρ1,ρ12,ρ2) with 0ρ12ρ2ρ1<1.

To allow for misclassification of instances, we included for a certain proportion p, some m:=[pN1] of the N1 ‘observed’ instances intensities from C1 that were actually simulated with D1,1 being replaced by D1,1=(1ρ2)IN1+ρ2JN1 in (30) above. Thus, m is the number of misclassified instances among the N1 instances that were initially labeled as belonging to C1.

Remark 3.1

In this simulation, ρ2 reflects (as a proxy) the common characteristics of two mislabeled instances. When ρ2=ρ1, distances between two mislabeled instances have the same distribution as two correctly labeled instances, as would be the case for binary classification when ρ2=ρ1. When ρ2=ρ12, distances for two mislabeled instances have the same distribution as distances between a correctly labeled instance and a mislabeled instance. This would be the case if the probability that two mislabeled instances come from the same class is zero.

For each simulation run, we recorded τ^, t^, and counted the number of true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs), as defined in Table 2. From this data, we calculated the sensitivity, specificity, false discovery proportion (FDP), false non-discovery proportion (FNP), and percent reduction in FNP ( %Δ) for each run; defined as follows:

Sensitivity=TPTP+FN Proportion of correctly removed instances out of all mislabeled instances.
Specificity=TNTN+FP Proportion of correctly retained instances out of all correctly labeled instances.
FDP=FPmax(TP+FP,1) Proportion of incorrectly removed instances out of all those removed.
FNP=FNmax(TN+FN,1) Proportion of incorrectly retained instances out of all those retained.
%Δ=(1FNPp)×100 Percent reduction in FNP relative to p.

Table 2.

For a single run, each instance has one of four possible outcomes.

    Truth  
    Correctly labeled mislabeled Total
Result Keep TN FN N1R
  Remove FP TP R
    N1m m N1

The notation for the total count of instances with each these outcomes in a single run.

Each statistic was averaged over all 1000 runs.

3.2. Simulation results

We conducted B = 1000 simulation runs of the test procedure with n = 10 , 25, 50, 75, 100, 250, 500, 700; N1=25 , 50, 100, 500; N2=1000; α0=0.05; and ρ=(ρ1=0.5,ρ12=0.1,ρ2=0.5), (0.5,0.1,0.1), (0.5,0.2,0.5), and (0.5,0.2,0.2). We also varied the value of p, the proportion of mislabeled instances such that p = 0.0, 0.05, 0.1, 0.2, 0.25 . In particular, p = 0 means no mislabeling and that the initial labeling is perfect.

3.2.1. Results with no mislabeling (case p = 0)

Simulations where p = 0 (i.e. no mislabeling) were used to illustrate the theoretical assumptions of the testing procedure, as this presents a case in which the global null hypothesis in (20) holds. In this case, in particular, the FDP measures the proportion of incorrectly rejected hypotheses in (15) out of all rejected hypothesis tests, given that at least one hypothesis test was rejected.

Remark 3.2

For p = 0, every rejected hypothesis test in (15) is incorrectly rejected. Thus, the FDP is 1 if any rejected hypothesis tests are observed, and zero otherwise (by the definition of the FDP). Consequently, the average value of the FDP over all B runs provides an estimate of α.

Figure 2 and Table 3 show the FDP for various n, N1, and ρ12. As seen in the figures, the FDP converges to zero as n increases for all values of N1, but the convergence slows as N1 increases. When n is small, at least one instance was removed from almost all classes. We attribute this behavior to the inherent correlation structure of the data. This will be explored further in Section 3.2.2.

Figure 2.

Figure 2.

The FDP as a function of n and N1 for (a) ρ12=0.1 and (b) ρ12=0.2 where p = 0.

Table 3.

Simulation results for p = 0, ρ12=ρ2=0.2, and α0=0.05.

  N1=25 N1=50 N1=100 N1=500
n FDP Spec. FDP Spec. FDP Spec. FDP Spec.
10 0.916 0.929 1.000 0.880 1.000 0.834 1.000 0.737
25 0.843 0.947 0.999 0.904 1.000 0.860 1.000 0.766
50 0.673 0.965 0.993 0.933 1.000 0.894 1.000 0.807
75 0.461 0.979 0.969 0.952 1.000 0.921 1.000 0.841
100 0.298 0.987 0.879 0.967 0.999 0.943 1.000 0.872
250 0.001 1.000 0.060 0.999 0.414 0.995 0.999 0.977
500 0.000 1.000 0.000 1.000 0.000 1.000 0.089 1.000
700 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000

The sensitivity and FNP are excluded from the table because they are either undefined or constant when p = 0. As explained in Remark 3.2, the FDP provides an estimate of α when p = 0.

The specificity measures the ability to keep instances which are correctly labeled. From Figure 3 and Table 3, it can be seen that while almost all runs remove at least one instance (based on the FDP), most instances are retained. With n=10, ρ12=ρ2=0.2, and N1=25, an average of 23.2 out of 25 were retained in each run. This number decreased as N1 increased, corresponding to the slower convergence of the FDP when N1 is larger. Even in this case, for n = 10, ρ12=ρ2=0.2 and N1=500, an average of 368.5 instances are retained each time.

Figure 3.

Figure 3.

The specificity as a function of n, N1 for (a) ρ12=0.1 and (b) ρ12=0.2 where p = m = 0.

For p = 0, the sensitivity is undefined and the FNP is universally zero. These statistics are relevant only when at least one mislabeled instance is present in the data set (i.e. p>0) and therefore are omitted from Table 3.

3.2.2. Behavior under artificially constructed independence

In light of Remark 3.2 and the likely impact of the correlation structure present in the data on the FDP, we designed a simulation study to explore this effect. In this study, the ‘distance’ matrices were artificially created in a manner which preserved dependence due to symmetry but removed all other dependencies across distances.

To simulate ‘distance’ matrices in this case, we began by randomly generating a distance matrix using the original test procedure with p = 0 and n = 100, N1=500, N2=1000, and ρ=(0.5,0.2,0.2). A normal distribution was fit to the within- C1 distances and the C1 to C2 distances, as shown in Figure 4. For N1=25 , 50, 100, 500, 1000, 2000, 5000, and 12,000 and N2=1000, these normal distributions were used to generate B new distance matrices by drawing dij for 1i<jN as follows:

dij{0i=j,N(μ=0.523,σ=0.0684)1i<jN1,N(μ=0.771,σ=0.0903)1iN1<jN,djii>j.

Figure 5 shows the results of this procedure on five sets of B = 1000 runs. As N1 increased, the FDP also increased, but never surpassed the theoretical limit of 1e0.05, consistent with the theoretical result given in (22).

Figure 5.

Figure 5.

Direct simulation of the distance matrix using independent normal draws from a normal distribution and p = 0, performed as five batches of B = 1000 runs each. The average across all five batches is shown as squares. The line at the top of the plot shows 1eα0 for α0=0.05.

3.2.3. Results under some mislabeling, (case p>0)

To develop context regarding the results with some mislabeling (i.e. p>0), we first consider for a moment a trivial filtering procedure which retains all N1 instances in C1. Since the pN1 incorrect instances are retained, the FNP of the trivial procedure is pN1/N1=p. Clearly, an FNP of p can be achieved without performing any filtering at all simply by returning all N1 instances in C1. For a testing procedure to improve upon this nominal level, it must result in an FNP below p. With p>0, the FDP calculates the proportion of correctly labeled instances among those removed. However, in the context of proteomics, where the set of retained peptides are used in subsequent analyses, we found that the specificity was a more relevant metric. Consequently, the FNP and specificity are the primary statistics used to evaluate our proposed testing procedure in the presence of labeling errors (p>0), with %Δ providing a standardized method of evaluating the decrease in FNP in a manner independent of p.

Figure 6 provides the average FNP and %Δ over the B = 1000 simulation runs as a function of n for ρ12=0.2 and varying combinations of N1, ρ2 and p. In all cases, the procedure reduced the FNP relative to p (that is, %Δ>0 for all results). Small values of n produced the smallest reduction (highest FNP), but this converged at approximately n = 100 to a value dependent on p, N1, and ρ2. Table 4 shows how the average FNP compares between small sample sizes ( n=10) and large sample sizes (averaged across n = 250, 500, and 700) at each value of p, N1, and ρ2. For p = 0.05, the FNP converged to 0 for all values of N1 and ρ2. For higher values of p, decreasing N1 and decreasing ρ2 caused the FNP to be higher for n250. For example, when p = 0.25 and N1=500, for example, the instances remaining in the class after filtering will still include 0.26% ( %Δ=59.0) mislabeled instances for n = 10 and 2.85% ( %Δ=88.6) mislabeled instances when n is large. These are both substantial decreases from 25% mislabeled instances as seen in the unfiltered data.

Figure 6.

Figure 6.

The FNP as a function of n and N1 for p = 0.05, 0.10, 0.20, and 0.25 with ρ12=0.2.

Table 4.

The mean FNP and %Δ at n = 10 and n250 for each combination of p, N1, and ρ2 at ρ12=0.2.

    ρ2=0.2 ρ2=0.5
    n = 10 n250 n = 10 n250
p N1 FNP %Δ FNP %Δ FNP %Δ FNP %Δ
0.05 25 0.0169 66.2 0.0000 100.0 0.0168 66.4 0.0000 100.0
  50 0.0131 73.8 0.0000 100.0 0.0135 72.9 0.0000 100.0
  100 0.0150 69.9 0.0000 100.0 0.0137 72.6 0.0000 100.0
  500 0.0110 78.0 0.0000 100.0 0.0116 76.8 0.0000 100.0
0.10 25 0.0361 63.9 0.0007 99.3 0.0388 61.2 0.0000 100.0
  50 0.0394 60.6 0.0015 98.5 0.0429 57.1 0.0001 99.9
  100 0.0325 67.5 0.0009 99.1 0.0363 63.7 0.0001 99.9
  500 0.0255 74.5 0.0004 99.6 0.0298 70.2 0.0000 100.0
0.15 25 0.0585 61.0 0.0047 96.8 0.0635 57.7 0.0006 99.6
  50 0.0637 57.5 0.0068 95.5 0.0694 53.8 0.0011 99.3
  100 0.0585 61.0 0.0061 95.9 0.0659 56.1 0.0010 99.3
  500 0.0446 70.3 0.0032 97.8 0.0511 66.0 0.0003 99.8
0.20 25 0.1254 37.3 0.0468 76.6 0.1427 28.6 0.0249 87.6
  50 0.1042 47.9 0.0300 85.0 0.1229 38.6 0.0108 94.6
  100 0.0922 53.9 0.0214 89.3 0.1069 46.6 0.0057 97.2
  500 0.0707 64.6 0.0115 94.2 0.0852 57.4 0.0021 98.9
0.25 25 0.1612 35.5 0.0854 65.8 0.1891 24.4 0.0678 72.9
  50 0.1399 44.0 0.0574 77.1 0.1657 33.7 0.0304 87.8
  100 0.1303 47.9 0.0485 80.6 0.1584 36.6 0.0205 91.8
  500 0.1026 59.0 0.0285 88.6 0.1292 48.3 0.0078 96.9

Figure 7 provides the average specificity over the 1000 simulation runs as a function of n for ρ12=0.2 and varying combinations of N1, ρ2 and p. The specificity always converged to 1 as n increased, with convergence by n = 250 in all cases. For larger values of p, convergence was faster (by n = 100). For small values of n, a higher specificity is observed when ρ2 and N1 is smaller, although even in the worst case, the specificity was greater than 0.75.

Figure 7.

Figure 7.

The specificity as a function of n and N1 for p=0.05,0.10,0.20, and 0.25 with ρ12=0.2.

Table 5 gives the average estimate of the FDP, FNP, %Δ , sensitivity, and specificity in the case where n = 50 and ρ12=0.2 across all combinations of p and N1. As already noted above, the table shows that the FNP and sensitivity decrease as p and N1 increase. The FDP gives the proportion of correctly labeled instances among all the removed instances. This measure increases as a function of N1, corresponding to the decrease in the specificity of the procedure resulting in more correctly labeled instances being filtered out. On the other hand, it decreases as a function of p due to the increased proportion of mislabeled instances available to be removed.

Table 5.

The FDP, FNP, sensitivity (Sens.), and specificity (Spec.) of the algorithm using simulated data with n = 50 and ρ12=0.2.

    ρ2=0.2 ρ2=0.5
p N1 FDP FNP %Δ Sens. Spec. FDP FNP %Δ Sens. Spec.
0.00 25 0.673 0.000     0.965 0.630 0.000     0.969
  50 0.993 0.000     0.933 0.980 0.000     0.936
  100 1.000 0.000     0.894 1.000 0.000     0.900
  500 1.000 0.000     0.807 1.000 0.000     0.813
0.05 25 0.131 0.000 99.242 0.991 0.989 0.138 0.001 98.912 0.987 0.988
  50 0.310 0.000 99.305 0.992 0.975 0.306 0.000 99.745 0.997 0.973
  100 0.349 0.000 99.384 0.994 0.965 0.372 0.000 99.422 0.995 0.959
  500 0.542 0.000 99.763 0.998 0.926 0.559 0.000 99.819 0.999 0.913
0.10 25 0.035 0.003 96.689 0.961 0.996 0.049 0.002 97.840 0.974 0.994
  50 0.046 0.003 96.583 0.969 0.994 0.080 0.002 97.726 0.980 0.989
  100 0.086 0.002 97.793 0.980 0.988 0.144 0.001 98.609 0.988 0.978
  500 0.197 0.001 99.018 0.992 0.968 0.268 0.001 99.431 0.995 0.949
0.15 25 0.013 0.010 93.127 0.921 0.998 0.028 0.008 94.708 0.940 0.996
  50 0.016 0.011 92.580 0.930 0.997 0.040 0.007 95.421 0.957 0.993
  100 0.027 0.009 93.669 0.946 0.995 0.072 0.006 95.928 0.966 0.985
  500 0.069 0.005 96.927 0.974 0.986 0.145 0.003 98.166 0.986 0.965
0.20 25 0.001 0.055 72.335 0.760 1.000 0.009 0.050 75.199 0.784 0.999
  50 0.006 0.037 81.397 0.844 0.999 0.027 0.030 84.936 0.874 0.994
  100 0.011 0.026 87.029 0.893 0.997 0.042 0.018 91.068 0.928 0.989
  500 0.026 0.014 93.150 0.945 0.993 0.094 0.009 95.687 0.967 0.971
0.25 25 0.000 0.094 62.489 0.668 1.000 0.017 0.105 58.032 0.619 0.998
  50 0.003 0.065 74.092 0.779 0.999 0.022 0.060 75.990 0.793 0.996
  100 0.005 0.053 78.976 0.833 0.999 0.032 0.046 81.705 0.856 0.991
  500 0.013 0.031 87.412 0.903 0.996 0.067 0.022 91.073 0.934 0.976

The sensitivity of the procedure, as discussed above, gives the proportion of mislabeled instances that are detected out of all mislabeled instances. This measure increases when p is small and N1 is large, and decreases for large p and small N1; corresponding to the FNP estimate. For example, when p = 0.20, N1=500, and ρ2=0.2 the data consists of 100 mislabeled instances and 400 correctly labeled instances. On average, the procedure removed 94.5% of mislabeled instances and only 0.7% correctly labeled instances, based on the reported sensitivity and specificity. Thus, the resulting filtered data set has an average of 5.5 mislabeled instances and 397.2 correctly labeled instances for an average of 402.7 total instances. This reflects a decrease in the proportion of mislabeled instances by 93.15%: while 20% of the original data set was mislabeled, only 1.4% of the filtered data set remains mislabeled.

4. Proteomics case study

To illustrate the applicability of our algorithm on a real data set, we applied it to a large observational study consisting of measured peptide intensities from 120 different serum samples collected from children with sickle cell disease. This study included additional runs (injections) to experimentally assess whether each peptide was quantified accurately and, for five specific proteins, whether the peptides were matched to the correct protein. A complete description of the methodology used to generate this verification data and decide whether protein label for the peptides from these five proteins can be found in Key [7].

Briefly, the so-called ‘spike-in subset’ consists of eight aliquots from a separate study where the protein concentration measurements were available: five in which a single protein was spiked at high intensity plus three controls as seen in Table 6. Correctly identified peptides from the spiked proteins should have an artificially high intensity in the corresponding aliquot while misidentified peptides lack this intensity ‘spike’, and thus appear ‘unspiked’. Unfortunately, this is not a ‘gold-standard’ on whether or not every peptide accurately reflects the abundance of the protein. Commercially produced proteins used in the spike-in subset likely have less sequence and/or post-translational modification variability relative to the human population, resulting in some correctly identified peptide sequences not overly abundant in the spiked samples. Conversely, because the procedures used to identify and assess the abundance of each peptide are performed using different data [15], some peptides can be identified correctly (with support from the spike-in data set), despite having a very low signal-to-noise ratio and thus poorly reflect the protein's abundance. Consequently, a manual determination of peptide accuracy (incorporating both abundance and labeling) based on heat maps of each of the five spiked proteins. This created a pair of decision rules that could be used as an objective (albeit imperfect) set of benchmarks for the purpose of evaluating the algorithm. In Key [7], they are further used for the purpose of comparing the performance of the proposed algorithm against other available algorithms.

Table 6.

The original quantity of the five spike-in proteins in 25 ml of each sample in the spike-in subset, as measured by nephelometry, as well as the quantity of protein added to generate the spike sample.

  Pre-depletion (μg/25 μl)  
Symbol s1 s2 s3 Spike quantity (μg/25 μl)
A1AG1 5.53 21.18 29.50 125
APOA1 20.23 30.00 41.25 100
APOB 9.30 10.50 24.75 167
CERU 3.00 7.13 6.03 38
HEMO 9.48 20.85 22.38 68

Proteins were spiked into sample s1.

4.1. Results

Results from the five proteins for which spike data is available are shown in Table 7. All five proteins showed at least at 25% reduction in the proportion of false positives regardless of the benchmark against which they were compared. In addition, the specificity was at least 84% across all five proteins, showing that the amount of poorly quantified or misidentified peptides was decreased while the algorithm generally retained peptides that had the spike signal (specificity of at least 0.84).

Table 7.

The estimated FNP and %Δ of the algorithm using the spike and manual benchmarks as a stand-in for the true peptide status.

  Spike Manual
Protein FNP Sens. Spec. %Δ FNP Sens. Spec. %Δ
HEMO 0.05 (0.06) 0.33 0.91 −26 0.20 (0.29) 0.36 1.00 −28
A1AG1 0.11 (0.18) 0.50 0.87 −38 0.17 (0.30) 0.53 0.95 −42
APOA1 0.07 (0.18) 0.67 0.96 −61 0.09 (0.20) 0.60 0.96 −53
APOB 0.09 (0.14) 0.53 0.84 −40 0.13 (0.25) 0.59 0.91 −48
CERU 0.09 (0.17) 0.59 0.90 −50 0.23 (0.32) 0.42 0.92 −28

For reference, the estimated FNP without any filtering is shown in parentheses.

Differences between the two benchmarks were primarily attributed to the different focus each had. The decision in the spike benchmark is completely independent of the quantitative data used by the filtering algorithm, but it cannot account for peptides with a low signal-to-noise ratio. Consequently, the spike benchmark tends to assign the ‘truth’ status of more peptides as ‘good’ relative to the manual method, resulting in a lower FNP prior to any filtering.

Figure 8 shows an example of the proteomics profiles from a single protein. Despite the fact that all 187 peptides were marked as originating from ceruloplasmin (CERU), approximately 6–8 different abundance patterns can clearly be visually discerned from the heat map, not including the ‘noisy’ cluster at the top of the figure marked in pink in the dendrogram. Despite this, the spike data generally supports the conclusion that most (if not all) of these peptides truly originated from a single protein source. Only three peptides lack the appropriate spike signal, and while a visual analysis of the heat map supports these being true misidentifications, their correlation with the observed data suggests that the cost of incorrectly retaining these peptides is likely to be relatively low. Among these ‘clean’ clusters, the algorithm removes one small cluster that has a substantially distinct abundance profile relative to the more dominant patterns, and four other peptides from the top (slightly noisier) cluster.

Figure 8.

Figure 8.

Heat map showing the relative abundance data of ceruloplasmin (CERU) for each patient (column) and peptide (row). Peptides identified incorrectly according to the spike and manual benchmarks are as open circles and in grey, respectively. Using our algorithm, values of Zi above the critical value ( a0.05=107) are predicted to be correct, while values below 107 are predicted to be incorrect. The spike produced an inconclusive result for one peptide, shown as a ×.

The top ‘noisy’ cluster can be further split visually into two groups at the dashed gray line. Above this line, the spike signal is only present for four peptides, and no abundance patterns are visible across the columns. The algorithm only retains a single one of these peptides. Below the dashed line, the spike-in data suggests that the peptides were (generally) identified correctly, but that the abundances are noisy. Visually, the left side of the heat map appears redder than the left, similar to the behavior of some cleaner subsets. These peptides are generally retained.

5. Discussion

In this paper, we have presented a testing procedure for identifying incorrectly labeled instances when two or more classes are present. Our non-parametric approach to the problem of filtering incorrect labels, which requires very few assumptions, yields a very high specificity, and can be implemented very easily and efficiently using standard statistical software. We demonstrated its applicability and effectiveness using real proteomic data observed in children with sickle cell disease using spike-in proteins to provide a contrasting control on protein labels. Further insights to the properties of our procedure were demonstrated via a simulation study.

As demonstrated in the simulation study, our testing procedure generally has a high specificity and low FNP, especially when the number of measurements (n) on each instance is large. Decreasing the value of n yields a more conservative test (i.e. one which is less likely to reject the null hypothesis in (15) and remove instances), since each distance is measured less precisely. Even for extremely small values of n, it was still possible to reduce the FNP and maintain a specificity over 80% for all classes.

For a fixed n, classes with fewer instances (i.e. small values of N1) had a higher FNP and specificity relative to larger classes. Compared to ‘classical’ classification algorithms, however, such a property is relatively unique. Many of the existing classification procedures found in the literature have difficulties when the number of instances varies substantially across classes [16]. This difficulty extends to those procedures utilizing these classification approaches to also detect mislabeled instances [7]. Our testing procedure avoids this problem and maintains the integrity of small classes by analyzing each using a ‘one-vs-all’ strategy that is most conservative for small values of N1 (say for 25N150).

On the other hand, when the number of instances is extremely low ( 2N125 based on our simulation studies) the accuracy of the non-parametric estimates, especially τ^, become unreliable. In the most extreme cases, the available data is insufficient to ever reject the null hypothesis even if a reliable estimate of τ could be found. For example, in LC-MS/MS proteomics, extremely small proteins ( 2N15) often consist entirely of inaccurate or mislabeled peptides and make up a substantial proportion of the reported proteins. This will be addressed in a subsequent work using a complementary procedure where instances are only retained if the null hypothesis is rejected.

The use of a Bonferroni-type procedure is aimed at protecting against removing correctly identified instances is extremely conservative, prioritizing a high specificity at the cost of a higher FNP. Even in this conservative case, the FNP in our simulations was universally reduced across all values of N1, n, ρ and p. Less conservative FWER procedures, FDR-type procedures [2,3], or procedures seeking to explicitly control the FNP could also be considered to further reduce the FNP in the filtered data. The primary convenience of the Bonferroni procedure is its universal applicability, especially in light of the complexity of the dependency structure of the distances and consequently of the tests statistics, Zi.

Because the testing procedure estimates G¯ and F¯ non-parametrically using the available data, these estimates are affected by the presence of mislabeled instances. The resulting estimates, τ^ and t^ of τ and t, may therefore be biased when a large number of mislabeled instances are included in the class. One possible method of remedy is to iteratively remove a small number of instances and re-estimate τ and t until some stopping criterion is met. Developing such a sequential estimation procedure is left to future work.

Although we have used Pearson's correlation as a measure distance in Sections 3 and 4, the procedure is generally applicable whenever the observed data for each instance can be effectively combined as a ‘measure of distance’. Usage of different distance measures is possible as was illustrated in Section 2.4 with the Euclidean distance measure.

Note

1

Scenario A Blue: xN((1.778.43),(1001)), Red: xN((4.658.08),(1001)), Black: xN((1.264.89),(1001)) Scenario B Blue: xN((1.778.43),(1001)), Red: xN((4.658.08),(1001)), Black: xN((0.260.89),(1001)) Scenario C Blue: xN((1.778.43),(1001)), Red: xN((7.6510.08),(1001)), Black: xN((0.260.89),(1001)).

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Anderson L. and Hunter C.L., Quantitative mass spectrometric multiple reaction monitoring assays for major plasma proteins, Mol. Cell. Proteom. 5 (2006), pp. 573–588. doi: 10.1074/mcp.M500331-MCP200 [DOI] [PubMed] [Google Scholar]
  • 2.Benjamini Y. and Hochberg Y., Controlling the false discovery rate: A practical and powerful approach to multiple testing, J. Roy. Stat. Soc. Ser. B (Methodol.) 57 (1995), pp. 289–300. [Google Scholar]
  • 3.Benjamini Y. and Yekutieli D., The control of the false discovery rate in multiple testing under dependency, Ann. Stat. 29 (2001), pp. 1165–1188. doi: 10.1214/aos/1013699998 [DOI] [Google Scholar]
  • 4.Esary J.D., Proschan F., and Walkup D.W., Association of random variables, with applications, Ann. Math. Stat. 38 (1967), pp. 1466–1474. doi: 10.1214/aoms/1177698701 [DOI] [Google Scholar]
  • 5.Forshed J., Johansson H.J., Pernemalm M., Branca R.M.M., Sofi Sandberg A., and Lehtiö J., Enhanced information output from shotgun proteomics data by protein quantification and peptide quality control (PQPQ), Mol. Cell. Proteom. 10 (2011), pp. 1–9. doi: 10.1074/mcp.M111.010264 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Frénay B. and Verleysen M., Classification in the presence of label noise: A survey, IEEE Trans. Neural Netw. Learn. Syst. 25 (2014), pp. 845–869. doi: 10.1109/TNNLS.2013.2292894 [DOI] [PubMed] [Google Scholar]
  • 7.Key M.C., ClassCleaner: A quantitative method for validating peptide identification in LC-MS/MS workflows, Ph.D. thesis, Indiana University, Indianapolis, 2020.
  • 8.Lucas J.E., Thompson J.W., Dubois L.G., McCarthy J., Tillmann H., Thompson A., Shire N., Hendrickson R., Dieguez F., Goldman P., Schwarz K., Patel K., McHutchison J., and Moseley M.A., Metaprotein expression modeling for label-free quantitative proteomics, BMC Bioinform. 13 (2012), p. 74. doi: 10.1186/1471-2105-13-74 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Morales P., Luengo J., Garcia L.P.F., Lorena A.C., de Carvalho A.C.P.L.F., and Herrera F., NoiseFiltersR: Label noise filters for data preprocessing in classification, 2016. Available at https://cran.r-project.org/package=NoiseFiltersR
  • 10.Polpitiya A.D., Jun Qian W., Jaitly N., Petyuk V.A., Adkins J.N., Camp D.G., Anderson G.A., and Smith R.D., DAnTE: A statistical tool for quantitative analysis of proteomics data, Bioinformatics 24 (2008), pp. 1556–1558. doi: 10.1093/bioinformatics/btn217 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Quinlan J.R., Induction of decision trees, Mach. Learn. 1 (1986), pp. 81–106. doi: 10.1007/bf00116251 [DOI] [Google Scholar]
  • 12.Sáez J.A., Galar M., Luengo J., and Herrera F., Analyzing the presence of noise in multi-class problems: Alleviating its influence with the one-vs-one decomposition, Knowl. Inf. Syst. 38 (2014), pp. 179–206. doi: 10.1007/s10115-012-0570-1 [DOI] [Google Scholar]
  • 13.Schader M. and Schmid F., Two rules of thumb for the approximation of the binomial distribution by the normal distribution, Am. Stat. 43 (1989), pp. 23–24. [Google Scholar]
  • 14.Silva J.C., Gorenstein M.V., Zhong Li G., Vissers J.P.C., and Geromanos S.J., Absolute quantification of proteins by LCMSE: A virtue of parallel MS acquisition, Mol. Cell. Proteom. 5 (2006), pp. 144–156. doi: 10.1074/mcp.M500230-MCP200 [DOI] [PubMed] [Google Scholar]
  • 15.Steen H. and Mann M., The ABC's (and XYZ's) of peptide sequencing, Nat. Rev. Mol. Cell. Biol. 5 (2004), pp. 699–711. doi: 10.1038/nrm1468 [DOI] [PubMed] [Google Scholar]
  • 16.Sun Y., C Wong A.K., and Kamel M.S., Classification of imbalanced data: A review, Int. J. Pattern Recognit. Artif. Intell. 23 (2009), pp. 687–719. doi: 10.1142/S0218001409007326 [DOI] [Google Scholar]
  • 17.Suomi T., Corthals G.L., Nevalainen O.S., and Elo L.L., Using peptide-level proteomics data for detecting differentially expressed proteins, J. Proteome Res. 14 (2015), pp. 4564–4570. doi: 10.1021/acs.jproteome.5b00363 [DOI] [PubMed] [Google Scholar]
  • 18.Webb-Robertson B.-J.M, Matzke M.M., Datta S., Payne S.H., Kang J., Bramer L.M., Nicora C.D., Shukla A.K., Metz T.O., Rodland K.D., Smith R.D., Tardiff M.F., McDermott J.E., Pounds J.G., and Waters K.M., Bayesian proteoform modeling improves protein quantification of global proteomic measurements, Mol. Cell. Proteom. 13 (2014), pp. 3639–3646. doi: 10.1074/mcp.M113.030932 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Zhu X. and Wu X., Class noise vs. attribute noise: A quantitative study, Artif. Intell. Rev. 22 (2004), pp. 177–210. doi: 10.1007/s10462-004-0751-8 [DOI] [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES