A statistical testing procedure for validating class labels

Melissa C Key; Susanne Ragg; Benzion Boukai

doi:10.1080/02664763.2022.2038546

. 2022 Feb 24;50(8):1725–1749. doi: 10.1080/02664763.2022.2038546

A statistical testing procedure for validating class labels

Melissa C Key ^a,^b,^CONTACT, Susanne Ragg ^c, Benzion Boukai ^d

PMCID: PMC10228321 PMID: 37260475

Abstract

Motivated by an open problem of validating protein identities in label-free shotgun proteomics work-flows, we present a testing procedure to validate class (protein) labels using available measurements across N instances (peptides). More generally, we present a non-parametric solution to the problem of identifying instances that are deemed as outliers relative to the subset of instances assigned to the same class. The primary assumption is that measured distances between instances within the same class are stochastically smaller than measured distances between instances from different classes. We show that the overall type I error probability across all instances within a class can be controlled by some fixed value (say α). We also demonstrate conditions where similar results on type II error probability hold. The theoretical results are supplemented by an extensive numerical study illustrating the applicability and viability of our method. Even with up to 25% of instances initially mislabeled, our testing procedure maintains a high specificity and greatly reduces the proportion of mislabeled instances. The applicability and effectiveness of our testing procedure is further illustrated by a detailed example on a proteomics data set from children with sickle cell disease where five spike-in proteins acted as contrasting controls.

Keywords: Non-parametric, hypothesis testing, machine learning, classification, proteomics

1. Introduction

The research presented in this paper is motivated by an open problem in the quantification of proteins in a label-free shotgun proteomics work-flow.

In label-free shotgun proteomics, the experimental units of interest (proteins) are not measured directly but are represented by measurements on one to 500+ enzymatically cleaved pieces known as peptides. The amino acid sequences composing each peptide are not known apriori, but inferred based on algorithmic procedures acting on spectrum data from the mass spectrometer. By keeping ‘correctly’ labeled peptides (instances) and removing inaccurately labeled peptides from each protein (class), subsequent quantitative analyses which assume that all measurements are equally representative of the protein are thus more accurate and powerful. Our proposed testing procedure is designed to accomplish this validation task. More specifically, we present a non-parametric solution to the problem of identifying instances that are outliers relative to the subset of instances assigned to the same class. This serves as a proxy for finding errors in the data set: instances for which the class label is recorded incorrectly, or where the quantitative data for a particular instance are sufficiently noisy as to render them uninformative for the purposes of summarizing the class.

Within the field of proteomics, inconsistencies across peptide measurements (including those caused by misidentified peptides, inaccurate quantification, and peptide sequences originating from multiple proteins) can obfuscate patterns in protein abundance measurements across samples (or subjects), making it more difficult to correctly identify proteins that vary across the groups of interest. While a more stringent criterion on the false discovery rate (FDR) [2] of the original identification algorithm is likely to reduce the proportion of misidentified peptides, it will inevitably also remove correct matches due to the inherent trade-off between the FDR and the false non-discovery rate (FNR). While some proteomics work-flows have stressed identification accuracy over quantity (e.g. multiple reaction monitoring, [1]), shotgun proteomics, in particular, has traditionally focused on attempting to identify as many peptides as possible, with robust summarization methods used to address the prevailing inconsistencies across the peptides [10,14,17].

More recently, alternative methods that explicitly address identification errors and other inconsistencies between the patterns in abundance measurements of peptides from the same protein have also been proposed. Unfortunately, however, none of these algorithms has seen much use in practice. Protein Quantification by Peptide Quality Control (PQPQ) [5] is an ad-hoc method that finds a set of ‘representative’ peptides from each protein. Additional peptides from these proteins are only retained if they have similar characteristics to one of the ‘representative’ peptides. The Bayesian proteoform method proposed by Webb-Roberston et al. [18] is primarily focused on grouping peptides based on their pattern of positive and negative differences across comparison groups (e.g. disease/healthy or treatments). This focus on comparison groups can create heterogeneous peptide clusters that meet the single criteria required, but are otherwise dissimilar. Lucas et al. [8] also uses Bayesian methods (within the Gaussian framework), but focuses on creating so-called ‘meta-proteins’, consisting of peptides with similar abundance patterns across multiple proteins. In contrast, our proposed filtering algorithm requires no parametric assumptions on the distribution of abundances for each protein as it is entirely non-parametric. It is specifically designed to identify a subset of validated peptides that can be used for any subsequent analysis, including tests of differential abundance, classification and clustering.

Outside of proteomics, a similar problem has been studied within the field of classification. Classification models trained on data with labeling errors tend to be more complex and less accurate than models trained on data without labeling errors [11,12,19]. As described in Frénay and Verleysen [6], multiple algorithms have been developed to improve classification algorithms in the presence of labeling errors. Similar to the approaches utilized in proteomics, these have been grouped into three categories: robust methods, explicit modeling methods, and filtering methods. In particular, these filtering methods can also be used to identify (and remove) mislabeled peptides from proteins.

By filtering method, we refer specifically to an algorithm designed to determine whether each instance (e.g. peptide) should be retained in its original class (e.g. protein), or whether it should be removed from that class. This inherently assumes that every instance has an original class label; these algorithms do not provide an initial determination of the class label. In the case of proteomics, these original class labels are likely to originate from identification algorithms, or modifications thereof. For example, proteins with high sequence overlap can be combined into a single ‘protein group’ and analyzed together, or split into subsets depending on the memberships of the different peptides.

In many respects, both the proteomics and classification filtering algorithms function by separating out ‘good’ instances from ‘bad’ instances. However, there is an important distinction between the problem of interest within proteomics and that addressed by the classification algorithms. In the classification problem, the aim is an improved ability to classify new instances: the accuracy of the resulting training set is secondary to the overall improved performance of the resulting classification procedure. On the other hand, the overall accuracy of the filtered proteomics data set is of paramount interest in the proteomics problem, and there is no need to classify new peptides.

In this paper, we present our proposed non-parametric filtering/testing procedure and demonstrate its properties. As will be seen below, it is capable of validating the initial labeling of large proteins (40+ peptides) with high accuracy, even in the presence of heterogeneity across peptides within a protein or when up to 25% of peptides are mislabeled. A detailed comparison of our proposed procedure against several alternative algorithms from both proteomic and classification literature can be found in Key [7]. This comparison includes proteomics-focused methods from Forshed et al. [5] and Webb-Roberston et al. [18], and other classification-focused algorithms implemented in the NoiseFiltersR package [9]. For the sake of space, we do not provide the details of this comparison here, but it will be presented and discussed in detail in a subsequent paper.

The paper is organized as follows. Section 2 presents in a non-parametric setting the theoretical basis for the algorithm as well as the testing procedure. Section 3 uses a simulation study to demonstrate the effectiveness of the proposed algorithm. Section 4 illustrates the application of the algorithm to a real proteomics data set (sickle cell disease in children) where spiked proteins provided an experimental control on protein identification. Section 5 wraps up the paper with a discussion and concluding remarks.

2. Methodology

2.1. The basic setup

Consider a data set of N instances, of which $N_{1}, N_{2}, \dots, N_{K - 1}$ are presumed to belong to classes $C_{1}, C_{2}, \dots, C_{K - 1},$ respectively, with $N_{k} > 1$ for $k = 1, \dots, K - 1$ . Let $C_{K}$ be a ‘mega-class’ consisting of the $N_{K}$ instances unassigned to a specific class or instances which are the sole representative of a class, such that $N_{K} = N - \sum_{k = 1}^{K - 1} N_{k}$ . For simplicity, we use the shorthand notation $i \in C_{k}$ to indicate an instance i which belongs to class $C_{k}$ while $i \notin C_{k}$ indicates an instance which does not. Let $x_{i} = (x_{i 1}, x_{i 2}, \dots, x_{i n})^{'}$ be the (vector) of observations from instance i, ( $i = 1, \dots, N$ ), across n (independent) samples. The available data is thus $X = (x_{1}, x_{2}, \dots, x_{N})$ is an $n \times N$ matrix of such quantitative observations (in the case of shotgun proteomics, this corresponds to the observed relative quantity of each peptide). The ‘distance’ between any two instances with observations $x_{i}$ and $x_{j}$ can thus be measured using any standard distance or quasi-distance function,

d_{i j} = dist (x_{i}, x_{j}) \geq 0.

(1)

For instance, $d_{i j}$ could be a measure of the dissimilarity between peptides over the n samples in the study. In this case, one popular quasi-distance function is defined by the correlation between $x_{i}$ and $x_{j}$ , $d_{i j} = 1 - r_{i j}$ , where

r_{i j} := cor (x_{i}, x_{j}) = \frac{\sum_{ℓ = 1}^{n} (x_{i ℓ} - {\bar{x}}_{i \cdot}) (x_{j ℓ} - {\bar{x}}_{j \cdot})}{\sqrt{\sum_{ℓ = 1}^{n} (x_{i ℓ} - {\bar{x}}_{i \cdot})^{2} \sum_{ℓ = 1}^{n} (x_{j ℓ} - {\bar{x}}_{j \cdot})^{2}}} .

(2)

Here, ${\bar{x}}_{i \cdot} = \sum_{ℓ = 1}^{n} x_{i ℓ} / n$ , for each $i = 1, \dots, N$ .

Let $D = {d_{i j} : i, j = 1, \dots, N}$ be the $N \times N$ (symmetric) matrix comprised of these between-instance observed distances. Without loss of generality, we assume that the entries of $D$ are ordered such that the first $N_{1}$ entries belong to $C_{1}$ , the next $N_{2}$ entries belong to $C_{2}$ , etc. Accordingly, we partition $D$ as

D = [\begin{matrix} D_{11} & D_{12} & \dots & D_{1 K} \\ D_{21} & D_{22} & \dots & D_{2 K} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ D_{K 1} & D_{K 2} & \dots & D_{K K} \end{matrix}],

(3)

where $D_{k k}$ is the $N_{k} \times N_{k}$ matrix of between-instance distances within class $C_{k}$ and the elements of $D_{k_{1} k_{2}}$ represent the distances between the $N_{k_{1}}$ instances belonging to $C_{k_{1}}$ and the $N_{k_{2}}$ instances belonging to $C_{k_{2}}$ . Note, in particular, that $D_{k_{1} k_{2}} \equiv D_{k_{2} k_{1}}^{'}$ .

To begin with, consider at first class $C_{1}$ and the $N_{1}$ instances initially assigned to it. For a fixed i, $i = 1, \dots, N_{1}$ , let $i \in C_{1}$ be a given instance in class $C_{1}$ and set $d_{i} := (d_{i}^{(1)}, d_{i}^{(2)}, \dots, d_{i}^{(K)})$ , be the ith row of $D$ , where $d_{i}^{(1)} := (d_{i 1}, \dots, d_{i N_{1}})^{'}$ and for $k \geq 2$ , $d_{i}^{(k)} := (d_{i (1 + \sum_{j = 1}^{k - 1} N_{j})}, \dots, d_{i (\sum_{j = 1}^{k} N_{j})})^{'}$ . Clearly, $d_{i}^{(k)}$ is the ith row of $D_{1 k}$ for $k = 1, \dots, K$ . For the class $C_{1}$ , we consider the stochastic modeling of the elements of $D_{11}, D_{12}, \dots, D_{1 K}$ . We assume that for each instance $i \in C_{1}$ , the observed within-class distances from it to the other $N_{1} - 1$ instances in $C_{1}$ are $i . i . d .$ random variable according to a class-specific distribution, $G (\cdot)$ , so that

(d_{i 1}, \dots, d_{i (i - 1)}, d_{i (i + 1)}, \dots, d_{i N_{1}}) \underset{i . i . d .}{\sim} G_{i} (\cdot),

(4)

(since $d_{i i} \equiv 0$ ). Here, the c.d.fs $G_{i} (\cdot)$ are defined for each fixed $i \in C_{1}$ , and any $j \in C_{1}$ as

G_{i} (t) \equiv G (t) := Pr (d_{i j} \leq t | i \in C_{1}, j \in C_{1}, j \neq i) \forall t \in R .

(5)

Our notation in (5) stresses that we are assuming that the distribution of distances from each individual instance to the remaining instances in $C_{1}$ is identical. Similarly, the distances between the given instance, $i \in C_{1}$ , and the $N_{k}$ instances in class $C_{k}, k = 2, \dots, K$ , are also $i . i . d .$ random variables according to some distribution $F^{(k)} (\cdot)$ , so that

(d_{i j}) \underset{i . i . d .}{\sim} F^{(k)} (\cdot) \forall j \in C_{k}, (j = 1 \dots, N_{k}) .

(6)

Here, $F^{(2)}, F^{(3)}, \dots, F^{(K)}$ are K−1 distinct c.d.fs defined for each $i \in C_{1}$ , and any $j \in C_{k}$ as

F^{(k)} (t) = Pr (d_{i j} \leq t | i \in C_{1}, j \in C_{k}) \forall t \in R .

(7)

We assume throughout this work that $G (\cdot), F^{(2)} (\cdot), \dots, F^{(K)} (\cdot)$ are continuous distributions with p.d.fs $g (\cdot), f^{(2)} (\cdot), \dots, f^{(K)} (\cdot),$ respectively. If we further assume that all the $N_{1}$ instances from class $C_{1}$ are equally representative of the true measurement across all n samples, we would expect that the distances between any two measurements from within class $C_{1}$ are stochastically smaller than distances between measurements from within $C_{1}$ and instances associated with class $C_{k}$ where $k \neq 1$ . Accordingly, we have the following assumption.

Assumption 2.1 Stochastic ordering —

For each $k, k = 2, \dots, K$ ,

$Pr (d_{i j} \leq t | i \in C_{1}, j \in C_{k}) \leq Pr (d_{i j} \leq t | i \in C_{1}, j \in C_{1}, j \neq i)$ (8)

or equivalently,

$F^{(k)} (t) \leq G (t) \forall t \in R .$

In light of (7), the distribution of distances from the ith instance in $C_{1}$ to any other random instance J, selected uniformly from among the $N - N_{1}$ instances not in $C_{1}$ , is thus the mixture,

\bar{F} (t) := Pr (d_{i J} \leq t | i \in C_{1}, J \notin C_{1}) = \frac{1}{N - N_{1}} \sum_{k = 2}^{K} N_{k} F^{(k)} (t),

(9)

where we have taken $Pr (J \in C_{k} | J \notin C_{1}) := N_{k} / (N - N_{1})$ . Further, if I is a randomly selected instance in class $C_{1}$ , selected with probability $Pr (I = i | I \in C_{1}) = 1 / N_{1}$ , for $i = 1, 2, \dots, N_{1}$ , then it follows that

\begin{aligned} Pr (d_{I J} \leq t | I \in C_{1}, J \notin C_{1}) & = \sum_{i = 1}^{N_{1}} \frac{1}{N_{1}} Pr (d_{i J} \leq t | i \in C_{1}, J \notin C_{1}) \\ = \sum_{i = 1}^{N_{1}} \frac{1}{N_{1}} \bar{F} (t) \\ \equiv \bar{F} (t) . \end{aligned}

(10)

Similarly, if I and J represent two distinct instances, both randomly selected from $C_{1}$ with $Pr (I = i, J = j | I \in C_{1}, J \in C_{1}) = 1 / N_{1} (N_{1} - 1)$ , then

\begin{aligned} \bar{G} (t) & := Pr (d_{I J} \leq t | I \in C_{1}, J \in C_{1}, J \neq I) \\ = \sum_{i = 1}^{N_{1}} \sum_{j = 1, j \neq i}^{N_{1}} Pr (d_{i j} \leq t | i \in C_{1}, j \in C_{1}, j \neq i) \\ = \sum_{i = 1}^{N_{1}} \sum_{j = 1, j \neq i}^{N_{1}} \frac{1}{N_{1} (N_{1} - 1)} G_{i} (t) \\ \equiv G (t) . \end{aligned}

(11)

It follows by Assumption 2.1 that

\bar{F} (t) \leq \bar{G} (t) \forall t \in R .

(12)

Now, for any $t \in R$ define

ψ (t) := {\bar{G}}^{- 1} (1 - \bar{F} (t)) .

(13)

It can be easily verified that the function $h (t) := ψ (t) - t$ has a unique solution, $t^{*}$ , such that $t^{*} = ψ (t^{*})$ and

\bar{G} (t^{*}) = 1 - \bar{F} (t^{*}) := τ .

(14)

As we will see below, the value of $t^{*}$ serves as a cut-off point to differentiate between the distribution governing distances between instances in class $C_{1}$ and the distribution of distances going from instances in $C_{1}$ to all remaining instances. To allow for our procedure to effectively discern between these two distributions, we have to restrict or limit the extent of the potential ‘overlap’ between them to less than 50% of their respective areas (see Figure 4 in Section 3.2.2 below for an illustration of this point). Accordingly, and in light of Assumption 2.1 and (12), we require that this cut-off point $t^{*}$ between these two distributions will strictly be greater than the median of $\bar{G}$ , namely,

Figure 4. — A histogram of all distances within $C_{1}$ (red/left) and between instances in $C_{1}$ and those in $C_{2}$ (blue/right) for p = 0, $N_{1} = 500$ , n = 100, and $ρ_{12} = ρ_{2} = 0.2$ . The lines are generated from a normal distribution with mean and standard deviations matched to the data.

Assumption 2.2 Restricted Overlap —

We assume that $t^{*} > {\bar{G}}^{- 1} (1 / 2)$ in (14) so that $τ = \bar{G} (t^{*}) > 0.5$ .

2.2. The testing procedure

2.2.1. Constructing the test

Consider at first class $C_{1}$ and the $N_{1}$ instances initially assigned to it. Based on the available data, we are interested in constructing a testing procedure for determining whether or not a given instance that was assigned to class $C_{1}$ should be retained or be removed from it (and potentially be reassigned to a different class). That is, for each selected instance from the list $i \in {1, 2, \dots, N_{1}}$ of instances labeled $C_{1}$ , we consider the statistical test of the hypothesis

H_{0}^{(i)} : i \in C_{1} (the initial label is correct)

(15)

against

H_{1}^{(i)} : i \notin C_{1} (the initial label is incorrect),

(16)

for $i = 1, \dots, N_{1}$ . The final result of these successive $N_{1}$ hypotheses tests is the set of all those instances in $C_{1}$ for which $H_{0}^{(i)} : i \in C_{1}$ was rejected and thus, providing the set of those instances in $C_{1}$ which were deemed to have been mislabeled. As we will see below, the successive testing procedure we propose is constructed so as to control the maximal probability of a type I error, while minimizing the probability of a type II error.

Towards that end, define

Z_{i} \equiv \sum_{j \in C_{1}, j \neq i} I [d_{i j} \leq t^{*}] = \sum_{j = 1, j \neq i}^{N_{1}} I [d_{i j} \leq t^{*}]

(17)

for each $i \in {1, 2, \dots, N_{1}}$ where $t^{*}$ is defined by (14) and $I [A]$ is the indicator function of the set $A$ . In light of the relation (12), $Z_{i}$ will serve as a test statistic for the above hypotheses. The distribution of $Z_{i}$ under both the null and alternative hypotheses can be explicitly defined as Binomial random variables, as is presented in the following lemma (proof omitted).

Lemma 2.1

Let $Z_{i}$ be as define in (17) above with $i = 1, 2, \dots, N_{1}$ , then

if $i \in C_{1}$ , we have $Z_{i} |_{H_{0}^{(i)}} \sim Bin (N_{1} - 1, \bar{G} (t^{*})) \equiv Bin (N_{1} - 1, τ)$ ;

if $i \notin C_{1}$ , we have $Z_{i} |_{H_{1}^{(i)}} \sim Bin ((N_{1} - 1, \bar{F} (t^{*})) \equiv Bin (N_{1} - 1, 1 - τ)$ .

Accordingly, the statistical test we propose will reject the null hypothesis $H_{0}^{(i)} : i \in C_{1}$ in favor of $H_{1}^{(i)} : i \notin C_{1}$ for small values of $Z_{i}$ , say if $Z_{i} \leq a_{α}$ for some suitable critical value $a_{α}$ (to be explicitly determined below) which should satisfy,

\hat{α} := Pr (Z_{i} \leq a_{α} | H_{0}^{(i)}) = Pr (Z_{i} \leq a_{α} | τ) \leq α,

(18)

for each $i = 1, \dots, N_{1}$ and some fixed (and small) statistical error level $α \in (0, 0.5)$ . The constant $a_{α}$ is the (appropriately calculated) αth percentile of the $Bin (N_{1} - 1, τ)$ distribution. That is, if $b (k, p, n) := \sum_{j = 0}^{k} (\binom{n}{j}) p^{j} (1 - p)^{n - j}, k = 0, \dots, n$ , denotes the c.d.f. of a $Bin (n, p)$ distribution, then for given α and τ, the value $a_{α}$ is determined so as

a_{α} = {arg max}_{k = 0, \dots N_{1} - 1} {b (k, N_{1} - 1, τ) \leq α} .

(19)

The final result of this repeated testing procedure is given by the set of all instances in $C_{1}$ for which $H_{0}^{(i)} : i \in C_{1}$ was rejected,

R_{α} := {i; i = 1, \dots, N_{1} : Z_{i} \leq a_{α}},

providing the set of those instances in $C_{1}$ for which the binomial threshold is achieved and therefore have been deemed mislabeled. Similarly,

A_{α} := {i; i = 1, \dots, N_{1} : Z_{i} > a_{α}} = {1, \dots, N_{1}} ∖ R_{α},

provides the set of instances correctly identified in $C_{1}$ . It remains only to determine the optimal value of $a_{α}$ for the test.

2.2.2. Controlling the procedure's type I error

With $H_{0}^{(i)}$ and $H_{1}^{(i)}$ as are given in (15) and (16), let

H_{0}^{*} = ⋂_{i = 1}^{N_{1}} H_{0}^{(i)} and {\tilde{H}}_{1}^{*} = ⋃_{i = 1}^{N_{1}} H_{1}^{(i)} .

(20)

The hypothesis $H_{0}^{*}$ above states that all the instances in $C_{1}$ are correctly identified, whereas $H_{1}^{*}$ is the hypothesis that at least one of the instances in $C_{1}$ is misidentified. We denote by $R = | R_{α} |$ the cardinality of the set $R_{α}$ ,

R = \sum_{i = 1}^{N_{1}} I [Z_{i} \leq a_{α}]

so that R is a random variable taking values over ${0, 1, \dots, N_{1}}$ . Note trivially that $N_{1} - R \equiv | A_{α} |$ .

We consider the ‘global’ test which rejects $H_{0}^{*}$ in (20) if for at least one $i, i = 1, \dots, N_{1}$ , $Z_{i} \leq a_{α}$ or equivalently, if ${R > 0}$ . The probability of a type I error associated with this ‘global’ test is therefore

\begin{aligned} α^{'} & := Pr (R > 0 ∣ H_{0}^{*}) = Pr (⋃_{i = 1}^{N_{1}} {Z_{i} \leq a_{α}} ∣ H_{0}^{*}) \\ \leq \sum_{i = 1}^{N_{1}} Pr (Z_{i} \leq a_{α} ∣ H_{0}^{(i)}) = N_{1} \hat{α} \leq N_{1} α, \end{aligned}

(21)

using the Bonferroni inequality since by (18)–(19), $\hat{α} \leq α$ . The calculations for $α^{'}$ , can be controlled by taking $α = α_{0} / N_{1}$ in (19) for some $α_{0}$ , to ensure that $α^{'} \leq α_{0}$ and that $\hat{α} \leq α_{0} / N_{1}$ .

Note that if ${Z_{i}, Z_{2}, \dots, Z_{N_{1}}}$ were to be independent or associated random variables [4], then under $H_{0}^{*}$ , $I [Z_{i} \leq a_{α}] \sim Bin (1, \hat{α}), i = 1, 2, \dots, N_{1}$ , and $R \sim Bin (N_{1}, \hat{α})$ . In this case,

Pr (R = 0 ∣ H_{0}^{*}) = (1 - \hat{α})^{N_{1}} \geq {(1 - \frac{α_{0}}{N_{1}})}^{N_{1}} .

It follows for sufficiently large $N_{1}$ (as $N_{1} \to \infty$ ) that

α^{'} = 1 - Pr (R = 0 ∣ H_{0}^{*}) \to 1 - e^{- α_{0}} < α_{0} .

(22)

By Lemma 2.1 (b), the distribution of the test statistic $Z_{i}$ under the alternative hypothesis, $H_{1}^{(i)}$ , is readily available so that (in similarity to (18)), the probability of the type II error of the test of (15)–(16), can explicitly be calculated as,

\begin{aligned} \hat{β} := Pr (Z_{i} > a_{α} | H_{1}^{(i)}) & = Pr (Z_{i} > a_{α} | 1 - τ) \\ = 1 - b (a_{α}, N_{1} - 1, 1 - τ), \end{aligned}

(23)

for each $i = 1, \dots, N_{1}$ . Further, whenever the conditions for the normal approximation to the binomial probabilities hold (i.e. $min ((N_{1} - 1) τ, (N_{1} - 1) (1 - τ)) > 5$ , see Shader and Schmid [13]), $\hat{β}$ in (23), can easily be evaluate as

\hat{β} \approx 1 - Φ (\frac{a_{α} + 0.5 - (N_{1} - 1) (1 - τ)}{\sqrt{(N_{1} - 1) τ (1 - τ)}}),

where Φ denotes the standard normal c.d.f..

Remark 2.1

Under certain circumstances, namely if $a_{α} \geq (N_{1} - 1) / 2$ , the ‘symmetry’ property of the Binomial distribution about τ and $1 - τ$ , (with $τ > 0.5$ , per Assumption 2.2), implies that $\hat{β} \leq α$ in (23). That is, whenever the critical test value $a_{α}$ as is determined by (19) with fixed α and $N_{1}$ , satisfies $a_{α} \geq a^{*}$ , where $a^{*} := [(N_{1} - 1) / 2] + 1$ , then our testing procedure has the added feature that the type I error rate, $\hat{α}$ (see (18)) is being controlled by $α = α_{0} / N_{1}$ , which also serves as an upper bound that controls the type II error rate, $\hat{β}$ . Thus, in this circumstance, the probability of the type I error as well as that of the type II error are simultaneously being controlled by $α = α_{0} / N_{1}$ . We summarize this observation in Lemma 2.2. We note, however, that (as can be seen from (19)), the critical test value $a_{α}$ is intricately dependent on $α, N_{1}$ and τ. Since the values of α and $N_{1}$ are both fixed by design, the only determining parameter in (19) is τ (aside from $τ > 0.5$ ) to ensure that $a_{α} \geq (N_{1} - 1) / 2$ and hence, by Lemma 2.2, $\hat{β} \leq α$ . It can be shown that for this to hold, τ should be sufficiently far from 0.5, say, $τ \in [τ^{*}, 1)$ for some $τ^{*} > 0.5$ , as is determined by

$τ^{*} := \underset{0.5 < p < 1}{arg max} {b (a^{*}, N_{1} - 1, p) \leq α} .$

However, though available, there is no need to explicitly calculate the value of this threshold $τ^{*}$ in order to ascertain whether or not $\hat{β} \leq α$ in (23); merely verifying that $a_{α} \geq (N_{1} - 1) / 2$ is sufficient.

Lemma 2.2

For a given $α < 0.5$ , consider the test of $H_{0}^{(i)}$ versus $H_{1}^{(i)}$ in (15)–(16) with a critical test value, $a_{α}$ as is determined in (18)–(19). Then, if $a_{α} \geq (N_{1} - 1) / 2,$ we have

$\hat{β} = Pr (Z_{i} > a_{α} | 1 - τ) \leq Pr (Z_{i} \leq a_{α} | τ) \leq α .$

Proof.

Indeed, if $a_{α} \geq (N_{1} - 1) / 2$ , then $N_{1} - 1 - a_{α} \leq (N_{1} - 1) / 2 \leq a_{α}$ , and hence,

$\hat{β} = Pr (Z_{i} \geq a_{α} | 1 - τ) = Pr (Z_{i} \leq N_{1} - 1 - a_{α} | τ) \leq Pr (Z_{i} \leq a_{α} | τ) \leq α .$

2.3. Estimation

We note that both $\bar{F}$ and $\bar{G}$ are generally unknown (as are $t^{*}$ and ψ), but can easily be estimated non-parametrically from the available data by their respective empirical c.d.fs, for sufficiently large $N_{1}$ and $N - N_{1}$ . For each given instance $i \in C_{1}$ ,

{\hat{G}}_{i} (t) := \frac{1}{N_{1} - 1} \sum_{j \in C_{1}, j \neq i}^{N_{1}} I [d_{i j} \leq t], {\hat{F}}_{i} (t) := \sum_{k = 2}^{K} \frac{N_{k}}{N - N_{1}} \cdot {\hat{F}}_{i}^{(k)} (t)

(24)

where, for each $k = 2, 3, \dots, K$ ,

{\hat{F}}_{i}^{(k)} (t) := \frac{1}{N_{k}} \sum_{j \in C_{k}} I [d_{i j} \leq t] .

Clearly, ${\hat{G}}_{i} (t)$ and ${\hat{F}}_{i} (t)$ are empirical c.d.fs for estimating, based on the ith instance, $\bar{G} (t)$ and $\bar{F} (t),$ respectively. Accordingly, when combined,

\hat{\bar{F}} (t) = \frac{1}{N_{1}} \sum_{i = 1}^{N_{1}} {\hat{F}}_{i} (t), \hat{\bar{G}} (t) = \frac{1}{N_{1}} \sum_{i = 1}^{N_{1}} {\hat{G}}_{i} (t),

(25)

are the estimators of $\bar{F} (t)$ and $\bar{G} (t)$ , respectively. Further, in similarity to (13), we set

\hat{ψ} (t) := {\hat{\bar{G}}}^{^{- 1}} (1 - \hat{\bar{F}} (t)),

(26)

and we let ${\hat{t}}^{*}$ denote the ‘solution’ of $\hat{ψ} (t_{c}^{*}) = t_{c}^{*}$ ; that is

{\hat{t}}^{*} := inf_{t} {\hat{ψ} (t) \leq t} .

(27)

Clearly, the value of τ in (14) would be estimated by

\hat{τ} = \hat{\bar{G}} ({\hat{t}}^{*}) .

(28)

Note that in view of (24), $N_{1} \hat{τ} \equiv \sum_{i = 1}^{N_{1}} {\hat{τ}}_{i}$ , with ${\hat{τ}}_{i} \equiv {\hat{G}}_{i} ({\hat{t}}^{*})$ for $i = 1, 2, \dots, N_{1}$ . With ${\hat{t}}^{*}$ as an estimate of $t^{*}$ in (14), we have that $Z_{i} \equiv (N_{1} - 1) {\hat{G}}_{i} ({\hat{t}}^{*})$ and ${\hat{τ}}_{i} \equiv Z_{i} / (N_{1} - 1)$ and therefore an equivalent estimate of $\hat{τ}$ is

\hat{τ} = \frac{1}{N_{1}} \sum_{i = 1}^{N_{1}} \frac{Z_{i}}{N_{1} - 1} .

(29)

By Lemma 2.1 (a), $E [{\hat{τ}}_{i} | H_{0}^{(i)}] = (N_{1} - 1) τ$ and hence, $\hat{τ}$ in (29) is an unbaised estimator of τ, $E [\hat{τ} | H_{0}^{*}] = τ$ .

2.4. An illustration

To help visualize the procedure, consider a simulated data set consisting of two features ( $= (x_{1}, x_{2}$ )) grouped into three classes of 500 observations each. These are shown as blue closed circles (the 'blue' cluster), gray Xs (the 'gray' cluster), and red open circles (the 'red' cluster). Three such scenarios are depicted in Figure 1, where each point's coordinates have been generated using a bivariate normal distribution¹ (so that n = 2 and $N_{1} = N_{2} = N_{3} = 500$ ). For the purpose of the illustration, we will focus on one specific point, labeled P in the figure, but the procedure is applied simultaneously to all points in each class, and can be iterated across all three classes. To validate whether or not the initial identification of P as belonging to the blue cluster is correct, we count how many other blue instances fall within ${\hat{t}}^{*}$ units of P, utilizing the Euclidean distance metric. In the two-dimensional Euclidean space, this can be represented using a circle around P, thus providing a clear visualization of the algorithm's mechanics.

Figure 1. — An illustration of how the proposed algorithm works in a simple (two-dimensional) case. The three scenarios vary by the amount of overlap between the different clusters, which impacts the value of $t^{*}$ and τ, and the subsequent estimate of each.

The value of ${\hat{t}}^{*}$ (obtained from (27)) is based on both the distribution of within-class distances, G, and between-class distances, F, as estimated by (25). It can be verified that in these three scenarios, (2.1) holds for the stochastic dominance of the between-class distances over the within-class distances. Consequently, it produces a contextually self-adjusting bound ( $t^{*}$ ) around P (or any other blue point) where most of the other points assigned to the same cluster are likely to be within the boundary if the point is identified correctly.

Using an alternative bound (say $t^{'} < t^{*}$ ), the number blue points falling within $t^{'}$ units around P is no greater than the number found when compared to ${\hat{t}}^{*}$ units, and is likely smaller. However, the estimated probability of falling within $t^{'}$ units of P is also smaller (or at least no greater) than $\hat{τ}$ . Thus, fewer blue points must be within $t^{'}$ units in order to successfully validate the inclusion of P into the blue cluster. The choice of threshold thus determines whether P must be extremely close to a small number of blue points, or moderately close to a large number of blue points. As seen in Figure 1, the value of ${\hat{t}}^{*}$ as calculated by this algorithm tends to fall in the latter category, although it is affected by the extent of the overlap between the clusters. The calculated values of $t^{*}$ , $\hat{τ}$ , and $a_{α}$ obtained from applying the testing procedure in (15) to all three classes with $α = 0.05 / 500$ in (19), are shown in Table 1. We note that in all cases, that $\hat{t} > 0.5$ (Assumption 2.2) and that $a_{α} > 250$ (Remark 3.2).

Table 1.

The outcome of running the algorithm on the three scenarios in Figure 1.

Scenario	Class	a	${\hat{t}}^{*}$	τ
A	Blue	355	2.51	0.784
	Red	375	2.63	0.819
	Black	402	2.81	0.866
B	Blue	382	2.71	0.832
	Red	383	2.69	0.833
	Black	489	4.47	0.996
C	Blue	472	3.89	0.977
	Red	472	3.91	0.977
	Black	489	4.51	0.996

Open in a new tab

The calculated values of ${\hat{t}}^{*}$ , and hence $\hat{t}$ and $a_{α}$ are greatly affected by the amount of overlap between classes. In Figure 1(c), the clear separation between the three classes causes the respective estimates of τ to be nearly 1, as seen in Table 1, Scenario C. By shifting the red class closer to the blue class (i.e. Figure 1(b) and Table 1, Scenario B), the values of ${\hat{t}}^{*}$ and $\hat{τ}$ both decrease as expected for these two classes.

3. A simulation study

Our simulation study is designed to mimic the conditions of a LC-MS/MS shotgun proteomics study. In this light, we consider a set-up in which $N_{1}$ instances (peptides) belong to class/protein $C_{1}$ , and $N_{2}$ instances model the peptides belonging to any other class/protein. The distance between instances is measured using correlation distance, again mimicking a common way to measure similarity between peptides, although similar results can be obtained using other (quasi-) distance metrics (e.g. Euclidean distance).

3.1. The simulation setup

To establish some notation, suppose that $y = (y_{1}, y_{2}, \dots, y_{N})^{'}$ is a $N \times 1$ random vector having some joint distribution $H_{N}$ . We assume, without loss of generality, that the values of $y$ are standardized, so that $E (y_{i}) = 0$ and $V (y_{i}) = 1$ , for each $i = 1, 2, \dots, N$ . We denote by $D$ the corresponding correlation (covariance) matrix for $y$ , $D = cor (y, y^{'})$ . To simplify, we assume that K = 2 so that the N instances are presumed to belong to either class $C_{1}$ or $C_{2}$ . Accordingly, we partitioned $y$ and $D$ as $y = [y_{1}^{'}, y_{2}^{'}]^{'}$ , with, $y_{1} = (y_{1, 1}, y_{1, 2}, \dots, y_{1, N_{1}})^{'}$ and $y_{2} = (y_{2, N_{1} + 1}, y_{2, N_{1} + 2}, \dots, y_{2, N_{1} + N_{2}})^{'}$ , $N_{1} + N_{2} = N$ , and

D = [\begin{matrix} D_{1, 1} & D_{1, 2} \\ D_{2, 1} & D_{2, 2} \end{matrix}],

with $D_{k, ℓ} = cor (y_{k}, y_{ℓ}^{'}), k, ℓ = 1, 2$ .

As in Section 2, let $X$ denote the $n \times N$ data matrix of the observed intensities. We denote by ${\overset{_{\sim}}{x}}_{j}^{'} := (x_{j 1}, x_{j 2}, \dots, x_{j N})$ the jth row of $X$ , $j = 1, 2, \dots, n$ , and we assume that ${\overset{_{\sim}}{x}}_{1}^{'}, {\overset{_{\sim}}{x}}_{2}^{'}, \dots, {\overset{_{\sim}}{x}}_{n}^{'}$ are independent and identically distributed as $y \sim H_{N}$ .

Using standard notation, we write, $1_{n} = (1, 1, \dots, 1)^{'}$ , $I_{n}$ for the $n \times n$ identity matrix and $J_{n} = 1_{n} 1_{n}^{'}$ for the $n \times n$ matrix of 1s. For the simulation studies we conducted, we took $H_{N}$ to be the N-variate normal distribution, so that

{\overset{_{\sim}}{x}}_{j}^{'} \sim N_{N} (0, D), j = 1, 2, \dots, n, i . i . d .,

where

\begin{aligned} D_{1, 1} & = (1 - ρ_{1}) I_{N_{1}} + ρ_{1} J_{N_{1}}, \end{aligned}

(30)

\begin{aligned} D_{2, 2} & = (1 - ρ_{2}) I_{N_{2}} + ρ_{2} J_{N_{2}}, \end{aligned}

(31)

\begin{aligned} D_{2, 1} & = D_{1, 2}^{'} = ρ_{12} 1_{N_{2}} {1^{'}}_{N_{1}} \end{aligned}

(32)

for $ρ = (ρ_{1}, ρ_{12}, ρ_{2})$ with $0 \leq ρ_{12} \leq ρ_{2} \leq ρ_{1} < 1$ .

To allow for misclassification of instances, we included for a certain proportion p, some $m := [p N_{1}]$ of the $N_{1}$ ‘observed’ instances intensities from $C_{1}$ that were actually simulated with $D_{1, 1}$ being replaced by $D_{1, 1}^{*} = (1 - ρ_{2}) I_{N_{1}} + ρ_{2} J_{N_{1}}$ in (30) above. Thus, m is the number of misclassified instances among the $N_{1}$ instances that were initially labeled as belonging to $C_{1}$ .

Remark 3.1

In this simulation, $ρ_{2}$ reflects (as a proxy) the common characteristics of two mislabeled instances. When $ρ_{2} = ρ_{1}$ , distances between two mislabeled instances have the same distribution as two correctly labeled instances, as would be the case for binary classification when $ρ_{2} = ρ_{1}$ . When $ρ_{2} = ρ_{12}$ , distances for two mislabeled instances have the same distribution as distances between a correctly labeled instance and a mislabeled instance. This would be the case if the probability that two mislabeled instances come from the same class is zero.

For each simulation run, we recorded $\hat{τ}$ , ${\hat{t}}^{*}$ , and counted the number of true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs), as defined in Table 2. From this data, we calculated the sensitivity, specificity, false discovery proportion (FDP), false non-discovery proportion (FNP), and percent reduction in FNP ( $% Δ$ ) for each run; defined as follows:

$Sensitivity = \frac{TP}{TP + FN}$	Proportion of correctly removed instances out of all mislabeled instances.
$Specificity = \frac{TN}{TN + FP}$	Proportion of correctly retained instances out of all correctly labeled instances.
$FDP = \frac{FP}{max (TP + FP, 1)}$	Proportion of incorrectly removed instances out of all those removed.
$FNP = \frac{FN}{max (TN + FN, 1)}$	Proportion of incorrectly retained instances out of all those retained.
$% Δ = (1 - \frac{FNP}{p}) \times 100$	Percent reduction in FNP relative to p.

Open in a new tab

Table 2.

For a single run, each instance has one of four possible outcomes.

		Truth
		Correctly labeled	mislabeled	Total
Result	Keep	TN	FN	$N_{1} - R$
	Remove	FP	TP	R
		$N_{1} - m$	m	$N_{1}$

Open in a new tab

The notation for the total count of instances with each these outcomes in a single run.

Each statistic was averaged over all 1000 runs.

3.2. Simulation results

We conducted B = 1000 simulation runs of the test procedure with n = 10 , 25, 50, 75, 100, 250, 500, 700; $N_{1} = 25$ , 50, 100, 500; $N_{2} = 1000$ ; $α_{0} = 0.05$ ; and $ρ = (ρ_{1} = 0.5, ρ_{12} = 0.1, ρ_{2} = 0.5)$ , $(0.5, 0.1, 0.1)$ , $(0.5, 0.2, 0.5)$ , and $(0.5, 0.2, 0.2)$ . We also varied the value of p, the proportion of mislabeled instances such that p = 0.0, 0.05, 0.1, 0.2, 0.25 . In particular, p = 0 means no mislabeling and that the initial labeling is perfect.

3.2.1. Results with no mislabeling (case p = 0)

Simulations where p = 0 (i.e. no mislabeling) were used to illustrate the theoretical assumptions of the testing procedure, as this presents a case in which the global null hypothesis in (20) holds. In this case, in particular, the FDP measures the proportion of incorrectly rejected hypotheses in (15) out of all rejected hypothesis tests, given that at least one hypothesis test was rejected.

Remark 3.2

For p = 0, every rejected hypothesis test in (15) is incorrectly rejected. Thus, the FDP is 1 if any rejected hypothesis tests are observed, and zero otherwise (by the definition of the FDP). Consequently, the average value of the FDP over all B runs provides an estimate of $α^{'}$ .

Figure 2 and Table 3 show the FDP for various n, $N_{1}$ , and $ρ_{12}$ . As seen in the figures, the FDP converges to zero as n increases for all values of $N_{1}$ , but the convergence slows as $N_{1}$ increases. When n is small, at least one instance was removed from almost all classes. We attribute this behavior to the inherent correlation structure of the data. This will be explored further in Section 3.2.2.

Table 3.

Simulation results for p = 0, $ρ_{12} = ρ_{2} = 0.2$ , and $α_{0} = 0.05$ .

	$N_{1} = 25$		$N_{1} = 50$		$N_{1} = 100$		$N_{1} = 500$
n	FDP	Spec.	FDP	Spec.	FDP	Spec.	FDP	Spec.
10	0.916	0.929	1.000	0.880	1.000	0.834	1.000	0.737
25	0.843	0.947	0.999	0.904	1.000	0.860	1.000	0.766
50	0.673	0.965	0.993	0.933	1.000	0.894	1.000	0.807
75	0.461	0.979	0.969	0.952	1.000	0.921	1.000	0.841
100	0.298	0.987	0.879	0.967	0.999	0.943	1.000	0.872
250	0.001	1.000	0.060	0.999	0.414	0.995	0.999	0.977
500	0.000	1.000	0.000	1.000	0.000	1.000	0.089	1.000
700	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000

Open in a new tab

The sensitivity and FNP are excluded from the table because they are either undefined or constant when p = 0. As explained in Remark 3.2, the FDP provides an estimate of $α^{'}$ when p = 0.

The specificity measures the ability to keep instances which are correctly labeled. From Figure 3 and Table 3, it can be seen that while almost all runs remove at least one instance (based on the FDP), most instances are retained. With $n = 10$ , $ρ_{12} = ρ_{2} = 0.2$ , and $N_{1} = 25$ , an average of 23.2 out of 25 were retained in each run. This number decreased as $N_{1}$ increased, corresponding to the slower convergence of the FDP when $N_{1}$ is larger. Even in this case, for n = 10, $ρ_{12} = ρ_{2} = 0.2$ and $N_{1} = 500$ , an average of 368.5 instances are retained each time.

For p = 0, the sensitivity is undefined and the FNP is universally zero. These statistics are relevant only when at least one mislabeled instance is present in the data set (i.e. p>0) and therefore are omitted from Table 3.

3.2.2. Behavior under artificially constructed independence

In light of Remark 3.2 and the likely impact of the correlation structure present in the data on the FDP, we designed a simulation study to explore this effect. In this study, the ‘distance’ matrices were artificially created in a manner which preserved dependence due to symmetry but removed all other dependencies across distances.

To simulate ‘distance’ matrices in this case, we began by randomly generating a distance matrix using the original test procedure with p = 0 and n = 100, $N_{1} = 500$ , $N_{2} = 1000$ , and $ρ = (0.5, 0.2, 0.2)$ . A normal distribution was fit to the within- $C_{1}$ distances and the $C_{1}$ to $C_{2}$ distances, as shown in Figure 4. For $N_{1} = 25$ , 50, 100, 500, 1000, 2000, 5000, and 12,000 and $N_{2} = 1000$ , these normal distributions were used to generate B new distance matrices by drawing $d_{i j}$ for $1 \leq i < j \leq N$ as follows:

d_{i j} \sim {\begin{cases} 0 & i = j, \\ N (μ = 0.523, σ = 0.0684) & 1 \leq i < j \leq N_{1}, \\ N (μ = 0.771, σ = 0.0903) & 1 \leq i \leq N_{1} < j \leq N, \\ d_{j i} & i > j . \end{cases}

Figure 5 shows the results of this procedure on five sets of B = 1000 runs. As $N_{1}$ increased, the FDP also increased, but never surpassed the theoretical limit of $1 - e^{- 0.05}$ , consistent with the theoretical result given in (22).

3.2.3. Results under some mislabeling, (case p>0)

To develop context regarding the results with some mislabeling (i.e. p>0), we first consider for a moment a trivial filtering procedure which retains all $N_{1}$ instances in $C_{1}$ . Since the $p N_{1}$ incorrect instances are retained, the FNP of the trivial procedure is $p N_{1} / N_{1} = p$ . Clearly, an FNP of p can be achieved without performing any filtering at all simply by returning all $N_{1}$ instances in $C_{1}$ . For a testing procedure to improve upon this nominal level, it must result in an FNP below p. With p>0, the FDP calculates the proportion of correctly labeled instances among those removed. However, in the context of proteomics, where the set of retained peptides are used in subsequent analyses, we found that the specificity was a more relevant metric. Consequently, the FNP and specificity are the primary statistics used to evaluate our proposed testing procedure in the presence of labeling errors (p>0), with $% Δ$ providing a standardized method of evaluating the decrease in FNP in a manner independent of p.

Figure 6 provides the average FNP and $% Δ$ over the B = 1000 simulation runs as a function of n for $ρ_{12} = 0.2$ and varying combinations of $N_{1}$ , $ρ_{2}$ and p. In all cases, the procedure reduced the FNP relative to p (that is, $% Δ > 0$ for all results). Small values of n produced the smallest reduction (highest FNP), but this converged at approximately n = 100 to a value dependent on p, $N_{1}$ , and $ρ_{2}$ . Table 4 shows how the average FNP compares between small sample sizes ( $n = 10$ ) and large sample sizes (averaged across n = 250, 500, and 700) at each value of p, $N_{1}$ , and $ρ_{2}$ . For p = 0.05, the FNP converged to 0 for all values of $N_{1}$ and $ρ_{2}$ . For higher values of p, decreasing $N_{1}$ and decreasing $ρ_{2}$ caused the FNP to be higher for $n \geq 250$ . For example, when p = 0.25 and $N_{1} = 500$ , for example, the instances remaining in the class after filtering will still include 0.26% ( $% Δ = 59.0$ ) mislabeled instances for n = 10 and 2.85% ( $% Δ = 88.6$ ) mislabeled instances when n is large. These are both substantial decreases from 25% mislabeled instances as seen in the unfiltered data.

Table 4.

The mean $FNP$ and $% Δ$ at n = 10 and $n \geq 250$ for each combination of p, $N_{1}$ , and $ρ_{2}$ at $ρ_{12} = 0.2$ .

		$ρ_{2} = 0.2$				$ρ_{2} = 0.5$
		n = 10		$n \geq 250$		n = 10		$n \geq 250$
p	$N_{1}$	FNP	$% Δ$	FNP	$% Δ$	FNP	$% Δ$	FNP	$% Δ$
0.05	25	0.0169	66.2	0.0000	100.0	0.0168	66.4	0.0000	100.0
	50	0.0131	73.8	0.0000	100.0	0.0135	72.9	0.0000	100.0
	100	0.0150	69.9	0.0000	100.0	0.0137	72.6	0.0000	100.0
	500	0.0110	78.0	0.0000	100.0	0.0116	76.8	0.0000	100.0
0.10	25	0.0361	63.9	0.0007	99.3	0.0388	61.2	0.0000	100.0
	50	0.0394	60.6	0.0015	98.5	0.0429	57.1	0.0001	99.9
	100	0.0325	67.5	0.0009	99.1	0.0363	63.7	0.0001	99.9
	500	0.0255	74.5	0.0004	99.6	0.0298	70.2	0.0000	100.0
0.15	25	0.0585	61.0	0.0047	96.8	0.0635	57.7	0.0006	99.6
	50	0.0637	57.5	0.0068	95.5	0.0694	53.8	0.0011	99.3
	100	0.0585	61.0	0.0061	95.9	0.0659	56.1	0.0010	99.3
	500	0.0446	70.3	0.0032	97.8	0.0511	66.0	0.0003	99.8
0.20	25	0.1254	37.3	0.0468	76.6	0.1427	28.6	0.0249	87.6
	50	0.1042	47.9	0.0300	85.0	0.1229	38.6	0.0108	94.6
	100	0.0922	53.9	0.0214	89.3	0.1069	46.6	0.0057	97.2
	500	0.0707	64.6	0.0115	94.2	0.0852	57.4	0.0021	98.9
0.25	25	0.1612	35.5	0.0854	65.8	0.1891	24.4	0.0678	72.9
	50	0.1399	44.0	0.0574	77.1	0.1657	33.7	0.0304	87.8
	100	0.1303	47.9	0.0485	80.6	0.1584	36.6	0.0205	91.8
	500	0.1026	59.0	0.0285	88.6	0.1292	48.3	0.0078	96.9

Open in a new tab

Figure 7 provides the average specificity over the 1000 simulation runs as a function of n for $ρ_{12} = 0.2$ and varying combinations of $N_{1}$ , $ρ_{2}$ and p. The specificity always converged to 1 as n increased, with convergence by n = 250 in all cases. For larger values of p, convergence was faster (by n = 100). For small values of n, a higher specificity is observed when $ρ_{2}$ and $N_{1}$ is smaller, although even in the worst case, the specificity was greater than 0.75.

Table 5 gives the average estimate of the FDP, FNP, $% Δ$ , sensitivity, and specificity in the case where n = 50 and $ρ_{12} = 0.2$ across all combinations of p and $N_{1}$ . As already noted above, the table shows that the FNP and sensitivity decrease as p and $N_{1}$ increase. The FDP gives the proportion of correctly labeled instances among all the removed instances. This measure increases as a function of $N_{1}$ , corresponding to the decrease in the specificity of the procedure resulting in more correctly labeled instances being filtered out. On the other hand, it decreases as a function of p due to the increased proportion of mislabeled instances available to be removed.

Table 5.

The $FDP$ , $FNP$ , sensitivity (Sens.), and specificity (Spec.) of the algorithm using simulated data with n = 50 and $ρ_{12} = 0.2$ .

		$ρ_{2} = 0.2$					$ρ_{2} = 0.5$
p	$N_{1}$	FDP	FNP	$% Δ$	Sens.	Spec.	FDP	FNP	$% Δ$	Sens.	Spec.
0.00	25	0.673	0.000			0.965	0.630	0.000			0.969
	50	0.993	0.000			0.933	0.980	0.000			0.936
	100	1.000	0.000			0.894	1.000	0.000			0.900
	500	1.000	0.000			0.807	1.000	0.000			0.813
0.05	25	0.131	0.000	99.242	0.991	0.989	0.138	0.001	98.912	0.987	0.988
	50	0.310	0.000	99.305	0.992	0.975	0.306	0.000	99.745	0.997	0.973
	100	0.349	0.000	99.384	0.994	0.965	0.372	0.000	99.422	0.995	0.959
	500	0.542	0.000	99.763	0.998	0.926	0.559	0.000	99.819	0.999	0.913
0.10	25	0.035	0.003	96.689	0.961	0.996	0.049	0.002	97.840	0.974	0.994
	50	0.046	0.003	96.583	0.969	0.994	0.080	0.002	97.726	0.980	0.989
	100	0.086	0.002	97.793	0.980	0.988	0.144	0.001	98.609	0.988	0.978
	500	0.197	0.001	99.018	0.992	0.968	0.268	0.001	99.431	0.995	0.949
0.15	25	0.013	0.010	93.127	0.921	0.998	0.028	0.008	94.708	0.940	0.996
	50	0.016	0.011	92.580	0.930	0.997	0.040	0.007	95.421	0.957	0.993
	100	0.027	0.009	93.669	0.946	0.995	0.072	0.006	95.928	0.966	0.985
	500	0.069	0.005	96.927	0.974	0.986	0.145	0.003	98.166	0.986	0.965
0.20	25	0.001	0.055	72.335	0.760	1.000	0.009	0.050	75.199	0.784	0.999
	50	0.006	0.037	81.397	0.844	0.999	0.027	0.030	84.936	0.874	0.994
	100	0.011	0.026	87.029	0.893	0.997	0.042	0.018	91.068	0.928	0.989
	500	0.026	0.014	93.150	0.945	0.993	0.094	0.009	95.687	0.967	0.971
0.25	25	0.000	0.094	62.489	0.668	1.000	0.017	0.105	58.032	0.619	0.998
	50	0.003	0.065	74.092	0.779	0.999	0.022	0.060	75.990	0.793	0.996
	100	0.005	0.053	78.976	0.833	0.999	0.032	0.046	81.705	0.856	0.991
	500	0.013	0.031	87.412	0.903	0.996	0.067	0.022	91.073	0.934	0.976

Open in a new tab

The sensitivity of the procedure, as discussed above, gives the proportion of mislabeled instances that are detected out of all mislabeled instances. This measure increases when p is small and $N_{1}$ is large, and decreases for large p and small $N_{1}$ ; corresponding to the FNP estimate. For example, when p = 0.20, $N_{1} = 500$ , and $ρ_{2} = 0.2$ the data consists of 100 mislabeled instances and 400 correctly labeled instances. On average, the procedure removed 94.5% of mislabeled instances and only 0.7% correctly labeled instances, based on the reported sensitivity and specificity. Thus, the resulting filtered data set has an average of 5.5 mislabeled instances and 397.2 correctly labeled instances for an average of 402.7 total instances. This reflects a decrease in the proportion of mislabeled instances by 93.15%: while 20% of the original data set was mislabeled, only 1.4% of the filtered data set remains mislabeled.

4. Proteomics case study

To illustrate the applicability of our algorithm on a real data set, we applied it to a large observational study consisting of measured peptide intensities from 120 different serum samples collected from children with sickle cell disease. This study included additional runs (injections) to experimentally assess whether each peptide was quantified accurately and, for five specific proteins, whether the peptides were matched to the correct protein. A complete description of the methodology used to generate this verification data and decide whether protein label for the peptides from these five proteins can be found in Key [7].

Briefly, the so-called ‘spike-in subset’ consists of eight aliquots from a separate study where the protein concentration measurements were available: five in which a single protein was spiked at high intensity plus three controls as seen in Table 6. Correctly identified peptides from the spiked proteins should have an artificially high intensity in the corresponding aliquot while misidentified peptides lack this intensity ‘spike’, and thus appear ‘unspiked’. Unfortunately, this is not a ‘gold-standard’ on whether or not every peptide accurately reflects the abundance of the protein. Commercially produced proteins used in the spike-in subset likely have less sequence and/or post-translational modification variability relative to the human population, resulting in some correctly identified peptide sequences not overly abundant in the spiked samples. Conversely, because the procedures used to identify and assess the abundance of each peptide are performed using different data [15], some peptides can be identified correctly (with support from the spike-in data set), despite having a very low signal-to-noise ratio and thus poorly reflect the protein's abundance. Consequently, a manual determination of peptide accuracy (incorporating both abundance and labeling) based on heat maps of each of the five spiked proteins. This created a pair of decision rules that could be used as an objective (albeit imperfect) set of benchmarks for the purpose of evaluating the algorithm. In Key [7], they are further used for the purpose of comparing the performance of the proposed algorithm against other available algorithms.

Table 6.

The original quantity of the five spike-in proteins in 25 ml of each sample in the spike-in subset, as measured by nephelometry, as well as the quantity of protein added to generate the spike sample.

	Pre-depletion (μg/25 μl)
Symbol	s1	s2	s3	Spike quantity (μg/25 μl)
A1AG1	5.53	21.18	29.50	125
APOA1	20.23	30.00	41.25	100
APOB	9.30	10.50	24.75	167
CERU	3.00	7.13	6.03	38
HEMO	9.48	20.85	22.38	68

Open in a new tab

Proteins were spiked into sample s1.

4.1. Results

Results from the five proteins for which spike data is available are shown in Table 7. All five proteins showed at least at 25% reduction in the proportion of false positives regardless of the benchmark against which they were compared. In addition, the specificity was at least 84% across all five proteins, showing that the amount of poorly quantified or misidentified peptides was decreased while the algorithm generally retained peptides that had the spike signal (specificity of at least 0.84).

Table 7.

The estimated FNP and $% Δ$ of the algorithm using the spike and manual benchmarks as a stand-in for the true peptide status.

	Spike				Manual
Protein	FNP	Sens.	Spec.	$% Δ$	FNP	Sens.	Spec.	$% Δ$
HEMO	0.05 (0.06)	0.33	0.91	−26	0.20 (0.29)	0.36	1.00	−28
A1AG1	0.11 (0.18)	0.50	0.87	−38	0.17 (0.30)	0.53	0.95	−42
APOA1	0.07 (0.18)	0.67	0.96	−61	0.09 (0.20)	0.60	0.96	−53
APOB	0.09 (0.14)	0.53	0.84	−40	0.13 (0.25)	0.59	0.91	−48
CERU	0.09 (0.17)	0.59	0.90	−50	0.23 (0.32)	0.42	0.92	−28

Open in a new tab

For reference, the estimated FNP without any filtering is shown in parentheses.

Differences between the two benchmarks were primarily attributed to the different focus each had. The decision in the spike benchmark is completely independent of the quantitative data used by the filtering algorithm, but it cannot account for peptides with a low signal-to-noise ratio. Consequently, the spike benchmark tends to assign the ‘truth’ status of more peptides as ‘good’ relative to the manual method, resulting in a lower FNP prior to any filtering.

Figure 8 shows an example of the proteomics profiles from a single protein. Despite the fact that all 187 peptides were marked as originating from ceruloplasmin (CERU), approximately 6–8 different abundance patterns can clearly be visually discerned from the heat map, not including the ‘noisy’ cluster at the top of the figure marked in pink in the dendrogram. Despite this, the spike data generally supports the conclusion that most (if not all) of these peptides truly originated from a single protein source. Only three peptides lack the appropriate spike signal, and while a visual analysis of the heat map supports these being true misidentifications, their correlation with the observed data suggests that the cost of incorrectly retaining these peptides is likely to be relatively low. Among these ‘clean’ clusters, the algorithm removes one small cluster that has a substantially distinct abundance profile relative to the more dominant patterns, and four other peptides from the top (slightly noisier) cluster.

The top ‘noisy’ cluster can be further split visually into two groups at the dashed gray line. Above this line, the spike signal is only present for four peptides, and no abundance patterns are visible across the columns. The algorithm only retains a single one of these peptides. Below the dashed line, the spike-in data suggests that the peptides were (generally) identified correctly, but that the abundances are noisy. Visually, the left side of the heat map appears redder than the left, similar to the behavior of some cleaner subsets. These peptides are generally retained.

5. Discussion

In this paper, we have presented a testing procedure for identifying incorrectly labeled instances when two or more classes are present. Our non-parametric approach to the problem of filtering incorrect labels, which requires very few assumptions, yields a very high specificity, and can be implemented very easily and efficiently using standard statistical software. We demonstrated its applicability and effectiveness using real proteomic data observed in children with sickle cell disease using spike-in proteins to provide a contrasting control on protein labels. Further insights to the properties of our procedure were demonstrated via a simulation study.

As demonstrated in the simulation study, our testing procedure generally has a high specificity and low FNP, especially when the number of measurements (n) on each instance is large. Decreasing the value of n yields a more conservative test (i.e. one which is less likely to reject the null hypothesis in (15) and remove instances), since each distance is measured less precisely. Even for extremely small values of n, it was still possible to reduce the FNP and maintain a specificity over 80% for all classes.

For a fixed n, classes with fewer instances (i.e. small values of $N_{1}$ ) had a higher FNP and specificity relative to larger classes. Compared to ‘classical’ classification algorithms, however, such a property is relatively unique. Many of the existing classification procedures found in the literature have difficulties when the number of instances varies substantially across classes [16]. This difficulty extends to those procedures utilizing these classification approaches to also detect mislabeled instances [7]. Our testing procedure avoids this problem and maintains the integrity of small classes by analyzing each using a ‘one-vs-all’ strategy that is most conservative for small values of $N_{1}$ (say for $25 \leq N_{1} \leq 50$ ).

On the other hand, when the number of instances is extremely low ( $2 \leq N_{1} \leq 25$ based on our simulation studies) the accuracy of the non-parametric estimates, especially $\hat{τ}$ , become unreliable. In the most extreme cases, the available data is insufficient to ever reject the null hypothesis even if a reliable estimate of τ could be found. For example, in LC-MS/MS proteomics, extremely small proteins ( $2 \leq N_{1} \leq 5$ ) often consist entirely of inaccurate or mislabeled peptides and make up a substantial proportion of the reported proteins. This will be addressed in a subsequent work using a complementary procedure where instances are only retained if the null hypothesis is rejected.

The use of a Bonferroni-type procedure is aimed at protecting against removing correctly identified instances is extremely conservative, prioritizing a high specificity at the cost of a higher FNP. Even in this conservative case, the FNP in our simulations was universally reduced across all values of $N_{1}$ , n, ρ and p. Less conservative FWER procedures, FDR-type procedures [2,3], or procedures seeking to explicitly control the FNP could also be considered to further reduce the FNP in the filtered data. The primary convenience of the Bonferroni procedure is its universal applicability, especially in light of the complexity of the dependency structure of the distances and consequently of the tests statistics, $Z_{i}$ .

Because the testing procedure estimates $\bar{G}$ and $\bar{F}$ non-parametrically using the available data, these estimates are affected by the presence of mislabeled instances. The resulting estimates, $\hat{τ}$ and ${\hat{t}}^{*}$ of τ and $t^{*}$ , may therefore be biased when a large number of mislabeled instances are included in the class. One possible method of remedy is to iteratively remove a small number of instances and re-estimate τ and $t^{*}$ until some stopping criterion is met. Developing such a sequential estimation procedure is left to future work.

Although we have used Pearson's correlation as a measure distance in Sections 3 and 4, the procedure is generally applicable whenever the observed data for each instance can be effectively combined as a ‘measure of distance’. Usage of different distance measures is possible as was illustrated in Section 2.4 with the Euclidean distance measure.

Note

Scenario A Blue: $x \sim N ((\begin{matrix} 1.77 \\ 8.43 \end{matrix}), (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}))$ , Red: $x \sim N ((\begin{matrix} 4.65 \\ 8.08 \end{matrix}), (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}))$ , Black: $x \sim N ((\begin{matrix} 1.26 \\ 4.89 \end{matrix}), (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}))$ Scenario B Blue: $x \sim N ((\begin{matrix} 1.77 \\ 8.43 \end{matrix}), (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}))$ , Red: $x \sim N ((\begin{matrix} 4.65 \\ 8.08 \end{matrix}), (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}))$ , Black: $x \sim N ((\begin{matrix} 0.26 \\ 0.89 \end{matrix}), (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}))$ Scenario C Blue: $x \sim N ((\begin{matrix} 1.77 \\ 8.43 \end{matrix}), (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}))$ , Red: $x \sim N ((\begin{matrix} 7.65 \\ 10.08 \end{matrix}), (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}))$ , Black: $x \sim N ((\begin{matrix} 0.26 \\ 0.89 \end{matrix}), (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}))$ .

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

1.Anderson L. and Hunter C.L., Quantitative mass spectrometric multiple reaction monitoring assays for major plasma proteins, Mol. Cell. Proteom. 5 (2006), pp. 573–588. doi: 10.1074/mcp.M500331-MCP200 [DOI] [PubMed] [Google Scholar]
2.Benjamini Y. and Hochberg Y., Controlling the false discovery rate: A practical and powerful approach to multiple testing, J. Roy. Stat. Soc. Ser. B (Methodol.) 57 (1995), pp. 289–300. [Google Scholar]
3.Benjamini Y. and Yekutieli D., The control of the false discovery rate in multiple testing under dependency, Ann. Stat. 29 (2001), pp. 1165–1188. doi: 10.1214/aos/1013699998 [DOI] [Google Scholar]
4.Esary J.D., Proschan F., and Walkup D.W., Association of random variables, with applications, Ann. Math. Stat. 38 (1967), pp. 1466–1474. doi: 10.1214/aoms/1177698701 [DOI] [Google Scholar]
5.Forshed J., Johansson H.J., Pernemalm M., Branca R.M.M., Sofi Sandberg A., and Lehtiö J., Enhanced information output from shotgun proteomics data by protein quantification and peptide quality control (PQPQ), Mol. Cell. Proteom. 10 (2011), pp. 1–9. doi: 10.1074/mcp.M111.010264 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Frénay B. and Verleysen M., Classification in the presence of label noise: A survey, IEEE Trans. Neural Netw. Learn. Syst. 25 (2014), pp. 845–869. doi: 10.1109/TNNLS.2013.2292894 [DOI] [PubMed] [Google Scholar]
7.Key M.C., ClassCleaner: A quantitative method for validating peptide identification in LC-MS/MS workflows, Ph.D. thesis, Indiana University, Indianapolis, 2020.
8.Lucas J.E., Thompson J.W., Dubois L.G., McCarthy J., Tillmann H., Thompson A., Shire N., Hendrickson R., Dieguez F., Goldman P., Schwarz K., Patel K., McHutchison J., and Moseley M.A., Metaprotein expression modeling for label-free quantitative proteomics, BMC Bioinform. 13 (2012), p. 74. doi: 10.1186/1471-2105-13-74 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Morales P., Luengo J., Garcia L.P.F., Lorena A.C., de Carvalho A.C.P.L.F., and Herrera F., NoiseFiltersR: Label noise filters for data preprocessing in classification, 2016. Available at https://cran.r-project.org/package=NoiseFiltersR
10.Polpitiya A.D., Jun Qian W., Jaitly N., Petyuk V.A., Adkins J.N., Camp D.G., Anderson G.A., and Smith R.D., DAnTE: A statistical tool for quantitative analysis of proteomics data, Bioinformatics 24 (2008), pp. 1556–1558. doi: 10.1093/bioinformatics/btn217 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Quinlan J.R., Induction of decision trees, Mach. Learn. 1 (1986), pp. 81–106. doi: 10.1007/bf00116251 [DOI] [Google Scholar]
12.Sáez J.A., Galar M., Luengo J., and Herrera F., Analyzing the presence of noise in multi-class problems: Alleviating its influence with the one-vs-one decomposition, Knowl. Inf. Syst. 38 (2014), pp. 179–206. doi: 10.1007/s10115-012-0570-1 [DOI] [Google Scholar]
13.Schader M. and Schmid F., Two rules of thumb for the approximation of the binomial distribution by the normal distribution, Am. Stat. 43 (1989), pp. 23–24. [Google Scholar]
14.Silva J.C., Gorenstein M.V., Zhong Li G., Vissers J.P.C., and Geromanos S.J., Absolute quantification of proteins by LCMSE: A virtue of parallel MS acquisition, Mol. Cell. Proteom. 5 (2006), pp. 144–156. doi: 10.1074/mcp.M500230-MCP200 [DOI] [PubMed] [Google Scholar]
15.Steen H. and Mann M., The ABC's (and XYZ's) of peptide sequencing, Nat. Rev. Mol. Cell. Biol. 5 (2004), pp. 699–711. doi: 10.1038/nrm1468 [DOI] [PubMed] [Google Scholar]
16.Sun Y., C Wong A.K., and Kamel M.S., Classification of imbalanced data: A review, Int. J. Pattern Recognit. Artif. Intell. 23 (2009), pp. 687–719. doi: 10.1142/S0218001409007326 [DOI] [Google Scholar]
17.Suomi T., Corthals G.L., Nevalainen O.S., and Elo L.L., Using peptide-level proteomics data for detecting differentially expressed proteins, J. Proteome Res. 14 (2015), pp. 4564–4570. doi: 10.1021/acs.jproteome.5b00363 [DOI] [PubMed] [Google Scholar]
18.Webb-Robertson B.-J.M, Matzke M.M., Datta S., Payne S.H., Kang J., Bramer L.M., Nicora C.D., Shukla A.K., Metz T.O., Rodland K.D., Smith R.D., Tardiff M.F., McDermott J.E., Pounds J.G., and Waters K.M., Bayesian proteoform modeling improves protein quantification of global proteomic measurements, Mol. Cell. Proteom. 13 (2014), pp. 3639–3646. doi: 10.1074/mcp.M113.030932 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Zhu X. and Wu X., Class noise vs. attribute noise: A quantitative study, Artif. Intell. Rev. 22 (2004), pp. 177–210. doi: 10.1007/s10462-004-0751-8 [DOI] [Google Scholar]

[CIT0001] 1.Anderson L. and Hunter C.L., Quantitative mass spectrometric multiple reaction monitoring assays for major plasma proteins, Mol. Cell. Proteom. 5 (2006), pp. 573–588. doi: 10.1074/mcp.M500331-MCP200 [DOI] [PubMed] [Google Scholar]

[CIT0002] 2.Benjamini Y. and Hochberg Y., Controlling the false discovery rate: A practical and powerful approach to multiple testing, J. Roy. Stat. Soc. Ser. B (Methodol.) 57 (1995), pp. 289–300. [Google Scholar]

[CIT0003] 3.Benjamini Y. and Yekutieli D., The control of the false discovery rate in multiple testing under dependency, Ann. Stat. 29 (2001), pp. 1165–1188. doi: 10.1214/aos/1013699998 [DOI] [Google Scholar]

[CIT0004] 4.Esary J.D., Proschan F., and Walkup D.W., Association of random variables, with applications, Ann. Math. Stat. 38 (1967), pp. 1466–1474. doi: 10.1214/aoms/1177698701 [DOI] [Google Scholar]

[CIT0005] 5.Forshed J., Johansson H.J., Pernemalm M., Branca R.M.M., Sofi Sandberg A., and Lehtiö J., Enhanced information output from shotgun proteomics data by protein quantification and peptide quality control (PQPQ), Mol. Cell. Proteom. 10 (2011), pp. 1–9. doi: 10.1074/mcp.M111.010264 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0006] 6.Frénay B. and Verleysen M., Classification in the presence of label noise: A survey, IEEE Trans. Neural Netw. Learn. Syst. 25 (2014), pp. 845–869. doi: 10.1109/TNNLS.2013.2292894 [DOI] [PubMed] [Google Scholar]

[CIT0007] 7.Key M.C., ClassCleaner: A quantitative method for validating peptide identification in LC-MS/MS workflows, Ph.D. thesis, Indiana University, Indianapolis, 2020.

[CIT0008] 8.Lucas J.E., Thompson J.W., Dubois L.G., McCarthy J., Tillmann H., Thompson A., Shire N., Hendrickson R., Dieguez F., Goldman P., Schwarz K., Patel K., McHutchison J., and Moseley M.A., Metaprotein expression modeling for label-free quantitative proteomics, BMC Bioinform. 13 (2012), p. 74. doi: 10.1186/1471-2105-13-74 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0009] 9.Morales P., Luengo J., Garcia L.P.F., Lorena A.C., de Carvalho A.C.P.L.F., and Herrera F., NoiseFiltersR: Label noise filters for data preprocessing in classification, 2016. Available at https://cran.r-project.org/package=NoiseFiltersR

[CIT0010] 10.Polpitiya A.D., Jun Qian W., Jaitly N., Petyuk V.A., Adkins J.N., Camp D.G., Anderson G.A., and Smith R.D., DAnTE: A statistical tool for quantitative analysis of proteomics data, Bioinformatics 24 (2008), pp. 1556–1558. doi: 10.1093/bioinformatics/btn217 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0011] 11.Quinlan J.R., Induction of decision trees, Mach. Learn. 1 (1986), pp. 81–106. doi: 10.1007/bf00116251 [DOI] [Google Scholar]

[CIT0012] 12.Sáez J.A., Galar M., Luengo J., and Herrera F., Analyzing the presence of noise in multi-class problems: Alleviating its influence with the one-vs-one decomposition, Knowl. Inf. Syst. 38 (2014), pp. 179–206. doi: 10.1007/s10115-012-0570-1 [DOI] [Google Scholar]

[CIT0013] 13.Schader M. and Schmid F., Two rules of thumb for the approximation of the binomial distribution by the normal distribution, Am. Stat. 43 (1989), pp. 23–24. [Google Scholar]

[CIT0014] 14.Silva J.C., Gorenstein M.V., Zhong Li G., Vissers J.P.C., and Geromanos S.J., Absolute quantification of proteins by LCMSE: A virtue of parallel MS acquisition, Mol. Cell. Proteom. 5 (2006), pp. 144–156. doi: 10.1074/mcp.M500230-MCP200 [DOI] [PubMed] [Google Scholar]

[CIT0015] 15.Steen H. and Mann M., The ABC's (and XYZ's) of peptide sequencing, Nat. Rev. Mol. Cell. Biol. 5 (2004), pp. 699–711. doi: 10.1038/nrm1468 [DOI] [PubMed] [Google Scholar]

[CIT0016] 16.Sun Y., C Wong A.K., and Kamel M.S., Classification of imbalanced data: A review, Int. J. Pattern Recognit. Artif. Intell. 23 (2009), pp. 687–719. doi: 10.1142/S0218001409007326 [DOI] [Google Scholar]

[CIT0017] 17.Suomi T., Corthals G.L., Nevalainen O.S., and Elo L.L., Using peptide-level proteomics data for detecting differentially expressed proteins, J. Proteome Res. 14 (2015), pp. 4564–4570. doi: 10.1021/acs.jproteome.5b00363 [DOI] [PubMed] [Google Scholar]

[CIT0018] 18.Webb-Robertson B.-J.M, Matzke M.M., Datta S., Payne S.H., Kang J., Bramer L.M., Nicora C.D., Shukla A.K., Metz T.O., Rodland K.D., Smith R.D., Tardiff M.F., McDermott J.E., Pounds J.G., and Waters K.M., Bayesian proteoform modeling improves protein quantification of global proteomic measurements, Mol. Cell. Proteom. 13 (2014), pp. 3639–3646. doi: 10.1074/mcp.M113.030932 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0019] 19.Zhu X. and Wu X., Class noise vs. attribute noise: A quantitative study, Artif. Intell. Rev. 22 (2004), pp. 177–210. doi: 10.1007/s10462-004-0751-8 [DOI] [Google Scholar]

PERMALINK

A statistical testing procedure for validating class labels

Melissa C Key

Susanne Ragg

Benzion Boukai

Abstract

1. Introduction

2. Methodology

2.1. The basic setup

Assumption 2.1 Stochastic ordering —

Figure 4.

Assumption 2.2 Restricted Overlap —

2.2. The testing procedure

2.2.1. Constructing the test

Lemma 2.1

2.2.2. Controlling the procedure's type I error

Remark 2.1

Lemma 2.2

Proof.

2.3. Estimation

2.4. An illustration

Figure 1.

Table 1.

3. A simulation study

3.1. The simulation setup

Remark 3.1

Table 2.

3.2. Simulation results

3.2.1. Results with no mislabeling (case p = 0)

Remark 3.2

Figure 2.

Table 3.

Figure 3.

3.2.2. Behavior under artificially constructed independence

Figure 5.

3.2.3. Results under some mislabeling, (case p>0)

Figure 6.

Table 4.

Figure 7.

Table 5.

4. Proteomics case study

Table 6.

4.1. Results

Table 7.

Figure 8.

5. Discussion

Note

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases