Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2013 Jan 7;8(1):e53112. doi: 10.1371/journal.pone.0053112

Automatic Peak Selection by a Benjamini-Hochberg-Based Algorithm

Ahmed Abbas 1, Xin-Bing Kong 2, Zhi Liu 3, Bing-Yi Jing 4,*, Xin Gao 1,*
Editor: Anna Tramontano5
PMCID: PMC3538655  PMID: 23308147

Abstract

A common issue in bioinformatics is that computational methods often generate a large number of predictions sorted according to certain confidence scores. A key problem is then determining how many predictions must be selected to include most of the true predictions while maintaining reasonably high precision. In nuclear magnetic resonance (NMR)-based protein structure determination, for instance, computational peak picking methods are becoming more and more common, although expert-knowledge remains the method of choice to determine how many peaks among thousands of candidate peaks should be taken into consideration to capture the true peaks. Here, we propose a Benjamini-Hochberg (B-H)-based approach that automatically selects the number of peaks. We formulate the peak selection problem as a multiple testing problem. Given a candidate peak list sorted by either volumes or intensities, we first convert the peaks into Inline graphic-values and then apply the B-H-based algorithm to automatically select the number of peaks. The proposed approach is tested on the state-of-the-art peak picking methods, including WaVPeak [1] and PICKY [2]. Compared with the traditional fixed number-based approach, our approach returns significantly more true peaks. For instance, by combining WaVPeak or PICKY with the proposed method, the missing peak rates are on average reduced by 20% and 26%, respectively, in a benchmark set of 32 spectra extracted from eight proteins. The consensus of the B-H-selected peaks from both WaVPeak and PICKY achieves 88% recall and 83% precision, which significantly outperforms each individual method and the consensus method without using the B-H algorithm. The proposed method can be used as a standard procedure for any peak picking method and straightforwardly applied to some other prediction selection problems in bioinformatics. The source code, documentation and example data of the proposed method is available at http://sfb.kaust.edu.sa/pages/software.aspx.

Introduction

Many computational bioinformatics methods generate a large number of predictions for the correct solution to a problem among which are both true and false predictions. Such predictions are usually sorted according to certain confidence scores. For instance, ab initio protein structure prediction methods sample tens of thousands of three-dimensional models. The energy values are calculated for each model based on a given energy function, where lower values likely indicate better models. Another example is the protein function annotation problem in which the amino acid sequence or the domain architecture of a protein is given and the Gene Ontology (GO) terms selected from among some 30,000 are used to annotate the function.

In nuclear magnetic resonance (NMR)-based protein structure determination, thousands of peaks are routinely predicted from the input spectra in which there are usually tens to hundreds of true signals. The peaks are sorted according to either their intensities or estimated volumes. Both means of sorting, based on computational methods, have common properties. First, a large number of predictions are generated. Second, the predictions are scored by the scoring functions of the methods. However, the scoring functions are not powerful enough to distinguish true predictions from the false ones. Third, it is important to discover most of the true predictions while maintaining a reasonably low false positive rate. Therefore, it is crucial to know how many predictions should be selected in such scenarios.

Peak picking is one of the key problems in NMR protein structure determination process [3][5]. The problem is defined as follows: given any NMR spectrum or a set of spectra, select the true signals, i.e., peaks, while filtering the false ones. Typically, true peaks are assumed to have Gaussian-like shapes and high intensities so that they can be easily differentiated from false ones. However, there are two main factors that make the peak picking problem difficult. On the one hand, depending on the quality of the protein sample, the property of the target protein and local dynamics, there can be a number of weak peaks, i.e., peaks with low intensities or volumes. That is, if we sort the predicted peaks by volumes or intensities, there is no clear cutoff threshold to distinguish true peaks from false ones. These peaks are difficult to identify even by manual processes. This is why computational methods are useful. On the other hand, due to the various sources of noise in NMR spectra, such as water bands and artifacts, false peaks can have high intensities or volumes. The group of sorted peaks is therefore comprised of a mixture of true peaks and false ones, where most of the true peaks tend to be ranked higher with a few strong, false peaks also included. It is extremely difficult, if not impossible, to select only the true peaks and eliminate all the false ones. In NMR structure determination, a missing true peak may cause all the follow-up procedures to fail, whereas a false peak can still be eliminated later [6][9]. Therefore, an ideal method should identify almost all the true peaks while maintaining reasonably high precision.

The peak picking problem has been studied for more than two decades. A variety of computational methods have been proposed [1], [2], [10][19]. The existing methods can be classified into two categories according to the de-noising method. Included in the first category are hard threshold-based approaches. For instance, PICKY [2] assumes that the noise is white Gaussian and estimates the noise level in small regions that do not contain signals. The data points that have lower intensities than the estimated noise level are eliminated from the spectra. Singular value decomposition is applied to the connected components of the remainder of the spectra to yield one-dimensional lineshapes. The peaks are identified in each lineshape and sorted according to the intensity values. The higher the intensity is, the greater the confidence that it is a true peak. However, the hard threshold-based methods cannot detect weak peaks that are embedded in the noise. In the second category are soft threshold-based approaches, which do not eliminate any data point from the spectra. We recently proposed WaVPeak [1] to overcome the bottleneck in the hard threshold-based methods. WaVPeak applies the high-dimensional version of the Daubechies 3 wavelet [20] to smooth the given spectra. The shapes of true peaks become sharper and smoother. A brute-force method is used to identify all the local maxima in the smoothed spectra. In contrast to PICKY, the peaks are sorted according to their estimated volumes by WaVPeak. We have found that volume significantly outperforms intensity in distinguishing true peaks from false ones.

However, the existing peak picking methods are not able to determine automatically how many peaks among many to identify in order to include most of the true peaks. This number should be large enough to include as many true peaks as possible, and in the meanwhile small enough to achieve relatively high precision. In PICKY, the default number of peaks to return is Inline graphic, where Inline graphic is the length of the protein. In [1], WaVPeak is mainly compared with PICKY on the top Inline graphic peaks, where Inline graphic is the number of manually identified peaks, which is unknown for a new target protein. However, such fixed number-based approaches do not take the distribution of peaks into consideration. For instance, if there is a spectrum that is very noisy or has a large number of artifacts, there can be many strong but false peaks, which are identified along with the true ones. Many true peaks will not be selected if Inline graphic or Inline graphic is used. No matter how powerful the peak picking method is, it is crucial to cleverly determine the number of peaks to be selected. Otherwise, true peaks will be eliminated even if they have been identified by the methods.

In this paper, we propose a Benjamini-Hochberg (B-H)-based approach for the peak picking problem. We first cast the peak selection problem into a multiple testing problem [21]. Because there is no clear cutoff threshold for intensities or volumes, we calculate the Inline graphic-value for each peak. The number of peaks to be selected is then automatically determined by the B-H-based algorithm. We demonstrate that the proposed method significantly outperforms the fixed number-based method on selecting the true peaks from the predictions by the state-of-the-art peak picking methods, including WaVPeak and PICKY.

Methods

Our goal is to develop a method to help us to determine how many peaks to select among candidate peaks that number usually in the order of several hundreds. Each candidate peak can be considered as a null hypothesis, where each false peak is a true null hypothesis and each true peak is a false null hypothesis. Therefore, the goal is to simultaneously test all the hypotheses and to reject as many false null hypotheses as possible. This is a multiple testing problem, which has received much attention in the literature (see, e.g., [22]). One prominent solution to multiple testing problem was proposed by Benjamini and Hochberg [23]. We first describe how to cast our problem into that framework.

A Quick Review of Benjamini-Hochberg Method

We wish to test Inline graphic null hypotheses:

graphic file with name pone.0053112.e010.jpg

on the basis of a data set Inline graphic. We have some decision rule Inline graphic that rejects or accepts each of the above Inline graphic cases (e.g., decides if the Inline graphicth candidate peak is a true peak or a false peak). The data set Inline graphic consists of

graphic file with name pone.0053112.e016.jpg
graphic file with name pone.0053112.e017.jpg
graphic file with name pone.0053112.e018.jpg

where Inline graphic are a random sample from the Inline graphicth population (e.g. intensities or volumes in a neighborhood of the Inline graphicth candidate peak). We assume that our decision rule, Inline graphic, produces a Inline graphic-value, Inline graphic, for each case, Inline graphic (we will discuss several different ways of calculating such Inline graphic-values later). Therefore, Inline graphic has a uniform distribution if Inline graphic is correct,

graphic file with name pone.0053112.e029.jpg

Intuitively, if the Inline graphic-value, Inline graphic, is small enough, Inline graphic will be rejected. In fact, the usual Bonferroni procedure [24], [25] rejects Inline graphic whenever

graphic file with name pone.0053112.e034.jpg

where Inline graphic is the significance level. This is typically a very conservative procedure, particularly when Inline graphic is large, because it does not reject as many null hypotheses as it should. In other words, it tends to have a low discovery rate.

To improve the discovery rate, Benjamini and Hochberg (1995) proposed an algorithm based on ordered Inline graphic-values:

graphic file with name pone.0053112.e038.jpg

The Benjamini and Hochberg (B-H) algorithm uses the following rule: for a fixed value of Inline graphic, referred to as the control rate, let Inline graphic be the largest index for which

graphic file with name pone.0053112.e041.jpg

and reject Inline graphic, the null hypothesis corresponding to Inline graphic, if

graphic file with name pone.0053112.e044.jpg

accepting Inline graphic otherwise. Figure 1 illustrates how the B-H procedure works.

Figure 1. Illustration of the Benjamini-Hochberg procedure.

Figure 1

In this example, the number of hypotheses (Inline graphic) is 10 and the false discovery proportion (Inline graphic) is 0.2. The largest index of the hypotheses that is below the line is 6 (Inline graphic). Therefore, the first six hypotheses are rejected as the predicted peaks.

Benjamini and Hochberg proved the following result [23], which justified their procedure.

Theorem

For independent test statistics, the B-H algorithm controls the expected false discovery proportion (FDP) at Inline graphic:

graphic file with name pone.0053112.e050.jpg

where Inline graphic, Inline graphic is the number of cases rejected, Inline graphic is the number of those that are actually null, and Inline graphic is the number of true null hypotheses.

Clearly, the above FDP control attempts to keep the number of false discoveries under control, and in a sense to keep the precision above a certain level. A good procedure should have as high recall rates as possible with prescribed high precision (or low FDP).

Applying the B-H Procedure to the Peak Picking Problem

We will cast the NMR peak picking problem into the multiple testing framework. In WaVPeak (or PICKY), after data cleaning at the first stage by wavelet smoothing (or by hard thresholding), Inline graphic potential peaks are identified. We wish to test that, for each Inline graphic,

graphic file with name pone.0053112.e057.jpg

against

graphic file with name pone.0053112.e058.jpg

We can view each candidate peak and its surroundings as one population. We have a random sample of intensities, Inline graphic from the Inline graphicth population. The sample size Inline graphic depends on which method is adopted. For WaVPeak, we have Inline graphic if we use a rectangular neighborhood of length 1 in 2D spectra, such as 15N-HSQC; for PICKY, we have Inline graphic since we only use one intensity at each candidate peak.

We implement the B-H procedure below in two steps.

  • Step I: calculating Inline graphic-values.

For WaVPeak and PICKY, we use volume (Inline graphic) and intensity (Inline graphic) around the Inline graphicth candidate peak as the test statistics, respectively. Our decision rule is to reject Inline graphic if Inline graphic or Inline graphic is large, respectively. The corresponding Inline graphic-values are

graphic file with name pone.0053112.e072.jpg
graphic file with name pone.0053112.e073.jpg

where Inline graphic and Inline graphic are observed values of Inline graphic and Inline graphic.

  • Step II: applying the B-H procedure at Inline graphic.

Rank the Inline graphic-values Inline graphic obtained from Step I in ascending order, and denote the ordered Inline graphic-values as Inline graphic. We can then plot Inline graphic vs Inline graphic, and apply the B-H procedure.

Calculation of P-values

We now explain how to calculate p-values Inline graphic and Inline graphic in Step I above. We assume that the observations from different peaks are independent, and that true peaks and false peaks are from two different normal distributions. Then we can rewrite the above testing problem as

graphic file with name pone.0053112.e087.jpg

against

graphic file with name pone.0053112.e088.jpg

Typically, the mean intensity Inline graphic from false peaks is much smaller than the mean intensity Inline graphic from true peaks, usually written as Inline graphic. However, Inline graphic may not be zero, and can be estimated from weak intensities. For variances, we typically have Inline graphic.

The reason why Inline graphic is small (compared with Inline graphic) but not zero is due to how the candidate peaks are selected. In WaVPeak and PICKY, the volumes and intensities are calculated for a grid of points, respectively, those below certain thresholds are discarded, and the remaining ones are retained as candidate peaks. Therefore, the volumes and intensities for those candidate peaks should all have mean volumes and intensities above the thresholds.

To calculate Inline graphic and Inline graphic, we need to standardize by subtracting the mean, Inline graphic, and divide the standard deviation (s.d.), Inline graphic, under Inline graphic’s. Due to the different data structures of WaVPeak (volumes) and PICKY (intensities), they are considered separately below.

Calculation of Inline graphic

In WaVPeak, the test statistic Inline graphic is the approximate volume under the Inline graphicth candidate peak: Inline graphic, where Inline graphic and Inline graphic is some constant. Then, the Inline graphic-value is

graphic file with name pone.0053112.e108.jpg
graphic file with name pone.0053112.e109.jpg
graphic file with name pone.0053112.e110.jpg (1)

where Inline graphic is the standard normal distribution. The mean/median, Inline graphic, and variance, Inline graphic, of the false peaks are unknown, which can be estimated by the sample median and sample variance of the false peaks, respectively. To do this, we need to have a rough idea of where those false peaks are located. It has been observed that the number of true peaks of a protein, Inline graphic, is always less than Inline graphic, where Inline graphic is the length of that protein and Inline graphic is the expected number of peaks per residue for the corresponding spectrum. For instance, for Inline graphic, Inline graphic; for Inline graphic, Inline graphic. Almost all true peaks are ranked in the top Inline graphic candidate peaks by volume in WaVPeak, while the remaining Inline graphic candidate peaks are mostly false peaks, from which we can estimate Inline graphic and Inline graphic.

To be more specific, for Inline graphic, let Inline graphic and Inline graphic denote the sample mean and variance for the Inline graphicth candidate peak; and Inline graphic and Inline graphic the ordered sample means and variances, respectively. Then Inline graphic and Inline graphic can be estimated by the medians of the Inline graphic smallest Inline graphic and Inline graphic:

graphic file with name pone.0053112.e137.jpg (2)
graphic file with name pone.0053112.e138.jpg (3)

Calculation of Inline graphic

In PICKY, the test statistic is the intensity, Inline graphic, at the Inline graphicth single peak. Here, Inline graphic. Its Inline graphic-value can be calculated similarly to that in WaVPeak, giving

graphic file with name pone.0053112.e149.jpg (4)

Although we could use the same estimators of Inline graphic and Inline graphic as above, we will propose some different ones for PICKY due to its unique features. It has been demonstrated that the intensity of a single peak point is a much less reliable confidence score than the volume of the peak [1]. It is thus expected that the intensity curves are mixed up by fewer true peaks and more false peaks. Therefore, the median of Inline graphic may no longer be accurate because the median may very likely come from a true peak. On the other hand, replacing the median by the minimum in (2) and (3) should produce better estimators of Inline graphic and Inline graphic, respectively, which turns out to be true for less reliable confidence scores (data not shown). Based on these considerations, we propose to estimate Inline graphic and Inline graphic in PICKY respectively by

graphic file with name pone.0053112.e157.jpg
graphic file with name pone.0053112.e158.jpg

Results

We evaluated the performance of the proposed methods on the peaks predicted by WaVPeak and PICKY. The same dataset as the one used by both [1] and [2] was used as the benchmark dataset, the most comprehensive dataset available for the peak picking problem. The dataset covers a wide range of spectrum types, including 2D 15N-HSQC, and 3D HNCO, HNCA, HNCACB and CBCA(CO)NH, which were extracted from the spectrum sets of eight proteins (TM1112, YST0336, RP3384, ATC1776, CASKIN, HACS1, VRAR, and COILIN).

We first demonstrate how our method performed when a more reliable confidence score is available, i.e., the estimated volumes of the peaks predicted by WaVPeak. We then present the performance of the method when a less reliable confidence score is available, i.e., the single intensity values of the peaks provided by PICKY. We finally demonstrate how to combine the results of our method with both WaVPeak and PICKY, to further eliminate false positive peaks.

Selecting WaVPeak Peaks

The B-H algorithm is first compared with a fixed number-based method, i.e., Inline graphic, on selecting peaks predicted by WaVPeak. Inline graphic is set to Inline graphic. That is, the top Inline graphic peaks predicted by WaVPeak are considered. The results are presented in Table 1, about which we make the following observations.

Table 1. Comparison of the missing peak rate of the fixed number-based method (Inline graphic) and the Benjamini-Hochberg (B-H) algorithm with Inline graphic on the 32 spectra of the eight proteins in the benchmark dataset as picked by WaVPeak.

Spectra 15N-HSQC HNCO HNCA HNCACB CBCA(CO)NH
Protein Len tNp B-H (d) tNp B-H (d) tNp B-H (d) tNp B-H (d) tNp B-H (d)
RP3384 64 7 7 (0) 0 0 (0) 12 12 (0) 8 8 (0)
CASKIN 67 6 2 (67) 22 15 (32) 38 41 (−8) 10 10 (0)
VRAR 72 3 3 (0) 7 7 (0) 31 32 (−3) 18 18 (0)
HACS1 74 7 2 (71) 8 6 (25) 14 15 (−7) 8 6 (25)
TM1112 89 8 2 (75) 6 6 (0) 8 7 (13) 9 2 (78)
COILIN 98 3 0 (100) 16 17 (−6) 24 25 (−4) 28 20 (29)
ATC1776 101 7 5 (29) 8 7 (13) 19 17 (11) 25 24 (4)
YST0336 146 2 2 (0) 6 3 (50) 11 10 (9) 17 13 (24)
Average 5.4 2.9 (43) 9.6 7.9 (16) 12 11.2 (5) 23 24 (−2) 15.4 12.6 (20)
SDave 2.2 2.0 6.7 5.7 4.6 3.9 10.9 12.0 7.4 7.0
Preave 84 77 77 73 83 76 64 71 72 67

Column Inline graphic is the relative improvement of the missing peak rate of B-H over Inline graphic. All values except the last two rows are the missing peak rates. The “Inline graphic” row lists the standard deviations of the missing peak rates for the corresponding columns, demonstrating the robustness of different methods. The last row is the average precision value. All values are given in percentage.

  • The B-H algorithm significantly outperforms the Inline graphic-based method in terms of the average missing peak rates, i.e., the percentage of true peaks that are not selected. On six out of the 32 spectra, the B-H algorithm reduces the Inline graphic-based method on the missing peak rate by more than 50%. One exception is HNCACB, where the B-H algorithm is slightly worse than the Inline graphic-based selection in the missing peak rate (but better in precision); however, this can be easily rectified by increasing the FDR Inline graphic to Inline graphic, which is commonly adopted in practice. Overall, the B-H algorithm is much more sensitive and stable than the fixed number-based method. It is noticeable that the improvement in the sensitivity is at the cost of the reduced precision. This is expected because the B-H algorithm does not change the order of the sorted candidate peaks. Instead, it provides a good tradeoff that prefers higher sensitivity by selecting a cutting point in the list of the sorted peaks.

  • As expected, the fixed number-based method is not stable. It performs well on some spectra (e.g. RP3384), but poorly on the others (e.g., TM1112). This is further verified by its larger standard deviations. The reason is that such a method does not take the properties of the input spectra into consideration. For instance, for a very noisy spectrum with weak signals, there can be many false peaks sorted amongst the true ones (e.g., Figures 2(a) and 2(c)). Thus, by taking a fixed number of peaks, there is no way one can ensure that the true peaks are included.

  • Reduction on the missing peak rate of B-H over Inline graphic can reach as high as Inline graphic, indicated in column Inline graphic. These improvements mostly occur in the weak peaks, which are the most difficult to find. Since there are not many weak peaks to start with, improvements measured by relative missing peak rates (i.e., weak signals found/all weak signals) are very high, even though those measured by absolute missing peak rates may not always appear.

Figure 2. Original volume curves and the corresponding p-value curves.

Figure 2

(a) and (d): sorted volume curve (a) and the corresponding p-value curve (d) of peaks predicted by WaVPeak on the 2D 15N-HSQC spectrum of the protein ATC1776; (b) and (e): sorted volume curve (b) and the corresponding p-value curve (e) of peaks predicted by WaVPeak on the 3D HNCO spectrum of the protein VRAR; (c) and (f): sorted volume curve (c) and the corresponding p-value curve (f) of peaks predicted by WaVPeak on the 3D CBCA(CO)NH spectrum of the protein COILIN. In all figures, true peaks are shown in black and false ones are shown in cyan. In (d), (e) and (f), the decision boundaries of Inline graphic and B-H procedure are shown in black and magenta, respectively.

It is noticeable that all the missing peak rates in Table 1 are the results by comparing to the “expected” peak lists of the spectra. The “expected” peak lists were generated by NMR labs by combining information from large sets of spectra. It is thus likely that an expected peak does not exist in some spectra, especially the noisy ones, such as HNCACB and CBCA(CO)NH. In practice, higher recall rates (lower missing peak rates) than those reported here can be expected.

Figure 2 shows several representative examples of how different selection methods work. We make several remarks.

  • It can be difficult to set a cutoff point from the original volume curves in Figures 2(a)–2(c) to separate true peaks from false ones. The best thing the fixed number-based methods can do is to take a random guess. For example, the Inline graphic-based selection method overestimates the number of peaks to be selected for a less noisy spectrum as shown in Figure 2(e), but significantly underestimates the number of peaks to be selected for a noisier spectrum as shown in Figure 2(f).

  • The B-H algorithm works consistently well on the Inline graphic-value curves. As shown in Figure 2, after converting the volumes to Inline graphic-values, strong true peaks with high volumes are dragged down to the Inline graphic-axis, i.e., the Inline graphic-values are almost equal to zero. Most of the weak true peaks with low volumes are also dragged to the Inline graphic-axis, making it possible to identify them in the Inline graphic-value curves. For instance, two of the three weak peaks with low volumes in Figure 2(a) are dragged down to the Inline graphic-axis, and thus selected by the B-H algorithm. Note that the Inline graphic-value does not change the volume order of the peaks. Instead, it provides a much better curve so that the weak peaks can be possibly selected.

Selecting PICKY Peaks

We then evaluated the performance of the proposed method with a less reliable confidence score, i.e., the intensity value of PICKY. PICKY has a default noise level threshold [2], which sometimes causes insufficient numbers of predicted peaks. For fair comparison purposes, we lowered the noise level threshold of PICKY until it generated more than 1.5Inline graphic peaks.

Table 2 presents the performance of the proposed method on selecting peaks predicted by PICKY. Similar conclusions to those about WaVPeak can be made here. For instance, the B-H method consistently and significantly outperforms the fixed number-based method. There are seven spectra on which the B-H algorithm reduces the missing peak rate of the Inline graphic-based method by at least 50%. Six of these spectra have original recall rates that were already higher than 90%. There are two spectra, HNCO of COILIN and CBCA(CO)NH of RP3384, on which the absolute improvements are greater than 15% with highest being 26%. As shown in Figures 3(b) and 3(c), the original intensity curves for these two spectra are continuous and smooth. It is difficult to identify a cutoff point between true peaks and false ones on such curves. Many false peaks are sorted amongst the true ones. After converting the intensity values into p-values, most of the true peaks are dragged down to the Inline graphic-axis, i.e., they have very small p-values. The 5% slope is then able to select most of the true peaks. In the two cases, fewer than three true peaks are not selected and true peaks are almost the last ones selected by the B-H algorithm.

Table 2. Comparison of the missing peak rate of the fixed number-based method (Inline graphic) and the Benjamini-Hochberg (B–H) algorithm with Inline graphic on the 32 spectra of the eight proteins in the benchmark set picked by PICKY.

Spectra 15N-HSQC HNCO HNCA HNCACB CBCA(CO)NH
Protein tNp B-H (d) tNp B-H (d) tNp B-H (d) tNp B-H (d) tNp B-H (d)
RP3384 6 4 (33) 0 0 (0) 13 13 (0) 36 10 (72)
CASKIN 2 2 (0) 22 15 (32) 33 38 (−15) 10 9 (10)
VRAR 7 7 (0) 9 9 (0) 31 31 (0) 19 19 (0)
HACS1 5 2 (60) 6 4 (33) 18 17 (6) 9 6 (33)
TM1112 8 1 (88) 6 6 (0) 8 7 (13) 9 2 (78)
COILIN 4 0 (100) 19 3 (84) 24 24 (0) 33 20 (39)
ATC1776 5 4 (20) 8 5 (38) 19 18 (5) 28 26 (7)
YST0336 2 2 (0) 6 3 (50) 12 10 (17) 17 12 (29)
Average 4.9 2.7 (38) 10.0 5.6 (34) 12.5 11.7 (6) 22.8 23.4 (1) 20.1 13.0 (34)
SDave 2.0 2.0 7.2 4.6 4.6 4.4 9.1 10.8 10.3 7.5
Preave 85 69 77 67 83 70 65 69 68 60

Column Inline graphic is the relative improvement of the missing peak rate of B-H over Inline graphic. All values except the last two rows are the missing peak rates. The “Inline graphic” row lists the standard deviations of the missing peak rates for the corresponding columns, which demonstrates the robustness of different methods. The last row gives the average precision values. All values are given in percentage.

Figure 3. Original intensity curves and the corresponding p-value curves.

Figure 3

(a) and (d): sorted intensity curve (a) and the corresponding p-value curve (d) of peaks predicted by PICKY on the 2D 15N-HSQC spectrum of the protein TM1112; (b) and (e): sorted intensity curve (b) and the corresponding p-value curve (e) of peaks predicted by PICKY on the 3D HNCO spectrum of the protein COILIN; (c) and (f): sorted intensity curve (c) and the corresponding p-value curve (f) of peaks predicted by PICKY on the 3D CBCA(CO)NH spectrum of the protein RP3384. In these figures, true peaks are shown in black and false ones are shown in cyan. In (d), (e) and (f), the decision boundaries of Inline graphic and the B-H procedure are shown in black and magenta, respectively.

Eliminating False Peaks

The proposed B-H algorithm automatically determines how many peaks we should select from the candidate peak lists that are sorted according to the confidence scores of different methods. Therefore, the more true peaks it includes, the greater the possibility that it also includes false ones. This possibility is verified by the relatively low precision values in Table 1. The selected false peaks usually have larger volumes (or even much larger volumes) than the true ones. This can be caused by a variety of reasons, such as water bands, artifacts and side-chains. It is thus very difficult to eliminate them from a single spectrum. An effective way to eliminate false peaks is to use spectra that share same atoms to “cross-reference” the peaks [2].

The goal of such cross-referencing is to eliminate as many false peaks as possible, while maintaining as many true peaks as possible. Among the commonly used NMR spectra, 15N-HSQC is the most sensitive and reliable one. It is often used as the root spectrum by NMR spectroscopists. If 15N-HSQC is not available, HNCO is usually considered to be the root, especially in non-linear acquisition mode. If other types of spectra are used to cross-reference 15N-HSQC, the recall will be significantly decreased. Therefore, we used a consensus method to refine the peaks selected for 15N-HSQC. Both WaVPeak and PICKY were used to pick peaks for the 15N-HSQC spectra of the eight proteins. The two candidate peak lists were then selected by the proposed B-H algorithm. Only the peaks that appeared in both selected peak lists were kept as the consensus peak list for 15N-HSQC. As shown in Table 3, the consensus method retained all the true peaks while increasing the precision by 13% on average. The consensus peak list was then used to refine all the other peak lists of WaVPeak that were selected by the proposed B-H algorithm. The reason we used the peak lists of WaVPeak was that WaVPeak was shown to be more sensitive than PICKY on noisier spectra [1]. Table 3 shows that for all the spectra, most of the true peaks were maintained, and the precision values were significantly improved. F-score, which is the harmonic mean of precision and recall, suggests that the BH-based consensus method gives the best overall accuracy comparing to other methods, including PICKY, WaVPeak, B-H WaVPeak, and the consensus of PICKY and WaVPeak by simply considering the top Inline graphic peaks from each method. On average, the BH-based consensus method was able to identify more than 88% of the expected true peaks, whereas less than 17% of the selected peaks were false ones.

Table 3. Comparison of the performance of different peak picking methods.

Spectra 15N-HSQC HNCO HNCA HNCACB CBCA(CO)NH
Method Rec Pre Rec Pre Rec Pre Rec Pre Rec Pre F – score
PICKY 93 81 89 74 88 74 60 78 72 66 77
WaVPeak 96 80 91 76 88 74 76 64 85 71 80
B-H (WaVPeak) 97 70 92 73 89 76 76 71 87 67 79
Consensus (Inline graphic) 97 72 92 70 88 82 77 72 88 69 80
B-H (Consensus) 97 83 89 80 86 93 76 84 86 80 85

Rec stands for recall values and Pre stands for precision values. The recall and the precision values of PICKY and WaVPeak are taken from [1]. B-H (WaVPeak) is the WaVPeak peaks selected by the proposed B-H algorithm. Consensus (Inline graphic) is the consensus of WaVPeak and PICKY by simply considering the top Inline graphic peaks from each method. B-H (Consensus) is the consensus of WaVPeak and PICKY by considering the top peaks that are determined by the proposed B-H algorithm. All the values are given as percentage.

Note that the performance of PICKY and WaVPeak in Table 3 was taken from that reported in [1], in which the top Inline graphic peaks were selected for comparison, where Inline graphic, the number of true peaks that exist in the spectrum, was assumed to be known. The Inline graphic consensus method in Table 3 was done by considering the top Inline graphic peaks of both PICKY and WaVPeak, which is much larger than the number of peaks used in [1]. This explains the significant drop of precision for the consensus method with respect to PICKY and WaVPeak.

Figures 4(a)–(e) show the precision-recall curves of the six different peak picking methods on the five types of spectra. These six methods are PICKY, B-H PICKY, WaVPeak, B-H WaVPeak, Inline graphic consensus and B-H consensus. For the sake of clearance, only the important parts of the curves, i.e., when recall is at least 0.7, are drawn. It is clear that B-H consensus always outperforms the five other methods. That is, at the same recall value, B-H consensus always has less proportion of false positive peaks. The Inline graphic consensus is the second best method. This makes sense because the consensus methods, comparing to the other methods, combine information from different, relevant spectra. B-H WaVPeak and B-H PICKY consistently outperform WaVPeak and PICKY. Note that WaVPeak has been shown to be better than PICKY [1]. Thus, the improvement of B-H PICKY over WaVPeak is due to the use of our B-H algorithm. In practice, we suggest the users to use the B-H WaVPeak if high sensitivity is required or only one spectrum is available, and use the B-H consensus method if high tradeoff between precision and recall is needed or a set of relevant spectra is given.

Figure 4. Precision-recall curves for different peak picking methods and sensitivity analysis of B-H WaVPeak.

Figure 4

(a)–(e): precision-recall curves for different methods on 15N-HSQC, HNCO, HNCA, CBCA(CO)NH and NHCACB, respectively. The solid black curves are for B-H consensus method; the dashed black curves are for the 1.5Inline graphic consensus method; the solid cyan curves are for B-H WaVPeak; the dashed cyan curves are for the original WaVPeak; the solid magenta curves are for B-H PICKY; and the dashed magenta curves are for the original PICKY. The relative area under curve (AUC) values are in legends, which are the area under curve over the total area of recall at least 0.7. (f): sensitivity analysis for different number of peaks. The precision and recall values of B-H WaVPeak are shown when Inline graphic, Inline graphic, Inline graphic and Inline graphic top peaks are used to calculate the p-values.

We further studied the sensitivity of the B-H algorithm with respect to the parameter. In this paper, we have been using 1.5 as the parameter value in Inline graphic. As shown in Figure 4(f), when the parameter value is changed to 2, 2.5 or 3, there is no significant change on both precision and recall.

Discussion

A common issue in bioinformatics is that a large number of predictions are made by computational methods. These predictions contain both true predictions and false ones. In most problems, a fixed number of predictions is selected according to a certain confidence score. The confidence score, however, is not accurate enough to differentiate true predictions from false ones. Therefore, selecting a fixed number of predictions or thresholding by a fixed score usually sacrifices a lot of true predictions because it does not take the properties of the problem into consideration. We propose a general approach to partially resolve this issue. The original confidence score is first converted into Inline graphic-values, which have been demonstrated to have a much stronger distinguishing capability than the original confidence score. The Benjamini-Hochberg algorithm is then applied to select a self-adapted number of predictions according to the false discovery rate that we want to control. This approach provides a systematic way of selecting predictions of computational methods. We further demonstrate that the false predictions can be further eliminated by using consensus or cross-referencing approaches.

The proposed approach has a wide range of potential applications. For instance, in protein function annotation problems, the amino acid sequences or domain architectures of proteins are given, and the GO terms selected from among some 30,000 are used to annotate the function. Most of the existing methods estimate the probability for each GO term to annotate the given protein [26][29]. However, the number of GO terms that annotate a certain protein is unknown. Our approach can be directly applied to the protein function annotation problem such that the correct number of GO terms is selected.

Theoretically speaking, the sum of the false discovery rate and the precision should be one. However, the precision values of B-H WaVPeak and B-H PICKY are way below 0.95, as shown in Tables 1 and 2. This is due to the fact that the volume and the intensity used in the original WaVPeak and PICKY are not perfect measures to rank peaks. That is, although such measures contain information about peak properties, the information is far from complete or correct. As shown in Figures 2 and 3, many true peaks can have much lower volume or intensity than some false ones. In order to achieve the theoretical precision level, better measurements have to be used by the original peak picking methods. For instance, the symmetry of peak shapes can be considered as additional information to rank peaks [4].

We are currently incorporating the proposed method as a plug-in into the available NMR software, such as CCPN and NMRView [15]. The source code of the proposed method is available at http://sfb.kaust.edu.sa/pages/software.aspx.

Conclusion

We have proposed a sensitive and robust approach to select peaks from automatic peak picking methods. The original peak confidence scores are first converted into Inline graphic-values. The Benjamini-Hochberg algorithm is then applied to select the number of peaks. In this paper, we demonstrated that the proposed approach worked consistently well using state-of-the-art peak picking methods. Therefore, this can be a potentially general approach to select a good number of candidates from a large set of predictions.

Acknowledgments

We are grateful to Ming Li for making PICKY publicly available. We thank Virginia Unkefer for editorial work on the manuscript. The spectra for TM1112, YST0336, RP3384 and ATC1776 were generated by Cheryl Arrowsmith’s Lab at the University of Toronto. The spectra for COILIN, VRAR, HACS1 and CASKIN were provided by Logan Donaldson’s Lab at York University.

Funding Statement

This work was supported by Award No. GRP-CF-2011-19-P-Gao-Huang, a GMSV-OCRF award from King Abdullah University of Science and Technology, and Hong Kong Research Grants Council grants HKUST6019/10P and HKUST6019/12P. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Liu Z, Abbas A, Jing B, Gao X (2012) WaVPeak: picking NMR peaks through wavelet-based smoothing and volume-based filtering. Bioinformatics 28: 914–920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Alipanahi B, Gao X, Karakoc E, Donaldson L, Li M (2009) PICKY: a novel SVD-based NMR spectra peak picking method. Bioinformatics 25: i268–i275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wüthrich K (1986) NMR of Proteins and Nucleic Acids. New York: John Wiley and Sons.
  • 4.Gao X (2009) Towards automating protein structure determination from NMR data. PhD dissertation, University of Waterloo.
  • 5. Gao X (2012) Mathematical approaches to the NMR peak-picking problem. Journal of Applied and Computational Mathematics 1: 1. [Google Scholar]
  • 6. Gronwald W, Kalbitzer H (2003) Automated structure determination of proteins by NMR spectroscopy. Progress in Nuclear Magnetic Resonance 44: 33–96. [Google Scholar]
  • 7. Altieri A, Byrd R (2004) Automation of NMR structure determination of proteins. Current Opinions in Structural Biology 14: 547–553. [DOI] [PubMed] [Google Scholar]
  • 8. Güntert T (2009) Automated structure determination from NMR spectra. European Biophysics Journal 38: 129–143. [DOI] [PubMed] [Google Scholar]
  • 9. Alipanahi B, Gao X, Karakoc E, Li SC, Balbach F, et al. (2011) Error tolerant NMR backbone resonance assignment and automated structure generation. Journal of Bionformatics and Computational Biology 9: 15–41. [DOI] [PubMed] [Google Scholar]
  • 10. Kleywegt G, Boelens R, Kaptein R (1990) A versatile approach toward the partially automatic recognition of cross peaks in 2D 1H NMR spectra. Journal of Magnetic Resonance 135: 288–297. [Google Scholar]
  • 11. Garret D, Powers R, Gronenborn A, Clore G (1991) A common sense approach to peak picking in two-, three-, and four-dimensional spectra using automatic computer analysis of contour diagrams. Journal of Magnetic Resonance 95: 214–220. [DOI] [PubMed] [Google Scholar]
  • 12. Corne S, Jognson A, Fisher J (1992) An artificial neural network for classifying cross peaks in two dimensional NMR spectra. Journal of Magnetic Resonance 100: 256–266. [Google Scholar]
  • 13. Carrara E, Pagliari F, Nicolini C (1993) Neural networks for the peak-picking of nuclear magnetic resonance spectra. Journal of Neural Networks 6: 1023–1032. [Google Scholar]
  • 14. Rouh A, Louis-Joseph A, Lallemand J (1994) Bayesian signal extraction from noisy FT NMR spectra. Journal of Biomolecular NMR 4: 505–518. [DOI] [PubMed] [Google Scholar]
  • 15. Johnson B, Blevins R (1994) NMR View: a computer program for the visualization and analysis of NMR data. Journal of Biomolecular NMR 4: 603–614. [DOI] [PubMed] [Google Scholar]
  • 16. Antz C, Neidig K, Kalbitzer H (1995) A general Bayesian method for an automated signal class recognition in 2D NMR spectra combined with a multivariate discriminant analysis. Journal of Biomolecular NMR 5: 287–296. [DOI] [PubMed] [Google Scholar]
  • 17. Koradi R, Billeter M, Engeli M, Güntert P, Wüthrich K (1998) Automated peak picking and peak integration in macromolecular NMR spectra using AUTOPSY. Journal of Magnetic Resonance 135: 288–297. [DOI] [PubMed] [Google Scholar]
  • 18. Orekhov V, Ilghiz V, Billeter M (2001) MUNIN: a new approach to multi-dimensional NMR spectra interpretation. Journal of Biomolecular NMR 20: 49–60. [DOI] [PubMed] [Google Scholar]
  • 19. Korzhneva D, Ibraghimov I, Billeter M, Orekhov V (2001) MUNIN: application of three-way decomposition to the analysis of heteronuclear NMR relaxation data. Journal of Biomolecular NMR 21: 263–268. [DOI] [PubMed] [Google Scholar]
  • 20.Daubechies I (1992) Ten lectures on wavelets. Philadelphia: SIAM.
  • 21.Miller R (1981) Simultaneous Statistical Inference. New York: Springer Verlag.
  • 22.Efron B (2010) Large-Scale Inference. New York: Cambridge University Press.
  • 23. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Methodological) 57: 289–300. [Google Scholar]
  • 24.Bonferroni C (1935) Il calcolo delle assicurazioni su gruppi di teste. Studi in Onore del Professore Salvatore Ortu Carboni : 13–60.
  • 25.Bonferroni C (1936) Teoria statistica delle classi e calcolo delle probabilit à. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze : 3–62.
  • 26. Coin L, Bateman A, Durbin R (2003) Enhanced protein domain discovery by using language modeling techniques from speech recognition. Proceedings of the National Academy of Sciences 100: 4516–5420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Silvescu A, Andorf C, Dobbs D, Honavar V (2004) Inter-element dependency models for sequence classification.
  • 28. Forslund K, Sonnhammer E (2008) Predicting protein function from domain content. Bioinformatics 24: 1681–1687. [DOI] [PubMed] [Google Scholar]
  • 29. Messih MA, Chitale M, Bajic VB, Kihara D, Gao X (2012) Protein domain recurrence and order can enhance prediction of protein functions. Bioinformatics 28: i444–i450. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES