Abstract
We have implemented a method that identifies the genomic origins of sample proteins by scanning their peptide-mass fingerprint against the theoretical translation and proteolytic digest of an entire genome. Unlike previously reported techniques, this method requires no predefined ORF or protein annotations. Fixed-size windows along the genome sequence are scored by an equation accounting for the number of matching peptides, the number of missed enzymatic cleavages in each peptide, the number of in-frame stop codons within a window, the adjacency between peptides, and duplicate peptide matches. Statistical significance of matching regions is assessed by comparing their scores to scores from windows matching randomly generated mass data. Tests with samples from Saccharomyces cerevisiae mitochondria and Escherichia coli have demonstrated the ability to produce statistically significant identifications, agreeing with two commonly used programs, peptident and mascot, in 86% of samples analyzed. This genome fingerprint scanning method has the potential to aid in genome annotation, identify proteins for which annotation is incorrect or missing, and handle cases where sequencing errors have caused framing mistakes in the databases. It might also aid in the identification of proteins in which recoding events such as frameshifting or stop-codon read-through have occurred, elucidating alternative translation mechanisms. The prototype is implemented as a client/server pair, allowing the distribution, among a set of cluster nodes, of a single or multiple genomes for concurrent analysis.
Peptide mass fingerprinting is a principal protein identification technique that was introduced in 1993 by several groups (1–3). Fractions from the separation of a protein sample by means of 2D gel electrophoresis or multidimensional HPLC are enzymatically digested and analyzed by MS. The resulting peptide mass fingerprints are matched against a sequence database to identify the proteins present in the sample. Commonly used computer programs such as peptident (4), profound (5), mascot (6), and sherpa (7) match peptide fingerprints by comparing the masses in the fingerprint to those derived by in silico digestion of predicted or confirmed database protein sequences. Misidentified or unidentified ORFs can present a major challenge to the process, as can sequencing insertion/deletion errors (indels). Furthermore, current methods cannot readily detect proteins generated by various alternative processing mechanisms observed at each stage of protein production. Examples include transcriptional slippage (8), alternative splicing (9), internal initiation (10, 11), and recoding (12, 13), the latter of which includes nonstandard translational phenomena such as programmed frameshift and stop codon read-through.
Proteins produced by such mechanisms may be absent from a protein database, leaving no positive search target. A frameshift product might be incorrectly identified as the in-frame product (with lower confidence because of no matches past the frameshift site), or not at all, because there are too few peptides matching in the original frame. The normal and transframe proteins translated from a sequence prone to programmed frameshift will be identified as only a single product. Without new search methods, such “under-identifications” may end up hiding mechanisms of biological significance from researchers.
The present genome-search approach was conceived to aid in the detection of alternative products among proteins from mitochondria of Saccharomyces cerevisiae separated by HPLC and analyzed by electrospray ionization (ESI)-MS of both intact proteins and their tryptic-digest products (B.M., C. Nelson, A.A.S., A. J. Baucum, M.C.G., R. Chowdry, J. Simmons, N. Wills, J. Atkins, and R.G., unpublished work). The genome fingerprint scanning (GFS) application matches peptide mass fingerprint data to a genomic locus without reference to ORF, protein anchor, or other annotation. The entire putative proteome is translated from the full genome sequence and digested by using the rules for a particular protease. The program matches masses from the peptide fingerprint to those generated by the in silico digestion, then scans in windows across the genome to identify regions where a high density of hits indicates a putative genomic origin for one of the sample proteins. The process is summarized in Fig. 1.
The in silico digestion of the raw S. cerevisiae proteome (translated genome) in all six reading frames using rules for the enzyme trypsin generates 8.9 million hypothetical peptides. The profusion of in silico fragments generated by the simulated digestion of a whole genome results in a large background of spurious hits against which to discern genuine matches. For example, in 100 full-genome searches against 40 randomly generated masses at 0.05% mass tolerance, an average of 145,190 or 1.6% of the S. cerevisiae fragments match in each search, equivalent to an average of 3,630 fragments per input mass. If evenly distributed along forward and reverse strands, a hit would occur every 185 nt. In reality, the hits are unevenly distributed, forming clusters of varying density determined by factors such as the location of lysine and arginine-encoding codons across the genome. In any case, masses from the sample protein invariably match many regions of the genome in addition to their genomic origin. Conversely, because of competitive ionization of peptide species, only a portion of the peptides expected from a given coding region are observed by means of MS analysis.
The method we have developed appears able to discriminate real protein hits from the random background when using a loose match tolerance of 0.05% (500 ppm). Tests with data from liquid chromatography-separated yeast mitochondrial proteins analyzed with ESI-MS (Micromass Quattro II, Manchester, U.K.) and from matrix-assisted laser desorption ionization (MALDI)-MS analysis of 2D gel-separated Escherichia coli proteins produced statistically significant identifications (P < 0.05), the majority in agreement with mascot or peptident.
The GFS method has promise for identifying proteins involving nonstandard translation and also has potential to be used for genome annotation based directly on observed proteins (14, 15). Researchers have also recently developed methods for scanning raw genomic data by using tandem MS data (16, 17). Tandem MS involves an additional step where select peptides are introduced into a collisional unit and bombarded with heavy atoms. The resulting stepwise fragmentation pattern can be used to reconstruct the peptide sequence. Although approaches that incorporate tandem and higher-order MS are generally agreed to provide the most authoritative identifications, the analysis of spectra can be complicated by posttranslational modifications (18) or internal (nonterminal) dissociation of peptide bonds (19). In contrast, peptide mass fingerprinting has the benefit of simplicity and lower equipment cost. We believe that GFS and the tandem-MS genome-scanning methods could be used complementarily to produce a high-confidence, genome-based characterization of protein samples.
Computational Methods
In silico digestion of an entire genome sequence generates masses for all peptides that might be produced by its in vivo translation and subsequent proteolytic digestion. Translation and digestion are performed in all forward and reverse frames. The process proceeds from 5′ to 3′ on each sequence, keeping an in-process list of not yet terminated fragments and a final master list of terminated fragments. As each in-frame codon is encountered the mass of its amino acid translation is added to each fragment of the in-process list. A new fragment is added to the incomplete list after each cleavage or stop codon and at each start codon. Terminated fragments are transferred to the master list when cleavage sites and stop codons are encountered. Any fragment falling below a length threshold of three codons is discarded.
The enzyme trypsin cleaves after lysine and arginine, encoded by a total of eight codons. In the standard genetic code, the start codon is ATG (also coding internal methionine) and stop codons are TAA, TAG, and TGA. For mtDNA, start and stop codons are ATG, ATA and TAA, TAG, respectively. The system supports the use of alternative genetic codes by using separate translation dictionaries, selected by annotation in the FASTA header of a sequence, and defaulting to standard nuclear encoding.
Duplicate fragments (produced, e.g., when a start codon follows a cleavage codon) are prevented by checking a queue containing fragments recently transferred for any fragment with the same start codon and mass before transferring the new fragment to the final digest list. Incomplete proteolytic digestion of a protein sample results in peptides containing internal cleavage sites. The presence of such missed cleavages requires that the in silico digest peptides also contain them. The program uses the variable b to specify the maximal number of missed cleavage sites (breaks) allowed internally to a fragment. For efficiency, b is generally kept low, i.e., b ≤ 2.
In silico digestion of the nuclear plus mitochondrial genome of S. cerevisiae with b = 2 generates 8.9 million fragments and requires ≈200 MB of memory. The program stores the mass, start point, length, number of breaks, and reading frame of each fragment. The computational complexity of the digestion process is roughly linear, or O(n), in correspondence to genome size n. The matching process is O(nm), with m representing the size of the mass list. Further algorithmic details are provided in Supporting Text, which is published as supporting information on the PNAS web site, www.pnas.org.
Mass data from either MALDI-MS or ESI-MS is matched to the in silico digestion to within a tolerance level calculated as a percentage of the experimental mass (Δ%). Each matching fragment is mapped back to its position on the chromosome (Fig. 1). Two scans across each chromosome evaluate match criteria for windows of size w to identify those with a high score (scoring discussed below). For all experiments herein w = 500. The first scan scores windows in 100-nt increments, providing an internal histogram of score distributions used by the program to establish a cutoff level such that only the top n scoring clusters are examined on the subsequent scan. Currently, n is fixed at a value of 10. Regions with windows scoring above the threshold are examined in further detail by an extension scan used to determine the full extent of the hit-cluster region. The extension scan starts with each high-scoring window and proceeds backward in 50-nt steps, considering each window of size w until the score falls below a defined cutoff. This marks the start of the full region. A scan forward from the original window in 50-nt steps again considers windows of size w until the same cutoff is reached, marking the end of the extended region. The cutoff currently used is half of the score of the 10th highest-scoring window found on the initial scan.
Several scoring methods have been investigated. Simple measures include the total number of hits (matches) per window and the percentage of the DNA sequence contained within matching fragments. The former tends to favor regions containing multiple repeats of a peptide matching one of the input masses and fails to account for the lower frequency of peptides with missed cleavages. Sequence coverage scores are inflated by high-mass fragments. A scoring function was developed to address these issues. It considers the following aspects of each window of size w: (i) the number of hits in a single frame; (ii) the number of possible hits in the same frame (i.e., the total digested fragment count); (iii) the number of missed cleavages in each fragment; (iv) the number of in-frame stop codons encountered in the window before the current fragment; (v) duplicate mass matches; and (vi) abutment of fragments.
Multiplicative combination of these attributes is important for scoring features such as the number of preceding stop codons, missed cleavages, and duplicate fragments matched. For example, whereas a region with one in-frame stop may still be considered because it could be caused by a sequence error or stop codon read-through (13, 20), multiple in-frame stops are increasingly unlikely. If the probability of stop-codon present in-frame is p, then the probability of s occurrences is ps. The scoring equation does not attempt to model actual probabilities. The data required to ascertain realistic probabilities or frequencies of these occurrences would be very difficult to obtain. Instead, the scoring function is intended to maximally discriminate randomly formed clusters from real hits, while allowing for occurrences such as sequence errors or missed cleavages.
Assignment of penalty values to each of the listed factors allows their multiplicative combination. The values are: cb, penalty for missed cleavages, default 0.6; cs, penalty for preceding in-frame stops, default 0.4; cd, penalty for preceding duplicate-mass matches, default 0.6; and cā, penalty for N terminus not abutting a preceding fragment, default 0.9.
For a window containing t hypothetical fragments, h of which match experimentally measured peptide masses, we calculate a window score s by summing the penalty products for each fragment j = 1 … h:
1 |
The functions b(fj), s(fj), d(fj), and a(fj) return counts for the number of breaks in a fragment (e.g., Lys and Arg codons), the number of stops preceding a fragment, the number of other preceding matches for this mass in the window, and whether or not (1 or 0) the amino terminus of the fragment abuts a preceding fragment, respectively. The term t normalizes the results for the total number of digest fragments in the window, but used by itself can skew results toward windows with a small number of possible fragments; h counterbalances this by giving weight to the number of hits in the window, and is reduced by d, the number of duplicate matches in the window. Scores are multiplied by a scaling factor of 100 to simplify the histogram analysis.
These scores are used to differentiate statistically significant match regions from the backdrop of random hits. Methods to assign probability estimators to fingerprint matches have been described by several groups. One approach is to derive a probabilistic function describing the likelihoods of alternative identifications, as was done in profound (5), or to assign a probability that a given protein match is by chance as was done in mascot (6). Another is to use randomized data to establish a baseline for assigning significance to matches with real data (21).
We use a method similar to the latter, by which we calculate the significance of a match region as a function of its window score and the total number of masses in the experimental spectrum. Each value in a randomly selected subset of a large set of experimental peptide masses is perturbed by a random modification representing the addition or subtraction of H, C, O, and N atoms. A large number of such mass lists is searched against the genome fragment database, establishing a histogram to represent the range of the null hypothesis (i.e., that any given result is caused by chance). The histogram is used to define Ps, the probability in a single genome scan that a randomly chosen set of masses, of the same size, would achieve a score equal or above the score considered (derivation in Supporting Text). The scoring methods used by GFS and mascot cannot be directly equated because the GFS significance score describes the probability of a false positive in a complete genome scan, whereas mascot produces a score that describes the probability of false-positive for each protein considered.
The system consists of a client-server pair with a UNIX command-line interface. The server performs the in silico digestion and keeps the resulting database in memory, obviating the recomputation of the genome digest over multiple MS analyses. Each client receives a peptide mass list, connects to the server for processing, and outputs the results. The client generates a formatted HTML file displaying the clusters for the 10 highest-scoring windows, with matched fragments highlighted in different colors according to the reading frame in which they are found to facilitate visual identification of transframe events. The entire region's score and the highest score for any contained window of size w are reported. The significance is calculated as a function of the latter, maximal fixed window-size score.
Server parameters include the number of missed cleavages to be calculated, window size, a directory containing all FASTA-formatted genome sequences, and whether to use average or mono-isotopic masses. Client parameters include the file containing a peptide mass list, the host name of the server, the tcp port number, Δ% mass tolerance, and parameters related to random trials. The programs currently run on the UNIX command line of MacOS X, with potential deployment on Linux. We plan to make a graphical interface accessible to other researchers via the web. The current prototype code is available free of charge to nonprofit researchers.
Laboratory Methods
S. cerevisiae Mitochondria.
The S. cerevisiae strain used was BY 4743 Diploid, His 3Δ, Leu 2Δ, Ura 3Δ. Yeast cultures were grown in YPGE (1% yeast extract/2% bactopeptone/2% glycerol/2% ethanol) media at 30°C to an OD600 of <1.0. Mitochondria were isolated as described (22). Oxyliticase (Enzogenetics, Corvallis, OR) was used instead of or in addition to Zymolase (ICN) in some preparations. Mitochondria were lysed by sonication in 50 mM 3-cyclohexylamino-1-propane sulfonic acid (CAPS), pH 10.5. Polyethyleneimine was added to a final concentration of 0.1% to precipitate nucleic acids. After a 20-min incubation at 4°C, samples were centrifuged at 60,000 rpm for 2 h in a Beckman TL-100 centrifuge (TLA 100.3 rotor).
Cleared mitochondrial lysate was separated on a PerSeptive Biosystems (Framingham, MA) BioCad Sprint HPLC system with a 4.6 × 100 mm column packed with Poros 20 HQ (strong anion exchange) media (PerSeptive Biosystems). The running buffer was 50 mM CAPS, pH 10.5, and proteins were eluted from the column with 0 to 1 M NaCl gradient over five column volumes. Collected fractions were further separated on the same system with Poros 20 R2 (reversed phase) media. The running buffer was 0.1% trifluoroacetic acid/15% acetonitrile, and proteins were eluted from the column with a 15–45% acetonitrile gradient. Fractions were lyophilized then digested with modified sequencing grade trypsin (Promega) as per vendor instructions.
Molecular weights of proteins and peptides were determined by using positive-ion electrospray MS on a Quattro-II mass spectrometer (Micromass). ESI generates a series of multiply charged molecular ions from which mass assignments are derived for each protein or peptide. Molecular masses for peptides were determined by manually deisotoping and deconvolving the mass-to-charge (m/z) spectra.
E. coli.
Data from an analysis of selected proteins from two strains of E. coli, CSH 142 and CSH 156, was provided by workers at Kendrick Labs (Madison, WI), who performed 2D electrophoresis by using the methods described (23). A brief summary follows. Proteins were added as standards to the gel: myosin (220 kDa), phosphorylase A (94 kDa), catalase (60 kDa), actin (43 kDa), carbonic anhydrase (29 kDa), and lysozyme (14 kDa) (Sigma). Spots with large differences in expression level between the two strains were selected for analysis. The bands/spots were cut and digested by using 0.06 μg of modified trypsin (sequencing grade, Roche Molecular Biochemicals) in 13–15 μl of 0.025 M Tris, pH 8.5. The tubes were placed in a heating block at 32°C and left overnight. Peptides were extracted with 2 × 50 μl of 50% acetonitrile/2% trifluoroacetic acid and then the combined extracts were dried and resuspended in matrix solution, 4-hydroxy-α-cyanocinnamic acid in 50% acetonitrile/0.1% trifluoroacetic acid with two standards, angiotensin and bovine insulin. An aliquot of 0.7 ml was spotted onto the sample plate, completely dried, and washed twice with water. A PerSeptive Voyager DE-RP MALDI–time-of-flight mass spectrometer was used to analyze digest samples in the linear or reflector mode (Applied Biosystems). The National Center for Biotechnology Information and/or GenPept databases were searched by the service lab by using profound (http://prowl.rockefeller.edu/cgi-bin/ProFound), ms-fit (http://prospector.ucsf.edu), and peptident (http://us.expasy.org/tools/peptident.html).
Results
Data from ESI-MS analysis of S. cerevisiae mitochondrial proteins and MALDI-MS analysis of E. coli proteins were used to test the GFS system. We rejected the use of synthetic data for reasons similar to those cited by Perkins et al. (6), e.g., because the real experimental factors that play into the observed data are not well enough understood to be modeled as required for the generation of realistic synthetic data.
On the other hand, the absence of authoritative identifications for the analyzed proteins requires that we evaluate performance by comparing our results with those of other algorithms. Our performance assessment is based on comparisons with both peptident (4) and mascot (6), well recognized tools used for the data analysis in our yeast proteome project. In most cases both of the programs were used for comparison; however, there are a few for which only mascot was used, because later in the S. cerevisiae project that became the default analysis tool. For mascot analyses in our proteome project, proteins with a match score >70 were considered significant identifications.
As opposed to the single protein samples typically produced by 2D gel separation, our yeast mitochondrial samples contain multiple proteins, a result of the method of HPLC separation used. The regular presence of multiple proteins imposes certain constraints on the system. For example, it greatly complicates the consideration of mutually exclusive possibilities required to develop a Bayesian formula for probability assessment of identifications as was done in profound (5). An alternative is to base comparisons on the distribution of randomized data, as we have done here. The increased number of peptides in a multiprotein analyte further increases the background noise level, making matches more difficult to distinguish.
To establish the significance levels of scores we performed multiple repetitions of experiments by using randomized mass lists of varying lengths. For S. cerevisiae, 1,000 repeat trials were performed, with lists of length 20–100 in increments of five. An example histogram of scores produced from one such set of trials is shown in Fig. 3, which is published as supporting information on the PNAS web site. The process was repeated for E. coli with lists of 20–50 random masses, incremented by five. The program keeps a summary histogram plotting scores against the number of w = 500-nt windows achieving each score. A plot of the Ps< 0.001 (maximum scores) and Ps< 0.05 values versus mass-list size is in Fig. 2. The curves show a close to linear relationship between the number of masses input and confidence thresholds, with the Ps< 0.05 varying more smoothly because of the higher quantity of data available to establish it.
Twenty-two samples were chosen for the comparative performance analysis, 18 from yeast and four from E. coli. The yeast samples were selected at random from the larger pool of those available. The program was run with each mass list, and results were manually parsed, with the top two scoring clusters considered in each case. The score for each cluster region was compared with the null hypothesis to compute the P value by interpolating between the two nearest mass-list sizes (increments of five).
The genome position of each identified cluster was compared with ORF annotations for yeast or E. coli. An ORF encompassing and in-frame with the cluster region was noted as a match. For the yeast searches, an in-house program was used to match positions to annotated ORFs in a mid-2001 download from the Saccharomyces Genome Database (http://genome-www.stanford.edu/Saccharomyces/). E. coli searches were performed manually against the National Center for Biotechnology Information database by using the summary data at www.ncbi.nlm.nih.gov/cgi-bin/Entrez/altik?gi=115&db=Genome (late 2001). The parameters for all reported experiments were w = 500, Δ% = 0.05, and b = 3, with monoisotopic peptide masses used for the yeast/ESI experiments and average peptide masses for the E. coli/MALDI experiments. Fig. 4, which is published as supporting information on the PNAS web site, is an illustration of typical output from the program.
Seventeen of the 22 samples had a top-scoring cluster region with significance Ps< 0.05. Of the five not within this threshold, four had top-scoring clusters falling in ORFs identified in the database as mitochondrially localized proteins; the remaining region did not correspond to an obvious ORF. The top-scoring GFS-identified ORF was also top scoring for one of the other programs in 16 cases, including one case where mascot and GFS agreed that no significant matches were present. When counting agreement between either first- or second-place significant protein matches, the GFS and mascot/peptident corroboration increases to 19 samples or 86%. Table 1 shows six representative results, including two disagreements and one where the second-place identification was corroborated. The disagreements provide insights into differences between the algorithms. Sample 2378 F9 with GFS score 121 (Ps ≈ 0.004) mapped to an ORF for the mitochondrial precursor of alcohol dehydrogenase (ADH3), whereas the other programs identified the sample protein as phosphoglycerate kinase (PGK). Given the high significance and mitochondrial localization of ADH3, it is likely that GFS correctly identified a different component of the sample than the other programs. GFS also corroborated the PGK identification, it being the second top-scoring region found.
Table 1.
Sample ID | Organism/ method | No. of masses | GFSP value | GFS score, ×100 | Programs agree? | GFS chromosome location | peptident ID | mascot ID (score) | GFS ID | Notes |
---|---|---|---|---|---|---|---|---|---|---|
2378 E2 | Yeast/ESI | 41 | <0.001 | 112 | Yes | Chr XII 734881– 736039 (rev) | ACO1 | ACO1 (90) | ACO1 | Aconitate hydratase |
WT-5 | E. coli/MALDI | 29 | <0.001 | 88 | Yes | 849492–850188 | ompX | NA | ompX | Outer membrane protein X |
2378 F9 | Yeast/ESI | 61 | 0.004 | 121 | No | Chr XIII 435350– 436343 | PGK | PGK (125) | ADH3 | GFS second-scoring hit is PGK, S = 82 |
2681 E1 | Yeast/ESI | 60 | 0.028 | 102 | No | Chr XIII, 529500– 530700 | ABC1 | Nothing | POM152 | Parent is 151 kDA, may be reason mascot/pepident didn't find it |
2681 D11 | Yeast/ESI | 60 | 0.208 | 84 | Yes | Chr XV 387289– 387874 (rev) | Q12452 | Nothing | Nothing | — |
2838 C3 | Yeast/ESI | 47 | 0.317 | 63 | No | Chr XV 363100– 364100 | Nothing | Nothing | ?PET127? | Though low significance, PET127 is large, putative mitochondrial protein |
Shown are the sample, organism, MS method, and number of masses collected for program input. The significance (P value), calculated by Ps, and top-ranked score found by GFS for the sample follow. The next columns indicate whether GFS and one of the programs agreed, the chromosomal location of the region identified by GFS (rev denotes the reverse strand), and the identification results produced by peptident, mascot, and GFS, followed by notes regarding the sample.
Another analyte was identified by GFS as PET127, a protein annotated in GenBank as a component of the mitochondrial translation system. For this sample neither mascot nor peptident found anything significant. It appears from the GFS program output that only a small, ≈10-kDa portion of the protein was present during tryptic digestion, which would account for the poor significance score. It would also explain the difficulty mascot had with this, because this is a small piece of a large 93-kDa protein predicted in the ORF database. In another case, GFS identified POM152 (Ps ≈ 0.03), whereas mascot found nothing significant and peptident had a weak match for ABC1. As with the previous case, the parent protein predicted by the database (Saccharomyces Genome Database) is large: 151 kDa. This finding highlights an advantage of position-based peptide matching. It is likely that a portion of analyzed proteins has undergone in vivo proteolysis, causing incomplete peptide coverage. Because the peptide coverage, when averaged over a large protein, will be low, such cases can confound searches that rely on ORF or protein annotation. With the genome-based positional scanning of GFS, parent protein size does not directly affect its performance unless a protein is much smaller in size than the window used for analysis.
Table 2 illustrates the detection of multiple proteins within a sample containing a large number of peptides (88 total). To calculate the significance for each match region we remove from the experimental mass list all masses matched to a higher-scoring region and not contained in the current region and calculate the significance corresponding to the score and this new number of masses. The rationale for this is its equivalency to removing the masses previously matched and rerunning the scan. GFS is unique among MS search algorithms for its ability to establish significance levels based only on the number of masses input. The use of a fixed window size for scanning simplifies the detection of multiple proteins and assessment of their significance in a single-pass analysis.
Table 2.
Sample ID | GFS ORF ID | GFS match name | Location | No. of masses remaining in list | Peptides matched | GFS score | Significance | mascot ID rank (score) | peptident rank |
---|---|---|---|---|---|---|---|---|---|
2667 E10(1) | ILV5 | Ketol-acid reductoisomerase, mitochondrial | XII 838501–839047 (rev) | 88 | 12 | 142 | 0.032 | 2 (95) | 1 |
2667 E10(2) | SOD2 | Manganese-containing superoxide dismutase | VIII 122827–123730 (rev) | 77 | 14 | 138 | 0.009 | 3 (61) | — |
2667 E10(3) | YBR271W | Hypothetical ORF | II 745051–745987 | 67 | 13 | 105 | 0.064 | — | — |
2667 E10(4) | YPL245W | Hypothetical ORF | XVI 86398–86875 | 52 | 13 | 104 | 0.002 | — | — |
2667 E10(5) | ATP2 | F(1)F(0)-ATPase complex beta subunit, mitochondrial | X 647391–647970 | 37 | 10 | 103 | 0.002 | 3 (61) | 6 |
The mass spectrometer output listed 88 masses. To calculate significance values we removed from the experimental mass list all masses matched to higher-scoring regions and not contained in the region currently being considered, and calculated the significance corresponding to the score and this new number of masses by using the probability regression in Fig. 2. Shown is the identified ORF, along with the ranking of this identification compared to mascot and peptident ranking for the same identification.
Identifications for all four E. coli samples matched the standard results, and all of them had GFS scores with significance Ps< 0.001. The stronger E. coli results are likely caused by the samples each having only one protein and by higher mass accuracy from both the instrument and data analysis methods. The ESI process generates ions with multiple charge states, whereas MALDI typically produces singly charged ions. For the ESI data, we lacked appropriate software to perform deconvolution of multiply charged spectra into straight mass spectra; our manual deconvolution may have been less accurate. Given the lower quality of these data and the relaxed match criteria used (500 ppm), the performance of GFS on the ESI/yeast data is promising.
Discussion
The empirically determined working parameter sets sufficed for practical operation on these distinct data sets but should be optimized by further experimentation to match the properties of the experimental equipment used. For example, mass tolerance should be decreased for higher mass accuracy data, effectively reducing the number of random hits. Optimization of window size is potentially complex, depending on protein size, gene composition, and the local configuration of matched fragments. If spliced genes are analyzed, the window size may have to be much larger, with a second, smaller window scan to identify putative exons.
We performed an experiment to investigate the extent of the random backdrop for large mammalian genomes. We repeated 1,000 queries with randomly generated 41- and 61-length mass lists against the in silico digest of human chromosome XIV, comprising 1/35th of the human genome. The background from this experiment should be roughly equivalent to scanning the entire genome 28 times. The maximum score for 41-mass lists was 107, and for 61-mass lists was 134, indicating that a real match scoring above those thresholds would have a significance value of less than ≈1/28 = 0.04. Our yeast sample (2378 E2) that had a 41-mass input list produced a score of 114, which for yeast is a Ps value <0.001, and for a genome of human size and composition would still have significance <0.04. A 61-mass sample that had a Ps≈ 0.004 for yeast would have a Ps≈ 0.13 if matched against a human-sized genome. These data indicate that scanning a much larger genome is statistically feasible. Improved MS data, allowing reduction of the tolerance from the present 500 ppm to ≈50 ppm, should provide a proportional 10-fold reduction in the random backdrop, improving the likelihood of success. However, the identification of proteins consisting of multiple exons remains a considerable challenge that has not yet been addressed.
Although our implementation is a proof-of-principle prototype, we have obtained interesting results contributing useful information to our research of yeast mitochondrial proteins. In application to a less-annotated genome than yeast, this method could contribute to the identification of proteins for which the database representation is not complete. Potential future enhancements include the automatic assignment of probability values from the regression of random trials, integration with ORF databases for automatic output of the ORF covering any high scoring cluster, and a web-based interface. We are optimizing the code for high-throughput analysis on modern vector processors and plan to deploy the system to provide concurrent search of multiple mass lists and several genomes.
Supplementary Material
Acknowledgments
We thank Pavel Baranov, Hendrick Labs, and the Protein Chemistry Core Facility at Columbia University (New York) for providing the E. coli data and Chad Nelson for producing the ESI-MS data for S. cerevisiae. This work was supported by National Institutes of Health Genome Scholar Award HG00044 (to M.C.G.) and Department of Energy Grant DE-FG03-99ER62732 (to R.G.).
Abbreviations
- ESI
electrospray ionization
- GFS
genome fingerprint scanning
- MALDI
matrix-assisted laser desorption ionization
References
- 1.Mann M, Hojrup P, Roepstorff P. Biol Mass Spectrom. 1993;22:338–345. doi: 10.1002/bms.1200220605. [DOI] [PubMed] [Google Scholar]
- 2.Henzel W J, Billeci T M, Stults J T, Wong S C, Grimley C, Watanabe C. Proc Natl Acad Sci USA. 1993;90:5011–5015. doi: 10.1073/pnas.90.11.5011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.James P, Quadroni M, Carafoli E, Gonnet G. Biochem Biophys Res Commun. 1993;195:58–64. doi: 10.1006/bbrc.1993.2009. [DOI] [PubMed] [Google Scholar]
- 4.Wilkins M R, Gasteiger E, Bairoch A, Sanchez J C, Williams K L, Appel R D, Hochstrasser D F. Methods Mol Biol. 1999;112:531–552. doi: 10.1385/1-59259-584-7:531. [DOI] [PubMed] [Google Scholar]
- 5.Zhang W, Chait B T. Anal Chem. 2000;72:2482–2489. doi: 10.1021/ac991363o. [DOI] [PubMed] [Google Scholar]
- 6.Perkins D N, Pappin D J, Creasy D M, Cottrell J S. Electrophoresis. 1999;20:3551–3567. doi: 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2. [DOI] [PubMed] [Google Scholar]
- 7.Taylor J A, Walsh K A, Johnson R S. Rapid Commun Mass Spectrom. 1996;10:679–687. doi: 10.1002/(SICI)1097-0231(199604)10:6<679::AID-RCM528>3.0.CO;2-Q. [DOI] [PubMed] [Google Scholar]
- 8.Wagner L A, Weiss R B, Driscoll R, Dunn D S, Gesteland R F. Nucleic Acids Res. 1990;18:3529–3535. doi: 10.1093/nar/18.12.3529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Black D L. Cell. 2000;103:367–370. doi: 10.1016/s0092-8674(00)00128-8. [DOI] [PubMed] [Google Scholar]
- 10.Liu C C, Simonsen C C, Levinson A D. Nature. 1984;309:82–85. doi: 10.1038/309082a0. [DOI] [PubMed] [Google Scholar]
- 11.Tesar M, Harmon S A, Summers D F, Ehrenfeld E. Virology. 1992;186:609–618. doi: 10.1016/0042-6822(92)90027-m. [DOI] [PubMed] [Google Scholar]
- 12.Gesteland R F, Atkins J F. Annu Rev Biochem. 1996;65:741–768. doi: 10.1146/annurev.bi.65.070196.003521. [DOI] [PubMed] [Google Scholar]
- 13.Tate W P, Mannering S A. Mol Microbiol. 1996;21:213–219. doi: 10.1046/j.1365-2958.1996.6391352.x. [DOI] [PubMed] [Google Scholar]
- 14.Andersen J S, Mann M. FEBS Lett. 2000;480:25–31. doi: 10.1016/s0014-5793(00)01773-7. [DOI] [PubMed] [Google Scholar]
- 15.Pandey A, Mann M. Nature. 2000;405:837–846. doi: 10.1038/35015709. [DOI] [PubMed] [Google Scholar]
- 16.Kuster B, Mortensen P, Andersen J S, Mann M. Proteomics. 2001;1:641–650. doi: 10.1002/1615-9861(200104)1:5<641::AID-PROT641>3.0.CO;2-R. [DOI] [PubMed] [Google Scholar]
- 17.Choudhary J S, Blackstock W P, Creasy D M, Cottrell J S. Proteomics. 2001;1:651–667. doi: 10.1002/1615-9861(200104)1:5<651::AID-PROT651>3.0.CO;2-N. [DOI] [PubMed] [Google Scholar]
- 18.Pevzner P A, Mulyukov Z, Dancik V, Tang C L. Genome Res. 2001;11:290–299. doi: 10.1101/gr.154101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Bafna V, Edwards N. Bioinformatics. 2001;17, Suppl. 1:S13–S21. doi: 10.1093/bioinformatics/17.suppl_1.s13. [DOI] [PubMed] [Google Scholar]
- 20.Tate W P, Mansell J B, Mannering S A, Irvine J H, Major L L, Wilson D N. Biochemistry (Moscow) 1999;64:1342–1353. [PubMed] [Google Scholar]
- 21.Eriksson J, Chait B T, Fenyo D. Anal Chem. 2000;72:999–1005. doi: 10.1021/ac990792j. [DOI] [PubMed] [Google Scholar]
- 22.Glick B, Pons L. Methods Enzymol. 1995;260:213–223. doi: 10.1016/0076-6879(95)60139-2. [DOI] [PubMed] [Google Scholar]
- 23.O'Farrell P H. J Biol Chem. 1975;250:4007–4021. [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.