Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2005 Dec 6.
Published in final edited form as: Mol Cell Proteomics. 2005 Apr 28;4(7):1002–1008. doi: 10.1074/mcp.M500064-MCP200

Precise and Parallel Characterization of Coding Polymorphisms, Alternative Splicing and Modifications in Human Proteins by Mass Spectrometry

Michael J Roth 1, Andrew J Forbes 1, Michael T Boyne II 1, Yong-Bin Kim 1, Dana E Robinson 1, Neil L Kelleher* 1
PMCID: PMC1307493  NIHMSID: NIHMS3417  PMID: 15863400

Summary

The human proteome is a highly complex extension of the genome wherein a single gene often produces distinct protein forms due to alternative splicing, RNA-editing, polymorphisms, and posttranslational modifications (PTMs). Such biological variation compounded by the high sequence identity within gene families currently overwhelms the complete and routine characterization of mammalian proteins by mass spectrometry (MS). A new database of human proteins (and their possible variants) was created and searched using tandem mass spectrometric data from intact proteins. This first application of Top Down MS/MS to wild-type human proteins demonstrates both gene-specific identification and the unambiguous characterization of multi-faceted mass shifts (Δm’s). Such Δm values found from the precise identification of 45 protein forms from HeLa cells reveal 34 coding SNPs, two protein forms from alternative splicing, and 12 diverse modifications (not including simple N-terminal processing), including a previously unknown phosphorylation at 10% occupancy. Automated protein identification was achieved with a median probability score of 10−13 and often occurred simultaneously with dissection of diverse sources of protein variability as they occur in combination. Top Down MS therefore has a bright future for enabling precise annotation of gene products expressed from the human genome by non-mass specrometrists.

Keywords: Proteomics, Top Down MS, modifications, single nucleotide polymorphisms, alternative splicing

Abbreviations Used: PTM — posttranslational modification, MS — mass spectrometry, MS/MS — tandem mass spectrometry, Δm — mass discrepancy, MALDI — matrix-assisted laser desorption ionization, ESI — electrospray ionization, cSNP — nonsynonymous coding single nucleotide polymorphism, FTMS — Fourier transform mass spectrometry, ECD — electron capture dissociation, CAD — collisionally-activated dissociation, IRMPD — infrared multiphoton dissociation, BAF — barrier-to-autointegration factor, SWIFT — stored waveform inverse Fourier transform, THRASH — thorough high-resolution analysis of spectra by Horn

Introduction

Due to the presence of polymorphisms, alternative splicing, and posttranslational modifications (PTMs) the human proteome is highly complex, often encoding multiple protein forms for a given gene (1). This biological complexity poses a significant analytical and bioinformatic challenge to the detailed analysis of mammalian proteomes by mass spectrometry (MS) and is exacerbated by the presence of gene families sharing high sequence identity (2, 3). Protein modifications are often indicative of changes in cellular or tissue dynamics and therefore play central roles in regulation of the cell cycle or development of disease. Whether for new diagnostics or understanding molecular mechanisms in cell biology, protein identification using tryptic peptides has revolutionized the analysis of complex mixtures by mass spectrometry (1, 4).

High-throughput platforms based on matrix-assisted laser desorption ionization (MALDI) (5) and electrospray ionization (ESI) employ MS/MS engines capable of spectral acquisition at a rate of >104/week (6, 7). Recent studies indicate significant inefficiencies associated with such large scale “Bottom Up” analyses in mammalian systems including imperfect enzymatic cleavage (8, 9) and some MS/MS spectra requiring manual interpretation/validation for identification. Despite the lingering difficulties with peptide analysis, it provides the best and most general method for large scale protein identification today, with information on coding polymorphisms (cSNPs), alternative splicing (10) and PTMs challenging to obtain (2).

Recent developments by Yates and Lubman use three proteases and “MudPIT” technology (11, 12) or isoelectric focusing, reversed-phase chromatography, and three mass spectrometers (13), respectively, to obtain mass information on ~70–99% of the primary protein structure. Combining intact protein measurement with near-exhaustive peptide analysis of five proteins from human cells allowed detection of N-terminal modifications and one alternatively spliced transcript (13). While cSNP analysis of abundant blood proteins is possible (14), a general informatic strategy has yet to systematically integrate DNA- and RNA-level data with the MS-based interrogation of the human proteome. This is accomplished here using a database of human proteins tailored for the “Top Down” MS approach by combinatorial consideration of protein variability during a search (i.e., “Shotgun Annotation”) (15). While nucleic acid-based approaches represent the highest throughput and best overall methods for capturing information about SNPs, proteomics-based approaches allow cSNP genotyping concurrent to modification and splice variant identification.

The direct fragmentation of intact protein ions using Fourier Transform (FT) MS now provides probability scores that are orders-of-magnitude better than searches based on tryptic peptides (1618), a far more efficient and robust reconstruction process for the primary structure of the mature protein, and detection of more diverse mass discrepancies (Δm’s) than targeted analysis approaches (e.g., for phosphopeptides). Major limitations for Top Down MS are difficulties in handling proteins >50 kDa routinely, low percent occupancy and multivalent PTMs (such as glycosylations) are difficult to detect, and only medium-scale projects <200 proteins from microorganisms have been achieved (19). The Top Down MS/MS approach using standard fragmentation methods or electron capture dissociation (ECD) has provided 100% coverage with localization of basic PTMs for proteins in Bacteria (17, 20), Archaea (16, 17), yeast (19, 21, 22) and a plant (23).

Here, we demonstrate unparalleled characterization of human (nuclear) proteins revealing 7 different types of modifications in regulation and maturation including a novel phosphoprotein. This was achieved by extending the database concept of “Shotgun Annotation” from a single human histone (15) to a proteomic scale and required the integration of diverse DNA, RNA, and protein level information. This work establishes the basis for routine application of Top Down MS to capture coding haplotypes within a gene and allele-specific splicing and modification patterns on a far greater number of human proteins.

Experimental Procedures

Cell culture and lysate fractionation

Human HeLa-S3 cells were grown to a density of 0.6 × 106 cells/mL using Joklik’s modified MEM and supplemented with 5% newborn calf medium. Cells were harvested using centrifugation at 2500xg and two washes in cold PBS. The nuclei were precipitated and isolated using detergent washes and the cytosol extracted (24). The isolated nuclei were then resuspended and for a portion of the extract, the chromatin (including DNA binding proteins) was precipitated by adding 0.5 N NaCl and 5 mM MgCl2. The proteins in solution were then loaded onto a prep cell (Biorad, Hercules, CA) with a 12% T gel using an acid-labile surfactant (21). Proteins from the prep cell fractions were precipitated, treated at pH 2 for 1 h, then separated using a symmetry C4 RPLC column (Waters, Milford, MA). For ~25% of identified proteins including BAF (Figure 4), the PF 2D system (Beckman Coulter, Fullerton, CA) was used for separation of proteins by pI, then reversed-phase liquid chromatography (RPLC) as outlined in the PF 2D manual.

Figure 4.

Figure 4

Characterization of a previously unknown phosphoprotein using Top Down MS/MS. a) Intact MS of 10+ charge states of species at 10191.3 Da and 10271.4 Da (~10:1 ratio). b) ECD fragmentation results for the 10+ charge state of the species at 10271.4 Da illustrating localization of the 80 Da Δm to Thr2 or Ser3. Red markers indicate ECD ions that matched the unmodified sequence, blue markers match ions that harbor the +80 Da Δm (as well as N-terminal acetylation).

ESI/Q-FTMS

Fractionated protein mixtures were suspended in ESI solution (49.5% MeOH, 49.5% H2O and 1% formic acid) and spun at 14,000 rpm for 10 min. Sample solutions were then loaded into a 96 well plate and automatically introduced to the mass spectrometer using the NanoMate 100 (Advion BioSciences, Ithaca, NY). Approximately 10 μL of solution from each well were infused by automated nanospray into the heated metal capillary source. Typical samples enabled more than 40 min. of stable nanospray providing sufficient time to acquire high quality broadband MS, threshold MS/MS and ECD MS/MS scans for 2–3 intact proteins per sample. In cases of insufficient fragmentation for precise localization of PTMs, excess sample was used in a more targeted fashion and in some cases a greater number of scans were summed for collisionally-activated dissociation (CAD), infrared multiphoton dissociation (IRMPD) or ECD.

The instrument used in this study was a custom 8.5 Tesla Q-FTMS of the Marshall design (25). In the case of CAD external to the magnet bore, ions were selected using the quadrupole and fragmented using electrostatic acceleration (10–45 V) into an octopole pressurized to ~10 mTorr with nitrogen gas. In the case of IRMPD or ECD, a SWIFT window 7 m/z wide was used. The isolated charge state was then dissociated using IR laser radiation for 0.25 s–0.45 s (with a beam expander mounted in front of the laser, 40W, 75% power). After threshold dissociation, the quad-enhanced and SWIFT isolated species was dissociated using ECD. Electrons were introduced to the cell for 100–200 ms using a dispenser cathode 35 in. from the center of the magnet. The kinetic energy of the electrons was controlled by placing a 1–2 V bias potential on the filament of the dispenser cathode.

Automated data acquisition

A custom TCL automation script first acquired 5–10 broadband scans, followed by a quadrupole marching experiment and upon completion a modified THRASH algorithm (26) automatically determined Mr values resulting in a peak list which was then used to select proteins for MS/MS analysis. The most abundant charge state of each protein was selectively accumulated using a notch-filtering quadrupole window 10 m/z wide automatically acquiring 5–10 scans. For targeted proteins, 25 or 50 scans of axial-CAD or IRMPD were recorded to yield protein identifications. Automatically acquired ECD spectra were the sum of 100 scans.

Construction of the custom human database

A highly annotated database of human protein forms was created within ProSight Warehouse (27) using conflict sequences, splicing data, PTMs from UniProt (28), SNP information from dbSNP, and a variety of manually entered data, such as new PTMs found in the primary literature. UniProt databases were transformed from Swiss-Prot format by a custom database loader created using Perl scripts and BioPerl libraries. In order to populate the database with SNP information, dbSNP was queried for nonsynonymous, coding polymorphisms with an available corresponding protein accession number. The resultant information was populated to a local database. Using a portion of dbSNP running locally, protein sequence information and function/description were obtained. Using custom Perl scripts, the results were converted to the necessary ProSight Warehouse format. A database loader application then extracted the protein information and populated ProSight Warehouse with all possible protein forms based on combinations of known variations for each gene product (15). The current number of protein forms in the human database is 2,823,267 yielding a SQL database of 3.5 GB with 17,333 proteins containing 1–10 cSNPs for subsequent searching using ProSight Retriever (29).

Data analysis and database searching

Intact protein MS and MS/MS data were analyzed by THRASH (26) resulting in a protein list and fragment ion list which were uploaded onto the ProSight PTM (27) web server for database searching (https://prosightptm.scs.uiuc.edu). The criteria for database searching were generally ±2000 Da Mr window and 5–20 ppm tolerance for fragment ions, with default search options selected as follows: Met on/off, acetyl on/off, and SNPs on. P-scores reported in this study are calculated as previously reported (16) and those <10−3 required no manual validation of the identification result. Unless noted otherwise, Mr and fragment ion mass values reported are for neutral, monoisotopic peaks (using external calibration) and protein identification numbers are UniProt primary accession numbers.

Results and Discussion

Genotyping by Top Down MS

With one SNP present every ~1 kb in the human genome and 50,973 nonsynonymous cSNPs currently-known in dbSNP alone, well over half of human genes contain cSNPs and Top Down MS/MS should enable robust genotyping even in the presence of PTMs. Fractions generated from a previously reported two-dimensional (2D) separation of intact proteins (21) typically contain multiple proteins of varying abundance, as in the ESI/Q-FTMS spectrum of Figure 1a. Of the 7 components, proteins of 6657.71 Da and 11644.8 Da were selectively accumulated and fragmented by CAD and separately using ECD (spectra not shown). The CAD fragmentation data of Figure 1b identified the 6.7 kDa component as a mitochondrial proteolipid (Pscore 4 × 10−7) containing a known cSNP encoding a Ile9Val residue change (Δm = 14.02 Da). Only the Ile9 allele was observed with an intact mass error of 18 ppm. The 11.6 kDa component was identified from the Figure 1c MS/MS data to be calgizzarin S100C (Pscore 1 × 10−12). The calgizzarin gene contains a cSNP translating to a one Da variability (Glu36Lys), readily resolved for the Glu36 allele observed in the background of N-terminal Methionine loss/acetylation (overall 0.6 ppm error). This illustrates the efficiency of intact protein MS/MS for genotyping cSNPs, a feat not often possible using digestion-based approaches. Determination of minihaplotypes in coding regions (i.e., the co-occurrence of multiple alleles in a coding sequence) should also be possible using endogenous material itself instead of in vitro produced/artificial peptides from PCR-products (30).

Figure 1.

Figure 1

Complete characterization of multiple cSNP-containing proteins from one fraction. a) Partial ESI/Q-FT mass spectrum (10 scans) of an ALS-PAGE/RPLC sample from human cells. b) Tandem mass spectrum (50 scans) from collisional dissociation of a 6.7 kDa protein selectively accumulated and fragmented using the quadrupole-enhancement to FTMS. c) Tandem mass spectrum (50 scans, axial-CAD) from dissociation of the 11.6 kDa species at 905 m/z. d) and e) Graphical fragment maps generated upon database retrieval using the MS/MS spectra of proteins highlighted in Figure 1a (insets). Tall and short markers represent fragment ions produced from CAD (b/y-type) and ECD (c/z-type), respectively.

Gene-specific identification and genotyping of a modified protein

Two-dimensional fractionation of a nuclear protein extract from asynchronous HeLa cells yielded various fractions containing core histones. Processing of one such sample by automated MS/MS provided ECD data for a 13997.8 Da component of only 8% relative abundance (Figure 2a). These MS/MS data (Figure 2c) specifically identified histone H2A family member O from 29 distinct H2A forms (17 gene family members and their variants, Supplementary Figure 1a) with a 10−18 Pscore. A sequence alignment was performed on H2A.O with the five most homologous protein forms in the H2A family (>80% identity; Supplementary Figure 1b), revealing that four fragment ions (of the 19 automatically assigned) provided the specificity for precise and automatic identification of H2A.O vs. the next best match. The H2A.O gene also contains a cSNP at residue 124 leading to a His→Tyr change (Δm = 26.00 Da). Only the His124 form was observed indicating that these cells are homozygous at this locus. The observed intact mass contained a Δm of 42.01±0.02 Da, localized to the first five N-terminal residues (Figure 2d). This Δm is most likely acetylation of the N-terminus, though this same modification at Lys5 is formally possible. Thus, an automated data flow can now differentiate between posttranslationally-modified and cSNP-containing isoforms, even in highly conserved gene families.

Figure 2.

Figure 2

Intact and MS/MS fragmentation spectra providing high retrieval specificity obtained for a modified, cSNP containing member of a highly conserved gene family. a) Broadband ESI FTMS spectrum (10 scans) of an ALS-PAGE/RPLC fraction from human HeLa cells. b) Auto-SWIFT isolation spectrum (10 scans) of the 18+ charge state at 779 m/z. c) Partial auto-ECD MS/MS spectrum (100 scans) of the species of Figure 2b. d) The graphical fragment map generated upon database retrieval from ECD and CAD fragmentation data illustrating the position of the cSNP within the histone H2A gene.

Supplementary Figure 1.

Supplementary Figure 1

An illustration of the biological complexity of MS analysis of members of highly conserved gene families. a) Homology tree generated using ClustalW sequence alignments illustrating the genetic distance between 17 H2A gene family members and their variants (29 total protein forms), not including cSNP containing forms. b) ClustalW sequence alignment of H2A.O and the 5 most closely related family members from Supplementary Figure 1a illustrating 5 distinct residue changes.

Identification and semi-quantitative analysis of alternative splice variants

In a separate sample, Q-FTMS/MS analysis automatically identified a 11977.9 Da protein as prothymosin alpha (ProTα, Figure 3d). ProTα is encoded by six family members with high sequence homology (31, 32). The family member observed contains 4 introns and from EST data is known to be alternatively spliced due to a rare GAGGAG motif that creates adjacent AG acceptor sites at the intron2/exon3 boundary (Figure 3e) (33). In most tissues, ~10% of this mRNA contains an extra GAG codon (encodes for an extra Glu) versus 90% of ProTα transcripts where the more 5′ acceptor site is used producing a form with one less residue (33). Upon examination of the broadband spectrum, both species were observed in a ~10:1 ratio of light vs. heavy protein (Figure 3a). The minor species was subsequently fragmented (Figure 3c) and the extra Glu residue precisely localized (Figure 3d, right).

Figure 3.

Figure 3

Characterization and semi-quantitation of alternative splice variants using Top Down MS/MS. a) Partial broadband MS spectrum for alternatively spliced species of 11977.9 Da and 12106.9 Da. b) and c) ECD and IRMPD MS/MS spectra of SWIFT isolated species from Figure 3a. d) Fragmentation details from MS/MS spectra of Figure 3b and 3c. e) Alternative splicing diagram for the ProTα gene illustrating the adjacent splice acceptors due to the GAGGAG motif. The tall blue and short red markers on the fragment maps indicate ions formed by IRMPD and ECD, respectively.

The presence of the GAGGAG motif was recognized as a possible acceptor site by only NetGene2 (www.cbs.dtu.dk/services/NetGene2), one of five intron/exon prediction programs tested. Using BLAST to search human EST libraries (www.ncbi.nlm.nih.gov/dbEST), more than 1300 dbEST entries were attributed to ProTα, with only ~150 matching the longer form, consistent with an earlier finding that the ~9:1 ratio of short:long is not tissue specific (33). Also using BLAST, the GAGGAG motif at this locus was found only in primates. Neither rat nor mouse have the extra splice acceptor site and have evolved only the long form of the protein which is actually the less favorable form in humans.

Identification of a novel phosphoprotein

As a last illustration of new advantages provided by the Top Down approach, the 10191.1 Da barrier-to-autointegration factor (BAF) protein was identified in a nuclear extract and exhibited a +79.95±.05 Da satellite peak at ~10 % occupancy consistent with phosphorylation (Figure 4a). The data from automated MS/MS localized the phosphorylation to the 11 N-terminal residues. Manual MS/MS using electrons further confirmed a Met off/acetylated N-terminus and narrowed the region of phosphorylation to Thr2 or Ser3 (Figure 4b). This well-studied protein directly binds to chromatin, is thought to be involved in attachment of chromatin to the inner nuclear membrane (34), and is not known to be modified. No other forms of this protein have been observed in adjacent fractions and the pI change caused by phosphorylation is small enough to allow coelution of both forms in identical fractions during chromatofocusing and RPLC. With the 2-dimensional fractionation behavior of this modified protein now known, detection of this protein from nuclear extracts was reproduced twice more. This now allows targeted studies on this protein from synchronized HeLa cells in a straightforward manner. Such a platform for biochemical interrogation of targeted proteins after RNAi, chemical perturbation, or cell synchronization will be highly valuable for capturing a more detailed picture of functional regulation mechanisms involving PTM dynamics.

Summary of Findings and Outlook

Using a dual ion fragmentation approach to automatically analyze 2–3 small human proteins per fraction by Top Down MS/MS, 45 proteins were identified with a median probability score of 10−13 (Table 1). A main advantage of the Top Down strategy is that information on the entire primary structure of the mature protein is obtained, allowing reliable dissection and abundance measurements of highly related gene products from genetic or transcriptional variation and enzymatic modification. Due to the complimentary nature of ion fragmentation using electrons and collisions with gas, precise localization of PTMs, polymorphisms, and amino acids at splice junctions is indeed possible. For the identified proteins, 45% were found in forms not present in UniProt’s Human Proteome Initiative and ~40% contained SNPs for which only single alleles were observed. Over 85% of the identifications required no manual validation of the database retrieval result; Δm localization sometimes improved upon inspection of the raw data. Characterization of closely related protein forms (e.g., different PTM isomers or SNP forms) sometimes required manual scrutiny of the output from ProSight PTM, with the correct form yielding the highest score in the retrieval list in ~90% of cases.

Table 1.

Partial list of human proteins identified and characterized using Top Down MS.

# Protein Accession # Fragmentation Method Protein Size (kDa) mass (Da)a Pscore Function Gene Family cSNPb Alt. Splice PTM Notes
1 P39019 CAD 15.9 0.3 1.E-03 40S ribosomal protein S19 X Glu44Gly
2 P56378 CAD 6.7 0.12 4.E-07 6.8 kDa mitochondrial proteolipid PRO1574 X IIeWal
3 P18124 CAD 29.2 0.20 1.E-02 60S Ribosomal protein L7 X Glu47Lys
4 P23411 ECD 8.1 0.03 2.E-05 60S Ribosomal Protein L38 X Lys52Glu; Phe34Leu
5 O75964 OCAD 11.3 0.20 1.E-10 ATP synthase e.g. chain, mitochondrial EC subunit G X X Lys47Arg; Ala56Val
6 P56381 ECD 5.6 0.12 8.E-26 ATP synthase epsilon chain, mitochondrial X Glu30Lys
7 O75531 ECD 10.2 80.05 2.E-10 Barrier-to-autointegration factor X Phosphorylated near N-term, N-term Ac
8 P61769 CAD 11.8 0.0 3.E-10 Beta-2-microglobulin precursor HDCMA22P X X Signal Peptide, 20 N-term residues cleaved, Lys61Arg, Asp73Asn, Glu56Lys, Phe42Leu, Pro34Ser, Arg32Cys, Pro92His
9 P06703 ECD 10.2 0.2 3.E-14 Calcyclin X Asn69Ser, His27Arg, Gly90Asp, IIe83Thr
10 P31949 ECD, CAD 11.6 0.1 1.E-12 Calgizzarin S100C protein MLN 70 X Glu36Lys
11 P02593 ECD 16.8 0.1 3.E-24 Calmodulin X Trimethylated on Lys115
12 P15954 CAD 5.4 0.02 1.E-07 Cytochrome c oxidase polypeptide Vllc, mitochondrial precursor EC .9.3 .1 X Loss of N-term16 residues to mature form
13 P63241 CAD 17 0.1 6.E-17 Eukaryotic translation initiation factor X Hypusine residue at position 50, N-term Ac
14 P49773 ECD 13.7 0.0 1.E-30 Histidine triad nucleotide-binding protein 1 (Adenosine 5′ monophosphoramidase) (Protein kinase C inhibitor 1) X Gly105Arg; Glu100
15 P10412 ECD 21.7 0.2 2.E-17 Histone H1.4 member H1b X X N-Terminal Ac
16 P20670 ECD 14.0 0.0 2.E-18 Histone H2A member O X X X Acetylations, His124Tyr
17 P02278 IRMPD 13.8 c 7.E-14 Histone H2B X X Multiple family members possible
18 P16106 CAD 15.5 c 1.E-02 Histone H3 X X Multiple methylations/acetylations
19 P02304 IRMPD 11.3 0.1 9.E-18 Histone H4 X Multiple acetylated forms, dimethylation
20 P14174 ECD, CAD 12.3 0.0 1.E-04 Macrophage migration inhibitory factor phenylpyruvate tautomerase glycosylation inhibitory factor X X Disulfide, Asn106Ser; IIe68Thr; Pro44Leu
21 095167 CAD 9.2 0.14 7.E-07 NADH-ubiquinone oxi doreductase B9 subunit X X Consistent w/N-terrn Ac at A[2], Leu81Val
22 000483 CAD 9.4 1.00 1.E-22 NADH-ubiquinone oxi doreductase MLRQ Subunit EC X Asp80Gly
23 P26447 ECD 11.9 0.1 2.E-28 Placental calcium-binding protein Calvasculin S100 calcium bindinq protein A4 MTS1 X Glu9Val
24 P06454 ECD 12.0 0.0 9.E-15 Prothymosin alpha X X X Alternatively spliced, N-term Ac
25 P06454 IRMPD 12.1 0.0 1.E-09 Prothymosin alpha X X X Alternatively spliced, N-term Ac
26 Q99584 IRMPD 11.4 0.1 7.E-06 S100 calcium-binding protein A13 X X Consistent w/N-term Ac at A[2], Glu14Asp
27 P25815 ECD 10.5 0.1 2.E-14 S-100P protein X Glu32Asp
28 P08578 ECD 10.7 0.5 5.E-07 Small nuclear ribonuclearprotein E snRNP-E Sm protein E Sm-E SmE X Gly5Cys
29 Q9UEA3 OCAD 6.5 0.1 2.E-21 Ubiquinol-cytochrome C reductase complex 6.4 kDa protein X Leu24Trp
30 P02248 ECD 8.4 0.00 Ubiquitin X Loss of C-term GG to mature form
31 Q15843 IRMPD 8.6 0.10 5.E-15 Ubiquitin-like protein NEDD8 X Loss of C-term GGLRQ to mature form
a

mass errors are from comparison of intact mass values: fragment ion mass errors are typically 1–20ppm

b

all cSNPs were verified using dbSNP, the major allele was observed for cases where allelic frequencies were provided

c

multiple isobaric forms observed with identical Pscore values.

The ability to automatically genotype cSNPs and characterize PTMs with gene-specific identifications is enabled by the new informatic strategy of “Shotgun Annotation”(15), the combinatorial consideration of diverse sources of Δm’s. This strategy represents a major shift in curation philosophy for protein databases (35), is well-suited for Top Down using FTMS, and recognizes that detailed information on SNPs, mutations (36), splice variants (37), and PTMs (38) will be increasingly known and even somewhat predictable (36). By embedding such variability tightly within a MS retrieval engine, the current study drastically improves identification metrics, enables known biological events to be characterized as they occur in combination, and allows unknown biology to be uncovered more efficiently. Shotgun Annotation actually increases the quality of most retrievals by allowing more absolute mass values of fragment ions observed in a Top Down MS/MS experiment to match those values generated from protein forms housed in a database. The examples highlighted here illustrate an overall process that can simply be called “proteotyping”. The term proteotyping is akin to genotyping at the DNA level but captures all the variability of proteins as they occur in populations and change over time. Fragmentation of intact proteins represents an emergent method for “reverse annotation” of the human genome and Top Down can now be embraced by organizations such as the Human Proteome Organization.

Acknowledgments

The authors thank Rich LeDuc for technical assistance and Hugh Robertson for valuable discussions. This work was supported by the National Science Foundation Career Award (CH 0134953), the National Institutes of Health (GM 067193), the Sloan Foundation, the Research Corporation (Cottrell Scholars Program) and the Henry and Lucille Packard Foundation. We also are grateful to John Hobbs and Jeff Chapman of Beckman Coulter and Tim Barder of Eprogen for assistance with the PF 2D system.

References

  • 1.Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003;422:198–207. doi: 10.1038/nature01511. [DOI] [PubMed] [Google Scholar]
  • 2.Yates JR. Mass spectral analysis in proteomics. Annu Rev Biophys Biomol Struct. 2004;33:297–316. doi: 10.1146/annurev.biophys.33.111502.082538. [DOI] [PubMed] [Google Scholar]
  • 3.Sam-Yellowe TY, Florens L, Johnson JR, Wang T, Drazba JA, Le Roch KG, Zhou Y, Batalov S, Carucci DJ, Winzeler EA, Yates JR., 3rd A Plasmodium gene family encoding Maurer’s cleft membrane proteins: structural properties and expression profiling. Genome Res. 2004;14:1052–1059. doi: 10.1101/gr.2126104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Rappsilber J, Ryder U, Lamond AI, Mann M. Large-scale proteomic analysis of the human spliceosome. Genome Res. 2002;12:1231–1245. doi: 10.1101/gr.473902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Hines WM, Parker K, Peltier J, Patterson DH, Vestal ML, Martin SA. Protein identification and protein characterization by high-performance time-of-flight mass spectrometry. J Protein Chem. 1998;17:525–526. [PubMed] [Google Scholar]
  • 6.Haynes PA, Yates JR., 3rd Proteome profiling-pitfalls and progress. Yeast. 2000;17:81–87. doi: 10.1002/1097-0061(20000630)17:2<81::AID-YEA22>3.0.CO;2-Z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gygi SP, Rist B, Griffin TJ, Eng J, Aebersold R. Proteome analysis of low-abundance proteins using multidimensional chromatography and isotope-coded affinity tags. J Proteome Res. 2002;1:47–54. doi: 10.1021/pr015509n. [DOI] [PubMed] [Google Scholar]
  • 8.Thiede B, Lamer S, Mattow J, Siejak F, Dimmler C, Rudel T, Jungblut PR. Analysis of missed cleavage sites, tryptophan oxidation and N-terminal pyroglutamylation after in-gel tryptic digestion. Rapid Commun Mass Spectrom. 2000;14:496–502. doi: 10.1002/(SICI)1097-0231(20000331)14:6<496::AID-RCM899>3.0.CO;2-1. [DOI] [PubMed] [Google Scholar]
  • 9.Konig S, Zeller M, Peter-Katalinic J, Roth J, Sorg C, Vogl T. Use of nonspecific cleavage products for protein sequence analysis as shown on calcyclin isolated from human granulocytes. J Am Soc Mass Spectrom. 2001;12:1180–1185. doi: 10.1016/S1044-0305(01)00300-2. [DOI] [PubMed] [Google Scholar]
  • 10.Field HI, Fenyo D, Beavis RC. RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database. Proteomics. 2002;2:36–47. [PubMed] [Google Scholar]
  • 11.MacCoss MJ, McDonald WH, Saraf A, Sadygov R, Clark JM, Tasto JJ, Gould KL, Wolters D, Washburn M, Weiss A, Clark JI, Yates JR., 3rd Shotgun identification of protein modifications from protein complexes and lens tissue. Proc Natl Acad Sci U S A. 2002;99:7900–7905. doi: 10.1073/pnas.122231399. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wu CC, MacCoss MJ, Mardones G, Finnigan C, Mogelsvang S, Yates JR, 3rd, Howell KE. Organellar Proteomics Reveals Golgi Arginine Dimethylation. Mol Biol Cell. 2004 doi: 10.1091/mbc.E04-02-0101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zhu K, Kim J, Yoo C, Miller FR, Lubman DM. High sequence coverage of proteins isolated from liquid separations of breast cancer cells using capillary electrophoresis-time-of-flight MS and MALDI-TOF MS mapping. Anal Chem. 2003;75:6209–6217. doi: 10.1021/ac0346454. [DOI] [PubMed] [Google Scholar]
  • 14.Gatlin CL, Eng JK, Cross ST, Detter JC, Yates JR., 3rd Automated identification of amino acid sequence variations in proteins by HPLC/microspray tandem mass spectrometry. Anal Chem. 2000;72:757–763. doi: 10.1021/ac991025n. [DOI] [PubMed] [Google Scholar]
  • 15.Pesavento JJ, Kim YB, Taylor GK, Kelleher NL. Shotgun annotation of histone modifications: a new approach for streamlined characterization of proteins by top down mass spectrometry. J Am Chem Soc. 2004;126:3386–3387. doi: 10.1021/ja039748i. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Forbes AJ, Patrie SM, Taylor GK, Kim YB, Jiang L, Kelleher NL. Targeted analysis and discovery of posttranslational modifications in proteins from methanogenic archaea by top-down MS. Proc Natl Acad Sci U S A. 2004;101:2678–2683. doi: 10.1073/pnas.0306575101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Meng F, Cargile BJ, Miller LM, Forbes AJ, Johnson JR, Kelleher NL. Informatics and multiplexing of intact protein identification in bacteria and the archaea. Nat Biotechnol. 2001;19:952–957. doi: 10.1038/nbt1001-952. [DOI] [PubMed] [Google Scholar]
  • 18.Amunugama R, Hogan JM, Newton KA, McLuckey SA. Whole protein dissociation in a quadrupole ion trap: identification of an a priori unknown modified protein. Anal Chem. 2004;76:720–727. doi: 10.1021/ac034900k. [DOI] [PubMed] [Google Scholar]
  • 19.Meng F, Du Y, Miller LM, Patrie SM, Robinson DE, Kelleher NL. Molecular-Level Description of Proteins from Saccharomyces cerevisiae Using Quadrupole FT Hybrid Mass Spectrometry for Top Down Proteomics. Anal Chem. 2004;76:2852–2858. doi: 10.1021/ac0354903. [DOI] [PubMed] [Google Scholar]
  • 20.Cargile BJ, McLuckey SA, Stephenson JL., Jr Identification of bacteriophage MS2 coat protein from E. coli lysates via ion trap collisional activation of intact protein ions. Anal Chem. 2001;73:1277–1285. doi: 10.1021/ac000725l. [DOI] [PubMed] [Google Scholar]
  • 21.Meng F, Cargile BJ, Patrie SM, Johnson JR, McLoughlin SM, Kelleher NL. Processing complex mixtures of intact proteins for direct analysis by mass spectrometry. Anal Chem. 2002;74:2923–2929. doi: 10.1021/ac020049i. [DOI] [PubMed] [Google Scholar]
  • 22.Forbes AJ, Mazur MT, Patel HM, Walsh CT, Kelleher NL. Toward efficient analysis of >70 kDa proteins with 100% sequence coverage. Proteomics. 2001;1:927–933. doi: 10.1002/1615-9861(200108)1:8<927::AID-PROT927>3.0.CO;2-T. [DOI] [PubMed] [Google Scholar]
  • 23.Zabrouskov V, Giacomelli L, Van Wijk KJ, McLafferty FW. A New Approach for Plant Proteomics: Characterization of Chloroplast Proteins of Arabidopsis thaliana by Top-down Mass Spectrometry. Mol Cell Proteomics. 2003;2:1253–1260. doi: 10.1074/mcp.M300069-MCP200. [DOI] [PubMed] [Google Scholar]
  • 24.Allis CD, Glover CV, Gorovsky MA. Micronuclei of Tetrahymena contain two types of histone H3. Proc Natl Acad Sci U S A. 1979;76:4857–4861. doi: 10.1073/pnas.76.10.4857. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Senko MW, Hendrickson CL, Pasa-Tolic L, Marto JA, White FM, Guan S, Marshall AG. Electrospray ionization Fourier transform ion cyclotron resonance at 9.4 T. . Rapid Commun Mass Spectrom. 1996;10:1824–1828. doi: 10.1002/(SICI)1097-0231(199611)10:14<1824::AID-RCM695>3.0.CO;2-E. [DOI] [PubMed] [Google Scholar]
  • 26.Horn DM, Zubarev RA, McLafferty FW. Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules. J Am Soc Mass Spectrom. 2000;11:320–332. doi: 10.1016/s1044-0305(99)00157-9. [DOI] [PubMed] [Google Scholar]
  • 27.Taylor GK, Kim YB, Forbes AJ, Meng F, McCarthy R, Kelleher NL. Web and database software for identification of intact proteins using “top down” mass spectrometry. Anal Chem. 2003;75:4081–4086. doi: 10.1021/ac0341721. [DOI] [PubMed] [Google Scholar]
  • 28.O’Donovan C, Apweiler R, Bairoch A. The human proteomics initiative (HPI) Trends Biotechnol. 2001;19:178–181. doi: 10.1016/s0167-7799(01)01598-0. [DOI] [PubMed] [Google Scholar]
  • 29.LeDuc RD, Taylor GK, Kim YB, Januszyk TE, Bynum LH, Sola JV, Garavelli JS, Kelleher NL. ProSight PTM: an integrated environment for protein identification and characterization by top-down mass spectrometry. Nucleic Acids Res. 2004;32:W340–345. doi: 10.1093/nar/gkh447. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Telmer CA, Retchless AC, Kinsey AD, Conley Y, Rigatti B, Gorin MB, Jarvik JW. Detection and assignment of mutations and minihaplotypes in human DNA using peptide mass signature genotyping (PMSG): application to the human RDS/peripherin gene. Genome Res. 2003;13:1944–1951. doi: 10.1101/gr.995103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Pineiro A, Cordero OJ, Nogueira M. Fifteen years of prothymosin alpha: contradictory past and new horizons. Peptides. 2000;21:1433–1446. doi: 10.1016/s0196-9781(00)00288-6. [DOI] [PubMed] [Google Scholar]
  • 32.Eschenfeldt WH, Manrow RE, Krug MS, Berger SL. Isolation and partial sequencing of the human prothymosin alpha gene family. Evidence against export of the gene products. J Biol Chem. 1989;264:7546–7555. [PubMed] [Google Scholar]
  • 33.Manrow RE, Berger SL. GAG triplets as splice acceptors of last resort. An unusual form of alternative splicing in prothymosin alpha pre-mRNA. J Mol Biol. 1993;234:281–288. doi: 10.1006/jmbi.1993.1583. [DOI] [PubMed] [Google Scholar]
  • 34.Segura-Totten M, Kowalski A, Craigie R, Wilson K. Barrier-to-autointegration factor: major roles in chromatin decondensation and nuclear assembly. J Cell Biol. 2002;158:475–485. doi: 10.1083/jcb.200202019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Mann M, Jensen ON. Proteomic analysis of post-translational modifications. Nat Biotechnol. 2003;21:255–261. doi: 10.1038/nbt0303-255. [DOI] [PubMed] [Google Scholar]
  • 36.Horvath MM, Fondon JW, 3rd, Garner HR. Low hanging fruit: a subset of human cSNPs is both highly non-uniform and predictable. Gene. 2003;312:197–206. doi: 10.1016/s0378-1119(03)00628-0. [DOI] [PubMed] [Google Scholar]
  • 37.Lee C, Atanelov L, Modrek B, Xing Y. ASAP: the Alternative Splicing Annotation Project. Nucleic Acids Res. 2003;31:101–105. doi: 10.1093/nar/gkg029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O’Donovan C, Phan I, Pilbout S, Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31:365–370. doi: 10.1093/nar/gkg095. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES