Abstract
The reliability of the transmembrane (TM) sequence assignments for membrane proteins (MPs) in standard sequence databases is uncertain because the vast majority are based on hydropathy plots. A database of MPs with dependable assignments is necessary for developing new computational tools for the prediction of MP structure. We have therefore created MPtopo, a database of MPs whose topologies have been verified experimentally by means of crystallography, gene fusion, and other methods. Tests using MPtopo strongly validated four existing MP topology-prediction algorithms. MPtopo is freely available over the internet and can be queried by means of an SQL-based search engine.
Keywords: Protein structure prediction, prediction accuracy, hydropathy plots
The number of protein sequences in the Protein Information Resource (PIR; Barker et al. 2000) and SWISS-PROT (Bairoch and Boeckmann 1991) databases has exploded as a result of genome sequencing efforts. The PIR database presently contains over 142,000 nonredundant entries, while SWISS-PROT contains over 80,000. A simple search of these databases returns a large number of entries classified as membrane proteins (MPs): 12,000 in the PIR and 9000 in SWISS-PROT. These MP entries provide assignments for transmembrane (TM) segments, but their reliability is uncertain. In a recent survey of SWISS-PROT, Senes et al. (2000) determined that almost 94% of the TM segments were annotated as potential, possible, or probable, indicating that these segments were identified through the use of prediction algorithms, primarily hydropathy plots. Therefore, the majority of TM segment assignments within these public databases must be used with caution. Several collections of MPs have been compiled directly from SWISS-PROT, using either SWISS-PROT annotations or criteria that are sometimes ambiguous (Hofmann and Stoffel 1993; Jones et al. 1994; Cserzö et al. 1997; Gromiha 1999). To avoid propagation of errors that may have been present in the original predictions underlying the database annotations, a curated database of membrane proteins is needed that contains only proteins for which direct experimental evidence of TM segment assignments exists. We have, therefore, created MPtopo, a modest but growing database of MP TM sequences whose topologies have been verified experimentally by means of crystallography, gene fusion, and other methods. The purpose of this note is to introduce the database and to report the results of using it to evaluate and compare four existing MP topology-prediction algorithms (Claros and von Heijne 1994; Milpetz et al. 1995; Rost et al. 1995; Tusnády and Simon 1998).
The assembly of a dependable database of MP topology from literature reports was less straightforward than expected. Even in the case of membrane proteins whose three-dimensional (3D) structures have been determined, TM segment assignments are often not identified in the original publications and often are not readily determinable from the Protein Data Bank coordinate files. Beyond the MPs of known structure, we sought papers in the MP literature that contained keywords, such as gene fusion, suggestive of direct experimental studies of topology. Reported MP topologies were included in the database only after careful evaluation of published experimental results. For example, in the case of gene fusion data (Boyd 1994), the density of fusions had to be sufficient to inspire confidence that the topology had been explored thoroughly, as in the case of lac permease (Calamia and Manoil 1990). MPtopo has now grown to 90 proteins or subunits contributing 534 TM segments, a size we believe sufficient for evaluating existing prediction algorithms and creating new ones.
Protein selection criteria and database characteristics
The TM segment assignments of membrane proteins or subunits of known 3D structure, labeled alphabetically in the data records, were obtained from the published reports or by examination of the PDB coordinate files. In a few cases, secondary structure determinations obtained using the Kabsch and Sander (1983) DSSP program were used to establish segment assignments.
Some surface-bound, monotopic membrane proteins without TM segments, such as prostaglandin synthase (Picot et al. 1994), were also included in MPtopo to aid the development of algorithms for distinguishing monotopic from TM proteins. These proteins are identified by an asterisk following the protein name in the data record, appropriate comments in the remarks field, and an asterisk on the TM segment alphabetic assignment, indicating surface-lying helices. The recently reported water and glycerol channel proteins (Fu et al. 2000; Murata et al. 2000) are also marked with asterisks because they have TM segments comprised of two end-to-end helices that are distant in the sequence. In addition to comments in the remarks field, each partial helix is recorded as a TM segment with an asterisk on the alphabetic identifier. To identify the segment pairs constituting the full TM segment, the first partial segment in the sequence is identified, for example, as C*, and the second one as *C.
In the absence of 3D structures, TM sequence assignments were obtained from published reports of topology that included experimental confirmation using techniques such as gene fusion (Manoil and Beckwith 1986), Asn-linked glycosylation (Pan et al. 1999), or amino acid deletions (Wolin and Kaback 1999). In some cases, such as rhodopsin (whose 3-D structure was recently reported [Palczewski et al. 2000]), an overwhelming amount of data of all sorts from a large number of publications provided strong, coherent evidence for TM segment assignments. Even when noncrystallographic experimental data are used to validate topology, however, most specific TM sequence assignments in published reports originate from hydropathy plots. In most cases, authors generally provided topology diagrams that assigned the TM segments believed to be located within the membrane bilayer. Such assignments were used for specifying TM segments in MPtopo. In a few cases, topologies of proteins with long interhelix connecting loops were specified without specific assignment of the membrane-buried segments. Under those circumstances, we identified likely membrane-buried segments by seeking long runs of hydrophobic residues bounded by charged residues. Our assignments included only intervening uncharged residues.
Hydropathy plots vary in important details even among closely related proteins (White and Jacobs 1990); seemingly subtle differences in sequences can have big effects on decision thresholds (Edelman and White 1989; Edelman 1993). Because of our interest in prediction tools based strictly on physiochemical criteria (White and Wimley 1999), we did not reject any protein because of high homology or sequence identity with a protein already in MPtopo.
The data fields and structure of each MPtopo entry are summarized in Figure 1B ▶. We have divided the entries into three subsets: 3D_helix, 1D_helix, and 3D_other.
Fig. 1.

Web tools for using MPtopo. (A) MPtopo Querier, a java applet designed to search the MPtopo database using an SQL-based server. With Querier, MPtopo may be searched by protein name, authors, number of transmembrane (TM) segments, Protein Information Resource (PIR) identifier, Protein Data Bank (PDB) identifier, or any combination of these fields. The search can be performed on the whole MPtopo database or limited to one of the subsets. MPtopo Querier is available for use over the World Wide Web from our Web site at http://blanco.biomol.uci.edu/mptopo. (B) Search results from Querier are displayed within the results window. Each returned result is displayed as the complete database entry. Each entry contains 15 fields including the complete protein sequence, the number of transmembrane (TM) segments, TM segment start and end positions, PIR and PDB identifiers (when available), a complete reference citation, and the topology (Nterm = in or out). Selected entries may be sent to MPEx, a hydropathy plot TM segment prediction tool developed in our laboratory (S. Jayasinghe, K. Hristova, and S.H. White, in prep.). The complete database is also available for anonymous ftp download as a plain text file at blanco.biomol.uci.edu/mptopo.
The first two contain helix-bundle proteins segregated according to the existence or absence, respectively, of 3-D structures. 3D_other includes β-barrel and monotopic MPs whose structures have been determined crystallographically. The general characteristics of the database are summarized in Table 1. The lengths of TM segments show a wide distribution. Within 3D_helix, the average TM helix length is 28 residues, ranging from 17 to 43 residues. These values are quite similar to those observed by Bowie (1997) for 45 TM helices from three helix-bundle MP structures. The length distribution for 1D_helix is slightly broader, nine to 46 residues with an average length of 22 residues. This shorter average undoubtedly reflects the influence of hydropathy plots performed with window lengths of 19 or 21 residues.
Table 1.
General characteristics of the MPtopo database
| MPtopo subset | |||
| 3D_helix | 1D_helix | 3D_other | |
| No. of proteinsa | 41 | 38 | 11 |
| No. of total residues | 8960 | 15018 | 4171 |
| Average sequence lengthb | 218 | 395 | 379 |
| No. of residues in TM segments | 4186 | 5426 | 1671 |
| No. of total TM segments | 150 | 242 | 142 |
| Average TM segment lengthb | 28 ± 5 | 22 ± 4 | 12 ± 3 |
| TM segment length rangeb | 17 − 43 | 9 − 46 | 4 − 20 |
a Includes protein subunits.
b Given as the number of residues.
Accuracies of prediction algorithms
We used the 3D_helix and 1D_helix subsets of MPtopo to determine TM-segment prediction accuracy of four algorithms designed for predicting TM helices: HMM (Tusnády and Simon 1998), TopPredII (von Heijne 1992), TMAP (Persson and Argos 1994; Milpetz et al. 1995), and PHDhtm (Rost et al. 1995, 1996). Prediction accuracy Q was computed using the per segment method of Tusnády and Simon (1998). The results are summarized in Table 2. All four algorithms yield impressive per segment prediction accuracies, the highest reaching 97% for the 3D_helix set. Interestingly, the prediction accuracies for the 1D_helix set are systematically lower than for 3D_helix. As shown in Table 2, the reduced accuracies are mainly caused by false positive TM segment predictions. The causes of this result are uncertain. Two simple possibilities include algorithmic bias toward MPs of known structure and imperfections in the experimental methods for validating topology. A third possibility is the existence of exceptionally hydrophobic extra membrane domains in the 1D_helix set. We tested this possibility using automated hydropathy analysis of a collection of ∼1000 soluble proteins. One or two potential TM segments were found for ∼10% of the proteins.
Table 2.
Prediction accuracy of various algorithms using MPtopo
| No. of transmembrane helicesa | ||||
| MPtopo subset | Algorithm | Npredicted | Ncorrect | Q (%)b |
| 3D_helix (Nknown = 150) | ||||
| PHDhtmc | 152 | 146 | 97 | |
| HMMd | 154 | 145 | 95 | |
| TopPred IIe | 162 | 148 | 95 | |
| TMAPf | 139 | 136 | 96 | |
| 1D_helix (Nknown = 242) | ||||
| PHDhtm | 250 | 228 | 93 | |
| HMM | 264 | 240 | 95 | |
| TopPred II | 259 | 224 | 89 | |
| TMAP | 241 | 221 | 92 | |
a Nknown, Npredicted, Ncorrect are, respectively, number of experimentally known helices, total number of predicted, and number predicted correctly. Ncorrect is defined as predicted helices that exhibited at least a 50% overlap with known transmembrane helices.
b Prediction accuracy Q was determined as described in Tusnády and Simon (1998).
![]() |
c From the PredictProtein automatic prediction server (Rost et al. 1996) using the default settings.
d Hidden Markov Model (Tusnády and Simon 1998) (HMM) used with single sequence information from MPtopo.
e TopPred II (von Heijne 1992) used with default settings: window size top = 11, window size bottom = 21, upper cut-off = 1.0, lower cut-off = 0.6.
f TMAP (Persson and Argos, 1994; Milpetz et al. 1995) was used with single sequence information from MPtopo.
Database accessibility and availability
MPtopo is available at http://blanco.biomol.uci.edu/mptopo. It can be downloaded as a composite text file or searched using a java applet (MPtopo Querier) connected to an SQL-based server (Fig. 1A ▶). Search results are returned as complete database entries, displayed in a separate results window (Fig. 1B ▶). We would be pleased to receive suggestions for other membrane proteins to include in MPtopo.
Acknowledgments
We are pleased to acknowledge Michael Myers' assistance in maintaining the MPtopo database and his assistance in editing this manuscript. This work is supported by National Institutes of General Medical Sciences (GM-46823).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at www.proteinscience.org/cgi/doi/10.1110/ps.43501
References
- Bairoch, A. and Boeckmann, B. 1991. The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 19 2247–2248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barker, W.C., Garavelli, J.S., Huang, H.Z., McGarvey, P.B., Orcutt, B.C., Srinivasarao, G.Y., Xiao, C.L., Yeh, L.-S.L., Ledley, R.S., Janda, J.F., et al. 2000. The Protein Information Resource (PIR). Nucleic Acids Res. 28 41–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bowie, J.U. 1997. Helix packing in membrane proteins. J. Mol. Biol. 272 780–789. [DOI] [PubMed] [Google Scholar]
- Boyd, D. 1994. Use of gene fusions to determine membrane protein topology. In Membrane protein structure: Experimental approaches (ed. S.H. White), pp. 144–163. Oxford University Press, New York.
- Calamia, J. and Manoil, C. 1990. lac permease of Escherichia coli: Topology and sequence elements promoting membrane insertion. Proc. Natl. Acad. Sci. 87 4937–4941. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Claros, M.G. and von Heijne, G. 1994. TopPred II: An improved software for membrane protein structure predictions. CABIOS 10 685–686. [DOI] [PubMed] [Google Scholar]
- Cserzö, M., Wallin, E., Simon, I., von Heijne, G., and Elofsson, A. 1997. Prediction of transmembrane α-helices in prokaryotic membrane proteins: The dense alignment surface method. Protein Eng. 10 673–676. [DOI] [PubMed] [Google Scholar]
- Edelman, J. 1993. Quadratic minimization of predictors for protein secondary structure: Application to transmembrane α helices. J. Mol. Biol. 232 165–191. [DOI] [PubMed] [Google Scholar]
- Edelman, J. and White, S.H. 1989. Linear optimization of predictors for secondary structure: Application to transbilayer segments of membrane proteins. J. Mol. Biol. 21 195–209. [DOI] [PubMed] [Google Scholar]
- Fu, D., Libson, A., Miercke, L.J.W., Weitzman, C., Nollert, P., Krucinski, J., and Stroud, R.M. 2000. Structure of a glycerol-conducting channel and the basis for its selectivity. Science 290 481–486. [DOI] [PubMed] [Google Scholar]
- Gromiha, M.M. 1999. A simple method for predicting transmembrane α helices with better accuracy. Protein Eng. 12 557–561. [DOI] [PubMed] [Google Scholar]
- Hofmann, K. and Stoffel, W. 1993. A database of membrane spanning protein segments. Biol. Chem. Hoppe-Seyler 374 166. [Google Scholar]
- Jones, D.T., Taylor, W.R., and Thorton, J.M. 1994. A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry 3 3038–3049. [DOI] [PubMed] [Google Scholar]
- Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded geometrical feature. Biopolymers 22 2577–2637. [DOI] [PubMed] [Google Scholar]
- Manoil, C. and Beckwith, J. 1986. A genetic approach to analyzing membrane protein topology. Science 233 1403–1408. [DOI] [PubMed] [Google Scholar]
- Milpetz, F., Argos, P., and Persson, B. 1995. TMAP: A new email and WWW service for membrane-protein structural predictions. Trends Biochem. Sci. 20 204–205. [DOI] [PubMed] [Google Scholar]
- Murata, K., Mitsuoka, K., Hirai, T., Walz, T., Agre, P., Heymann, J.B., Engel, A., and Fujiyoshi, Y. 2000. Structural determinants of water permeation through aquaporin-1. Nature (Lond.) 407 599–605. [DOI] [PubMed] [Google Scholar]
- Palczewski, K., Kumasaka, T., Hori, T., Behnke, C.A., Motoshima, H., Fox, B.A., le Trong, I., Teller, D.C., Okada, T., Stenkamp, R.E., et al. 2000. Crystal structure of rhodopsin: A G protein-coupled receptor. Science 289 739–745. [DOI] [PubMed] [Google Scholar]
- Pan C.-J., Lin B.C., and Chou J.Y. 1999. Transmembrane topology of human glucose 6-phosphate transporter. J. Biol. Chem. 274 13865–13869. [DOI] [PubMed] [Google Scholar]
- Persson, B. and Argos, P. 1994. Prediction of transmembrane segments in proteins utilising multiple sequence alignments. J. Mol. Biol. 237 182–192. [DOI] [PubMed] [Google Scholar]
- Picot, D., Loll, P.J., and Garavito, R.M. 1994. The x-ray crystal structure of the membrane protein prostaglandin H2 synthase-1. Nature (Lond.) 367 243–249. [DOI] [PubMed] [Google Scholar]
- Rost, B., Casadio, R., Fariselli, P., and Sander, C. 1995. Transmembrane helices predicted at 95% accuracy. Protein Sci. 4 521–533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rost, B., Fariselli, P., and Casadio, R. 1996. Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci. 5 1704–1718. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Senes, A., Gerstein, M., and Engelman, D.M. 2000. Statistical analysis of amino acid patterns in transmembrane helices: The GxxxG motif occurs frequently and in association with β-branched residues at neighboring positions. J. Mol. Biol. 296 921–936. [DOI] [PubMed] [Google Scholar]
- Tusnády, G.E. and Simon, I. 1998. Principles governing amino acid composition of integral membrane proteins: Application to topology prediction. J. Mol. Biol. 283 489–506. [DOI] [PubMed] [Google Scholar]
- von Heijne, G. 1992. Membrane protein structure prediction—Hydrophobicity analysis and the positive-inside rule. J. Mol. Biol. 225 487–494. [DOI] [PubMed] [Google Scholar]
- White, S.H. and Jacobs, R.E. 1990. Observations concerning topology and locations of helix ends of membrane proteins of known structure. J. Membr. Biol. 115 145–158. [DOI] [PubMed] [Google Scholar]
- White, S.H. and Wimley, W.C. 1999. Membrane protein folding and stability: Physical principles. Annu. Rev. Biophys. Biomol. Struc. 28 319–365. [DOI] [PubMed] [Google Scholar]
- Wolin, C.D. and Kaback, H.R. 1999. Estimating loop-helix interfaces in a polytopic membrane protein by deletion analysis. Biochemistry 38 8590–8597. [DOI] [PubMed] [Google Scholar]

