Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2006 Nov 11;35(Database issue):D241–D246. doi: 10.1093/nar/gkl850

EVEREST: a collection of evolutionary conserved protein domains

Elon Portugaly 1,*, Nathan Linial 1, Michal Linial 1
PMCID: PMC1669739  PMID: 17099230

Abstract

Protein domains are subunits of proteins that recur throughout the protein world. There are many definitions attempting to capture the essence of a protein domain, and several systems that identify protein domains and classify them into families. EVEREST, recently described in Portugaly et al. (2006) BMC Bioinformatics, 7, 277, is one such system that performs the task automatically, using protein sequence alone. Herein we describe EVEREST release 2.0, consisting of 20 029 families, each defined by one or more HMMs. The current EVEREST database was constructed by scanning UniProt 8.1 and all PDB sequences (total over 3 000 000 sequences) with each of the EVEREST families. EVEREST annotates 64% of all sequences, and covers 59% of all residues. EVEREST is available at http://www.everest.cs.huji.ac.il/. The website provides annotations given by SCOP, CATH, Pfam A and EVEREST. It allows for browsing through the families of each of those sources, graphically visualizing the domain organization of the proteins in the family. The website also provides access to analyzes of relationships between domain families, within and across domain definition systems. Users can upload sequences for analysis by the set of EVEREST families. Finally an advanced search form allows querying for families matching criteria regarding novelty, phylogenetic composition and more.

INTRODUCTION

Proteins are comprised of one or several domains. The literature in protein science teems with definitions that attempt to capture the correct notion of a protein domain. Employing a structural point of view, domains are sometimes defined as minimal segments of the protein that will fold to their native shape should they be isolated from the rest of the peptide chain. Other definitions take an evolutionary perspective and define domains as segments of the sequence that recur in different proteins. Based on these definitions, several systems attempt to define and classifiy domains within protein databases. These systems vary both in the type of data they analyze and in the amount of manual input they incorporate. SCOP (1) and CATH (2) are both classifications of domains that analyze protein structures. SCOP is a manual classification while CATH classification is determined using a combination of automated and manual procedures. The relative scarcity of protein structures has led to the development of protein domain classification systems that take as input only protein sequence information. Databases, such as Pfam A (3), BLOCKS (4), SMART (5) offer comprehensive collections of families that were compiled by human experts, with the aid of computational tools [see review in (6,7)]. These methods provide high quality definitions that are most useful for biologists. However, they incorporate a great deal of human labor and expertize and require external information to identify new domain families. Several automatic systems for the identification and classification of domains in a database of protein sequences have been described in the literature. These include the ProDom algorithm (8) that was adopted by Pfam and forms Pfam B, and the more recent ADDA (9). EVEREST is our attempt at creating such an automatic system.

The different definitions for protein domains and for protein domain families do not always agree. In some cases these disagreements are the results of mistakes and inaccuracies. However, in many cases, more than one interpretation of the sequence or structure data are valid. The protein domain world is highly complex. For example, domains are hierarchical in nature, in two different senses. First, one domain may be composed of two or more sub-domains. Second, domain families may be grouped to super-families or divided into sub-families. Due to this complexity, several domain definition systems may disagree on the interpretation of a protein, and yet all be correct in some sense. It is therefore important to develop tools for browsing protein domain families and for comparing them, both within and across domain definition systems.

The EVEREST process

We have developed EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), an automatic computational process identifying protein domains and classifying them into families. The EVEREST process begins by constructing a database of protein segments that emerge in an all versus all pairwise sequence comparison. It then proceeds to cluster these segments, choosing the best clusters using machine learning techniques and creating a statistical model for each of the them. This procedure is then iterated: The aforementioned statistical models are used to scan all protein sequences, to recreate a database of segments and to cluster them again.

EVEREST has been thoroughly tested and evaluated, and has been shown to reconstruct 56% of Pfam A families and 63% of SCOP families with high accuracy, and to suggest many new domain families. A recently published manuscript describes the EVEREST process and its evaluation in detail (10).

THE EVEREST DATABASE AND WEBSITE

The EVEREST database contains 20 029 families, each defined by one or more HMMER HMMs (http://hmmer.wustl.edu/). The current release of the EVEREST database was constructed by scanning UniProt release 8.1 (11) and the sequences of all PDB (12) structures (total over 3 million sequences) with each of the EVEREST families. EVEREST annotates 93% of all Swiss-Prot sequences and 62% of all TrEMBL sequences (64% over all UniProt), and covers 84% of all residues in Swiss-Prot (56% for TrEMBL, 59% over all UniProt). For PDB, 88% of all sequences are annotated, and 84% of all residues are covered.

The EVEREST database of protein domain families can be accessed through the EVEREST website (http://www.everest.cs.huji.ac.il/). The website allows browsing through EVEREST domain families as well as domain families defined by SCOP, CATH and Pfam A. EVEREST families contain domains on both UniProt and PDB sequences. SCOP and CATH families only contain domains on PDB sequences and Pfam families only contain domains on UniProt sequences. A family page in the website provides a graphical representation of all proteins containing a domain of the family, and of all domains, as defined by the above four domain definition systems, on these proteins.

EVEREST families are denoted as EVRR.NNNNN where RR stands for the release number and NNNNN stands for the family number within the release.

The website also features analysis of relationship between families and searches for proteins and families on the basis of keywords, family statistics, family phylogenetic profile and more. Finally, the user may upload a sequence to be scanned for EVEREST families and stored for future browsing by that user.

At any stage of the browsing, the user may customize the set of databases used. As a default non-redundant subsets of UniProt and PDB are used. The user may instead select to view the full versions of the sequence databases or to limit the view to the Swiss-Prot subset of UniProt. The user may also select which of the external domain definition systems to show, and at what level of classification (super-families or families for SCOP, homologous superfamily or S35 clusters for CATH and clans or families for Pfam).

Protein page

The protein page is accessible by textual search for keywords, accession numbers and names, as well as through links from domain family pages of all domains on the protein. The main body of the page starts with general information regarding the protein, followed by the sequence of the protein. Below that is a graphical representation of the domains on the protein. Domains are shown for all systems selected for view by the user. Each domain segment serves as a hyper link to the family page of the represented domain's family. For all but EVEREST domains, the segments representing the domains are color coded for family. EVEREST families are color coded by the best score they receive with respect to any reference family in the database (see section 3.2 ‘Evaluating domain families using reference systems’). See Figure 1 for an example.

Figure 1.

Figure 1

Example protein record. Excerpt from the protein page of HMUU_YERPE – ‘Hemin transport system permease protein hmuU’ showing the graphical representation of the domains on the protein. The width of the record is proportional to the length of the protein sequence. Colored segments mark domains found by different systems (here EVEREST and Pfam) on the sequence. EVEREST segments are color coded for the best score their family receives with respect to any reference family in the database. Other segments are color coded for family. A color legend is available in a vertical stripe in the left side of the page.

Domain family page

A family page can be produced for families of the EVEREST, SCOP, CATH and Pfam systems. The main part of the page contains general information about the family followed by records describing all proteins containing domains of the family.

The general information part contains the family's name and links to the home page of the family for families defined by the external systems, followed by download links for the HMMs defining the family for EVEREST families. Below those is a link to a list of the domains of the family in tabular format, followed by links to pages describing the scoring of the family by reference families from other systems and to the scoring of families from other systems using this family as a reference. See section 3.2 ‘Evaluating domain families using reference systems’ for further details on the scoring of families.

Below the general family information part, each protein record contains textual information about the protein and a schematic representation of all domains on the protein, in the same format as in the protein page, with the exception that EVEREST families are not coded for score. The main family of the page is always color coded red.

At the left of the page is a vertical strip containing links to other parts of the website, followed by a legend for the color coding of the domain families appearing in the page. The legend also provides information about relationships between those families and the main family of the page, as illustrated in Figure 2.

Figure 2.

Figure 2

Relationship between EV02.00096 and SCOP c.69.1.12. Excerpt from the family page of EV02.00096 is shown. (A) Record for PDB sequence 1BRT is highlighted. The EV02.00096 domain, in red, is a sub-domain of the SCOP c.69.1.12 domain, in striped dark blue. (B) The relationship between EV02.00096 and c.69.1.12 is described by (1) the keyword ‘Super’ indicating that c.69.1.12 domains are super-domains of EV02.00096 domains, (2) the left bar graph, which through the height of the bar indicates that less than a quarter of EV02.00096 domains participate in this relationship and (3) the right bar graph, indicating that all of the domains of c.69.1.12 participate in this relationship. (C) EV02.00096 is also a super-family of sub-domains of c.69.1.11.

Relationship between families

Our database describes relationships between domain families, both within and across domain definition systems. These relationships allow for the comparison of families and for browsing the domain family space from one family to related families. We define two dimensions of relations between protein domain families. The first dimension describes the relationship between ‘typical’ domains of the two families. The second dimension describes the relationship between the two domain families in terms of set inclusion. For example, let us review the relationship between EV02.00096 and SCOP family c.69.1.12: Haloperoxidase. All 6 c.69.1.12 domains are super-domains of domains of EV02.00096, but EV02.00096 contains 21 other domains, unrelated to m c.69.1.12 domains. Ascending one level in the SCOP hierarchy, all of EV02.00096 domains are sub-domains of SCOP super-family c.69.1: alpha/beta-Hydrolases domains, which in turn contains domains unrelated to EV.00096 domains. Thus, c.69.1.12 is a sub-family of super-domains of EV02.00096, which is a sub-family of sub-domains of c.69.1. See Figure 2 for an excerpt from the family page of EV02.00096 describing its relationships with SCOP families. Section 3.3 ‘Relationships between domain families’ describes the definitions we use for marking relationships between families.

Family query page

The website allows querying for domain families by several criteria. The user may select one or more criteria to apply in conjunction. Following are the different criteria types available:

  • Textual search in family name.

  • Family size limits.

  • Average domain size limits.

  • Family taxonomical composition as defined by limits on the proportion of the domains in the family in user requested taxa. Taxa from all levels of the phylogenetic tree are available.

  • Criteria regarding the novelty of the family as defined by limits on the proportion of domains in the family that are known to other domain definition systems (see section 3.2 ‘Evaluating domain families using reference systems’).

  • Limits on the scoring of the family by the best matching reference family of user selected reference domain definition systems (see section 3.2 ‘Evaluating domain families using reference systems’).

Some criteria definitions, especially those involving phylogenetic profiling, may produce searches that require several minutes to complete. Therefore, users are asked to provide an email address to which we send an email with a hyperlink to the results of the search once it is completed.

For an example of search, suppose we wish to look for a new target for structural determination that might be applicable to medical research. We set the number of domains found on UniProt to be between 50 and 500. We request that the average size of the domain be between 100 and 200 amino acids—the usual range for structural domains. We ask that there would be no domains in PDB, because we want an unknown structure. Furthermore, we request that the proportion of the family covered by Pfam A to be at most 10% since Pfam families are already on the structural genomics target lists. Finally, because we wish for applicability to medical research, we ask that the family contain human proteins and rodent proteins. We set the search in motion. After a few seconds we are asked to be more precise regarding the taxa criteria. Since we knew of the many human viruses taxa, we have asked for ‘human-virus’, so we only have to select ‘Homo sapiens’ amongst the many human bacteria and other parasites. For ‘rodent’ we select the ‘Rodentia’ order. Because our search contains phylogenetic criteria, it could take a while. Finally, when the search is over we receive an email containing a hyperlink to the list of 89 families it produced.

EVEREST annotation of user sequences

Users may also upload their own sequences to be scanned for EVEREST families. The scan takes several minutes to a few hours, and the user is notified by email upon completion. The email contains a hyperlink to a protein page of the uploaded sequence. Furthermore, during sessions starting from the hyperlink in the email, the user's uploaded sequence will show in the family pages of all domains found on this sequence.

Registration

Users may choose to register to our database. Registration provides the users with a private space in our database, in which the user's searches and uploaded sequences are stored. Thereafter, upon logging in, the user may access lists of all searches they performed and of all sequences they uploaded. Furtheremore, all sequences uploaded by the user will show in the family pages of all domains found on those sequences.

Downloads

The EVEREST database is available for download through the downloads link in the website. Available for download are the HMMs defining the families, in HMMER format and flat files listing the EVEREST domains found on UniProt and on the PDB sequences.

TECHNICAL DETAILS

Data sources

Protein sequences were taken from UniProt release 8.1 (11) and PDB (as downloaded from the PDB server on February 2006) (12).

EVEREST release 2.0 family models were generated by applying the EVEREST algorithm to release 49.2 of the Swiss-Prot database (11). These models were then used to identify family members on all sequences in our database.

SCOP domains were taken from ASTRAL release 1.69 (13). CATH release 2.6.0 was used. Pfam A families and clans (14) were taken from the InterPro database, release 12.1 (15).

Phylogenetic tree was downloaded from the NCBI Taxonomy FTP site (16).

Evaluating domain families using reference systems

The EVEREST system is evaluated by computing its coverage of reference systems and its accuracy when taking those reference systems as gold standards for domain family definitions. To this end we have developed a scoring scheme that enables scoring an evaluated domain family with respect to a reference domain family in the context of a reference system of domain families. A detailed description of the scoring scheme and the results of applying it to EVEREST is given in (10). Briefly, for an evaluated family e, let π(e) be a collection of reference domains given by allowing each domain in the evaluated family to collect those reference domains that significantly intersect with it. Then, when evaluating e with respect to a reference family r, a true positive would be a member of π(e) that is also a member of r, a false positive would be a member of π(e) that is not a member of r, and a false negative would be a member of r that is not a member of π(e). The score of e with respect to r would be the size of the intersection of π(e) and r divided by the size of their union. We have calculated the scores of EVEREST families with respect Pfam families and with respect to SCOP and CATH families. We have also calculated the scores of SCOP families with respect to CATH families and vice versa. Since Pfam families are defined on UniProt sequences, while SCOP and CATH families are defined on PDB sequences, we cannot score Pfam with respect to SCOP and CATH, furthermore, since a priori, EVEREST is less reliable than SCOP, CATH and Pfam, and Pfam is less reliable than SCOP and CATH, we do not score the latter systems with respect to the former.

Relationships between domain families

Observing two domain instances on the same protein, we mark five relations, namely sub-domain, super-domain, same, N-neighbor and C-neighbor, as illustrated in Figure 3. When marking these relations, we allow each pair of domain instances a and b to be either strongly following, possibly following, contradicting or none of the above, with respect to each of the possible relationship types. Strongly following is always also possibly following. A pair of domain instances can be possibly following two different relations, but a pair that is strongly following a relation cannot be possibly following any other relation.

Figure 3.

Figure 3

Five types of relations between domain instances. Illustration of the five defined relation types between two domain instances on the same protein. 1. sub-domain: domain a is a sub-segment of domain b. 2. super-domain: domain a is a super-segment of domain b. 3. same: domain a is the same segment as domain b. 4. N-neighbor: domain a is N-terminal to domain b. 5. C-neighbor: domain a is C-terminal to domain b.

Let Pa be the proportion of domain a that is covered by domain b and Pb be the proportion of domain b that is covered by domain a. Let Ca be the middle position of domain a and Cb be the middle position of domain b. Table 1 shows the different conditions used for defining strongly following and possibly following for the different relations. For N-neighbor and C-neighbor relations, if the pair is not possibly following, it is defined to be contradicting. For sub-domain, super-domain and same relations, if a pair is not possibly following the relation and is not strongly following either of the two neighbor relations, it is defined to be contradicting. We also note the natural notion of reciprocity of relation. Namely sub-domain is reciprocal to super-domain, N-neighbor is reciprocal to C-neighbor and same is reciprocal to itself.

Table 1.

Parameters for defining relations between two domain instances

Relation Conditions
Strongly following Possibly following
Sub-domain Pa ≥ 0.9 Pb < 0.65 Pa ≥ 0.75 Pb < 0.8
Super-domain Pa < 0.65 Pb ≥ 0.9 Pa < 0.8 Pb ≥ 0.75
Same Pa ≥ 0.9 Pb ≥ 0.9 Pa ≥ 0.75 Pb ≥ 0.75
N-neighbor Pa ≤ 0.1 Pb ≤ 0.1 Ca < Cb Pa ≤ 0.25 Pb ≤ 0.25 Ca < Cb
C-neighbor Pa ≤ 0.1 Pb ≤ 0.1 Ca > Cb Pa ≤ 0.25 Pb ≤ 0.25 Ca > Cb

Observing two domain families A and B, we count for each of the above five relations the number of domains a of A for which there exist a domain b in B such that the pair a, b is strongly following, possibly following and contradicting the relation. These counts form the basis of the second dimension of the relationship between the families. If all, or nearly all of the domains of A have a certain relation with a domain of B, but a significant number of the domains of B do not have the reciprocal relation with a domain of A, then B is a super-family of A with respect to that relation, and A is a sub-family of B with respect to that relation. If all, or nearly all of the domains of A have a certain relation with a domain of B and all or nearly all of the domains of B have the reciprocal relation with a domain of A then A and B are matching families with respect to that relation. We do not provide exact definitions and thresholds for these terms. Instead we provide, and graphically visualize, the counts of the domains in each family sharing the relation, and let the user decide how to name the relationship between the families.

MAINTENANCE AND FUTURE DEVELOPMENTS

The EVEREST database is designed to handle multiple versions of EVEREST and of all other information sources (sequence database and domain definition systems). In fact, EVEREST families defined by a scan of an older Swiss-Prot version are available by choosing to view EVEREST release 1.0. We will run the EVEREST process at least once a year to define new families and update the database as new releases of UniProt, PDB, SCOP, CATH and Pfam are available.

Storing search results opens many options for combining the results of different searches. We plan to enable more sophisticated searches by adding tools for conjunction and disjunction of result sets, as well as tools for combining result sets via the family relations defined in section 3.3 ‘Relationships between domain families’. An example search using such a tool would be to define two sets of SCOP families of two different functions using keyword search, and then to look for EVEREST families that are super-families of members of both SCOP sets.

Acknowledgments

The authors thank Alex Savenok for the design and much of the programming of the EVEREST website. The authors thank Reuven Abliyev and Menachem Fromer for their support. The authors also thank the system team in the Hebrew University School of Computer Science and Engineering. E.P. is supported by an Eshkol fellowship from the Israeli Ministry of Science and by the Sudarsky Center for Computational Biology. This work is partially funded by NoE (Framework VI) BioSapiens consortium. The Open Access publication charges for this article were waived by Oxford University Press.

Conflict of interest statement. None declared.

REFERENCES

  • 1.Hubbard T.J., Ailey B., Brenner S.E., Murzin A.G., Chothia C. SCOP: a structural classification of proteins database. Nucleic Acids Res. 1999;27:254–256. doi: 10.1093/nar/27.1.254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Orengo C.A., Pearl F.M., Bray J.E., Todd A.E., Martin A.C., Lo Conte L., Thornton J.M. The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Res. 1999;27:275–279. doi: 10.1093/nar/27.1.275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Bateman A., Birney E., Cerruti L., Durbin R., Etwiller L., Eddy S.R., Griffiths-Jones S., Howe K.L., Marshall M., Sonnhammer E.L. The Pfam protein families database. Nucleic Acids Res. 2002;30:276–280. doi: 10.1093/nar/30.1.276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Henikoff S., Henikoff J.G., Pietrokovski S. Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics. 1999;15:471–479. doi: 10.1093/bioinformatics/15.6.471. [DOI] [PubMed] [Google Scholar]
  • 5.Schultz J., Copley R.R., Doerks T., Ponting C.P., Bork P. SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res. 2000;28:231–234. doi: 10.1093/nar/28.1.231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Henikoff S. Comparative methods for identifying functional domains in protein sequences. Biotechnol. Annu. Rev. 1995;1:129–147. doi: 10.1016/s1387-2656(08)70050-4. [DOI] [PubMed] [Google Scholar]
  • 7.Liu J., Rost B. Domains, motifs and clusters in the protein universe. Curr. Opin. Chem. Biol. 2003;7:5–11. doi: 10.1016/s1367-5931(02)00003-0. [DOI] [PubMed] [Google Scholar]
  • 8.Servant F., Bru C., Carrere S., Courcelle E., Gouzy J., Peyruc D., Kahn D. ProDom: automated clustering of homologous domains. Brief Bioinform. 2002;3:246–251. doi: 10.1093/bib/3.3.246. [DOI] [PubMed] [Google Scholar]
  • 9.Heger A., Holm L. Exhaustive enumeration of protein domain families. J. Mol. Biol. 2003;328:749–767. doi: 10.1016/s0022-2836(03)00269-9. [DOI] [PubMed] [Google Scholar]
  • 10.Portugaly E., Harel A., Linial N., Linial M. EVEREST: automatic identification and classification of protein domains in all protein sequences. BMC Bioinformatics. 2006;7:277. doi: 10.1186/1471-2105-7-277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wu C.H., Apweiler R., Bairoch A., Natale D.A., Barker W.C., Boeckmann B., Ferro S.E.G., Huang H., Lopez R., Magrane M., et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006;34:D187–D191. doi: 10.1093/nar/gkj161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Chandonia J.M., Walker N.S., Lo Conte L., Koehl P., Levitt M., Brenner S.E. ASTRAL compendium enhancements. Nucleic Acids Res. 2002;30:260–263. doi: 10.1093/nar/30.1.260. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Finn R.D., Mistry J., Schuster-Bockler B., Griffiths-Jones S., Hollich V., Lassmann T., Moxon S., Marshall M., Khanna A., Durbin R., et al. Pfam: clans, web tools and services. Nucleic Acids Res. 2006;34:D247–D251. doi: 10.1093/nar/gkj149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Mulder N.J., Apweiler R., Attwood T.K., Bairoch A., Bateman A., Binns D., Bradley P., Bork P., Bucher P., Cerutti L. InterPro, progress and status in 2005. Nucleic Acids Res. 2005;33:D201–D205. doi: 10.1093/nar/gki106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Wheeler D.L., Chappey C., Lash A.E., Leipe D.D., Madden T.L., Schuler G.D., Tatusova T.A., Rapp B.A. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2000;28:10–14. doi: 10.1093/nar/28.1.10. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES