Skip to main content
BMC Bioinformatics logoLink to BMC Bioinformatics
. 2004 Dec 17;5:204. doi: 10.1186/1471-2105-5-204

An SVD-based comparison of nine whole eukaryotic genomes supports a coelomate rather than ecdysozoan lineage

Gary W Stuart 1,2,, Michael W Berry 3
PMCID: PMC544558  PMID: 15606920

Abstract

Background

Eukaryotic whole genome sequences are accumulating at an impressive rate. Effective methods for comparing multiple whole eukaryotic genomes on a large scale are needed. Most attempted solutions involve the production of large scale alignments, and many of these require a high stringency pre-screen for putative orthologs in order to reduce the effective size of the dataset and provide a reasonably high but unknown fraction of correctly aligned homologous sites for comparison. As an alternative, highly efficient methods that do not require the pre-alignment of operationally defined orthologs are also being explored.

Results

A non-alignment method based on the Singular Value Decomposition (SVD) was used to compare the predicted protein complement of nine whole eukaryotic genomes ranging from yeast to man. This analysis resulted in the simultaneous identification and definition of a large number of well conserved motifs and gene families, and produced a species tree supporting one of two conflicting hypotheses of metazoan relationships.

Conclusions

Our SVD-based analysis of the entire protein complement of nine whole eukaryotic genomes suggests that highly conserved motifs and gene families can be identified and effectively compared in a single coherent definition space for the easy extraction of gene and species trees. While this occurs without the explicit definition of orthologs or homologous sites, the analysis can provide a basis for these definitions.

Background

Several methods have been developed for the detailed global comparison of multiple whole genomes and the production of global phylogenies. Most of these methods require the prior identification and selection of a reasonably small subset of putative orthologs within which individual homologous sites are identified with some degree of confidence using alignment [1-7]. Frequently, detailed alignment information is subdivided and compressed into a smaller number of complex characters (such as gene content or gene order), which are then used for quantitative comparison [[4,5]; see [6] for review], but the more or less direct use of large scale sequence alignments have also been attempted [7].

Though generally less developed, many non-alignment methods, considered initially by Blaisdell [8], are currently being explored for a similar purpose [[9-15]; see [17] for review]. Rarely do such methods simultaneously provide 1) detailed and unbiased comparisons of a high fraction of biomolecular sequences within full genome datasets, and 2) globally consistent gene and species trees based on this exhaustive comparison. We have recently developed an SVD-based phylogenetic method that provides accurate comparisons of a high fraction of sequences within whole genomes without the prior identification of orthologs or homologous sites [13]. This method has been successfully applied to a number of diverse genome datasets, including mitochondrial genomes, bacterial genomes, and viral genomes [13-15]. Here we apply this method to a diverse set of nine complete eukaryotic nuclear genomes, resulting in the production of a species tree based on the automatic identification and simultaneous comparison of over 400 conserved amino acid motifs and gene families.

Results and discussion

Proteome data sets and sequence conversion

The nine eukaryotic genomes compared in this analysis are listed in Table 1. The protein sets obtained from NCBI for the malaria parasite (Pfal) and the budding yeast (Scer) each contributed only 3% of the 175,559 total proteins in the dataset, while the proteins for Frub provided nearly 21% of the total. Only the Frub proteins were obtained from the Ensemble Genome Browser [16], since protein predictions for this organism were not available from NCBI. Differences in methods used to predict proteins by these two organizations might be responsible for the large difference in the number of proteins predicted for comparable vertebrate genomes (>37,000 for Frub, but only 21–25,000 for all other vertebrates). These differences could, in principle, drastically effect the gene and species trees derived from a global comparison of all proteins. However, the position of Frub in the final species tree suggests that these effects were relatively minor (see below). We have noted previously that even drastic genome size differences can be accommodated by our method [14].

Table 1.

Genes and Genomes Compared

Organism SVD "top 5" Genome Total
Hsap 996 (23%) 25,319 (14%)
Mmus 881 (20%) 25,371 (14%)
Rnov 670 (15%) 21,204 (12%)
Frub 536 (12%) 37,439 (21%)
Agam 573 (13%) 16,091 (9%)
Dmel 443 (10%) 18,107 (10%)
Cele 135 (3%) 21,124 (12%)
Scer 113 (3%) 5,855 (3%)
Pfal 23 (1%) 5,049 (3%)
4370 (100%) 175559 (100%)

SVD-derived vector definitions for motifs and gene families

All the proteins in the dataset were recoded as overlapping tetrapeptide frequency vectors and the resulting data matrix was decomposed by the SVD. A total of 437 singular triplets were obtained as output. The "protein" vectors provided in the "right" factor matrix are known to provide reduced dimensional definitions for all proteins in the dataset as linear combinations of the orthogonal "right" singular vectors (rsv's). Conversely, the right singular vectors themselves frequently represent "ideal" versions of proteins defining a given gene family [13,14] Protein vectors having the strongest projections on a given rsv are therefore likely to represent members of a given gene family. In this analysis, the proteins with the five strongest projections (referred to as the "top 5") for each rsv were used to identify and summarize a number of gene families. The total number of proteins from each species that appear in the "top 5" for all 437 right singular vectors are listed in Table 1. Although the fraction of "top 5" proteins identified by the SVD roughly parallels the fraction of total proteins from each species, the mammalian proteins tend to dominate the analysis.

Each right singular vector can potentially define two distinct gene families. In this case, the highest positive elemental values within a vector identify proteins associated with one protein family, while the highest negative values identify proteins associated with an anti-correlated family (i.e. proteins that rarely share the same tetrapeptides). Frequently, however, strong family definitions are provided for just one protein family. In this case, the anti-correlated proteins are seen to be derived from a mixture of two or more families. Since the choice of sign is arbitrary, strong family definitions are equally likely to be provided by either the positive or the negative values within a vector. Family definitions provided by positive vector values are denoted below using the simple vector index (e.g. 277 = the 277th singular vector). Those provided by negative vector values are followed by an "a" (e.g. 277a). Its worth noting at this point that protein family definitions provided by the SVD necessarily account for not only "what is there" (tetrapeptides that form the motifs that define the family), but also "what is not there" (tetrapeptides excluded by that family of proteins, but likely to form anti-correlated motifs within other families of proteins).

Protein family definitions provided by right singular vectors

An abbreviated list of 58 protein families identified within the 437 SVD-derived singular triplets are provided in Table 2. For each listed singular triplet, the gi# of an example protein chosen from among the "top 5" values within the right singular vector is provided, along with its corresponding Name and a Protein Description provided within the NCBI annotation for that protein. In general, proteins described by the more dominant singular triplets were selected for presentation from the complete list of 437 triplets. However, some were chosen due to their historical utility for evolutionary comparisons (ribosomal proteins) and/or their tendency to be accompanied by strongly correlated peptide motifs (last column of Table 2). Relatively few families appear in the table due to the fact that some vectors strongly describe only one family rather than two, some vectors describe only families from species that lack annotation or are poorly annotated at NCBI (i.e. Frub proteins, Agam proteins, etc.), some vectors describe protein families listed by NCBI merely as "unknown" or "conserved unknown", some vectors describe proteins with weakly conserved motifs, and some vectors describe distinct subfamilies of proteins. In the latter case, multiple right singular vectors are apparently required in combination to describe some of the more diverse families of proteins. Included in Table 2 is the number of singular vectors that include the chosen example protein within its "top 5". When multiple vectors are involved in defining multiple related subfamilies, the most "dominant" vector (the one on which the example protein casts its strongest projection) is listed in the first column. Thus, some proteins are seen to have multiple subfamily affiliations. The multiple vectors observed per family effectively subdivide the 58 families into 179 distinct subfamilies. For instance, Table 2 includes a set of 18 ribosomal protein families described by a total of 65 singular vectors (highlighted in bold). Ribosomal proteins are frequently well conserved, effectively aligned, and commonly used for estimating evolutionary relationships. Their presence within our list of dominant singular vectors suggests their utility for establishing evolutionary relationships even in the absence of explicit alignments and explicit a priori assignments of orthology.

Table 2.

A selected list of protein family/motifs identified by SVD-derived singular triplets (st's). In this summary table, unique example proteins (rsv-gi#) were chosen from the 5 to 40 "top five" proteins identified as members of a given family by as many as 8 distinct right singular vectors. As examples, six individual ras proteins representing six broad categories of ras (highlighted in italics) are defined by a total of 13 right singular vectors, and 18 ribosomal proteins (highlighted in bold) are defined by a total of 65 right singular vectors. The lengths of continuous copep strings identified from the corresponding left singular vectors and their specificities (E-values) as revealed by pairwise BLAST are also provided.

triplet # rsv-gi# Name Protein Description lsv copep string (E-value)
421a 1 11415030 HIST1H4J H4 histone family, member E 62 aa's (1e-54)
417a 2 21166389 HIST1H2BC H2B histone family, member L 75 aa's (4e-67)
413a 1 31560385 Rpl21 ribosomal protein L21 60 aa's (2e-55)
408 1 4501885 ACTB beta actin; beta cytoskeletal actin 42 aa's (9e-38)
405 1 4506661 Rpl7a ribosomal protein 7a 79 aa's (3e-62)
392a 1 5174735 TUBB2 tubulin, beta, 2 45 aa's (7e-41)
389a 2 13569962 RAB1B RAB1B, RAS oncogene family; small GTP-binding 14 aa's (2e-11)
389 3 6677781 Rpl29 ribosomal protein L29 77 aa's (3e-60)
387 3 31981690 Hspa8 heat shock 70kD protein 8 40 aa's (2e-35)
385a 1 11024714 UBB ubiquitin B precursor; polyubiquitin B 77 aa's (2e-68)
378a 5 26051216 CAMK2B calmodulin-dependent protein kinase IIB isoform 7 14 aa's (2e-10)
373a 2 4502201 ARF1 ADP-ribosylation factor 1 86 aa's (1e-41)
371a 3 6679439 Ppia peptidylprolyl isomerase A; cyclophilin A 55 aa's (2e-48)
368a 5 25150942 Tcb-1 transposable element tcb1 transposase (1O615) 88 aa's (7e-74)
363 3 33149310 UBE2D3 ubiquitin-conjugating enzyme E2D 3 isoform 1 138 aa's (7e-91)
354 3 4502549 CALM2 calmodulin 2; phosphorylase kinase delta 40 aa's (1e-19)
352a 4 17105394 RPL23A ribosomal protein L23a 44 aa's (3e-33)
350a 4 9845511 RAC1 ras-related C3 botox sub 1 isoform Rac1, rho 15 aa's (2e-12)
347a 3 51873060 Eef1a1 eukaryotic translation elongation factor 1 alpha 1 24 aa's (4e-19)
345 2 27679110 Rpl17 ribosomal protein L17 (L23) 92 aa's (2e-89)
341a 5 31980772 Ppp1cc protein phosphatase 1, catalytic, gamma isoform 20 aa's (5e-17)
337 5 24648716 mod(mdg4) modifier of mdg4 32 aa's (2e-29)
334 5 24653107 Galpha49B G protein alpha49B 19 aa's (9e-18)
333a 3 4506633 RPL31 ribosomal protein L31 78 aa's (8e-74)
329a 2 34878793 Pcdha13 protocadherin alpha 13 17 aa's (8e-14)
327 3 32307119 PPP2R2B Serine/threonine protein phosphatase 2A, neuronal 23 aa's (7e-20)
324 1 31982919 ZNF430 zinc finger protein 430 18 aa's (3e-11)
322a 3 34871376 LOC287293 similar to high mobility group 1 protein 15 aa's (9e-13)
321a 3 4504445 HNRPA1 heterogeneous nuclear ribonucleoprotein A1 23 aa's (2e-18)
320a 2 25141298 kin-1 cyclic AMP-dependent catalytic subunit (kin-1) 66 aa's (4e-62)
316a 5 22094075 Slc25a5 solute carrier family 25; adenine nucleotide 27 aa's (7e-22)
308a 3 9845502 LAMR1 laminin receptor 1 (67kD, ribosomal protein SA) 68 aa's (1e-60)
304 3 6978809 Eno1 enolase 1, alpha 32 aa's (3e-27)
301 4 27676004 LOC365206 similar to ribosomal protein L9 139 aa's (1e-13)
295 2 31083250 PPP2R5C Ser/threo protein phosphatase 2A, 56 kD regulator, 16 aa's (6e-12)
292 4 31560517 Rpl27a ribosomal protein L27a 58 aa's (7e-56)
291 2 15011936 RPS26 ribosomal protein S26 77 aa's (7e-64)
288 1 22129671 Olfr493 olfactory receptor MOR204–35 12 aa's (3e-08)
287 2 38076430 LOC193565 similar to T-cell receptor alpha chain 16 aa's (2e-12)
285a 3 6754140 H2-Q7 histocompatibility 2, Q region locus 7 19 aa's (5e-16)
280a 5 16418339 Rpl10 ribosomal protein 10 27 aa's (4e-23)
277a 1 15718763 KRAS2 cellular c-Ki-ras2 proto-oncogene 9 aa's (2e-06)
277 2 27689505 Rab5c similar to Rab5c protein 17 aa's (4e-13)
276 4 24580529 M(2)21AB Minute (2) 21AB CG2674-PA 25 aa's (5e-20)
272 1 25742772 Kcna2 potassium voltage-gated channel, shaker-related, 12 aa's (1e-09)
270 4 33186863 Rpl13 ribosomal protein L13 11 aa's (3e-09)
266 4 4506697 RPS20 ribosomal protein S20 54 aa's (2e-49)
256 3 4506597 RPL12 ribosomal protein L12 34 aa's (8e-30)
253a 6 15809016 MRLC2 myosin regulatory light chain MRCL2 19 aa's (7e-16)
247 3 31981515 Rpl7 ribosomal protein L7 10 aa's (4e-08)
240a 5 24639734 Dlc dynein light chain ATPase 22 aa's (4e-21)
237a 4 34865959 gpdh similar to glyceraldehyde-3-phosphate 16 aa's (7e-13)
236a 2 10835049 ARHA Aplysia ras-related homolog 12; oncogene RHO 9 aa's (9e-07)
230 6 15431293 RPL15 ribosomal protein L15 11 aa's (6e-09)
224 5 13592069 Rps10 ribosomal protein S10 81 aa's (1e-78)
197a 2 14249144 Rab11b RAB11B, member RAS oncogene family 15 aa's (4e-12)
190a 6 4506621 RPL26 ribosomal protein L26 16 aa's (8e-14)
183a 5 14277700 RPS12 ribosomal protein S12 13 aa's (1e-10)

The diverse families of ras proteins present within the eukaryotic data set provide good examples of the ability of SVD-derived singular triplets to identify and describe both superfamilies and subfamilies of proteins. The ras proteins are well described by at least 13 vectors, including the 6 dominant vectors highlighted in italics in Table 2. All the "top 5" members of the protein families identified by these 6 vectors are listed in Table 3. Vector 197a summarizes the brain-associated ras11 subfamily (Rab11), vector 236a summarizes the Aplysia-related ras subfamily (ApRas), vectors 277 and 277a summarize the brain-associated ras 5 subfamily (Rab5) and the complex Ha/K/Nras subfamily (HaRas) respectively, vector 350a summarizes the ras-related C3 botulinum toxin substrate 1 subfamily (Rac1), and vector 387a summarizes the brain-associated ras1B subfamily (Rab1B). The most dominant ras vector, 389a, appears to describe a more generalized version of the Rab1 subfamily, since this vector includes both Rab1A and Rab1B proteins within the "top five". In addition, as explained below, this vector also summarizes a high fraction of the entire set of 34 ras sequences within all subfamilies.

Table 3.

Comparison of seven ras family clusters provided by right singular vectors with KOG and Homologen clusters. Only proteins having one of the five strongest projections ("top five") for a given singular vector are used in the comparison. Few genomes have KOG members specifically identified by NCBI, however, most or all of the "top 5" proteins for a given rsv would likely be identified as members of the same KOG family. For 197a (Rab11), the KOG # provided in parentheses is that of the closely related human protein.

rsv# gi# projection organism GeneName kog# hg#
197a 6679583 0.06900 Mmus Rab11b (0087) 3109
(Rab11) 14249144 0.06892 Rnov Rab11b na 3109
31209781 0.06827 Agam na na 3109
31209783 0.06827 Agam na na 3109
31209785 0.06826 Agam na na 3109
236a 31542143 0.05883 Mmus Arha na 1257
(ApRas) 16923986 0.05883 Rnov Arha2 na 1257
10835049 0.05873 Hsap RHOA 0393 1257
28395033 0.05610 Hsap ARHC 0393 22408
en131312 0.05412 Frub na na na
277 27689505 0.07229 Rnov Rab5c na 20961
(Rab5) 4759020 0.07214 Hsap RAB5C 0092 20961
31225537 0.07022 Agam na na 20961
31225545 0.07022 Agam na na 20961
31225553 0.07022 Agam na na 20961
277a 15718763 0.04278 Hsap KRAS2 0395 2159
(HaRas) 4885425 0.04243 Hsap HRAS 0395 3907
34861217 0.04243 Rnov Hras1 na 3907
4505451 0.04176 Hsap NRAS 0395 20564
34859609 0.04165 Rnov Nras na 20564
350a 9845511 0.07403 Hsap RAC1 0393 23126
(RasC3) 38081613 0.07403 Mmus Rac1 na 23126
9845509 0.06942 Hsap RAC1b 0393 23126
4826962 0.06820 Hsap RAC3 0393 3705
18875380 0.06820 Mmus Rac3 na 3705
387a 34861437 0.03486 Rnov Rab1B na 23689
(Rab1) 21313162 0.03413 Mmus Rab1B na 23689
13569962 0.03400 Hsap RAB1B 0084 23689
27709432 0.03400 Rnov Rab1B-like na 27733
en156199 0.03396 Frub na na na
389a 4758988 0.04851 Hsap RAB1A 0084 3067
(Rab/Ras) 6679587 0.04851 Mmus Rab1A na 3067
13569962 0.04840 Hsap RAB1B 0084 23689
en160503 0.04824 Frub na na na
13592035 0.04811 Rnov Rab1A na 3067

For comparison, KOG and Homologen memberships are also listed, when available, for each of the "top 5" proteins listed in Table 3. Table 4 provides a similar comparison for a set of four arbitrarily selected protein families unrelated to ras or to each other (potassium channel, enolase, solute carrier protein, and ADP-ribosylation factor). Since most of the genomes used in our study have not yet been included within the KOG classification scheme, only fly and human proteins have official KOG affiliations. However, we expect with high likelihood that most if not all of the top 5 proteins listed in Tables 3 and 4 would also be members of the particular KOG family listed for each vector. Given this, there would be a good correspondence in Tables 3 and 4 between KOG family members and the proteins identified by singular vectors. In contrast, the Homologen resource appears to provide a more selective classification method, dividing the KOG protein families into two or more subfamilies within which members are more likely to represent specific orthologs.

Table 4.

Comparison of four unrelated protein clusters provided by right singular vectors with KOG and Homologen clusters. Descriptions for each of these clusters are provided in Table 2. Only proteins having one of the five strongest projections ("top five") for a given singular vector are used in the comparison.

rsv# gi# projection organism GeneName kog# hg#
272a en165011 0.06928 Frub na na na
(Kcna) 25742772 0.06865 Rnov Kcna2 na 21034
4826782 0.06834 Hsap Kcna2 1545 21034
31543024 0.06821 Mmus Kcna2 na 21034
27465523 0.06632 Rnov Kcna1 na 183
304 12963491 0.101507 Mmus Eno1 na 1093
(Eno) 6978809 0.101252 Rnov Eno1 na 1093
4503571 0.097337 Hsap Eno1 2670 1093
51770896 0.092899 Mmus Eno1 na 1093
en150208 0.091209 Frub na na na
316a 32189350 0.11376 Rnov Slc25a5 na 37448
(Slc25) 22094075 0.11343 Mmus Slc25a5 na 37448
4502099 0.11202 Hsap Slc25a5 0749 37448
en159404 0.1034 Frub na na na
20863388 0.10117 Mmus Slc25a4 na 36058
373a 4502201 0.12887 Hsap Arf1 0070 1253
(Arf) 6680716 0.12887 Mmus Arf1 na 1253
11968098 0.12887 Rnov Arf1 na 1253
24668762 0.12856 Dmel Arf79F 0070 1253
24668773 0.12856 Dmel Arf79F 0070 1253

Conserved motif definitions provided by left singular vectors

Members of any particular ras subfamily represented by a given right singular vector share a uniquely conserved set of correlated tetrapeptides we have previously referred to as a "copep motif". These motifs are explicitly described by the corresponding "left" singular vectors (lsv's) comprising a given singular triplet. The lsv's describe these copep motifs as linear combinations of the 160,000 possible tetrapeptides. Those with high positive values identify peptides found with high probability in the conserved motif of a given subfamily, while those with a high negative value identify peptides excluded with high probability. Therefore, like the rsv's, the lsv's frequently describe two distinct anti-correlated entities (in this case motifs rather than protein families) using either positive or negative values within the vector. Using essentially the same procedure described above for any given rsv, the tetrapeptides having the largest positive or largest negative projections on any given lsv were identified in order to provide a focused summary of the motifs described by that vector. For motif extraction, however, an arbitrary cut-off value (absolute value > 0.025) was used to identify dominant tetrapeptides.

In most cases, it is possible to cluster the resulting short list of dominant tetrapeptides into several uninterrupted copep strings formed by tetrapeptides that overlap in 3 of 4 consecutive amino acid positions. Using this procedure, one long copep string was identified for each of the singular triplets listed in Table 2. The length of the identified long copep string and its corresponding E-value (resulting from pairwise BLAST) are provided as a summary in the last column. The precise amino acid sequences of the long copep strings identified for all listed vectors are provided in a supplementary table [see Additional file 1]. The E-values listed provide a measure of the specificity with which each corresponding protein is identified by the copep string extracted from a given lsv. Its important to note that the long copep string provides only an approximate summary of the lsv from which it is extracted, yet the small E-values clearly indicate that the vast majority of the proteins identified in Table 2 are very specifically recognized by their corresponding copep string.

Figure 1 provides a more detailed demonstration of how correlated peptide motifs and their associated gene families are simultaneously identified and described by SVD-derived singular vectors. In order to allow a clear comparison of SVD-derived motifs with alignment-derived motifs, the dominant tetrapeptides were superimposed over matching regions of a standard ClustalX alignment of the 34 ras proteins identified in the "top five" of the corresponding right singular vectors listed in Table 3. In this example, the dominant tetrapeptides extracted from the six selected left singular vectors are demarcated within (shaded/colored) boxes. Many of the dominant tetrapeptides are seen to form extended strings of overlapping peptides that correspond well to conserved contiguous regions within particular subsets of the ras proteins. For example, vectors 350a and 236a identify and provide distinct descriptions for motifs within the Ras-related botulinum toxin C3 substrate proteins (RasC3) and the Aplysia-related ras proteins (ApRas), respectively. The two most dominant left singular vectors of Figure 1 (389a and 387a) describe motifs within overlapping subsets of the nine Rab1 proteins. In addition, the most dominant left singular vector (389a) appears to describe a highly conserved motif within the entire set of 34 ras proteins reasonably well (solid clear boxes). This vector conspicuously identifies dominant tetrapeptides that span the two regions of the alignment in which unbroken strings of two or more invariant amino acids (asterisks) are present. These two regions are known to be required for ras GTPase activity [18]. It is notable that although these 34 ras proteins have only one stretch with more than two globally conserved consecutive amino acids (DTAGQE), vector 389a is capable of describing large regions of all 34 proteins by recognizing the latent similarity of multiple equivalent tetrapeptides. For example, this single vector recognizes KSAL, KSCL, and KTCL (residues 18–21 of the alignment) as dominant tetrapeptides that occupy equivalent positions within four of the six subtypes of ras proteins (Figure 1). Vector 389a also provides a reasonably strong summary of the large number of other ras proteins present within the genomes of these organisms, but not included in Figure 1 (not shown). In general, the most dominant singular vectors appear to identify highly conserved peptides present in a high fraction of individual members of a protein family or superfamily, while the less dominant vectors appear to describe conserved tetrapeptides present within a restricted set of proteins comprising a subfamily.

Figure 1.

Figure 1

Ras families and sub-families defined by singular vectors (labeled at right). For comparison, dominant peptide strings identified by SVD (boxes) are shown within a Clustal-X alignment. The aligned region corresponds to the first 181aa's of the 192aa Human Rac3 protein. Protein sequences are labeled by gi# (or ensemble# for Frup). Asterisks (*) indicate globally conserved residues. Subfamily motifs associated with negative vector values are denoted with an "a" suffix (e.g. 350a).

Instead of simply providing restricted motif summaries using the most dominant elements of the left singular vectors, we have also attempted to examine entire vectors in order to gain a better understanding of the motifs (and associated protein families) they describe. A reasonably efficient method for depicting left singular vectors is presented in Figure 2, using vectors 389 and 277 as examples. Both vectors are shown as frequency distributions (purple bars) that summarize the approximate magnitudes of the projections provided by all 160,000 tetrapeptides on the vector in question. These distributions are compared to a normal distribution having the same standard deviation (blue bars). In both examples, a significant fraction of tetrapeptides have high or low values in considerable excess of that expected from a normal distribution. Many of these also exceed the arbitrary cut-off value of 0.025 (dashed lines) used to extract the dominant tetrapeptides that serve to summarize the corresponding motifs. Parts of the Rab5 and HaRas motifs extracted from vector 277 are shown in Figure 2A as overlapping dominant tetrapeptides with associated projection values. Similar motifs extracted from vector 389 are shown in Figure 2B. In the latter case, a motif from the large subunit ribosomal protein rpL29 represents the "anti-motif"of the Ras/Rab proteins described by the extreme vector elements of opposite sign.

Figure 2.

Figure 2

Left singular vectors depicted as tetrapeptide projection value frequency distributions. Distributions for singular vectors 277 (A) and 389 (B) are shown in purple, normal distributions having the same standard deviation are shown in blue. For both distributions, the vast majority of values fall between 0.015 and -0.015. Dashed lines mark the cut-off values used to extract dominant tetrapeptides summarizing correlated peptide (copep) motifs. Selected strings of overlapping tetrapeptides describing parts of these motifs are shown boxed above the approximate regions in the distribution in which they appear.

Species vectors for the production of species phylogenies

The detailed comparative information contained within the hundreds of singular vectors and their corresponding motifs and gene families was subsequently used to build a species phylogeny by summing all the SVD-derived right protein vectors separately for each organism and then comparing the relative orientation of the resulting species vectors [13]. Figure 3A shows the SVD-based topology obtained for the nine eukaryotes compared in this study. This tree supports a coelomate rather than ecdysozoan lineage. Two distinct re-sampling methods were used to estimate branch statistics for this tree. The top value of each pair of support values for each branch shown in Figure 3A was generated using a traditional bootstrap procedure [19]. In this case, 100 random sets of 437 re-sampled singular vectors were made and used to construct 100 species trees. Alternatively, a novel "successive, delete one" jackknife procedure [14] was used to generate the bottom value shown for each branch. In this case, the least dominant singular vector was removed successively (down to 10 vectors) to generate 427 ordered sets of singular vectors, and a new tree was estimated following each removal. Although bootstrap support values for the branches grouping arthropods with vertebrates (37%) and worms with other metazoa (49%) are relatively weak, support values for these branches are strong (100%) using the modified jackknife procedure. All other branches are strongly supported by both procedures. The branch separating Cele from the coelomates is of special interest, since the weak bootstrap support observed (37%) might suggest a significant affinity between Cele and the arthropods consistent with the "ecdysozoan" model (Figure 3A – alternative branching pattern shown in red). Bootstrap support for the alternative ecdysozoan cluster, however, was only 24%.

Figure 3.

Figure 3

SVD-based proteome phylogeny (A) of nine eukaryotes with percentage branch support: top – bootstrap; bottom – novel jackknife. An unsupported alternative phylogeny containing the "ecdysozoan" lineage is indicated by the dashed red branches. Percentage branch support values for the various clades of the tree are also provided to the left (B) for trees built using all proteins, as well as trees built after poorly described proteins are removed using either of two alternative vector magnitude inclusion values (>0.005, >0.05).

Use of the "successive, delete-one" jackknife procedure as a species tree branch statistic is justified by the fact that SVD provides singular triplets in order of their "dominance" in explaining the data set [20]. Mathematical dominance provides an objective measure of importance that can be utilized to weight characters. Since the modified jackknife procedure used here deletes the least dominant singular vectors one at a time in order, the more dominant singular vectors (i.e. conserved motifs/families) are automatically weighted more heavily within the consensus tree. Hence, one can argue that our novel jackknife procedure provides stronger support for the derived phylogeny because the most dominant singular vectors generally contain stronger information about gene and species relationships.

Poorly described proteins and species tree quality

While our SVD-based analysis technically considers all proteins present within all nine genomes of the data set, it is likely that accurate vector definitions are provided for only a small fraction of these proteins. Theoretically, the 437 singular triplets could effectively describe as many as 2 × 437 = 874 protein families. However, many of these vectors appear to best describe particular subfamilies of larger groups of closely related proteins. Thus, the 58 protein families listed in Table 2 are each represented by anywhere from 1 to 8 triplets. Although, as mentioned earlier, some protein families lacking clear functional annotation were omitted from this table, it still serves to provide a conservative lower estimate of the number of well-described protein families provided by the SVD. Assuming the number of identifiable protein families in our nine genome data set significantly exceeds the 58 to 179 protein families unambiguously demarcated and subdivided in our analysis, then hundreds or perhaps thousand of the poorly described proteins included in our species vector sums might be contributing a high fraction of "noise" to the definition of species.

In an attempt to increase the fraction of well described proteins used to define species, proteins having poor projections on all 437 right singular vectors were ignored during the summation process. Arbitrary vector magnitude cut-off values of 0.005 or 0.05 were applied to reduce the number of poorly described proteins used to build species trees. Even though the highest and most stringent cut-off value removed the majority of proteins during summation, both new species trees had identical topologies to that of the tree shown in Figure 3A in which all proteins were included. Bootstrap and modified jackknife branch support values for these tree are shown in Figure 3B along with those derived from the inclusive analysis. The removal of only a small fraction of poorly described proteins (cut-off = 0.005, about 103 proteins removed) resulted in 22% bootstrap and 100% modified jackknife support for the coelomate lineage, but 0% support for the ecdysozoan lineage. Removal of a much higher fraction of poorly described proteins (cut-off = .05, about 105 proteins removed) produced an equivalent result. Hence, poorly described proteins contribute little to the support that our analysis provides for the coelomate model.

Conclusions

As demonstrated above, an SVD-based analysis of multiple genomes automatically interprets proteins from input genomes as potential members of a limited list of hierarchically defined protein families and subfamilies. Each subfamily is defined in detail by one or more singular vectors as linear combinations of a large number of peptides (160,000 tetrapeptides, in this case). Potentially, a large number of proteomes lacking annotation can be directly interpreted using this method, assuming a sufficient number of annotated proteomes are included in the analysis. Although most of the genomes used in the present analysis were already accompanied by detailed protein annotations, formal annotations of the Frub and Agam proteins were not readily available. Nevertheless, our SVD-based analysis was able to provide precise protein motif descriptions and subfamily affiliations, not only for the six Frub or Agam proteins shown in Figure 1, but also for any of the hundreds of other Frub or Agam proteins exhibiting strong vector projections on any of the 437 derived singular vectors (see "SVD top five" of Table 1).

Our method bears partial resemblance to a recently described graph-theoretic method for rapidly clustering massive datasets of whole genome protein sequence [22]. In this case, the protein definitions generated were not used to derive gene or species trees, but to provide for a comprehensive clustering of all proteins into families having one or more members. The nodes of their graphs, like the vectors from the right matrix in our analysis, represent proteins, while the edges between nodes in their graphs, like the angles between vectors in our analysis, contain the distance information used to compare proteins. However, the distance information in their analysis was obtained ultimately from exhaustive pairwise BLAST alignments. In contrast, our distance information was derived without alignment, by reference to the 437 most dominant SVD-derived orthonormal left singular vectors. These vectors provide "motif models" expressed as particular linear combinations of the 160,000 possible tetrapeptides. The projections of these motif models on a given protein vector serve to quantitatively define the protein. Since no more than 874 motif models would be provided by our truncated SVD, our method would be less effective than other methods for providing comprehensive family designations for all proteins in a dataset [22,23]. However, a high fraction of these protein families are found to contain only one or a few members [22]. Singletons and small families would generally provide unimportant contributions to relative species definitions, since the majority of species would lack a homolog for comparison. Hence small or poorly conserved protein families, presumably represented by the weaker singular triplets in a complete SVD, are profitably ignored in our analysis.

Although our descriptive analysis of singular triplets (e.g. Table 2, Figure 1) suggests that the protein vectors in our high dimensional definition space can be effectively clustered, we have not applied any specific clustering algorithm. Hence no explicit clustering of proteins, equivalent to the identification of orthologs or homologs, is required. Nevertheless, the application of a clustering algorithm to our vector based symmetric protein distances is clearly feasible and results in accurate clustering for a high fraction of proteins. In fact, the accuracy with which proteins are clustered into known families via Neighbor Joining was used previously to establish optimal dimensionality for a well characterized data set [13]. In addition, unlike other methods, our method provides a straightforward vector addition mechanism for converting relative protein definitions into relative species definitions for the production of species phylogenies.

Alternative non-alignment methods exist for comparing sequences [reviewed in [17]]. Some of these methods may prove to be scalable and adaptable to the problem of whole genome phylogeny. For example, a comprehensive bacterial phylogeny was recently derived using species vectors that include a set of background corrected pentapeptide or hexapeptide (K-tuple) frequency values [12]. Although apparently effective for producing global species phylogenies, this method fails to provide quantitatively comparable protein definitions or interpretable predictions for conserved motifs. While many phylogenetically informative pentapeptides and hexapeptides are likely derived from homologs or orthologs, no mechanism exists for extracting, summarizing, and interpreting this information in terms of motif and gene family definitions. This high stringency method provides a low false positive rate (strong connections between probable orthologous peptides), but comes at the expense of a high false negative rate (little or no recognition of other homologous regions within proteins). For organisms exhibiting a significant level of horizontal gene transfer [24-26], models for motifs and protein families may be crucial tools for identifying "borrowed" genes and assessing their impact on phylogenetic hypotheses.

Our SVD-based species tree supports the traditional "coelomate" model of animal phylogeny. Other large-scale, genome level analyses also tend to support this model [27,28]. The alternative "ecdysozoan" model is supported by comparative analyses of rRNA and analyses that include morphological characters [28,29]. Although genome-scale analyses should perhaps carry considerable weight due to the higher fraction of "total information" used as input, the separation of "signal" from "noise" represents a serious hurdle for these methods. Our method represents a uniquely independent solution that provides a noise-reduced simultaneous global comparison of all proteins within multiple genomes without the need for alignments and without the prior application of operational definitions of orthology. As such, it provides a global perspective on gene and species relationships that is based on a much larger subset of information than that normally used. Since it is a non-alignment method, it provides a fundamentally different kind of analysis, and to the extent that the resulting species phylogenies agree with those provided by other analyses that depend upon highly filtered subsets of aligned orthologs or close homologs, we may derive an additional degree of confidence in these relationships. However, the balanced comparison of a large number of additional whole genome sequences from a variety of animals will likely be required in order to produce an unambiguous and universally accepted animal phylogeny.

Methods

Datasets

Complete reference protein sequences for nine whole eukaryotic genomes ranging from yeast to man were compiled into a single dataset (Figure 1a). Curated protein sequence files were obtained from NCBI dated as follows: human (Hsap) 10/10/03, mouse (Mmus) 10/31/03, rat (Rnov) 9/23/03, mosquito (Agam) 10/24/03, fly (Dmel) 10/24/03, worm (Cele) 11/12/03, malaria (Pfal) 10/17/02, and yeast (Scer) 11/12/03. Pufferfish (Frub) protein sequences dated 8/26/02 were obtained from the DOE Joint Genome Institute.

Peptide frequencies and SVD

Each protein sequence in the dataset was recoded as a high dimensional vector containing raw frequencies for each of the 160,000 possible tetrapeptides. Previous work has established that although tripeptides work well for estimating similarities between highly divergent proteins contained within small sets of viral genomes [15], tetrapeptides work better for larger data sets derived from vertebrate mitochondrial genomes or whole bacterial genomes [13,14]. Although pentapeptides also worked well with the mitochondrial datasets (unpublished), our computational capacity precluded the use of pentapeptides (3.2 million patterns) and larger data sets, like the one used here. Following a log-entropy transformation [21], the singular value decomposition of the resulting data matrix was computed. The log-entropy transformation tends to down-weight evenly distributed high frequency peptides that are likely sources of homoplasy. After 1500 Lanczos iterations (residual errors less than 10-6), three output matrices were obtained, consisting of 437 singular triplets (left and right singular vectors and their corresponding singular value). Each left singular vector produced by the SVD defines one or two conserved motifs within the dataset as particular linear combinations of tetrapeptides [13,14]. Similarly, each of the right singular vectors defines one or two conserved gene families (or subfamilies) as particular linear combinations of proteins. Each gene family identified by a given right singular vector contains motifs described by the corresponding left singular vector. Two distinct motif/families are frequently identified per triplet, since each triplet describes both a correlated motif/family (positive values) and an anti-correlated motif/family (negative values).

Vector based motif and protein family models

"Dominant" vector elements (absolute values in excess of 0.025) were extracted from the left singular vectors and summarized using the C++ program "Copepx" [14]. These values were associated with the most "dominant" (i.e. highly conserved) tetrapeptides found within the motifs described by a given left vector. In addition, the "top five" positive and "top five" negative elements were extracted from the right singular vectors and summarized using the C++ program "Coprotx". These values represent the most dominant members of the gene families described by a given right vector.

Species trees and branch support

Distance matrices were derived by summing all the SVD-derived right protein vectors for a given organism and then comparing the relative orientation of the resulting species vectors using the program Cosdist [13,14]. Species trees were subsequently derived from distance matrices using Phylip-Neighbor [30]. Two distinct resampling methods were used to provide branch support: a traditional bootstrap procedure [19], and a modified jackknife procedure. For the bootstrap, 100 random sets of 437 resampled singular vectors were made and used to construct 100 species trees. For the "successive, delete one" jackknife procedure [14], the least dominant singular vector was removed successively (down to 10 vectors) to generate 427 ordered sets of singular vectors, and a new tree was estimated following each removal.

List of abbreviations used

Homo Sapiens (Hsap), Mus musculus (Mmus), Rattus Norvegicus (Rnov), Anopheles gambiae (Agam), Drosophila melanogaster (Dmel), Caenorhabditis elegans (Cele), Plasmodium falciparum (Pfal), and Saccharomyces cerevisiae (Scer), Fugu rubripes (Frub), correlated peptide (copep), correlated protein (coprot), right singular vector (rsv), left singular vector (lsv).

Authors' contributions

GS conceived the study, gathered the input data, provided primary interpretation of the output, and drafted the manuscript. MB wrote and adapted software, performed computational analysis on the input data, and provided manuscript modifications. All authors read and approved the final manuscript.

Supplementary Material

Additional File 1

Copep Motifs. Long copep strings identified within the left singular vectors of a given s-triplet.

Click here for file (104KB, doc)

Acknowledgments

Acknowledgements

This work was supported in part by a Computer and Information Technology Resources challenge grant from the University of Tennessee, Knoxville.

Contributor Information

Gary W Stuart, Email: G-Stuart@indstate.edu.

Michael W Berry, Email: berry@cs.utk.edu.

References

  1. House CH, Fitz-Gibbon ST. Using homolog groups to create a whole-genomic tree of free-living organisms: an update. J Mol Evol. 2002;54:539–547. doi: 10.1007/s00239-001-0054-5. [DOI] [PubMed] [Google Scholar]
  2. Lerat E, Daubin V, Moran NA. From Gene Trees to Organismal Phylogeny in Prokaryotes: The Case of the gamma-Proteobacteria. PLoS Biol. 2003;1:E19. doi: 10.1371/journal.pbio.0000019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Wolf YI, Rogozin IB, Grishin NV, Koonin EV. Genome trees and the tree of life. Trends Genet. 2002;18:472–479. doi: 10.1016/S0168-9525(02)02744-0. [DOI] [PubMed] [Google Scholar]
  4. Lin J, Gerstein M. Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Res. 2000;10:808–818. doi: 10.1101/gr.10.6.808. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Snel B, Bork P, Huynen MA. Genome phylogeny based on gene content. Nat Genet. 1999;21:108–110. doi: 10.1038/5052. [DOI] [PubMed] [Google Scholar]
  6. Sawa G, Dicks J, Roberts IN. Current approaches to whole genome phylogenetic analysis. Brief Bioinform. 2003;4:63–74. doi: 10.1186/1471-2105-4-63. [DOI] [PubMed] [Google Scholar]
  7. Wolf YI, Rogozin IB, Koonin EV. Coelomata and not Ecdysozoa: evidence from genome-wide phylogenetic analysis. Genome Res. 2004;14:29–36. doi: 10.1101/gr.1347404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci U S A. 1986;83:5155–5159. doi: 10.1073/pnas.83.14.5155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chu KH, Qi J, Yu ZG, Anh V. Origin and phylogeny of chloroplasts revealed by a simple correlation analysis of complete genomes. Mol Biol Evol. 2004;21:200–206. doi: 10.1093/molbev/msh002. [DOI] [PubMed] [Google Scholar]
  10. Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics. 2001;17:149–154. doi: 10.1093/bioinformatics/17.2.149. [DOI] [PubMed] [Google Scholar]
  11. Pe'er I, Felder CE, Man O, Silman I, Sussman JL, Beckmann JS. Proteomic signatures: amino acid and oligopeptide compositions differentiate among phyla. Proteins. 2004;54:20–40. doi: 10.1002/prot.10559. [DOI] [PubMed] [Google Scholar]
  12. Qi J, Wang B, Hao BI. Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach. J Mol Evol. 2004;58:1–11. doi: 10.1007/s00239-003-2493-7. [DOI] [PubMed] [Google Scholar]
  13. Stuart GW, Moffett K, Leader JJ. A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. Mol Biol Evol. 2002;19:554–562. doi: 10.1093/oxfordjournals.molbev.a004111. [DOI] [PubMed] [Google Scholar]
  14. Stuart GW, Berry MW. A comprehensive whole genome bacterial phylogeny using correlated peptide motifs defined in a high dimensional vector space. J Bioinfo Comp Biol. 2003;19:475–494. doi: 10.1142/S0219720003000265. [DOI] [PubMed] [Google Scholar]
  15. Stuart GW, Moffet K, Bozarth R. A whole genome perspective on the phylogeny of the plant virus family Tombusviridae. Arch Virol. 2004;149:1595–1610. doi: 10.1007/s00705-004-0298-7. [DOI] [PubMed] [Google Scholar]
  16. Ensemble Genome Browser http://www.ensembl.org
  17. Vinga S, Almeida J. Alignment free sequence comparison – a review. Bioinformatics. 2002;4:513–523. doi: 10.1093/bioinformatics/btg005. [DOI] [PubMed] [Google Scholar]
  18. Macaluso M, Russo G, Cinti C, Bazan V, Gebbia N, Russo A. Ras family genes: an interesting link between cell cycle and cancer. J Cell Physiol. 2002;192:125–130. doi: 10.1002/jcp.10109. [DOI] [PubMed] [Google Scholar]
  19. Felsenstein J. Confidence-limits on phylogenies – an approach using the bootstrap. Evolution. 1985;39:783–791. doi: 10.1111/j.1558-5646.1985.tb00420.x. [DOI] [PubMed] [Google Scholar]
  20. Berry MW, Browne M. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM, Philadelphia; 1999. [Google Scholar]
  21. Frakes WB, Baeza-Yates R. Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs, NJ; 1992. [Google Scholar]
  22. Enright AJ, Kunin V, Ouzounis CA. Protein families and TRIBES in genome sequence space. Nucleic Acids Res. 2003;31:4632–4638. doi: 10.1093/nar/gkg495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, et al. A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol. 2004;5:R7. doi: 10.1186/gb-2004-5-2-r7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Daubin V, Moran NA, Ochman H. Phylogenetics and the cohesion of bacterial genomes. Science. 2003;301:829–832. doi: 10.1126/science.1086568. [DOI] [PubMed] [Google Scholar]
  25. Philippe H, Douady CJ. Horizontal gene transfer and phylogenetics. Curr Opin Microbiol. 2003;6:498–505. doi: 10.1016/j.mib.2003.09.008. [DOI] [PubMed] [Google Scholar]
  26. Raymond J, Zhaxybayeva O, Gogarten JP, Blankenship RE. Evolution of photosynthetic prokaryotes: a maximum-likelihood mapping approach. Philos Trans R Soc Lond B Biol Sci. 2003;358:223–230. doi: 10.1098/rstb.2002.1181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Blair JE, Ikeo K, Gojobori T, Hedges SB. The evolutionary position of nematodes. BMC Evol Biol. 2002;2:7. doi: 10.1186/1471-2148-2-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Aguinaldo AM, Turbeville JM, Linford LS, Rivera MC, Garey JR, Raff RA, Lake JA. Evidence for a clade of nematodes, arthropods and other moulting animals. Nature. 1997;387:489–493. doi: 10.1038/387489a0. [DOI] [PubMed] [Google Scholar]
  29. Mallatt J, Winchell CJ. Testing the new animal phylogeny: first use of combined large-subunit and small-subunit rRNA gene sequences to classify the protostomes. Mol Biol Evol. 2002;19:289–301. doi: 10.1093/oxfordjournals.molbev.a004082. [DOI] [PubMed] [Google Scholar]
  30. Felsenstein J. PHYLIP (Phylogeny Inference Package) version 3.6a3. Department of Genome Sciences, University of Washington, Seattle. 2002.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Additional File 1

Copep Motifs. Long copep strings identified within the left singular vectors of a given s-triplet.

Click here for file (104KB, doc)

Articles from BMC Bioinformatics are provided here courtesy of BMC

RESOURCES