Abstract
Microarrays measure the expression of large numbers of genes simultaneously and can be used to delve into interaction networks involving many genes at a time. However, it is often difficult to decide to what extent knowledge about the expression of genes gleaned in one model organism can be transferred to other species. This can be examined either by measuring the expression of genes of interest under comparable experimental conditions in other species, or by gathering the necessary data from comparable microarray experiments. However, it is essential to know which genes to compare between the organisms. To facilitate comparison of expression data across different species, we have implemented a Web-based software tool that provides information about sequence orthologs across a range of Affymetrix microarray chips. AffyTrees provides a quick and easy way of assigning which probe sets on different Affymetrix chips measure the expression of orthologous genes. Even in cases where gene or genome duplications have complicated the assignment, groups of comparable probe sets can be identified. The phylogenetic trees provide a resource that can be used to improve sequence annotation and detect biases in the sequence complement of Affymetrix chips. Being able to identify sequence orthologs and recognize biases in the sequence complement of chips is necessary for reliable cross-species microarray comparison. As the amount of work required to generate a single phylogeny in a nonautomated manner is considerable, AffyTrees can greatly reduce the workload for scientists interested in large-scale cross-species comparisons.
Microarray experiments have made it possible to rapidly quantify the expression of large numbers of genes for a given experimental condition. The rapidity and ease of use of this technology has enabled research into complex aspects of growth and development involving multiple genes at a time. However, it remains difficult to extend findings from one organism to another, as it is often not known which of the spots on different microarray chips measure the expression of comparable (i.e. orthologous) genes.
The basic idea of using model organisms is that the knowledge gained from studying such an organism will, to a large extent, be transferable to other species. Taking the regulatory feedback loop controlling branching in Arabidopsis (Arabidopsis thaliana) as an example, validating analyses needed to be performed in a range of other species to determine to what extent this mechanism was conserved and how far the knowledge gained in Arabidopsis could be applied to other plants (Johnson et al., 2006).
Approaches to validate such regulatory networks range from crudely determining whether the necessary genes might be present in another genome and then assuming the complete network of gene interaction to be conserved, to quantifying the expression of the corresponding genes under comparable experimental conditions and verifying that the genes actually do behave in a similar manner. The former is a crude but quick, cheap, and easy approach, while the latter is more refined, but work intensive, expensive, and complicated. Data-mining available microarray data may provide an intermediate solution to the problem. Microarray data repositories such as the Gene Expression Omnibus (Edgar et al., 2002) provide a wealth of information about how an organism responds to a wide variety of experimental conditions and may provide information about the expression of a gene of interest in a species of interest under an experimental condition of interest.
Regardless of the approach used, it is necessary to know which genes can be compared between organisms. In many cases, available gene annotation or best BLAST (Altschul et al., 1997) hits are used. However, gene annotation is not always correct or up to date, and best BLAST hits do not always correspond to the closest phylogenetic relative (Koski and Golding, 2001). The orthology of genes, i.e. gene copies that arose due to a speciation event, is the quintessential feature to look for when attempting to compare genes or gene products. The underlying assumption is that a gene in an emergent species will continue to perform the same function it had in the ancestral species. Genes that arose via duplication (i.e. paralogous genes) are a different matter, as two copies of the gene are present in the genome of the organism, making it less likely that changes in one of the duplicates will lead to a noticeable reduction in fitness, making it more likely that such changes will be passed on to the next generation. The paralogous genes we observe today were therefore less restrained in their ability to change, be lost, be inactivated, or evolve toward a new function. Alternatively, both of the duplicates may have changed only slightly, each continuing to perform a subset of the original gene's tasks or both may have remained fully functional, accumulating only minor changes in the regulation of their expression to counteract potential dosage effects. This freedom of paralogs to change is the main reason why comparison of paralogous genes is unlikely to be beneficial or intended, and cross-species comparisons should be confined to orthologous or co-orthologous genes.
A number of tools and databases exist that attempt to determine which genes are orthologous and therefore comparable across organisms (e.g. COG [Tatusov et al., 1997], Orthomcl [Li et al., 2003], KOG [Tatusov et al., 2003], Genome Clusters Database [Horan et al., 2005], Inparanoid [O'Brien et al., 2005], Multiparanoid [Alexeyenko et al., 2006], and Orthologid [Chiu et al., 2006]). Unfortunately, some of these provide orthology assignments for only a very restricted set of species, while others require completed genomes on which to base their predictions. Both these points make these databases next to useless for researchers wanting to compare sequences from organisms for which completed genomes are not yet available and that were not part of the select set of species that were included in the databases. For such organisms, researchers generally have to rely on sequence similarity searches to determine potential sequence orthologs in better-described species. In addition, the majority of the methods do not base their orthology predictions on phylogenetic trees but on other clustering methods and only use phylogenies to visualize the results. Finally, none of the methods provide an easy lookup of which Affymetrix sequences are comparable across chips, making an additional mapping of Affymetrix exemplar sequences to predicted sequence orthologs necessary.
Our Web-based software tool provides a quick and easy way of assessing the orthology of protein-coding genes for a variety of plant microarray chips, irrespective of whether the genome of the organism is completed or not. We focused on Affymetrix chips, as the overwhelming majority of microarray data present in public repositories is based on these (Gene Expression Omnibus; Edgar et al., 2002). These chips generally provide a reasonable coverage of the transcriptome of an organism, and the corresponding sequence data are readily available. As many chips are designed and sold before the corresponding organism is completely sequenced, there may be cases where sequences spotted on a chip are thought no longer to be present in the genome, or some genes in the genome may be represented multiple times or missing on the chip. In contrast to other methods, we do not use open reading frames (ORFs) predicted from genomic data but the sequences from which the probe sets for a given chip were derived, hereafter referred to as either exemplar or consensus sequences. We thereby avoid problems arising from inaccurate ORF prediction, genome sequences being revised and changed, as well as errors in assigning the various probe sets to predicted genomic ORFs. For each of the consensus sequences, we provide the results of sequence similarity searches against a number of sequence databases, a Profile-Hidden Markov model (HMM) representative of the sequence family, as well as a multiple sequence alignment and phylogenetic tree for that family. An additional utility permits determining sequence orthologs in a species of choice for the sequences present on an Affymetrix chip. A Web interface is provided to PHAT, part of the PhyloGenie package (Frickey and Lupas, 2004), which allows the repository of phylogenetic trees to be mined for trees corresponding to specific topological or species constraints.
CONSTRUCTION AND CONTENT
The National Center for Biotechnology Information (NCBI) nonredundant protein database “nr” and 6-frame translations of the plant microarray chip consensus sequences provided by Affymetrix provide the set of sequences on which we base our predictions. The 6-frame translations of the consensus sequences provide information as to what proteins are represented on the various microarray chips. The “nr” database contains a wide variety of species suitable as outgroups for the phylogenies and provides sequences that may have failed to be included on the microarray chips of the various organisms. The latter are of special importance, as they provide critical data when attempting to assess whether two sequences are orthologous or paralogous (Fig. 1).
Figure 1.
An ancestral gene undergoes a duplication and gives rise to two paralogous genes, A and B. Some time later, a speciation event gives rise to two species (light and dark). Each of these has retained both paralogs in their genome, but only genes A for the dark species and B for the light species are included on the chip. Simple pairwise comparison of the chip sequences alone would predict A (dark) and B′ (light) to be sequence orthologs, as these would appear to be reciprocal closest relatives. Including additional sequence data, such as sequences of outgroup species or the sequences A′ (light) and B (dark) missing on the chips but present in the genomes of the blue and red species, can help clarify relationships and allow unambiguous assignment of sequence orthologs.
PhyloGenie is used to automatically search for sequence homologs and infer phylogenetic trees for all consensus sequences on a chip. This tool was originally developed to generate and analyze phylomes in regards to gene duplications and lateral gene transfers and can be briefly described as follows. Each microarray consensus sequence is compared against the above-mentioned databases using BLAST. The result of these sequence similarity searches is used to identify potential sequence homologs. BLAST high-scoring segment pairs (HSPs) with greater than 70% coverage of the query and E values better than 1e-5 are extracted and aligned to one another. These parameters were chosen lax enough to detect nontrivial sequence similarities yet stringent enough to exclude high-scoring local similarities that would, by themselves, not warrant the assignment of two sequences as being orthologous. The resulting alignment contains the sequence regions we regard as homologous to the query. Hmmer (http://hmmer.janelia.org/) is used to derive an HMM from this alignment and search the full-length sequences of all BLAST-HSPs with E values better than 1. Deriving an HMM from the above alignment gives a better representation of the sequence family. Using this HMM to search against full-length sequences of even marginal BLAST hits allows detection of more of the distant sequence homologs and better defines the start and end of homologous sequence regions than a single BLAST search could. Sequence regions matching the full-length HMM with E values better than 1e-5 are combined to a multiple sequence alignment. A phylogenetic tree with 100 bootstrap replicates is inferred from this alignment. Due to limited computational resources, we use neighbor-joining (Saitou and Nei, 1987) to infer phylogenies. All intermediary files are made available so that the process can be followed from beginning to end, and alternative approaches, for example, a different method of tree inference, could be used. The trees are rooted at the phylogenetic node closest to the “last universal common ancestor,” as described in the PhyloGenie manuscript (Frickey and Lupas, 2004).
The set of trees generated by PhyloGenie provides the basis of our prediction of sequence orthologs. The actual prediction requires a number of user-specified parameters and is performed on-the-fly, allowing for a high degree of flexibility. Detection of sequence orthologs is based on the number of nodes separating the query sequence, i.e. the sequence for which a tree was derived, from sequences of any given species in the tree. In the following examples, we assume that the user selected the Arabidopsis ATH1-121501 chip and was attempting to find sequence orthologs in Medicago truncatula.
Determining sequence orthologs is done in the following manner (Fig. 2). The number of nodes separating each M. truncatula sequence (yellow) from the query (purple) is determined (minimum no. 4, sd 2.87). An additional scaling factor (default, 0.5) allows the user to specify the range in which he is willing to accept M. truncatula sequences as potential sequence orthologs. Increasing this value causes the program to take into account more distant sequence relatives as potential orthologs, while decreasing this value causes the program to focus on the most closely related sequences only. In the presented analysis, we used a value of 0.5, as this allowed us to determine orthologs for most of the chip sequences while not causing too many of the query sequences to be assigned multiple orthologs in the other species. The distance within which sequences are accepted as potential sequence orthologs is referred to as the permissive range in this manuscript. The permissive range is calculated as the minimal number of nodes separating the query sequence from a M. truncatula homolog in the tree plus the sd multiplied by the scaling factor. The sd reflects the dispersal pattern of M. truncatula sequences throughout the tree. The more clades in a tree containing M. truncatula sequences, the greater the uncertainty about which of these clades contains sequences orthologous to the query. We therefore use the sd of the number of nodes separating M. truncatula sequences from the query as a measure for how uncertain we are that the sequences closest to each other, in number of nodes, really are the sequence orthologs. For the tree shown in Figure 2, the permissive range is highlighted in green and encompasses all sequences less than six nodes removed from the query. Affymetrix Arabidopsis ATH1-121501 sequences less than six nodes removed from the query are regarded as sequence paralogs to the query (260439_at). M. truncatula sequences within the permissive range are regarded as potential sequence orthologs (Mtr.28509.1.S1_at, Mtr.17370.1.S1_at, and Mtr.21922.1.S1_at).
Figure 2.
Determining sequence orthologs based on the number of nodes separating them from the query. This example provides a case where multiple clades containing both M. truncatula and closely related Arabidopsis homologs are present. Sequences from the Arabidopsis microarray chip ATH1-121501 are highlighted in blue, the query sequence for which this tree was computed is highlighted in magenta, and sequences from the M. truncatula microarray chip are highlighted in yellow. The permissive range for the query is shown with a colored background (green), and red and blue lines, above and below the tree, respectively, show the permissive range for the reverse lookup for two of the three potential sequence orthologs. Circles show which of the Arabidopsis ATH1-121501 sequences were recovered in the respective reverse lookups. [See online article for color version of this figure.]
For each of the potential orthologs, we subsequently perform a reverse lookup. We calculate the minimum and sd of the number of nodes separating each potential ortholog from the Affymetrix Arabidopsis ATH1-121501 sequences present in the tree. As the minimum and sd are greatly influenced by the position in the tree of the sequence for which the values are being calculated, the permissive ranges of the potential orthologs may be quite different from one another. A red and blue line show the permissive ranges for two of our three potential orthologs. The query sequence does not lie within the permissive range of Mtr.21922.1.S1_at (blue line). This sequence is therefore removed from the set of potential orthologs, as it appears much more closely related to the Affymetrix Arabidopsis sequence “257728_at” than to the query. Mtr.28509.1.S1_at (red line) and Mtr.17370.1.S1_at (not shown) recover the query sequence in their permissive ranges, and both are retained as sequence orthologs to the query. Analysis of this tree therefore tells us that our query sequence “245641_at” has a sequence paralog (260439_at) on the Affymetrix Arabidopsis ATH1-121501 chip and two sequence orthologs (or co-orthologs) on the Affymetrix M. truncatula chip.
The aim of this tool is two-fold: it offers a fully automated way of retrieving sequence orthologs for microarray consensus sequences from a wide variety of species and provides the results of a BLAST search, multiple sequence alignment, and phylogenetic inference for every consensus sequence on a chip. This allows manual validation of any dubious orthology predictions by comparing the various intermediate results leading to the phylogeny against the corresponding phylogenetic trees and alignments. In addition, the large number of alignments generated in the process of constructing the phylogenies are a useful resource on which to base further analyses, as they provide sets of aligned sequence homologs for every consensus sequence on a chip.
UTILITY
The user interface has five Web pages. The home page allows querying of individual genes and links to the remaining pages, some help, and supplemental data. The other four pages of the interface deal with batch requests, analysis of chip phylomes, generation of phylogenies for sequences provided by the user, and prediction of sequence orthologs between the consensus sequences represented on a chip and other species.
The results of an individual query are shown in Figure 3. Tabs at the top of the page allow navigation between the results of a BLAST search (BLAST), alignment of HSPs (CLN), the derived HMM (HMM), results of the HMM search (HMS), alignment of high-scoring HMM hits (HLN), and either a textual or applet-based representation of a Neighbor-Joining tree (TRE). The tabs allow the user to retrace every step leading from query sequence to phylogeny and are very useful to gain a better understanding of why two genes were regarded as homologous, included in the same tree, or predicted to be sequence orthologs. To facilitate interpretation of batch requests and complete phylome analyses, intermediate pages can be generated that gather the results, order them, and link to the results pages of the various genes. Prediction of sequence orthologs between microarray chip consensus sequences and a species of choice generates a tab-delimited list containing information about which sequences on the chip could be assigned sequence orthologs in another species, which sequences should be regarded as co-orthologous or paralogous, and which other homologous sequences were present in the phylogenies but could not be assigned a more precise relationship.
Figure 3.
Screenshot of results using the Arabidopsis ATH1-121501 chip consensus sequence 261590_at as a query. Part of the corresponding phylogenetic tree is displayed. Red (dark) dots highlight M. truncatula sequences, yellow (light) dots highlight ATH1-121501 sequences, and a blue dot (bottom) highlights the query sequence. The tabs at the top of the page allow navigation between BLAST results (BLAST), the alignment of HSPs (CLN), the derived HMM (HMM), the HMM search results (HMS), the alignment from which the phylogeny is inferred (HLN), and either a text or graphical representation of the phylogenetic tree (TRE). [See online article for color version of this figure.]
Supplemental data, providing further information about the programs used, the individual steps performed to generate the data, as well as the parameters the user can tweak, are available at http://bioinfoserver.rsbs.anu.edu.au/utils/affytrees/help.php. Results of phylome analyses, custom phylogenetic trees, and orthology predictions are stored for a week and can be accessed by referring to the job identifier provided in the results.
This tool differs from other databases and programs in a number of ways. It provides the data on which tree inference and orthology prediction is based and thereby allows the user to retrace each step of the decision process. Our trees include sequences from the “nr” database that greatly facilitate correct rooting and interpretation. In addition, this allows us to potentially detect sequence orthologs for any species represented in “nr” instead of being limited to those species for which complete genomes or proteomes are available. The use of a user-defined “scaling factor” avoids problems co-orthologous genes cause for approaches relying solely on reciprocal best hits between genomes. If, for example, a species has a gene of interest, gene A, that was duplicated in another species, giving rise to genes B and B′, reciprocal best hit approaches may identify genes A and B or A and B′ as reciprocal best hits and assign them as sequence orthologs. However, if A appears most similar to B but B′ appears most similar to A, a possible scenario if nonsymmetric scoring schemes such as employed by BLAST are used, then no reciprocal best hits can be determined and no sequence orthologs are assigned. All of the above cases produce an incorrect assignment of gene orthology, as B and B′ are co-orthologous to A (i.e. duplicates derived from a gene that was orthologous to A) and should be treated as such.
Another part of this tool allows the user to search through the trees of a given species or chip for those corresponding to specific topological selection criteria. For example, to find all trees in which a clade contains at least one M. truncatula and Arabidopsis sequence, but no sequences from the Arabidopsis ATH1-121501 chip, the selection string “((Medicago truncatula & Arabidopsis) & !Arabidopsis ATH1-121501)” could be used. Trees containing such clades could identify sequences present in M. truncatula, the orthologs of which cannot be measured using the Affymetrix Arabidopsis ATH1-121501 chip, as no sequence orthologs are present on that chip. As an example of such a case (Fig. 4), we show a tree derived for a hypothetical protein from M. truncatula, the ortholog of which was not included on the ATH1-121501 chip, even though orthologous sequences are present in the Arabidopsis genome as well as throughout the plant, fungal, and animal kingdoms.
Figure 4.
Phylogenetic tree of a protein-coding gene present in a wide variety of eukaryotes that is not represented on either of the Affymetrix Arabidopsis chips. This is recognizable by the sequence identifiers. The Arabidopsis sequences (yellow, light dot) have NCBI gi-numbers instead of Affymetrix identifiers, signifying that these sequences were taken from the “nr” database and not one of the 6-frame translations of the microarray chip consensus sequences. The bottom-most sequence is the M. truncatula query sequence for which this tree was generated. Other M. truncatula sequences are highlighted with a red (dark) dot. [See online article for color version of this figure.]
Future developments include, as a first step, extending this tool beyond the currently available seven chips to include all publicly available Affymetrix plant microarray chips. Because this system is not limited as to what species can be analyzed, provided some sequence information for the species is available, it is conceivable that the system may be extended to cover all available Affymetrix microarray chips. Beyond that, the aim will be to develop and implement methods that further facilitate comparative analysis of microarray expression data across species.
RESULTS AND DISCUSSION
To determine whether the AffyTrees orthology predictions were comparable to, less, or more accurate than reciprocal best BLAST hits, the most widely used method to identify sequence orthologs, we compared the orthology predictions generated by both methods. Phylogenetically orthologous sequences are generally expected to fulfill the same function in different species, and functionally orthologous sequences are expected to be similarly expressed across different species. Therefore, phylogenetic orthologs can be expected to show a certain degree of similarity in their expression across species. We based our comparison on prediction of sequence orthologs between the Arabidopsis ATH1-121501 and M. truncatula Affymetrix chips. These species were chosen specifically, because sets of comparable microarray experiments were available and provided us with the opportunity to test whether and how well sequence orthology, as predicted by reciprocal best BLAST hits and AffyTrees, was reflected in similarity of expression.
The results of comparing the orthology predictions for these two microarray chips are shown in Figure 5A. BLAST produced many more reciprocal best hits (7,025) than AffyTrees predicted orthologs (5,793). Of these, 2,926 predictions of sequence orthologs coincided, 4,099 orthology predictions were unique to the reciprocal best BLAST hits, and, 2867 orthology predictions were unique to AffyTrees. Even though BLAST produced nearly 30% more orthology predictions, fewer individual sequences were assigned an ortholog in BLAST than in AffyTrees. This was due to many of the BLAST hits having multiple ortholog assignments. On average, each M. truncatula chip sequence was assigned 1.78 Arabidopsis chip sequences as reciprocal best BLAST hits, and every Arabidopsis chip sequence was assigned 1.57 M. truncatula chip sequences. This artificially inflated the number of “orthology” predictions provided by BLAST. Dividing the number of reciprocal best BLAST hits by the amount of multiple predictions for each species gives us the number of individual genes for each species that could be assigned at least one ortholog in the other species: the exclusively BLAST-based predictions assigned 2,303 sequences from Medicago one or more orthologs in Arabidopsis, and 2,611 sequences in Arabidopsis could be assigned one or more orthologs in Medicago. The exclusively AffyTrees-based predictions assigned 2,515 Medicago sequences orthologs in Arabidopsis and 2,537 Arabidopsis sequences orthologs in Medicago, 138 more sequences than assigned by reciprocal best BLAST hits.
Figure 5.
A, Overlap of reciprocal best BLAST hits (yellow) with AffyTrees orthology predictions (blue). B, Histogram and fitted EVD of ortholog pairs predicted by both BLAST and AffyTrees over the correlation coefficient of their expression values across the microarray experiments. For comparison purposes, the fitted EVD curve (green) for this data is represented in 5C and 5D as well. A vertical dotted line is placed at the peak of the EVD, and the correlation coefficient at which the peak is found is stated in black numbers at the bottom. The median value of each dataset is marked in the top left corner. C, Histogram and fitted EVD of the genes assigned orthologs in either BLAST (yellow) or AffyTrees (blue) over the average correlation coefficient of the assigned orthologs. D, Histogram and fitted EVD over the average correlation coefficient for genes assigned orthologs randomly (black) or by indiscriminately using any sequences present in the AffyTrees phylogenies as orthologs (magenta).
To determine which of the methods provided a more accurate orthology prediction, we compared the expression of predicted sequence orthologs in two sets of microarray experiments, one for Arabidopsis (Schmid et al., 2005) and one for M. truncatula (V. Benedito, I. Torres-Jerez, J. Murray, A. Andriankaja, S. Allen, K. Kakar, M. Wandrey, J. Verdier, H. Zuber, T. Ott, S. Moreau, A. Niebel, T. Frickey, G. Weiller, J. He, X. Dai, P. Zhao, Y. Tang, and M. Udvardi, unpublished data; Medicago Gene Atlas, ArrayExpress accession E-MEXP-1097). The expression of genes was compared across seven tissue types: stems, petioles, leaves, vegetative buds, flowers, roots, and seeds. Different laboratories generated the data, and differences in harvesting, preparation, experimental procedure, growth conditions, and of course the plants themselves undoubtedly will have affected the experiments and provide ample explanation for why some sequence orthologs might not be correlated in their expression in these two species. Therefore, we did do not expect all sequence orthologs to show a strong positive correlation in their expression, but a general positive trend in correlation was certainly expected. However, our aim was not to show that sequence orthologs share similar expression patterns but to use the available expression data to assess the accuracy of the two prediction methods.
Accepting the 2,926 orthology assignments both BLAST and AffyTrees agreed upon as “true” orthologs, we used the Pearson (linear) correlation coefficient of the expression values to measure the coexpression of all predicted ortholog pairs. The histogram in Figure 5B shows the number of predicted ortholog pairs for a given correlation coefficient as well as a fitted scaled extreme value distribution (EVD; Fig. 5B). Most of the predicted ortholog pairs produced positive correlation coefficients, supporting our expectation that sequence orthologs, in general, should show similar expression across different organisms. In addition, the graph provides us with a means of testing the accuracy of reciprocal best BLAST hits and AffyTrees orthology predictions as seen in Figure 5C. Rather than comparing histograms directly, we approximated the histograms by a distribution with a small number of parameters to facilitate comparison of multiple datasets. The EVD approximates the various histograms depicted in Figure 5 quite well. The more accurate the set of orthologs predicted by each method, the better the corresponding fitted EVD should approximate the EVD derived from our set of 2,926 true orthologs.
We then compared the sets of genes for which sequence orthologs could only be predicted by either BLAST or AffyTrees. Whenever one gene was assigned multiple sequence orthologs, we averaged their correlation coefficients to reflect that the method generating the prediction could not decide in more detail which of the predicted orthologs should be used. A total of 4,914 genes were assigned sequence orthologs only in reciprocal BLAST hits and 5,052 genes were assigned sequence orthologs only in AffyTrees. The graphs of the histograms and fitted EVDs for these sets of genes are shown in Figure 5C. Both BLAST and AffyTrees were able to predict orthologs for similar numbers of genes; however, the maximum of the BLAST-EVD lies at 0.47, while the maximum of the AffyTrees-EVD lies at 0.66. The EVD based on the AffyTrees predictions also better approximates the EVD based on the set of true orthologs. Taking the median of the correlation coefficients as the comparison metric leads to similar results (Fig. 5, B–D). Bootstrap sampling of the BLAST and AffyTrees distributions (10,000 samples, 1,000 replicates) showed the median values of the distributions to be very resilient to change. The probability of generating a randomly sampled distribution with the median value observed in the other method was, in both cases, quite unlikely (BLAST, 2.1−36; AffyTrees, 6.2−26). Both the median values of the distributions as well as the maximum of the fitted EVDs show that the histogram of the AffyTrees predictions (blue) is more similar to the histogram of the true orthologs (green) than the histogram of the best BLAST-based predictions (yellow) is to the true orthologs. This points to the AffyTrees predictions being more reliable than the predictions based on best BLAST hits.
However, it was recently shown that GCRMA (Wu et al., 2004) normalization can lead to overprediction of correlated genes (Lim et al., 2007). To see whether this was affecting our results, we repeated the above analysis using MAS5 (Hubbell et al., 2002) normalized data. The median values of the resulting distributions were 0.417 for our set of true orthologs, 0.339 for the AffyTrees orthologs, 0.275 for the BLAST predictions, 0.267 for AffyTrees homologs, and 0.018 for random sequence pairs. These values are similar to those calculated based on the GCRMA normalized data, indicating that, although GCRMA normalization does seem to increase the median value of the distributions, the increase is slight, and no qualitative difference in how the methods compare to one another is apparent.
In an attempt to determine why the BLAST-based prediction fared poorly, we examined how various modes of orthology assignment influence the fitted EVD. We show the histograms and fitted EVD for two further datasets (Fig. 5D). The first set was generated by randomly pairing sequences from within our set of true orthologs (black) and the second by accepting all sequence homologs present in the AffyTrees phylogenies as sequence orthologs (pink). These phylogenies provide a large number of groupings of homologous sequences. We know a large number of the trees to contain paralogous sequences, and misassigning sequence paralogs as orthologs is one of the key difficulties in accurately detecting sequence orthologs. The graph shows that an EVD fitted to the random orthology assignments (black) has its maximum close to zero. Indiscriminately assigning all sequence homologs present in a tree as sequence orthologs generates many more orthology predictions, as visible by the increased amplitude of the EVD. However, the maximum of the fitted EVD is close to 0.5, well below the 0.68 maximum we determined for the EVD of the set of true orthologs (green). We therefore expect the maximum of EVDs fitted to various methods of orthology assignment, for this dataset, to lie within 0 and 0.7. The closer the maximum lies to 0.7 or above, the better the prediction method is likely to be. Not differentiating between orthology and homology, thereby causing too many sequences to be assigned as sequence orthologs, shifts the maximum of the fitted EVD to around 0.5. BLAST-based predictions more frequently assigned multiple sequence orthologs to genes than the AffyTrees predictions. This might explain why the maximum of the BLAST-EVD lies at 0.47. The best BLAST approach, while quite suited to detecting sequence homologs, therefore does not appear very accurate when used to distinguish between sequence orthologs and other homologs. The AffyTrees method, in contrast, appears far better at reliably determining orthologous sequences.
CONCLUSION
AffyTrees provides a repository of phylogenetic trees inferred from every consensus sequence represented on a variety of Affymetrix plant microarray chips. This repository can be used to gain insights into the relationship of sequence homologs, improve annotation data, or automatically generate a list of sequence orthologs between a species and the consensus sequences represented on a specific microarray chip. The inclusion of sequences from the “nr” database and our method of detecting sequence orthologs circumvent the problems reciprocal best hit approaches have when dealing with co-orthologous genes. For sequences represented on Affymetrix plant microarray chips, AffyTrees can identify sequence orthologs present on other Affymetrix plant microarray chips, as well as sequence orthologs present in the “nr” database.
The ability to filter chip phylomes for specific selection criteria allows discrepancies or systematic biases between the sequence complements of chips and the corresponding genomes to be detected. Affymetrix chips were designed to measure the transcription of genes and therefore are biased toward highly expressed and protein-coding genes. This is a known and useful bias of these chips. However, other biases, for example, systematic preference for long or short sequences, differences in the EST libraries on which the chips were based, or differences in the ability to successfully predict short genes in different species, will have affected which sequences were included on a chip and thereby influence the results.
We provide a means of comparing the sequence complement of microarray chips to the publicly available sequence data of the corresponding organism as well as to the microarrays of other species. Robust ways of assessing sequence orthologs and knowledge about systematic differences in the sequence complement of various chips are prerequisites to making cross-species analyses of microarray expression data feasible. Without knowledge of the sequence orthologs present on other microarray chips, there is no way of determining which probe sets are comparable across chips. Similarly, without a way of estimating sequence biases or genes missing on a chip, the conclusions drawn from the presence or absence of groups of genes derived from expression data are likely to be flawed.
We show, to the extent that the limitations of the available experimental data permitted, that the majority of genes predicted to be orthologous show a similar expression across the two examined species. We also show that AffyTrees is able to assign sequence orthologs to more genes than a comparable approach relying on reciprocal best BLAST hits and, by comparing the expression of predicted sequence orthologs, that the AffyTrees orthologs appear more reliable than the BLAST-based predictions.
AffyTrees provides prediction of sequence orthologs for a wide variety of species at greater accuracy than reciprocal best BLAST hits. Combined with the available phylogenetic trees, sequence alignments, and additional utilities, AffyTrees should provide a useful resource for comparative analyses of transcriptomes and proteomes.
MATERIALS AND METHODS
The sequences we based our sequence-similarity searches on originated from either the “nr” database, downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz), or from 6-frame translations of exemplar sequences for a variety of Affymetrix chips. The nucleotide exemplar sequences were downloaded, after registration, from the Affymetrix Web site by following the links to the various species (http://www.affymetrix.com/support/technical/byproduct.affx?cat=exparrays). BLAST searches were performed against the NCBI nonredundant protein database “nr” and 6-frame translation of consensus sequences for the Affymetrix microarray chips ATH1-121501, AtGenome1, Barley1, Citrus, Cotton, Grape, Maize, Medicago, Poplar, Rice, Soybean, Sugar Cane, Tomato, and Wheat. The BLAST results for sequences represented on the Arabidopsis (Arabidopsis thaliana) ATH1-121501 and Medicago truncatula chips were retrieved via the AffyTrees Web interface. Putative sequence orthologs between M. truncatula and Arabidopsis sequences were predicted as described above (scaling factor = 0.5) based on the phylogenies provided by AffyTrees. To keep the results as comparable as possible, the same cutoffs used to generate the phylogenies (i.e. >70% coverage of the query and E values better than 1e-5) were used as a lower limit for analysis of the reciprocal best BLAST hits. BLAST hits that did not satisfy these cutoffs were not taken into account. In cases where multiple BLAST hits had identical best E values, all of these best hits were taken into account. This made it possible for some genes to be assigned multiple reciprocal best BLAST hits. The method of orthology prediction we describe allows genes in one species to be assigned multiple orthologs in another. In such cases, all of the predicted sequence orthologs were taken into account. A noticeable discrepancy was apparent in the number of predicted sequence orthologs compared to the number of reciprocal best BLAST hits. To keep both approaches of detecting sequence orthologs as comparable as possible, we compared reciprocal AffyTrees orthologs to the reciprocal best BLAST hits. This allowed both methods to use “reciprocality” as a further criterion to reduce the number of false positive orthology predictions.
For each plant species, the Affymetrix CEL files of the experiments we wanted to compare were normalized using both GCRMA (Wu et al., 2004) and MAS5 (Hubbell et al., 2002) for comparison. All experimental files for a species were normalized at the same time, as normalizing each set of experiments individually would have artificially increased the differences observed between the experimental conditions. Linear correlation coefficients were calculated using the average expression value of each gene over the three available experimental replicates.
Availability and Requirements
The tool is freely accessible at http://bioinfoserver.rsbs.anu.edu.au/utils/affytrees/. Further information and help is available at http://bioinfoserver.rsbs.anu.edu.au/utils/affytrees/help.php. Javascript should be enabled in the browser and a Java1.5 or above browser plugin should be installed for visualization of phylogenetic trees.
This work was supported by the Australian Research Council Centre of Excellence. Funding to pay for the publication charges was provided by the same grant.
The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Georg Weiller (georg.weiller@anu.edu.au).
Some figures in this article are displayed in color online but in black and white in the print edition.
Open Access articles can be viewed online without a subscription.
References
- Alexeyenko A, Tamas I, Liu G, Sonnhammer EL (2006) Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 22 e9–e15 [DOI] [PubMed] [Google Scholar]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25 3389–3402 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chiu JC, Lee EK, Egan MG, Sarkar IN, Coruzzi GM, DeSalle R (2006) OrthologID: automation of genome-scale ortholog identification within a parsimony framework. Bioinformatics 22 699–707 [DOI] [PubMed] [Google Scholar]
- Edgar R, Domrachev M, Lash AE (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30 207–210 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frickey T, Lupas AN (2004) PhyloGenie: automated phylome generation and analysis. Nucleic Acids Res 32 5231–5238 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Horan K, Lauricha J, Bailey-Serres J, Raikhel N, Girke T (2005) Genome cluster database. A sequence family analysis platform for Arabidopsis and rice. Plant Physiol 138 47–54 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hubbell E, Liu WM, Mei R (2002) Robust estimators for expression analysis. Bioinformatics 18 1585–1592 [DOI] [PubMed] [Google Scholar]
- Johnson X, Brcich T, Dun EA, Goussot M, Haurogne K, Beveridge CA, Rameau C (2006) Branching genes are conserved across species. Genes controlling a novel signal in pea are coregulated by other long-distance signals. Plant Physiol 142 1014–1026 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koski LB, Golding GB (2001) The closest BLAST hit is often not the nearest neighbor. J Mol Evol 52 540–542 [DOI] [PubMed] [Google Scholar]
- Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13 2178–2189 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lim WK, Wang K, Lefebvre C, Califano A (2007) Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks. Bioinformatics 23 282–288 [DOI] [PubMed] [Google Scholar]
- O'Brien KP, Remm M, Sonnhammer EL (2005) Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 33 D476–480 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4 406–425 [DOI] [PubMed] [Google Scholar]
- Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M, Scholkopf B, Weigel D, Lohmann JU (2005) A gene expression map of Arabidopsis thaliana development. Nat Genet 5 501–506 [DOI] [PubMed] [Google Scholar]
- Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science 278 631–637 [DOI] [PubMed] [Google Scholar]
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4 41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu Z, Irizarry RA, Gentleman R, Murillo FM, Spencer F (2004) A Model Based Background Adjustment for Oligonucleotide Expression Arrays. Technical Report. Department of Biostatistics Working Papers. John Hopkins University, Baltimore, MD





