Abstract
With the recent increase in the number of mammalian genomes being sequenced, large-scale genome scans for human-specific positive selection are now possible. Selection can be inferred through phylogenetic analysis by comparing the rates of silent and replacement substitution between related species. Maximum-likelihood (ML) analysis of codon substitution models can be used to identify genes with an accelerated pattern of amino acid substitution on a particular lineage. However, the ML methods are computationally intensive and awkward to configure. We have created a database that contains the results of tests for positive selection along the human lineage in 13 721 genes with orthologs in the UCSC multispecies genome alignments. The Human PAML Browser is a resource through which researchers can search for a gene of interest or groups of genes by Gene Ontology category, and obtain coding sequence alignments for the gene and as well as results from tests of positive selection from the software package Phylogenetic Analysis by Maximum Likelihood. The Human PAML Browser is available at http://mendel.gene.cwru.edu/adamslab/pbrowser.py.
INTRODUCTION
How are humans so genetically similar to the great apes, yet so phenotypically divergent from the other members of this family? This question has intrigued researchers for decades (1,2). One approach to identifying genetic determinants of phenotypic divergence is to examine protein-coding genes for evidence of positive selection (3,4). This type of selection is characterized by a new allele that offers a fitness advantage to an organism, and is rapidly pulled through the population until fixation of the beneficial allele. Genes that have been positively selected along the human lineage, yet remain constrained or selectively neutral in our closest living relatives may offer insight into the biologically significant genetic changes that have occurred since the Pan–Homo split.
A variety of methods have been developed for the prediction of positive selection. Some take advantage of population-specific genetic patterns. One such method uses allele frequency differences between populations (5) to uncover loci that have been affected by a genetic hitchhiking event. Similarly, the extended haplotype heterozygosity method (6) measures linkage disequilibrium between two markers with the intent of uncovering a recent selective sweep and is particularly useful in uncovering population-specific selection. Another technique compares the rate of polymorphism within a species to the rate of divergence (fixed difference) between species (7). While these methods are extremely useful for ascertaining genes subject to recent positive selection, they are unable to fully uncover those very ancient and fundamental changes that occurred around the time of human-chimp divergence and are shared by all human populations. For this purpose, analysis of evolutionary rates in a phylogenetic context (3,8) can be used to identify genes that are evolving more rapidly on a particular lineage compared to the rest of the tree (9). The pattern of codon substitution across a phylogenetic tree can be inferred from a multiple sequence alignment using maximum-likelihood methods (10). When the rate of non-synonymous codon changes (dN) exceeds the rate of synonymous codon changes (dS), positive selection can be inferred. The codeml program from the Phylogenetic Analysis by Maximum Likelihood (PAML) package can be used to test different codon substitution models and perform a likelihood ratio test of positive selection along specified lineages based on the dN/dS ratio (11).
Anisimova et al. (12,13) have performed large-scale simulation studies to test the effect that parameters such as the number of species, branch lengths, sequence length and sequence divergence has on the tests of positive selection as implemented in PAML. They concluded that predictions of positive selection are unreliable when the sequences being compared are highly similar and when only a small number of species is used. To expand the power and accuracy of the predictions, it is suggested that the number of lineages used in the analysis is increased. Finally, they conclude that multiple models should be used in any analysis of selection in order to ensure robustness in the predictions and to protect against spurious results. In accordance with these findings, our study was designed to include the maximum number of mammalian sequences available to increase the power to detect selection in sequences as similar as human and chimpanzee. Multiple models have also been included in the database to help differentiate between positive selection and relaxation of selective constraint.
Essential to the prediction of selection by phylogenetic analysis is the availability of sequence data from a variety of species. Multispecies alignments for orthologous protein-coding genes are used to infer the ancestral sequence at each internal node within a phylogenetic tree that is necessary in the calculation of the rate and direction of codon substitution. In this way, differences in selective constraint can be predicted on one or many lineages within a phylogeny. Currently, the National Human Genome Research Institute (NHGRI) has approved 43 mammalian species as sequencing targets, many of which are currently underway (http://www.genome.gov/10002154). Five of these have been completed or are in genome refinement (14–20), 21 are slated for draft assembly and the remaining 17 are to be sequenced at low (∼2X) coverage. The Genome Browser group at UC Santa-Cruz has produced full-genome multisequence alignments using all of the available vertebrate genome assemblies (21). With the increasing number of genomes being sequenced, phylogenetic analyses can now be performed on a genome-wide scale.
We have collected alignments containing multiple mammalian species for 13 721 orthologous protein-coding genes and examined each for evidence of human-specific positive selection using phylogenetic analysis. The multispecies alignments and the results from the genome-wide selection scan are housed in a web-accessible database named the Human PAML Browser. Users can search by gene or gene family and obtain results from likelihood tests of positive selection on the gene of interest.
These types of analyses can be computationally intensive and time consuming, and the database offers an alternative for many researchers to investigate selection without the difficulty of performing the analysis themselves. The wide variety of species represented in the database also avoids the need to sequence multiple organisms for a comprehensive analysis. The data presented in the Human PAML Browser provides the opportunity for researchers to easily examine their gene(s) of interest for human-specific positive selection and may be a stepping stone for many future studies of selection and its effect on the human genome.
DATA SOURCES AND PROCESSING
Input data
The availability of pre-computed genome alignments for a diverse set of mammalian species represents an excellent starting point for phylogenetic analysis of the pattern of selection operating on individual genes. An assumption of phylogenetic analysis is that the aligned sequences are in fact orthologous. Several groups have constructed sets of orthologous genes (22–25), but the genome-based alignments have certain advantages. Orthology relationships in the multiple alignments are in the context of genomic segments, rather than inferred on the basis of protein alignments, and are thus expected to be reasonably robust. Studies comparing primary reads and finished sequence from a selection of mammals with the human genome suggests that >97% of alignable sequences match at orthologous locations (26). Furthermore, coding sequence information from incomplete (and incompletely annotated) genomes can be used that is not available in protein sequence databases. The disadvantage of using genome alignments is that they do not completely account for lineage-specific duplication and deletion, which can make inference of orthology difficult (27–29).
Alignments of 16 vertebrate species with the human reference sequence were downloaded from the UCSC Genome Browser [http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz17way/, (21)]. The genome assembly multiple alignments were constructed from draft genome assemblies from the following organisms: chimpanzee, rhesus macaque, mouse, rat, rabbit, dog, cow, armadillo, elephant, tenrec, opossum, chicken, frog, zebrafish, tetraodon and fugu. Initial results using protein-coding sequence (CDS) coordinates resulted in an unacceptable number of frameshifts, presumably due to small alignment errors near exon/intron junctions. We therefore determined CDS regions by alignment of the longest representative RefSeq mRNA for each gene to the human genome. CDS alignments were then extracted representing all of the mammalian species from the genomic alignment files, representing most mammalian orders including the subclass Metatheria (Figure 1). For codeml analysis, alignments were extracted corresponding to each mammalian species with sequence data at a given position. Total 13 721 genes were analyzed; 12 905 of these had at least a portion of the sequence represented in at least eight species. At the time of the analysis, more than 5× coverage of the orangutan genome was available as shotgun sequence reads in the NCBI Trace Archive, but these data had not yet been assembled. We felt that including all available primate sequence was important, so we added inferred protein coding sequence from orangutan to the mammalian CDS alignments by comparing human genomic sequence to the orangutan Trace Archive at NCBI, initially using BLAST, then assembling reads using phrap (www.phrap.org) and producing final alignments with the lalign algorithm implemented in matcher from the EMBOSS package (30,31), and extracting CDS regions based on exon coordinates from the human genomic sequence. Inclusion of orangutan sequence in the alignments had only minimal impact on the human-specific dN/dS, but improved prediction of human-specific substitutions when using the branch + site models described below (data not shown). Columns with gaps in the human sequence were removed from the alignment to facilitate analysis.
Statistical analysis
Markov process models of codon substitution, as implemented in the PAML software package (11), were used to analyze the selective pressures affecting each gene along the human lineage. The PAML software package is available at http://abacus.gene.ucl.ac.uk/software/paml.html. The program codeml v3.14b within the PAML package was used to analyze the data. codeml allows specification of several different codon substitution models that allow testing of hypotheses related to selection along certain branches of the tree and/or at a subset of codons (sites).
For codeml analysis, a directory was made for each gene containing the multiple sequence alignment in PHYLIP format, codeml control files, and an appropriate phylogenetic tree file for the species represented in the alignment. Species tree files were constructed from the consensus mammalian tree (Figure 1) (32) essentially by pruning branches that were not represented for a given gene. codeml was run on an Apple OSX compute cluster consisting of 20 dual-processor nodes, each equipped with 2 GB of RAM. The compute time for codeml analyses under all five models for the 13 721 alignments was 19 days using an average of 30 CPUs concurrently. Following codeml analysis, a python script parsed the codeml results and alignment files and loaded information to a MySQL database. A web interface was developed using python to provide interactive access to the codeml results on a gene by gene basis.
For each gene, codeml was run under two test models and three null models, and results from each were then compared as tests of positive selection on the human gene. Each run produced a maximum-likelihood estimate, which is the probability of observing the data under the evolutionary conditions implemented in the model. A likelihood ratio test (LRT) was used to determine whether the test model is a significantly better fit to the data than the null model. A P-value was calculated by comparing two times the difference in log likelihood values to a chi-squared distribution, with the degrees of freedom equal to the difference in number of parameters between the pair of nested tests. The branch model allows the dN/dS ratio on a specified branch of the tree to differ from the average dN/dS ratio across the rest of the tree. The matching null model fixes dN/dS = 1 on the specified branch. If dN/dS > 1 in the branch model and the LRT is significant, positive selection can be inferred. For tests of selection on the human branch, we refer to these branch models as model H and model Hnull, respectively. A particular problem in the branch tests, which is due to the short evolutionary time since the human-chimpanzee divergence, occurs when there are only non-synonymous substitutions on the branch between human and the human-chimpanzee ancestor. In this situation, the rate of synonymous substitution dS = 0 and dN/dS is undefined; this is represented in the database as 999. The branch test also requires dN/dS to be elevated across the entire gene, and therefore is considered somewhat unrealistic, particularly for multi-domain proteins where positive selection may be acting on only one domain.
The branch + site models were developed to address positive selection at a subset of sites (codons) on one branch of the tree (33,34). Model A defines four classes of sites, where two of the classes have dN/dS ≤ 1 on all branches while two additional classes have dN/dS > 1 on the lineage of interest (human), but dN/dS ≤ 1 on the other branches of the tree. Model Anull fixes dN/dS = 1 for the latter two classes and thus comparison of Model A with Model Anull is a strict test of positive selection. Model 1a assumes dN/dS ≤ 1 at all sites across all branches. A significant LRT in the comparison of Model A with Model 1a can be due to either positive selection or relaxation of selective constraint, since there is no formal test of whether dN/dS is >1 on the human branch, just whether there is a subset of sites on the human branch with a dN/dS ratio that is elevated relative to the rest of the tree. A summary of the models and likelihood ratio tests performed in this analysis are found in Table 1 and more thorough information is available in the PAML documentation. A small number of genes have also been run under site models in which dN/dS is allowed to vary among sites, instead of between lineages. This is helpful in interpreting whether positive selection is occurring specifically along the human lineage or if the gene is rapidly evolving in multiple species.
Table 1.
Model | Description | ||||
---|---|---|---|---|---|
Model H | Non-neutral model: human ωa allowed to vary from other branches ωHUMAN ≠ ωOTHERS | ||||
Model Hnull | Neutral model: human ω fixed at ω = 1 and allowed to vary from other branches ωHUMAN = 1 and ωHUMAN ≠ ωOTHERS | ||||
Model A | Sites on the human branch allowed to differ | ||||
Background lineages | 4 site classes: | 0 < ω0 < 1 | ω1 = 1 | 0 < ω2a < 1 | ω2b = 1 |
Foreground lineage-human | 4 site classes: | 0 < ω0 < 1 | ω1 = 1 | ω2a ≥ 1 | ω2b ≥ 1 |
Model Anull | Sites on the human branch fixed at ω=1 | ||||
Background lineages | 4 site classes: | 0 < ω0 < 1 | ω1 = 1 | 0 < ω2a < 1 | ω2b = 1 |
Foreground lineage-human | 4 site classes: | 0 < ω0 < 1 | ω1 = 1 | ω2a = 1 | ω2b = 1 |
Model 1a | Sites on all branches nearly neutral | ||||
All lineages | 2 site classes: 0 < ω0 < 1, ω1 = 1 | ||||
Compare | Likelihood ratio tests of significance | ||||
Model H, Model Hnull | Branch test: non-neutral evolution | ||||
MA, M1a | Relaxed branch + site test: positive selection or relaxation of selective constraint | ||||
MA, MAnull | Strict branch+site test: positive selection |
aω = dN/dS.
RESULTS OF TESTS OF SELECTION
Table 2 presents a summary of the results of tests of positive selection for human genes. As expected, the branch test, which requires dN/dS to exceed 1 over the entire coding sequence, returned few significant results. In the strict branch + site test that compares results from Models A and Anull, 244 genes met the nominal threshold for significance of P < 0.05. More than twice as many genes met the significance threshold in a comparison of Model A with Model 1a, which can indicate a relaxation of selective constraint or positive selection. Gene Ontology categories were matched to the 244 genes with P < 0.05 in the strict branch + site test to assess whether certain categories might be over-represented among genes predicted to be positively selected (Table 3). As has been reported previously (17,35,36), transcription factors and olfactory receptors are over-represented among positively selected genes. Reflecting the abundance of transcription factors, the cellular component category most over-represented is the nucleus.
Table 2.
Test | Significance threshold | Number of genesa |
---|---|---|
Model A versus Model Anull | 0.05 | 244 |
[strict branch + site test] | 0.01 | 152 |
0.001 | 48 | |
Model A versus Model 1a | 0.05 | 611 |
[relaxed branch + site test] | 0.01 | 276 |
0.001 | 114 | |
Model H versus Model Hnull | 0.05 | 16 |
[strict branch test] | 0.01 | 3 |
0.001 | 2 |
aOR5B3 is the only gene with P < 0.05 in both branch (Model H versus Model Hnull) and branch + site (Model A versus Model Anull) tests.
Table 3.
Gene Ontology ID | Category name | Corrected P-value |
---|---|---|
Biological process | ||
GO:0006350 | Transcription | 0.003 |
GO:0006355 | Regulation of transcription, DNA-dependent | 0.003 |
GO:0007608 | Sensory perception of smell | 0.02 |
GO:0006814 | Sodium ion transport | 0.02 |
GO:0007165 | Signal transduction | 0.02 |
Molecular function | ||
GO:0005515 | Protein binding | 0 |
GO:0046872 | Metal ion binding | 0.00005 |
GO:0016740 | Transferase activity | 0.00009 |
GO:0008270 | Zinc ion binding | 0.001 |
GO:0003677 | DNA binding | 0.001 |
GO:0000166 | Nucleotide binding | 0.001 |
GO:0003676 | Nucleic acid binding | 0.006 |
GO:0004984 | Olfactory receptor activity | 0.006 |
GO:005524 | ATP binding | 0.01 |
GO:0031402 | Sodium ion binding | 0.02 |
Cellular component | ||
GO:0005634 | Nucleus | 0 |
GO:0016020 | Membrane | 0.0004 |
GO:0005737 | Cytoplasm | 0.002 |
Gene Ontology classification was performed using Onto-Express (43,44).
Several groups have performed genome scans for positive selection on the human lineage using divergence data (17,35,37,38). One feature that sets this study apart is the large number and wide variety of mammalian species used. When only a small number of species is used (e.g. human–chimpanzee–mouse or human–chimpanzee–macaque) or outgroups that are too divergent (35,38,39) there can be a high degree of uncertainty as to whether a substitution occurred on the human or chimpanzee lineage. This can lead to both an increase in the false positive rate, as well as a lack of power to detect selection if multiple substitutions have occurred at the same codon. The inclusion of a third great ape, the orangutan, the Old World monkey rhesus macaque, and several non-primate mammals adds considerable power to model the substitution pattern. As recent studies have concluded (35,37), there is a relative paucity of human-specific positively selected genes and these events appear to be relatively rare in the genome. Bakewell (37) suggests that the small long-term effective population size of humans as they migrated out of Africa may be masking positive selection, as many advantageous alleles have yet to become fixed in the human population.
USER INTERFACE
The Human PAML Browser can be accessed at http://mendel.gene.cwru.edu/adamslab/pbrowser.py and there are a number of ways to query the database. A gene of interest can be searched for by gene symbol, Entrez Gene ID, mRNA accession (RefSeq) and gene name. To examine genes by their biological process, molecular function or cellular component, Gene Ontology IDs (http://www.geneontology.org/) may also be used to query the database. For sets of genes with similar names, a wildcard * can be inserted. A more recent addition is the ability to query the database by statistical significance. The user can set the threshold P-value for one of the three likelihood ratio tests to extract genes by significance. The return screen from a query contains one or more genes related to the search, and the user has the ability to chose the alignment set used in the analysis, which is primarily the UCSC alignments plus orangutan.
Database organization
Upon gene choice, the user is presented with a summary screen from the PAML analysis (Figure 2). This screen contains direct links to the Entrez ID, the mRNA accession and the GO terms for the gene. The organisms used in this analysis are ordered by evolutionary distance from humans, and information is given on the percent coverage and average percent identity of the aligned sequence from each organism as compared to the orthologous human sequence. The results from the likelihood ratio tests in the form of a P-value are displayed for all three tests of positive selection: the relaxed branch + site test, the strict branch + site test and the branch test. Also contained are links to the DNA and protein multispecies alignments (Figure 3) as well as the sequences in FASTA format to facilitate subsequent analysis. It is of note that an asterisk in the protein alignments refer to an unknown amino acid or gap in the sequence, not a stop codon.
The bulk of the data is contained under the Branch and Site Models link (Figure 4). To the right is the table of results from the branch Models H and Hnull (Figure 4A). The numbers under the branch heading refer to the relationship of the branch from an internal node to a terminal node or to another internal node. Terminal branches are labeled by organism. One dN/dS ratio is present across the entire tree with the exception of the terminal branch to human (Hsa). dN/dS on the human branch is calculated in Model H and fixed at 1 in Model Hnull.
Figure 4B contains the results from branch + site Model A. Model A is run under the assumption of four site classes: class 0, sites under negative selection in all branches; class 1, sites evolving neutrally in all branches; class 2a, sites positively selected on the human branch, but negatively selected on the other branches and class 2b, sites positively selected on the human branch, but neutrally evolving on the other branches. The proportion refers to the fraction of sites within the protein that fall into each of the four site classes. Foreground and background ω is the dN/dS ratio for the human lineage and all the other lineages, respectively. The version of codeml implemented uses a Bayes Empirical Bayes (BEB) method for calculating the posterior probability that each site is from a particular site class; sites with a high posterior probability from site class 2a and 2b are inferred to be positively selected (34). A summary table shows the total number of sites predicted to be positively selected as well as the number with a BEB posterior probability ≥95 and ≥99%. Finally, all sites predicted to be positively selected are listed along with the residue number, the affected amino acid and the BEB probability specific to that residue. The results from the null Model 1a can be retrieved through the Site Models link on the original summary page (Figure 2).
IMPORTANT LIMITATIONS
There are several limitations of the data and analyses presented in the Human PAML Browser that the user should consider. The results presented in the browser were obtained using an automated process at multiple steps from alignment creation to extraction of coding sequences, to maximum-likelihood analysis, to database loading and web display. Most alignments and result sets have not been examined manually or curated in any way. Bad alignments lead to bad phylogenetic inferences. In particular, some genes with very low P-values in tests of selection appear to have alignment problems with many human-specific substitutions clustered at adjacent residues. Incorrect assignment of orthology could also result in misleading P-values. The whole-genome alignments used here have the advantage of aligning coding sequences in a larger (often much larger) syntenic framework, but a gene family approach that accounts for gene duplication and deletion events would be a useful adjunct to interpretation of the information presented in the Human PAML Browser. Differential gene loss following duplication can lead to 1:1 paralogs being mistaken for 1:1 orthologs. This problem is particularly acute in organisms, such as yeast, which have experienced whole-genome duplication events (40), but could also affect vertebrate alignments, particularly given the extent of lineage-specific segmental duplication in primates (41). Gene trees have not been constructed for the alignments used here, but reconciliation of the gene tree with the species tree using a program such as NOTUNG (42) would be a useful exercise in the context of follow-up study.
The P-values in the Human PAML Browser have not been corrected for multiple tests, and so care should be taken in interpretation of results. Another factor that potentially impacts the P-values is that in some cases, the ML methods fail to converge to a global optimum, resulting in an inappropriately low P-value. Finally, genome assemblies and assembly alignments are improving over time. It is strongly recommended that the user validate the results presented here for any gene of interest by re-extracting gene sequences and repeating the codeml analysis. This will serve the dual purposes of incorporating new and improved genome assembly data and ensuring that the ML analysis is stable.
CONCLUSION
We have generated a database containing the PAML results for tests of human-specific positive selection. The simple web interface makes access to PAML results readily available, compared to the tasks of preparing properly formatted alignment, phylogenetic tree, and codeml control files and parsing five different codeml output files. The multispecies alignments used in each analysis are readily available in FASTA file format for further analysis using other methods or to examine selection among other orders/families of mammals or in individual species. The Human PAML Browser is intended to aide other researchers as they search for the selective pressures that have affected their gene or gene family of interest, and can be a stepping stone for many future studies of positive selection and the human genome.
ACKNOWLEDGEMENTS
We thank Neil Molyneaux for additional contributions to the Human PAML Browser and Ulrich Neuss for maintaining the compute cluster. Funds for this work were provided by a start-up award to M.D.A. from Case Western Reserve University. Funding to pay Open Access publication charges for this article was provided by Case Western Reserve University.
Conflict of interest statement. None declared.
REFERENCES
- 1.King M.-C, Wilson AC. Evolution at two levels in humans and chimpanzees. Science. 1975;188:107–116. doi: 10.1126/science.1090005. [DOI] [PubMed] [Google Scholar]
- 2.Carroll SB. Genetics and the making of Homo sapiens. Nature. 2003;422:849–857. doi: 10.1038/nature01495. [DOI] [PubMed] [Google Scholar]
- 3.Li WH, Wu CI, Luo CC. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 1985;2:150–174. doi: 10.1093/oxfordjournals.molbev.a040343. [DOI] [PubMed] [Google Scholar]
- 4.Kimura M. The Neutral Theory of Molecular Evolution. Cambridge: Cambridge University Press; 1983. [Google Scholar]
- 5.Tajima F. Statistical analysis of DNA polymorphism. Jpn. J. Genet. 1993;68:567–595. doi: 10.1266/jjg.68.567. [DOI] [PubMed] [Google Scholar]
- 6.Sabeti PC, Reich DE, Higgins JM, Levine HZP, Richter DJ, Schaffner SF, Gabriel SB, Platko JV, Patterson NJ, et al. Detecting recent positive selection in the human genome from haplotype structure. Nature. 2002;419:832–837. doi: 10.1038/nature01140. [DOI] [PubMed] [Google Scholar]
- 7.McDonald JH, Kreitman M. Adaptive protein evolution at the Adh locus in Drosophila. Nature. 1991;351:652–654. doi: 10.1038/351652a0. [DOI] [PubMed] [Google Scholar]
- 8.Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 1981;17:368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
- 9.Yang Z. Inference of selection from multiple species alignments. Curr. Opin. Genet. Dev. 2002;12:688–694. doi: 10.1016/s0959-437x(02)00348-9. [DOI] [PubMed] [Google Scholar]
- 10.Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 1994;11:725–736. doi: 10.1093/oxfordjournals.molbev.a040153. [DOI] [PubMed] [Google Scholar]
- 11.Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 1997;13:555–556. doi: 10.1093/bioinformatics/13.5.555. [DOI] [PubMed] [Google Scholar]
- 12.Anisimova M, Bielawski JP, Yang Z. Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. Mol. Biol. Evol. 2001;18:1585–1592. doi: 10.1093/oxfordjournals.molbev.a003945. [DOI] [PubMed] [Google Scholar]
- 13.Anisimova M, Bielawski JP, Yang Z. Accuracy and power of bayes prediction of amino acid sites under positive selection. Mol. Biol. Evol. 2002;19:950–958. doi: 10.1093/oxfordjournals.molbev.a004152. [DOI] [PubMed] [Google Scholar]
- 14.International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature. 409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
- 15.Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans C, et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
- 16.International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature. 431:931–945. doi: 10.1038/nature03001. [DOI] [PubMed] [Google Scholar]
- 17.Chimpanzee Sequencing and Analysis Consortium (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 437:69–87. doi: 10.1038/nature04072. [DOI] [PubMed] [Google Scholar]
- 18.Rat Genome Sequencing Project Consortium (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature. 428:493–521. doi: 10.1038/nature02426. [DOI] [PubMed] [Google Scholar]
- 19.Lindblad-Toh K, Wade CM, Mikkelsen TS, Karlsson EK, Jaffe DB, Kamal M, Clamp M, Chang JL, Kulbokas E, III, et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature. 2005;438:803–819. doi: 10.1038/nature04338. [DOI] [PubMed] [Google Scholar]
- 20.Mouse Genome Sequencing Consortium (2002) Initial sequencing and comparative analysis of the mouse genome. Nature. 420:520–562. doi: 10.1038/nature01262. [DOI] [PubMed] [Google Scholar]
- 21.Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14:708–715. doi: 10.1101/gr.1933104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2007;35:D5–D12. doi: 10.1093/nar/gkl1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Eppig JT, Bult CJ, Kadin JA, Richardson JE, Blake JA, Anagnostopoulos A, Baldarelli RM, Baya M, Beal JS, et al. The Mouse Genome Database (MGD): from genes to mice – a community resource for mouse biology. Nucleic Acids Res. 2005;33:D471–D475. doi: 10.1093/nar/gki113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, et al. Ensembl 2007. Nucleic Acids Res. 2007;35:D610–D617. doi: 10.1093/nar/gkl996. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai J, Parvizi B, Cheung F, Antonescu V, et al. Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA) Genome Res. 2002;12:493–502. doi: 10.1101/gr.212002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Margulies EH, Vinson JP, Miller W, Jaffe DB, Lindblad-Toh K, Chang JL, Green ED, Lander ES, Mullikin JC, et al. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc. Natl Acad. Sci. USA. 2005;102:4795–4800. doi: 10.1073/pnas.0409882102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Ohno S. Evolution by Gene Duplication. Heidelberg: Springer-Verlag; 1970. [Google Scholar]
- 28.Prince VE, Pickett FB. Splitting pairs: the diverging fates of duplicated genes. Nat. Rev. Genet. 2002;3:827–837. doi: 10.1038/nrg928. [DOI] [PubMed] [Google Scholar]
- 29.Roth C, Rastogi S, Arvestad L, Dittmar K, Light S, Ekman D, Liberles DA. Evolution after gene duplication: models, mechanisms, sequences, systems, and organisms. J. Exp. Zoolog. B Mol. Dev. Evol. 2007;308:58–73. doi: 10.1002/jez.b.21124. [DOI] [PubMed] [Google Scholar]
- 30.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 31.Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA. 1988;85:2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Murphy WJ, Eizirik E, O'Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001;294:2348–2351. doi: 10.1126/science.1067179. [DOI] [PubMed] [Google Scholar]
- 33.Yang Z, Nielsen R. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 2002;19:908–917. doi: 10.1093/oxfordjournals.molbev.a004148. [DOI] [PubMed] [Google Scholar]
- 34.Yang Z, Wong WS, Nielsen R. Bayes empirical Bayes inference of amino acid sites under positive selection. Mol. Biol. Evol. 2005;22:1107–1118. doi: 10.1093/molbev/msi097. [DOI] [PubMed] [Google Scholar]
- 35.Arbiza L, Dopazo J, Dopazo H. Positive selection, relaxation, and acceleration in the evolution of the human and chimp genome. PLoS Comput. Biol. 2006;2:e38. doi: 10.1371/journal.pcbi.0020038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Bustamante CD, Fledel-Alon A, Williamson S, Nielsen R, Todd Hubisz M, Glanowski S, Tanenbaum DM, White TJ, Sninsky JJ, Hernandez R, et al. Natural selection on protein-coding genes in the human genome. Nature. 2005;437:1153–1157. doi: 10.1038/nature04240. [DOI] [PubMed] [Google Scholar]
- 37.Bakewell MA, Shi P, Zhang J. More genes underwent positive selection in chimpanzee evolution than in human evolution. Proc. Natl Acad. Sci. USA. 2007;104:7489–7494. doi: 10.1073/pnas.0701705104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Clark AG, Glanowski S, Nielsen R, Thomas PD, Kejariwal A, Todd MA, Tanenbaum DM, Civello D, Lu F, et al. Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science. 2003;302:1960–1963. doi: 10.1126/science.1088821. [DOI] [PubMed] [Google Scholar]
- 39.Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, Remington KA, Strausberg RL, Venter JC, et al. Evolutionary and biomedical insights from the rhesus macaque genome. Science. 2007;316:222–234. doi: 10.1126/science.1139247. [DOI] [PubMed] [Google Scholar]
- 40.Scannell DR, Frank AC, Conant GC, Byrne KP, Woolfit M, Wolfe KH. Independent sorting-out of thousands of duplicated gene pairs in two yeast species descended from a whole-genome duplication. Proc. Natl Acad. Sci. USA. 2007;104:8397–8402. doi: 10.1073/pnas.0608218104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Bailey JA, Eichler EE. Primate segmental duplications: crucibles of evolution, diversity and disease. Nat. Rev. Genet. 2006;7:552–564. doi: 10.1038/nrg1895. [DOI] [PubMed] [Google Scholar]
- 42.Chen K, Durand D, Farach-Colton M. NOTUNG: a program for dating gene duplications and optimizing gene family trees. J. Comput. Biol. 2000;7:429–447. doi: 10.1089/106652700750050871. [DOI] [PubMed] [Google Scholar]
- 43.Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA. Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate. Nucleic Acids Res. 2003;31:3775–3781. doi: 10.1093/nar/gkg624. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Khatri P, Bhavsar P, Bawa G, Draghici S. Onto-Tools: an ensemble of web-accessible, ontology-based tools for the functional design and interpretation of high-throughput gene expression experiments. Nucleic Acids Res. 2004;32:W449–W456. doi: 10.1093/nar/gkh409. [DOI] [PMC free article] [PubMed] [Google Scholar]