Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2015 Nov 17;112(48):14918–14923. doi: 10.1073/pnas.1507669112

Rapid genome reshaping by multiple-gene loss after whole-genome duplication in teleost fish suggested by mathematical modeling

Jun Inoue a,b,1, Yukuto Sato c,d,1, Robert Sinclair a,1, Katsumi Tsukamoto b,e, Mutsumi Nishida b,f,2
PMCID: PMC4672829  PMID: 26578810

Significance

All genes are duplicated by whole-genome duplication (WGD), reverting in number over time, but the actual timing of genome reshaping through gene loss remains poorly understood. We estimated the spatiotemporal loss/persistence pattern of 6,892 gene lineage pairs after the teleost-specific WGD, using careful orthology assignment and a reliable time-calibrated tree. We found that massive gene loss did occur in the first 60 My, mainly due to events involving the simultaneous loss of multiple redundant genes, and the rate of loss then slowed to an approximately constant level for the subsequent 250 My. Similar genomic gene arrangements within teleosts imply that rapid gene loss led to the reshaping of the teleost genomes before their major divergence.

Keywords: orthologous gene, bony vertebrates, post-WGD genome evolution

Abstract

Whole-genome duplication (WGD) is believed to be a significant source of major evolutionary innovation. Redundant genes resulting from WGD are thought to be lost or acquire new functions. However, the rates of gene loss and thus temporal process of genome reshaping after WGD remain unclear. The WGD shared by all teleost fish, one-half of all jawed vertebrates, was more recent than the two ancient WGDs that occurred before the origin of jawed vertebrates, and thus lends itself to analysis of gene loss and genome reshaping. Using a newly developed orthology identification pipeline, we inferred the post–teleost-specific WGD evolutionary histories of 6,892 protein-coding genes from nine phylogenetically representative teleost genomes on a time-calibrated tree. We found that rapid gene loss did occur in the first 60 My, with a loss of more than 70–80% of duplicated genes, and produced similar genomic gene arrangements within teleosts in that relatively short time. Mathematical modeling suggests that rapid gene loss occurred mainly by events involving simultaneous loss of multiple genes. We found that the subsequent 250 My were characterized by slow and steady loss of individual genes. Our pipeline also identified about 1,100 shared single-copy genes that are inferred to have become singletons before the divergence of clupeocephalan teleosts. Therefore, our comparative genome analysis suggests that rapid gene loss just after the WGD reshaped teleost genomes before the major divergence, and provides a useful set of marker genes for future phylogenetic analysis.


The recent rapid growth of genome data has made it possible to clarify major evolutionary events that have shaped eukaryote genomes, such as gene duplication, chromosomal rearrangement, and whole-genome duplication (WGD) (1). In particular, WGD events, known to have occurred in several major lineages of flowering plants (2), budding yeasts (3), and vertebrates (4) (Fig. 1A), are considered to have had a major impact on genomic architecture and consequently organismal features.

Fig. 1.

Fig. 1.

Inferred spatiotemporal process of gene loss and persistence after TGD in teleost ancestors. (A) The estimated numbers of gene loss events in the teleost phylogeny, time-scaled tree of vertebrates (11, 41) with the timing of genome duplication events at the base of vertebrates (VGD1/2) and teleosts (TGD), and the number of extant species (26). Species used in this study are connected by solid branches. The numbers were parsimoniously inferred from the presence or absence of TGD-derived gene lineage pairs belonging to 6,892 orthogroups and mapped onto the time points of TGD (306 Mya), nodes ag (a: 245 Mya; b: 158; c: 120; d: 105; e: 41; f: 164; g: 86) (11), and h (74 Mya) (28). On the left side of the tree, ortholog arrangements are compared between representatives (connected by bold branches in the tree) by CIRCOS (circos.ca) using orthology information for 5,655 orthogroups belonging to the 1to1 category (Fig. S2). (B) Definition of terms relating to WGD events. An orthogroup is a monophyletic group containing WGD-derived paralogs (gene lineages) of all focal species (Sp1) and orthologs of their sister species (Sp2), ignoring lineage-specific gene duplications (GeneA-1′ and -1″) or gene loss (GeneA-1″). (C) Approximation of the pattern of the number of gene loss and persistence events associated with TGD. The estimated number of retained paired gene lineages at nodes a to h and current teleosts (Ca, Ze, Co, Ti, Pl, Me, St, Te, and Fu) were used to compare the fit of the one-phase [αe–2μt (14)] and two-phase models. (D) Region of C detailing the recent pattern of gene loss. The solid and dashed curves have been corrected upward to remove the bias expected to result from parsimony analysis. These approximations are effectively insensitive to fluctuations in the estimated numbers of gene lineage pairs and times for the TGD event and ancestral nodes (a to h) (SI Text). The evolutionary scenario is essentially unchanged if the number of gene lineage pairs estimated without the BS 70% criterion or the divergence times estimated by nuclear gene (28)/mitochondrial genome (42) data were used. Note that the two-phase model can be roughly approximated by a double-exponential curve.

Duplicate genes generated by WGD are typically assumed to be redundant and therefore subsequently lost in a stochastic manner. Comparative genome studies have suggested that 90% of duplicate genes were rapidly lost (5) by a neutral process (6) after WGD in budding yeast, but 20–30% of them were retained in human (7) even after several hundred million years. However, few genome-wide studies have addressed the temporal pattern of gene loss or persistence after WGD with reference to a reliable timescale (but see refs. 6 and 8). Such examination is indispensable for understanding when duplicate genes were lost and, consequently, genome structures were reshaped, during vertebrate diversification after the WGD (Fig. 1).

To examine the detailed process of duplicate gene loss after WGD, one needs to estimate the number (proportion) of remaining duplicates in extant and ancestral species. For this purpose, both (i) reliably time-calibrated phylogenetic trees of species and (ii) well-annotated genomes are required. These two requirements have been met for several vertebrate lineages, including some teleost fishes. Given this, the next step should be to accurately estimate orthology and paralogy relationships of all of the genes that experienced WGD. For the analysis of gene orthology and paralogy, a homology search- or synteny-based approach has usually been used (9). In addition to the homology search-based approach (e.g., COGs and OrthoDB), a phylogenetic tree-based approach has also been introduced (e.g., Ensembl and PhylomeDB) (9). Recent developments of tree search algorithms and increased computing power allow a sophisticated tree-based approach, comparing each gene tree with the species tree. Such an approach is indispensable for the effective analysis of gene orthology and paralogy across many species, providing us with a powerful opportunity to investigate genome evolution after WGD.

Here, we aim to investigate the gene loss/persistence pattern using genome-wide data, focusing on what is known as the teleost genome duplication (TGD). TGD is estimated to have occurred in an ancestor of teleosts (Fig. 1A) but after the divergence of tetrapods and teleosts (10). Thus, it is a relatively recent WGD shared by a large vertebrate group, i.e., the Teleostei. For teleosts, reliably time-calibrated phylogenies, including phylogenetic position and timing of the TGD event, are available (e.g., ref. 11). In addition, well-annotated whole-genome data from at least nine phylogenetically representative teleost species (cave fish, zebrafish, cod, tilapia, platyfish, medaka, stickleback, Tetraodon, and fugu) are now available from Ensembl (12). In the present study, we inferred the timing of rapid genome reshaping through gene loss after TGD by estimating the temporal and genomic positional (spatiotemporal) loss/persistence pattern of TGD-derived gene lineage pairs (Fig. 1B) over the past several hundred million years, using accurate tree-based orthology estimation (Fig. S1) and a reliable time-calibrated teleost tree. We investigated the mechanism of rapid gene loss after TGD by fitting a newly developed model for the observed temporal pattern of gene loss. This new model is necessary because standard models, based upon random and independent loss of duplicate genes, fail to fit our data. Our model analysis explicitly includes both the possibility of the loss of multiple genes in single events, and also the known phylogeny of the relevant species. The significance of the inclusion of events that result in the loss of multiple genes is that it reproduces the two phases of loss. The inclusion of known phylogeny allows us to correct for the bias associated with parsimony analysis.

Results

Automated Pipeline Analysis for WGD-Derived Genes.

To estimate the spatiotemporal loss/persistence pattern of TGD-derived gene (hereafter gene lineage; Fig. 1B) pairs, we accurately identified orthologous and paralogous relationships between tetrapod and teleost genes (involving TGD-derived gene lineage pairs in teleosts) by conducting rigorous phylogenetic and reconciliation analyses with the species tree for all protein-coding gene sequences retrieved mainly from Ensembl (Fig. S1). Homologous gene clades identified in the above procedure were regarded as “orthogroups,” including paralogous gene lineages derived from TGD in teleosts and their orthologous one in tetrapods. We regard “gene loss” as the absence of a gene from a specific species in an identified orthogroup, according to well-established genomic and gene annotation datasets such as provided by Ensembl (12). In the present study (Fig. 1A), we are specifically focusing on gene loss events during post TGD evolution. Using human and medaka protein-coding gene sequences as representatives of tetrapods and teleosts, we used the following three-step approach in our novel automated pipeline: candidate ortholog selection using BLAST search and neighbor-joining (NJ) analysis (Fig. S1A), orthogroup identification by ML gene tree analysis using the candidates (Fig. S1B), and reliable orthogroup recognition using a 70% bootstrap criterion (BS 70% criterion) (Materials and Methods) (Fig. S1C). Finally, to identify all reliable orthogroups of bony vertebrates, we integrated the results from a medaka-centered analysis into those from a human-centered analysis (Fig. S1D). We tested the sensitivity of our automated pipeline by comparing with the dataset used in our previous study (13) (SI Text). Note that such an approach could easily be adapted to other evolutionary lineages.

In general, our pipeline analysis differs from Ensembl in the following points: (i) taxon sampling with special reference to the TGD by distinguishing duplicates derived from WGD or lineage-specific duplication and identifying reciprocal patterns of gene lineage loss among teleost lineages (but see SI Text); (ii) sophisticated gene tree estimation by deleting distantly related sequences such as those from nonbilaterians, potentially erroneous sequences published in Ensembl, and ambiguously aligned sites; and (iii) filtering of orthologs/paralogs by excluding ambiguously estimated gene trees using a specific criterion.

Analyses of the type we have performed always involve finding a balance between overly permissive acceptance of both real and spurious data, and overzealous rejection of data that may be real but fails to pass certain criteria. We have chosen to take a conservative approach, consistent with our objective of producing a reliable set of marker genes. To demonstrate that our main conclusions are not artifacts of the strict criteria (e.g., BS 70% criterion and long-branch deletion option in Fig. S1), we have also performed our analysis without them, as reported in multiple places below. In particular, our conclusions regarding the temporal pattern of gene loss are shown to be robust.

Orthology/Paralogy Identification from Genome-Wide Analysis.

Automated analysis using 20,368 human and 19,686 medaka protein-coding genes inferred 6,892 orthogroups with high reliability (Table 1, Dataset S1, and SI Text). Among the 6,892 orthogroups, we identified 1,237 pairs of TGD-derived gene lineages in more than one teleost species (1to2 category) (ignoring lineage-specific duplications). The remaining 5,655 orthogroups (1to1 category) had only one gene lineage in all nine teleost species. These orthogroups can be considered to have lost one of the pair of TGD-derived gene lineages during the period between the TGD event and the basal separation of otocephalan and neoteleost lineages (Fig. 1A, node a).

Table 1.

The number of orthogroups identified between tetrapods and teleosts

Gene/orthogroup category Human centered Medaka centered Integrated*
Without criterion With criterion Without criterion With criterion Without criterion With criterion
Loci in query genome 20,368 19,686
No clear orthogroups 4,809 4,134
Putative orthogroups 15,559 6,673 15,552 6,471
Additional putative orthogroups from medaka-centered analysis§ 1,617 684
Integrated putative orthogroups 17,176 (15,559 + 1,617) 7,357 (6,673 + 684)
Multiply counted orthogroups 1,823 (1,544 + 279) 465 (409 + 56)
Identified orthogroups 15,353 (14,015 + 1,338) 6,892 (6,264 + 628)
1to2 category (paired by TGD) 4,262 (3,723 + 539) 1,237 (1,102 + 135)
1to1 category (losing one gene lineage) 11,091 (10,292 + 799) 5,655 (5,162 + 493)
TGD-derived paired gene lineages/total loci (gene lineages) in common ancestor 27.76% (4,262/15,353) 17.95% (1,237/6,892)
*

See Dataset S1 for the results from all protein-coding gene analyses. The numbers in parentheses indicate estimates derived from human- or medaka-centered analyses.

Orthogroups without the BS 70% criterion (Fig. S1C).

Orthogroups with the BS 70% criterion.

§

The orthogroups without human ortholog.

cDNA sequence alignments and rearranged ML trees are available at fish-evol.unit.oist.jp/db/TGD16/6892data.tar.gz.

The above estimation indicates that, in the most recent common ancestor (MRCA) of nine teleosts, orthology/paralogy of a total of 8,129 (5,655 singleton + 2 × 1,237 duplicate pairs) loci (gene lineages) was successfully elucidated for the 6,892 (5,655 + 1,237) orthogroups. This means that 18% (1,237/6,892) of gene lineages in the MRCA have remained paired since the TGD event. When the BS 70% criterion was not applied, 28% (4,262/15,353) of the gene lineages remained paired.

The orthologs belonging to 6,892 orthogroups were plotted along each chromosome to compare conserved synteny blocks among species. The result (Fig. 1A, Left, Fig. S2, and Dataset S2) shows that the genomes of teleosts are organized similarly to each other but not to those of tetrapods.

Gene ontology (GO) analysis revealed that 141 GO terms were significantly overrepresented among the TGD-derived paired gene lineages, with Padj < 0.05 (Table S1). Most of the highly significant genes associated with the 141 terms were those of proteins involved in signaling (e.g., glutamate receptor signaling pathway) and ion transport (e.g., ion channel complex). On the other hand, five GO terms were significantly underrepresented among TGD-derived paired gene lineages, with Padj < 0.05. These underrepresented genes included those for housekeeping functions such as RNA processing and DNA repair.

Table S1.

GO enrichment analysis for 1,102 orthogroups belonging to 1to2 category obtained from human-centered analysis

GO ID GO name No. of 1to2 category (1,102) No. of allorthogroups(6,264) Padj*
Overrepresented GO ID
 0005891 Voltage-gated calcium channel complex 12 15 <0.001
 0007215 Glutamate receptor signaling pathway 14 19 <0.001
 0022843 Voltage-gated cation channel activity 26 51 <0.001
 0007156 Homophilic cell adhesion 26 54 <0.001
 0015085 Calcium ion transmembrane transporter activity 29 62 <0.001
 0009581 Detection of external stimulus 31 72 <0.001
 0009582 Detection of abiotic stimulus 31 72 <0.001
 0034703 Cation channel complex 33 65 <0.001
 0007611 Learning or memory 33 77 <0.001
 0045202 Synapse 38 99 <0.001
 0097060 Synaptic membrane 40 99 <0.001
 0005261 Cation channel activity 43 117 <0.001
 0034765 Regulation of ion transmembrane transport 47 126 <0.001
 0034762 Regulation of transmembrane transport 47 127 <0.001
 0043235 Receptor complex 48 123 <0.001
 0034702 Ion channel complex 50 115 <0.001
 0005911 Cell–cell junction 50 123 <0.001
 0022836 Gated channel activity 50 131 <0.001
 1902495 Transmembrane transporter complex 52 122 <0.001
 1990351 Transporter complex 52 122 <0.001
 0016337 Cell–cell adhesion 54 139 <0.001
 0005216 Ion channel activity 58 164 <0.001
 0022838 Substrate-specific channel activity 58 165 <0.001
 0044456 Synapse part 59 163 <0.001
 0015267 Channel activity 61 171 <0.001
 0022803 Passive transmembrane transporter activity 61 171 <0.001
 0050767 Regulation of neurogenesis 66 200 <0.001
 0007268 Synaptic transmission 67 192 <0.001
 0043005 Neuron projection 72 237 <0.001
 0043269 Regulation of ion transport 73 195 <0.001
 0016477 Cell migration 73 238 <0.001
 0060284 Regulation of cell development 75 247 <0.001
 0051960 Regulation of nervous system development 76 226 <0.001
 0048646 Anatomical structure formation involved in morphogenesis 83 271 <0.001
 0007267 Cell–cell signaling 86 285 <0.001
 0005509 Calcium ion binding 87 269 <0.001
 0015075 Ion transmembrane transporter activity 88 302 <0.001
 0051094 Positive regulation of developmental process 89 300 <0.001
 0022891 Substrate-specific transmembrane transporter activity 92 319 <0.001
 0023052 Signaling 93 298 <0.001
 0044700 Single organism signaling 93 298 <0.001
 0050877 Neurological system process 94 284 <0.001
 0007155 Cell adhesion 96 315 <0.001
 0022610 Biological adhesion 96 316 <0.001
 0007154 Cell communication 100 347 <0.001
 0022857 Transmembrane transporter activity 102 356 <0.001
 0055085 Transmembrane transport 115 411 <0.001
 0042995 Cell projection 118 423 <0.001
 0030054 Cell junction 122 354 <0.001
 0097458 Neuron part 122 371 <0.001
 0005887 Integral component of plasma membrane 122 416 <0.001
 0006811 Ion transport 128 452 <0.001
 0006928 Cellular component movement 132 468 <0.001
 2000026 Regulation of multicellular organismal development 133 509 <0.001
 0003008 System process 134 437 <0.001
 0051049 Regulation of transport 136 529 <0.001
 0019899 Enzyme binding 137 531 <0.001
 0004872 Receptor activity 142 554 <0.001
 0050793 Regulation of developmental process 165 659 <0.001
 0032879 Regulation of localization 185 714 <0.001
 0065008 Regulation of biological quality 224 933 <0.001
 0044765 Single-organism transport 230 949 <0.001
 0048856 Anatomical structure development 244 1,002 <0.001
 0044459 Plasma membrane part 249 833 <0.001
 0044707 Single–multicellular-organism process 272 1,098 <0.001
 0032501 Multicellular organismal process 276 1,118 <0.001
 0006810 Transport 287 1,226 <0.001
 0051234 Establishment of localization 300 1,259 <0.001
 0044767 Single-organism developmental process 332 1,412 <0.001
 0032502 Developmental process 362 1,566 <0.001
 0005886 Plasma membrane 364 1,325 <0.001
 0016021 Integral component of membrane 364 1,684 <0.001
 0044425 Membrane part 487 2,182 <0.001
 0016020 Membrane 510 2,212 <0.001
 0005515 Protein binding 605 2,999 <0.001
 0044763 Single-organism cellular process 679 3,285 <0.001
 0065007 Biological regulation 700 3,492 <0.001
 0044699 Single-organism process 771 3,819 <0.001
 0005575 Cellular_component 1,013 5,474 <0.001
 0005245 Voltage-gated calcium channel activity 14 21 0.002
 0048870 Cell motility 75 252 0.002
 0005244 Voltage-gated ion channel activity 30 71 0.003
 0022832 Voltage-gated channel activity 30 71 0.003
 0072509 Divalent inorganic cation transmembrane transporter activity 30 71 0.003
 0050953 Sensory perception of light stimulus 34 86 0.003
 0040011 Locomotion 85 297 0.003
 0051239 Regulation of multicellular organismal process 190 798 0.003
 0007165 Signal transduction 356 1,656 0.003
 0050789 Regulation of biological process 659 3,337 0.003
 0007612 Learning 21 42 0.004
 0007600 Sensory perception 55 169 0.004
 0034220 Ion transmembrane transport 77 264 0.004
 0008092 Cytoskeletal protein binding 93 336 0.004
 0005215 Transporter activity 120 461 0.004
 0050794 Regulation of cellular process 631 3,185 0.004
 0034704 Calcium channel complex 14 22 0.006
 0007601 Visual perception 33 84 0.006
 0050890 Cognition 34 88 0.006
 0005516 Calmodulin binding 35 92 0.006
 0001525 Angiogenesis 38 103 0.006
 0022892 Substrate-specific transporter activity 101 375 0.006
 0045595 Regulation of cell differentiation 121 469 0.006
 0044464 Cell part 922 4,925 0.006
 0051899 Membrane depolarization 18 34 0.007
 0010646 Regulation of cell communication 217 945 0.007
 0008066 Glutamate receptor activity 11 15 0.008
 0005262 Calcium channel activity 23 50 0.008
 0046873 Metal ion transmembrane transporter activity 52 161 0.008
 0022890 Inorganic cation transmembrane transporter activity 55 173 0.008
 0045597 Positive regulation of cell differentiation 66 220 0.008
 0048869 Cellular developmental process 186 789 0.008
 0034330 Cell junction organization 31 79 0.009
 0023051 Regulation of signaling 215 939 0.009
 0009653 Anatomical structure morphogenesis 120 470 0.01
 0050804 Regulation of synaptic transmission 33 87 0.012
 0044708 Single-organism behavior 44 130 0.012
 0016529 Sarcoplasmic reticulum 12 18 0.015
 0048167 Regulation of synaptic plasticity 22 48 0.015
 0045664 Regulation of neuron differentiation 50 156 0.019
 0007610 Behavior 56 181 0.019
 0030001 Metal ion transport 65 220 0.019
 0043167 Ion binding 426 2,064 0.019
 0050770 Regulation of axonogenesis 19 39 0.022
 0050839 Cell adhesion molecule binding 15 27 0.023
 0042391 Regulation of membrane potential 35 97 0.023
 0003674 Molecular_function 937 5,039 0.023
 0019901 Protein kinase binding 55 179 0.024
 0008324 Cation transmembrane transporter activity 66 227 0.024
 0004871 Signal transducer activity 142 584 0.024
 0060089 Molecular transducer activity 142 584 0.024
 0046872 Metal ion binding 286 1,322 0.028
 0030315 T tubule 12 19 0.029
 0008150 Biological_process 964 5,218 0.035
 0045216 Cell–cell junction organization 26 65 0.036
 0044463 Cell projection part 70 249 0.04
 0044449 Contractile fiber part 29 77 0.045
 0019900 Kinase binding 59 201 0.045
 0043169 Cation binding 290 1,353 0.045
 0035637 Multicellular organismal signaling 12 20 0.049
 0051270 Regulation of cellular component movement 65 229 0.049
 0045211 Postsynaptic membrane 29 78 0.05
Underrepresented GO ID
 0006396 RNA processing 12 192 0.006
 0044260 Cellular macromolecule metabolic process 282 1,944 0.022
 0006974 Cellular response to DNA damage stimulus 15 202 0.041
 0034470 ncRNA processing 0 54 0.05
 0005730 Nucleolus 17 215 0.05
*

Adjusted P value from the results of 1,000 simulated null hypothesis queries.

Temporal Loss/Persistence Pattern of Genes After TGD.

To estimate the proportion of TGD-derived gene lineage pairs remaining in the genomes of current teleosts, we counted the number of pairs in each teleost (SI Text and Fig. S3). Out of the number of gene lineage pairs in the ancestral genome soon after TGD, on average 10% (692/6,892) of them (Fig. 1A, Right) remained paired until the present and 16% (2,398/15,353) remained when the BS 70% criterion was not used.

Using these numbers of TGD-derived paired gene lineages, we parsimoniously estimated the ancestral number of gene lineage pairs at each node (ah in Fig. 1A) of the teleost time-calibrated phylogenetic tree (11). Plotting the number of gene lineage pairs at every node against its divergence time, we estimated the temporal pattern of gene loss/persistence (Fig. 1C). The estimated pattern suggests that many of the gene lineage pairs (5,655/6,892) rapidly lost one gene lineage during the initial 60 My (from the TGD to node a). On the other hand, on average, more than one-half (692/1,237) of the gene lineage pairs in the ancestor of the nine teleosts (node a) persisted over the next 250 My (from node a to the present). This two-phase curve was also observed when the BS 70% criterion was not used (Fig. S4).

Development of a Two-Phase Model.

We developed a new model for the observed temporal pattern of gene loss after the TGD event (Fig. 1C) to better understand the process and mechanism of gene loss after WGD (SI Text). Immediately after a WGD, unusually high levels of redundancy are to be expected. At such exceptional times, overall selection against gene loss may be reduced, allowing the loss of multiple genes in single events. Although such situations may be relatively rare over long evolutionary timescales, their possibility should be considered to allow for a full understanding of post-WGD gene loss patterns over both short and long timescales. Consequently, we have intentionally incorporated the possibility of events that result in the simultaneous loss of multiple genes in our new model (two-phase model).

The two-phase model is an extension of the classical one-phase model based on exponential decay of duplicated genes (14), extended to incorporate the possibility of simultaneous loss of multiple genes (Fig. 1C). Several studies have uncovered examples of the deletion of contiguous clusters of genes or large chromosomal segments (e.g., refs. 3 and 15). Some genes dispersed throughout a genome and coregulated (but not necessarily colocalized) may be simultaneously inactivated (16) by the deletion of a common enhancer. The model was further extended to estimate the number of gene lineage pairs counted parsimoniously using the species phylogenetic tree.

We found that the estimated temporal pattern of gene loss/persistence is fitted by the two-phase model (Fig. 1C, solid line) far better than the classic one-phase model (dashed line). The new model fits the recent part of the process particularly well (Fig. 1D; from node a to the present).

Discussion

Rapid Reshaping of Teleost Genome After TGD.

The present study suggests that rapid reshaping of the teleost genome occurred just after the TGD event. Our analysis showed the rapid loss of genes generated by TGD just after the event. We found 82% [5,655/(5,655 + 1,237)] of those gene lineages [72% (11,091/15,353) without the BS 70% criterion] were rapidly lost during the initial 60 My (first phase), whereas the paired gene lineages still present 60 My after the WGD (node a in Fig. 1C) were more slowly lost after that time (second phase). Although rapid gene loss after the TGD was so far suggested based on the analysis of specific gene families (8, 13), the actual temporal gene loss pattern along with a time-calibrated phylogenetic tree has not previously been examined by genome-wide analysis. Recently, the conventional view that WGD is followed by rapid gene loss has been questioned by studies of gene loss rate after lineage-specific WGD events in salmonid (17) and carp (18). This discrepancy may be resolved through detailed comparative analyses using the pipeline and gene loss model developed in this study in the light of the different possible mechanisms such as autopolyploidization vs. allopolyploidization (19).

We suggest that the rapid gene loss after the TGD detected here may be one of the key factors that shaped the basic structure of teleost genomes. Considering the close similarity of gene arrangements between zebrafish and medaka genomes (Fig. 1A, Left), we propose that the genome reshaping after the TGD was rapid, mostly completed before the divergence of zebrafish and other major teleosts (node a in Fig. 1 A and C). Further analysis of the eel transcriptome (SI Text and Dataset S3) suggested that the temporal gene loss pattern around the divergence of the eel lineage was also explained by the two-phase model, at least according to our tree-based analysis, implying that the genome reshaping was almost finished before the divergence of all extant teleosts.

We found that the ancestral genomic structure of bony vertebrates is conserved between ancestors of amniotes and just before TGD. The gene arrangements of human and medaka genomes (Fig. 1A, Left) is different probably due to the TGD and following genomic reshaping discussed above. Those of human and chicken exhibit substantial similarity, although partial chromosomal rearrangements (e.g., fusions, fissions, and translocations) are reported within amniotes (10). In addition, our preliminary analysis of the garpike genome implied that garpike has similar genomic gene arrangements as amniotes, in particular compared with chicken (see also ref. 20). These observations further support the notion that the ancestral teleost genome was rapidly reshaped with extensive rearrangements after the TGD. WGD-associated genomic rearrangement is also reported in angiosperms (2). Detailed analysis of the garpike genome is expected to provide a comprehensive view of the origin and early evolution of bony vertebrate genomes.

Mechanisms of Gene Loss After TGD.

To understand the mechanisms of gene loss after the WGD event, we fitted different loss models to the observed data (Fig. 1C) and found that the two-phase model, which includes multiple gene loss, provided a better fit than did the single gene loss model. This suggests that a major cause of the rapid gene loss may have been multiple gene loss events, such as deletions of chromosomal segments containing multiple genes as suggested by studies of yeast (3), bird (21), and teleost (15) genomes. Furthermore, inactivation of coregulated gene groups (cis-regulated gene sets [e.g., in plant (16)] and trans-regulated gene networks) might also be regarded as the simultaneous loss of multiple genes. Well-annotated whole-genome data of representatives from the two basal teleost lineages will allow us to better trace the multiple-gene loss process in the first phase (Fig. 1C).

In the second phase of gene loss after TGD (Fig. 1D), the loss rate could have slowed for two possible reasons, not mutually exclusive. First, in the context of the two-phase model, the slow decay observed in the second phase can be explained by the decrease in loss of redundant multiple-gene blocks or coregulated gene groups, simply because fewer of them remain as time goes by after the WGD event. Second, there is the possible involvement of natural selection in retaining paired gene lineages. We had originally speculated that the rate of the temporal gene loss pattern in the second phase was slowed mainly due to natural selection for the retention of paired gene lineages, each of which acquired different function, potentially by neofunctionalization or subfunctionalization and dosage constraints (22, 23). To examine the existence of such selective forces, we added an additional parameter to model gene retention (specifying a fixed fraction of gene lineages for which both duplicates are marked as essential and thus protected from loss in the model) to the two-phase model and fitted it to the observed data. However, the most likely value of the parameter was estimated to be 0 (SI Text). Although this result indicates that our analysis of the temporal gene loss pattern cannot detect any selective force of gene retention, the analysis does not deny the existence of any such force. For example, a general observation that many isozyme loci appear to be present in multiple copies in teleost lineages, although present as single copies in tetrapods (ref. 19 and references therein), implies the existence of such forces retaining the paired gene lineages derived from TGD event. Recently, Kassahn et al. (24) suggested that the TGD-derived paired gene lineages in teleost genomes generally show signs of having experienced subfunctionalization or neofunctionalization (22) from the analysis of protein domain repertoire and gene expression localization. Moreover, an additional analysis of the shared preserved paired gene lineages across all nine teleosts (Fig. S5) indicates that the number of such lineages (136 orthogroups) cannot be explained by chance (SI Text and Table S2). Also, an analysis comparing lengths of coding sequences (SI Text and Fig. S6) indicated that the longer genes, likely containing more protein domains and motifs, are more common among 1to2 than 1to1 orthogroups, as shown in the previous studies (13, 25). These analyses also imply that natural selection contributed to retention of some of the paired gene lineages.

Table S2.

GO enrichment analysis for 136 orthogroups including shared preserved paired gene lineages across all nine teleosts from human-centered analysis

Overrepresented GO ID GO name No. of the 136 orthogroups (129) No. of all orthogroups (6,264) Padj*
31988 Membrane-bounded vesicle 39 912 0.003
31982 Vesicle 40 948 0.003
30335 Positive regulation of cell migration 11 108 0.024
2000147 Positive regulation of cell motility 11 108 0.024
31175 Neuron projection development 8 56 0.026
51272 Positive regulation of cellular component movement 11 109 0.026
40017 Positive regulation of locomotion 11 111 0.026
32879 Regulation of localization 32 714 0.026
44699 Single-organism process 101 3,819 0.026
30139 Endocytic vesicle 6 30 0.027
31410 Cytoplasmic vesicle 17 265 0.027
51049 Regulation of transport 26 529 0.027
*

Adjusted P value from the results of 1,000 simulated null hypothesis queries.

Altogether, if the redundant gene lineages mostly disappeared in the first phase through multiple gene loss (Fig. S7), the low gene loss rate observed in the second phase (Fig. 1D) may reflect general gene turnover of established genes. Interestingly, the diversification of major lineages of teleosts [Fig. 1A; ∼27,000 spp. placed in 40 orders (26)] seems to have occurred mainly in this second phase.

Significance of Orthology Identification for Comparative Genomics.

The present gene tree-based pipeline can produce gene orthology identifications, anchoring orthologous or paralogous chromosomal regions (Fig. S8). Such a tree-based orthology identification approach may be crucial for comparative genome analysis in lineages that have experienced WGD such as flowering plants and budding yeast (4). Nonetheless, we cannot rule out the possibility that tree-based analysis may fail to identify a gene that has evolved much more rapidly than other WGD-derived paired gene lineages (20), even if such genes might be important for teleost evolution. Using the inferred orthology of physically separated chromosomal regions, one may find additional orthologous genes, as well as the traces of early multiple gene deletions, that cannot be identified based solely by tree-based analysis. However, because our results are formulated in terms of gene lineages rather than individual genes, it is not straightforward to incorporate such a synteny analysis here.

Phylogenetic Marker Genes.

It should be noted that our pipeline offers unique opportunities to establish a large set of orthology-confirmed phylogenetic marker genes for bony vertebrates (including teleosts). So far, the phylogenetic relationships of teleosts have been analyzed using data obtained from mitochondrial genomes (27), nuclear genes [10–20 or so nuclear single-copy genes (11, 28)], or conserved noncoding elements (29). Most relationships among major teleost lineages have been resolved but controversies remain, especially within the Percomorpha (27) due to the limited availability of reliable nuclear sequence markers.

For further progress in the molecular phylogenetics of vertebrates, a greater number of reliable, orthology-confirmed nuclear gene markers is required. For teleosts, they are desired to be 1:1 single-copy genes (30) that have lost one of a pair after TGD but before teleost diversification. We successfully found about 1,100 genes (fish-evol.unit.oist.jp/db/TGD16/1139phyMarker_CDNAalignments.tar.gz) belonging to 1to1 orthogroups between four tetrapods and nine teleosts by excluding cases of reciprocal gene lineage loss between teleost lineages (Fig. S5). Considering that such patterns of loss between otocephalan and neoteleost lineages are unrecognizable in the tree-based approach, some need more detailed assessment paying due attention to synteny. Our analysis suggests that rapid gene loss has left useful markers for the estimation of detailed phylogenetic relationships of teleosts and other vertebrates.

Materials and Methods

BLAST Search.

The human and medaka protein-coding sequences (amino acids) were used as queries for a BLASTP search (31) against all protein-coding sequences in 17 selected animal genomes (see below) in Ensembl release 76 (12) except for lancelet (Assembly, version 1.0: genome.jgi.doe.gov/Brafl1/Brafl1.home.html) (Fig. S1A1). These include 13,937 protein-coding loci in fruitfly (Drosophila melanogaster), 50,817 (predictions) in lancelet (Branchiostoma floridae), 16,671 in sea squirt (Ciona intestinalis), 10,415 in lamprey (Petromyzon marinus), 18,442 in Xenopus (Xenopus tropicalis), 18,596 in Anole (Anolis carolinensis), 15,508 in chicken (Gallus gallus), 20,368 (excluding genes on alternative genome assemblies and mitochondrial genes) in human (Homo sapiens), 23,042 in cave fish (Astyanax mexicanus), 26,459 in zebrafish (Danio rerio), 20,095 in cod (Gadus morhua), 21,437 in tilapia (Oreochromis niloticus), 20,379 in platyfish (Xiphophorus maculatus), 19,686 (excluding mitochondrial genes) in medaka (Oryzias latipes), 20,787 in stickleback (Gasterosteus aculeatus), 19,602 in Tetraodon (Tetraodon nigroviridis), and 18,523 in fugu (Takifugu rubripes). The resulting BLAST top 10 hits were screened using an E-value cutoff of <10−3 (13). Where transcript variants existed for a single locus, only the longest sequence was used in the present analysis.

Alignment.

The primary sequences of the proteins obtained by the BLASTP search (Fig. S1A1) were aligned using MAFFT (32). The multiple sequence alignments were trimmed by removing poorly aligned regions using TRIMAL 1.2 (33) with option “gappyout.” Corresponding cDNA sequences were forced onto the amino acid alignment using PAL2NAL (34) to generate nucleotide alignments for later comparative analysis. Each gene sequence was checked, and removed from the alignment as spurious BLAST hits (Fig. S1A2) if the sequence was shorter than 55% of the length of the query sequence in the unambiguously aligned sites. The percentage of removed BLAST hits ranged from 8% to 10%, leading to no significant change in our main conclusion.

Gene Tree Search.

Phylogenetic analyses were conducted by NJ and maximum-likelihood (ML) methods using the first and second codon positions of each gene sequence aligned with bootstrap analysis based upon 100 replicates. To select tetrapod/teleost ortholog candidates primarily from the BLASTP hit sequences, initially, NJ analysis was conducted using the software package Ape in R using the TN93 model (35) with gamma-distributed rate heterogeneity (36) (Fig. S1A3). Based on the resultant NJ tree, gene sequences that have three times longer branch length from the root than that of the query sequence were removed to avoid including dubious sequences (Fig. S1A4), although the analysis without this option produced similar result (SI Text).

The resultant gene trees, however, often have some weakly supported nodes. In such cases, one needs to revise ambiguous nodes in comparison with the topology of the broadly accepted phylogenetic relationships—the species tree. For this purpose, we then conducted rearrangement/reconciliation analysis using a method implemented in NOTUNG (37) for the NJ (Fig. S1A5) gene tree in comparison with the species tree. As a first step, NOTUNG rearranges weakly supported nodes of the gene tree, to minimize duplication and extinction of genes, using parsimony with equal weights. We set the threshold to 70% for bootstrap support values of nodes. Then, the rearranged tree was reconciled with the species tree. Ortholog candidates were selected from the rearranged NJ tree (marked with open circles).

To identify orthologs and TGD-derived paired gene lineages, the selected ortholog candidates from rearranged NJ analysis were realigned and subjected to codon-partitioned ML analysis (Fig. S1B6). The analysis was performed by RAxML 7.2.8 (38), which invokes a rapid bootstrap analysis and search for the best scoring ML tree with the GTRGAMMA (general time-reversible (39) with the gamma) model. The resulting ML trees were also subjected to rearrangement/reconciliation analysis for identification (Fig. S1B7). Gene trees with excessively long branch lengths (i.e., ≥2.0 base substitution per site) between tetrapods and teleosts were removed from subsequent analysis.

Criterion-Based Orthogroup Selection.

To select reliable orthogroups, the ML trees derived from rearrangement/reconciliation analysis were filtered using a criterion (BS 70% criterion) based upon the bootstrap value of key nodes (Fig. S1C8): nodes BV (monophyly of the bony-vertebrate genes) and TO (monophyly of the teleost genes). For the filtration of orthogroups of 1to2 category, we evaluated bootstrap support values for two additional key nodes: nodes D1 and D2 (monophylies of daughter clades Teleost-1 and -2). For a cutoff value, we set the bootstrap probability at 70% for the key nodes to avoid including orthogroups identified with ambiguous gene trees.

GO.

To investigate gene functions of orthogroups in the 1to2 category, GOs were analyzed using FuncAssociate 2.0 (40). GOs were assigned to each orthogroup on the basis of the Ensembl gene ID of the human ortholog as a representative of each orthogroup, because human genes are generally well characterized with respect to their gene function. After the GO assignments, we tested whether the particular GOs were overrepresented or underrepresented in the 1to2-category orthogroups (1,102 including human genes) compared with all orthogroups belonging to 1to2 and 1to1 categories (6,264 including human genes). Statistical significance of the odds ratio was assessed by Fisher’s exact test with adjusted P value (Padj) based on 1,000-fold replication of null-hypothesis simulations conducted by FuncAssociate.

Temporal Pattern of Gene Loss/Persistence.

The overall temporal pattern of gene loss/persistence following the TGD was explored on the basis of our orthogroup gene trees and time-calibrated teleost phylogeny (11). In this study, “gene loss” indicates the absence of a gene from a specific lineage in an identified orthogroup in the post-TGD evolution (Fig. 1A). In the pipeline analysis, gene loss may be due to absence of sequences in the Ensembl database and/or sequencing errors. The sequence data retrieved from Ensembl (pep/cdna.all) does not include genes without detection of their product or prediction from sequence analysis of species comparison (ftp://ftp.ensembl.org/pub/release-76/fasta/homo_sapiens/pep/README in detail).

The number of gene lineage losses that occurred among ancestral nodes of the teleost tree (Fig. 1A) was estimated using the parsimony method of Sato et al. (13) (Table S3). Next, the estimated number of remaining gene lineage pairs after the TGD was used to plot the temporal pattern of gene loss (Fig. 1C). In this plot, the number of gene lineage pairs at TGD, ancestral nodes (a to h), and current teleosts (Ca, Ze, Co, Ti, Pl, Me, St, Te, and Fu) were used as the vertical coordinate, and the absolute time of the TGD and teleost divergence (11) was used as the horizontal coordinate. We developed the two-phase model by extending the classical exponential model for single-gene loss (14) by including the possibility of multiple gene loss and by considering the underestimation bias of parsimony inferences (SI Text). The data points were fitted by the one-phase [exponential: αe2μt (14)] and two-phase models of gene loss, and parameters were estimated by the least-squares method using the software package R. The time ranges of TGD and ancestral nodes were taken from the literature (11, 28). For the analysis of gene loss process, the tacit assumption was that the genome rediploidization processes have been completed before the divergence of clupeocephalan lineages.

Table S3.

The number of orthogroups identified between tetrapods and teleosts for 130 human protein-coding genes used in Sato et al. (13)

Gene/orthogroup category Sato et al. This study*
Without criterion With criterion
Loci in human genome 130 130
No clear orthogroups 11 8
Putative orthogroups 119 122 58
Multiply counted orthogroups 3 5 3
Identified orthogroups 116 117 55
1to2 category (paired by TGD) 45 48 17
1to1 category (losing one gene lineage) 71 69 38
TGD-derived paired gene lineages/ 38.79% 41.02% 30.90%
total loci (gene lineages) in common ancestor (45/116) (48/117) (17/55)
*

For the results from the 130 human protein-coding gene analyses, see fish-evol.unit.oist.jp/db/TGD16/SHN/SB70BL10/130results.html.

Orthogroups fulfilling the BS 70% criterion (Fig. S1C).

SI Text

Orthology/Paralogy Identification from Genome-Wide Analysis.

To accurately identify tetrapod/teleost orthologs and teleost genome duplication (TGD)-derived paralogs (paired gene lineages), orthogroups were estimated (Table 1) on the basis of phylogenetic analysis using all 40,054 protein-coding gene sequences from human (20,368) and medaka (19,686) as queries through our automated pipeline (Fig. S1).

Human-centered analysis.

At first, our automated analysis (Fig. S1 A–C) was conducted using 20,368 human protein-coding gene sequences as queries. The analysis identified 15,559 putative orthogroups (Table 1). No clear orthogroups were found for the remaining 4,809 genes for the following reasons (Dataset S1): (i) human query sequences of less than 100 aa (818 loci; sequence length limitation for accurate gene tree estimation), (ii) no hits in BLAST search from teleost genomes with E-value cutoff of <10−3 (1,280 loci), (iii) no clear orthologs identified between tetrapods and teleosts in the estimated gene tree (1,874 loci), (iv) limited length of unambiguously aligned sites for accurate gene tree estimation (135 loci), and (v) failure of BLAST search or amino acid translation due to erroneous sequence data (702 loci). Regarding the reliability of gene phylogenies, 6,673 out of 15,559 putative orthogroups fulfilled our criterion using 70% bootstrap supports for key nodes (the BS 70% criterion in Fig. S1C). We considered these 6,673 orthogroups to be reliable and 15,559 to be putative.

Integration with medaka-centered analysis.

To obtain the final dataset, we estimated the orthogroups lost in the tetrapod lineage on the basis of medaka genome analysis. The same analysis as above (Fig. S1 A–C) was applied to 19,686 protein-coding gene sequences of medaka as representative sequences of a major teleost lineage (Table 1). As a result, we identified 15,552 orthogroups, which is almost the same number of putative orthogroups found in the human-centered analysis (15,559). Among these, we considered the 6,471 putative orthogroups as reliable orthogroups in the medaka-centered analysis because their gene trees fulfilled the BS 70% criterion. Of these, orthogroups including human genes were excluded because they were regarded to be specific to the human-centered approach, and the remaining 684 orthogroups were considered to have lost the corresponding gene in the lineage leading to human after the separation of ray-finned fish and tetrapods. We integrated the 684 orthogroups with those from human-centered analysis. Thus, we accurately estimated 7,357 orthogroups comprised of 6,673 and 684 orthogroups from human- and medaka- centered analyses, respectively. Of these, we excluded 465 orthogroups from the subsequent analyses due to their redundancy (Figs. S1D and S5D). Consequently, we used the 6,892 tetrapod/teleost orthogroups for the gene loss analysis in the present study.

Sensitivity Analysis of Automated Pipeline.

We tested the sensitivity of our automated pipeline (Fig. S1) by analyzing the dataset used in our previous study (13). Several studies analyzed the TGD-derived paired gene lineages based on the metazoan gene trees obtained from EnsemblCompara (9). It should be noted that the metazoan gene tree in EnsemblCompara may not be appropriate for the analysis of the genes derived from whole-genome duplications (WGDs), because the EnsemblCompara is not designed for such a purpose. On the other hand, our pilot study (13) used taxonomic sampling with special reference to the TGD-derived gene lineages by focusing on tetrapods and teleosts and manually conducted tree-based analysis, which we applied to this study. Based mainly on phylogenetic analysis, Sato et al. (13) explored teleost orthologs of 130 human protein-coding genes from four teleost genomes. As a result, they identified 116 orthogroups between tetrapods and teleosts, and 45 pairs of TGD-derived gene lineages among them (Table S3). This suggests that 38.8% (45/116) of protein-coding gene lineages were still duplicated (paired) in the MRCA of the four teleosts after the TGD event.

Application of our automated pipeline to the above 130 gene sequences produced almost the same result as that of Sato et al. (13). Among the 130 genes, orthogroups were identified for 117 genes (Table S3 and fish-evol.unit.oist.jp/db/TGD16/SHN/SB70BL10/130results.html). Of these, 48 were classified into the 1to2 category and 69 were classified into the 1to1 category (41.02% [48/(48+69)] gene lineage pairs in the MRCA). This proportion (41.02%) obtained by the automated analysis in the present study is close to the 38.8% obtained by the manual analysis in Sato et al. (13). Therefore, we conclude that our automated pipeline is reliable. In the present study, we mainly used the BS 70% criterion to exclude results derived from ambiguous gene trees even though this reduced the ratio of the number of 1to2 orthogroups to that of 1to1 orthogroups (down to 30.9% [17/(17 + 38)]). The underestimate of the number of 1to2 orthogroups compared with 1to1 orthogroups due to the use of the BS 70% criterion can be partly explained by the larger number of key nodes applied to the 1to2 orthogroups (three or four key nodes, BV, TO, D1, and/or D2, in Fig. S1C8) compared with the 1to1 orthogroups (two key nodes, BV and TO). To demonstrate that our main conclusion is not changed by the strict BS 70% criterion, we also provided the estimates obtained from the analysis without the criterion in the main text.

Comparison of Synteny Information.

We investigated the distribution of orthologs and TGD-derived paired gene lineages by mapping ortholog/paralogs inferred in orthogroups of the 1to1 (5,655) and 1to2 (1,237) categories onto chromosomes (Fig. S2).

To compare our synteny information with that of previous studies, we examined how many orthogroups of the 1to2 category are supported by previously published information on the results of analysis of a paralogous chromosomal region derived from WGD (5), namely doubly conserved synteny (DCS), in the medaka genome (supplementary table 15 in ref. 43). The comparison of the TGD-derived pair set estimated in this study and Kasahara et al. (43) showed that 48% of orthogroups in our 1to2 category (Dataset S1) have DCS support. The remaining 52% of orthogroups were not supported by DCS. Inconsistencies with respect to Kasahara et al. (43) would be reasonably explained by the following: in the DCS analysis, (i) an ortholog has been lost specifically in medaka but retained in some other teleosts, (ii) an ortholog has been lost only in human but retained in some other tetrapods, and (iii) synteny around the medaka ortholog was not conserved; and/or (iv) the identified gene lineage pairs derived from TGD or those from a local chromosomal duplication just after TGD cannot be distinguished in the present pipeline analysis with the available genome data, although we expect the number of latter cases would be small for the following reasons: (i) most of the lines shown between the identified TGD-derived gene lineage pairs (Fig. S2B) distributed widely across all chromosomes consistent with what would be expected as a consequence of TGD, and (ii) only 13 of 1to2 orthogroups (out of 1,237) contained multiple duplicates possibly derived from both the TGD and the local chromosomal duplications (Fig. S5F).

Counting Method of TGD-Derived Paired Gene Lineages in Current Teleost Genomes.

To count the number of TGD-derived gene lineage pairs in current genomes of teleost species (Fig. 1A), one needs to exclude genes derived from lineage-specific duplication. In the orthogroup of 1to2 category (Fig. S3A1), the absence or presence (0 or 1) of TGD-derived gene lineage pairs excluding lineage-specific duplicates is “1” for zebrafish and fugu (A2) because they have a TGD-derived gene lineage pair and “0” for cod because it has no pair. In the orthogroup of 1to1 category (Fig. S3B), on the other hand, the number of TGD-derived gene lineage pair is “0,” of course. Then, the number of TGD-derived gene lineage pairs is 692 on average (belonging to the 1,237 orthogroups of 1to2 category) as summarized in Fig. S3C and Fig. 1A.

To calculate the proportion of TGD-derived paired gene lineages in current teleost genomes, one needs to count the total number of gene lineages derived from TGD for each species (Fig. S3 A and B) as the denominator. In the orthogroup of 1to2 category (A1), the number of gene lineages (0, 1, or 2) excluding species-specific duplicates is assigned as “2” for zebrafish and fugu genomes (A2) because they have orthologs in both teleost daughter clades (Teleost-1 and -2), and “1” for cod genome because it has lost an ortholog in the Teleost-1 clade. In the orthogroup of 1to1 category (Fig. S3B), the numbers of gene lineages excluding lineage-specific duplicates are assigned as “1” for zebrafish and fugu and “0” for cod. In total, the number of gene lineages derived from TGD (belonging to the 6,892 orthogroups) is 6,378 on average as summarized in Fig. S3C. Taken together, the proportion of TGD-derived paired gene lineages in current teleost genomes can be calculated as 21.7% (692 × 2/6,378) on average.

Model Fittings Under Alternative Conditions.

To assess any possible bias due to the criterion requiring a bootstrap value of at least 70% (BS 70% criterion in Fig. S1C8), we estimated the gene loss process by using the gene lineage pairs obtained from the analysis without this criterion. In the original result (Table 1), the pipeline analysis produced 15,353 putative orthogroups. Among them, 4,262 orthogroups were classified into 1to2 and 11,091 were classified into 1to1 categories. This result (Fig. S4A) suggests that the 72% [11,091/(4,262 + 11,091)] of gene lineages were rapidly lost during the initial 60 My (first phase) before node a, whereas the remaining paired gene lineages were more slowly lost after that time (second phase). When we fitted the two models to the plots obtained without the BS 70% criterion, the two-phase model still showed a better fit than the classical one-phase model. As a result, the gene loss process estimated without the BS 70% criterion also showed the rapid gene loss and can be expressed by the two-phase model. We used the BS 70% criterion in the present study to estimate the fraction of gene lineage pairs more precisely.

To examine the sensitivity of the two-phase model to divergence times, the plots were also fitted to the model using the time-scaled tree based on nuclear gene (28) and mitochondrial genome (42) data (Fig. S4 B and C). The two-phase model was also supported by these two time-calibrated trees. Note that the two-phase model (D) can be roughly approximated by a double-exponential curve (see below).

Analysis Including Eel Data.

To investigate the gene loss pattern before the divergence of major clupeocephalan lineages, we conducted the pipeline analysis using data including transcriptome data [45,975 protein-coding gene sequences (44)] of the European freshwater eel (Elopomorpha), as a representative of basal teleost lineage (Fig. S4E1). The same analysis shown in Fig. S1 was applied to 20,368 human and 19,686 medaka protein-coding genes. As a result (Dataset S3), we found that 85.1% [5,198/(911 + 5,198)] of gene lineages were rapidly lost during the initial 20 My between TGD and the divergence of the eel lineage (node i). The estimated gene loss pattern can be fitted by the two-phase model around node i (Fig. S4 E2 and E3).

The estimated number of retained gene lineage pairs in eel (105) is smaller than those of other teleosts (379–544). To examine this point, the orthologous relationships of genes between eel and the other teleost lineages need to be identified more carefully. The identification of the orthologs of the clupeocephalan genes from eel data, however, is a challenging issue due to the following technical reasons: (i) The transcriptome data of eel might not cover all protein-coding gene sequences. Some gene lineages are difficult to sequence because of differential expression patterns between the products of paired gene lineages through subfunctionalization. (ii) Some eel sequences might not be identified as a member of orthogroups in their gene trees. The poor taxon sampling from basal teleost lineages makes the eel orthologs difficult to identify, possibly misleading the phylogenetic analysis (45).

Furthermore, recent studies (20, 30) suggest that phylogenetic methods frequently fail in assessing the orthology of clupeocephalan to the two basal teleost lineages, whereas the assessment is often straightforward within clupeocephalans. Despite the uncertainties described above, the eel data are consistent with our two-phase model, and this is the best that can be achieved at present given the currently available genome data.

Example of Gene Tree for Gene Characterization.

To explain how to assign gene categories, typical examples of gene trees are shown in Fig. S5. The gene histories of some orthogroups (Fig. S5 E–H) were difficult to trace even though their gene trees fulfilled our criterion (Fig. S1C). Considering the numbers of these cases are small in comparison with the total number of identified orthogroups in 1to2 (1,237) or 1to1 (5,655) categories, we believe that they do not have a considerable impact on our major conclusion.

Fig. S5A (1to2 category).

The orthogroup (thick branch) of the ACAP3 gene (A1) is classified as belonging to the 1to2 category because it contains a clade with the human query sequence (bold) and its orthologs from tetrapods and teleosts forming two teleost daughter clades (filled and open triangles). The orthogroup of the NT5C1A gene (A2) contains shared preserved paired gene lineages across all nine teleosts.

Fig. S5B (1to1 category).

The orthogroup of the AGRN gene (B1) is classified as belonging to the 1to1 category because it contains the human query sequence and its tetrapod/teleost orthologs forming one teleost clade (open circles). The NOC2L gene (B2) is identified as a phylogenetic marker of vertebrates including teleosts because its orthogroup consists of 1:1 single-copied genes and can be considered to have lost one of its paired gene lineages after TGD but before teleost diversification.

Fig. S5C (no clear orthologs).

The gene tree of ATP6V1G3 shows no clear orthologous relationships between the human query and teleost gene sequences.

Fig. S5D (redundancy).

The orthogroup of the CLCNKB gene (D2) has the same human ortholog as the CLCNKA gene (D1), and these orthogroups were considered to be redundant. The former (D2) was excluded from the gene loss pattern analysis.

Fig. S5E (1to2 category with ambiguously clustered teleost orthologs).

The orthogroup (thick branch) of the DNALI1 gene includes clades containing only one lineage (open circles) placed at the outside of the identified teleost daughter clades (triangles). These single clades containing only one teleost lineage were ignored in the gene loss pattern analysis (Fig. 1C). Although the orthogroup can be classified as belonging to the 1to1 category, including tetraodontiform-specific duplication (Tetraodon and fugu), our pipeline identified this kind of orthogroup as belonging to the 1to2 category. Similar patterns were found in 58 orthogroups (Dataset S1).

Fig. S5F (1to2 category but having more than two daughter clades).

The orthogroup of MC6AST2 gene has more than two teleost daughter clades clearly. These additional daughter clades might be derived from local chromosomal duplications just after TGD. Similar patterns were found in at least 13 orthogroups (Dataset S1).

Fig. S5G (1to1 category with one of paired gene lineages eliminated by long-branch deletion option).

The orthogroup of the SPZ1 gene (G1) was classified as belonging to the 1to1 category by an accidental deletion of the gene lineage with accelerated evolutionary rate from the gene tree (G2) obtained without the long-branch deletion (LBD) option (Fig. S1A4). Although such patterns were found in 26 orthogroups (fish-evol.unit.oist.jp/db/TGD16/3103_noLBD/results.htm), we left the LBD option in our pipeline analysis to exclude erroneously assembled sequences (see below).

Fig. S5H (reciprocal pattern of gene lineage loss).

The orthogroup in Fig. S5H1 (thick branch) is classified as belonging to the 1to2 category by identification of a reciprocal pattern of gene lineage loss. The orthogroup in Fig. S5H2 should actually belong to the 1to2 category but is misclassified as belonging to the 1to1 category by our tree-based method. The identification of reciprocal patterns of gene lineage loss, in cases such as those shown in Fig. S5 H1 and H2, depends upon the reconciliation of NOTUNG (37) by comparison with the species tree as follows: in the schematic orthogroup of Fig. S5H1, one of the teleost daughter clades consists of tetraodontiforms, and the other clade contains clupeocephalans and no tetraodontiforms. In this orthogroup, the tree-based approach identifies two daughter clades because tetraodontiforms are derived clupeocephalans. By comparing with the species tree, NOTUNG identifies the duplication event as occurring at the node marked D, and the orthogroup is classified as belonging to the 1to2 category. In this manner, reciprocal loss of gene lineages between daughter clades is detected in cases of the type depicted in Fig. S5H1. Such a reciprocal pattern of gene lineage loss was found in 10 orthogroups. In the schematic orthogroup shown in Fig. S5H2, one of the daughter clades consists of otocephalans and the other of neoteleosts. In this case, the tree-based approach identifies only one daughter clade, including both otocephalan and neoteleost lineages, because otocephalans are phylogenetically more basal than neoteleosts within the clupeocephalans, so NOTUNG does not detect any genome duplication event in nodes within the orthogroup by comparison with the species tree. As a result, the orthogroup is misclassified as belonging to the 1to1 category. Therefore, the reciprocal pattern of gene lineage loss is not detected in cases of the type illustrated in Fig. S5H2. Note that the tree-based approach cannot identify reciprocal gene lineage loss between the basalmost and other derived lineages.

Gene Function of TGD-Derived Paired Gene Lineages.

Using gene ontology (GO) annotations, we examined whether teleost genes in orthogroups belonging to the 1to2 category have any characteristics in gene function differing from those in the 1to1 category (Table S1). The results showed that there was an overabundance of genes coding for proteins involved in signaling and ion transport in the 1to2 category (main text) as seen by Kassahn et al. (24). Through a comparative analysis of five teleost genomes, Kassahn et al. (24) also pointed out that the TGD-derived paired gene lineages were enriched in proteins involved in cellular signaling, metabolism, and transcription. Their and our results may mean that various receptors, including chemosensory and immune receptors, and their downstream signaling molecules, tend to remain duplicated possibly due to subfunctionalization/neofunctionalization (22). Genome studies of budding yeasts, flowering plants, and land vertebrates have suggested that genes having functions of transcription and regulatory controls tend to be retained in duplicate after WGD, possibly due to constraint on their product–dosage balance (24). This kind of constraint may also have affected the retention of a part of the TGD-derived paired gene lineages we identified.

Our pilot study based on 130 gene sets (13) suggested that the TGD-derived gene lineages have significantly longer length of protein-coding gene sequences. Even the analysis based on the genome-wide data of this study, we also found a significant difference in gene lengths between 1to2 and 1to1 orthogroups (Fig. S6). The protein peptides encoded by the genes belonging to 1to2 orthogroups tended to be long (>1,000 aa) rather than short (<200 aa) among nine teleosts (with the BS 70% criterion: χ2 = 1,110.8, df = 2, P = 6.14 × 10−242; without the criterion: χ2 = 327.2, df = 2, P = 8.69 × 10−72) and four tetrapods (with the criterion: χ2 = 289.8, df = 2, P = 1.18 × 10−63; without the criterion: χ2 = 89.1, df = 2, P = 4.60 × 10−20). Considering that longer sequences likely contain more protein domains and motifs, this result implies that some of the gene lineage pairs have been maintained by subfunctionalization (13, 25).

Investigation of Possible Action of Natural Selection on Pair Retention.

To test for the action of natural selection on the retention of TGD-derived paired gene lineages, we examined an alternative hypothesis that paired gene lineages may be retained solely by chance. This is one of the predictions of the passive or random loss model (6). We observed that 136 orthogroups retain gene lineage pairs among all nine teleosts (Fig. S5A2 and Dataset S1) and tested whether this number is consistent with chance or purely random loss in each lineage independently.

Under this alternative hypothesis, retention and loss of gene lineage pairs is considered to occur by chance in the ancestors of all nine species (Fig. 1A). Thus, whenever the genomes of two descendants of a MRCA have less TGD-derived gene lineage pairs than the MRCA, we assume that the choice of which gene lineage pairs are passed down to the descendants is purely random and also independent for the two descendants’ lineages. For example, if one were to assume that the MRCA of Tetraodon and fugu had Ne TGD-derived gene lineage pairs, then, given that the Tetraodon genome contains NTe = 622 such pairs, the probability of any given pair being transmitted from this MRCA to Tetraodon would be 622/Ne or NTe/Ne. In the case of fugu, the corresponding probability would be NFu/Ne. Assuming independent loss in each lineage, the probability that any given TGD-derived gene lineage pair present in the MRCA of Tetraodon and fugu would also be present in both Tetraodon and fugu must be NTe/Ne × NFu/Ne. Continuing in this manner, the probability of parallel retention among all nine species would be as follows:

p=(Nf/Na×Nb/Na)×(NCo/Nb×Nc/Nb)×(Ng/Nc×Nd/Nc)×(NSt/Nd×Ne/Nd)×(NTe/Ne×NFu/Ne)×(NCa/Nf×NZe/Nf)×(NTi/Ng×Nh/Ng)×(NPl/Nh×NMe/Nh),

or

p=(NCa×NZe×NCo×NTi×NPl×NMe×NSt×NTe×NFu)/(Na2×Nb×Nc×Nd×Ne×Nf×Ng×Nh).

The nine complete genome sequences of extant teleosts provide us with (Fig. 1A) NCa = 736, NZe = 778, NCo = 560, NTi = 815, NPl = 740, NMe = 612, NSt = 674, NTe = 622, and NFu = 691.

Our parsimony analysis provides us with lower bounds on the numbers of retained TGD-derived gene lineage pairs in the hypothetical MRCA genomes at nodes ah (Fig. 1A). These lower bounds are Ma = 1,237, Mb = 982, Mc = 967, Md = 875, Me = 769, Mf = 976, Mg = 909, and Mh = 849, so we can write Na ≥ 1,237, Nb ≥ 982, Nc ≥ 967, Nd ≥ 875, Ne ≥ 769, Nf ≥ 976, Ng ≥ 909, and Nh ≥ 849.

We conclude that, under the assumption of independent loss in each lineage,

p0.047.

We therefore expect to see at most 1,237 × 0.047 ∼ 58 out of 1,237 orthogroups retain TGD-derived gene lineage pairs in all nine extant teleost genomes, and yet, in the present study, we observe 136 (Dataset S1). Taking the largest possible value of p consistent with our data (p = 0.047, which will make our analysis as conservative as possible), the predicted distribution of retained pairs is, under the assumption of random assortment, the binomial distribution B(1,237, 0.047). When gene lineage pairs are assumed to follow the hypothesis of random loss, the probability that the 136 orthogroups retain gene lineage pairs in at least nine teleosts is P = 6.64 × 10−21. Thus, our analysis suggests that about 78 (136 − 58) orthogroups may not have been retained by chance, but possibly by selection acting to preserve paired gene lineages arising from TGD.

To investigate gene functions for the 136 orthogroups, GO analysis was conducted (Table S2). The analysis revealed that 12 GO terms were significantly overrepresented among the 136 orthogroups, with Padj < 0.05. Most of the highly significant genes associated with the 12 terms were those of proteins related to vesicle or positive regulation of cell migration/motility. On the other hand, no GO term was significantly underrepresented among the 136 orthogroups, with Padj < 0.05.

Phylogenetic Marker Selection.

Our pipeline analysis produced about 1,100 sets of phylogenetic marker genes for bony vertebrates with special reference to teleosts. In this study, these markers were identified only from genes belonging to 1to1 orthogroups. In principle, orthology-confirmed marker genes can also be selected from genes belonging to 1to2 orthogroups such as the 136 orthogroups including the paired gene lineages across all nine teleosts (Fig. S5A2). For phylogenetic questions based on the genome-wide data, transcriptome sequencing has been applied when whole-genome sequencing is impractical (46). Sequencing of some genes, however, is difficult because of differential expression patterns between the products of paired gene lineages through subfunctionalization. For this reason, we focused on phylogenetic markers from genes belonging to the 1to1 orthogroups for future phylogenetic study.

Effect of LBD on the Inference of Rapid Gene Loss.

We incorporated the LBD option (Fig. S1A4) to exclude erroneously assembled sequences from the downloaded datasets. However, when evolutionary rates are significantly different between gene lineage pairs (e.g., the POMC gene in ref. 20), the LBD option could incorrectly identify some orthogroups of the 1to2 category as belonging to the 1to1 category by exclusion of gene lineages with accelerated evolutionary rate (Fig. S5G). If this type of misidentification were to occur in a large number of orthogroups, the increase in the number of orthogroups of the 1to1 category could lead to an incorrect or exaggerated inference of rapid gene loss after TGD. To exclude the possibility that our inference of rapid gene loss could be an artifact caused by use of the LBD option, we conducted the analysis without this option and compared the result with that from the original analysis.

For 3,103 gene analyses using the LBD option in the original pipeline analysis, we also conducted additional analysis without this option (fish-evol.unit.oist.jp/db/TGD16/3103_noLBD/results.htm). As a result, the analysis without the option identified 464 orthogroups as belonging to the 1to2 category without the BS 70% criterion. Among them, the analysis using the LBD option identified only 26 orthogroups as 1to1 category with the BS 70% criterion by excluding gene lineages with accelerated evolutionary rates. For confirmation, when we replaced the 3,103 results from the original analysis with those from the corresponding gene analyses without the LBD option, the fraction of remained gene lineage pairs at node a [18.6% (1,229/[1,229 + 5,394]) showed only a slight difference from the original analysis (18.0% in Table 1)]. Thus, we conclude that the possible exclusion of gene lineages with accelerated rates by the LBD option does not change our major conclusion that the gene loss process after TGD exhibits rapid gene loss and can be described by the two-phase model. Following the result of this sensitivity analysis, we retained the LBD option in the original pipeline analysis.

Sensitivity Analysis of Model Prediction.

To test the sensitivity of parameters for gene loss models estimated in Fig. 1C, the parameters were evaluated using 1,000 randomly generated plots (Fig. S7). To generate the 1,000 random plots for the sensitivity test, the values of estimated numbers of gene lineage pairs and times at ancestral nodes (a to h) and TGD were varied within their respective 95% confidence intervals. To generate the plots randomly, 95% confidence intervals of retained gene lineage pairs were estimated for each time point using the SD of the number of gene lineage pairs persisting in current teleosts. Times at ancestral nodes were randomly chosen within the range of 95% confidence interval of estimates, and the time of TGD was randomly chosen within the range of estimated times between the basal neopterygian and teleost nodes (11). In general, parameter estimations are not very sensitive to the fluctuations of 1,000 randomly generated plots and the two-phase model fits the randomly generated plots.

Gene Orthology Database.

All data (gene trees and sequence alignments, etc.) used in 21,706 gene analyses (20,368 human- and 1,338 medaka-centered analyses; Table 1 and Dataset S1) are accessible at the Orthology Database of Fish-Specific WGD-Derived Genes (FishOrthoDB: fish-evol.unit.oist.jp/cgi-bin/TGD.cgi). Fig. S8 shows the web page of results from GLI2 gene analysis as an example.

  • A: The number of orthologs identified from the GLI2 gene tree (B1).

  • B1: The GLI2 tree derived from rearrangement/reconciliation analysis of the ML tree (B2). Thick branches lead to the orthogroup members of GLI2. The gene tree indicates that the orthogroup of GLI2 is classified as belonging to the 1to2 category because its gene tree has two daughter clades of teleost orthologs (Teleost-1 and -2).

  • B2: The ML tree estimated using the selected candidate sequences of tetrapod/teleost ortholog based on rearranged NJ tree (B3).

  • B3: The rearranged NJ tree derived from NJ tree (B4). Thick branches lead to the ortholog candidates and open circles (connected with thin branches) indicate the outgroups.

  • B4: The NJ tree reestimated without the sequences leading to long branches (see below).

  • B5: The NJ tree based on primary sequences obtained from a BLAST search using the human GLI2 gene sequence as a query. If the tree contained a sequence whose branch length was very different from that of the query sequence, the sequence was removed from realignment and the NJ tree was reconstructed again (B4). See main text for details.

The results of the above process for 21,706 query gene sequences are summarized in the online database (FishOrthoDB). This database provides the basic information necessary for the estimation of the evolutionary history of teleost protein-coding genes.

Introduction of the Two-Phase Model.

Here, we explain the two-phase model, a mathematical expression of the common observation that most paired gene lineages (Fig. 1B) resulting from a WGD are lost over time. It is an extension of the passive (6) or random loss model (e.g., ref. 14), with only one additional component: The possibility of loss of multiple genes in single events, such as the deletion of a chromosomal segment or loss of coregulated gene groups (cis-regulated gene sets/trans-regulatory–associated gene networks). The model does not include mechanisms for preservation of gene lineage pairs such as subfunctionalization and/or neofunctionalization or selection for dosage balance. If, given data, it cannot be rejected as a null model, then, on that basis alone, we can neither state that we have evidence for such mechanisms, nor that we have evidence against them.

The model does not take into account the special properties of any specific genes or gene families. In particular, it does not attempt to model in any detail possible mechanisms leading to the loss of gene blocks or coregulated gene groups, although it is clear that segmental deletions resulting from errors in meiosis, illegitimate recombination, or double-strand break repair are candidates (47, 48). In principle, the model also allows for events that would result in the loss of expression (i.e., inactivation, rather than deletion) of single or multiple genes. Possible dependencies of mutation, inactivation, or deletion rates on 3D chromosome structure, or even chromosome length play no role in this model, and the chromosomal locations of genes are ignored. To be very clear, the model is designed to be compatible with a high or low rate of local gene order rearrangement. As a result of this in-built ambiguity, the two-phase model cannot be said to take hitchhiking (49) into account. Furthermore, the model treats gene lineages as basic units, not individual genes, and therefore ignores lineage-specific duplications, making no use of the number of genes in an orthogroup or gene lineage. The point we are trying to stress here is that the model is minimalist, and intentionally so. From a theoretical point of view, one advantage of this is that we can present an explicit solution, which we do here.

It will be useful to begin with a reformulation of the passive loss model, from the specific point of view we are taking. We make the following assumptions:

  • i)

    All genes that were present in the pre-WGD genome are essential.

  • ii)

    WGD-derived paired gene lineages are treated as indistinguishable, ignoring the possibility of subfunctionalization and/or neofunctionalization.

  • iii)

    When both WGD-derived paired gene lineages are present in the post-WGD genome, both are labeled as redundant.

  • iv)

    When both WGD-derived paired gene lineages are present in the post-WGD genome, the loss of one of them makes the remaining gene lineage essential.

  • v)

    The total number of gene lineages in the genome is always large (this justifies our use of a differential equation).

  • vi)

    Redundant gene lineages are deleted one by one, randomly and independently (passive model assumption only—see below for two-phase model).

As a consequence, if both WGD-derived paired gene lineages are present in the post-WGD genome, simultaneous loss of both is forbidden.

Let f(t) denote the fraction of redundant gene lineage pairs in the post-WGD genome at time t. As a fraction, it only takes on values between 0 and 1:

0f(t)1.

[f(t) is the fraction of redundant gene lineage pairs at a given time t. If we were to have 15,000 singleton (essential) gene lineages and 10,000 WGD-derived gene lineage pairs (redundant) in a hypothetical genome, then the total number of orthogroups would be 25,000. A total of 10,000 of the total number of orthogroups would have redundant (paired) gene lineages, so, in this way of thinking, the fraction of redundant gene lineage pairs would be f(t)=10,000/25,000=0.4 at the given time t.] Because both WGD-derived paired genes are, according to our assumptions, considered to be redundant, all (i.e., 100% of) genes in the genome immediately after the WGD event (at t=0) are considered to be redundant, and we express this with the following initial condition:

f(0)=1. [S1]

The differential equation which expresses the passive model of redundant gene loss can be written as follows:

df(t)dt=α×f(t), [S2]

where

α>0

is a constant that represents the single gene lineage loss rate. Because f(t) can be interpreted as the probability that a given, randomly chosen, orthogroup’s gene lineages are redundant (i.e., paired), one can think of f(t) as the probability of fixation of a single gene lineage inactivation or deletion.

The well-known solution to Eq. S2 with initial condition [S1] is a single-exponential decay curve:

f(t)=eαt. [S3]

This is the prediction of the passive or random loss model.

Assumptions of the Two-Phase Model.

Before altering any equations, let us state the assumptions we are making for the two-phase model. The first five are identical to what we wrote above for the passive loss model:

  • i)

    All genes that were present in the pre-WGD genome are essential.

  • ii)

    WGD-derived paired gene lineages are treated as indistinguishable, ignoring the possibility of subfunctionalization and/or neofunctionalization.

  • iii)

    When both WGD-derived paired gene lineages are present in the post-WGD genome, both are labeled as redundant.

  • iv)

    When both WGD-derived paired gene lineages are present in the post-WGD genome, the loss of one of them makes the remaining lineage essential.

  • v)

    The total number of gene lineages in the genome is always large (this justifies our use of a differential equation).

  • vi)

    Single redundant gene lineages can be deleted one by one, randomly and independently.

  • vii)

    Gene blocks or coregulated gene groups can also be lost if (and only if) they do not contain an essential gene, with the choice of gene block or coregulated gene set being random and independent of others.

Also note that assumption v (that the number of genes in the genome is always large, meaning that it is taken to be infinite) minimizes the influence of assumptions iv and vii, which prevent multiple (gene block/coregulated gene group) loss events from resulting in the loss of both WGD-derived gene lineages of any gene present in the pre-WGD genome. The reason is that, for an infinite number of genes, the chance of attempting to delete a block of multiple genes, containing both WGD-derived paired gene lineages of any pre-WGD gene, is vanishingly small. For this reason, there is no term in our differential equation [S4] (below) that takes either assumption iv or assumption vii into account.

Let us now add a new term to the differential equation, which will represent the possibility of loss of blocks of n genes (actually, gene lineages). Note that we must now relax the assumption that gene lineages are lost independently, to allow for the loss of blocks of multiple gene lineages. The new differential equation, which, together with the initial condition [S1], defines the two-phase model, is as follows:

df(t)dt=α×f(t)n×β×{f(t)}n, [S4]

where

β0

is a constant that represents the rate of events in which n gene lineages are simultaneously lost. n appears as the exponent in {f(t)}n, which is intended to represent the probability that, in a given random selection of n orthogroups, all of them consist of redundant gene lineage pairs. According to our model assumptions, a loss event can only be fixed if all gene lineages that would be lost are redundant. Therefore, the additional term in the two-phase model explicitly takes into account the possibility of single events that result in the loss of n gene lineages. Note that the two-phase model can only differ from the passive loss model if

n>1.

We suggest that n be interpreted as an average number of gene lineages lost in a single event in which multiple gene lineages are lost. It would be possible to formulate a more complex model, with a spectrum of block loss sizes, but we have chosen not to do this to avoid introducing more unknowns than we have data points.

If β=0, the two-phase model reverts to the passive loss model, and the solution is the single exponential decay given by Eq. S3.

If β>0, the solution to the two-phase model, given by differential equation [S4] and initial condition [S1], is as follows:

f(t)=(αα+n×β)1n1×eαt×[α+n×βα+n×β×(1e(n1)αt)]1n1 [S5]
=eαt×[11(n×βα)×(e(n1)αt1)]1n1.

Application to the Number of Retained WGD-Derived Gene Lineage Pairs.

To apply Eq. S5 to data involving actual numbers of WGD-derived gene lineage pairs, we need to multiply by the total number of orthogroups (i.e., the number of genes in the pre-WGD genome), N, which gives us the total number of retained WGD-derived gene lineage pairs at any given time:

F(t)=N×f(t). [S6]

In comparing with data, we therefore have a function F, which depends upon four parameters (α, β, n, and N) and the time elapsed since the WGD (t).

The correct interpretation of F(t) is that it is the two-phase model prediction for the number of retained WGD-derived gene lineage pairs in a single (organismal) lineage from the time of the WGD to the present. Given that the strength of the genome data analysis presented in this paper lies in comparisons between gene lineages based mainly upon genomic data collected only in the present, we must take an extra step of modeling the outcome of our pipeline parsimony analysis on the raw predictions of the two-phase model. Therefore, we must develop a model of the process of parsimony analysis. The goal is to overcome the inherent tendency of parsimony analysis to underestimate numbers of genes.

The central issue in parsimony analysis, as performed by our pipeline, is the inference of the pattern of presence and absence of gene lineage pairs in ancestral genomes, on the basis of the pattern of presence of pairs in extant genomes. If a pair is present in one or more of the extant genomes included in our analysis, then parsimony implies that the pair was present in their common ancestors. If a pair is not present in any of the extant genomes included in our analysis, then the logic of parsimony analysis dictates that we should infer the loss of that pair in a common ancestor (i.e., absence of the pair in the most recent common ancestor), although there remains the nonparsimonious possibility that the pair was in fact present in the most recent common ancestor but was lost independently in the several (organismal) lineages leading to the extant genomes included in our analysis. It is for this reason that it is expected that parsimony analysis will give us a lower bound on the number of pairs present in the most recent common ancestor.

In order that we can apply the principles of parsimony analysis to the two-phase model (or indeed any other such model, including the passive loss model), we need to make an assumption concerning the dependence or independence of paired gene lineage loss in separate organismal lineages. In the same minimalist spirit underlying the two-phase model itself, we have chosen to assume that paired gene lineage loss in separate organismal lineages is independent. We are aware of the fact that there are many reasons to expect convergent loss for specific gene families, but we once again take the point of view of the hypothetical “average gene,” for which convergent loss would not be expected.

With this assumption of independent loss, we can now proceed to estimate, from F(t), the fraction of gene lineage pairs which would be lost in all descendants of a given lineage at a given time, given a specific phylogenetic tree. This becomes our parsimony estimate of the number of pairs absent in the common ancestor. It is then a simple matter to estimate the number of pairs present in that same common ancestor.

Let us begin with the most recent common ancestor of Tetraodon and fugu, represented by node e in Fig. 1A. It is estimated to have lived around 41 Mya. The teleost WGD is estimated to have occurred around 306 Mya. Equivalently, we can say that the most recent common ancestor of Tetraodon and fugu is estimated to have lived 265 My after the TGD event. If we measure time in millions of years, the two-phase model would predict that the most recent common ancestor of Tetraodon and fugu had F(265) retained WGD-derived gene lineage pairs, and that the genomes of extant Tetraodon and fugu fishes have F(306) retained WGD-derived gene lineage pairs. Keeping within the confines of the two-phase model, we would then expect that F(265)F(306) pairs of gene lineages have been lost. To be specific, we estimate that, compared with the most recent common ancestor of Tetraodon and fugu, the fugu genome has lost F(265)F(306) paired gene lineages, and the Tetraodon genome has also lost F(265)F(306) paired gene lineages. Our assumption of independent loss requires us to treat these losses as truly independent, meaning that we have the following (model) result:

The probability that a gene lineage pair, which was present in the most recent common ancestor of Tetraodon and fugu, has been lost in both the Tetraodon and also the fugu lineages is as follows:

Pe=(F(265)F(306)F(265))×(F(265)F(306)F(265)).

The two-phase model parsimony estimate for the number of retained gene lineage pairs in the most recent common ancestor of Tetraodon and fugu must then be as follows:

Ne=F(265)×(1Pe),

because it is the number of gene lineage pairs that were not lost in both the Tetraodon and fugu lineages.

We can apply the same reasoning to any other model, such as the passive loss model, by substituting the appropriate function F. When we are speaking of the two-phase model, we mean the function F defined in Eq. S6, which is in turn defined in terms of the function f in Eq. S5. For the passive or random loss model, we can still define the function F using Eq. S6, but must use the function f defined in Eq. S3. In either case, we have a model-specific function F, which we use to compute model-specific values of Pe and Ne as shown above. We fit Ne to the value 769 in Fig. 1A.

Let us now move on to node d of Fig. 1A, representing the most recent common ancestor of stickleback, Tetraodon, and fugu, which is estimated to have lived 201 My after the TGD. Once again, we will first compute the probability that a gene lineage pair present in this ancestor could be lost in all three lineages (stickleback, Tetraodon, and fugu), calling the result Pd. Using the same logic applied above, we can already state that

Nd=F(201)×(1Pd).

Because we already have the probability that a gene lineage pair, present in the most recent common ancestor of Tetraodon and fugu, is lost in both Tetraodon and fugu (that is Pe), the calculation of Pd is not as complicated as it might have been. The idea is simple: We need to lose gene lineage pairs in the stickleback lineage and also in the most recent common ancestor of Tetraodon and fugu or, if they are still present in the most recent common ancestor of Tetraodon and fugu, in these lineages separately. The final component of the calculation is in fact Pe, which we earlier defined as the “probability that a gene lineage pair, which was present in the most recent common ancestor of Tetraodon and fugu, has been lost in both the Tetraodon and also the fugu lineages.”

The probability that a gene lineage pair, present in the most recent common ancestor of stickleback, Tetraodon, and fugu, is lost in the stickleback lineage is as follows:

F(201)F(306)F(201).

The probability that a gene lineage pair, present in the most recent common ancestor of stickleback, Tetraodon, and fugu, is lost in the most recent common ancestor of Tetraodon and fugu is as follows:

Pde=F(201)F(265)F(201).

The probability that a gene pair, present in the most recent common ancestor of stickleback, Tetraodon and fugu, is not lost in the most recent common ancestor of Tetraodon and fugu but is lost in both the Tetraodon and fugu lineages is as follows:

(1Pde)×Pe.

Taking all of these results together, we have the probability, that a gene lineage pair, which was present in the most recent common ancestor of stickleback, Tetraodon, and fugu, is lost in all these organismal lineages is as follows:

Pd=F(201)F(306)F(201)×[Pde+(1Pde)×Pe].

This style of argument works analogously for nodes b and g (of Fig. 1A) also, and nodes a and c are obvious extensions of the same idea. So far, we have been using our numerical best estimates for the number of million years after teleost-specific WGD of various branchings, but these are of course only estimates. If we write Te instead of our best estimate of 265 My, and Tp (where the “p” stands for “present”) instead of our best estimate of 306 My, etc., then we can write the full set of (probability) equations as follows:

Ph=(F(Th)F(Tp)F(Th))2,
Pgh=F(Tg)F(Th)F(Tg),
Pg=F(Tg)F(Tp)F(Tg)×[Pgh+(1Pgh)×Ph],
Pcg=F(Tc)F(Tg)F(Tc),
Pf=(F(Tf)F(Tp)F(Tf))2,
Paf=F(Ta)F(Tf)F(Ta),
Pe=(F(Te)F(Tp)F(Te))2,
Pde=F(Td)F(Te)F(Td),
Pd=F(Td)F(Tp)F(Td)×[Pde+(1Pde)×Pe],
Pcd=F(Tc)F(Td)F(Tc),
Pc=[Pcg+(1Pcg)×Pg]×[Pcd+(1Pcd)×Pd],
Pbc=F(Tb)F(Tc)F(Tb),
Pb=F(Tb)F(Tp)F(Tb)×[Pbc+(1Pbc)×Pc],
Pab=F(Ta)F(Tb)F(Ta),
Pa=[Paf+(1Paf)×Pf]×[Pab+(1Pab)×Pb],

and our full set of parsimony estimates for the number of gene lineage pairs at nodes a to h (in Fig. 1A) as follows:

Nh=F(Th)×(1Ph)849,
Ng=F(Tg)×(1Pg)909,
Nf=F(Tf)×(1Pf)976,
Ne=F(Te)×(1Pe)769,
Nd=F(Td)×(1Pd)875,
Nc=F(Tc)×(1Pc)967,
Nb=F(Tb)×(1Pb)982,
Na=F(Ta)×(1Pa)1,237,

where we have included the full genomic parsimony pipeline-derived estimates for these numbers in the rightmost column as targets for fitting.

Now we can turn directly to the question of parameter fitting. In the case of the two-phase model, we have four parameters: N, α, β, and n. We already have eight equations (of the sort Na1,237) that are suitable for a fitting procedure, but it is always advisable to have many more equations than parameters, so we can add nine equations from the WGD-derived gene lineage pair counts in the nine extant species’ genomes (note that these are not results of parsimony analysis):

F(Tp)736,
F(Tp)778,
F(Tp)560,
F(Tp)815,
F(Tp)740,
F(Tp)612,
F(Tp)674,
F(Tp)622,
F(Tp)691,

and we have the total number of orthogroups considered in the analysis, without which the node gene lineage pair counts would have no meaning, giving us a consistency requirement that

F(0)=6,892.

If we fit using this (absolute) requirement and the 17 equations above to the two-phase and passive loss models, then the results are as illustrated in Fig. 1C. Clearly, the passive loss model can be rejected, whereas the two-phase model fits very well.

Returning to the two-phase model fit, it is useful to note that the data points that are being fitted to are all in what we can call the “second phase” of decay, when the rate of loss of gene lineage pairs has slowed. This long tail is very well defined in the data, and we are therefore able to infer the value of α fairly precisely, which is useful because α plays the role of 2μ in standard models of gene loss via random mutations, except that it applies to gene lineages here instead of single genes. The connection is not accidental: A long time after WGD, the solution to the two-phase model approaches the simple (single) exponential decay associated with single gene lineage loss:

f(t)(αα+n×β)1n1×eαt(fortlarge).

The reason for this is that, in the context of our model, it becomes increasingly difficult to find new blocks of redundant gene lineages to delete as time goes by, so losses of blocks of genes are essentially never fixed long after a WGD. In the context of the two-phase model, this is the natural explanation of the slow decay observed in the “second phase” of gene loss. Given the uncertainties in our data, we must however caution against any denial of subfunctionalization or neofunctionalization, selection, or hitchhiking solely on the basis of our failure to reject the two-phase model. The two-phase model remains a null model, and one that aims only to explain “most” genes. Detailed studies of specific genes or gene families are required before one can make any concrete statements regarding any selection in the evolution of teleosts.

Approximation of Solutions to the Two-Phase Model by Double Exponentials.

It is sometimes reasonable to simply try to fit a curve to data, without an underlying model. The two-phase gene lineage loss we observe looks very much like a double-exponential function, which one might describe in terms of the following function:

f˜(t)=(1q)×eαt+q×eγt,

where 0q1.

Does Eq. S5 have an approximation of this sort? In a rough sense, the answer is yes. If we take a very low rate of block loss (i.e., β0), we can consider a Taylor expansion of Eq. S5 around β=0:

f(t)(1nβ(n1)α)×eαt+nβ(n1)α×enαt+O(β2),

where O(β2) represents the error (growing with β2). Note that we have the approximate relations qnβ/[(n1)α] and γnα. For larger β, these relations cannot be expected to hold very accurately, but the general idea, that Eq. S5 has an approximation that looks like a double exponential, will continue to be valid.

An Extension of the Two-Phase Model That Includes Subfunctionalization or Neofunctionalization.

Let us now consider including subfunctionalization or neofunctionalization. We will do so in a simple and straightforward manner, leaving out many facts that are known about these phenomena. The simplest change we can make to the two-phase model in this direction is to “reserve” some fraction (δ) of gene lineage pairs. This fraction of pairs will not be lost, but rather be said to become essential due to subfunctionalization or neofunctionalization. The new differential equation would be as follows:

df(t)dt=α×{f(t)δ}n×β×{f(t)δ}n.

The solution to this extended two-phase model is as follows:

f(t)=(1δ)×eαt×[ααn×(1δ)n1×β×(e(n1)αt1)]1n1+δ.

It has one extra parameter than the original two-phase model. If we set δ=0, then we revert to the original two-phase model.

Supplementary Material

Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File

Acknowledgments

We thank S. Kinjo and the Information Service Section of the Okinawa Institute of Science and Technology Graduate University (OIST) for putting our database online. The manuscript has benefited from the comments of two anonymous reviewers. Cluster computing resources were provided by OIST and Human Genome Center of The University of Tokyo. This work was supported in part by funding from the Mathematical Biology Unit of OIST, and Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research 24770070 (to J.I.) and 21228005 (to K.T.).

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: All data are accessible at our institutional website: Orthology Database of Fish-Specific WGD-Derived Genes (FishOrthoDB) (fish-evol.unit.oist.jp/cgi-bin/TGD.cgi).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1507669112/-/DCSupplemental.

References

  • 1.Lynch M. The Origins of Genome Architecture. Sinauer Associates; Sunderland, MA: 2007. [Google Scholar]
  • 2.Jiao Y, Paterson AH. Polyploidy-associated genome modifications during land plant evolution. Philos Trans R Soc Lond B Biol Sci. 2014;369(1648):20130355. doi: 10.1098/rstb.2013.0355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Selmecki AM, et al. Polyploidy can drive rapid adaptation in yeast. Nature. 2015;519(7543):349–352. doi: 10.1038/nature14187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Van de Peer Y, Maere S, Meyer A. The evolutionary significance of ancient genome duplications. Nat Rev Genet. 2009;10(10):725–732. doi: 10.1038/nrg2600. [DOI] [PubMed] [Google Scholar]
  • 5.Kellis M, Birren BW, Lander ES. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004;428(6983):617–624. doi: 10.1038/nature02424. [DOI] [PubMed] [Google Scholar]
  • 6.Scannell DR, Byrne KP, Gordon JL, Wong S, Wolfe KH. Multiple rounds of speciation associated with reciprocal gene loss in polyploid yeasts. Nature. 2006;440(7082):341–345. doi: 10.1038/nature04562. [DOI] [PubMed] [Google Scholar]
  • 7.Makino T, McLysaght A. Ohnologs in the human genome are dosage balanced and frequently associated with disease. Proc Natl Acad Sci USA. 2010;107(20):9270–9274. doi: 10.1073/pnas.0914697107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Schartl M, et al. The genome of the platyfish, Xiphophorus maculatus, provides insights into evolutionary adaptation and several complex traits. Nat Genet. 2013;45(5):567–572. doi: 10.1038/ng.2604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Trachana K, et al. Orthology prediction methods: A quality assessment using curated protein families. BioEssays. 2011;33(10):769–780. doi: 10.1002/bies.201100062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Sato Y, Nishida M. Teleost fish with specific genome duplication as unique models of vertebrate evolution. Environ Biol Fishes. 2010;88(2):169–188. [Google Scholar]
  • 11.Broughton RE, Betancur-R R, Li C, Arratia G, Ortí G. Multi-locus phylogenetic analysis reveals the pattern and tempo of bony fish evolution. PLoS Curr. 2013;5:5. doi: 10.1371/currents.tol.2ca8041495ffafd0c92756e75247483e. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Vilella AJ, et al. EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009;19(2):327–335. doi: 10.1101/gr.073585.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Sato Y, Hashiguchi Y, Nishida M. Temporal pattern of loss/persistence of duplicate genes involved in signal transduction and metabolic pathways after teleost-specific genome duplication. BMC Evol Biol. 2009;9:127. doi: 10.1186/1471-2148-9-127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Nei M, Roychoudhury A. Probability of fixation of nonfunctional genes at duplicate loci. Am Nat. 1973;107(955):362–372. [Google Scholar]
  • 15.Kuraku S, Meyer A. The evolution and maintenance of Hox gene clusters in vertebrates and the teleost-specific genome duplication. Int J Dev Biol. 2009;53(5-6):765–773. doi: 10.1387/ijdb.072533km. [DOI] [PubMed] [Google Scholar]
  • 16.Buggs RJA, et al. Rapid, repeated, and clustered loss of duplicate genes in allopolyploid plant populations of independent origin. Curr Biol. 2012;22(3):248–252. doi: 10.1016/j.cub.2011.12.027. [DOI] [PubMed] [Google Scholar]
  • 17.Berthelot C, et al. The rainbow trout genome provides novel insights into evolution after whole-genome duplication in vertebrates. Nat Commun. 2014;5:3657. doi: 10.1038/ncomms4657. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Xu P, et al. Genome sequence and genetic diversity of the common carp, Cyprinus carpio. Nat Genet. 2014;46(11):1212–1219. doi: 10.1038/ng.3098. [DOI] [PubMed] [Google Scholar]
  • 19.Braasch I, Postlethwait J. Polyploidy in fish and the teleost genome duplication. In: Soltis PS, Soltis DE, editors. Polyploidy and Genome Evolution. Springer; Berlin: 2012. pp. 341–383. [Google Scholar]
  • 20.Braasch I, et al. A new model army: Emerging fish models to study the genomics of vertebrate Evo-Devo. J Exp Zool B Mol Dev Evol. 2014;324(4):316–341. doi: 10.1002/jez.b.22589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Lovell PV, et al. Conserved syntenic clusters of protein coding genes are missing in birds. Genome Biol. 2014;15(12):565. doi: 10.1186/s13059-014-0565-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Force A, et al. Preservation of duplicate genes by complementary, degenerative mutations. Genetics. 1999;151(4):1531–1545. doi: 10.1093/genetics/151.4.1531. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Gout JF, Lynch M. Maintenance and loss of duplicated genes by dosage subfunctionalization. Mol Biol Evol. 2015;32(8):2141–2148. doi: 10.1093/molbev/msv095. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Kassahn KS, Dang VT, Wilkins SJ, Perkins AC, Ragan MA. Evolution of gene function and regulatory control after whole-genome duplication: Comparative analyses in vertebrates. Genome Res. 2009;19(8):1404–1418. doi: 10.1101/gr.086827.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Yu WP, et al. Elephant shark sequence reveals unique insights into the evolutionary history of vertebrate genes: A comparative analysis of the protocadherin cluster. Proc Natl Acad Sci USA. 2008;105(10):3819–3824. doi: 10.1073/pnas.0800398105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Nelson JS. Fishes of the World. 4th Ed Wiley; Hoboken, NJ: 2006. [Google Scholar]
  • 27.Miya M, Nishida M. The mitogenomic contributions to molecular phylogenetics and evolution of fishes: A 15-year retrospect. Ichthyol Res. 2015;62(1):29–71. [Google Scholar]
  • 28.Near TJ, et al. Resolution of ray-finned fish phylogeny and timing of diversification. Proc Natl Acad Sci USA. 2012;109(34):13698–13703. doi: 10.1073/pnas.1206625109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Faircloth BC, Sorenson L, Santini F, Alfaro ME. A phylogenomic perspective on the radiation of ray-finned fishes based upon targeted sequencing of ultraconserved elements (UCEs) PLoS One. 2013;8(6):e65923. doi: 10.1371/journal.pone.0065923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Martin KJ, Holland PWH. Enigmatic orthology relationships between Hox clusters of the African butterfly fish and other teleosts following ancient whole-genome duplication. Mol Biol Evol. 2014;31(10):2592–2611. doi: 10.1093/molbev/msu202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Altschul SF, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Katoh K, Kuma K, Toh H, Miyata T. MAFFT version 5: Improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005;33(2):511–518. doi: 10.1093/nar/gki198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: A tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25(15):1972–1973. doi: 10.1093/bioinformatics/btp348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Suyama M, Torrents D, Bork P. PAL2NAL: Robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 2006;34(Web Server issue):W609–W612. doi: 10.1093/nar/gkl315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993;10(3):512–526. doi: 10.1093/oxfordjournals.molbev.a040023. [DOI] [PubMed] [Google Scholar]
  • 36.Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J Mol Evol. 1994;39(3):306–314. doi: 10.1007/BF00160154. [DOI] [PubMed] [Google Scholar]
  • 37.Chen K, Durand D, Farach-Colton M. NOTUNG: A program for dating gene duplications and optimizing gene family trees. J Comput Biol. 2000;7(3-4):429–447. doi: 10.1089/106652700750050871. [DOI] [PubMed] [Google Scholar]
  • 38.Stamatakis A. RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006;22(21):2688–2690. doi: 10.1093/bioinformatics/btl446. [DOI] [PubMed] [Google Scholar]
  • 39.Yang Z. Estimating the pattern of nucleotide substitution. J Mol Evol. 1994;39(1):105–111. doi: 10.1007/BF00178256. [DOI] [PubMed] [Google Scholar]
  • 40.Berriz GF, Beaver JE, Cenik C, Tasan M, Roth FP. Next generation software for functional trend analysis. Bioinformatics. 2009;25(22):3043–3044. doi: 10.1093/bioinformatics/btp498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Edgecombe GD, et al. Higher-level metazoan relationships: Recent progress and remaining questions. Org Divers Evol. 2011;11(2):151–172. [Google Scholar]
  • 42.Azuma Y, Kumazawa Y, Miya M, Mabuchi K, Nishida M. Mitogenomic evaluation of the historical biogeography of cichlids toward reliable dating of teleostean divergences. BMC Evol Biol. 2008;8:215. doi: 10.1186/1471-2148-8-215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Kasahara M, et al. The medaka draft genome and insights into vertebrate genome evolution. Nature. 2007;447(7145):714–719. doi: 10.1038/nature05846. [DOI] [PubMed] [Google Scholar]
  • 44.Henkel CV, et al. Primitive duplicate Hox clusters in the European eel’s genome. PLoS One. 2012;7(2):e32231. doi: 10.1371/journal.pone.0032231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Hillis DM. Taxonomic sampling, phylogenetic accuracy, and investigator bias. Syst Biol. 1998;47(1):3–8. doi: 10.1080/106351598260987. [DOI] [PubMed] [Google Scholar]
  • 46.Lemmon EM, Lemmon AR. High-throughput genomic data in systematics and phylogenetics. Annu Rev Ecol Evol Syst. 2013;44:99–121. [Google Scholar]
  • 47.Stankiewicz P, et al. Genomic disorders: Genome architecture results in susceptibility to DNA rearrangements causing common human traits. Cold Spring Harb Symp Quant Biol. 2003;68:445–454. doi: 10.1101/sqb.2003.68.445. [DOI] [PubMed] [Google Scholar]
  • 48.Hufton AL, Panopoulou G. Polyploidy and genome restructuring: A variety of outcomes. Curr Opin Genet Dev. 2009;19(6):600–606. doi: 10.1016/j.gde.2009.10.005. [DOI] [PubMed] [Google Scholar]
  • 49.Smith JM, Haigh J. The hitch-hiking effect of a favourable gene. Genet Res. 1974;23(1):23–35. [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File
Supplementary File

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES