Abstract
As the only surviving lineages of jawless fishes, hagfishes and lampreys provide a crucial window into early vertebrate evolution1–3. Here we investigate the complex history, timing and functional role of genome-wide duplications4–7 and programmed DNA elimination8,9 in vertebrates in the light of a chromosome-scale genome sequence for the brown hagfish Eptatretus atami. Combining evidence from syntenic and phylogenetic analyses, we establish a comprehensive picture of vertebrate genome evolution, including an auto-tetraploidization (1RV) that predates the early Cambrian cyclostome–gnathostome split, followed by a mid–late Cambrian allo-tetraploidization (2RJV) in gnathostomes and a prolonged Cambrian–Ordovician hexaploidization (2RCY) in cyclostomes. Subsequently, hagfishes underwent extensive genomic changes, with chromosomal fusions accompanied by the loss of genes that are essential for organ systems (for example, genes involved in the development of eyes and in the proliferation of osteoclasts); these changes account, in part, for the simplification of the hagfish body plan1,2. Finally, we characterize programmed DNA elimination in hagfish, identifying protein-coding genes and repetitive elements that are deleted from somatic cell lineages during early development. The elimination of these germline-specific genes provides a mechanism for resolving genetic conflict between soma and germline by repressing germline and pluripotency functions, paralleling findings in lampreys10,11. Reconstruction of the early genomic history of vertebrates provides a framework for further investigations of the evolution of cyclostomes and jawed vertebrates.
Subject terms: Phylogenetics, Molecular evolution, Evolutionary genetics, Genome evolution
A chromosome-scale genome assembly for the hagfish Eptatretus atami, combined with a series of phylogenetic analyses, sheds light on ancient polyploidization events that had a key role in the early evolution of vertebrates.
Main
Hagfishes are deep-sea scavengers with a prodigious capacity for producing slime12 (Fig. 1a). As one of only two surviving lineages of jawless fishes, hagfishes provide a unique comparative perspective on early vertebrate evolution. Both hagfishes and lampreys stand apart from jawed vertebrates (gnathostomes) in the absence of jaws, bone and dentine3, and they have been grouped together as cyclostomes13, the sister group to gnathostomes. However, hagfishes lack several key characteristics that are shared by lampreys and gnathostomes, including definitive vertebrae14, lensed eyes with oculomotor control and electroreceptive sensory organs1,2. The relative simplicity of the hagfish body plan suggests an alternative hypothesis whereby hagfishes diverged before a craniate clade that groups lampreys with jawed vertebrates3. Early molecular phylogenies (albeit with limited sequence datasets and taxonomic sampling) have consistently supported cyclostome monophyly15–17, which implies that hagfishes are secondarily simplified. But the molecular bases of this derived body plan in the light of the tumultuous genomic history of vertebrates are poorly understood.
Fig. 1. Phylogenetic relationships and syntenic architecture of cyclostomes and gnathostomes.
a, The brown hagfish, Eptatretus atami (photo credit, M. Suzuki). b, Summary of deuterostome phylogeny based on 176 selected genes (61,939 positions) using a site-heterogeneous model (CAT+GTR). This topology is robust to compositional heterogeneity and similar to what was obtained with 1,467 genes using a site-homogeneous partitioned model (see Methods, Supplementary Note 1 and Extended Data Fig. 2). c, Karyograms showing the ancestry of hagfish, lamprey and gar chromosomes in terms of chordate linkage groups (CLGs A1, A2 and B–Q) described previously19,35 (see also ref. 20 and Supplementary Note 2). Coloured bins contain 20 genes and only genes from CLGs with significant enrichment (Fisher’s exact test) are counted (Methods). Hagfish, lamprey and gar silhouettes downloaded from PhyloPic (credit to Gareth Monger for lamprey). d, Conserved syntenies show that hagfish chromosomes are typically fusions of multiple lamprey chromosomes. Lines connect orthologous genes and are coloured according to the ancestral chordate linkage groups (colour legend in c).
Early vertebrate evolution was punctuated by multiple polyploidizations, although the nature and timing of these ancient events, and their effects on vertebrate biology, remain elusive4–6. An early duplication preceding the gnathostome–cyclostome split (1RV) is generally accepted but has not been clearly resolved by molecular phylogenetics18. A gnathostome-specific allo-tetraploidization (2RJV) was definitively established on the basis of chromosomal rearrangements observed in gnathostomes but not lampreys11,19,20, leading to the rejection of the hypothesis that two rounds of genome duplication (1R and 2R) occurred before the cyclostome–gnathostome split21–23. Conversely, lampreys experienced additional independent duplication(s) not found in gnathostomes11,19,20. A hexaploidy inferred in lampreys was further hypothesized to be ancestral to cyclostomes20, on the basis of the observation that lampreys and hagfishes both possess six Hox clusters, although their respective orthology remains unclear24. Definitive resolution of the duplication history of cyclostome genomes must also account for the disparate karyotypes of hagfishes (2n ≈ 34 somatic chromosomes) and lampreys (2n ≈ 168) (refs. 7,11,25,26).
Notably, hagfishes25,27,28 and lampreys10,11,29 perform programmed elimination of germline-specific chromosomes from the genomes of somatic cells8,9. In lampreys, germline-specific chromosomes encode numerous genes with putative functions in the maintenance and development of the germline10,11. Although hagfishes were the first vertebrate species shown to experience developmentally programmed DNA elimination25, only germline-enriched satellite repeats have been characterized30–33. Because no germline-specific protein-coding genes have been reported in hagfish so far, their possible germline functions, evolutionary origin and relationship to germline-specific genes in lampreys have not been addressed.
Here we report a chromosome-scale genome assembly for the brown hagfish E. atami. Using a synteny-based phylogenetic approach, we definitively resolve and date the timing of duplication and divergence events that shaped the genomes of extant vertebrate lineages, and assess rediploidization after duplication. We further dissect the effect of these events on the emergence of genes that are involved in vertebrate and hagfish characteristics, and find that hagfish genomes are derived by extensive gene loss, consistent with their morphological simplification. We also find that hagfish genes that are programmatically eliminated during early embryonic development contribute to several aspects of germ cell biology, and reveal the evolution of vertebrate germline-specific chromosomes.
Evolution of cyclostome genomes
We sequenced the germline genome of the brown hagfish E. atami (formerly Paramyxine atami) using a combination of long and short reads from testes, and organized the assembly into chromosomes using proximity ligation data from somatic tissue (Supplementary Table 1). Our E. atami assembly spans 2.52 Gb and includes 17 large chromosomal scaffolds, consistent with the expected somatic karyotype (2n = 34) (Extended Data Fig. 1a and Supplementary Table 2). The length of the assembly is intermediate between the fluorescence-based estimates of genome size for somatic (2.01 Gb) and germline (3.37 Gb) cells25,28, consistent with k-mer estimates (2.02 Gb and 3.28 Gb, respectively, Extended Data Fig. 1b and Methods). The E. atami germline genome also includes seven highly repetitive chromosomes that are completely eliminated during development and whose sequences are present in our assembly as sub-chromosomal fragments, similar to what is seen in the highly repetitive germline-specific chromosomes of lampreys7,11 and songbirds34. We annotated 28,469 genes, of which 22,663 show similarity with the protein-coding complement of other species.
Extended Data Fig. 1. Genome content and architecture of E. atami.
a, Hi-C contact map visualizing the density of interactions between binned genomic regions in the proximity ligation data. The high contact regions are consistent with the 17 somatic chromosomes. b, Density of 21-mer of increasing multiplicity in the somatic (blood) and germline (testes) shotgun sequence data indicated an estimated genome sizes of 2.02 and 3.28 Gb, respectively. c,d, Repeat landscape summarizing the fraction of regions diverging from consensus repeats at varying levels of divergence (Kimura 2-parameter distance) in lampreys (c) and hagfish (d). Lamprey and hagfish show a markedly different profile with respect to the number and diversity of repetitive element classes.
We first used our hagfish gene set to test the monophyly of cyclostomes, extending pioneering early studies16,17 by introducing (i) broader taxonomic sampling of cyclostomes (including new data for the Atlantic hagfish Myxine glutinosa) and (ii) improved modelling of site heterogeneity and compositional bias (Methods, Supplementary Note 1 and Supplementary Table 3). A new set of 1,467 orthologues informed by complete hagfish and lamprey genomes alleviates possible paralogy issues, and includes eightfold more markers than earlier studies did (Methods). These analyses confirm the monophyly of cyclostomes with both partitioned analysis (Extended Data Fig. 2a) and site-heterogeneous model analysis (Fig. 1b and Extended Data Fig. 2b). Robustness to compositional heterogeneity is further supported by six-category amino acid recoding validated by posterior predictive tests (Extended Data Fig. 2c,d and Supplementary Table 4).
Extended Data Fig. 2. Phylogenetic reconstruction of deuterostome relationships with a focus on cyclostome position.
a, Tree reconstructed with IQ-TREE assuming LG4X model using a dataset of 1,467 single-copy orthologues and a partitioned model. b, Tree reconstructed using PhyloBayes and a CAT+GTR+G4 model using a subset of 176 orthologues showing the lowest saturation (see methods). c, Tree reconstructed using the same set of orthologues after Dayhoff 6 categories amino acid recoding to account for possible compositional heterogeneity due to high GC% in cyclostome genomes. d, z-score of posterior predictive analyses to assess composition heterogeneity. Positive z-scores indicate that average amino acid diversity is underestimated (negative z-scores indicate an overestimation) which highlights the composition bias existing in some lamprey and hagfish species and shows that recoding (Dayhoff 6) alleviates these biases.
Despite their disparate karyotypes, the chromosomes of hagfish and lamprey are simply related (Fig. 1c,d) although after around 457 million years of independent evolution, gene order is highly scrambled (Extended Data Fig. 3a,b) and repetitive landscapes are distinct (Extended Data Fig. 1c,d). In general, each hagfish chromosome is typically orthologous to a fusion of between two and six lamprey chromosomes and, conversely, each lamprey chromosome is typically associated with a single hagfish chromosome, with few exceptions (Extended Data Fig. 3b). To differentiate between possible chromosomal fusions in the hagfish lineage and/or fissions (or duplication) in the lamprey lineage, we used the previously reconstructed ancestral chordate linkage groups (CLGs A1, A2 and B-Q)20,35 (Supplementary Note 2). Although lamprey chromosomes are typically derived from single CLGs (consistent with previous analyses of lamprey genomes19,20), hagfish chromosomes are evidently derived from these ancestral elements by irreversible fusions35, analogous to but distinct from the fusions observed on the gnathostome stem lineage19 (Fig. 1c). The direct and largely one-to-one segmental correspondence between hagfish and lamprey is consistent with the previous assumption20,24 that the cyclostomes share the same duplication history, although more detailed phylogenetic analysis is required to rule out scenarios of convergent duplications.
Extended Data Fig. 3. Comparison of the chromosomal architectures of cyclostome genomes.
a, Comparison between two lampreys (Lethenteron reissneri and P. marinus) highlighting the conservation of both chromosomal identity and extensive collinear segments. b, Comparison between the hagfish E. atami and the lamprey P. marinus. In both panels, dots show the relative location of orthologous genes between two species, coloured if the chromosome:chromosome enrichment is significant by Fisher’s exact test (Methods); others shown in grey. The colours in a and b are based on P. marinus and E. atami chromosomes, respectively. In b, P. marinus chromosomes are sorted to aid in visualizing many-to-one mappings shown in Fig. 1d.
Genome duplications in early vertebrates
We used two complementary phylogenetic approaches to fully resolve the sequence of early vertebrate polyploidization events: (i) model-based polyploidization inference from a large number of individual gene trees and (ii) concatenation of genes with similar evolutionary histories on the basis of chromosome-scale synteny. We first tested alternative scenarios for the sequence of whole-genome duplications (WGDs) based on 8,931 individual gene trees by probabilistic reconciliation of gene and species trees using WHALE36 (Methods). This analysis provided significant support for the occurrence of a single genome duplication in the vertebrate stem lineage (1RV), followed by independent polyploidizations on the gnathostome (2RJV) and cyclostome (2RCY) stem lineages (all Bayes factors BFNull_vs_WGD < 10−3) (Fig. 2a and Extended Data Fig. 4). By contrast, we found no support for a second round of polyploidization on the vertebrate stem lineage (2RV) or for polyploidization events specific to the lamprey or hagfish lineages (Fig. 2a and Extended Data Fig. 4), consistent with synteny-based analysis19,20 and Fig. 1d.
Fig. 2. History of genome duplications in vertebrates.
a, Probabilistic inference of polyploidization events in early vertebrate evolution on the basis of gene tree–species tree reconciliation (WHALE36; Extended Data Fig. 4, Supplementary Table 8 and Methods) supports an initial tetraploidization shared by all vertebrates (1RV), a jawed-vertebrate-specific tetraploidization (2RJV) and a cyclostome-specific polyploidization (2RCY). Supported polyploidization events (Bayes factors BFNull_vs_WGD < 10−3) are shown in colour (1RV, 2RJV and 2RCY) and non-supported ones in grey (2RV, hagfish-specific, lamprey-specific). The WHALE method cannot distinguish between tetraploidization and hexaploidization events. b, Paralogon-based polyploidization inference using molecular phylogenies reconstructed for each of the 17 informative CLGs (Supplementary Fig. 1). Successive polyploidization events during vertebrate evolution are shown as coloured polygons and the proportion of CLG trees displaying these duplication nodes is indicated below. c, Sample paralogon tree for CLGJ. As for gene trees, in paralogon trees some nodes correspond to speciation events (grey) and others to duplication events (coloured); both types of events can be dated using a molecular clock. Species and datasets used are listed in Supplementary Table 8, and dating was performed with PhyloBayes (Methods) using fossil calibrations reported in Supplementary Table 7. d, Molecular dating of the polyploidizations and speciation events in early vertebrate evolution. Divergence times are indicated for speciation (grey) and duplication nodes (coloured as in a) are indicated. In c,d, each node is labelled with the mean divergence time across CLGs. Ediac., Ediacaran; Cambr., Cambrian; Ord., Ordovician; Sil., Silurian; Devon., Devonian; Carbon., Carboniferous; Perm., Permian; Trias., Triassic.
Extended Data Fig. 4. Tests of genome duplication hypotheses on the vertebrate tree.
a, Species phylogeny and polyploidization hypotheses tested with WHALE36 using 8,931 gene families (Methods, see Supplementary Table 8 for details of the genomes used in the analysis). Polyploidization hypotheses are indicated by circles on the corresponding branches, with supported polyploidizations indicated with solid circles. Inferred background gene duplication and loss rates are presented on the branches. b, Posterior distribution obtained for the WHALE post-duplication retention parameter q, for each hypothesis presented in a. Stars indicate distributions significantly different from 0 (Bayes factors BFNull_vs_WGD < 10−3), which correspond to the supported polyploidization events. c, Alternative set of polyploidization hypotheses tested, as in a, but with two successive duplications proposed in the ancestral vertebrate lineage (1RV and 2RV). d, Posterior distribution obtained for the WHALE post-duplication retention parameter, for each hypothesis presented in c. Here, the posterior distribution for retention parameters of the 1RV and 2RV events are bimodal, suggesting that the method cannot effectively separate parameters estimated for 1RV and 2RV when starting from identical priors. e, Use of distinct priors on 1RV (Beta(8, 2)) and 2RV (Beta(2, 8)) separates the estimated posterior distribution into distinct unimodal posterior distributions and provides support for a single shared 1Rv event in the vertebrate stem lineage. This analysis was performed on a random subset of 1,000 gene families, to reduce computational time (Methods).
We also developed a synteny-based approach that takes advantage of the shared evolutionary history of persistently linked genes to enhance the limited phylogenetic signal of individual gene trees and avoid the confounding effects of differential gene loss21,37,38. In this approach, we determined the duplication history of each CLG by concatenating genes from its derivatives in hagfish, lamprey and several jawed vertebrates (for example, chromosomes or chromosomal segments of the same colour in Fig. 1d; Methods). Within each species, the paralogous chromosome segments are called ‘paralogons’39. Because the CLGs are preserved in diverse invertebrates35 corresponding sets of chromosomally linked genes can be concatenated to provide outgroups for phylogenetic and molecular dating. We reconstructed paralogon-based molecular phylogenies for 17 of the 18 CLGs or proto-vertebrate chromosomes (PVCs) (Supplementary Table 5); the 18th group (CLG A2 in the notation of ref. 35, and PVC18 in ref. 20) contains a relatively small number of genes consistently linked across vertebrate taxa and has anomalous properties in both lampreys and gnathostomes20.
Paralogon-based molecular phylogenies support a single early vertebrate auto-tetraploidization (1RV) before the cyclostome–gnathostome split, followed by a later gnathostome-specific allo-tetraploidization (2RJV) and a cyclostome-specific polyploidy (2RCY) (for example, CLGJ in Fig. 2b; Extended Data Fig. 5 and Supplementary Fig. 1). The 1RV duplication node precedes cyclostome–gnathostome speciation in 12 of the 14 CLG paralogon phylogenies with bootstrap support BP > 60 (Supplementary Table 5 and Supplementary Fig. 1). A single shared tetraploidization on the vertebrate stem (1RV) is therefore consistent with both probabilistic inference of genome duplications from single gene trees and paralogon-based phylogenies. Molecular dating indicates that these duplication and speciation events occurred in close succession, and we estimate that the divergence of 1RV paralogons was completed by around 527 million years ago (Ma) and that the cyclostome–gnathostome split occurred around 520 Ma.
Extended Data Fig. 5. Timescale of vertebrate genome evolution.
a. Distributions of timings for speciation and duplication events derived from paralogon phylogenies, showing details of the distributions indicated in Fig. 3c. b. Scenario for genome duplication and speciation events during early vertebrate evolution. Filled black circles or ovals mark speciation events; horizontal rectangles indicate presumptive auto-tetraploidies; starbursts indicate allo-polyploidies arising from hybridization of distinct progenitors (for example, alpha–beta in gnathostomes). Timings are based on a. Note that although speciation times (for example, the split between gnathostome progenitors alpha and beta, divergence of lamprey and hagfish lineages) can be estimated from gene or paralogon trees, hybridization times (for example, 2RJV, shown as green starburst) cannot be estimated from gene-tree analysis. Similarly, homoeologous recombination after auto-tetraploidization implies that the auto-tetraploidization event itself cannot be timed, but only the cessation of homoeologous recombination. Thus, the estimate of around 527 Ma for 1RV (horizontal blue rectangle) represents the cessation of recombination after this presumptive auto-tetraploidy (open rectangle on vertebrate stem) with homologous recombination represented by blue shading. The absolute timing of 1RV itself is unknown. (Auto-tetraploidy is suggested by the lack of differential gene loss between the two paralogous branches after 1R, as noted previously19.) The rough estimate of a 10-million-year interval between the alpha–beta split and 2RJV allo-tetraploidy is based on analogy with recent vertebrate allo-tetraploidies in frogs and goldfish. Cyclostome hexaploidization 2RCY is shown as a two-step process culminating in the hybridization of diploid and tetraploid stem cyclostomes (orange starburst). This scenario follows the recent model of hexaploidy in sturgeon in which auto-tetraploids and diploid species coexist and hybridize48. In this scenario, the earliest divergences among cyclostome paralogues occurs around 511 Ma when the diploid and future tetraploid lineages split, which could be coincident with the early tetraploidization itself. Homoeologous recombination (shown as orange shading) is largely complete by around 493 Ma, defining a second peak in paralogue divergence (horizontal orange rectangle). Not shown is ongoing homoeologous recombination in CLGB which continues into the stem hagfish and lamprey lineages, as discussed further in the main text.
Our estimated date for the divergence of 1RV paralogons corresponds to the cessation of homeologous recombination (rediploidization) rather than the WGD itself, as noted in a previous study40 We tested for lineage-specific rediploidization across CLGs (relative to the gnathostome–cyclostome divergence) by comparing the likelihoods of gene trees under the ancestral and lineage-specific rediploidization models, as previously proposed41–45 (Fig. 3a and Extended Data Fig. 6). We found that ancestral rediploidization after 1RV was supported by a larger number of significant gene trees for all CLGs (Fig. 3b), indicating that meiotic rediploidization was essentially complete by the time of the cyclostome–gnathostome split. This contrasts with other more recent vertebrate auto-polyploidizations, in which a number of homoeologous chromosomes have maintained tetrasomic inheritance through subsequent speciation events41–44. Unfortunately, molecular phylogenetics can only estimate a later bound on the timing of the 1RV auto-tetraploidization event itself, because it is likely to be obscured by a period of homoeologous recombination of unknown duration: the 1RV duplication event could have predated the divergence time of 1RV ohnologues (around 527 Ma) by millions of years40.
Fig. 3. Limited lineage-specific rediploidization after vertebrate genome duplications.
a, Gene-tree topologies expected under the ancestral rediploidization (left) and lineage-specific rediploidization (right) models after the 1RV (Methods and Extended Data Fig. 6). In the ancestral rediploidization scenario, paralogous gene sequences diverge before the cyclostome–gnathostome split and thus group by duplicated gene copy. In the lineage-specific rediploidization scenario, paralogue sequences diverge independently in the stem gnathostome and cyclostome lineages, and thus genes are grouped by lineage. b, Number of significantly supported gene trees in favour of ancestral and lineage-specific rediploidization scenarios after 1RV, for each of 17 informative ancestral linkage groups (CLGs). c, Tree topologies expected under the ancestral and lineage-specific rediploidization models after 2RCY. The CLGB paralogon tree shows an ancestral rediploidization topology for 1RV copy 2, but lineage-specific rediploidization for 1RV copy 1, where two hagfish (chr. 4 and chr. 5) and two lamprey (chr. 10 and chr. 2) paralogons independently rediploidized. Myr, million years. d, Evolutionary history of vertebrate Hox gene clusters resolved by the CLGB paralogon phylogeny (see bottom of c).
Extended Data Fig. 6. Method for construction of post-1R ancestral rediploidization constrained gene-tree topologies, using CLGM as an example.
a, Gnathostome 1R 1 and 1R 2 copies can be confidently identified and serve as a skeleton to build ancestral rediploidization tree topologies (blue-purple groups). Hagfish and lamprey chromosomes confidently grouped in a clade from the CLGM paralogon tree are defined as potential 1R-derived paralogons (yellow-orange groups) and kept together in the constrained ancestral rediploidization tree topology (see b). All sets of cyclostome chromosomes that were kept together for other CLGs are indicated in Supplementary Table 5. b, Possible groupings of hagfish and lamprey genes with gnathostome genes based on their chromosomal location, following 1R ancestral rediploidization. c. Genes located on hagfish and lamprey chromosomes that are not considered in the reconstructed paralogon tree (due to low representation because of small-scale rearrangements displacing them on different chromosomes) can each be placed on either side of the duplication in the absence of any prior information. In the presented scenario, this results in six different possible ancestral rediploidization (i to vi) constrained tree topologies. Only topologies with a maximum of three lamprey genes and three hagfish genes on each side of the 1R are permitted, to remove possibly confounding effects of complex multicopy gene families.
Distinct duplications in cyclostomes
Paralogon-based molecular phylogenies also strongly support and refine the 2RJV allo-tetraploidization scenario19,20 (Supplementary Table 6). Molecular dating of paralogon trees places the split of the pre-2RJV alpha and beta progenitors in the middle Cambrian, around 508 Ma (Fig. 2d, Extended Data Fig. 5, Supplementary Table 7 and Methods). The allo-tetraploidization event itself (the hybridization of alpha and beta progenitors and subsequent associated genome doubling), however, occurred some time after the alpha–beta divergence and cannot itself be precisely placed using molecular phylogenies. In recent vertebrate allo-tetraploids such as Xenopus46 and goldfish47, hybridization occurred within 10–15 million years of the divergence of progenitors. If we take these as analogies for gnathostome allo-tetraploidy, then 2RJV probably occurred in the late Cambrian, long before the origin of crown-group gnathostomes around 456 Ma near the middle–late Ordovician boundary.
Among cyclostomes, paralogon trees confirm the general orthology of hagfish and lamprey chromosome segments (Fig. 2c), as suggested above (Supplementary Table 5 and Supplementary Fig. 1). We typically observed one or two duplication nodes for each CLG, indicating shared cyclostome genome-wide duplications that took place before the hagfish–lamprey split around 457 Ma. Although the nature of the cyclostome-specific duplications is difficult to decipher owing to extensive losses, the net effect appears to be hexaploidization20 without the obvious patterns of differential gene retention that are typical of allo-polyploidy and are observed in the gnathostome lineage19,20 (Extended Data Fig. 7).
Extended Data Fig. 7. Orthologue retention rates after 2RCY.
Retention is computed for each P. marinus chromosomal segment derived from a single CLG as the fraction of orthologues maintained on the segment in a comparison with the total number of orthologues for the same CLG in Branchiostoma floridae. a, Distribution of retention rates plotted for all CLGs. b, Distribution of retention rates plotted for each CLG. These distributions are not distinctly bimodal, in contrast to the finding for 2RJV19. Lamprey is used because it more closely preserves ancestral cyclostome state, as proposed previously20.
The bimodal distribution of divergence times observed between homoeologous (that is, paralogous) cyclostome chromosomes (peaks at around 511 and around 493) Ma; see Extended Data Fig. 5a) is consistent with two-step hexaploidy through the hybridization of diploids and related tetraploids (Extended Data Fig. 5b), as seen in sturgeons48. Although the near one-to-one relationship between orthologous hagfish and lamprey chromosome segments (Fig. 1d) suggests that rediploidization after 2RCY was largely completed by the origination of crown-group cyclostomes, we formally tested for lineage-specific rediploidization (Fig. 3c). From the estimated paralogon divergence times and concatenated paralogon tree topologies, we identified a single case of lineage-specific rediploidization that affected CLGB paralogons in hagfish and lampreys after 2RCY. Specifically, the paralogon pairs hagfish chr. 4–chr. 8 and lamprey chr. 10–chr. 12 descending from the 1RV copy 1 of CLGB each rediploidized independently in hagfish and lampreys, as shown by the CLGB paralogon phylogeny (Fig. 3c) and estimated paralogon divergence times (cyclostome split around 457 Ma; hagfish chr. 4–chr. 8 paralogon divergence around 431 Ma; lamprey chr. 10–chr. 2 paralogon divergence around 442 Ma). This result is confirmed by gene-tree topology tests, albeit with a small number of testable gene trees (Methods; 30 tested trees of which 7 support lineage-specific rediploidization and none support ancestral).
Evolution of vertebrate Hox clusters
Notably, CLGB contains the Hox cluster, a key locus in early analyses of vertebrate WGD (ref. 40). A more targeted phylogenetic analysis of concatenated Hox and bystander genes also recovered the same lineage-specific rediploidization tree topology as the full CLG paralogon analysis (Extended Data Fig. 8). Through our CLGB paralogon and Hox-plus-bystander trees, we fully resolve the evolutionary history of the four gnathostome and six cyclostome Hox clusters after the 1RV, 2RCY and 2RJV events (Fig. 3d). Although previous studies did not identify one-to-one relationships between Hox clusters in lampreys and hagfishes24, we report unambiguous orthologies between lamprey Hox ζ, α, γ and δ and hagfish Hox II, III, IV and VI, with groupings as follows: ζ–II, α–III, γ–IV and δ–VI. By contrast, after lineage-specific rediploidization of CLGB, no true (that is, one-to-one) orthology relationships exist between lamprey Hox β and ε and hagfish Hox V and I, which should be considered as ‘tetralogues’49.
Extended Data Fig. 8. Phylogeny of the Hox clusters on the basis of a concatenation of Hox genes and bystanders.
Left, phylogeny of the Hox clusters, with node bootstrap support. One-to-one orthologies for gnathostome clusters are well-supported, similarly for cyclostome clusters with the exception of hagfish V-I and lamprey β-ε. Dark grey boxes highlight cyclostomes clusters that are expected to be orthologous to gnathostome clusters A/B based on chromosomal orthology (Fig. 3d), similarly light grey boxes are for expected orthologs to gnathostome clusters C/D. Right, schematic representation of cyclostome and gnathostome Hox clusters. Hox genes are shown as yellow boxes, 5’ bystanders as red boxes and 3’ bystanders as blue boxes. The order of genes reflects the actual arrangement of genes in each species.
Origin of neural crest
Our paralogon-based classification makes it possible to robustly assign paralogues to specific duplications, and this sheds some light on the relative origin, with respect to WGDs, of vertebrate characteristics such as neural crest, placodes and hormone systems50. As highlighted previously51, establishing whether both paralogous branches retain (or partition) the ancestral role can help to pinpoint whether a character is likely to have emerged before or after 1RV.
To assess the origin of neural crest, we considered a set of 22 gene families involved in the specification and migration of neural crest52,53 (Fig. 4a and Supplementary Table 8) and we asked whether corresponding 1RV paralogues perform neural crest cell (NCC)-related functions in gnathostomes and in lampreys on the basis of the literature54–56 and available RNA sequencing (RNA-seq) data56. We find that for many of these gene families, including Tfap2, SoxE, EdnR, Twist1 and Gata3, paralogues on both 1Rv branches are involved in neural-crest-related functions (Supplementary Table 9). This pattern indicates that NCC-related functions were inherited from pre-1Rv genes in the vertebrate ancestor, and thus suggests that the neural crest originated before 1Rv. Post-WGD subfunctionalization had a limited role in its emergence, contrary to other gnathostome novelties such as limbs57. Consistently, in lampreys, an alternative 1Rv paralogue also seems to be involved in NCCs, involving, for example, Gata3, Six1, Msx1 and possibly FoxD (Fig. 4a).
Fig. 4. Functional effects of vertebrate WGD and gene loss in vertebrates.
a, Key neural-crest-related gene families with members classified according to their functional role (colour) and paralogy status relative to 1RV and 2RJV. The involvement of paralogues derived from both copies of the 1RV in NCC-related function, in both gnathostomes and lampreys, supports the hypothesis that NCCs predate 1RV. b, Enrichment of functional annotation terms (gene ontology) in sets of genes showing a specific pattern of retention after vertebrate WGDs. Each column corresponds to a set of paralogous genes with a specific pattern of post-duplication retention in a given species. We distinguished cases in which both paralogues can be assigned to a specific duplication and are retained, cases in which at least one of the paralogues is retained and cases in which at least one of the two copies is lost. CNS, central nervous system. c, Distribution of the difference of positive organ-specific expression domains between selected vertebrate species and the amphioxus outgroup for ohnologue gene families59. A shift to the left in the distribution (as seen for the gar) indicates an extensive subfunctionalization through the restriction of gene-expression domains in vertebrates. d, Gene-family loss in deuterostomes, highlighting the severe loss in the hagfish lineage relative to that seen in other vertebrates and deuterostomes (grey). Species abbreviations are provided in Supplementary Table 8. e, Functional enrichment (gene ontology) for gene families lost in the hagfish lineages, highlighting a simplification of visual and hormonal systems (labels in orange). f, Structure of the two clusters of α-keratin genes on chromosomes 14 and 4, and their expression in the slime gland and the skin shown as a heat map (gene expression expressed as fragments per kilobase per million reads (FPKM)). Unchar is the prefix used for naming genes that did not receive a gene name by homology search. Genes are shown in the same order in the heat map as they are located in the two clusters. Stars indicate the two genes that are expressed preferentially in the skin (Extended Data Fig. 10).
By contrast, the establishment of the trunk and cranial NCCs seems to differ among cyclostomes, osteichthyans and even amniotes, with distinct genes being involved56. Some but not all of the genes involved in this process seem to show a more recent occurrence of subfunctionalization. For instance, Lhx5, Id3, Gid2 and Dmbx, which have a role in gnathostome cranial NCCs, do not have 1RV or 2RJV paralogues with a similar function, whereas Tfap2 and SoxE, which are involved in the ancestral specification of both cranial and trunk NCCs, have paralogues on both of the 1Rv branches that are involved in this function. Lampreys show marked differences: neither RhoB nor Ets are involved or expressed in lamprey migratory NCCs, and Lhx5, Dmbx and Ets1 are expressed in later NCC derivatives (Extended Data Fig. 9c and Supplementary Table 10). Despite the extensive gene loss experienced in the hagfish lineage (see below), we recovered homologues for most of the NCC-related genes that we investigated. Further functional studies will be necessary to determine whether subsequent 2RCY paralogues in hagfish were incorporated in NCC-related functions specific to this lineage52.
Extended Data Fig. 9. Evolution of duplicated genes and gene families in hagfish.
a, Counts of gene families containing the specified number of retained paralogues in gar, lamprey (P. marinus) and hagfish (E. atami). b, Comparison of the tissue-specificity of gene expression (tau index) for ohnologue gene families in lampreys, hagfish, gar and the (unduplicated) amphioxus outgroup (Methods). The distribution of the maximal tau value for each gene family is shown. c, Node-specific gene loss events inferred by GeneRax in a species–gene tree reconciliation framework (Methods). Species labels are specified in Supplementary Table 8. d, Loss of Panther families across deuterostomes species inferred as the most parsimonious events from gene-family composition. e, Genome structure of the two clusters of expanded keratin genes, with mRNA expression in slime gland and skin (blue track).
A distinct fate for paralogues
Paralogues retained in gnathostomes after two rounds of genome duplication were previously shown to be functionally associated with the regulation of development and nervous system activity18. To determine whether similar genes were retained preferentially in multiple copies in the cyclostome lineage after 1RV and/or after cyclostome-specific 2RCY, we tested paralogue sets that show distinct retention patterns for functional enrichment (Fig. 4b). We recovered gene ontology terms that were previously found to be enriched in gnathostome paralogues (for example, axon guidance and embryonic organ development), but, notably, we found that they were preferentially associated with paralogues retained after the pan-vertebrate 1RV regardless of their post-gnathostome duplication 2RJV fate (Fig. 4b), suggesting that 1RV had a key role in the early elaboration of the vertebrate nervous system. In cyclostomes, however, these terms are preferentially associated with paralogues that were systematically retained after all polyploidizations (1RV and 2RCY); this suggests a distinct path of paralogue evolution at the functional level, possibly coupled with an increased retention after 2RCY compared with 2RJV.
The fate of paralogues after WGD is often related to their acquisition of more specific expression domains that can explain subfunctionalization and functional innovation58–60. To examine patterns of divergence in gene expression in gnathostomes and cyclostomes, we compared paralogues across a consistent set of six organs in amphioxus, gar, lampreys and hagfish. Considering 3,009 gene families, we found a higher level of gene-expression specificity in gar than in lampreys and hagfish, with the hagfish showing the least specificity (Extended Data Fig. 10b). We then counted the number of expression patterns that were gained or lost in the same gene family between amphioxus and the different vertebrate species, which also indicated a lower level of subfunctionalization in cyclostomes than in gnathostomes (Fig. 4c). Finally, we asked whether particular organs show a significant enrichment of paralogous genes using gene-expression clustering (weighted gene co-expression network analysis (WGCNA) (Methods and Extended Data Fig. 9a,b). Of note, we found that only neural tissue exhibits enrichment in both gnathostomes (for example, gar) and hagfish, whereas many recently duplicated genes are expressed in an organ-specific manner (Extended Data Fig. 9b). Together, these results imply that cyclostomes—and, to a greater extent, hagfish—show more limited subfunctionalization or specialization of expression patterns than do gnathostomes.
Extended Data Fig. 10. Gene expression and gene duplications in vertebrates.
a,b, Weighted gene co-expression network analysis (WGCNA) among organs for hagfish (a) and gar (b). Each row corresponds to a WGCNA cluster (with an arbitrary colour name) and its expression specificity is shown in selected tissues on the left (a, hagfish, b, gar). The enrichment of gene duplicated at successive phylogenetic nodes in each WGCNA cluster is indicated on the right as the p-value (-log10) of hypergeometric tests. A significant enrichment is observed in gene with strong neural expression (brain, blue cluster). c, Expression of selected paralogues involved in neural crest specification and migration in cranial and trunk neural crest tissues from lamprey P. marinus. RNA-seq data from a previous study56 was quantified using the latest version of the lamprey genome and RefSeq annotation (kPetMar1). For each gene family, all paralogues derived from the vertebrate polyploidization event (1RV and 2RCY) are considered and classified (see Supplementary Tables 9 and 10). As denoted in inset, 1 (green cells) and 2 (pink cells) refer to the two original paralog branches derived from 1RV (see main Fig. 4a). Grey groups could not be definitively assigned.
Gene loss and hagfish novelties
Hagfish underwent the most extensive gene loss among vertebrates, with 1,386 missing gene families, of which 892 were present in the deuterostome ancestor (Fig. 4d and Extended Data Fig. 9d). Hagfishes stand out as having lost all members of several entire gene families, rather than exhibiting just an increased loss of paralogues (Extended Data Fig. 10c).
Several gene families lost in hagfish are functionally enriched for roles associated with missing characters in hagfish (Fig. 4e). For instance, γ-crystallins, which make up the lenses of vertebrate eyes, are absent in hagfish (but independently expanded in lampreys and gnathostomes), as are the EYS (eyes shut homologue) and RBP3 (retinol-binding protein 3) genes that are involved in photoreceptor maintenance and development61 (Supplementary Table 11). Several genes that are involved in bone development and its hormonal control in other vertebrates62 are absent in hagfish: two members of the RANK–osteoprotegerin pathway that control osteoclast proliferation in gnathostomes63, as well as the genes encoding the parathyroid hormones (PTH and PTLH), which have a role in the regulation of calcium metabolism (their receptor is still present)64. These genes are present in lampreys and their loss in the hagfish lineage could be associated with the limited condensation of the hagfish vertebral cartilage.
Hagfish have also gained new traits, most notably their prodigious ability to secrete a highly viscous slime that acts as a defence against predators. We found two clusters of genes that are specifically and highly expressed in the slime gland (Extended Data Fig. 9e) and are related to intermediate filaments (α-keratin)65. One of these clusters contains a gene that is expressed mainly in the skin but not in the slime gland, consistent with the recent suggestion that the keratin threads of hagfish slime could have originated as elements of the skin66 (Fig. 4f). We found that the most highly expressed glycoproteins in the slime gland included von Willebrand A and D domains, rather than mucin-type domains as previously hypothesized67.
Programmed DNA and gene elimination
The somatic and germline cells of hagfish exhibit distinct karyotypes, owing apparently to the loss of germline-specific chromosomes through embryonically programmed DNA elimination. On the basis of k-mer counts (Extended Data Fig. 1b), we estimate that around 1.3 Gb is lost from the approximately 3.3-Gb germline genome of E. atami, consistent with cytofluorometry25,28. Analysis of the genome assembly identified a large number of germline-specific genes and confirmed that germline-specific regions contain large numbers of complex repetitive elements30–33, including one newly identified repeat that accounts for 4% of the genome (Fig. 5, Extended Data Figs. 11 and 12 and Supplementary Note 3).
Fig. 5. Germline-specific and enriched sequences and genes in hagfish.
a, Plot showing the degree of germline enrichment and estimated span of all predicted repetitive elements in the E. atami genome, focusing on elements with a cumulative span of less than 4 Mb (per family member). Previously identified elements30,33 are highlighted by coloured circles and newly identified high-copy elements are highlighted by coloured diamonds. Additional higher copy repeats are visible in Extended Data Fig. 12m,n. The colouring scheme is the same in b and in Extended Data Fig. 12m,n. b, Estimated cumulative span of the eight most highly abundant repeats shown as the percentage of the genome covered. c, Fluorescence in situ hybridization (FISH) of high-copy germline-specific repeats to a testes metaphase plate showing their distinct spatial clustering within chromosomes (blue counterstaining is NucBlue: Hoechst 33342; individual pairs of probes are shown in Extended Data Fig. 12m,n). d, Comparison of the sequence depth of DNA extracted from germline (testes) versus somatic (blood) tissues identifies a large number of genomic intervals with evidence for strong enrichment in the germline. The bin representing no enrichment contains a total of 2.3 Gb of the assembly. e, Genes encoded within germline-specific regions are enriched for several ontology terms related to regulation of cell cycle and cell motility (Panther Biological Processes: most specific subclass shown; Supplementary Table 14).
Extended Data Fig. 11. Eliminated genes and repeats identified in the hagfish genome.
a, Plot showing the degree of germline enrichment and estimated span of all predicted repetitive elements. Previously identified elements30,33 are highlighted by coloured circles and new high-copy elements are highlighted by coloured diamonds. b, PCR validation illustrating germline enrichment and tandem repetition of predicted satellite elements. g: germline (testes) DNA used as template, s: somatic (blood) DNA used as template. c–e, Gene trees for homologues that are eliminated in both lamprey and hagfish. Gnathostome clades are highlighted in shades of green and cyclostome clades are highlighted in shades of purple. Individual germline-specific genes are highlighted in red (hagfish) or blue (lamprey). c, Tree for YTHCD2 homologues. d, Tree for WNT7 homologues. e, Tree for MSH4 homologues. f,g, Gene trees for homologues that are highly duplicated in hagfish. Gnathostome clades are highlighted in shades of green and cyclostome clades are highlighted in shades of purple. Individual germline-specific genes are highlighted in red. f, Tree for FBXL4 homologues. g, Tree for TRRAP homologues.
Extended Data Fig. 12. FISH of repeats to germline and somatic interphase nuclei.
Nuclei are labelled with the DNA stain NucBlue (blue) and for all panels labelled “no signal” fluorescence images are overexposed to both show background signal and aid in confirming the location of nuclei in those images. a,b, Germline enriched EEPs2 (red) and HFR10 (magenta), and the somatic repeat EEPs1 (green) are hybridized to nuclei isolated from a, germline: testes and b, soma: blood. c,d, Germline enriched repeats EEPs4 (red) and HFR5 (green), and the somatic repeat Soma3 (magenta) are hybridized to nuclei isolated from c, germline: testes and d, soma: blood. e,f, Germline enriched repeats HFR13 (magenta) and HFR6 (green), and the somatic repeat Soma1 (red) are hybridized to nuclei isolated from e, germline: testes and f, soma: blood. g,h, Germline enriched repeats EEEo2 (red) and HFR16 (magenta), and the somatic repeat EEPs1 (green) are hybridized to nuclei isolated from g, germline: testes and h, soma: blood. i,j, Germline enriched repeats HFR4 (red) and HFR8 (green), and the somatic repeat Soma3.1 (magenta) are hybridized to nuclei isolated from i, germline: testes and j, soma: blood. k,l, Germline enriched repeats EEPs2 (red) and EEPs3 (green), and the somatic repeat Soma3 (magenta) are hybridized to nuclei isolated from k, germline: testes and l, soma: blood. m,n, In situ Hybridization of probes for ten germline-enriched satellite sequences. m, Probes are hybridized to germline interphase nuclei. n, Probes are hybridized to germline interphase nuclei. The location of hybridization signals for telomere probes and approximate bounds of 18 germline-specific dyads, corresponding to nine distinct germline-specific chromosomes. For all images, pairs of repeats are shown to aid in visualizing the relative location of individual probes.
So far, no germline-specific genes have been identified in any hagfish species. We identified germline-specific genes in E. atami by comparing the read depth of germline and somatic reads across low- to medium-copy regions of the assembled genome (Methods and Fig. 5d). We discovered 81 Mb of germline-specific sequence that encode 1,654 genes, 226 of which have identifiable human homologues (to 121 non-redundant human genes) (Supplementary Table 12). We confirmed that 44 of 46 tested germline-specific intervals can be PCR-amplified from testes but not blood DNA (95.7% validation rate) (Supplementary Table 13). Germline-specific genes in hagfish are enriched in several biological functions on the basis of gene ontology analyses, including functions related to cell cycle, cell motility and chromatin or DNA repair (Fig. 5e and Supplementary Tables 14 and 15). Similar functions were also enriched among germline-specific genes in sea lamprey, and support the hypothesis that somatically eliminated genes generally perform functions that benefit the development and maintenance of the germline11.
The broad functional similarity between germline-specific genes of hagfish and lamprey suggests that DNA elimination could be a shared ancestral feature of the cyclostome lineage8. To attempt to identify the vestiges of theoretical ancestral germline-specific chromosomes in the cyclostome lineage, we examined patterns of orthology and paralogy for eliminated genes. Despite the general functional similarity of eliminated genes in hagfish and sea lampreys, few orthologous genes were found to be eliminated in both genomes. In total, 7 of 121 non-redundant hagfish gene families were also eliminated in sea lampreys (CDH1, CDH2 and CDH4; GJC1; MSH4; NCAM1; SEMA4B and SEMA4C; WNT5A, WNT5B, WNT7A and WNT7B; and YTHDC2; Extended Data Fig. 11). An analysis of gene trees indicates that three of these (orthologues of MSH4, WNT7A and YTHDC2; Extended Data Fig. 11) share a last common ancestor that can be traced to a single lineage after the basal vertebrate divergence and duplication events. This small set of genes might reflect the vestiges of shared germline-specific sequences that were eliminated early in the cyclostome lineage, or, alternatively, these genes might have been independently recruited to the germline-specific fraction during the early evolution of both lineages.
Germline-specific chromosomes in songbirds and lampreys are continuously capturing duplicates of somatic genes, establishing new germline-specific genes that often evolve rapidly owing to their unique selective and mutational genomic environment7,34. In E. atami, we observe several germline-specific genes that have undergone extra rounds of duplication after duplicating or translocating to germline chromosomes. The genes with the highest germline-specific copy numbers are homologues of FBXL4, a modulator of E3 ubiquitin ligase that regulates the proteasomal turnover of the histone demethylase KDM4A (ref. 68) (25 copies); and TRRAP, a component of several histone acetyltransferase complexes (18 copies) (Extended Data Fig. 11 and Supplementary Table 12). The FBXL4 orthogroup also contains 45 paralogues in the draft genome of the closely related hagfish Eptatretus burgeri (ref. 69), indicating that the origin of germline-specific FBXL4 and the expansion of this gene family predates the split between the two hagfish species, with additional lineage-specific expansions and losses underlying differences in paralogue numbers over the past few million years. These gene families seem to have undergone substantial expansion even in the recent past, emphasizing their high rate of turnover.
The accumulation of epigenetic silencing marks and regulated degradation has been implicated in the cellular mechanisms that underlie the elimination of lamprey germline-specific chromosomes8,70. This suggests that some components of hagfish DNA elimination mechanisms might be encoded by the germline-specific chromosomes themselves, or contribute to other aspects of hagfish germ cell development. Other genes involved in the same pathways as FBXL4 and TRRAP have also duplicated in the context of the E. atami germline-specific chromosomes, albeit to a lesser extent. These include KLHL10, a component of the E3 ubiquitin ligase complex involved in spermatogenesis (4 copies); SIN3, a transcriptional repressor whose human homologue is highly expressed in the testis (4 copies); and DNMT1, the primary enzyme responsible for maintaining silencing DNA methylation marks after DNA replication (2 copies). Notably, each of these five families of germline-specific genes also possesses at least one somatically retained paralogue, indicating that germline-specific expansion of gene families related to ubiquitination and regulation of chromatin state has evolved in the context of largely intact ancestral somatic pathways.
Conclusion
Early vertebrate evolution was accompanied by a series of ancient polyploidization events that have been difficult to unambiguously resolve using conventional sequence-based molecular phylogenetics. Challenges include the antiquity of these events and the relatively short intervals between them18,21, as well as lineage-specific evolution and gene loss after duplication21,37,38. We used the hagfish genome and an approach focused on chromosome-scale phylogenetics to fully resolve this history of ancient vertebrate polyploidies (Fig. 2 and Extended Data Fig. 5). The earliest duplication, 1RV, occurred on the vertebrate stem lineage in the early Cambrian (around 527 Ma), around 10 million years before the appearance of Haikouichthys and Myllokunmingia (ref. 3) the earliest vertebrate fossils. Whether the similarity in timing is coincidental or causal remains to be seen.
After the shared duplication, cyclostomes and gnathostomes experienced independent polyploidizations during the late Cambrian–early Ordovician, coinciding with a gap in the vertebrate fossil record. We can, however, begin to relate early genomic events to the emergence and elaboration of vertebrate innovations by correlating the contemporary functions of gene duplicates with their appearance at specific duplication events. For example, we find that when one gene functions in neural crest, its 1RV paralogues also do, as expected if the neural crest regulatory circuits already existed before 1RV. More speculatively, we note that Evx homeobox genes, which have a role in the development and patterning of paired fins and limbs, duplicated at 1RV, with both lineages being retained in gnathostomes (Evx1–HoxA and Evx2–HoxD), but that cyclostomes are missing HoxC and HoxD-associated Evx paralogues owing to lineage-specific loss. This observation suggests that 1RV duplicates might have acquired roles in fin bud development and patterning very early in the evolution of the gnathostome lineage, consistent with the observation of paired fin fold morphologies in early diverging galeaspids71.
Finally, analysis of E. atami germline-specific chromosomes in comparison with other vertebrates supports the hypothesis that these chromosomes encode functions that are advantageous for the development of germ cells and the production of gametes, and indicates that rapid turnover of germline-specific gene content might be a common feature across highly divergent lineages. As with other features of their biology, differences in the gene content of lamprey and hagfish germline-specific chromosomes might reflect their long history of independent evolution and the marked differences in their reproductive, ecological and developmental biology that have accumulated over the approximately 460 million years since the last common cyclostome ancestor.
Methods
Genome sequencing and assembly
DNA was extracted from a testis from a male E. atami individual and extracted using proteinase K digestion and phenol:chloroform extraction72. Animals were sampled in Suruga Bay, off Yaizu (300–330-m depth) and maintained in seawater aquariums at 11–13 °C. In agreement with procedures authorized by the Guidelines for Proper Conduct of Animal Experiments by the Science Council of Japan (2006), animals were anaesthetized using Tricaine (MS222, Sigma) before euthanasia and dissection. Paired-end and mate-pairs Illumina libraries were generated using Illumina Truseq and Nextera Mate-pair kits and sequenced on HiSeq2000 and HiSeq2500 instruments (Supplementary Table 1). The Illumina dataset was assembled using Meraculous (v.2.2.2.5) with a k-mer of 71 and ‘diploid mode’ set to ‘1’ to attempt the merging haplotypes73, and subsequently scaffolded using mate-pairs information (Supplementary Table 2). PacBio long-reads data at around 35× coverage were generated on a PacBio RSII instrument (Supplementary Table 2) and incorporated using PBJelly (v.15.8.24)74. PBJelly aligns the PacBio reads to the assembly using the Blasr aligner and collects reads surrounding and spanning gaps. Sequences assembled from these spanning reads are used to fill gaps and extend scaffolds. We used the parameters ‘-minMatch 8 -sdpTupleSize 8 -minPctIdentity 75 -bestn 1 -nCandidates 10 -maxScore -500’ for Blasr alignment.
The gap-filled assembly was further scaffolded using proximity ligation information. We used both Chicago libraries relying on syntenic reconstructed chromatin and Hi-C libraries capturing the native chromatin contacts, and scaffolding was performed using the HiRise package75. Hagfish liver was cross-linked in 1% paraformaldehyde, and chromatin was subsequently extracted, immobilized on SPRI beads, washed and digested with DpnII (ref. 76). After end-labelling, proximity ligation was performed using T4 DNA ligase and cross-linking was reversed using proteinase K. The DNA fragments were removed from the beads and then purified again on SPRI beads. The sequencing library was constructed using the NEB Ultra Library Preparation Kit (New England Biolabs).
The genome-wide heterozygosity was estimated to be 0.9%. The final BUSCO score (Metazoa) is C:90.0% (S:89.8%, D:0.2%), F:4.0%, M:6.0%, n:954. The size of the hagfish genome was estimated by counting 21-mers with Meryl (v.1.1)77. Using a fitting four-peak model as implemented in Genomescope2, the estimated size is 2.02 Gb and 3.28 Gb using sequencing data from blood and testis DNA, respectively78 (Extended Data Fig. 1b).
Transcriptome and genome annotation
We generated RNA-seq data for 13 organs with an average depth of 25 million reads. We aligned the reads to the genome using STAR (v.2.5.2b) with an average 78.7% uniquely mapping reads79. These alignments were used to assemble transcriptomes for each organ using StringTie (v.1.3.3b) and subsequently merged together using Taco80. In parallel, a de novo assembly of the bulk RNA-seq data was performed using Trinity (v.2.11.0) both in reference-free and genome-guided mode81.
We also sequenced full-length cDNA from brain RNA on eight cells of Pacbio RSII. Following the Iso-Seq protocol, circular consensuses of subreads were calculated and validated as full length on the basis of the presence of SMART adaptors at both extremities. Full-length transcripts were clustered and polished using all circular consensus reads with Quiver (v.2.0.0), yielding 23,343 high-quality transcripts.
Assembled transcripts from de novo and genome-guided Trinity and high-quality Iso-Seq transcripts were aligned to the genome using GMAP (v.2018-03-25). Mikado (v.1.2.1) was used to generate a high-quality reference transcriptome leveraging (i) the aligned Trinity de novo and genome-based transcriptomes; (ii) the Iso-Seq transcripts; (iii) the StringTie transcriptomes merged with Taco and a set of curated splice junctions generated from RNA-seq alignments using Portcullis (v.1.0.2). Putative fusion transcripts were detected by Blast comparison against Swiss-Prot and ORFs were annotated using TransDecoder82. Transcripts derived from the reference transcriptome were selected to train the Augustus de novo gene prediction tool83. Intron positions and exon positions were converted into hints for Augustus gene prediction.
Finally, we constructed a database of repetitive elements using RepeatModeler (v.1.0.11) and used it for masking repetitive sequences with RepeatMasker (v.4.0.7). Gene models with half or more of their exons showing 50% overlap with repeats were discarded, yielding 46,822 filtered gene models. Alternative transcripts and UTRs were subsequently incorporated using the PASA pipeline82. These gene models contain a total number of 4,915 distinct PFAM domains.
Phylogenomics and molecular dating
To obtain sequences from a previously unsampled hagfish group, we extracted RNA from M. glutinosa liver preserved in RNAlater using the RNAeasy kit (Qiagen). The RNA-seq library was constructed using the NEBNExt Ultra II Directional RNA Library Prep Kit for Illumina (NEB) and sequenced on a Novaseq6000 instrument (SRR). The transcriptome was assembled using Trinity (v.2.11.0)81, enabling read trimming, and was translated using TransDecoder (v.5.5.0)82.
We inferred a set of 1,467 single-copy orthologues suitable for phylogenetic reconstruction by applying the OMA tool (v.2.4.1)84 to a subset of deuterostome proteomes including lamprey and the newly generated hagfish gene models (Supplementary Table 8). Selected transcriptomes were assembled using Trinity (v.2.11.0) and translated using TransDecoder (v.5.5.0)82. We built hidden Markov model (HMM) profiles using Hmmer (v.3.1b2) for each orthologue family and extracted orthologues for phylogenetic reconstruction using a previously described approach85. Subsequent sequences were aligned using Msaprobs86, mistranslated stretches were filtered out using HmmCleaner87 and diverging regions intractable for phylogenetic analysis were removed using BMGE (-g 0.9)88. Phylogenetic trees were reconstructed for each alignment using IQ-TREE (v.2.1.1) with a LGX+R model89. For computational intensive analyses, such as site-heterogenous reconstruction with CAT+GTR, we selected the 20% orthologues with the lowest saturation. Molecular dating analysis was conducted using PhyloBayes (v.4.1e)90 using the CAT+GTR+G4 model and the CIR relaxed clock (with soft-bound) assuming fossil calibrations2,91,92 (Supplementary Table 7).
Synteny reconstruction
Pairs of orthologous genes were obtained by mutual best hit after reciprocal proteome comparison using MMSeqs2 (r12-113e3), and were used to create a system of joint coordinates to plot orthologue position in two species. Fisher’s exact test was used to determine mutual enrichment of orthologues between chromosomes, and only significant enrichments were incorporated in binned orthologous content representations (Fig. 1c). Plots connecting orthologues in multiple species (Fig. 1d) were generated using Rideogram (v.0.2.2).
Gene-family analyses and phylogenetic analysis of paralogons
We reconstructed gene families using Broccoli (ref. 93) for a set of genomes from deuterostome species (Supplementary Table 6). For gene families that included at least 6 genes and 3 species but fewer than 450 sequences in total, we applied GeneRax to infer the losses and duplications that affected a given gene family94. To do that, we generated individual alignments using MAFFT (v.7.305)95, filtered them using BMGE and reconstructed a tree using IQ-TREE and an LG+R model89. These curated alignments and trees were used as input for GeneRax (v.1.2.2) assuming a D+L (duplication plus loss model). Reconciled trees in the RecPhyloXML format were parsed to estimate the duplications and lineage-specific losses at each node of the species tree96 as seen in Extended Data Fig. 10c. Reconciled trees were split if they showed a duplication at the ‘deuterostomia’ node indicative of a deep paralogy relationship.
For each gene family, we first assigned the CLG by considering the location of amphioxus and sea urchin genes and the corresponding CLG-to-chromosome assignment, and then evaluated the occurrence of the paralogues derived from the 1R and 2Rjv in gnathostomes on the basis of the vertebrate classification that was previously established19 and has been revised in this study (Supplementary Table 6). Selected species for gene families including derivatives of the 1R paralogons and at least three out of four possible paralogons for gnathostomes (α1, α2, β1, β1) were collected (Supplementary Table 5). These genes were concatenated for each CLG on the basis of their paralogon identity in gnathostomes, and the chromosomal identity of the CLG derivatives in cyclostomes. Two datasets were generated: a ‘strict’ one, in which at least three distinct gnathostome paralogons were required for each retained gene family; and a ‘relaxed’ one, in which only two or more gnathostome paralogons were required (Supplementary Table 5). A similar approach was used to classify individual genes depending on the duplication events from which they derive. We collected gene ontology terms and functional classification information by applying eggNOG (ref. 97) on the proteome of our interest species and term enrichment analysis conducted using the TopGO package (v.2.50.0).
For analyses of gain and loss, we used gene-family reconstruction that incorporated the gene models of the related hagfish E. burgeri69 to assess recent gene-family expansions or contractions in the hagfish lineage. Gene functions were assigned using the PANTHER classification98.
Tests of WGD hypotheses on the vertebrate phylogeny
We used WHALE (v.2.1.0)36 to rigorously test WGD hypotheses on a reduced vertebrate species tree (Fig. 2a and Extended Data Fig. 4). We leveraged a total of 8,931 gene families in this analysis, selected to contain at least one gene copy in each clade from the root, in compliance with the assumption of WHALE that genes were acquired in a common ancestor of all included species. We further filtered large families to reduce the computational burden. For each of the 8,931 retained families, we built a multiple sequence alignment based on the amino acid sequences with MAFFT (v.7.508)95 and reconstructed 1,000 bootstrap trees with IQ-TREE (v.2.2.0.3)89 under the LG+G model. We summarized clade conditional distribution (CCD) from bootstrapped trees using the ALEobserve tool from the ALE software99. We ran WHALE on the dated species trees and CCD data to test five different WGD hypotheses on the vertebrate species tree: 1RV, 2RJV, 2RCY, a hagfish-specific duplication and a lamprey-specific duplication (Extended Data Fig. 4a). We used the variable rate DLWGD WHALE model, which models independent duplication and loss rates across branches. We assumed a normal distribution N(log(0.15), 2) on the mean log-scaled duplication and loss rate, an exponential distribution (mean = 0.1) prior on its variance, a Beta (3, 1) hyper prior on the η parameter (distribution of the number of genes at the root) and uniform priors on the retention parameters (q parameter) for all WGDs. We obtained significant Bayes factors (BFNull_vs_WGD < 10−3) in support of large-scale duplication (post-duplication retention parameter q ≠ 0) for the 1RV, 2RJV and 2RCY events (Extended Data Fig. 4b). These results were reproduced using the simpler constant rate DLWGD model. We similarly tested an alternative scenario with two duplications on the vertebrate stem (1RV and 2RV; Extended Data Fig. 4c,d). In this configuration, and using uniform priors on the retention parameters, we observed that WHALE could not distinguish retention parameters for 1RV and 2RV: this is revealed by the bimodality of the estimated posterior distribution for each of these two parameters. We found that using distinct priors on retention parameters allows the estimation of distinct retention parameters for 1RV and 2RV and shows support for a single 1RV event (Extended Data Fig. 4e). This investigation of alternative priors was conducted on a pilot run of 1,000 randomly selected gene families, to alleviate computational time (1,000 families were previously suggested to be sufficient for parameter estimation36).
Ancestral and lineage-specific meiotic rediploidization
We selected a set of 1,247 gene families, including genes of 6 vertebrate species (bamboo shark Chiloscyllium plagiosum, spotted gar Lepisosteus oculatus, chicken Gallus gallus, western clawed frog Xenopus tropicalis, brown hagfish E. atami and sea lamprey Petromyzon marinus) and the closest outgroup (depending on taxonomic availability), to test for ancestral and lineage-specific rediploidization after the 1R genome duplication. These 1,247 families were selected so as to result in distinct tree topologies under the ancestral and lineage-specific rediploidization models, on the basis of the following criteria: (i) at least one gnathostome species has retained both 1R_1 (that is, alpha1 and/or beta1) and 1R_2 (that is, alpha2 and/or beta2) gene copies; (ii) at least one hagfish gene and one lamprey gene; (iii) at least one non-vertebrate outgroup gene; and (iv) a non-prohibitive number of hagfish and lamprey genes so that a maximum of 10 possible ancestral rediploidization topologies can be derived for the family (Extended Data Fig. 6). For each gene family, we designed constrained tree topologies as expected under the lineage-specific and ancestral rediploidization models (Fig. 3a). More specifically, for the constrained ancestral rediploidization topologies, we built constrained topologies as follows: we first placed 1R_1 and 1R_2 gnathostome gene copies in two different clades following 1R and then derived possible combinations of hagfish and lamprey genes to be placed on the 1R_1 and 1R_2 clades, using well-supported hagfish and lamprey chromosomal orthologies to limit the number of combinations (Extended Data Fig. 6 and Supplementary Table 5). Next, for each of these 1,247 families, we built gene trees using RAxML (v.8.2.12)100, with 10 distinct starting trees and the PROTGAMMAJTT model, for: the unconstrained maximum likelihood (ML) tree; the constrained ancestral rediploidization topologies; and the constrained lineage-specific rediploidization topology. We then used the AU-test implemented in CONSEL (ref. 101) to test for significant differences in log-likelihoods reported by RAxML (ref. 100). A tree topology was rejected when significantly less likely than the ML tree at α = 0.05.
We used the same approach to test for lineage-specific rediploidization in lampreys and hagfish on CLGB-1R after the 2R cyclostome hexaploidization. We ran likelihood tests on 30 informative gene families, constraining the lineage-specific rediploidization gene-tree topology as presented in Fig. 3c and constraining the ancestral rediploidization topologies according to the two possible ways of grouping hagfish and lamprey genes together (that is, either grouping genes from hagfish chr. 4 with lamprey chr. 10 and hagfish chr. 8 with lamprey chr. 2, or hagfish chr. 8 with lamprey chr. 10 and hagfish chr. 4 with lamprey chr. 8).
The code to reproduce the analysis, as well as the associated resulting gene trees, have been deposited in GitHub (https://github.com/fmarletaz/hagfish/tree/main/rediploidization).
Phylogenetic tree based on concatenation of Hox clusters
We investigated the phylogenetic relationships between Hox clusters and bystander genes in seven genomes: amphioxus, sea lamprey, hagfish, human, mouse, chicken and spotted gar. We identified Hox and bystander genes in three steps: (i) starting from human gene names, we searched for orthologues in the other species using our set of reconciled gene trees (GeneRax trees); (ii) we used NCBI blastp (ref. 102) to confirm identified hox genes and further search for Hox genes missed by the gene-trees approach; and (iii) we used miniprot (v.0.5-r179)103 with the sets of human and E. burgeri Hox proteins24 to search for Hox genes missing from genome annotations of other species. We next aligned each gene family using their amino acid sequence with MAFFT (v.7.508)95 and concatenated alignment from each cluster. Finally, we used the concatenation matrix to build a phylogenetic tree with RAxML-NG (v. 1.1)100 using the LG+G4+F model, 10 different starting parsimony trees and 100 bootstrap replicates.
Comparative transcriptomics
RNA-seq reads for hagfish (this study), lamprey Lampetra japonica (PRJNA354821, PRJNA349779 and PRJNA312435), gar Lepisosteus oculatus (PRJNA255881) and the cephalochordate amphioxus (PRJNA416977) were aligned with STAR (v.2.5.2b)79, and counts for annotated genes were obtained using featureCount from the subreads package (v.1.6.3)104. Counts were converted to FPKM in the R package for subsequent analyses: WGCNA (v.1.7.0) was used to cluster gene expression in the full organ set and, after filtering out genes with limited variance and coverage, the ‘softpower’ parameter was estimated to be 20, and clustering was run with a ‘signed’ network type105. The gene-expression specificity index (or τ) was calculated as described previously106 on sets of organs (brain or neural tube; gills; heart; intestine; kidney; liver or hepatic tissues; ovary or female gonad; skin or epidermis; and muscle). For comparative analyses, gene families with paralogues derived from the vertebrate WGD were selected on the basis of their duplication history, and the gene-expression specificity index was compared across species for the same gene families (Extended Data Fig. 10b). We also compared gain and losses of expression domains for a given gene family by binarizing gene expression across a reduced set of six organs (brain, gills, intestine, liver, muscle and ovary) and counting expression patterns gains of lost between genes belonging to a given gene including paralogues and outgroup. The number of gain and loss events is then plotted as a distribution centred around zero (Fig. 4c).
The expression of paralogous genes in lamprey neural crest was assessed by quantifying gene expression using Salmon (v.1.10.0)107 from RNA-seq data generated in a previous study56 on dissected cranial and trunk dissected tissues using the latest lamprey genome and annotation7. Paralogy status and expression is specified in Supplementary Table 10.
Detection of germline-enriched and germline-specific regions
DNA was extracted from testes and blood by phenol-chloroform extraction72. To enrich for germ cells, testes tissue was ground gently with a plastic pestle in a 1.5-ml microfuge tube and residual connective tissues were discarded before proteinase K digestion. Outsourced library prep and Illumina sequencing (HiSeq2500 V4, 150-bp paired-end reads) were performed by Hudson Alpha Genome Services Laboratory.
Sequence data were aligned to the E. atami genome assembly using BWA-mem (v.0.7.5a-r416)108 with option -a and filtered by samtools view108 with option -F2308. Only primary alignments with mapping quality 5 and higher were retained for further analysis. The resulting files were processed using DifCover (v.3.0.1)7,11,34 to calculate the degree of germline enrichment across all discontiguous 500-base intervals of low-copy sequence using modal coverages for testes and blood of 32× and 54× respectively, low-coverage masking of regions with a read depth of less than 1/3× in both samples and high-coverage masking of sequences with a read depth greater than 3× modal coverage in both samples. To identify germline-specific genes that are present at a higher copy number, we ran DifCover using low-coverage masking with a read depth of less than 10× in both samples and high-coverage masking of sequences with a read depth greater than 30× modal coverage.
PCR validation of germline-enriched loci
Primers were designed using a coverage-masked version of the E. atami genome using Primer3 (version 4.1.0)109. Amplication of PCR validation reactions was performed using GoTaq DNA polymerase (Promega, 1.2 units per 50 μl reaction), Colorless GoTaq Reaction Buffer, 1 μg genomic DNA template and 100 ng oligonucleotide primer. PCR cycling conditions included a 3-min initial denaturation step at 95 °C, 34 cycles of a three-step thermal cycling consisting of a 30-s denaturation at 95 °C, a 30-s primer annealing step at 55–65 °C (Supplementary Table 13) and a 30-s extension step at 72 °C. A final extension at 72 °C was performed on all reactions to ensure that full-length amplicons were produced. Amplification was assessed by agarose gel electrophoresis. Eight primer pairs with an ambiguous signal in the first round of PCR were redesigned and retested (Supplementary Table 13). We note that some PCR markers might not be fully diagnostic with respect to germline specificity, as somatic gene duplicates are continuously captured by the germline-specific chromosomes in both lamprey7,11,34 and songbird lineages34.
Computational prediction of germline-enriched and highly abundant somatic repeats
Abundant k-mers (k = 31) were identified from testes and blood DNA-sequencing data using Jellyfish (v.2.2.4)110. Minimal copy-number thresholds for defining abundant k-mers were set at 3× the modal copy number: 72 for testes and 120 for blood. Abundant k-mers were extracted and assembled into a set of de-novo-assembled repetitive sequences using Velvet (v.1.2.10)111 with a hash length of 29. These sequences were aligned (blastn with -word_size 17) to repetitive elements generated from the E. atami genome assembly by RepeatModeler112 and sequences that aligned with less than 90% identity or less than 80% of their length were added to the set of reference-derived repeats to form a union set.
Enrichment analysis was performed by separately aligning paired-end reads from testes and blood to the union set of repeats. Primary alignments, identified by samtools view113 with option -F2308, were also filtered to retain only alignments that either cover more than 80% of a repeat or have more than 80% of read bases aligned. Enrichment scores were calculated with DifCover pipeline (v.3)11. Stage 2 of the pipeline was run with parameters v=10000, l=0, a=b=10, A=B=108. Stage 3 of the pipeline was modified by using a subroutine from DNAcopy114 without ‘smoothing’ the data before analysis. From a set of 180,032 intervals generated by DifCover, we chose 138 highly abundant and germline-specific sequences with enrichment scores of more than 10 and an estimated span size of more than 100 kb. The estimated genomic span of these repeats was computed as [(testes coverage/modal testes coverage) × (number of bases with read depth coverage > 10)], in which modal testes coverage = 32.
Clustering of 138 highly abundant and germline-specific sequences was performed using CD-HIT-EST (v.4.6, with parameters: -c0.8, -G0, -aS 0.3, -aL 0.3, -sc 1, -g 1, -b 4)115, resulting in the formation of 38 clusters that were further merged to 24 by manual curation and cross-alignment of sequences from the initial clusters. For characterization of repetitive structures and identification of motifs, representative sequences from each cluster were mapped to the assembly (blastn, -word_size 15) and to a collection of published hagfish repeats. We found that 4 of 24 clusters have sequences that are homologous to the published repeats of Paramyxine sheni EEPs2, EEPs3 and EEPs4 and Eptatretus okinoseanus EEEo2 (refs. 30,33). Primers for these and representatives of 7 other clusters were designed with the Primer3 (v.0.4.0) tool (Supplementary Table 16).
To facilitate FISH visualization, we also searched for possible candidates for centromeric repeats. Such candidates are expected to be (1) highly abundant in both somatic and germline sequence and (2) enriched in a ‘centromeric’ region of every chromosome. From the union set, we chose repeats with blood coverage > 105 or span > 1 Mb and aligned them to the assembly (blastn -word_size 15, p>75, coverage > 80%). Repeats with more than 200 hits in a 1-Mb window were grouped to three families labelled Soma1–Soma3. Soma2 seemed to be homologous to the P. sheni repeat EEPs1 (ref. 33) and Soma1 and Soma3 to the E. burgeri contigs LC047612.1 and LC047003.1. FISH analysis confirmed the in silico prediction that EEPs1 is highly abundant in both testes and blood DNA of E. atami.
To estimate more accurately the genomic span of chosen germline-enriched and somatic repeats, we realigned reads from blood and testes to the sequences of these repeats or to the sequence extended as a tandem repetition of repeat motifs spanning at least 150 bp (Supplementary Table 12), and applied all described previously steps for filtering and coverage and span estimation.
In situ hybridization
Slide preparation
Snap-frozen samples of blood and testes were used for slide preparation of somatic and germline cells for validation of the presence and specificity of repeats in different cell types. A small amount of blood (about 20 mg) was gently thawed on ice, mixed with 2 ml buffered hypotonic solution (0.4% KCl, 0.01 M HEPES, pH 6.8) and incubated for 30 min at room temperature. The cells were prefixed by gently mixing the suspension with several drops of fixative solution (methanol:acetic acid 3:1). After centrifugation (5,000g for 10 min), the supernatant was removed and cells were resuspended and fixed with methanol:acetic acid 3:1. Three further fixative solution changes were performed to ensure that cells were fully equilibrated to fixative solution. Fixed cells were stored at −20 °C. One fixative change was made before spreading the cell suspension onto slides. A drop of about 20 ml was applied to a steamed slide, which was immediately placed on a heating block in a humidity chamber at 60 °C for 1–2 min. After air drying, slides were examined with a microscope using a low condenser position to aid in viewing unstained nuclei and metaphases. Slides were aged for 1–3 days on a warming stage at 37 °C before hybridization. For germline cells, a piece of testis (30–40 ng) was minced with a razor blade, placed in a homogenizer and disaggregated in hypotonic solution. Testis cell suspensions were filtered through a 40–50-mm cell strainer to remove excess tissue. Subsequent steps of fixation and slide preparation for testis tissue were as described for blood.
Probe labelling
Probes for FISH were generated using a modified conventional PCR: the reaction mix with a final volume of 25 μl contained 0.1 mM each of unlabelled dATP, dCTP and dGTP and 0.03 mM of dTTP; 0.5 μl one fluorophore conjugated dUTP (cyanine 3-dUTP (Enzo), cyanine 5-dUTP (Enzo) or fluorescein-12-dUTP (Thermo Fisher Scientific)); 1× Taq-buffer; and 0.625 U GoTaq DNA Polymerase (Promega). Each PCR amplification was performed using 0.5 μg of genomic DNA template from testes, 34 PCR cycles and a 30-s extension step to obtain appropriately sized probes for FISH. After cycling the reaction, 25 μl PCR mix was combined with 5 μl sheared salmon sperm DNA (1 mg ml−1; Thermo Fisher Scientific), 3 μl 3 M sodium acetate, pH 5.2 and 80 μl 100% cold ethanol, and kept overnight at −20 °C for probe precipitation. After spinning and supernatant removal, the pellet was dissolved in 25–30 μl of 50% formamide and stored at −20 °C before use.
FISH
FISH on chromosome preparations was performed according to a standard protocol for chromosome spreads116 with modifications117. Before hybridization, slides were incubated in 2× SSC for 30 min at 37 °C, passed through an ethanol series (70%, 80%, 100%), dried and denatured in formamide (70% in 2× SSC) for 2 min, prewarmed to 70 °C. After the formamide denaturation, slides were placed immediately in cold (−20 °C) 70% ethanol, further dehydrated in 80% and 100% ethanol, and kept on a slide warmer at 37 °C until the hybridization mix with probe was applied.
Differently labelled hybridization probes were mixed (1 μl of each per slide) with hybridization master mix (60% formamide, 10% dextran sulfate and 1.2× SSC) to a final volume of 10 μl. The hybridization mix was denatured at 95 °C for 7 min, cooled in ice, prewarmed to 37 °C, applied to the slide, coverslipped and sealed with rubber cement. After overnight incubation in a humidity chamber at 37 °C, slides were washed in 0.4× SSC and 0.3% NP-40 for 3 min at 70 °C and in 2× SSC, 0.1% NP-40 for 5 min at room temperature. One drop of ProLong Glass Antifade Mountant with NucBlue Stain was placed in the centre of an area to be examined and covered with a coverslip.
Microscopy and image analysis
Slides were analysed with an Olympus-BX63 microscope using filter sets for DAPI, FITC, Cy3 and Cy5. Images were captured using CellSens software (Olympus) and processed with Adobe Photoshop CC 2019 and ImageJ 1.53k (NIH).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-024-07070-3.
Supplementary information
This file contains Supplementary Notes 1–3, Supplementary References, and Supplementary Fig. 1 (Phylogenetic trees inferred for paralogons in each CLG assuming the C20+R model).
This file contains Supplementary Tables 1–16.
Source data
Acknowledgements
We thank B. Venkatesh and S. Kuraku for early discussions; N. Segi, T. Suzuki, B. Muramatsu and H. Dohra for technical support; H. Hasegawa and K. Hasegawa for hagfish supply; and M. Levine, M. Martik, H. van Mullem, C. Amemiya and L. Piovani for comments on the manuscript. Work at the OIST Molecular Genetics Unit (D.S.R., O.S., D.G. and F.M.) was supported by OIST internal funds. F.M. is supported by the Royal Society Fellowship URF\R1\191161 and the BBSRC grant BB/V01109X/1. D.S.R. is a Chan Zuckerberg Biohub Investigator and is supported by the Marthella Foskett Brown Family Chair of Biological Sciences at the University of California, Berkeley. J.J.S. is supported by grants from the National Institutes of Health (NIH) (R35GM130349) and the National Science Foundation (NSF) (MCB1818012). M.S. was in part supported by the Field Science Center and Research Institute of Green Science and Technology, Shizuoka University. E.P. is supported by a Newton International Fellowship from the Royal Society (NIF\R1\222125). We thank the OIST Sequencing Section for DNA and RNA sequencing and acknowledge the support of OIST supercomputing and the University of Kentucky High-Performance Computing complex.
Extended data figures and tables
Author contributions
D.S.R., O.S., F.M., J.J.S. and S.B. conceived the study, which was led by F.M., J.J.S. and D.S.R. F.M. and O.S. sequenced, assembled and annotated the genome. N.T., V.A.T. and J.J.S. performed the DNA elimination analysis. F.M., O.S. and D.S.R. performed synteny analyses. F.M. completed phylogenetic and paralogon analysis. E.P. contributed duplication phylogenetic modelling, lineage-specific rediploidization and Hox cluster analysis. D.G. analysed transcriptomic data. M.S. provided biological samples and figure material. K.K. provided biological material and took part in transcriptomic analyses. D.S.R., F.M., E.P. and J.J.S. wrote the paper with input from N.T. and V.A.T. All authors read and approved the manuscript.
Peer review
Peer review information
Nature thanks Sebastian Shimeld and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
Raw and processed sequences have been deposited in the NCBI Sequence Read Archive (SRA) (PRJNA953751) and Gene Expression Omnibus (GEO) (GSE230176). The RNA-seq data for M. glutinosa are available on the SRA (SRR25213276). The resequenced somatic tissues are also available on the SRA (blood, SRR24133795; testes, SRR24130678). RNA-seq datasets used for comparative analyses are publicly available for Japanese lamprey (PRJNA354821, PRJNA349779 and PRJNA312435), gar (PRJNA255881), amphioxus (PRJNA416977) and sea lamprey (PRJNA497902). The read data are available at PRJNA953751. The genome and its annotation are also deposited in zenodo: https://zenodo.org/records/10227719. Source data are provided with this paper.
Code availability
The code used is available at https://github.com/fmarletaz/hagfish.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Deceased: Sydney Brenner
Contributor Information
Ferdinand Marlétaz, Email: f.marletaz@ucl.ac.uk.
Jeramiah J. Smith, Email: jjsmit3@uky.edu
Daniel S. Rokhsar, Email: dsrokhsar@gmail.com
Extended data
is available for this paper at 10.1038/s41586-024-07070-3.
Supplementary information
The online version contains supplementary material available at 10.1038/s41586-024-07070-3.
References
- 1.Shimeld SM, Donoghue PCJ. Evolutionary crossroads in developmental biology: cyclostomes (lamprey and hagfish) Development. 2012;139:2091–2099. doi: 10.1242/dev.074716. [DOI] [PubMed] [Google Scholar]
- 2.Miyashita T, et al. Hagfish from the Cretaceous Tethys Sea and a reconciliation of the morphological–molecular conflict in early vertebrate phylogeny. Proc. Natl Acad. Sci. USA. 2019;116:2146–2151. doi: 10.1073/pnas.1814794116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Janvier P. Facts and fancies about early fossil chordates and vertebrates. Nature. 2015;520:483–489. doi: 10.1038/nature14437. [DOI] [PubMed] [Google Scholar]
- 4.Ohno, S. Evolution by Gene Duplication (Springer, 1970).
- 5.Holland PW, Garcia-Fernàndez J, Williams NA, Sidow A. Gene duplications and the origins of vertebrate development. Dev. Suppl. 1994;1994:125–133. [PubMed] [Google Scholar]
- 6.Donoghue PCJ, Purnell MA. Genome duplication, extinction and vertebrate evolution. Trends Ecol. Evol. 2005;20:312–319. doi: 10.1016/j.tree.2005.04.008. [DOI] [PubMed] [Google Scholar]
- 7.Timoshevskaya N, et al. An improved germline genome assembly for the sea lamprey Petromyzon marinus illuminates the evolution of germline-specific chromosomes. Cell Rep. 2023;42:112263. doi: 10.1016/j.celrep.2023.112263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Smith JJ, Timoshevskiy VA, Saraceno C. Programmed DNA elimination in vertebrates. Annu. Rev. Anim. Biosci. 2021;9:173–201. doi: 10.1146/annurev-animal-061220-023220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Drotos KHI, Zagoskin MV, Kess T, Gregory TR, Wyngaard GA. Throwing away DNA: programmed downsizing in somatic nuclei. Trends Genet. 2022;38:483–500. doi: 10.1016/j.tig.2022.02.003. [DOI] [PubMed] [Google Scholar]
- 10.Smith JJ, Antonacci F, Eichler EE, Amemiya CT. Programmed loss of millions of base pairs from a vertebrate genome. Proc. Natl Acad. Sci. USA. 2009;106:11212–11217. doi: 10.1073/pnas.0902358106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Smith JJ, et al. The sea lamprey germline genome provides insights into programmed genome rearrangement and vertebrate evolution. Nat. Genet. 2018;50:270–277. doi: 10.1038/s41588-017-0036-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Spitzer, R. H. & Koch, E. A. in The Biology of Hagfishes (eds Jørgensen, J. M. et al.) 109–132 (Springer, 1998).
- 13.Duméril, A. M. C. Dissertation sur la famille des poissons cyclostomes, pour démontrer leurs rapports avec les animaux sans vertèbres (Didot, 1812).
- 14.Ota KG, Fujimoto S, Oisi Y, Kuratani S. Identification of vertebra-like elements and their possible differentiation from sclerotomes in the hagfish. Nat. Commun. 2011;2:373. doi: 10.1038/ncomms1355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Mallatt J, Sullivan J. 28S and 18S rDNA sequences support the monophyly of lampreys and hagfishes. Mol. Biol. Evol. 1998;15:1706–1718. doi: 10.1093/oxfordjournals.molbev.a025897. [DOI] [PubMed] [Google Scholar]
- 16.Kuraku S, Kuratani S. Time scale for cyclostome evolution inferred with a phylogenetic diagnosis of hagfish and lamprey cDNA sequences. Zoolog. Sci. 2006;23:1053–1064. doi: 10.2108/zsj.23.1053. [DOI] [PubMed] [Google Scholar]
- 17.Delsuc F, Tsagkogeorga G, Lartillot N, Philippe H. Additional molecular support for the new chordate phylogeny. Genesis. 2008;46:592–604. doi: 10.1002/dvg.20450. [DOI] [PubMed] [Google Scholar]
- 18.Putnam NH, et al. The amphioxus genome and the evolution of the chordate karyotype. Nature. 2008;453:1064–1071. doi: 10.1038/nature06967. [DOI] [PubMed] [Google Scholar]
- 19.Simakov O, et al. Deeply conserved synteny resolves early events in vertebrate evolution. Nat. Ecol. Evol. 2020;4:820–830. doi: 10.1038/s41559-020-1156-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Nakatani Y, et al. Reconstruction of proto-vertebrate, proto-cyclostome and proto-gnathostome genomes provides new insights into early vertebrate evolution. Nat. Commun. 2021;12:4489. doi: 10.1038/s41467-021-24573-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kuraku S, Meyer A, Kuratani S. Timing of genome duplications relative to the origin of the vertebrates: did cyclostomes diverge before or after? Mol. Biol. Evol. 2009;26:47–59. doi: 10.1093/molbev/msn222. [DOI] [PubMed] [Google Scholar]
- 22.Smith JJ, et al. Sequencing of the sea lamprey (Petromyzon marinus) genome provides insights into vertebrate evolution. Nat. Genet. 2013;45:415–421. doi: 10.1038/ng.2568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Sacerdot C, Louis A, Bon C, Berthelot C, Roest Crollius H. Chromosome evolution at the origin of the ancestral vertebrate genome. Genome Biol. 2018;19:166. doi: 10.1186/s13059-018-1559-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Pascual-Anaya J, et al. Hagfish and lamprey Hox genes reveal conservation of temporal colinearity in vertebrates. Nat. Ecol. Evol. 2018;2:859–866. doi: 10.1038/s41559-018-0526-2. [DOI] [PubMed] [Google Scholar]
- 25.Nakai Y, Kubota S, Kohno S. Chromatin diminution and chromosome elimination in four Japanese hagfish species. Cytogenet. Cell Genet. 1991;56:196–198. doi: 10.1159/000133087. [DOI] [PubMed] [Google Scholar]
- 26.Caputo Barucchi V, Giovannotti M, Nisi Cerioni P, Splendiani A. Genome duplication in early vertebrates: insights from agnathan cytogenetics. Cytogenet. Genome Res. 2013;141:80–89. doi: 10.1159/000354098. [DOI] [PubMed] [Google Scholar]
- 27.Nakai Y, Kohno S. Elimination of the largest chromosome pair during differentiation into somatic cells in the Japanese hagfish, Myxine garmani (Cyclostomata, Agnatha) Cytogenet. Genome Res. 1987;45:80–83. doi: 10.1159/000132434. [DOI] [Google Scholar]
- 28.Nakai Y, et al. Chromosome elimination in three Baltic, south Pacific and north-east Pacific hagfish species. Chromosome Res. 1995;3:321–330. doi: 10.1007/BF00713071. [DOI] [PubMed] [Google Scholar]
- 29.Smith JJ, Baker C, Eichler EE, Amemiya CT. Genetic consequences of programmed genome rearrangement. Curr. Biol. 2012;22:1524–1529. doi: 10.1016/j.cub.2012.06.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Kubota S, Kuro-o M, Mizuno S, Kohno S. Germ line-restricted, highly repeated DNA sequences and their chromosomal localization in a Japanese hagfish (Eptatretus okinoseanus) Chromosoma. 1993;102:163–173. doi: 10.1007/BF00387731. [DOI] [PubMed] [Google Scholar]
- 31.Goto Y, Kubota S, Kohno S. Highly repetitive DNA sequences that are restricted to the germ line in the hagfish Eptatretus cirrhatus: a mosaic of eliminated elements. Chromosoma. 1998;107:17–32. doi: 10.1007/s004120050278. [DOI] [PubMed] [Google Scholar]
- 32.Nabeyama M, Kubota S, Kohno S. Concerted evolution of a highly repetitive DNA family in eptatretidae (Cyclostomata, agnatha) implies specifically differential homogenization and amplification events in their germ cells. J. Mol. Evol. 2000;50:154–169. doi: 10.1007/s002399910017. [DOI] [PubMed] [Google Scholar]
- 33.Kojima NF, et al. Whole chromosome elimination and chromosome terminus elimination both contribute to somatic differentiation in Taiwanese hagfish Paramyxine sheni. Chromosome Res. 2010;18:383–400. doi: 10.1007/s10577-010-9122-2. [DOI] [PubMed] [Google Scholar]
- 34.Kinsella CM, et al. Programmed DNA elimination of germline development genes in songbirds. Nat. Commun. 2019;10:5468. doi: 10.1038/s41467-019-13427-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Simakov O, et al. Deeply conserved synteny and the evolution of metazoan chromosomes. Sci. Adv. 2022;8:eabi5884. doi: 10.1126/sciadv.abi5884. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Zwaenepoel A, de Van de Peer Y. Inference of ancient whole-genome duplications and the evolution of gene duplication and loss rates. Mol. Biol. Evol. 2019;36:1384–1404. doi: 10.1093/molbev/msz088. [DOI] [PubMed] [Google Scholar]
- 37.Holland PWH, Marlétaz F, Maeso I, Dunwell TL, Paps J. New genes from old: asymmetric divergence of gene duplicates and the evolution of development. Philos. Trans. R. Soc. B. 2017;372:20150480. doi: 10.1098/rstb.2015.0480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Kuraku S. Palaeophylogenomics of the vertebrate ancestor—impact of hidden paralogy on hagfish and lamprey gene phylogeny. Integr. Comp. Biol. 2010;50:124–129. doi: 10.1093/icb/icq044. [DOI] [PubMed] [Google Scholar]
- 39.Coulier F, Popovici C, Villet R, Birnbaum D. MetaHox gene clusters. J. Exp. Zool. 2000;288:345–351. doi: 10.1002/1097-010X(20001215)288:4<345::AID-JEZ7>3.0.CO;2-Y. [DOI] [PubMed] [Google Scholar]
- 40.Furlong RF, Holland PWH. Were vertebrates octoploid? Philos. Trans. R. Soc. B. 2002;357:531–544. doi: 10.1098/rstb.2001.1035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Robertson FM, et al. Lineage-specific rediploidization is a mechanism to explain time-lags between genome duplication and evolutionary diversification. Genome Biol. 2017;18:111. doi: 10.1186/s13059-017-1241-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Gundappa MK, et al. Genome-wide reconstruction of rediploidization following autopolyploidization across one hundred million years of salmonid evolution. Mol. Biol. Evol. 2022;39:msab310. doi: 10.1093/molbev/msab310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Parey E, et al. An atlas of fish genome evolution reveals delayed rediploidization following the teleost whole-genome duplication. Genome Res. 2022;32:1685–1697. doi: 10.1101/gr.276953.122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Redmond AK, Casey D, Gundappa MK, Macqueen DJ, McLysaght A. Independent rediploidization masks shared whole genome duplication in the sturgeon-paddlefish ancestor. Nat. Commun. 2023;14:2879. doi: 10.1038/s41467-023-38714-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Lien S, et al. The Atlantic salmon genome provides insights into rediploidization. Nature. 2016;533:200–205. doi: 10.1038/nature17164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Session AM, et al. Genome evolution in the allotetraploid frog Xenopus laevis. Nature. 2016;538:336–343. doi: 10.1038/nature19840. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Chen Z, et al. De novo assembly of the goldfish (Carassius auratus) genome and the evolution of genes after whole-genome duplication. Sci. Adv. 2019;5:eaav0547. doi: 10.1126/sciadv.aav0547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Fontana F, et al. Evidence of hexaploid karyotype in shortnose sturgeon. Genome. 2008;51:113–119. doi: 10.1139/G07-112. [DOI] [PubMed] [Google Scholar]
- 49.Martin KJ, Holland PWH. Enigmatic orthology relationships between Hox clusters of the African butterfly fish and other teleosts following ancient whole-genome duplication. Mol. Biol. Evol. 2014;31:2592–2611. doi: 10.1093/molbev/msu202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Shimeld SM, Holland PW. Vertebrate innovations. Proc. Natl Acad. Sci. USA. 2000;97:4449–4452. doi: 10.1073/pnas.97.9.4449. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Wada H, Makabe K. Genome duplications of early vertebrates as a possible chronicle of the evolutionary history of the neural crest. Int. J. Biol. Sci. 2006;1449:228. doi: 10.7150/ijbs.2.133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Martik ML, Bronner ME. Riding the crest to get a head: neural crest evolution in vertebrates. Nat. Rev. Neurosci. 2021;22:616–626. doi: 10.1038/s41583-021-00503-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Simões-Costa M, Bronner ME. Establishing neural crest identity: a gene regulatory recipe. Development. 2015;142:242–257. doi: 10.1242/dev.105445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Sauka-Spengler T, Meulemans D, Jones M, Bronner-Fraser M. Ancient evolutionary origin of the neural crest gene regulatory network. Dev. Cell. 2007;13:405–420. doi: 10.1016/j.devcel.2007.08.005. [DOI] [PubMed] [Google Scholar]
- 55.Hockman D, et al. A genome-wide assessment of the ancestral neural crest gene regulatory network. Nat. Commun. 2019;10:4689. doi: 10.1038/s41467-019-12687-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Martik ML, et al. Evolution of the new head by gradual acquisition of neural crest regulatory circuits. Nature. 2019;574:675–678. doi: 10.1038/s41586-019-1691-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Minguillon C, Gibson-Brown JJ, Logan MP. Tbx4/5 gene duplication and the origin of vertebrate paired appendages. Proc. Natl Acad. Sci. USA. 2009;106:21726–21730. doi: 10.1073/pnas.0910153106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Lynch M, Conery JS. The evolutionary fate and consequences of duplicate genes. Science. 2000;290:1151–1155. doi: 10.1126/science.290.5494.1151. [DOI] [PubMed] [Google Scholar]
- 59.Marlétaz F, et al. Amphioxus functional genomics and the origins of vertebrate gene regulation. Nature. 2018;564:64–70. doi: 10.1038/s41586-018-0734-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Force A, et al. Preservation of duplicate genes by complementary, degenerative mutations. Genetics. 1999;151:1531–1545. doi: 10.1093/genetics/151.4.1531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Dong EM, Allison WT. Vertebrate features revealed in the rudimentary eye of the Pacific hagfish (Eptatretus stoutii) Proc. Biol. Sci. 2021;288:20202187. doi: 10.1098/rspb.2020.2187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Venkatesh B, et al. Elephant shark genome provides unique insights into gnathostome evolution. Nature. 2014;505:174–179. doi: 10.1038/nature12826. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Theill LE, Boyle WJ, Penninger JM. RANK-L and RANK: T cells, bone loss, and mammalian evolution. Annu. Rev. Immunol. 2002;20:795–823. doi: 10.1146/annurev.immunol.20.100301.064753. [DOI] [PubMed] [Google Scholar]
- 64.Poole KES, Reeve J. Parathyroid hormone—a bone anabolic and catabolic agent. Curr. Opin. Pharmacol. 2005;5:612–617. doi: 10.1016/j.coph.2005.07.004. [DOI] [PubMed] [Google Scholar]
- 65.Fudge DS, et al. From ultra-soft slime to hard α-keratins: the many lives of intermediate filaments. Integr. Comp. Biol. 2009;49:32–39. doi: 10.1093/icb/icp007. [DOI] [PubMed] [Google Scholar]
- 66.Zeng Y, et al. Epidermal threads reveal the origin of hagfish slime. eLife. 2023;12:e81405. doi: 10.7554/eLife.81405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Fudge DS, Levy N, Chiu S, Gosline JM. Composition, morphology and mechanics of hagfish slime. J. Exp. Biol. 2005;208:4613–4625. doi: 10.1242/jeb.01963. [DOI] [PubMed] [Google Scholar]
- 68.Van Rechem C, et al. The SKP1-Cul1-F-box and leucine-rich repeat protein 4 (SCF-FbxL4) ubiquitin ligase regulates lysine demethylase 4A (KDM4A)/Jumonji domain-containing 2A (JMJD2A) protein. J. Biol. Chem. 2011;286:30462–30470. doi: 10.1074/jbc.M111.273508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Nishimura O, et al. Inference of a genome-wide protein-coding gene set of the inshore hagfish Eptatretus burgeri. F1000Res. 2022;11:1270. doi: 10.12688/f1000research.124719.1. [DOI] [Google Scholar]
- 70.Timoshevskiy VA, Herdy JR, Keinath MC, Smith JJ. Cellular and molecular features of developmentally programmed genome rearrangement in a vertebrate (sea lamprey: Petromyzon marinus) PLoS Genet. 2016;12:e1006103. doi: 10.1371/journal.pgen.1006103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Gai Z, et al. Galeaspid anatomy and the origin of vertebrate paired appendages. Nature. 2022;609:959–963. doi: 10.1038/s41586-022-04897-6. [DOI] [PubMed] [Google Scholar]
- 72.Green, M. R. & Sambrook, J. Molecular Cloning:A Laboratory Manual 4th edn (Cold Spring Harbor Laboratory Press, 2012).
- 73.Chapman JA, et al. Meraculous: de novo genome assembly with short paired-end reads. PLoS ONE. 2011;6:e23501. doi: 10.1371/journal.pone.0023501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.English AC, et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS ONE. 2012;7:e47768. doi: 10.1371/journal.pone.0047768. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Putnam NH, et al. Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res. 2016;26:342–350. doi: 10.1101/gr.193474.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Meyer M, Kircher M. Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb. Protoc. 2010;2010:pdb.prot5448. doi: 10.1101/pdb.prot5448. [DOI] [PubMed] [Google Scholar]
- 77.Miller JR, et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics. 2008;24:2818–2824. doi: 10.1093/bioinformatics/btn548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Vurture GW, et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. 2017;33:2202–2204. doi: 10.1093/bioinformatics/btx153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Dobin A, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Niknafs YS, Pandian B, Iyer HK, Chinnaiyan AM, Iyer MK. TACO produces robust multisample transcriptome assemblies from RNA-seq. Nat. Methods. 2017;14:68–70. doi: 10.1038/nmeth.4078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Grabherr MG, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 2011;29:644–652. doi: 10.1038/nbt.1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Haas BJ, et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 2008;9:R7. doi: 10.1186/gb-2008-9-1-r7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Stanke M, et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 2006;34:W435–W439. doi: 10.1093/nar/gkl200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Altenhoff AM, et al. OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Res. 2019;29:1152–1163. doi: 10.1101/gr.243212.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Marlétaz F, Peijnenburg KTCA, Goto T, Satoh N, Rokhsar DS. A new spiralian phylogeny places the enigmatic arrow worms among gnathiferans. Curr. Biol. 2019;29:312–318. doi: 10.1016/j.cub.2018.11.042. [DOI] [PubMed] [Google Scholar]
- 86.Liu Y, Schmidt B, Maskell DL. MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities. Bioinformatics. 2010;26:1958–1964. doi: 10.1093/bioinformatics/btq338. [DOI] [PubMed] [Google Scholar]
- 87.Di Franco A, Poujol R, Baurain D, Philippe H. Evaluating the usefulness of alignment filtering methods to reduce the impact of errors on evolutionary inferences. BMC Evol. Biol. 2019;19:21. doi: 10.1186/s12862-019-1350-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Criscuolo A, Gribaldo S. BMGE (block mapping and gathering with entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol. Biol. 2010;10:210. doi: 10.1186/1471-2148-10-210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Minh BQ, et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 2020;37:1530–1534. doi: 10.1093/molbev/msaa015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Rodrigue N, Lartillot N. Site-heterogeneous mutation-selection models within the PhyloBayes-MPI package. Bioinformatics. 2014;30:1020–1021. doi: 10.1093/bioinformatics/btt729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Irisarri I, et al. Phylotranscriptomic consolidation of the jawed vertebrate timetree. Nat. Ecol. Evol. 2017;1:1370–1378. doi: 10.1038/s41559-017-0240-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Kuraku, S., Ota, K. G. & Kuratani, S. in The Timetree of Life (eds Blair Hedges, S. & Kumar, S.) 317–319 (Oxford Univ. Press, 2009).
- 93.Derelle R, Philippe H, Colbourne JK. Broccoli: combining phylogenetic and network analyses for orthology assignment. Mol. Biol. Evol. 2020;37:3389–3396. doi: 10.1093/molbev/msaa159. [DOI] [PubMed] [Google Scholar]
- 94.Morel B, Kozlov AM, Stamatakis A, Szöllősi GJ. GeneRax: a tool for species-tree-aware maximum likelihood-based gene family tree inference under gene duplication, transfer, and loss. Mol. Biol. Evol. 2020;37:2763–2774. doi: 10.1093/molbev/msaa141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Duchemin W, et al. RecPhyloXML: a format for reconciled gene trees. Bioinformatics. 2018;34:3646–3652. doi: 10.1093/bioinformatics/bty389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Huerta-Cepas J, et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 2019;47:D309–D314. doi: 10.1093/nar/gky1085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Thomas PD, et al. PANTHER: making genome-scale phylogenetics accessible to all. Protein Sci. 2022;31:8–22. doi: 10.1002/pro.4218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Szöllõsi GJ, Rosikiewicz W, Boussau B, Tannier E, Daubin V. Efficient exploration of the space of reconciled gene trees. Syst. Biol. 2013;62:901–912. doi: 10.1093/sysbio/syt054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 2019;35:4453–4455. doi: 10.1093/bioinformatics/btz305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Shimodaira H, Hasegawa M. CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics. 2001;17:1246–1247. doi: 10.1093/bioinformatics/17.12.1246. [DOI] [PubMed] [Google Scholar]
- 102.Johnson M, et al. NCBI BLAST: a better web interface. Nucleic Acids Res. 2008;36:W5–W9. doi: 10.1093/nar/gkn201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Li H. Protein-to-genome alignment with miniprot. Bioinformatics. 2023;39:btad014. doi: 10.1093/bioinformatics/btad014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30:923–930. doi: 10.1093/bioinformatics/btt656. [DOI] [PubMed] [Google Scholar]
- 105.Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinf. 2008;9:559. doi: 10.1186/1471-2105-9-559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Yanai I, et al. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics. 2005;21:650–659. doi: 10.1093/bioinformatics/bti042. [DOI] [PubMed] [Google Scholar]
- 107.Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods. 2017;14:417–419. doi: 10.1038/nmeth.4197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108.Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Koressaar T, Remm M. Enhancements and modifications of primer design program Primer3. Bioinformatics. 2007;23:1289–1291. doi: 10.1093/bioinformatics/btm091. [DOI] [PubMed] [Google Scholar]
- 110.Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–770. doi: 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111.Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–829. doi: 10.1101/gr.074492.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112.Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl Acad. Sci. USA117, 9451–9457 (2020). [DOI] [PMC free article] [PubMed]
- 113.Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114.Seshan, V. E. & Olshen, A. DNAcopy: DNA copy number data analysis. R package version 1.76.0 https://bioconductor.org/packages/DNAcopy (2023).
- 115.Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
- 116.Rooney, D. E. Human Cytogenetics: Constitutional Analysis (Oxford Univ. Press, 2001).
- 117.Timoshevskiy VA, Sharma A, Sharakhov IV, Sharakhova MV. Fluorescent in situ hybridization on mitotic chromosomes of mosquitoes. J. Vis. Exp. 2012;67:e4215. doi: 10.3791/4215. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
This file contains Supplementary Notes 1–3, Supplementary References, and Supplementary Fig. 1 (Phylogenetic trees inferred for paralogons in each CLG assuming the C20+R model).
This file contains Supplementary Tables 1–16.
Data Availability Statement
Raw and processed sequences have been deposited in the NCBI Sequence Read Archive (SRA) (PRJNA953751) and Gene Expression Omnibus (GEO) (GSE230176). The RNA-seq data for M. glutinosa are available on the SRA (SRR25213276). The resequenced somatic tissues are also available on the SRA (blood, SRR24133795; testes, SRR24130678). RNA-seq datasets used for comparative analyses are publicly available for Japanese lamprey (PRJNA354821, PRJNA349779 and PRJNA312435), gar (PRJNA255881), amphioxus (PRJNA416977) and sea lamprey (PRJNA497902). The read data are available at PRJNA953751. The genome and its annotation are also deposited in zenodo: https://zenodo.org/records/10227719. Source data are provided with this paper.
The code used is available at https://github.com/fmarletaz/hagfish.