Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2024 May 17:2024.05.16.593026. [Version 1] doi: 10.1101/2024.05.16.593026

Combining DNA and protein alignments to improve genome annotation with LiftOn

Kuan-Hao Chao 1,2,*, Jakob M Heinz 3, Celine Hoh 1,2, Alan Mao 1,2,4, Alaina Shumate 2,4, Mihaela Pertea 1,2,4, Steven L Salzberg 1,2,4,5,*
PMCID: PMC11118573  PMID: 38798552

Abstract

As the number and variety of assembled genomes continues to grow, the number of annotated genomes is falling behind, particularly for eukaryotes. DNA-based mapping tools help to address this challenge, but they are only able to transfer annotation between closely-related species. Here we introduce LiftOn, a homology-based software tool that integrates DNA and protein alignments to enhance the accuracy of genome-scale annotation and to allow mapping between relatively distant species. LiftOn's protein-centric algorithm considers both types of alignments, chooses optimal open reading frames, resolves overlapping gene loci, and finds additional gene copies where they exist. LiftOn can reliably transfer annotation between genomes representing members of the same species, as we demonstrate on human, mouse, honey bee, rice, and Arabidopsis thaliana. It can further map annotation effectively across species pairs as far apart as mouse and rat or Drosophila melanogaster and D. erecta.

1. Introduction

Recent advancements in sequencing technologies have resulted in an exponential increase in available genome assemblies. Long-read sequencing technology, which has become faster, more cost-effective, and more accurate over the past decade 1, 2, 3, 4, 5, has notably enhanced the accuracy and contiguity of genome assemblies. A prominent example is the telomere-to-telomere (T2T) assembly of the human genome CHM13 6, which has subsequently been used to guide the assembly of multiple additional human genomes 7, 8, 9 As of December 2023, the NCBI database contained 30,530 eukaryotic genomes and 567,228 prokaryotic genomes 10, 11, 12.

Genome annotation – identifying genes and other biological features – is essential to understanding the biology of these genomes 13. Annotating eukaryotic genomes presents particular challenges 14, 15 due to their multi-exon gene structures, extensive intergenic regions, and sparse gene density. This complexity necessitates detailed analysis and manual curation, making the process slower and more difficult to automate than genome assembly 14, 16, 17, 18. One widely used approach to annotation is ab initio gene prediction, but even the best ab initio systems miss many genes and struggle to get exon-intron structures precisely correct 19, 20, 21, 22, 23, 24, 25. RNA sequencing can more accurately identify genes by directly capturing gene transcripts, although it may miss genes if they only expressed at low levels or in certain hard-to-capture tissues 26, 27. Annotation "lift-over" strategies, which transfer annotations from a well-annotated genome to a newly sequenced one based on homology and synteny, provide a more efficient and cost-effective annotation method, particularly when annotations from the same or closely related species are available.

Currently, the best approach for transferring annotation between assemblies is DNA-based, exemplified by the Liftoff 28, and CAT 29 programs, which were used to create the initial annotation of the human T2T-CHM13 genome 6 based on the annotation of the human reference genome, GRCh38. However, in cases where the DNA sequence of a newly assembled genome deviates substantially from the reference genome, a DNA-based alignment process sometimes produces transcripts with invalid open reading frames or with erroneous splice sites. DNA-based mapping becomes even more challenging when the new genome is less closely related to the available reference.

By comparison with DNA, protein sequences of orthologous genes are conserved at much greater evolutionary distances. This observation motivated us to develop a new method that integrates protein sequence alignment into the lift-over process. A key step in this approach is to align protein sequences from the reference genome to the target, considering all six reading frames and allowing for spliced alignments to span introns. However, protein-based alignment alone cannot map all annotation between two genomes. First and most obviously, it cannot capture untranslated regions (UTRs) on either end of a transcript. Second, when small exons are separated by much longer introns, as is common in the human genome, protein-based aligners may miss some small exons entirely (as illustrated in Figures S1A-B). Third, because intronic alignments are not considered, protein-to-DNA alignment is susceptible to aligning proteins to pseudogenes. Fourth, this approach sometimes combines coding sequences (CDSs) from distinct genes when multiple members of a gene family are arranged in tandem along a genome (Figure S1C). Finally, another obvious limitation is that a protein alignment strategy cannot transfer annotations of non-coding genes or other features.

LiftOn is a homology-based genome annotation tool that uses both DNA and protein sequence alignment, and that builds on Liftoff, the current leading homology-based annotation lift-over tool. LiftOn includes several key functions to create annotation: (1) it uses a protein-maximization algorithm that combines both DNA and protein sequence alignment to generate protein-coding gene annotations that maximize similarity to the reference proteins; (2) it checks alternative open reading frames (ORFs) for truncated proteins to identify the ORF that yields the longest match to the reference protein; (3) similarly to Liftoff, it finds extra copies of protein-coding gene copies in the target genome; (4) it reports the various types of mutations for proteins that fail to match the reference perfectly, similarly to LiftoffTools 30; and (5) it resolves issues such as overlapping gene loci and multi-mapping for genes within a large gene family. By combining the advantages of DNA and protein sequence alignment, LiftOn generates better protein-coding gene annotation than either alignment method can achieve on its own.

In our experiments below, we demonstrate how LiftOn can yield substantial improvements in mapping annotations from one human genome, GRCh38, to another, T2T-CHM13, using three different human annotation sets: MANE 31, CHESS 32, and RefSeq 33. Additionally, we show that LiftOn is effective for mapping annotation between members of non-human species, including Mus musculus (mouse), Apis mellifera (honey bee), Arabidopsis thaliana (thale cress), and Oryza sativa (rice). To demonstrate LiftOn's effectiveness at mapping annotation between distinct but closely related species, we mapped human genes onto Pan troglodytes (chimpanzee). Finally, we illustrate that LiftOn works on more distantly related species by mapping annotation from Drosophila melanogaster to Drosophila erecta and from Mus musculus to Rattus norvegicus.

2. Results

2.1. A two-step protein-maximization algorithm to improve protein-coding gene annotations

LiftOn implements a two-step protein-maximization (PM) algorithm (illustrated in Figure 1 and discussed in more detail in Methods) to find the best annotations at protein-coding gene loci. First, it uses a chaining algorithm, described below, to find the exon-intron boundaries of protein coding transcripts. Second, if the coding sequence (CDS) does not preserve the entire protein sequence of the reference protein, it adjusts the CDS boundaries to preserve as much of the reference protein as possible. (Note that CDS features are defined as the coding portions of exons, from start to stop, and excluding untranslated portions of exons.)

Figure 1.

Figure 1.

The protein-maximization (PM) algorithm consists of two modules: (A-E) the chaining algorithm, and (F-K) the open reading frame search algorithm. (A) Matched protein-coding transcripts mapped by Liftoff (green) and miniprot (orange) at the same location in a target genome. The transcript in blue represents the correct transcript annotation on the target genome. Liftoff's mapping has an erroneous splice junction between L3 and L4, while miniprot's mapping has a missing splice junction in M6. (B) Pairwise alignment results of the proteins mapped by Liftoff and miniprot to the reference protein. The figure shows a premature stop codon in the Liftoff protein, while the miniprot alignment has a mismatched protein sequence at the end. (C) Pairwise alignment mappings with added exon/CDS boundaries. (D) CDSs are grouped based on the cumulative lengths of the amino acids in the reference protein as described in the main text. In this example, CDSs are organized into groups: GL1={L1}, GM1={M1}, GL2={L2}, GM2={M2}, GL3={L3,L4}, GM3={M3,M4}, GL4={L5}, GM4={M5}, GL5={L6,L7} and GM5={M6}. The chaining algorithm iterates through each group, comparing the corresponding partial protein sequences to the reference protein and chaining those with higher protein sequence identity. (E) In this example, GL1, GL2, GM3, GL4 and GL5 are chained, forming the new protein-coding transcript CDS list. This list includes L1, L2, M3, M4, L5, L6 and L7 in the LiftOn annotation. (F-K) Schematic diagrams illustrating how the ORF search algorithm handles various types of sequence mutations. This process leads to changes in the gene annotation of both translated and untranslated regions. (F) Frameshift mutation: a variation caused by the insertion or deletion of a sequence of nucleotides whose length is not divisible by three. In this example, the indel introduces a premature stop codon. (G-H) Point mutations leading to premature stop codons. LiftOn searches for the longest ORF, considering two scenarios: (G) depicts the selection of the first encountered stop codon, while (H) illustrates the switch to the downstream start codon. (I) Stop codon loss: when a stop codon is deleted, LiftOn identifies a new stop codon in the 3' UTR. (J-K) Start codon loss: in this scenario, LiftOn searches for a new start codon, exploring both downstream in the coding region (J) and upstream in the 5’ UTR (K).

The chaining algorithm (Figure 1A-E) starts by pairing up miniprot alignments with transcripts identified by Liftoff (see Methods and Algorithm S1 for details on the pairing approach). After two transcripts are paired, the protein sequences from the Liftoff and miniprot annotations are then aligned to the full-length reference protein, as illustrated in Figure 1B. Subsequently, LiftOn maps the CDS boundaries from both the Liftoff and miniprot annotations onto the protein alignment (Figure 1C and Algorithm S2).

The CDSs within the Liftoff and miniprot annotations are grouped in the 5’ to 3’ direction. The CDSs groups are represented as GL_i and GM_i respectively for LiftOff and miniprot, where i denotes the ith group in that annotation (Figure 1D).

The grouping process begins with the first CDS in each annotation and continues until reaching the endpoints of the downstream CDS in Liftoff and miniprot, where the number of aligned amino acids from the reference protein is equal. This forms the first CDS group in Liftoff, denoted as GL_1, and the first CDS group in miniprot, denoted as GM_1. Subsequent groups start from the previous endpoint in both Liftoff and miniprot, extending until the number of aligned amino acids from the reference protein matches for both annotations again. These subsequent groups are represented as GL_2 and GM_2, respectively. The grouping process concludes upon reaching the last CDS in both annotations (see Algorithm S3 for more details).

Within each group, GL_i or GM_i, we calculate the partial protein sequence identity and select the group with higher protein sequence identity score (Figure 1D and Algorithm S4). In case of a tie, LiftOn prioritizes the Liftoff annotation, GL_i, so that it will include UTRs in its output. The selected group of CDSs is represented by GSEL_i. All CDSs in GSEL_i are then concatenated to form the final LiftOn transcript, as shown in Figure 1E. This transcript is an ordered sequence of CDSs sourced from either Liftoff or miniprot, with the goal of maximizing protein similarity to the reference protein. This approach is particularly effective in addressing issues such as in-frame indels or mis-splicing that may arise from misalignments as illustrated in Figure 1A.

After applying the chaining algorithm, LiftOn attempts to make further improvements in the CDS regions, as illustrated in Figure 1F-K. It searches the translations of protein-coding transcripts and adjusts CDS boundaries to avoid early stop codons (Figure 1F, G), choose better translation start sites (Figure 1H,J,K), or extend proteins with stop codon loss (Figure 1I). Figure S2 provides additional IGV 34, 35 screenshots that illustrate LiftOn’s results for each scenario. After making these adjustments, LiftOn evaluates the differences between the reference and target transcripts and, similarly to LiftoffTools 30, produces a mutation report. Transcripts are deemed "identical" when their target and reference gene DNA sequences are entirely the same. For non-identical sequences, LiftOn categorizes their differences using these categories: synonymous, non-synonymous, in-frame insertion, in-frame deletion, frameshift, stop codon gain, stop codon lost, and start codon loss.

2.2. LiftOn improves human annotation lift-over

For our first experiment, we mapped annotations from GRCh38 onto T2T-CHM13 using RefSeq 33 (release 220) as the reference annotation (Figure S3). In total, there were 37,986 protein-coding and non-coding genes, with 160,561 transcripts in the reference annotation, of which 130,528 were protein-coding (see Table S1 for all mapped genes and Methods 4.7 for our gene counting approach). We focused our analysis on the protein-coding transcripts because those are the ones where LiftOn can produce improvements over DNA-based methods. At the gene level, LiftOn successfully lifted over 37,828 genes, while 158 genes remained unmapped, of which 103 were protein-coding. The overall gene mapping rate was 99.6%. Out of the successfully mapped genes, 37,453 were mapped as single copies and 375 genes were mapped with extra copies. In total, 38,886 gene loci were mapped onto T2T-CHM13 (Table 1).

Table 1.

Statistics for LiftOn at both the gene and transcript levels, as a result of mapping RefSeq release 220 annotation from the GRCh38 human genome to T2T-CHM13.

Total feature
count
Protein-coding feature count Non-coding feature count
Gene Reference (GRCh38) 37,986 19,927 18,059
Target (CHM13) 38,916 Single copy >1 copy Extra copy Single copy >1 copy Extra copy
19,738 86 320 17,715 289 768
20,144 18,772
Transcript Reference (GRCh38) 160,561 130,528 30,033
Target (CHM13) 16,1701 Single copy >1 copy Extra copy Single copy >1 copy Extra copy
129,967 239 573 29,488 410 1,024
130,779 30,922

In assessing overall protein-coding gene annotations, we computed gap-compressed protein sequence identity scores for LiftOn, Liftoff, and miniprot. Figure 2A compares the mapping of protein-coding genes using Liftoff, which is based on DNA alignment, to miniprot, which uses protein-to-DNA alignment. Dots in the lower right part of the panel indicate transcripts where Liftoff was superior to miniprot, as measured by protein sequence identity, while the upper left part of the plot shows transcripts where miniprot was superior. This comparison shows that neither approach dominates the other. In the LiftOn versus Liftoff comparison (Figure 2B), 866 transcripts exhibit higher protein sequence identity, with 113 of those achieving 100% identity. Similarly, the LiftOn versus miniprot comparison (Figure 1C) shows that LiftOn finds better matches for 30,266 protein-coding transcripts, improving 22,746 to 100% identity. The protein sequence identity distributions are shown in Figure S4.

Figure 2.

Figure 2.

Comparative analysis of various tools for mapping RefSeq protein annotations from GRCh38 to T2T-CHM13 v2.0. (A-C) Scatter plots of protein sequence identity. (A) Comparison between miniprot (y-axis) and Liftoff (x-axis), (B) comparison between LiftOn (y-axis) and Liftoff (x-axis), and (C) comparison between LiftOn (y-axis) and miniprot (x-axis). (D-G) Examples of improved annotation due to LiftOn’s two-step PM algorithm. (D) LiftOn employs Liftoff's annotation to correct a splice junction missed by miniprot in transcript NM_001083965.2 of the TDRKH gene. (E) For transcript NM_001384763.1 of the SLC22A31 gene, LiftOn uses miniprot’s annotation to resolve an incorrect acceptor site from Liftoff’s annotation. (F) For transcript XM_011517662.4 of the WASHC1 gene, LiftOn combines both annotations to rectify an omitted CDS by miniprot and a misidentified splice junction by Liftoff between the fifth and sixth exons. (G) LiftOn's ORF search algorithm selects an alternative downstream start codon for a frameshift mutation in transcript NM_001004692.2 of the OR2T12 gene, thereby conserving the majority of the protein sequence.

The LiftOn PM algorithm can improve protein-coding transcript annotations in four scenarios illustrated in Figure 2. First, as observed with the TDRKH gene (Figures 2D and S5A), miniprot missed a splice junction, leading to the retention of an intron and a premature stop codon. Here LiftOn adopted Liftoff's CDS, which yielded the correct protein. Second, as shown in Figures 2E and S5B for the SLC22A31 gene, miniprot correctly identified an AG acceptor site that Liftoff missed, yielding the correct CDS, which LiftOn combined with the UTRs from Liftoff. Third, LiftOn sometimes was able to fix errors in both methods, as shown in Figures 2F and S5C for the gene WASHC1. In this example, LiftOn corrected the omission of the first coding exon by miniprot and fixed an incorrect splice junction found by Liftoff, one that erroneously introduced a premature stop codon between the fifth and sixth exons. LiftOn's chaining algorithm effectively consolidates these segments, improving the protein sequence identity to 99.4%. Fourth, as shown in Figure 2G for the OR2T12 gene, a mutation introduced a premature stop codon (Figure 2G, inset), and the ORF search algorithm in LiftOn adjusted by scanning for ORFs that best match the reference protein sequence. Here it found an alternative start codon just two codons downstream that maintained a near full-length match to the reference protein.

LiftOn identifies and reports additional copies of lifted-over features, similarly to Liftoff. In the mapping from GRCh38 to CHM13, LiftOn identified 86 protein-coding genes with at least one extra copy, for a total of 320 new protein-coding gene loci. Of these, 289 extra gene copies were detected by Liftoff and 31 by miniprot, detailed in Tables S2 and S3. Figure 3 shows the relative positions of extra gene copies between the two genomes, illustrating that most were located on the same chromosomes.

Figure 3.

Figure 3.

Plot illustrating the locations of extra gene copies found on T2T-CHM13 (left side) compared to GRCh38 (right side). The Circos plot were generated using pyCircos (https://github.com/ponnhide/pyCircos). Each line represents a gene copy mapped from the reference genome to the target genome, with colors indicating different chromosomes. The lines are color-coded by the chromosome of the original copy. The bands within the plot are sized proportionally to reflect the actual size of these chromosomes.

We further compared the LiftOn-generated annotation of CHM13 with the current release of the T2T-CHM13 annotation (available from the T2T github site, https://github.com/marbl/CHM13). We excluded the annotation of chromosome Y from this analysis due to the extremely complex repeat structure of that chromosome, which confounds most attempts to align genes consistently to it 36. Our comparison considered only protein-coding transcripts on chromosomes 1-22 and X. LiftOn's annotation contained 665 protein-coding transcripts that had a higher protein sequence identity to the corresponding proteins on GRCh38, as compared to the current CHM13 annotation. Four examples are shown in Figure 4, each of which illustrates how LiftOn can improve the fidelity of the match between the source protein and the mapped-over version.

Figure 4.

Figure 4.

Examples where LiftOn improves the current T2T-CHM13 annotation. (A) In the NM_006065.5 transcript of the SIRPB1 gene, the current CHM13 annotation omits three coding exons. The LiftOn version finds those exons and increases the DNA sequence identity from 81% to 98%. (B) In transcript NM_001134939.1 of OAZ3, the CHM13 annotation incorporates a partial CDS in the second exon, leading to a truncated protein. LiftOn corrects this mis-annotation, increasing the protein sequence identity from 5.42% to 100%. (C) In transcript XM_047448259.1 of EPHA2, the published annotation chooses the wrong start codon. LiftOn finds a better start codon that improves the protein sequence identity from 2.4% to 98.7%. (D) In transcript NM_001099772.2 from CYP4B1, LiftOn shifts the donor site of the seventh coding exon by 11 nucleotides, fixing a frameshift and improving the protein sequence identity from 53% to 99%.

Among the 130,528 protein-coding transcripts mapped onto CHM13 by LiftOn, 13,486 transcripts were identical and 86,335 had only synonymous mutations. Of the remaining 30,707 transcripts, only 1,046 had a protein with less than 95% identity to the reference protein. Overall, LiftOn’s mapped-over annotation has greater fidelity to the source annotation (from GRCh38) than the current T2T-CHM13 annotation (Figure S6).

We observed similar improvements (as compared to both Liftoff and miniprot) when we lifted over different human annotations, including MANE 31 (Note S1, Table S4 and Figure S7) and CHESS3 32 (Note S2, Table S5 and Figure S8). Additionally, to assess the ability of LiftOn to map annotation between genomes of non-human species, we ran experiments on two different genomes from each of four plant and animal species: Mus musculus 37, 38 (Note S3, Table S6 and Figure S9), Apis mellifera 39, 40 (honey bee, see Note S4, Table S7 and Figure S10), Oryza sativa 41, 42 (Asian rice, Note S5, Table S8 and Figure S11), and Arabidopsis thaliana 43, 44 (Note S6, Table S9 and Figure S12), and obtained comparable enhancements. These analyses demonstrate the robust performance of LiftOn in handling annotation lift-over across a broad spectrum of organisms.

2.3. LiftOn improves both closely and distantly related species annotation lift-over

To demonstrate that LiftOn can reliably lift-over cross-species annotations, we used it to transfer genes between both closely-related and more distantly-related species. We measured genomic distance using Dashing 45 and Mash 46.

2.3.1. Mapping from Homo sapiens to Pan troglodytes

Chimpanzees are the closest evolutionary relatives of humans 47, 48, 49, 50, so we chose this example to demonstrate the applicability of LiftOn between closely-related species. LiftOn successfully lifted-over 37,509 genes (Table 2, single copy + >1 copy), leaving 477 genes unmapped, of which 285 were protein-coding and 192 noncoding. The overall gene mapping rate was 98.7%. Out of all mapped genes, 37,081 were mapped uniquely, while 428 genes were mapped to multiple locations, producing a total of 38,879 gene loci. LiftOn obtained higher protein sequence identity scores than Liftoff and miniprot for 4,332 and 33,509 transcripts respectively (Figure 5A, Note S7, and Figure S13).

Table 2.

Statistics for LiftOn at both the gene and transcript levels, after mapping RefSeq release v220 annotation from the human genome to Pan troglodytes (mPanTro3-v1.1), from Drosophila melanogaster (genome assembly release 6 + ISO1MT) to Drosophila erecta (Prin_Dsim_3.1), and from Mus musculus (GRCm39) to Rattus norvegicus (mRatBN7.2).

Lift-over
experiment
Feature Reference
/ Target
Total
feature
count
Protein-coding feature
count
Non-coding feature count
Homo sapiens
Pan troglodytes
Gene Reference (GRCh38) 37,986 19,927 18,059
Target (Pan_trog_v1) 38,879 Single copy >1 copy Extra copy Single copy >1 copy Extra copy
19,485 157 535 17,596 271 835
20,177 18,702
Transcript Reference (GRCh38) 160,561 130,528 30,033
Target (Pan_trog_v1) 161,222 Single copy >1 copy Extra copy Single copy >1 copy Extra copy
128,859 450 899 29,233 404 1,377
130,208 31,014
Drosophila melanogaster
Drosophila erecta
Gene Reference (D. melanogaster) 16,005 13,962 2,043
Target (D. erecta) 15,276 Single copy >1 copy Extra copy Single copy >1 copy Extra copy
13,180 140 152 1804 0 0
13,472 1,804
Transcript Reference (D. melanogaster) 33,176 30,749 2,427
Target (D. erecta) 32,143 Single copy >1 copy Extra copy Single copy >1 copy Extra copy
29,739 138 150 2,116 0 0
30,027 2,116
Mus musculus
Rattus norvegicus
Gene Reference (M. musculus) 35,551 22,192 13,359
Target (R.norvegicus) 33,706 Single copy >1 copy Extra copy Single copy >1 copy Extra copy
20,490 126 164 12,926 0 0
20,780 12,926
Transcript Reference (M. musculus) 119,745 96,192 23,553
Target (R.norvegicus) 115,855 Single copy >1 copy Extra copy Single copy >1 copy Extra copy
92,767 125 163 22,800 0 0
93,055 22,800
Figure 5.

Figure 5.

(A) Comparative analysis of lifting over RefSeq v220 annotations from Homo sapiens (GRCh38) to Pan troglodytes (NHGRI_mPanTro3-v1.1). (B) Comparative analysis of lifting over annotations from Drosophila melanogaster (genome assembly release 6 + ISO1MT) to Drosophila erecta (Prin_Dsim_3.1). (C) Comparative analysis of lifting over annotations from Mus musculus (GRCm39) to Rattus norvegicus (mRatBN7.2). Graphs labeled (a) show protein-gene order plots, with the x-axis representing the reference genome and the y-axis representing the target genome. The protein sequence identities are color-coded on a logarithmic scale, ranging from green (identical) to red. Graphs labeled (a) show protein-gene order plots, with the x-axis representing the reference genome and the y-axis representing the target genome. The protein sequence identities are color-coded on a logarithmic scale, ranging from green (1) to red (0), represent the degree of amino acid similarity, with 1 indicating identical sequences and 0 indicating no shared amino acids. The gene order plot script was customized from Liftofftools. Graphs labeled (b) are 3-D protein sequence identity plots comparing Liftoff on the x-axis, miniprot on the y-axis, and LiftOn on the z-axis. Each dot represents a protein-coding transcript. If a dot is above the x=y plane, LiftOn's mapping produced a higher protein sequence identity score than the other programs. (c) Frequency plots on a logarithmic scale of protein sequence identity for LiftOn (left), Liftoff (middle), and miniprot (right).

2.3.2. Mapping from Drosophila melanogaster to Drosophila erecta

We then mapped the annotation between two Drosophila species 51, 52 that are considerably more distant from each other than human and chimpanzee: Drosophila melanogaster 53 and D. erecta 54, 55 with 0.07 Dashing 45 similarity score and 0.08 Mash-distance 46. A protein-coding gene order plot (Figure 5B) shows multiple genome rearrangements, illustrating the divergence between the genomes. The LiftOn gene- and transcript-level statistics are summarized in Table 2, showing a gene mapping rate of 94.5% and a transcript mapping rate of 96.4%. LiftOn yields improved annotations for 4,583 and 7,763 protein-coding transcripts compared to Liftoff and miniprot respectively (Note S8 and Figure S14).

2.3.3. Mapping from Mus musculus to Rattus norvegicus

Finally, we chose two model organisms, Mus musculus 56 and Rattus norvegicus 57 (mouse and rat), to showcase LiftOn’s ability to map annotation between even more distant species, with 0.01 Dashing 45 similarity score and 0.12 Mash-distance 46. The results are illustrated in Figure 5C and Table 2, which reveals a gene mapping rate of 94.3% and a transcript mapping rate of 96.6%. For the mouse to rat mapping, LiftOn improves the annotations for 15,420 and 30,574 protein-coding transcripts as compared to Liftoff and miniprot respectively. The frequency plots in Figure 5D reveal that LiftOn only had 2,492 protein-coding transcripts below the 40% identity threshold, a notable improvement over Liftoff (9,574) and miniprot (9,292). More details are included in Supplementary Note S9 and Figures S15-S16.

3. Discussion

Annotation lift-over is a powerful means to transfer annotation from one genome to another, but mapping requires accurate alignment, which increases in difficulty as genomes become more divergent. Incorporating protein sequence alignment, as we have done here, addresses some of the challenges, but factors such as pseudogenes, large gene families, structural variations, and the quality of the source annotation all affect the difficulty of the problem.

LiftOn’s accuracy will always be critically dependent on the accuracy of the source annotation. As an example, consider the mapping of two uncharacterized protein-coding genes, LOC124905331 and LOC112268317, both of which mapped onto CHM13 in multiple copies. LOC124905331 appears on an unplaced contig (NT_187388.1) in GRCh38, and on CHM13, LiftOn mapped it to chromosome 21, spanning positions 3,205,883 to 5,522,421, with 50 tandemly repeated copies. LOC112268317 was initially on another unplaced contig (NT_187499.1) in GRCh38, and LiftOn mapped it in 30 copies on CHM13, near the telomeric regions of chromosomes 14, 15, 21, and 22. Both of these genes overlap with the ribosomal DNA (rDNA) array, which occurs in many copies on the acrocentric chromosomes 13, 14, 15, 21, and 22 6, 9, 58. Given this overlap with rDNA, we suspect that the source genes are not genes at all, and manual curation would be required to clean up the results of the automated lift-over process.

LiftOn is a versatile homology-based annotation mapping tool, designed as a successor to Liftoff, the leading homology-based annotation lift-over tool. LiftOn improves on Liftoff by mapping protein-coding transcripts more accurately, which it achieves through the use of a protein-to-DNA aligner, miniprot, a chaining algorithm to generate protein sequences with high similarity to reference proteins, and post-processing methods to identify and correct open reading frames that might otherwise produce truncated proteins. Our experiments demonstrate that LiftOn outperforms other methods that rely solely on DNA or proteins for mapping annotations from one genome to another. Its use of protein sequence alignment allows it to effectively map annotation between different species, even those as distantly related as mouse and rat.

4. Online Methods

LiftOn is implemented as a Python package that maps gene features from a reference genome to a target genome using DNA- and protein-alignment-based methods, namely Liftoff and miniprot. For inputs, LiftOn requires a reference annotation in either GFF or GTF format, as well as reference and target genomes provided as FASTA files. LiftOn uses gffutils version 0.12 (http://github.com/daler/gffutils) to create an sqlite3 database, and it uses Biopython 59 and Pyfaidx 60 to extract both DNA and protein transcript sequences from the reference genome. Subsequently, LiftOn runs the embedded Liftoff code and runs miniprot through a python subprocess package, producing separate Liftoff and miniprot annotations. Alternatively, users have the option to run Liftoff and miniprot separately and then input the annotations using the `--liftoff` and `--miniprot` arguments to LiftOn. The two annotations are then used to create gffutils sqlite3 databases.

4.1. Annotation lift-over special handling of the human reference

For all mappings of the human annotation, we excluded all alternative scaffolds and patches from the GRCh38 genome and its annotation. Specifically, we excluded scaffolds ending in “_fix” and “_alt”, because they are duplicates or variants of sequences found on the primary chromosomes. We also excluded rRNAs, which occur in hundreds of identical copies that vary widely among humans and create problems for the alignment programs 9, 58, 61.

4.2. Pairing Liftoff and miniprot protein-coding transcript annotations

Liftoff was designed to map any genomic interval, containing genes, transcripts, and exons, from one genome onto another. Liftoff’s output uses a hierarchy where genes contain one or more transcripts, and transcripts contain exons and/or CDS features. A CDS feature defines the coding sequences within a protein-coding transcript; therefore, some CDS features span the same intervals as exons, while others span only parts of exons when the exons also include UTRs (untranslated regions). In contrast, miniprot was designed to map protein sequences to a genome and can only generate annotations of transcripts containing CDS features.

After running both Liftoff and miniprot, LiftOn goes through the miniprot mappings and finds, for each one, the Liftoff genomic interval to which it corresponds. A miniprot transcript is considered to match a Liftoff transcript if: (1) the loci of the two transcripts overlap, and (2) the locus of the miniprot transcript must not overlap with any other Liftoff genes loci, except those where the matched Liftoff locus also overlaps. Note that the genomic intervals found by Liftoff are mostly distinct, but it does allow up to 10% overlap between adjacent gene features by default. On the other hand, miniprot lacks the capability to reconcile overlapping gene loci, which sometimes results in a protein-coding transcript from a large gene family to map to multiple gene loci instead of one or to both a pseudogene and the correct locus.

In most cases, LiftOn is able to find a one-to-one mapping between the miniprot transcripts and those from LiftOff. For instance, mapping RefSeq release 220 annotations from GRCh38.p14 to T2T-CHM13 v2 results in 128,351 of 136,978 protein-coding transcripts (93.7%) being uniquely matched between Liftoff and miniprot (See Note S10, and Figure S17).

For the cases where miniprot identifies multiple copies for a protein-coding transcript, LiftOn checks if there at least one copy overlapping with a Liftoff gene locus. Miniprot transcript mappings spanning multiple Liftoff loci are removed to prevent erroneous “read through” annotations. Among the remaining options, the transcript with the highest protein sequence identity score is selected. Note that in the GRCh38.p14 to T2T-CHM13 v2 lift-over there were 355 protein-coding transcripts where none of the miniprot-discovered protein-coding transcripts overlapped with a corresponding Liftoff gene locus.

Once the one-to-one correspondence between two transcripts is established, LiftOn considers both Liftoff and miniprot CDS features of the two transcripts and initiates its protein-maximization algorithm as described in the next Methods section.

4.3. Protein-maximization (PM) algorithm

The PM algorithm consists of two components: the chaining algorithm and the open reading frame (ORF) search algorithm. The pseudocode for the protein_maximization function is outlined in Algorithm S1.

4.3.1. Step 1: the chaining algorithm

The chaining algorithm is a method for collecting and concatenating the optimal segments of CDSs obtained through the DNA-based mapping approach (Liftoff) and the protein-based approach (miniprot) into a consistent CDS chain, with the goal of maximizing the similarity of the mapped protein to the reference protein. In this context, the reference protein refers to the protein extracted from the reference annotation and genome; e.g., RefSeq annotation on GRCh38.

LiftOn uses Biopython 59 to extract and translate mapped protein-coding transcript sequences into their corresponding proteins, and the Parasail 62 Python package to align these protein sequences with the reference proteins. Following the alignment step, LiftOn maps the CDS boundaries onto the pairwise protein alignment results (see Figure 1C). The algorithm initially maps the CDS boundaries onto the protein (get_cds_protein_boundary function in Algorithm S2) and then adjusts these boundaries using the cigar string obtained from the protein alignment if gaps exist in the target protein (adjust_cds_protein_boundary function in Algorithm S2).

Subsequently, the chaining algorithm clusters CDSs from Liftoff and miniprot. It starts by grouping the initial CDSs from both annotations until it encounters a boundary where the aligned amino acid counts in the reference protein up to that boundary match for both Liftoff and miniprot. These CDSs form a comparison block where LiftOn calculates a partial protein sequence identity between the reference protein and those annotated by Liftoff and miniprot, as described in Methods section 4.4. CDS groups with higher identity scores relative to the reference protein are selected for the final LiftOn annotation. After this block is processed, LiftOn identifies a new group of CDSs starting from the previous found boundary and repeats the same selection process. The chaining algorithm's full pseudocode is provided in Algorithm S3.

4.3.2. Step 2: the open reading frame (ORF) search algorithm

Following the chaining algorithm, LiftOn performs an open reading frame search algorithm on the protein-coding regions of the mapped transcripts that have mutations likely to be more deleterious, such as “frameshift”, “stop codon gain”, “stop codon loss”, and “start codon loss” mutations. The objective is to generate the longest valid protein sequences that align with the full-length reference proteins.

For each selected protein-coding transcript, LiftOn’s open reading frame search algorithm iterates through three reading frames (0, 1, 2). In each frame, it identifies potential ORFs by searching for start codons (“ATG”) and locating the corresponding stop codons (“TAA”, “TAG”, “TGA”), and retains the longest one it finds in that frame.

After identifying potential ORFs, LiftOn proceeds to compare the longest ORF in each frame to the reference protein sequence. It calculates and compares the sequence identity score of each candidate and selects the ORF with the highest sequence identity. If the selected ORF's identity exceeds the original annotation, LiftOn updates the CDS boundaries of the protein-coding transcript.

4.4. DNA and protein transcript sequence identity score calculation

To compute DNA sequence identity scores, LiftOn first extracts transcript sequences by concatenating all exonic regions in a transcript. Subsequently, it aligns each transcript sequence mapped on the target genome by LiftOn, Liftoff, or miniprot to its respective reference sequence. This alignment is performed using the nw_trace_scan_sat function from the Parasail 62 Python package, configured with a match score of 1, mismatch penalty of −3, gap opening penalty of 2, and gap extension penalty of 2. LiftOn then reports the percent identity between the two aligned sequences, defined as in BLAST 63 as the number of matching bases in the two sequences over the number of alignment columns. The pseudocode of the algorithm described in this paragraph is illustrated in the get_DNA_id_fraction function from Algorithm S4. An example is provided in Figure S18A.

To compute protein sequence identity scores, LiftOn first generates the protein sequence for each mapped transcript by translating the sequence obtained from concatenating all CDS regions in the transcript. Then, it aligns each protein sequence with the full-length reference protein by performing a pairwise alignment using the BLOSUM 62 matrix 64, a gap opening penalty of 11 for insertions and deletions (INDELs), and a gap extension penalty of 2. The protein sequence identity score is calculated up to the first encountered stop codon in the target protein. Differing slightly from the BLAST-style metric employed for DNA sequence identity, LiftOn compresses the gaps in the reference alignment to prevent over-penalization of longer proteins in the target genome. These proteins might be mapped as longer due to potential repeat regions within the target genome or because of potential truncations in the proteins of the reference genome. The pseudocode for this process is presented in the get_AA_id_fraction function in Algorithm S4. An example is provided in Figure S18B.

Additionally, the get_partial_id_fraction function in Algorithm S4 describes this process when the sequence identity computation is restricted to evaluating matches between partial proteins, as done in the chaining algorithm, to determine the best matching CDS group to the reference.

4.5. Resolving overlapping gene loci and finding extra copy of protein-coding genes

LiftOn employs both Liftoff and miniprot to identify additional copies of protein-coding genes, prioritizing Liftoff for its capability to also map UTR regions. First, LiftOn uses the embedded Liftoff module to iteratively map new gene copies and verify whether the lifted-over gene loci overlap with existing annotations. If so, it allows overlaps that are also present in the reference genome; if not, it permits a maximum overlap of 10% with other gene loci. The Liftoff module also ensures that each gene copy meets the user-specified minimum DNA sequence identity between the reference and target gene loci. This process continues until all possible copies have been identified.

Then, LiftOn applies the miniprot module to identify any additional copies of protein-coding genes that Liftoff might have missed. LiftOn utilizes the intervaltree package, a self-balancing interval tree implemented in Python (https://github.com/chaimleib/intervaltree), to manage gene loci intervals and enable fast detection of overlaps. Since miniprot maps annotations at the protein-coding transcript (mRNA) level, LiftOn copies the corresponding gene-level feature from the reference annotation as the parent of the mRNA feature identified exclusively by miniprot to ensure consistency within the “gene-mRNA-exon” hierarchy. LiftOn also enforces the constraint that any extra gene copies identified by miniprot should have at most a 10% overlap with other gene loci (computed as the ratio of the mapped miniprot length to the smaller of the two values: the mapped miniprot length or the reference protein length).

Furthermore, since miniprot maps only proteins and does not consider the untranslated regions in gene loci, it is very likely that miniprot maps to a processed pseudogene or only maps a partial gene segment. To remove potential pseudogenes, we implemented stricter filtering rules for gene loci identified exclusively by miniprot. First, if miniprot identifies only one CDS (i.e. with no intervening introns), the gene locus is annotated only if there is also a single CDS in the reference annotation. Second, the ratio of the coding regions in the miniprot annotation to the coding regions in the longest isoform of a reference gene locus must be between 0.9 and 1.5. Note that any extra gene copies identified exclusively by miniprot will not include UTRs.

4.6. LiftOn arguments used in the study

All analyses in this study were performed using LiftOn with its default parameters along with the additional arguments `-copies`, `-sc 0.95`, and `-polish`. The `-copies` argument prompts Liftoff to search for extra gene copies following the initial lift-over process. A gene copy will be annotated at a specific locus only if there is no overlap with an already annotated feature. Accompanying the `-copies` argument, the `-sc 0.95` parameter modifies the default Liftoff requirement of 100% sequence identity of mapped exons/CDSs to the reference ones, reducing it to allow a 95% sequence identity. The `-copies` argument will also trigger the annotation of extra copies of gene loci, which were distinctly found by miniport.

The `-polish` argument enables the Liftoff step in LiftOn to realign exons to correct CDSs that may have been altered during the lift-over process, such as the loss of start/stop codons or the introduction of in-frame stop codons, although this step extends the processing time.

To map RefSeq release v220 (Results 2.2), MANE version 1.2 (Note S1, Table S4 and Figure S7), and CHESS 3 (Note S2, Table S5 and Figure S8) human annotations from GRCh38 to CHM13 (Table 1 and Figure 2), the `-chroms <chromosome_mapping.txt>` argument was also employed in order to first perform a mapping step on a chromosome-by-chromosome basis. Once this step is complete, any genes that remain unmapped are remapped to the entire target assembly. This setting resulted in improved human annotation mapping accuracy. For all other experiments, LiftOn was run without the `-chroms` argument, mapping gene sequences to the full genome.

The `-f` argument and the input file `features.txt` indicates all types of parent features to lift over. To ensure LiftOn does not mistakenly identify pseudogenes as extra gene copies, we ran LiftOn with the `features.txt` file specifying both “gene” and “pseudogene” features, to lift them over along with their children.

LiftOn runs miniprot with default parameters plus the `-gff-only` option to exclusively generate a GFF output file.

All experiments were conducted on a 24-core, 48-thread Intel(R) Xeon(R) Gold 6248R Linux computer with 1024 GB memory, using a single thread of execution.

4.7. Gene and transcript feature counting

The counts of protein-coding and non-coding genes and transcripts reported in this study were calculated as follows:

• Gene counting

Gene features were classified as “protein-coding”, “non-coding”, and “others” based on the “gene_biotype” attribute in NCBI's RefSeq or the “gene_type” attribute in EMBL-EBI’s Ensembl/GENCODE and CHESS 3. More specifically, a gene feature was categorized as a protein-coding gene if the feature type was “gene” and its type attribute was “protein_coding”; a gene was categorized as non-coding if its type attribute was either “lncRNA” or “ncRNA”. Other gene features, including “Pseudogene”, “miRNA”, “snoRNA”, “tRNA”, “V_segment”, “snRNA”, “J_segment”, “misc_RNA”, “C_region”, “antisense_RNA”, etc., were categorized as “others”. The full list is provided in Table S1.

• Transcript counting

Transcript features were also categorized into “protein-coding”, “non-coding”, and “others”. The criteria for classification were based on both the type of transcript and the type of its parent gene. A transcript was counted as protein-coding if its feature type was “mRNA”, and its parent gene was a protein-coding gene; a transcript was classified as non-coding if its feature type was either “lncRNA” or “ncRNA” and its parent gene was also a non-coding gene. The remaining transcripts were categorized as “others”.

4.8. Running LiftOn on a test dataset

We provide sample files for the users wanting to test LiftOn at our GitHub repository: https://github.com/Kuanhao-Chao/LiftOn/tree/main/test. To execute LiftOn, the user can type a single command that uses the `-g` flag to specify a reference annotation file in GFF format (e.g., `NCBI_RefSeq_no_rRNA.gff`), the `-o` flag to specify the output file (e.g., `lifton.gff3`), plus both the reference genome file (e.g., `GCF_000001405.40_GRCh38.p14_genomic.fna`) and the target genome file (e.g., `chm13v2.0.fa`). e.g.:

lifton -g NCBI_RefSeq_no_rRNA.gff -o lifton.gff3 -copies chm13v2.0.fa GCF_000001405.40_GRCh38.p14_genomic.fna

4.9. Genomes used in this study

Brief rationale for selecting the genomes used in our experiments are described below.

Apis mellifera.

At the time of this study, GenBank contained no Apis mellifera assemblies of the same strain as the reference genome, Amel_HAv3.1 (Strain DH4) 40. We considered both ASM1932182v1 39 (strain ligustica) and ASM1384124v2 (strain carnica), and chose ASM1932182v1 because it was more contiguous (42 contigs) than ASM1384124v2 (313 contigs).

Arabidopsis thaliana.

The A. thaliana reference genome, TAIR10.1 43, represents the Columbia (Col-0) strain. Of the other Col-0 assemblies, Col-CC and Col-CEN (ASM2311539v1) 44 were of similar quality, having all chromosomes completely assembled. We chose Col-CEN (ASM2311539v1) because Col-CC was a consensus assembly of 13 independent assemblies, whereas Col-CEN (ASM2311539v1) was derived from a single organism.

Mus musculus.

At the time of this study, GenBank contained no high-quality contiguous assemblies of the same strain (C57BL/6J) as the reference genome, GRCm39 56. The only other assembly of the C57BL/6J strain (ASM377452v2) is quite fragmented, with 36,193 contigs. We chose NOD_SCID 37 (strain NOD/SCID) instead because it has the fewest contigs (281) of all other M. musculus assemblies from known strains.

Oryza sativa.

The reference genome for rice, IRGSP-1.0 42, belongs to the Japonica subgroup and the Nipponbare cultivar. From the three other assemblies that were also from the Nipponbare cultivar, we chose ASM3414082v1 41 as it was the most contiguous, with every chromosome fully assembled into a single scaffold.

Supplementary Material

Supplement 1
media-1.pdf (290.5KB, pdf)
Supplement 2
media-2.pdf (36.4MB, pdf)
Supplement 3
media-3.pdf (279.1KB, pdf)
Supplement 4
media-4.pdf (256.2KB, pdf)

Funding

This research was supported in part by the U.S. National Institutes of Health under grants R01-HG006677 and R35-GM130151. J.M.H is supported by the U.S. National Institutes of Health training grant T32HG002295.

Footnotes

Data and Code availability

LiftOn is implemented as a Python package. The LiftOn project is freely available on github from: github.com/Kuanhao-Chao/LiftOn, and on PyPi from: https://pypi.org/project/LiftOn/. The LiftOn documentation is available at: ccb.jhu.edu/lifton.

Reference

  • 1.Kovaka S., Ou S., Jenike K.M. & Schatz M.C. Approaching complete genomes, transcriptomes and epi-omes with accurate long-read sequencing. Nature Methods 20, 12–16 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Logsdon G.A., Vollger M.R. & Eichler E.E. Long-read human genome sequencing and its applications. Nature Reviews Genetics 21, 597–614 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Van Dijk E.L., Jaszczyszyn Y., Naquin D. & Thermes C. The third revolution in sequencing technology. Trends in Genetics 34, 666–681 (2018). [DOI] [PubMed] [Google Scholar]
  • 4.Marx V. Method of the year: long-read sequencing. Nature Methods 20, 6–11 (2023). [DOI] [PubMed] [Google Scholar]
  • 5.Shendure J. et al. DNA sequencing at 40: past, present and future. Nature 550, 345–353 (2017). [DOI] [PubMed] [Google Scholar]
  • 6.Nurk S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Zimin A.V. et al. A reference-quality, fully annotated genome from a Puerto Rican individual. Genetics 220, iyab227 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Shumate A. et al. Assembly and annotation of an Ashkenazi human reference genome. Genome biology 21, 1–18 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Chao K.-H., Zimin A.V., Pertea M. & Salzberg S.L. The first gapless, reference-quality, fully annotated genome from a Southern Han Chinese individual. G3: Genes, Genomes, Genetics 13, jkac321 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Scientists G.K.C.o. Genome 10K: a proposal to obtain whole-genome sequence for 10 000 vertebrate species. Journal of Heredity 100, 659–674 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Benson D.A. et al. GenBank. Nucleic acids research 41, D36–D42 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Leinonen R. et al. The European nucleotide archive. Nucleic acids research 39, D28–D31 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Stein L. Genome annotation: from sequence to biology. Nature reviews genetics 2, 493–503 (2001). [DOI] [PubMed] [Google Scholar]
  • 14.Salzberg S.L., Vol. 20 1–3 (BioMed Central, 2019). [Google Scholar]
  • 15.Danchin A., Ouzounis C., Tokuyasu T. & Zucker J.D. No wisdom in the crowd: genome annotation in the era of big data-current status and future prospects. Microbial Biotechnology 11, 588–605 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.O'Leary N.A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic acids research 44, D733–D745 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Frankish A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic acids research 47, D766–D773 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Yandell M. & Ence D. A beginner's guide to eukaryotic genome annotation. Nature Reviews Genetics 13, 329–342 (2012). [DOI] [PubMed] [Google Scholar]
  • 19.Stanke M., Schöffmann O., Morgenstern B. & Waack S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC bioinformatics 7, 1–11 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Scalzitti N., Jeannin-Girardon A., Collet P., Poch O. & Thompson J.D. A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC genomics 21, 1–20 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Burge C. & Karlin S. Prediction of complete gene structures in human genomic DNA. Journal of molecular biology 268, 78–94 (1997). [DOI] [PubMed] [Google Scholar]
  • 22.Stanke M. & Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics-Oxford 19, 215–225 (2003). [DOI] [PubMed] [Google Scholar]
  • 23.Lomsadze A., Ter-Hovhannisyan V., Chernoff Y.O. & Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic acids research 33, 6494–6506 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Salzberg S.L., Pertea M., Delcher A.L., Gardner M.J. & Tettelin H. Interpolated Markov models for eukaryotic gene finding. Genomics 59, 24–31 (1999). [DOI] [PubMed] [Google Scholar]
  • 25.Majoros W.H., Pertea M. & Salzberg S.L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004). [DOI] [PubMed] [Google Scholar]
  • 26.Lonsdale J. et al. The genotype-tissue expression (GTEx) project. Nature genetics 45, 580–585 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Weinstein J.N. et al. The cancer genome atlas pan-cancer analysis project. Nature genetics 45, 1113–1120 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Shumate A. & Salzberg S.L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639–1643 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Lenzi V.B., Moretti G. & Sprugnoli R. in LREC 333–338 (2012). [Google Scholar]
  • 30.Shumate A. & Salzberg S. Liftofftools: a toolkit for comparing gene annotations mapped between genome assemblies. F1000Research 11, 1230 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Morales J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Varabyou A. et al. CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure. Genome Biology 24, 1–16 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Pruitt K.D., Tatusova T. & Maglott D.R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research 35, D61–D65 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Thorvaldsdóttir H., Robinson J.T. & Mesirov J.P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in bioinformatics 14, 178–192 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Robinson J.T. et al. Integrative genomics viewer. Nature biotechnology 29, 24–26 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Rhie A. et al. The complete sequence of a human Y chromosome. Nature 621, 344–354 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Schmid-Siegert E. et al. Reference mouse strain assemblies for BALB/c Nude and NOD/SCID mouse models. bioRxiv, 2023.2003. 2016.532783 (2023). [Google Scholar]
  • 38.Simon 3, E.B.I.B.E.G.N.K.A.M.E.R.A.G.S.G.S.A.U.-V.A.W. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002). [DOI] [PubMed] [Google Scholar]
  • 39.Cao L., Zhao X., Chen Y. & Sun C. Chromosome-scale genome assembly of the high royal jelly-producing honeybees. Scientific Data 8, 302 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Wallberg A. et al. A hybrid de novo genome assembly of the honeybee, Apis mellifera, with chromosome-length scaffolds. BMC genomics 20, 1–19 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Shang L. et al. A complete assembly of the rice Nipponbare reference genome. Molecular Plant 16, 1232–1236 (2023). [DOI] [PubMed] [Google Scholar]
  • 42.Kawahara Y. et al. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice 6, 1–10 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Lamesch P. et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic acids research 40, D1202–D1210 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Naish M. et al. The genetic and epigenetic landscape of the Arabidopsis centromeres. Science 374, eabi7489 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Baker D.N. & Langmead B. Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2. Genome Research 33, 1218–1227 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Ondov B.D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome biology 17, 1–14 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Sequencing, C. & edu, A.C.W.R.H.w.g.w.e.L.E.S.l.b.m.e.W.R.K.r.w.w. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87 (2005). [DOI] [PubMed] [Google Scholar]
  • 48.Goodman M. The genomic record of Humankind's evolutionary roots. The American Journal of Human Genetics 64, 31–39 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Prüfer K. et al. The bonobo genome compared with the chimpanzee and human genomes. Nature 486, 527–531 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Ebersberger I., Metzler D., Schwarz C. & Pääbo S. Genomewide comparison of DNA sequences between humans and chimpanzees. The American Journal of Human Genetics 70, 1490–1497 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.c, A.C.I.V.N.v.b.e.b.P.D.A.d.a.b.e., 1, A.W.C.S.T.B.t.c.e.b.L.A.M.a.c.e.c.S.N.D.n.c.e. & manoli@mit.edu 10 j Mauceli Evan 10 MacCallum Iain 10, B.I.W.G.A.T.J.D.B.A.P.B.W.B.J.C.C.G.S.G.M.K.M. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450, 203–218 (2007). [DOI] [PubMed] [Google Scholar]
  • 52.Adams M.D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000). [DOI] [PubMed] [Google Scholar]
  • 53.Hoskins R.A. et al. The Release 6 reference sequence of the Drosophila melanogaster genome. Genome research 25, 445–458 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Rio B., Couturier G., Lemeunier F. & Lachaise D. in Annales de la Société entomologique de France (NS), Vol. 19 235–248 (Taylor & Francis, 1983). [Google Scholar]
  • 55.David J.R., Lemeunier F., Tsacas L. & Yassin A. The historical discovery of the nine species in the Drosophila melanogaster species subgroup. Genetics 177, 1969–1973 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Church D.M. et al. Modernizing reference genome assemblies. PLoS biology 9, e1001091 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Howe K. et al. The genome sequence of the Norway rat, Rattus norvegicus Berkenhout 1769. Wellcome Open Research 6 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Agrawal S. & Ganley A.R. The conservation landscape of the human ribosomal RNA gene repeats. PloS one 13, e0207531 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Cock P.J. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Shirley M.D., Ma Z., Pedersen B.S. & Wheelan S.J. (PeerJ PrePrints, 2015). [Google Scholar]
  • 61.Hori Y., Shimamoto A. & Kobayashi T. The human ribosomal DNA array is composed of highly homogenized tandem clusters. Genome research 31, 1971–1982 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Daily J. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC bioinformatics 17, 1–11 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Altschul S.F., Gish W., Miller W., Myers E.W. & Lipman D.J. Basic local alignment search tool. Journal of molecular biology 215, 403–410 (1990). [DOI] [PubMed] [Google Scholar]
  • 64.Henikoff S. & Henikoff J.G. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences 89, 10915–10919 (1992). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
media-1.pdf (290.5KB, pdf)
Supplement 2
media-2.pdf (36.4MB, pdf)
Supplement 3
media-3.pdf (279.1KB, pdf)
Supplement 4
media-4.pdf (256.2KB, pdf)

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES