Abstract
Background
Gene annotation in eukaryotes is a non-trivial task that requires meticulous analysis of accumulated transcript data. Challenges include transcriptionally active regions of the genome that contain overlapping genes, genes that produce numerous transcripts, transposable elements and numerous diverse sequence repeats. Currently available gene annotation software applications depend on pre-constructed full-length gene sequence assemblies which are not guaranteed to be error-free. The origins of these sequences are often uncertain, making it difficult to identify and rectify errors in them. This hinders the creation of an accurate and holistic representation of the transcriptomic landscape across multiple tissue types and experimental conditions. Therefore, to gauge the extent of diversity in gene structures, a comprehensive analysis of genome-wide expression data is imperative.
Results
We present FINDER, a fully automated computational tool that optimizes the entire process of annotating genes and transcript structures. Unlike current state-of-the-art pipelines, FINDER automates the RNA-Seq pre-processing step by working directly with raw sequence reads and optimizes gene prediction from BRAKER2 by supplementing these reads with associated proteins. The FINDER pipeline (1) reports transcripts and recognizes genes that are expressed under specific conditions, (2) generates all possible alternatively spliced transcripts from expressed RNA-Seq data, (3) analyzes read coverage patterns to modify existing transcript models and create new ones, and (4) scores genes as high- or low-confidence based on the available evidence across multiple datasets. We demonstrate the ability of FINDER to automatically annotate a diverse pool of genomes from eight species.
Conclusions
FINDER takes a completely automated approach to annotate genes directly from raw expression data. It is capable of processing eukaryotic genomes of all sizes and requires no manual supervision—ideal for bench researchers with limited experience in handling computational tools.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12859-021-04120-9.
Keywords: Genomics, Transcriptomics, Eukaryotic gene annotation, Gene prediction, Optimized RNA-Seq alignment, Changepoint detection
Background
Recent advances in sequencing technology enable the construction of chromosomal-level assemblies for even non-model organisms. As of December 2020, genomes of 16,108 eukaryotes, 295,784 prokaryotes, 41,936 viruses, 26,079 plasmids and 17,820 organelles are sequenced and available through GenBank [1], a considerable increase over the 1,500 sequences reported two decades ago (see Additional file 1: Fig. S1). Therefore, to annotate the ever-rising number of genome sequences, annotation software applications need to be fast, accurate, and designed to handle large amounts of expression data to facilitate discovery of novel genes across different conditions [2–5]. Extensive analysis of this available data is the key to achieving exhaustive gene discovery by analyzing samples from multiple tissues and conditions, obviating the need for additional sequencing.
Genome annotation is the process of identifying transcriptionally active regions of the genome and defining gene structures. Decoding the correct structures of genes is essential since several downstream applications rely on accurate annotations: detecting interactions between proteins [6–14], identifying post-translational modifications [15–23], mining effectors [24–28], and determining protein structure [29–32]. Although we have seen a significant improvement in genome sequencing technology, annotation methods continue to underperform [33, 34]. Obtaining accurate gene annotations is challenging, especially in recently sequenced non-model organisms. The presence of sequences exchanged through horizontal gene transfer in such genomes and the existence of fragmented assemblies make it difficult to predict gene structures [35]. Multiple groups working on the same species have different and oftentimes conflicting annotations that are difficult to merge into a common consensus.
The early 2000s saw initial genome annotation attempts with the introduction of PASA [36], which was developed to map full-length transcripts and Expressed Sequence Tags (ESTs) in order to annotate genomes. In parallel, FGENESH [37, 38], GeneGenerator [39], mGene [40] and GeneSeqer [41] were introduced which predicted gene structures directly from genome sequence. Tools such as MAKER [42–45] and PASA [36] closely depend on pre-assembled full-length transcripts to generate annotations. ESTs and/or de novo assembled transcriptomes have been often provided as inputs to these tools to generate annotations [46–52]. Transcripts constructed via de novo [53–57] or genome-guided [58–63] approaches are sensitive to the nature of the assembler and its parameter settings. Such assemblers report sequences that are highly similar to one another, making the process of sifting the correct assemblies from artefacts difficult. This issue is moderately mitigated by BRAKER2 [64, 65], which uses read splice information instead of full-length assemblies to predict gene structures and has been shown to perform better than de novo approaches [66]. BRAKER2 entails a round of unsupervised gene predictions using GeneMark-ET [67] generating ab-initio gene predictions followed by a second round of training by AUGUSTUS [68] using a subset of the gene models created by GeneMark-ET [64]. All variations of MAKER (MAKER, MAKER2 and MAKER-P) use a combination of AUGUSTUS [68] and SNAP [69] to generate gene predictions. Unlike BRAKER2 or PASA, users need to run MAKER for multiple rounds to improve annotation. With no standard technique to optimize the number of rounds, users often undertake a trial-and-error approach to decide what data is supplied to MAKER in each execution round. These unguided choices can create different annotations based on the same data sets. Thus, current approaches report either incomplete genes and/or derive annotations that are missing alternatively spliced transcripts. In addition to MAKER, BRAKER, PASA there is another gene annotator -GeMoMa [70] that use protein coding genes from a reference genome and transfers those to the target genome. Since it does not optimally use information from expression data, it has not been compared with the other gene annotators discussed in this manuscript.
To overcome the drawbacks described above, we developed FINDER, a new, automated annotation pipeline that downloads RNA-Seq data from NCBI SRA [71], conducts genome-guided assembly of short reads, predicts gene structure, and annotates genes. FINDER annotates both untranslated and coding regions of genes, categorizes transcripts based on the tissue/conditions where they are expressed, and outputs a complete set of alternatively spliced transcripts. FINDER analyzes the spatial expression profile of each transcript to redefine its boundaries and/or even create newer transcripts and employs an optimized strategy to locate transcripts housing micro-exons. Finally, gene models predicted by BRAKER2 are incorporated into the annotation along with assemblies generated by PsiCLASS [63]. We show that FINDER outperforms state-of-the-art annotation tools in constructing accurate gene structures, when executed with the same expression data.
Implementation
The detailed workflow of FINDER is outlined in Fig. 1. The pipeline accepts metadata via a comma-separated values (csv) file (see Additional file 2: Table S1). Users can verify the input data using the `verifyInputsToFINDER` utility (Please check Sect. 1.5.1 of Additional file 9). Both single-end and paired-end data are accepted. The pipeline automatically downloads RNA-Seq data from NCBI SRA or the samples can be accessed locally. Multiple rounds of alignment are conducted using STAR [72, 73] with short reads, thus ensuring the capture of tissue-specific splice junctions and ultimately generates the most comprehensive set of alternatively spliced transcripts. FINDER uses PsiCLASS [63] to generate transcripts both at the tissue level and consolidates them to produce a consensus annotation. It employs change-point detection (CPD) using coverage data to polish intron/exon boundaries if needed. Polished transcripts are then supplied to GeneMarkS-T [74] to predict protein coding regions. In addition to constructing genes from expression data, FINDER uses BRAKER2 [65] to predict genes de novo. Finally, gene models are assigned scores that reflect the confidence of prediction and evidence across different data sets. Throughout the pipeline run, intermediate temporary data is removed to optimize space usage. Proper logging of executions is implemented through ruffus [75].
Read alignments to the genome
Reads from each sample are aligned to the genome using STAR [73]. FINDER accepts the location of the genomic STAR indices. If indices are not provided, then FINDER will generate them locally. FINDER implements multiple strategies to detect as many correct splice-junctions as possible. Several studies use a multi-step approach where splice junctions are detected in the first pass and then those junctions are used to guide the alignments in future passes [76, 77]. FINDER employs a similar strategy to align reads and obtain the most confident splice junctions in each tissue type and/or condition by conducting mapping in four passes (Please check Sect. 1.3 of Additional file 9 for more details).
Annotating transcripts with micro-exons
Certain genes in eukaryotes have micro-exons (i.e., exons with fewer than 50 nucleotides) [78–81] which impart important biological properties both in plants [82–86] and animals [87–91]. FINDER uses OLego [92] to map the reads which were reported unmapped by STAR, because OLego optimizes micro-exon sensitivity by checking intron signatures when no hits of seed sequences (~ 14 nt) are found. It is configured to align reads to exons of minimum length 2, with a minimum and maximum intron size of 20 and 10 K respectively.
Generating exon-exon transcript structure annotation with PsiCLASS
Alignments reported by STAR and OLego are combined and provided as input to PsiCLASS [63]. Unlike traditional assemblers, PsiCLASS accepts alignments from multiple samples at the same time. It generates annotations for each sample and one consolidated gene annotation for all the samples. FINDER runs PsiCLASS with the—bamGroup option enabled which instructs PsiCLASS to preserve tissue/condition specific features. It is a fast meta-assembler generating 350 samples of output in less than three hours while running on 30 cores and consumes less than 50 GB of memory.
Polishing gene structures to optimize gene discovery
Gene structure annotations reported by PsiCLASS were polished to generate the best assemblies. Annotations generated by assemblers often have three kinds of errors that impact accuracy: (1) presence of redundant transcripts that are proper subsets of other transcripts, (2) multiple transcripts on the same strand merged into one, and (3) transcripts with ill-defined exon boundaries. Most assemblers ignore such cases to boost the speed of operation. Developing solutions to deal with these kinds of errors increases the number of correct structural annotations thereby improving downstream analysis.
FINDER uses different algorithmic and statistical approaches to deal with the above cases. To eliminate redundant transcripts, exon–intron structure of all transcripts is compared with each other to retain only unique transcripts. Even though eukaryotes possess large genomes, certain genes/transcripts are closely packed and are overlapping (Fig. 2). Reads originating from one of those genes often map to nearby overlapping genes making the task of distinctly recognizing the transcripts very challenging.
FINDER is configured to use changepoint detection (CPD) analysis to detect the descent in read coverage at the junction of two overlapping transcripts. Statistical CPD is a procedure to detect changes in the probability distribution of a stochastic process. Typically, CPD is widely used to detect changes in time series [93–97], but can be extended to other applications as well [98, 99]. We have found that even though CPD was developed under the assumption of normality, it can also be used where normality is violated.
In the first step in FINDER’s CPD, short read alignments to the genome are converted into number of read counts per nucleotide using bedtools [100]. A custom python script is used to transfer the per nucleotide coverage data from the genome to the transcriptome reported by PsiCLASS. Each internal exon is considered as a potential site for the presence of changepoints if there exist premature stop codons in all the three frame translations. CPD only considers exons that have a high chance of housing a changepoint, thereby reducing duration of operation. The coverage pattern of each exon is probed to detect changepoints. The data has been modeled using an exponential distribution, and binary segmentation has been used to determines the changepoints in the exonic coverage using the ‘changepoints’ package [101]. Read coverage of exons mimics a time series where each nucleotide position of an exon can be assumed to be a single unit of time. Coverage patterns of exons, suspected to be merged, contain a characteristic depression in the signal to split the gene models (Fig. 2a). Overlapping transcripts on opposite strands sometimes share a common exon (Fig. 2b). This negatively impacts precision since the boundaries of the predicted transcript exceed the boundaries of the transcript in the reference annotation. FINDER trims the transcript boundaries, using the changepoints, to better model the RNA-Seq coverage (Fig. 2b). These strategies improve the annotation by increasing the transcript F1 scores (Table 1).
Table 1.
Arabidopsis thaliana | Oryza sativa | Zea mays (NCBI) | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BRAKER2 | MAKER2 | PASA | FINDER | FINDER + BRAKER2 | FINDER + BRAKER2 + PROTEIN | BRAKER2 | MAKER2 | PASA | FINDER | FINDER + BRAKER2 | FINDER + BRAKER2 + PROTEIN | BRAKER2 | MAKER2 | PASA | FINDER | FINDER + BRAKER2 | FINDER + BRAKER2 + PROTEIN | |
Base specificity | 91.08 | 74.87 | 62.71 | 74.46 | 75.01 | 75.04 | 57.63 | 52.67 | 36.77 | 42.46 | 42.6 | 42.75 | 6.52 | 45.22 | 60.39 | 62.97 | 62.96 | 62.92 |
Base sensitivity | 60.27 | 52.55 | 69.41 | 71.45 | 74.14 | 74.23 | 36.85 | 40.17 | 59.19 | 61.43 | 61.83 | 62.37 | 50.43 | 54.38 | 66.7 | 72.1 | 72.15 | 72.16 |
Base F1 score | 72.54 | 61.76 | 65.89 | 72.92 | 74.57 | 74.63 | 44.95 | 45.58 | 45.36 | 50.21 | 50.44 | 50.73 | 11.55 | 49.38 | 63.39 | 67.23 | 67.24 | 67.22 |
Exon specificity | 80.28 | 95.35 | 90.74 | 91.79 | 91.43 | 91.43 | 38.4 | 72.31 | 67.37 | 67.67 | 67.5 | 67.62 | 15.96 | 65.14 | 84.26 | 79.71 | 79.59 | 79.57 |
Exon sensitivity | 73.29 | 55.04 | 67.49 | 69.93 | 71.62 | 71.7 | 53.74 | 50.59 | 63.99 | 64.81 | 65.22 | 66.06 | 65.65 | 61.84 | 71.84 | 74.38 | 74.4 | 74.4 |
Exon F1 score | 76.63 | 69.79 | 77.41 | 79.38 | 80.32 | 80.37 | 44.79 | 59.53 | 65.64 | 66.21 | 66.34 | 66.83 | 25.68 | 63.45 | 77.56 | 76.95 | 76.91 | 76.9 |
Intron specificity | 86 | 98.26 | 95.98 | 96.55 | 96.26 | 96.26 | 52.8 | 76.87 | 72.36 | 73.24 | 73.19 | 73.26 | 21.77 | 74.41 | 88.45 | 84.86 | 84.78 | 84.76 |
Intron sensitivity | 84.25 | 60.64 | 76.11 | 76.51 | 78.33 | 78.42 | 76.78 | 55.77 | 70.44 | 71.03 | 71.52 | 72.34 | 80.88 | 64.99 | 75.19 | 78.12 | 78.15 | 78.15 |
Intron F1 score | 85.12 | 75 | 84.9 | 85.37 | 86.37 | 86.43 | 62.57 | 64.64 | 71.39 | 72.12 | 72.35 | 72.8 | 34.31 | 69.38 | 81.28 | 81.35 | 81.33 | 81.32 |
Transcript specificity | 49.91 | 76.9 | 55.21 | 60.04 | 59.82 | 59.82 | 12.71 | 40.13 | 23.17 | 24.54 | 24.59 | 24.54 | 2.75 | 32.67 | 48.6 | 44.78 | 44.84 | 44.78 |
Transcript sensitivity | 30.26 | 21.74 | 28.62 | 37.21 | 39.21 | 39.28 | 16.35 | 18.01 | 29.51 | 33 | 33.32 | 33.82 | 19.57 | 25.01 | 37.94 | 42.25 | 42.3 | 42.31 |
Transcript F1 score | 37.68 | 33.9 | 37.7 | 45.95 | 47.37 | 47.42 | 14.3 | 24.86 | 25.96 | 28.15 | 28.3 | 28.44 | 4.82 | 28.33 | 42.61 | 43.48 | 43.53 | 43.51 |
Gene specificity | 51.58 | 76.9 | 61.79 | 67.33 | 66.56 | 66.55 | 13.46 | 40.13 | 36.81 | 32.79 | 32.76 | 32.38 | 2.78 | 32.67 | 55.69 | 52.3 | 52.29 | 52.19 |
Gene sensitivity | 50.43 | 37.2 | 44.39 | 57.96 | 61.3 | 61.4 | 18.07 | 19.4 | 31.37 | 35.05 | 35.4 | 35.96 | 29.02 | 38.14 | 51.57 | 57.35 | 57.42 | 57.43 |
Gene F1 score | 51 | 50.14 | 51.66 | 62.29 | 63.82 | 63.87 | 15.43 | 26.16 | 33.87 | 33.88 | 34.03 | 34.08 | 5.07 | 35.19 | 53.55 | 54.71 | 54.74 | 54.68 |
Best performance is indicated by bold values
De novo gene prediction from expression data and proteins from closely related species
Certain genes are expressed only under specific tissues and conditions [102]. However, constructing an exhaustive set of genes expressed across all possible tissues and conditions is a daunting task due to the mammoth volume of potential expression data. Hence, approaches that can predict structures of unknown genes using information obtained from known genes are needed. Within the FINDER framework, we used BRAKER2 [103] to predict the structure of protein coding genes. The pipeline is provided with alignment files generated by STAR and an optional, user-provided protein data file. If the previous execution fails, a second execution of BRAKER2 is launched without protein information. Genes predicted by BRAKER2 are compared to the genes obtained from expression data. To prevent too many false positives, predictions made by BRAKER2 are considered high confidence, only if those are supported by expression level or protein level evidence.
In addition to RNA-Seq data, FINDER also uses protein data (when provided), in two ways (1) to assess the veracity of the transcript models generated by BRAKER2, and (2) to align those proteins not recognized by BRAKER2 or PsiCLASS. Protein coding genes obtained from expression data and predicted by BRAKER2 are BLASTed [104] to the protein set provided by the user. Proteins not encountering any hits are aligned to the genome using exonerate [105] with a minimum threshold of 90% similarity. These alignments are augmented to the final set of gene predictions. Since these transcripts are obtained solely from proteins, they lack UTR sequences.
Prediction of coding regions
We leveraged GeneMarkS-T [74] to predict protein-coding regions of genes constructed from expression data. GTF files are first converted to FASTA files using the provided genome. Those FASTA files are supplied to GeneMarkS-T as inputs. GeneMarkS-T outputs coding sequence for the transcripts. CDS annotations are incorporated into the final GTF file by converting the transcriptomic coordinates to genomic coordinates.
Tissue/condition specific transcripts/gene models
Most eukaryotic genes have multiple isoforms which are derived from alternative transcripts. Expression of different transcripts can occur under different conditions in different tissues at different time points. FINDER compares assembled transcripts from each condition and prints out an association between each transcript and the provided tissue/condition (Additional file 9: Sect. 1.5).
Scoring gene models
FINDER groups genes into multiple categories based on supporting evidence. Genes that are expressed in RNA-Seq datasets, predicted by BRAKER2, and have protein evidence, are put into the high-confidence gene set. BRAKER2-predicted genes with no evidence of expression and/or proteins are treated as low confidence genes. FINDER expects a soft masked genome since it is a BRAKER2 requirement. Genes which are located in the repeat regions are marked as such and moved to the set of low-confidence genes.
Results and discussion
Choice of species for comparison
We tested the performance of FINDER primarily on three well-annotated plant organisms—Arabidopsis thaliana [106], Oryza sativa [107–109] and Zea mays [110, 111]. The genomes assemblies of these model organisms have been frequently updated and are almost complete with telomere-to-telomere sequences with fewer gaps and unknown nucleotides. In addition, their gene annotations have undergone regular improvement by mining the large number of RNA-Seq datasets available in the literature. Also, The Arabidopsis Information Resource (TAIR) provides a five-star rating system based on available evidence for each gene. Such a system offers a platform to test the quality of gene annotation software. For further evaluation, and to ensure that FINDER is able to annotate a wider range of genome types, we selected the following additional species to test: Caenorhabditis elegans [112], Drosophila melanogaster [113, 114], Homo sapiens [115, 116], Hordeum vulgare [117], and Triticum aestivum [117–120]). The genomes of these species range from small (C. elegans, D. melanogaster, A. thaliana), medium (O. sativa), to large (H. sapiens, Z. mays, H. vulgare, and T. aestivum). Finally, we evaluated FINDER on three different versions of Z. mays annotations—RefSeq [121], AGPv3 [111, 122] and AGPv4 [110, 123].
Metrics to assess quality of annotation
We used four metrics to compare the quality of annotations generated by each pipeline: (1) Annotation Edit Distance (AED) [42, 43, 124], (2) sensitivity, (3) specificity, and (4) F1 score. Although these metrics could be computed both at the nucleotide- and exon-level we chose to make comparisons at the transcript level since it encompasses bases, exons, and introns. An AED score of 0 indicates complete agreement of the predicted annotation with the reference, and a score of 1 denotes that the reference has not been identified in the annotation. A transcript is considered to be “recognized” only when all its intron definitions agree with at least one transcript from the predicted set. We used the Mikado “compare” utility to compare the predictions with the reference annotations [125]. A highly sensitive annotation is one that can correctly recognize more reference transcripts. A set of annotations has high specificity when it reports minimal incorrect transcripts. For an annotation to be of good quality, both sensitivity and specificity should be high. A balanced metric is the F1 score which is the harmonic mean of sensitivity and specificity. While AED provides a good numeric assessment of how well the ground truth evidence is represented in an annotation, when individually used, it fails to capture the extent to which false positives are reported. Hence, F1 score complements AED since it incorporates both specificity and sensitivity. For evaluation purposes, we assume that the annotations achieved through community efforts are the ground truth and contain no errors.
FINDER generates more accurate gene models than BRAKER2, MAKER2 and PASA
FINDER leverages expression data to construct transcript models and employs statistical changepoint detection to enhance their structures (see “Implementation” section). Both MAKER2 and PASA were run with transcript sequences reported by PsiCLASS.
To assess FINDER’s performance, we compared the AED scores of transcript models generated by FINDER with those generated by other commonly used annotation methods. As shown in Fig. 3a, d, g, the violin plots for FINDER are broader at the base, indicating a greater number of transcripts with lower AED scores as compared to BRAKER2, MAKER, and PASA. We compared the FINDER AED scores with the AED scores reported by other pipelines using Wilcoxon’s signed rank test (More details in Additional file 9: Sect. 2.5). For all organisms (Fig. 3, Additional file 1: Figs. S2–S5 and Additional file 3: Table S2), the AED scores reported by FINDER were significantly lower (p_value < 0.01) than that of any other pipeline. Figure 3c, f, i, shows a stacked bar plot to represent the fraction of transcripts in each category of AED values. In all the cases, a higher percentage of transcripts reported by FINDER have lower AED scores (Additional file 1: Figs. S2–S5). This indicates that FINDER is capable of constructing gene structures that better comply with the reference annotations.
High-quality exhaustive annotations predict the fewest false positives thereby boosting the transcript F1 score. The transcript F1 scores of the gene models that were reported by FINDER for A. thaliana, O. sativa and Z. mays were higher than the models generated by BRAKER2, MAKER, and PASA (Fig. 3b, e, h). This same trend is observed for other tested organisms where FINDER was successful in detecting nucleotides, exons, introns, transcripts and genes (Table 1, Additional file 1: Figs. S2–S5 and Additional file 3: Table S2). MAKER2 and BRAKER2 registered a high specificity for most of the organisms because fewer transcripts were reported than FINDER. MAKER2 and BRAKER2 also had lower F1 scores, indicating less sensitivity than FINDER. Additionally, we compared the CDS regions of genes reported by FINDER with those of BRAKER2. For most of the organisms, FINDER generated transcript models with a higher F1 score (Additional file 4: Table S3). These results show that the better performance of FINDER is ensured not only due to the presence of UTRs but also due to enhanced CDS structure of gene models.
Finally, including BRAKER2 predictions and protein sequences to FINDER enhanced the gene model predictions. About 15% of the gene models reported by BRAKER2, those having high sequence similarity with the provided protein sequences were included in the final annotations (Table 2). As shown in Table 1 and Additional file 5: Table S4, including evidence at the protein level led to the identification of more genes.
Table 2.
Organism | Number of transcript models borrowed from BRAKER | Percentage of transcript models borrowed from BRAKER | Improvement in average annotation score | Number of transcript models from protein alignments | Percentage of transcript models from protein alignments | Improvement in average annotation score |
---|---|---|---|---|---|---|
Arabidopsis thaliana | 1692 | 5 | 1.43 | 185 | 0.01 | 0.05 |
Oryza sativa | 5662 | 10 | 0.15 | 440 | 0.01 | 0.15 |
Zea mays | 1061 | 2 | 0.05 | 452 | 0.01 | − 0.02 |
Caenorhabditis elegans | 4807 | 18 | 0.48 | 389 | 0.01 | 0.58 |
Drosophila melanogaster | 2421 | 9 | 0.44 | 481 | 0.02 | 0.22 |
Homo sapiens | 5776 | 16 | 0.05 | 229 | 0.01 | 0.15 |
Hordeum vulgare | 1065 | 3 | 0.01 | 19 | 0 | − 0.57 |
Unlike BRAKER2, FINDER does not assume a homogeneous nucleotide composition of the genome [103]. FINDER outperforms BRAKER2 while constructing gene models in complex organisms like H. sapiens, H. vulgare, and Z. mays since assemblers generating transcriptomes from alignments do not require a genome to possess homogeneous nucleotide composition.
FINDER in itself is restricted to annotate genes only in regions of the genome that are transcriptionally active. Recognizing that BRAKER2, being a gene predictor, can construct gene models in transcriptionally silent regions of the genome, FINDER is designed to incorporate the gene models predicted by BRAKER2 into the final annotations.
Distinct gene groups are accurately annotated with FINDER
Although eukaryotic genes differ from one another in terms of location, structure and the isoforms they encode, most annotation pipelines annotate and evaluate gene predictions with a global and uniform approach. The problem arises when these variances prompt each pipeline to perform differently on dissimilar groups of genes. To avoid this pitfall, we created groups of genes and transcripts based on various criteria (Table 3) and compared the performance of FINDER with BRAKER2, MAKER, and PASA for each of these sets.
Table 3.
Name | Description | |
---|---|---|
Group 1 | Uni-exon transcripts | Transcripts having a single exon and no introns |
Group 2 | Transcripts without UTRs | Transcripts missing either the 5′ or the 3′ UTR sequence |
Group 3 | Transcripts with UTRs | Transcripts having both UTRs |
Group 4 | Transcripts with micro-exons | Transcripts where at least one exon has length less than 50 nucleotides |
Group 5 | Transcripts with long introns | Transcripts where at least one intron has a length greater than 10,000 bp |
Group 6 | Closely placed transcripts on same strand | Transcripts on the same strand having less than 250 nucleotides between each other |
Group 7 | Closely placed transcripts on opposite strand | Transcripts on the opposite strands having less than 250 nucleotides between each other |
Group 8 | Multi transcript gene | Transcripts of a gene that have multiple transcripts |
Group 9 | Single transcript gene | Transcripts of a gene that have single transcript |
On the set of UTR-containing transcripts, FINDER reported the best transcript F1 scores (Fig. 4, Additional file 1: Figs. S6, S7). Unlike BRAKER2, FINDER uses GeneMark S/T to predict CDS from the transcript sequences assembled by PsiCLASS and can hence annotate UTR regions. For most of the organisms, BRAKER2 and MAKER2 gene models register a low transcript F1 score in this category of genes. Next, we tested the performance of the annotation pipelines on transcripts that are closely located in the genome. On this set of transcripts, FINDER reported the best F1 transcript score for A. thaliana, O. sativa, and Z. mays (Fig. 4), and comparable scores for D. melanogaster (Additional file 1: Fig. S6), H. vulgare (Additional file 1: Fig. S8), and C. elegans (Additional file 1: Fig. S7) with BRAKER2. Most eukaryotic genes have multiple isoforms which differ from one another by their exon–intron definition. Splice sites and coverage information provides clues to construct such alternatively spliced transcripts. We selected genes with more than one transcript to check how well each annotation pipeline was able to detect transcript isoforms. For this case, FINDER was able to generate the best transcript structures with the highest transcript F1 score among all the pipelines gene annotation software applications (Fig. 4 and Additional file 1: Figs. S6–S9). Surprisingly, BRAKER2 fared poorly in this category despite training with all the detected splice sites from RNA-Seq data. This demonstrates that FINDER is capable of leveraging both intron splice sites and read coverages to report best transcript structures. For H. sapiens, PASA was able to generate the best transcript structures across all categories of transcripts. Adding transcripts from BRAKER2 and protein evidence improved the transcript F1 score for all the organisms, signifying the importance of incorporating de novo gene models and protein evidence.
BRAKER2 generated the best transcript annotation for the set of transcripts with a single exon (Fig. 4a, b and Additional file 1: Figs. S6–S9). Such transcripts, devoid of any introns, are difficult to construct from RNA-Seq alone. Also, the direction of the splice sites infers the direction of a transcript. Without any introns, such a single-exon transcript has to be probed for a CDS sequences' presence to infer directionality. BRAKER2 was configured to optimally predict only CDS regions of genes, hence, it performs well with the set of transcripts that have missing UTRs for organisms with small and moderate sized genomes (Fig. 4a, b and Additional file 1: Fig. S6–S9). The average number of transcripts per gene reported by BRAKER2 is lower than FINDER. While this boosts specificity, it compromises recall since BRAKER2 is not sensitive to detecting alternatively spliced transcripts. Hence, BRAKER2 accomplishes the best F1 score when tested on a set of single-transcript genes but performs poorly on a set of multi-transcript genes (Fig. 4a, b and Additional file 1: Figs. S6–S9).
Performance comparison on TAIR’s 5-star System
In order to assess the performance of the annotation pipelines on groups of genes constructed from varying levels of evidence, we used the TAIR10 5-star system. TAIR associates a quality score to each A. thaliana transcript based on the evidence used to construct the models, with five stars designating the best evidence and zero stars the least [126]. The three categories with limited evidence (< 3 stars) have fewer than 3,000 transcripts each. BRAKER2′s performance, on the genes in these three categories, was slightly better than the rest of the annotation pipelines (Fig. 5). The other two categories (five star and four star) have 9,067 and 18,374 transcripts respectively. In both of these categories, FINDER was able to detect more transcripts than any other annotation pipeline. 51.5% and 86.4% of genes in the 5-star and 4-star category respectively were multi-exonic. In both these categories, FINDER correctly constructed more gene models compared to any other annotation pipeline (Fig. 5). FINDER reported 80% of the gene models belonging to the 4-star category—18% more than BRAKER2 (Fig. 5). Hence, it is evident from this analysis that FINDER can reconstruct the structures of most of the genes that are well-supported by underlying evidence.
Improving transcript annotations using changepoint analysis
The co-location of multiple overlapping genes on the genome strands makes it difficult to correctly annotate their structures (see “Polishing gene structures to optimize gene discovery” section). FINDER employs changepoint detection (CPD) [101] to split the merged transcripts reported by PsiCLASS (Fig. 2). To gauge the magnitude of improvement in transcript structures brought about by the application of CPD, we compared the accuracy of the predicted transcriptome before and after implementing CPD based on read coverage. As shown in Table 4 and Additional file 6: Table S5, implementing the CPD improved both specificity and sensitivity in organisms with small or medium-sized genomes. In A. thaliana, the transcript F1 scores increased from 40.78 to 45.95 (Table 4 and Additional file 6: Table S5) and in C. elegans it increased from 40 to 50. In large genomes, the improvement was not as significant, mainly because there are only a few genes that overlap with one another.
Table 4.
Arabidopsis thaliana | Oryza sativa | Zea mays (RefSeq) | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
STRAWBERRY | SCALLOP | STRINGTIE | PSICLASS | FINDER | STRAWBERRY | SCALLOP | STRINGTIE | PSICLASS | FINDER | STRAWBERRY | SCALLOP | STRINGTIE | PSICLASS | FINDER | |
Base specificity | 38.41 | 37.34 | 58.35 | 62.63 | 74.46 | 22.6 | 24.23 | 39.83 | 36.64 | 42.46 | 30.06 | 29.6 | 49.33 | 55.61 | 62.97 |
Base sensitivity | 87.06 | 85.3 | 80.22 | 70.83 | 71.45 | 78.2 | 77.64 | 70.87 | 60.27 | 61.43 | 81.2 | 79.08 | 76.98 | 70.52 | 72.1 |
Base F1 score | 53.3 | 51.94 | 67.56 | 66.48 | 72.92 | 35.07 | 36.93 | 51 | 45.57 | 50.21 | 43.88 | 43.08 | 60.13 | 62.18 | 67.23 |
Exon specificity | 43.86 | 70.64 | 74.82 | 89.82 | 91.79 | 23.51 | 42.97 | 51.7 | 66.29 | 67.67 | 37.18 | 52.37 | 60.33 | 77.76 | 79.71 |
Exon sensitivity | 85.3 | 79.67 | 79.29 | 69.54 | 69.93 | 79.08 | 76.65 | 75.47 | 65.75 | 64.81 | 85.03 | 81.68 | 81.77 | 75.88 | 74.38 |
Exon F1 score | 57.93 | 74.88 | 76.99 | 78.39 | 79.38 | 36.24 | 55.07 | 61.36 | 66.02 | 66.21 | 51.74 | 63.82 | 69.43 | 76.81 | 76.95 |
Intron specificity | 55.32 | 78.7 | 80.58 | 95.29 | 96.55 | 29.13 | 48.56 | 56.5 | 71.41 | 73.24 | 43.79 | 56.74 | 64.79 | 82.8 | 84.86 |
Intron sensitivity | 92.06 | 89.99 | 87.75 | 77.63 | 76.51 | 85.84 | 85.41 | 83.28 | 71.72 | 71.03 | 90.19 | 86.69 | 86.05 | 78.99 | 78.12 |
Intron F1 score | 69.11 | 83.97 | 84.01 | 85.56 | 85.37 | 43.5 | 61.92 | 67.32 | 71.56 | 72.12 | 58.96 | 68.59 | 73.92 | 80.85 | 81.35 |
Transcript specificity | 6.88 | 24.84 | 35.02 | 56.82 | 60.04 | 1.59 | 9.03 | 14.26 | 24.22 | 24.54 | 6.96 | 17.8 | 26.23 | 43.96 | 44.78 |
Transcript sensitivity | 31.68 | 32.19 | 35.23 | 31.8 | 37.21 | 26.69 | 29.43 | 31.2 | 32.37 | 33 | 48.71 | 46.71 | 47.76 | 42.48 | 42.25 |
Transcript F1 score | 11.3 | 28.04 | 35.12 | 40.78 | 45.95 | 3 | 13.82 | 19.57 | 27.71 | 28.15 | 12.18 | 25.78 | 33.86 | 43.21 | 43.48 |
Gene specificity | 33.16 | 36.51 | 63.7 | 65.67 | 67.33 | 17.61 | 20.16 | 35.69 | 40.39 | 32.79 | 21.83 | 26.33 | 47.43 | 54.04 | 52.3 |
Gene sensitivity | 44.77 | 46.18 | 50.59 | 49.33 | 57.96 | 28.91 | 31.79 | 33.65 | 34.54 | 35.05 | 55.2 | 55.85 | 58.03 | 57.22 | 57.35 |
Gene F1 score | 38.1 | 40.78 | 56.39 | 56.34 | 62.29 | 21.89 | 24.67 | 34.64 | 37.24 | 33.88 | 31.29 | 35.79 | 52.2 | 55.58 | 54.71 |
Best performance is indicated by bold values
PsiCLASS meta-assembly works better than other approaches
We explored three popularly used software applications for merging transcriptome assemblies—StringTie-merge [77, 127–133], TACO [134–139] and Cuffmerge [140–145] to combine 116 A. thaliana assemblies constructed by StringTie [59], Scallop [61] and Strawberry [60] (Please check Sect. 3 of Additional file 9 for more details). The best assembly was reported by StringTie-merge and was hence used for all other organisms. We compared the accuracy of the consensus transcript models generated by StringTie-merge with the transcript models reported by PsiCLASS [63]. As depicted in Table 4 and Additional file 6: Table S5, PsiCLASS generated the best transcript models for all organisms registering the highest transcript F1 score improving upon the StringTie models by up to 15%. Hence, FINDER uses only PsiCLASS to generate assemblies from short-read data.
Impact of missing untranslated region on annotation of transcripts
Gene transcription is triggered by adherence of a transcription factor in the promoter region of a gene. Promoters are typically located within 1,000 bp upstream of a gene’s transcription start site (TSS) [146–148]. Determining the TSS from sequencing data is best facilitated by RAMPAGE [149, 150] or CAGE-Seq [151], but this data is usually unavailable due to constraints imposed by cost and time. Nevertheless, a good estimate can be obtained from RNA-Seq data by assuming the start coordinates of the assembled genes as the TSS. Thus, researchers often localize their investigation to a section 500–1000 bp upstream of the assumed TSS [152, 153]. Without 5′ UTR annotation it is impossible to deduce a good approximation of the TSS. This leads to conducting promoter mining in a completely incorrect genome location. To assess the quality of 5′ UTR annotation, we plotted the difference of TSS between the reference genes and the genes reported by BRAKER2 and FINDER using a violin plot (Fig. 6). Further, we applied Wilcoxon’s rank-sum test and found that the TSS distances reported by FINDER were significantly less than that of BRAKER2 for A. thaliana and Z. mays. Interestingly, for O. sativa, BRAKER2 generated better gene models for more transcripts. Over 25% of reference gene models in O. sativa have no UTRs annotated which is higher compared to 15% UTR-less gene models in A. thaliana and Z. mays. This result illustrates that more FINDER transcripts have a TSS closer to the evidence as compared to the TSS of the transcripts reported by BRAKER2. This is an expected result since BRAKER2 was configured to annotate only CDS regions of transcripts. Table 5 highlights the number of transcripts that have better agreement with the reference TSS for FINDER and BRAKER2.
Table 5.
Number of FINDER1 transcripts having TSS better than BRAKER2 | Number of BRAKER2 transcripts having TSS better than FINDER1 | |
---|---|---|
Arabidopsis thaliana | 15,063 (65%) | 8022 (35%) |
Oryza sativa | 11,089 (66%) | 5762 (34%) |
Zea mays (NCBI) | 20,721 (76%) | 6628 (24%) |
Zea mays (AGPv3) | 7618 (28%) | 19,731 (72%) |
Zea mays (AGPv4) | 18,114 (69%) | 8297 (31%) |
Caenorhabditis elegans | 8681 (33%) | 17,730 (67%) |
Drosophila melanogaster | 10,238 (63%) | 5917 (37%) |
Homo sapiens | 10,158 (74%) | 3486 (26%) |
Hordeum vulgare | 10,373 (65%) | 5607 (35%) |
Best performance is indicated by bold values
Enhancing ground truth annotations by extending untranslated regions
Official annotations of several model organisms, used as ground truth for this study, contain transcripts with missing UTR sequences. Even though UTRs do not code for proteins, they are relevant segments of a transcript involved in several important biological processes like mRNA translation [154–156], regulation of expression [157–161]] and a number of diseases [162–166]. In the A. thaliana TAIR10 annotations, there are 7,888 transcripts missing either UTR; 50% of these had a rating below 2 stars.
PacBio (Menlo Park, CA) offers long-read sequencing that contain both CDS and UTRs. Therefore, we used the PacBio annotations instead of the incomplete TAIR10 transcripts to assess FINDER’s performance on transcripts that were missing UTRs (Please refer to Sect. 2.6 in Additional file 9 for more details). Out of the 7,888 TAIR10 transcripts with missing UTRs, 113 transcripts were found both in the PacBio data and the 116 short-read RNA-Seq samples. We compared the FINDER annotations against these 113 transcripts. FINDER annotations were able to recall 91.55% of the nucleotides in 113 transcripts of TAIR10 and 97.86% of PacBio transcripts. The specificity of the FINDER annotations is markedly higher with PacBio transcripts (79.67%) compared to TAIR10 transcripts (72.14%). This demonstrates that FINDER enhances and improves upon the existing annotation.
The TRITEX H. vulgare annotation (Morex version r2) [117], released by the International Barley Sequencing Consortium (IBSC), is devoid of UTRs. We used FINDER to update and enrich the existing annotations by flanking the CDS region with UTRs on both sides. To verify the accuracy of the gene models reported by FINDER, we used PacBio full-length mRNA sequences derived from a time course of powdery mildew infected barley leaf tissue [167, 168]. A total of 7,352 gene models from IBSC, FINDER, and PacBio had a complete intron-chain match with each other. The gene structures for more than 93% (6,886 out of 7,352) of the FINDER models were improved when compared to PacBio full-length sequences (Additional file 7: Table S6). The highest F1 score achieved was 87.16. This shows that FINDER is capable of constructing accurate gene structures constituting both CDS and UTRs.
Evaluating performance with different annotations of Zea mays
Z. mays is an important model organism for crops and has been one of the most studied plants for genetics by researchers in several different fields [169–172]. Genes have been annotated in multiple ways using different kinds of data, resulting in substantial differences in gene structures [122]. Here we compare three alternative annotation sets of Z. mays—RefSeq, AGPv3, and AGPv4 and the performance of FINDER surpassed all three approaches. The transcript F1 score for FINDER gene models compared against the NCBI gene models were 43.48, whereas the F1 scores for AGPv3 and AGPv4 were 26.69 and 22.51 respectively. We observed the same trend for other annotation pipelines and reported a higher transcript F1 score for NCBI than the AGP annotations (Table 1 and Additional file 3: Table S2). Hence, FINDER generated high-quality gene structures with high transcript F1 scores for different Z. mays annotations.
Evaluating FINDER on different clades reported by Phylostratr
Genes in each organism can be categorized by their evolutionary history [173, 174]. We used Phylostratr [175] to classify genes into evolutionary strata. Here we present our results on the three model organisms—A. thaliana, O. sativa, and Z. mays. For all three, FINDER was able to accurately detect more genes in highly populated strata (Fig. 7). The performance of FINDER and PASA was comparable in strata with few genes. It was surprising to note that BRAKER2 was unable to identify highly conserved genes (those from the “cellular organisms” strata) since those would be easier to predict than organism specific genes. This demonstrates that FINDER is capable of effectively constructing genes from different evolutionary backgrounds.
FINDER constructs gene models for polyploid genomes
Being a general-purpose genome annotator, in addition to diploid organisms, FINDER can annotate the genomes of polyploid organisms. We generated gene structures of Triticum aestivum, a hexaploid with 120,744 annotated genes and 146,597 transcripts [117]. FINDER was able to detect 48,129 transcripts (39.9%). Out of the 130,582 transcripts predicted by FINDER, 48,104 (36.83%) matched perfectly with at least one reference annotation.
Conclusion
Identifying genes on chromosomes and deducing their structures from a plethora of evidence has been undertaken in multiple ways, with each method having advantages and disadvantages. Herein, we propose FINDER—an entirely automated, general-purpose pipeline to annotate genes in eukaryotic genomes. FINDER (1) implements an optimized mapping strategy that reduces the number of spurious mappings, (2) produces complete full-length transcripts comprising UTRs while identifying transcripts with micro-exons, (3) employs statistical CPD to modify gene boundaries and construct new genes, (4) reports more alternatively spliced transcripts as compared to other state-of-the-art annotation pipelines, and (5) assigns confidence classes to each transcript based on the evidence(s) that were used to construct those.
While FINDER’s performance has been superior to other gene annotation softwares, all the gene models reported by FINDER are predicted. Hence, a validation is necessary to ensure false positives are detected and removed. Also, future versions of FINDER will offer functionalities to leverage data from CAGE-Seq and Ribo-Seq to better annotate transcription start site and translation start sites respectively.
With a wide variety of available data for annotation, researchers often struggle to manage and optimize their usage. Several gene annotation software also offer users complicated configurations without providing substantial guidance. FINDER makes the job of gene annotation easy for bench scientists by automating the entire process from RNA-Seq data processing to gene prediction. Since FINDER does not assume the ploidy or the nucleotide composition of a genome, it could be applied to derive gene structures for a wide range of species, including non-model organisms. FINDER constructs gene models primarily from RNA-Seq data and is therefore capable of constructing tissue- and/or condition- specific isoforms which would have been impossible to obtain from ESTs only. FINDER supersedes the performance of existing software applications by utilizing read coverage information to fine-tune gene model boundaries. Instead of removing low-quality transcripts, FINDER flags them as low confidence—giving users the choice of using them as they seem fit. As a proof of concept, we provided evidence that using read coverage signal indeed enhances gene structures in a diverse set of organisms. Thus, we are confident that FINDER will pave the way for improved gene structure annotation in the future.
Availability and requirements
Project name: FINDER.
Project home page: https://github.com/sagnikbanerjee15/Finder.
Operating system(s): Linux, MacOS.
Programming language: Python, C, C++, Perl, Shell.
License: MIT.
Other software requirements: All software requirements are listed in https://github.com/sagnikbanerjee15/Finder/blob/master/environment.yml.
Any restrictions to use by non-academics: MIT licensing restrictions apply.
Supplementary Information
Acknowledgements
This research used resources provided by the SCINet project of the USDA Agricultural Research Service, ARS Project No. 0500-00093-001-00-D. The authors are also thankful to Dr. Karin Dorman (Professor, Department of Statistics, Iowa State University), for providing insightful feedbacks to compare annotations and for implementing changepoint detection. The authors thank Gregory Fuerst for taking care of submitting data to NCBI. Finally, the authors thank Dr. Eve Wurtele (Professor, Department of Genetics, Development and Cell Biology, Iowa State University) for permitting her student Priyanka Bhandari to collaborate on this work.
Abbreviations
- ESTs
Expressed sequence tags
- NGS
Next generation sequencing
- NCBI
National Center for Biotechnology Information
- SRA
Sequence read archive
- UTR
Untranslated regions
- CSV
Comma separated values
- AED
Annotation edit distance
- CPD
Changepoint detection
- TSS
Transcription start site
- CDS
Coding sequence
- CPU
Central processing unit
- cDNA
Complementary DNA
Authors' contributions
SB: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software design, Validation, Visualization, Support personnel, Writing—Original Draft Preparation, Writing—Review and Editing. PB: Formal Analysis, Writing—Review and Editing. MGW: Conceptualization, Resources, Supervision, Writing—Review and Editing. TZS: Conceptualization, Resources, Supervision, Writing—Review and Editing. RPW: Conceptualization, Investigation, Resources, Supervision, Writing—Review and Editing. CMA: Conceptualization, Funding Acquisition, Investigation, Project Administration, Resources, Supervision, Writing—Review and Editing. All the authors have read and approved the final manuscript.
Funding
This research was supported by the US. Department of Agriculture, Agricultural Research Service, Project No. (5030-21000-068-00D) and (3625-21000-067-00D) through the Corn Insects and Crop Genetics Research Unit and Project No. (2030-21000-024-00D) through the Crop Improvement and Genetics Research Unit. Research supported in part by Oak Ridge Institute for Science and Education (ORISE) under US Department of Energy (DOE) contract number DE-SC0014664 to SB and National Science Foundation—Plant Genome Research Program Grant 13-39348 to RPW. PB received funding from National Science Foundation Grant (IOS 1546858, in part); Orphan Genes, “An Untapped Genetic Reservoir of Novel Traits”. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA, ARS, DOE, ORAU/ORISE or the National Science Foundation. USDA is an equal opportunity provider and employer.
Availability of data and materials
FINDER can be accessed from https://github.com/sagnikbanerjee15/Finder. RNA-Seq samples used for annotation is included in Additional file 8: Table S7. Barley PacBio sequences have been deposited in NCBI (Project Id: GSE165730).
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Genome List-Genome-NCBI. https://www.ncbi.nlm.nih.gov/genome/browse/#!/overview/. Accessed 12 Jan 2021.
- 2.Morganti S, Tarantino P, Ferraro E, D’Amico P, Viale G, Trapani D, et al. Complexity of genome sequencing and reporting: next generation sequencing (NGS) technologies and implementation of precision medicine in real life. Crit Rev Oncol Hematol. 2019;133:171–182. doi: 10.1016/j.critrevonc.2018.11.008. [DOI] [PubMed] [Google Scholar]
- 3.Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155:27–38. doi: 10.1016/j.cell.2013.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Phillips KA, Douglas MP. The global market for next-generation sequencing tests continues its torrid pace. J Precis Med. 2018;2018:4. [PMC free article] [PubMed] [Google Scholar]
- 5.Kulski JK. Next-generation sequencing—an overview of the history, tools, and “Omic” applications. Next Generation Sequencing–Advances, Applications and Challenges. 2016;3–60.
- 6.Banerjee S, Mitra B, Chatterjee A, Santra A, Chatterjee B. Identification of relevant physico chemical properties of amino acids with respect to protein glycosylation prediction. In: Computing and Communication (IEMCON), 2015 International Conference and Workshop on. IEEE; 2015. p. 1–7.
- 7.Banerjee S, Basu S, Nasipuri M. Big Data Analytics and Its Prospects in Computational Proteomics. In: Information systems design and intelligent applications. Springer; 2015. p. 591–8.
- 8.Banerjee S, Velásquez-Zapata V, Fuerst G, Elmore JM, Wise RP, Elmore M. NGPINT: a next-generation protein–protein interaction software. Brief Bioinform. 2020;2020:1–14. doi: 10.1093/bib/bbaa351. [DOI] [PubMed] [Google Scholar]
- 9.Rao VS, Srinivas K, Sujini GN, Kumar GN. Protein–protein interaction detection: methods and analysis. Int J Proteom. 2014;2014:147648. doi: 10.1155/2014/147648. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Patel S, Tripathi R, Kumari V, Varadwaj P. DeepInteract: deep neural network based protein–protein interaction prediction tool. Curr Bioinform. 2017;12:551–557. doi: 10.2174/1574893611666160815150746. [DOI] [Google Scholar]
- 11.Chen M, Ju CJ-T, Zhou G, Chen X, Zhang T, Chang K-W, et al. Multifaceted protein–protein interaction prediction based on siamese residual rcnn. Bioinformatics. 2019;35:305–314. doi: 10.1093/bioinformatics/btz328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Yang S, Li H, He H, Zhou Y, Zhang Z. Critical assessment and performance improvement of plant–pathogen protein–protein interaction prediction methods. Brief Bioinform. 2019;20:274–287. doi: 10.1093/bib/bbx123. [DOI] [PubMed] [Google Scholar]
- 13.Li Y, Ilie L. SPRINT: ultrafast protein–protein interaction prediction of the entire human interactome. BMC Bioinform. 2017;18:485. doi: 10.1186/s12859-017-1871-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Velásquez-Zapata V, Elmore JM, Banerjee S, Dorman KS, Wise RP. Next-generation yeast-two-hybrid analysis with Y2H-SCORES identifies novel interactors of the MLA immune receptor. PLoS Comput Biol 2021. [DOI] [PMC free article] [PubMed]
- 15.Banerjee S, Ghosh D, Basu S, Nasipuri M. JUPred_MLP: Prediction of phosphorylation sites using a consensus of MLP classifiers. 2016.
- 16.Banerjee S, Ghosh D, Basu S, Nasipuri M. JUPred_SVM : Prediction of Phosphorylation Sites using a consensus of SVM classifiers. In: Proceedings of Fifth International Conference on Soft Computing for Problem Solving. Springer; 2016. p. 1–8.
- 17.Banerjee S, Nag S, Tapadar S, Ghosh S, Guha S, Bakshi S. Improving protein protein interaction prediction by choosing appropriate physiochemical properties of amino acids. In: Computing and Communication (IEMCON), 2015 International Conference and Workshop on. IEEE; 2015. p. 1–8.
- 18.Banerjee S, Basu S, Ghosh D, Nasipuri M. PhospredRF: Prediction of protein phosphorylation sites using a consensus of random forest classifiers. In: Computing and Communication (IEMCON), 2015 International Conference and Workshop on. IEEE; 2015. p. 1–7.
- 19.Luo F, Wang M, Liu Y, Zhao X-M, Li A. DeepPhos: prediction of protein phosphorylation sites with deep learning. Bioinformatics. 2019;35:2766–2773. doi: 10.1093/bioinformatics/bty1051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Li F, Li C, Marquez-Lago TT, Leier A, Akutsu T, Purcell AW, et al. Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome. Bioinformatics. 2018;34:4223–4231. doi: 10.1093/bioinformatics/bty522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Song J, Wang H, Wang J, Leier A, Marquez-Lago T, Yang B, et al. PhosphoPredict: a bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection. Sci Rep. 2017;7:1–19. doi: 10.1038/s41598-016-0028-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Chen H, Xue Y, Huang N, Yao X, Sun Z. MeMo: a web tool for prediction of protein methylation modifications. Nucl Acids Res. 2006;34 suppl_2:W249–53. [DOI] [PMC free article] [PubMed]
- 23.Eisenhaber B, Eisenhaber F. Prediction of posttranslational modification of proteins from their amino acid sequence. In: Data mining techniques for the life sciences. Springer; 2010. p. 365–84. [DOI] [PubMed]
- 24.Elmore MG, Banerjee S, Pedley KF, Ruck A, Whitham SA. De novo transcriptome of Phakopsora pachyrhizi uncovers putative effector repertoire during infection. Physiol Mol Plant Pathol. 2020;110:101464. doi: 10.1016/j.pmpp.2020.101464. [DOI] [Google Scholar]
- 25.Frantzeskakis L, Kracher B, Kusch S, Yoshikawa-Maekawa M, Bauer S, Pedersen C, et al. Signatures of host specialization and a recent transposable element burst in the dynamic one-speed genome of the fungal barley powdery mildew pathogen. BMC Genomics. 2018;19:381. doi: 10.1186/s12864-018-4750-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sperschneider J. Machine learning in plant–pathogen interactions: empowering biological predictions from field scale to genome scale. New Phytologist. 2019;nph.15771. 10.1111/nph.15771. [DOI] [PubMed]
- 27.Sperschneider J, Dodds PN, Singh KB, Taylor JM. ApoplastP: prediction of effectors and plant proteins in the apoplast using machine learning. New Phytologist. 2017. [DOI] [PubMed]
- 28.Sperschneider J, Gardiner DM, Dodds PN, Tini F, Covarelli L, Singh KB, et al. EffectorP: predicting fungal effector proteins from secretomes using machine learning. New Phytol. 2016;210:743–761. doi: 10.1111/nph.13794. [DOI] [PubMed] [Google Scholar]
- 29.Magnan CN, Baldi P. SSpro/ACCpro 5: Almost Perfect Prediction of Protein Secondary Structure and Relative Solvent Accessibility Using Profiles, Machine Learning, and Structural Similarity. Bioinformatics. 2014;:btu352. [DOI] [PMC free article] [PubMed]
- 30.McGuffin LJ, Bryson K, Jones DT. The PSIPRED protein structure prediction server. Bioinformatics (Oxford, England) 2000;16:404–405. doi: 10.1093/bioinformatics/16.4.404. [DOI] [PubMed] [Google Scholar]
- 31.Laskowski RA, Watson JD, Thornton JM. Protein function prediction using local 3D templates. J Mol Biol. 2005;351:614–626. doi: 10.1016/j.jmb.2005.05.067. [DOI] [PubMed] [Google Scholar]
- 32.Banerjee S, Guha S, Dutta A, Dutta S. Improvement of protein disorder prediction by brainstorming consensus. In: Computing and Communication (IEMCON), 2015 International Conference and Workshop on. IEEE; 2015. p. 1–7.
- 33.Salzberg SL. Next-generation genome annotation: we still struggle to get it right. BioMed Central; 2019. 10.1186/s13059-019-1715-2. [DOI] [PMC free article] [PubMed]
- 34.del Angel VD, Hjerde E, Sterck L, Capella-Gutierrez S, Notredame C, Pettersson OV, et al. Ten steps to get started in genome assembly and annotation. F1000Research. 2018;7. [DOI] [PMC free article] [PubMed]
- 35.Richards S. Full disclosure: genome assembly is still hard. PLoS Biol. 2018;16:e2005894. doi: 10.1371/journal.pbio.2005894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Jr, Hannick LI, et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucl Acids Res. 2003;31:5654–5666. doi: 10.1093/nar/gkg770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Salamov A, Solovyev V. Fgenesh multiple gene prediction program; 1998.
- 38.Solovyev V, Kosarev P, Seledsov I, Vorobyev D. Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 2006;7:S10. doi: 10.1186/gb-2006-7-s1-s10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kleffe J, Hermann K, Vahrson W, Wittig B, Brendel V. GeneGenerator—a flexible algorithm for gene prediction and its application to maize sequences. Bioinformatics (Oxford) 1998;14:232–243. doi: 10.1093/bioinformatics/14.3.232. [DOI] [PubMed] [Google Scholar]
- 40.Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, et al. mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Res. 2009;19:2133–2143. doi: 10.1101/gr.090597.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Schlueter SD, Dong Q, Brendel V. GeneSeqer@ PlantGDB: gene structure prediction in plant genomes. Nucl Acids Res. 2003;31:3597–3600. doi: 10.1093/nar/gkg533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Cantarel BL, Korf I, Robb SMCC, Parra G, Ross E, Moore B, et al. MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18:188–196. doi: 10.1101/gr.6743907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinform. 2011;12:491. doi: 10.1186/1471-2105-12-491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Campbell MS, Law M, Holt C, Stein JC, Moghe GD, Hufnagel DE, et al. MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol. 2014;164:513–524. doi: 10.1104/pp.113.230144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Campbell MS, Holt C, Moore B, Yandell M. Genome annotation and curation using MAKER and MAKER-P. Curr Protoc Bioinform. 2014;48:4–11. doi: 10.1002/0471250953.bi0411s48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Vonk FJ, Casewell NR, Henkel CV, Heimberg AM, Jansen HJ, McCleary RJR, et al. The king cobra genome reveals dynamic gene evolution and adaptation in the snake venom system. Proc Natl Acad Sci. 2013;110:20651–20656. doi: 10.1073/pnas.1314702110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Keane M, Semeiks J, Webb AE, Li YI, Quesada V, Craig T, et al. Insights into the evolution of longevity from the bowhead whale genome. Cell Rep. 2015;10:112–122. doi: 10.1016/j.celrep.2014.12.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Zhang J, Fu X-X, Li R-Q, Zhao X, Liu Y, Li M-H, et al. The hornwort genome and early land plant evolution. Nature plants. 2020;6:107–118. doi: 10.1038/s41477-019-0588-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Gray MW, Burger G, Derelle R, Klimeš V, Leger MM, Sarrasin M, et al. The draft nuclear genome sequence and predicted mitochondrial proteome of Andalucia godoyi, a protist with the most gene-rich and bacteria-like mitochondrial genome. BMC Biol. 2020;18:1–35. doi: 10.1186/s12915-020-0741-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Peng C, Ren J-L, Deng C, Jiang D, Wang J, Qu J, et al. The genome of Shaw’s sea snake (Hydrophis curtus) reveals secondary adaptation to its marine environment. Mol Biol Evol; 2020. [DOI] [PubMed]
- 51.Weitemier K, Straub SCK, Fishbein M, Bailey CD, Cronn RC, Liston A. A draft genome and transcriptome of common milkweed (Asclepias syriaca) as resources for evolutionary, ecological, and molecular studies in milkweeds and Apocynaceae. PeerJ. 2019;7:e7649. doi: 10.7717/peerj.7649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Zhang J, Zhang X, Tang H, Zhang Q, Hua X, Ma X, et al. Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L. Nat Genet. 2018;50:1565–1573. doi: 10.1038/s41588-018-0237-2. [DOI] [PubMed] [Google Scholar]
- 53.Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–829. doi: 10.1101/gr.074492.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat Biotechnol. 2011;29:644. doi: 10.1038/nbt.1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–652. doi: 10.1038/nbt.1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19:455–477. doi: 10.1089/cmb.2012.0021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Xie Y, Wu G, Tang J, Luo R, Patterson J, Liu S, et al. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics. 2014;30:1660–1666. doi: 10.1093/bioinformatics/btu077. [DOI] [PubMed] [Google Scholar]
- 58.Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511. doi: 10.1038/nbt.1621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33:290–295. doi: 10.1038/nbt.3122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Liu R, Dickerson J. Strawberry: fast and accurate genome-guided transcript reconstruction and quantification from RNA-Seq. PLoS Comput Biol. 2017;13:e1005851. doi: 10.1371/journal.pcbi.1005851. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Shao M, Kingsford C. Accurate assembly of transcripts through phase-preserving graph decomposition. Nat Biotechnol. 2017;35:1167–1169. doi: 10.1038/nbt.4020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 2019;20:1–13. doi: 10.1186/s13059-019-1910-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Song L, Sabunciyan S, Yang G, Florea L. A multi-sample approach increases the accuracy of transcript assembly. Nat Commun. 2019;10:5000. doi: 10.1038/s41467-019-12990-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS: Table 1. Bioinformatics. 2016;32:767–769. doi: 10.1093/bioinformatics/btv661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Hoff KJ, Lomsadze A, Borodovsky M, Stanke M. Whole-genome annotation with BRAKER. In: Gene prediction. Springer; 2019. p. 65–95. [DOI] [PMC free article] [PubMed]
- 66.Steijger T, Abril JF, Engström PG, Kokocinski F, Akerman M, Alioto T, et al. Assessment of transcript reconstruction methods for RNA-seq. Nat Methods. 2013;10:1177–1184. doi: 10.1038/nmeth.2714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Lomsadze A, Burns PD, Borodovsky M. Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucl Acids Res. 2014;42:e119–e119. doi: 10.1093/nar/gku557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008;24:637–644. doi: 10.1093/bioinformatics/btn013. [DOI] [PubMed] [Google Scholar]
- 69.Korf I. Gene finding in novel genomes. BMC Bioinform. 2004;5:59. doi: 10.1186/1471-2105-5-59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Keilwagen J, Hartung F, Grau J. GeMoMa: Homology-based gene prediction utilizing intron position conservation and RNA-seq data. In: Methods in molecular biology. 2019. [DOI] [PubMed]
- 71.Leinonen R, Sugawara H, Shumway M, Collaboration INSD. The sequence read archive. Nucl Acids Res. 2010;39 suppl_1:D19–21. [DOI] [PMC free article] [PubMed]
- 72.Dobin A, Gingeras TR, Spring C, Flores R, Sampson J, Knight R, et al. Mapping RNA-seq with STAR. Curr Protoc Bioinform. 2016;51:586–597. [Google Scholar]
- 73.Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Tang S, Lomsadze A, Borodovsky M. Identification of protein coding regions in RNA transcripts. Nucleic Acids Res. 2015;43:e78. doi: 10.1093/nar/gkv227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Goodstadt L. Ruffus: a lightweight Python library for computational pipelines. Bioinformatics. 2010;26:2778–2779. doi: 10.1093/bioinformatics/btq524. [DOI] [PubMed] [Google Scholar]
- 76.Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, Alioto T, et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods. 2013;10:1185–1191. doi: 10.1038/nmeth.2722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Rapazote-Flores P, Bayer M, Milne L, Mayer C-D, Fuller J, Guo W, et al. BaRTv1.0: an improved barley reference transcript dataset to determine accurate changes in the barley transcriptome using RNA-seq. BMC Genomics. 2019;20:1–17. doi: 10.1186/s12864-019-6243-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Ustianenko D, Weyn-Vanhentenryck SM, Zhang C. Microexons: discovery, regulation, and function. Wiley Interdiscip Rev RNA. 2017;8:e1418. doi: 10.1002/wrna.1418. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Curry-Hyde A, Chen BJ, Mills JD, Janitz M. Microexons: novel regulators of the transcriptome. J Hum Transcript. 2018;2:1–6. doi: 10.1080/23324015.2018.1491940. [DOI] [Google Scholar]
- 80.Wen F, Li F, Xia H, Lu X, Zhang X, Li Y. The impact of very short alternative splicing on protein structures and functions in the human genome. Trends Genet. 2004;20:232–236. doi: 10.1016/j.tig.2004.03.005. [DOI] [PubMed] [Google Scholar]
- 81.Sakharkar MK, Chow VTK, Kangueane P. Distributions of exons and introns in the human genome. silico biology. 2004;4:387–393. [PubMed] [Google Scholar]
- 82.Mano F, Aoyanagi T, Kozaki A. Atypical splicing accompanied by skipping conserved micro-exons produces unique WRINKLED1, an AP2 domain transcription factor in rice plants. Plants. 2019;8:207. doi: 10.3390/plants8070207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Song Q, Lv F, Tahir ul Qamar M, Xing F, Zhou R, Li H, et al. Identification and analysis of micro-exon genes in the rice genome. Int J Mol Sci. 2019;20:2685. doi: 10.3390/ijms20112685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Bulman S, Ridgway HJ, Eady C, Conner AJ. Intron-rich gene structure in the intracellular plant parasite Plasmodiophora brassicae. Protist. 2007;158:423–433. doi: 10.1016/j.protis.2007.04.005. [DOI] [PubMed] [Google Scholar]
- 85.Wang X. Protein and proteome atlas for plants under stresses: new highlights and ways for integrated Omics in post-genomics era; 2019. [DOI] [PMC free article] [PubMed]
- 86.Guo L, Liu C-M. A single-nucleotide exon found in Arabidopsis. Sci Rep. 2015;5:18087. doi: 10.1038/srep18087. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Gonatopoulos-Pournatzis T, Wu M, Braunschweig U, Roth J, Han H, Best AJ, et al. Genome-wide CRISPR-Cas9 interrogation of splicing networks reveals a mechanism for recognition of autism-misregulated neuronal microexons. Mol Cell. 2018;72:510–524. doi: 10.1016/j.molcel.2018.10.008. [DOI] [PubMed] [Google Scholar]
- 88.Consortium Gte. Human genomics. The human transcriptome across tissues and individuals. Science. 2015;348:660–5. [DOI] [PMC free article] [PubMed]
- 89.Irimia M, Weatheritt RJ, Ellis JD, Parikshak NN, Gonatopoulos-Pournatzis T, Babor M, et al. A highly conserved program of neuronal microexons is misregulated in autistic brains. Cell. 2014;159:1511–1523. doi: 10.1016/j.cell.2014.11.035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Torres-Méndez A, Bonnal S, Marquez Y, Roth J, Iglesias M, Permanyer J, et al. A novel protein domain in an ancestral splicing factor drove the evolution of neural microexons. Nat Ecol Evol. 2019;3:691–701. doi: 10.1038/s41559-019-0813-6. [DOI] [PubMed] [Google Scholar]
- 91.Parras A, Anta H, Santos-Galindo M, Swarup V, Elorza A, Nieto-González JL, et al. Autism-like phenotype and risk gene mRNA deadenylation by CPEB4 mis-splicing. Nature. 2018;560:441–446. doi: 10.1038/s41586-018-0423-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Wu J, Anczukow O, Krainer AR, Zhang MQ, Zhang C. OLego: fast and sensitive mapping of spliced mRNA-Seq reads using small seeds. Nucleic Acids Res. 2013;41:5149–5163. doi: 10.1093/nar/gkt216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Kawahara Y, Sugiyama M. Change-point detection in time-series data by direct density-ratio estimation. In: Proceedings of the 2009 SIAM International Conference on Data Mining. SIAM; 2009. p. 389–400.
- 94.Lund R, Wang XL, Lu QQ, Reeves J, Gallagher C, Feng Y. Changepoint detection in periodic and autocorrelated time series. J Clim. 2007;20:5178–5190. doi: 10.1175/JCLI4291.1. [DOI] [Google Scholar]
- 95.Kawahara Y, Yairi T, Machida K. Change-point detection in time-series data based on subspace identification. In: Seventh IEEE international conference on data mining (ICDM 2007). IEEE; 2007. p. 559–64.
- 96.Takeuchi J, Yamanishi K. A unifying framework for detecting outliers and change points from time series. IEEE Trans Knowl Data Eng. 2006;18:482–492. doi: 10.1109/TKDE.2006.1599387. [DOI] [Google Scholar]
- 97.Aalvik Stranden S. A supervised sliding window approach for change point detection in multivariate time series; 2020.
- 98.Tartakovsky AG, Rozovskii BL, Blazek RB, Kim H. A novel approach to detection of intrusions in computer networks via adaptive sequential and batch-sequential change-point detection methods. IEEE Trans Signal Process. 2006;54:3372–3382. doi: 10.1109/TSP.2006.879308. [DOI] [Google Scholar]
- 99.Klanderman MC, Newhart KB, Cath TY, Hering AS. Fault isolation for a complex decentralized waste water treatment facility. J R Stat Soc Ser C. 2020;69:931–951. doi: 10.1111/rssc.12429. [DOI] [Google Scholar]
- 100.Quinlan AR. BEDTools: the Swiss-army tool for genome feature analysis. Curr Protoc Bioinformatics. 2014;47:11–12. doi: 10.1002/0471250953.bi1112s47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Killick R, Eckley I. changepoint: an R package for changepoint analysis. J Stat Softw. 2014;58:1–19. doi: 10.18637/jss.v058.i03. [DOI] [Google Scholar]
- 102.Xiang S, Huang Z, Wang T, Han Z, Christina YY, Ni D, et al. Condition-specific gene co-expression network mining identifies key pathways and regulators in the brain tissue of Alzheimer’s disease patients. BMC Med Genomics. 2018;11:115. doi: 10.1186/s12920-018-0431-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Bruna T, Hoff K, Stanke M, Lomsadze A, Borodovsky M. BRAKER2: Automatic Eukaryotic Genome Annotation with GeneMark-EP+ and AUGUSTUS Supported by a Protein Database. bioRxiv. 2020. [DOI] [PMC free article] [PubMed]
- 104.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 105.Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinform. 2005;6:31. doi: 10.1186/1471-2105-6-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Cheng C, Krishnakumar V, Chan AP, Thibaud-Nissen F, Schobel S, Town CD. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J. 2017;89:789–804. doi: 10.1111/tpj.13415. [DOI] [PubMed] [Google Scholar]
- 107.Li J-Y, Wang J, Zeigler RS. The 3,000 rice genomes project: new opportunities and challenges for future rice research. Gigascience. 2014;3:2047–2217. doi: 10.1186/2047-217X-3-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108.Duitama J, Silva A, Sanabria Y, Cruz DF, Quintero C, Ballen C, et al. Whole genome sequencing of elite rice cultivars as a comprehensive information resource for marker assisted selection. PLoS ONE. 2015;10:e0124617. doi: 10.1371/journal.pone.0124617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Zhao K, Wright M, Kimball J, Eizenga G, McClung A, Kovach M, et al. Genomic diversity and introgression in O. sativa reveal the impact of domestication and breeding on the rice genome. PLoS ONE. 2010;5:e10780. doi: 10.1371/journal.pone.0010780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110.Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, et al. Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nat Commun. 2016;7:11708. doi: 10.1038/ncomms11708. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111.Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009;326:1112–5. [DOI] [PubMed]
- 112.The_C_elegans_Sequencing_Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282:2012–8. [DOI] [PubMed]
- 113.Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 2008;18:1979–1990. doi: 10.1101/gr.081612.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114.Drosophila_consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature. 2007;450:203. [DOI] [PubMed]
- 115.International_Human_Genome_Sequencing_consortium. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. [DOI] [PubMed]
- 116.Hood L, Rowen L. The human genome project: big science transforms biology and medicine. Genome Med. 2013;5:79. doi: 10.1186/gm483. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117.Monat C, Padmarasu S, Lux T, Wicker T, Gundlach H, Himmelbach A, et al. TRITEX: chromosome-scale sequence assembly of Triticeae genomes with open-source tools. Genome Biol. 2019;20:284. doi: 10.1186/s13059-019-1899-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118.Appels R, Eversole K, Stein N, Feuillet C, Keller B, Rogers J, et al. Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science. 2018;361. [DOI] [PubMed]
- 119.Krasileva KV, Vasquez-Gross HA, Howell T, Bailey P, Paraiso F, Clissold L, et al. Uncovering hidden variation in polyploid wheat. Proc Natl Acad Sci. 2017;114:E913–E921. doi: 10.1073/pnas.1619268114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120.Clavijo BJ, Venturini L, Schudoma C, Accinelli GG, Kaithakottil G, Wright J, et al. An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations. Genome Res. 2017;27:885–896. doi: 10.1101/gr.217117.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 121.Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, et al. RefSeq: an update on mammalian reference sequences. Nucl Acids Res. 2014;42:D756–D763. doi: 10.1093/nar/gkt1114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 122.Tello-Ruiz MK, Naithani S, Stein JC, Gupta P, Campbell M, Olson A, et al. Gramene 2018: unifying comparative genomics and pathway resources for plant research. Nucl Acids Res. 2018;46:D1181–D1189. doi: 10.1093/nar/gkx1111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 123.Jiao Y, Peluso P, Shi J, Liang T, Stitzer MC, Wang B, et al. Improved maize reference genome with single-molecule technologies. Nature. 2017;546:524–527. doi: 10.1038/nature22971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 124.Eilbeck K, Moore B, Holt C, Yandell M. Quantitative measures for the management and comparison of annotated genomes. BMC Bioinform. 2009;10:67. doi: 10.1186/1471-2105-10-67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 125.Venturini L, Caim S, Kaithakottil GG, Mapleson DL, Swarbreck D. Leveraging multiple transcriptome assembly methods for improved gene structure annotation. GigaScience. 2018;7. 10.1093/gigascience/giy093. [DOI] [PMC free article] [PubMed]
- 126.TAIR. Documentation for the TAIR gene model and exon confidence ranking system. 2009. http://plantta.jcvi.org/. Accessed 9 Oct 2020.
- 127.Sreenivasamurthy SK, Madugundu AK, Patil AH, Dey G, Mohanty AK, Kumar M, et al. Mosquito-borne diseases and Omics: tissue-restricted expression and alternative splicing revealed by transcriptome profiling of Anopheles stephensi. Omics J Integr Biol. 2017;21:488–497. doi: 10.1089/omi.2017.0073. [DOI] [PubMed] [Google Scholar]
- 128.Azlan A, Obeidat SM, Yunus MA, Azzam G. Transcriptome profiles and novel lncRNA identification of Aedes aegypti cells in response to dengue virus serotype 1. BioRxiv. 2018;:422170.
- 129.Azlan A, Halim MA, Azzam G. Genome-wide identification and characterization of long intergenic noncoding RNAs in the regenerative flatworm Macrostomum lignano. Genomics. 2020;112:1273–1281. doi: 10.1016/j.ygeno.2019.07.016. [DOI] [PubMed] [Google Scholar]
- 130.Qi S, Akter S, Li S. Identification of Novel lincRNA and Co-Expression Network Analysis Using RNA-Sequencing Data in Plants. In: Plant long non-coding RNAs. Springer; 2019. p. 207–21. [DOI] [PubMed]
- 131.Beisel NS, Noble J, Barbazuk WB, Paul A-L, Ferl RJ. Spaceflight-induced alternative splicing during seedling development in Arabidopsis thaliana. NPJ Micrograv. 2019;5:1–5. doi: 10.1038/s41526-019-0070-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 132.Wang C, Wallerman O, Arendt M-L, Sundstrom E, Karlsson A, Nordin J, et al. A new long-read dog assembly uncovers thousands of exons and functional elements missing in the previous reference. bioRxiv. 2020.
- 133.Liu S, Aagaard A, Bechsgaard J, Bilde T. DNA methylation patterns in the social spider. Stegodyphus dumicola Genes. 2019;10:137. doi: 10.3390/genes10020137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 134.Wu S, Gao S, Wang S, Meng J, Wickham J, Luo S, et al. A reference genome of bursaphelenchus mucronatus provides new resources for revealing its displacement by pinewood nematode. Genes. 2020;11:570. doi: 10.3390/genes11050570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 135.Wang P, Luo Y, Huang J, Gao S, Zhu G, Dang Z, et al. The genome evolution and domestication of tropical fruit mango. Genome Biol. 2020;21:1–17. doi: 10.1186/s13059-019-1906-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 136.Cieślik M, Chinnaiyan AM. Cancer transcriptome profiling at the juncture of clinical translation. Nat Rev Genet. 2018;19:93. doi: 10.1038/nrg.2017.96. [DOI] [PubMed] [Google Scholar]
- 137.Lorenzi L, Avila Cobos F, Decock A, Everaert C, Helsmoortel H, Lefever S, et al. Long noncoding RNA expression profiling in cancer: challenges and opportunities. Genes Chromosom Cancer. 2019;58:191–199. doi: 10.1002/gcc.22709. [DOI] [PubMed] [Google Scholar]
- 138.Yang J, Moeinzadeh M-H, Kuhl H, Helmuth J, Xiao P, Haas S, et al. Haplotype-resolved sweet potato genome traces back its hexaploidization history. Nature plants. 2017;3:696–703. doi: 10.1038/s41477-017-0002-z. [DOI] [PubMed] [Google Scholar]
- 139.Sun Z, Nair A, Chen X, Prodduturi N, Wang J, Kocher J-P. UClncR: ultrafast and comprehensive long non-coding RNA detection from RNA-seq. Sci Rep. 2017;7:1–10. doi: 10.1038/s41598-016-0028-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 140.Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7:562–578. doi: 10.1038/nprot.2012.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 141.Arrigoni A, Ranzani V, Rossetti G, Panzeri I, Abrignani S, Bonnal RJP, et al. Analysis RNA-seq and Noncoding RNA. In: Polycomb group proteins. Springer; 2016. p. 125–35. [DOI] [PubMed]
- 142.Ghosh S, Chan C-KK. Analysis of RNA-Seq data using TopHat and Cufflinks. In: Plant Bioinformatics. Springer; 2016. p. 339–61. [DOI] [PubMed]
- 143.Qi X, Xie S, Liu Y, Yi F, Yu J. Genome-wide annotation of genes and noncoding RNAs of foxtail millet in response to simulated drought stress by deep sequencing. Plant Mol Biol. 2013;83:459–473. doi: 10.1007/s11103-013-0104-6. [DOI] [PubMed] [Google Scholar]
- 144.Marchant A, Mougel F, Mendonça V, Quartier M, Jacquin-Joly E, da Rosa JA, et al. Comparing de novo and reference-based transcriptome assembly strategies by applying them to the blood-sucking bug Rhodnius prolixus. Insect Biochem Mol Biol. 2016;69:25–33. doi: 10.1016/j.ibmb.2015.05.009. [DOI] [PubMed] [Google Scholar]
- 145.Li A, Zhang J, Zhou Z, Wang L, Liu Y, Liu Y. ALDB: a domestic-animal long noncoding RNA database. PLoS ONE. 2015;10:e0124003. doi: 10.1371/journal.pone.0124003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 146.Cooper SJ, Trinklein ND, Anton ED, Nguyen L, Myers RM. Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome. Genome Res. 2006;16:1–10. doi: 10.1101/gr.4222606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 147.Brown RH, Gross SS, Brent MR. Begin at the beginning: predicting genes with 5′ UTRs. Genome Res. 2005;15:742–747. doi: 10.1101/gr.3696205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 148.Ohler U, Liao G, Niemann H, Rubin GM. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 2002;3:research0087–1. [DOI] [PMC free article] [PubMed]
- 149.Batut P, Gingeras TR. RAMPAGE: promoter activity profiling by paired-end sequencing of 5′-complete cDNAs. Curr Protoc Mol Biol. 2013;104:25B–B11. doi: 10.1002/0471142727.mb25b11s104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 150.Adiconis X, Haber AL, Simmons SK, Levy Moonshine A, Ji Z, Busby MA, et al. Comprehensive comparative analysis of 5′-end RNA-sequencing methods. Nat Methods. 2018;15:505–511. doi: 10.1038/s41592-018-0014-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 151.Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci. 2003;100:15776–15781. doi: 10.1073/pnas.2136655100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 152.Holmqvist E, Wright PR, Li L, Bischler T, Barquist L, Reinhardt R, et al. Global RNA recognition patterns of post-transcriptional regulators Hfq and CsrA revealed by UV crosslinking in vivo. EMBO J. 2016;35:991–1011. doi: 10.15252/embj.201593360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 153.Hickman R, van Verk MC, van Dijken AJH, Mendes MP, Vroegop-Vos IA, Caarls L, et al. Architecture and dynamics of the jasmonic acid gene regulatory network. Plant Cell Online. 2017;:tpc-00958. [DOI] [PMC free article] [PubMed]
- 154.Jackson RJ, Standart N. Do the poly (A) tail and 3′ untranslated region control mRNA translation? Cell. 1990;62:15–24. doi: 10.1016/0092-8674(90)90235-7. [DOI] [PubMed] [Google Scholar]
- 155.Meijer HA, Thomas AAM. Control of eukaryotic protein synthesis by upstream open reading frames in the 5′-untranslated region of an mRNA. Biochem J. 2002;367:1–11. doi: 10.1042/bj20011706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 156.Miller GM, Madras BK. Polymorphisms in the 3′-untranslated region of human and monkey dopamine transporter genes affect reporter gene expression. Mol Psychiatry. 2002;7:44–55. doi: 10.1038/sj.mp.4000921. [DOI] [PubMed] [Google Scholar]
- 157.Wu S, Huang S, Ding J, Zhao Y, Liang L, Liu T, et al. Multiple microRNAs modulate p21Cip1/Waf1 expression by directly targeting its 3′ untranslated region. Oncogene. 2010;29:2302–2308. doi: 10.1038/onc.2010.34. [DOI] [PubMed] [Google Scholar]
- 158.Dixon DA, Kaplan CD, McIntyre TM, Zimmerman GA, Prescott SM. Post-transcriptional control of cyclooxygenase-2 gene expression The role of the 3′-untranslated region. J Biol Chem. 2000;275:11750–11757. doi: 10.1074/jbc.275.16.11750. [DOI] [PubMed] [Google Scholar]
- 159.Gu S, Jin L, Zhang F, Sarnow P, Kay MA. Biological basis for restriction of microRNA targets to the 3′ untranslated region in mammalian mRNAs. Nat Struct Mol Biol. 2009;16:144. doi: 10.1038/nsmb.1552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 160.Eberle AB, Stalder L, Mathys H, Orozco RZ, Mühlemann O. Posttranscriptional gene regulation by spatial rearrangement of the 3′ untranslated region. PLoS Biol. 2008;6:e92. doi: 10.1371/journal.pbio.0060092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 161.Halterman DA, Wise RP. Upstream open reading frames of the barley Mla13 powdery mildew resistance gene function co-operatively to down-regulate translation. Mol Plant Pathol. 2006;7:167–176. doi: 10.1111/j.1364-3703.2006.00329.x. [DOI] [PubMed] [Google Scholar]
- 162.Awata T, Inoue K, Kurihara S, Ohkubo T, Watanabe M, Inukai K, et al. A common polymorphism in the 5′-untranslated region of the VEGF gene is associated with diabetic retinopathy in type 2 diabetes. Diabetes. 2002;51:1635–1639. doi: 10.2337/diabetes.51.5.1635. [DOI] [PubMed] [Google Scholar]
- 163.Rogers JT, Randall JD, Cahill CM, Eder PS, Huang X, Gunshin H, et al. An iron-responsive element type II in the 5′-untranslated region of the Alzheimer’s amyloid precursor protein transcript. J Biol Chem. 2002;277:45518–45528. doi: 10.1074/jbc.M207435200. [DOI] [PubMed] [Google Scholar]
- 164.Chin LJ, Ratner E, Leng S, Zhai R, Nallur S, Babar I, et al. A SNP in a let-7 microRNA complementary site in the KRAS 3′ untranslated region increases non-small cell lung cancer risk. Can Res. 2008;68:8535–8540. doi: 10.1158/0008-5472.CAN-08-2129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 165.Halterman DA, Wei F, Wise RP. Powdery mildew-induced Mla mRNAs are alternatively spliced and contain multiple upstream open reading frames. Plant Physiol. 2003;131:558–567. doi: 10.1104/pp.014407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 166.Conne B, Stutz A, Vassalli J-D. The 3′ untranslated region of messenger RNA: a molecular ‘hotspot’for pathology? Nat Med. 2000;6:637–641. doi: 10.1038/76211. [DOI] [PubMed] [Google Scholar]
- 167.Hunt M, Banerjee S, Surana P, Liu M, Fuerst G, Mathioni S, et al. Small RNA discovery in the interaction between barley and the powdery mildew pathogen. BMC Genomics. 2019;20:610. doi: 10.1186/s12864-019-5947-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 168.Chapman AVE, Matthew H, Surana P, Velásquez-Zapata V, Xu W, Fuerst G, et al. Disruption of barley immunity to powdery mildew by an in-frame Lys-Leu deletion in the essential protein SGT1. Oxford Genetics. 2020. [DOI] [PMC free article] [PubMed]
- 169.Dai X, Xu Z, Liang Z, Tu X, Zhong S, Schnable JC. Non-homology-based prediction of gene functions. 2019;1–18. [DOI] [PubMed]
- 170.Duvick DN. The contribution of breeding to yield advances in maize (Zea mays L.). Adv Agronomy. 2005;86:83–145.
- 171.Agrama HAS, Moussa ME. Mapping QTLs in breeding for drought tolerance in maize (Zea mays L.). Euphytica. 1996;91:89–97.
- 172.Maazou A-RS, Tu J, Qiu J, Liu Z. Breeding for drought tolerance in maize (Zea mays L.). Am J Plant Sci. 2016;7:1858.
- 173.Bhandary P, Seetharam AS, Arendsee ZW, Hur M, Wurtele ES. Raising orphans from a metadata morass: a researcher’s guide to re-use of public ’omics data. Plant Sci. 2018 doi: 10.1016/j.plantsci.2017.10.014. [DOI] [PubMed] [Google Scholar]
- 174.Arendsee ZW, Li L, Wurtele ES. Coming of age: orphan genes in plants. Trends Plant Sci. 2014;19:698–708. doi: 10.1016/J.TPLANTS.2014.07.003. [DOI] [PubMed] [Google Scholar]
- 175.Arendsee Z, Li J, Singh U, Seetharam A, Dorman K, Wurtele ES. phylostratr: a framework for phylostratigraphy. Bioinformatics. 2019;35:3617–3627. doi: 10.1093/bioinformatics/btz171. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
FINDER can be accessed from https://github.com/sagnikbanerjee15/Finder. RNA-Seq samples used for annotation is included in Additional file 8: Table S7. Barley PacBio sequences have been deposited in NCBI (Project Id: GSE165730).