Skip to main content
Molecular Biology and Evolution logoLink to Molecular Biology and Evolution
. 2022 Jun 14;39(7):msac133. doi: 10.1093/molbev/msac133

Accurate Identification of Transcription Regulatory Sequences and Genes in Coronaviruses

Chuanyi Zhang 1,a, Palash Sashittal 2,a,, Michael Xiang 3, Yichi Zhang 4, Ayesha Kazi 5, Mohammed El-Kebir 6,
Editor: Thomas Leitner
PMCID: PMC9214144  PMID: 35700225

Abstract

Transcription regulatory sequences (TRSs), which occur upstream of structural and accessory genes as well as the 5 end of a coronavirus genome, play a critical role in discontinuous transcription in coronaviruses. We introduce two problems collectively aimed at identifying these regulatory sequences as well as their associated genes. First, we formulate the TRS Identification problem of identifying TRS sites in a coronavirus genome sequence with prescribed gene locations. We introduce CORSID-A, an algorithm that solves this problem to optimality in polynomial time. We demonstrate that CORSID-A outperforms existing motif-based methods in identifying TRS sites in coronaviruses. Second, we demonstrate for the first time how TRS sites can be leveraged to identify gene locations in the coronavirus genome. To that end, we formulate the TRS and Gene Identification problem of simultaneously identifying TRS sites and gene locations in unannotated coronavirus genomes. We introduce CORSID to solve this problem, which includes a web-based visualization tool to explore the space of near-optimal solutions. We show that CORSID outperforms state-of-the-art gene finding methods in coronavirus genomes. Furthermore, we demonstrate that CORSID enables de novo identification of TRS sites and genes in previously unannotated coronavirus genomes. CORSID is the first method to perform accurate and simultaneous identification of TRS sites and genes in coronavirus genomes without the use of any prior information.

Keywords: core sequences, gene identification, coronavirus, motif finding, local alignment, maximum weight independent set, interval graph

Introduction

Coronaviruses are enveloped viruses comprised of a positive-sense, single-stranded RNA genome that is ready to be translated by the host ribosome. While the majority of messenger RNA (mRNA) in eukaryotes is monocistronic, that is, each mRNA is translated into a single gene product, the coronavirus RNA genome is comprised of several structural, non-structural and accessory genes (fig. 1a). These genes are necessary for the viral life cycle and are expressed and translated using three distinct mechanisms (Sola et al. 2015).

Fig. 1.

Fig. 1.

Overview. (a) A coronavirus genome v consists of a leader region vleader and a body region vbody. (b) Structural and accessory genes are expressed via discontinuous transcription with template switching occurring at transcription regulatory sequences (TRS, indicated in red), resulting in subgenomic messenger RNAs (sgmRNAs) for each gene. (c) In the TRS Identification (TRS-ID) problem, we wish to identify TRSs given a genome v with genes x0,,xn. The TRS and Gene Identification (TRS-Gene-ID) asks to simultaneously identify genes and their associated TRSs given genome v. Throughout this manuscript, we use “T” (thymine) rather than “U” (uracil).

First, upon cell entry, the viral genome is translated to produce polypeptides corresponding to one or two overlapping open reading frames (ORFs). The resulting polypeptides undergo auto-cleavage, producing many non-structural proteins, including the RNA-dependent-RNA-polymerase (RdRP). Second, the viral RdRP mediates the expression of the remaining viral genes via discontinuous transcription (Sola et al. 2015). That is, the RdRP is prone to perform template switching, predominantly upon encountering transcription regulatory sequences (TRSs), located in the 5 untranslated region (UTR) of the genome—called TRS-L where L stands for leader—and upstream of viral genes—called TRS-B where B stands for body (fig. 1b). Note that while previous studies have found evidence of TRS-independent template switching leading to non-canonical transcripts, the function of these transcripts is still unknown (Kim et al. 2020; Finkel et al. 2021; Sashittal et al. 2021). Third, occasionally certain genes are expressed via leaky scanning, where a weak initiation codon leads to the translation of the next downstream ORF (Jungreis et al. 2021). Not only is the identification and characterization of TRS sites crucial to understanding the regulation and expression of the viral proteins, but here we hypothesize that the existence of these regulatory sequences can be leveraged to simultaneously identify TRS sites and associated viral genes in unannotated coronavirus genomes with high accuracy.

While there exist methods for identifying either TRS sites or viral genes, no method exists that does so simultaneously (supplementary table S1, Supplementary Material online). More specifically, since each TRS contains a 6–7 nt long conserved sequence, called a core sequence (Sola et al. 2015; Finkel et al. 2021), general-purpose motif finding methods (Pavesi et al. 2004; Down and Hubbard 2005; Yao et al. 2006; Bailey et al. 2009) can be employed to identify TRS-L and TRS-Bs in coronaviruses. For instance, MEME (Bailey et al. 2009) is a widely used method that employs expectation maximization to identify multiple appearances of multiple motifs simultaneously. The only method that is specifically developed for identifying TRS sites in coronaviruses is SuPER (Yang et al. 2021), which takes as input a coronavirus genome sequence with specified gene locations as well as additional taxonomic and secondary structure information. Importantly, SuPERas well as other general-purpose motif finding algorithms are unable to identify viral genes in unannotated coronavirus genome sequences.

On the other hand, gene prediction is a well-studied problem with many methods including Glimmer3 (Salzberg et al. 1998; Delcher et al. 2007), Prodigal (Hyatt et al. 2010, 2012) and VADR (Schäffer et al. 2020). Glimmer3uses a Markov model to assign scores to ORFs, and then processes overlapping genes to generate the final list of predicted genes. By contrast, Prodigalemploys a more heuristic approach with fine-tuned parameters that are optimized to identify genes in prokaryotes. While Glimmer3and Prodigalare designed for prokaryotic genomes, VADR is specifically designed for identifying genes in viral genomes. To that end, VADR first classifies the input sequence and finds the most similar sequence in a pre-specified database, maps the curated annotations to the input based on a covariance model, and then uses BLAST (Altschul et al. 1990) to validate the annotated genes. Importantly, current gene finding tools do not leverage the genomic structure of coronaviruses, specifically the TRS sites located upstream of the genes in the genome, nor are they able to directly identify these regulatory sequences.

In this study, we introduce the TRS Identification (TRS-ID) and the TRS and Gene Identification (TRS-Gene-ID) problems, to identify TRS sites in a coronavirus genome with specified gene annotations, and to simultaneously identifying TRS sites and genes in an unannotated coronavirus genome, respectively (fig. 1c). Underpinning our approach is the concept of a TRS alignment, which is a multiple sequence alignment of TRS sites with additional constraints that result from template switching by RdRP. We introduce CORSID-A, a dynamic programming (DP) algorithm to solve the TRS-ID problem, adapting the recurrence that underlies the Smith–Waterman algorithm (Smith and Waterman 1981) for local sequence alignment. Additionally, we introduce CORSID to solve the TRS-Gene-ID problem via a maximum-weight independent set problem (Hsiao et al. 1992) on an interval graph defined by the candidate ORFs in the genome with weights obtained from the previous DP.

CORSID enables de novo identification of viral genes in coronaviruses using only the nucleotide sequence of the viral genome. CORSID-A, on the other hand, is designed to identify all the TRSs in a coronavirus genome annotated with gene locations. We evaluate the performance of our methods on 468 coronavirus genomes downloaded from GenBank, demonstrating that CORSID-A outperforms MEME and SuPERin identifying TRS sites and, unlike these methods, possesses the ability to identify recombination events. Moreover, we find that CORSID outperforms state-of-the-art gene finding methods. Finally, we illustrate how CORSID enables de novo identification of TRS sites and genes in previously unannotated coronaviruses. In summary, CORSID is the first method to perform accurate and simultaneous identification of TRS sites and genes in coronavirus genomes without the use of prior taxonomic or secondary structure information.

New Approaches

Viewing TRS and gene identification as a multiple sequence alignment is a novel approach that we will outline in this section. We begin by introducing notation and key definitions, followed by stating the TRS Identification problem and then the TRS and Gene Identification problem. Finally, we overview the key methodological contributions of our algorithms.

Preliminaries

A genome  v=v1v|v| is a sequence from the alphabet Σ={A,T,C,G}. The first position of the genome is known as the 5 end whereas the last position of the genome is known as the 3 end. We denote the contiguous subsequence vpvq of v by v[p,q]. We also call a contiguous subsequence x of v a region, denoted as x=[x,x+] such that x=v[x,x+]. Thus, coordinates x and x+ of a subsequence x are in terms of the reference genome v, that is, x=vxvx+. Alternatively, we may refer to individual characters in a subsequence x using relative indices, that is, x=x1x|x|. Our goals are twofold: given a coronavirus genome v, we aim to identify (i) TRS-L and TRS-Bs, and optionally, (ii) the associated genes (fig. 1c). To begin, recall the following definition of an alignment.

Definition 1.

Matrix A=[aij] with n+1 rows is an alignment of sequences b0,,bnΣ* provided (i) entries aij either correspond to a letter in the alphabet Σ or a gap denoted by “–” such that (ii) no column of A is composed of only gaps, and (iii) the removal of gaps of row i of A yields sequence bi.

Here, we seek an alignment with two additional constraints, called a TRS alignment defined as follows.

Definition 2.

An alignment A=[a0,,an] is a TRS alignment provided (i) a0 does not contain any gaps, and (ii) a1,,an do not contain any internal gaps.

Intuitively, the first sequence a0 in the alignment A represents TRS-L, whereas a1,,an represent TRS-Bs, each upstream of an accessory or structural gene. We do not allow gaps in the TRS-L sequence a0 as template switching by RdRP occurs due to complementary base pairing between TRS-L and the nascent strand of TRS-B (Sola et al. 2005). For the same reason, we do not allow internal gaps in TRS-Bs ai. However, as each TRS-B may match a different region of the TRS-L, we do allow flanking gaps in these sequences (fig. 1c). We score a TRS alignment A using a scoring function δ:Σ×(Σ{})R in the following way.

Definition 3.

The score  s(A) of a TRS alignment A=[a0,,an] is given by i=1ns(a0,ai)=i=1nj=1|a0|  δ(a0j,aij), whereas the minimum score  smin(A) is defined as mini{1,,n}s(a0,ai).

In other words, we score each TRS-B ai (where i1) by comparing it to the TRS-L sequence a0 in a way that is consistent with the mechanism of template switching during discontinuous transcription. As such, our scoring function differs from the traditional sum-of-pairs scoring function (Carrillo and Lipman 1988) where every unordered pair (ai,aj) of sequences contributes to the score of the alignment. Furthermore, each TRS alignment uniquely determines the core sequence as follows.

Definition 4.

Sequence c(A) is the core sequence of a TRS alignment A=[a0,,an] provided c(A) is the largest contiguous subsequence of a0 such that no character of c is aligned to a gap in any of a1,,an.

Note that the core sequence is a subsequence of the TRS sequences. As such, the TRS alignment can include nucleotides immediately flanking the core sequence, which have been shown to play an important role in discontinuous transcription in previous experiments (Sola et al. 2005).

The TRS Identification Problem

The first problem we consider is that of identifying TRS sites given a viral genome with known genes x0,,xn. Specifically, we are given a candidate region w0 that contains the unknown TRS-L a0 upstream of gene x0 as well as candidate regions w1,,wn that contain the unknown TRS-Bs a1,,an of genes x1,,xn. We detail in Materials and Methods how to obtain these candidate regions when only given the gene locations. To further guide the optimization problem, we impose an additional constraint on the sought TRS alignment A in the form of a minimum length ω on the core sequence c(A) as well as a threshold τ on the minimum score smin(A) of the TRS alignment. We formalize this problem as follows.

Problem 1 (TRS Identification (TRS-ID)) —

Given non-overlapping sequences w0,,wn, core-sequence length ω>0 and score threshold τ>0, find a TRS alignment A=[a0,,an] such that (i) ai corresponds to a subsequence in wi for all i{0,,n}, (ii) the core sequence c(A) has length at least ω, (iii) the minimum score smin(A) is at least τ, and (iv) the alignment has maximum score s(A).

The TRS and Gene Identification Problem

In the second problem, we are no longer given an annotated genome with gene locations. Rather, we seek to simultaneously identify genes and TRS sites given a viral genome sequence v split into a leader region vleader and body region vbody. We describe in Materials and Methods a heuristic for identifying these two regions when only given v. The key idea here is that each TRS alignment will uniquely determine a set of genes it encodes. To make this relationship clear, we begin by defining an open reading frame as follows.

Definition 5.

A contiguous subsequence x=[x,x+] of v is an open reading frame provided x (i) has a length |x| that is a multiple of 3, (ii) starts with a start codon, that is, x1x3=ATG, (iii) ends at a stop codon, that is, x|x|2x|x|{TAA,TAG,TGA}, and (iv) does not contain an internal in-frame stop codon, that is, for all j{1,,|x|/31} it holds that x3(j1)+1x3(j1)+3  {TAA,TAG,TGA}.

Naively, to identify the ORF associated with TRS-B ai, one could simply scan downstream of the TRS-B for the first occurrence of a start codon and continue scanning to identify the corresponding in-frame stop codon. However, this would not take leaky scanning into account, where the ribosome does not initiate translation at the first encountered “ATG.” We provide a more robust definition of a downstream ORF that takes leaky scanning into account in Materials and Methods. To summarize, we have that a TRS alignment A=[a0,,an] uniquely determines a set Γ(A) of candidate genes.

Definition 6.

A set Γ(A) of ORFs are induced genes of a TRS alignment A=[a0,,an] provided Γ(A) is composed of the ORFs that occur downstream of each TRS-B a1,,an in vbody.

Note that there may not be an ORF downstream of each TRS-B ai. As such, we have that |Γ(A)|n. While in theory multiple TRS-Bs of a TRS alignment A=[a0,,an] may induce the same gene in vbody, in practice each coronavirus gene typically has a unique TRS-B. Moreover, these viral genes are typically non-overlapping in the genome. To that end, we have the following definition.

Definition 7.

A TRS alignment A=[a0,,an] is concordant provided (i) each TRS-B ai corresponds to a unique gene in Γ(A), and (ii) there are no two ORFs in Γ(A) whose positions in vbody overlap.

In practice, since some coronavirus genes may overlap, we later relax this definition to allow some overlap between ORFs (details in supplementary methods, Supplementary Material online). Finally, coronavirus genomes tend to be compact with most positions coding for genes. To capture these biological constraints, we introduce the following definition.

Definition 8.

The genome coverage  g(A) of a TRS alignment A is the number of positions in vbody that are covered by the set Γ(A) of induced genes.

This leads to the following problem.

Problem 2 (TRS and Gene Identification (TRS-Gene-ID)) —

Given leader region vleader, body region vbody, core-sequence length ω>0 and score threshold τ>0, find a TRS alignment A=[ai] such that (i) a0 corresponds to a subsequence in vleader, (ii) ai corresponds to a subsequence in vbody for all i1, (iii) the core sequence c(A) has length at least ω, (iv) the minimum score smin(A) is at least τ, (v) A is concordant, and (vi) A induces a set Γ(A) of genes with maximum genome coverage g(A) and subsequently maximum score s(A).

In other words, there are two objectives, genome coverage g(A) and alignment score s(A), that are ordered lexicographically. That is, if there exist multiple TRS alignments that have maximum genome coverage, we break ties using the alignment score.

Overview of CORSID and CORSID-A

Our algorithm CORSID-A solves the TRS-ID problem to optimality. The key algorithmic insight is that the problem decomposes into n independent pairwise alignment problems when fixing a window in the leader candidate region w0 that contains the core sequence. Our second problem, the TRS-Gene-ID problem, is solved by CORSID. We use the same insight of sliding a window through the leader region vleader and show that the constrained problem corresponds to a maximum-weight independent set on an interval graph, which can be solved in polynomial time. We implemented both methods in Python. Moreover, we implemented a web-based visual analytics tool for exploring the space of near-optimal solutions. The source code is available at https://github.com/elkebir-group/CORSID. The results of CORSID and CORSID-A are available at https://github.com/elkebir-group/CORSID-data and we also built a web application to visualize these results for easier exploration of the solution space and manual annotation (https://elkebir-group.github.io/CORSID-viz/). We refer to Materials and Methods for further details, including a description of scope, recommendations, practical considerations and heuristics for obtaining the required input to each problem.

Results

To evaluate the performance of CORSID-A and CORSID, we downloaded the same set of 505 assembled coronavirus genomes previously analyzed by SuPER (Yang et al. 2021) from GenBank along with their annotation GFF files, indicating gene locations. To benchmark methods for the TRS-ID problem, we assessed each method’s ability to correctly identify TRS-L as well as identify a TRS-B upstream of each gene. For the TRS-Gene-ID problem, we additionally assessed each method’s ability to identify ground-truth genes. To account for missing genes in annotation GFF files, we used BLASTx to identify an extended set of ground-truth genes (Altschul et al. 1990) (supplementary fig. S1, Supplementary Material online). We refer to supplementary methods, Supplementary Material online for additional details on how we established the set of genes and locations of TRS sites in the coronavirus genomes (supplement section 2.1 and fig. S2, Supplementary Material online). We excluded 35 genomes with incomplete leader sequences, thus lacking TRS-L. We excluded two more genomes due to empty GFF files, thus lacking gene annotations. The remaining 468 genomes comprised all four genera of the Coronaviridae family and spanned a total of 22 subgenera (supplementary table S2, Supplementary Material online).

CORSID-A Finds TRS-L and TRS-Bs with High Accuracy

We begin by comparing the performance of CORSID-A with MEME and SuPERfor the TRS Identification problem. Recall that MEME is a general-purpose motif detection algorithm (Bailey et al. 2009), whereas SuPERis specifically designed for identifying core sequences within coronavirus genomes annotated with genes (Yang et al. 2021). To run CORSID-A, we extracted candidate regions w1,,wn upstream of annotated genes x0,,xn (see supplementary methods, Supplementary Material online for a precise definition of candidate region). The minimum length ω of core sequence is set to seven following existing literature (Alonso et al. 2002; Sola et al. 2015), and we use a minimum alignment score of τ=2. We provided MEME with the same candidate regions w0,,wn, and ran it in “zero or one occurrence per sequence” mode. For SuPER, we analyzed the previously reported results on the same 468 sequences. We refer to the supplementary results, Supplementary Material online for detailed commands and parameters.

As shown in figure 2b, CORSID-A correctly identified TRS-Ls in 466 out of 468 genomes, reaching a higher accuracy (99.6%) than MEME (442 genomes, 94.4%), but was outperformed by SuPER, which was correct in 467 genomes (99.8%). The two genomes where our method failed are outliers in their respective subgenera, indicative of possible sequencing errors (supplementary results, supplementary figs. S3 and S4, Supplementary Material online). We discuss the one genome (MN996532) where SuPERfailed to identify TRS-L correctly in supplementary fig. S5, Supplementary Material online, showing that the TRS-L sequence identified by our method is supported by both secondary structure information as well as a split read in a corresponding RNA sequencing sample (SRR11085797). Split reads from RNA-sequencing data map to non-contiguous regions of the viral genome and provide direct evidence of template switching at TRS sites during viral replication in infected cells (Sashittal et al. 2021).

Fig. 2.

Fig. 2.

CORSID-A accurately identifies TRS-Ls and TRS-Bs. (a) We used SuPER (Yang et al. 2021), MEME (Bailey et al. 2009) and CORSID-A to identify TRS sites in 468 coronavirus genome with known gene locations. (b) The fraction of genomes for which the three methods identified the TRS-L correctly. (c) The fraction of genes of the genomes for which the three methods identified the corresponding TRS-B site correctly. (d) Number of coronavirus genomes of the four genera of the Coronaviridae family with different lengths of the TRS-L identified by the three methods. The TRS alignment identified by CORSID-A for the genome indicated by “*” is shown in supplementary fig. S7, Supplementary Material online.

Of note, SuPERuses additional information to identify TRS-L and TRS-B sites compared to MEME and CORSID-A. That is, SuPERrequires the user to specify the genus of origin for each input sequence, which is used to obtain a genus-specific motif of the core sequence from a look-up table. This motif is used to identify matches along the genome. In addition, SuPERtakes as input the 5 UTR secondary structure, restricting the region in which the TRS-L occurs until the fourth stem loop (SL4). Importantly, while CORSID-A does not rely on any pre-specified motif, taxonomic or secondary structure information, our method identified more TRS-Bs than either SuPERor MEME (fig. 2c). Specifically, we define the TRS-B recall as the fraction of genes for which TRS-Bs were identified. While the median TRS-B recall of all three methods is 1, CORSID-A found putative TRS-Bs of all genes in 387 genomes (82.7%), while SuPERand MEME did so in only 290 (62.0%) and 315 (67.3%) genomes, respectively.

To validate the identified TRS sites, we examined split reads in publicly available RNA-sequencing data of cells infected by coronaviruses. Here we considered two samples, Accepted SRR1942956 and Accepted SRR1942957, of SARS-CoV-1-infected cells (Accepted NC_004718) with a median depth of 2940× and 2765×, respectively. The TRS-B region for ORF7b predicted by CORSID-A is supported by 246 reads in sample Accepted SRR1942956 and 233 reads in sample Accepted SRR1942957. On the other hand, SuPERfound a different TRS-B region for this gene, which it marked as not recommended, and is supported by only 1 read in each sample (supplementary fig. S6a, Supplementary Material online). We suspect our method successfully identified the TRS-B region by using matching flanking positions of the core sequence rather than restricting the search to a short 6–7 nt motif in a fixed length region as done by SuPER.

As CORSID-A does not restrict the length of regulatory sequences, our method is able to find evidence for homologous recombination and/or putative TRS-L derived insertions. Specifically, even though the length of the core sequence is fixed at 7, the length of the TRSs identified by our method can be longer due to matching sequences in regions flanking the core sequence. While the core sequences identified by SuPERand MEME (fig. 2d and supplementary fig. S9, Supplementary Material online) are at most 10 nt long, the length of TRSs identified by CORSID-A ranges from 9 to 45 (median: 22, supplementary fig. S7, Supplementary Material online). Across all the genomes for which CORSID-A identifies a TRS-L longer than 25 nt (42 genomes), the median length of the core sequences is 7 and the median number of mismatches of the between core sequences within the longest TRS-B and the core sequences within TRS-L is only 1 (supplementary fig. S8, Supplementary Material online). In particular, in betacoronavirus genome Accepted NC_006577, CORSID-A identifies a TRS-B upstream of ORF4 with a length of 36 nucleotides that perfectly matches the TRS-L as well as another TRS-B preceding gene HE with a length of 27 nucleotides with only 1 mismatch, showing strong evidence of recombination and/or TRS-L derived insertion (supplementary fig. S7, Supplementary Material online). Thus, we corroborate previous findings showing numerous genomic insertions of 5-UTR in betacoronaviruses (Patarca and Haseltine 2022) and that recombination hotspots in coronaviruses are colocated with TRS sites (Yang et al. 2021). Furthermore, we note that there is experimental evidence that, besides the core sequences, flanking nucleotides also play an important role in discontinuous transcription (Sola et al. 2005). In summary, by considering matches in the regions flanking the core sequences using the TRS alignment, CORSID-A finds evidence for putative recombination and/or TRS-L derived insertion events and more accurately identifies regulatory sequences compared to existing motif finding methods such as SuPERand MEME.

CORSID Identifies Genes with High Accuracy

We now focus on the TRS-Gene-ID problem, where we compared CORSID to three gene finding methods: Glimmer3 (Salzberg et al. 1998; Delcher et al. 2007), Prodigal (Hyatt et al. 2010, 2012), and VADR (Schäffer et al. 2020). Each method was given as input the complete, unannotated genome sequence of each of the 468 coronaviruses. Following recommended instructions, we ran Glimmer3by first building the required interpolated context model (ICM) on each genome sequence separately. We ran Prodigalin meta-genomics mode. We ran VADR using their recommended parameters as well as their released database for coronaviruses. For CORSID, we used window length ω=7 and progressively reduced the score threshold τ from 7 to 2. These parameter values were determined from a fivefold cross validation study (details in supplementary section 2.3, figs. S10, and S11, Supplementary Material online). We refer the reader to supplementary results, Supplementary Material online for the precise commands used to run previous tools and details on how the predicted set of genes are compared to ground truth.

Figure 3 a shows that CORSID outperformed Glimmer3and Prodigalin terms of both precision and recall, and achieved higher recall than VADR. The median precision and recall of CORSID is 0.889 and 1.00, respectively, whereas the median precision and recall is 0.625 and 0.600, respectively, for Glimmer3, 0.714 and 0.636, respectively, for Prodigal. Although VADR has a higher precision (1.00) than CORSID, its median recall is lower (0.900), and its median F1 score is 0.909, less than CORSID’s F1 score of 0.923.

Fig. 3.

Fig. 3.

CORSID accurately identifies TRS-Ls, TRS-Bs, and genes. (a) Precision and recall of Glimmer3 (Delcher et al. 2007), Prodigal (Hyatt et al. 2010), VADR (Schäffer et al. 2020), and CORSID for gene prediction in 468 genomes. For clarity, we added a small jitter (drawn from N(0,2.5×105)) to the 2D distribution plot. (b) Confusion matrices of the ground-truth genes and the predicted genes by the each method. Supplementary fig. S8, Supplementary Material online shows that CORSID incurs a modest reduction in TRS-L accuracy compared to CORSID-A (0.955 vs. 0.996).

The same trends are observed when pooling all gene predictions as shown in fig. 3b. CORSID achieved the highest pooled recall (0.926) and F1 score (0.895), while the precision (0.865) is only slightly lower than VADR’s (0.876). The higher precision achieved by VADR can be explained by the fact that its reference database contains 55 coronavirus sequences, 48 of which are included in the 468 complete genomes we test on. If these 48 genomes are removed from the test set, CORSID achieves better overall performance than VADR in the remaining 420 genomes (supplementary fig. S12, Supplementary Material online).

While Prodigal, Glimmer3, and VADR do not have the capability to identify TRS sites, CORSID identifies these regulatory sites in addition to the genes. Specifically, compared to CORSID-A, which identified TRS-L correctly for 466 (99.6%) genomes, CORSID does so for 443 (94.7%) genomes (supplementary fig. S13, Supplementary Material online). This is a modest reduction in performance, especially when taking into account that CORSID, unlike CORSID-A, is not given any additional information apart from the complete, unannotated genome sequence. Analyzing the previously discussed SARS-CoV-1genome (Accepted NC_004718), we found that CORSID identified the same 10 genes as CORSID-A, while Prodigalmissed four genes and Glimmer3missed two genes (supplementary fig. S6b, Supplementary Material online). Although VADR found all genes, including three genes missed by CORSID, SARS-CoV-1is contained in its reference database as mentioned earlier.

In summary, CORSID accurately identifies TRS sites and genes given just the unannotated genome, outperforming existing gene finding methods.

CORSID Enables De Novo Identification of TRS Sites and Genes

To demonstrate how users can use CORSID to annotate genes and identify TRS-L and TRS-Bs given a newly assembled genome, we analyzed a previously excluded genome that lacks gene annotation (genome DQ288927). This genome is 27,534 nt long, which we provided as input to CORSID, Glimmer3, Prodigaland VADR. We note that this genome is absent from VADR’s reference database. CORSID identified nine genes spanning 91.66% of the genome, all of which match annotated genes in other Igacoviruses sequences in the BLASTx database (fig. 4). By contrast, VADR found eight genes, missing gene 4b, covering 88.03% of the genome. Glimmer3identified a total of six genes spanning 80.52% of the genome, five of which match genes in the BLASTx database. Finally, Prodigalfound six genes, all of which were present in the database, spanning 84.22% of the genome. In summary, CORSID identified more genes than existing methods, all of which occurred in homologous previously annotated genomes in the BLASTx database, demonstrating that CORSID can be used to accurately annotate coronavirus genomes.

Fig. 4.

Fig. 4.

CORSID accurately finds genes in an unannotated Igacovirus genome (Accepted DQ288927). (a) The position of the genes identified by CORSID. The Venn diagram shows every gene found by CORSID, Glimmer3, Prodigaland VADR. “*” indicates 95% query/hit coverage by BLASTx, “**” indicates a BLASTx hit with query/hit coverage less than 95%, and “?” represents a predicted gene with no BLASTx hit. (b) TRS alignment for genes identified by CORSID. (c) The fraction of positions in vbody covered by genes identified by the four methods.

Discussion

In this paper, we demonstrated that transcription regulatory sequences in coronavirus genomes can be leveraged to simultaneously infer these regulatory sequences and their associated genes in a synergistic manner. To that end, we formulated the TRS Identification (TRS-ID) problem of identifying TRS sites in a coronavirus genome with given gene locations, and the general problem, the TRS and Gene Identification (TRS-Gene-ID) problem of simultaneous identification of genes and TRS sites given only the coronavirus genome. Underpinning both problems is the notion of a TRS alignment, which extends the previous concept of core sequences to include flanking nucleotides that provide additional signal. Our proposed method for the first problem, CORSID-A, is based upon a dynamic programming formulation which extends the classical Smith–Waterman recurrence (Smith and Waterman 1981). CORSID, which solves the general problem, additionally incorporates a maximum-weight independent set formulation on an interval graph to identify TRS sites and genes.

Using extensive experiments on 468 coronavirus genomes, we showed that CORSID-A outperformed two motif-based approaches, MEME (Bailey et al. 2009) and SuPER (Yang et al. 2021). Additionally, we showed that CORSID outperformed two general-purpose gene finding algorithms, Glimmer3 (Salzberg et al. 1998; Delcher et al. 2007) and Prodigal (Hyatt et al. 2010). We performed direct validation of TRS sites predicted for the SARS-CoV-1genome (Accepted NC_004718), showing that the TRS sites identified by our method are more strongly supported by split reads in RNA-seq samples than the TRS sites identified by SuPER. Lastly, we demonstrated that CORSID enables de novo identification of TRSs and genes in newly assembled coronavirus genomes by applying it on a previously unannotated coronavirus (Accepted DQ288927) belonging to the Igacovirus subgenus.

There are several limitations and avenues for future research. First, the accuracy of identifying genes can be improved by accounting for alternative start codons (supplementary table S3, Supplementary Material online to improve recall and incorporating Kozak sequence information to improve precision. Second, CORSID is designed for de novo gene annotation of novel coronaviruses given only the nucleotide sequence of the genome. However, RNA-sequencing data when aligned to the reference genome contain split reads, that is, reads that span non-contiguous regions of the genome, which can be leveraged for identifying candidate regions that contain TRSs. We plan to extend our method by supporting the use of RNA-sequencing data to improve gene annotation. Third, CORSID currently requires the complete genome as input to identify the TRS sites and the genes. CORSID can be extended to allow gene identification in the several coronaviruses available in GenBank with only partial reference genomes, such as Accepted NC_014470, by leveraging knowledge from other coronaviruses with complete genomes with similar TRS sites. Fourth, while in this study we only focused on coronaviruses, discontinuous transcription occurs in all viruses in the taxonomic order of Nidovirales. However, CORSID, which assumes a single TRS-L region in the genome, cannot be directly applied to other families of viruses within Nidovirales such as the family Mesoniviridae that contain multiple TRS-L regions in the genome (Zirkel et al. 2013; Vasilakis et al. 2014). Incorporating such features and extending CORSID to all Nidovirales viruses is a useful direction of future work. Finally, currently CORSID requires the reference genome of the virus as input. In the future, we plan to perform de novo assembly jointly with core sequence and TRS site identification, facilitating comprehensive analysis from raw sequencing data of novel coronaviruses.

Materials and Methods

We begin by discussing CORSID-A, which solves the TRS-ID problem. Next, we introduce CORSID, which solves the TRS-Gene-ID problem. Finally, we discuss a web-based visual analytics tool to the space of near-optimal solutions.

Solving the TRS Identification Problem

Recall that in the TRS-ID problem we seek a TRS alignment A given input candidate regions sequences w0,,wn that each occur upstream of genes x0,,xn. Intuitively, we define the candidate region for a gene xi as the region wi=[wi,wi+] composed of positions wpw+ such that any sgRNA starting at p will lead to the translation of ORF xi by the ribosome. SuPER (Yang et al. 2021), the only other method for identifying TRSs in annotated coronavirus genomes, employs a heuristic by defining the candidate region wi of a gene xi as vx170vx1, that is, the candidate region wi is a subsequence of 170 nt immediately upstream of gene xi for 1in. Here, we take a more rigorous and flexible approach that takes leaky scanning into account by skipping over previous ORFs with length smaller than 100 nt (details in supplementary methods section 1.1, figs. S14 and S15, Supplementary Material online).

Recall that in a TRS alignment A=[a0,,an] only the TRS-Bs a1,,an are allowed to have gaps (restricted to the flanks), and that the TRS-L a0 is gapless. To score a TRS alignment, we use a simple scoring function δ:Σ×(Σ{})R such that s(x,y) equals +1 for matches (i.e., x=y), 2 for mismatches (i.e., xy and y), and 0 for gaps (i.e., y=). In other words, while we reward matches and penalize mismatches, we do not penalize flanking gaps.

Recall that the sought TRS alignment A must induce a core sequence c(A) of length at least ω. Due to this constraint, the input sequences w0,,wn depend on one another and cannot be considered in isolation. We break this dependency by considering a subsequence u within w0 of length ω, restricting the induced core sequence c(A) of output TRS alignments A to contain u. We solve this constrained version of the TRS-ID problem using dynamic programming in time O(|w0|L) where L is the total length of candidate regions w1,,wn (details are in supplementary methods, supplementary fig. S16, Supplementary Material online, fig. 5a). We obtain the solution to the original TRS-ID problem by identifying the window u that induces a TRS alignment A with maximum score. As there are O(|w0|) windows in w0 of fixed length ω, this procedure takes O(|w0|2L) time.

Fig. 5.

Fig. 5.

Algorithm details. (a) Given genes x0,,xn, we obtain candidate regions w0,,wn by identifying upstream ORFs, skipping over ORFs if they are of length less than 100 nt (indicated by “*”). CORSID-A solves the TRS-ID problem by sliding a window u through w0, solving n independent pairwise dynamic programming problems, which together yield the optimal TRS alignment A for window u. (b) To solve the TRS-Gene-ID problem, CORSID additionally solves a maximum-weight independent set problem (Hsiao et al. 1992) on an interval graph defined by the candidate ORFs to simultaneously identify an optimal pair (A,Γ(A)) for window u.

Solving the TRS and Gene Identification Problem

In the TRS-Gene-ID problem, we require two sequences: vleader which contains TRS-L a0 and vbody which contains each TRS-B a1,,an. We propose a heuristic to partition a genome v into vleader and vbody, which takes O(m2) time where m is the number of ORFs in v that incorporates a classifier to identify truncated genomes missing TRS-L in the 5 UTR (supplementary methods, supplementary fig. S17, Supplementary Material online).

We will now define the relationship between a TRS alignment A=[a0,,an] and the set Γ(A) of induced genes. Upon removing (flanking) gaps, each aligned sequence ai corresponds to a contiguous subsequence vi of the viral genome v. Specifically, v0 occurs in vleader and vi occurs in vbody (where i1). By Definition 4, each subsequence vi has positions that are aligned with the core sequence c(A). These aligned positions induce the subsequence ci=[ci,ci+] of length equal to |c(A)|. Note that while c0=c(A), it may be that cic(A) where i1 due to mismatches. Importantly, there are coronaviruses where the last three nucleotides of the core sequence within a TRS-B coincide with the start codon of the associated gene (supplementary fig. S18, Supplementary Material online). As such, we have the following definition.

Definition 9.

Let A=[a0,,an] be a TRS alignment and let ci=[ci,ci+] be the subsequence of ai that is aligned to the core sequence c(A). The ORF associated with TRS-B ai is the unique ORF x where position ci+ occurs within the candidate region of x.

As discussed, there may not exist an ORF associated with a TRS-B ai, which may happen when the TRS-B is located near the 3 end of the genome. Given a TRS alignment A=[a0,,an], the set Γ(A) of induced genes equals the set of ORFs that are associated with a1,,an.

To solve the TRS-Gene-ID problem, we take a similar sliding window approach that we used to solve the TRS-ID problem. That is, we consider all subsequences u within vleader of length ω and solve a constrained version of the TRS-Gene-ID problem, additionally requiring that the sought TRS alignment A has a core sequence c(A) that fully contains u, using the following two steps. First, we construct a DP table similar to the previous table used in TRS-ID problem in O(|vleader||vbody|) time, and for each ORF, we select the alignment with the highest score in the corresponding candidate region. Second, given these ORFs and corresponding alignments, we build a vertex-weighted interval graph combining ORF lengths and alignment scores as weights. To identify the optimal TRS alignment A and associated genes Γ(A), we solve a maximum-weight independent set (MWIS) on this graph in O(m) time, where m is the number of candidate ORFs in vbody (supplementary methods, Supplementary Material online and fig. 5b). Each instance of the constrained TRS-Gene-ID problem takes O(|vleader||vbody|+m) time. Since the number of windows of length ω in vleader is O(|vleader|), the total running time of CORSID to solve the TRS-Gene-ID problem is O(|vleader|2|vbody|+|vleader|m). In practice, the number m of candidate ORFs in vbody ranges from 2192, the length |vleader| of leader region ranges from 171 to 716 and the length |vbody| of the body region ranges from 6280 to 11,462 across all the coronaviruses studied in this paper. Finally, to obtain biologically meaningful solutions, we employ a progressive approach and consider overlapping genes (see supplementary methods, Supplementary Material online for details and supplementary fig. S19, Supplementary Material online).

Web Application to Explore Solution Space

In order to present a comprehensive overview of identified TRS sites and genes across solutions, we created a web application that visualizes all solutions and allows for manual annotation. After obtaining solutions from CORSID and CORSID-A, users can launch the application with the output JSON file, then inspect all possible solutions. Specifically, we show a summary table of all solutions, followed by the optimal solution for which we show a sequence logo of the identified TRS-L and TRS-Bs, a genome coverage map, and a detailed table of each identified gene. Users can click the summary table and show other alternative solutions below the fixed optimal solution for comparison. A demo of the visualization can be found at https://elkebir-group.github.io/CORSID-viz. We also made an integrated dockerized workflow including CORSID, CORSID-A, BLASTx, and the visualization web application. After obtaining the docker images, users can easily analyze a new genome by running the workflow without any manual configuration. Additional details about the scope and recommendations for using CORSID and CORSID-A, including the combination of running CORSID-A after VADR, are provided in Section 2.7 of the supplement (supplementary fig. S20, Supplementary Material online).

Supplementary Material

msac133_Supplementary_Data

Acknowledgments

This material is based upon work supported by the National Science Foundation under award numbers CCF-1850502, CCF-2027669, and CCF-2046488. This work used resources, services, and support provided via the Greg Gulick Honorary Research Award Opportunity supported by a gift from Amazon Web Services.

Contributor Information

Chuanyi Zhang, Department of Electrical & Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

Palash Sashittal, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

Michael Xiang, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

Yichi Zhang, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

Ayesha Kazi, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

Mohammed El-Kebir, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online.

References

  1. Alonso  S, Izeta  A, Sola  I, Enjuanes  L. 2002. Transcription regulatory sequences and mRNA expression levels in the coronavirus transmissible gastroenteritis virus. J Virol. 76(3):1293–1308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Altschul  SF, Gish  W, Miller  W, Myers  EW, Lipman  DJ. 1990. Basic local alignment search tool. J Mol Biol. 215(3):403–410. [DOI] [PubMed] [Google Scholar]
  3. Bailey  TL, Boden  M, Buske  FA, Frith  M, Grant  CE, Clementi  L, Ren  J, Li  WW, Noble  WS. 2009. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37(Suppl 2):W202–W208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Carrillo  H, Lipman  D. 1988. The multiple sequence alignment problem in biology. SIAM J Appl Math. 48(5):1073–1082. [Google Scholar]
  5. Delcher  AL, Bratke  KA, Powers  EC, Salzberg  SL. 2007. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics  23(6):673–679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Down  TA, Hubbard  TJP. 2005. NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence. Nucleic Acids Res. 33(5):1445–1453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Finkel  Y, Mizrahi  O, Nachshon  A, Weingarten-Gabbay  S, Morgenstern  D, Yahalom-Ronen  Y, Tamir  H, Achdout  H, Stein  D, Israeli  O, et al.  2021. The coding capacity of SARS-CoV-2. Nature  589(7840):125–130. [DOI] [PubMed] [Google Scholar]
  8. Hsiao  JY, Tang  CY, Chang  RS. 1992. An efficient algorithm for finding a maximum weight 2-independent set on interval graphs. Inf Process Lett. 43(5):229–235. [Google Scholar]
  9. Hyatt  D, Chen  G-L, LoCascio  PF, Land  ML, Larimer  FW, Hauser  LJ. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Hyatt  D, LoCascio  PF, Hauser  LJ, Uberbacher  EC. 2012. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics  28(17):2223–2230. [DOI] [PubMed] [Google Scholar]
  11. Jungreis  I, Sealfon  R, Kellis  M. 2021. SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes. Nat Commun. 12(1):1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Kim  D, Lee  J-Y, Yang  J-S, Kim  JW, Kim  VN, Chang  H. 2020. The architecture of SARS-CoV-2 transcriptome. Cell  181(4):914–921. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Patarca  R, Haseltine  WA. 2022. Intragenomic rearrangements of SARS-CoV-2 and other β-coronaviruses. bioRxiv.
  14. Pavesi  G, Mereghetti  P, Mauri  G, Pesole  G. 2004. Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 32(Suppl 2):W199–W203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Salzberg  SL, Delcher  AL, Kasif  S, White  O. 1998. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 26(2):544–548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Sashittal  P, Zhang  C, Peng  J, El-Kebir  M. 2021. Jumper enables discontinuous transcript assembly in coronaviruses. Nat Commun. 12:67280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Schäffer  AA, Hatcher  EL, Yankie  L, Shonkwiler  L, Brister  JR, Karsch-Mizrachi  I, Nawrocki  EP. 2020. VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinform. 21:1–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Smith  TF, Waterman  MS. 1981. Identification of common molecular subsequences. J Mole Biol. 147(1):195–197. [DOI] [PubMed] [Google Scholar]
  19. Sola  I, Almazan  F, Zúñiga  S, Enjuanes  L. 2015. Continuous and discontinuous RNA synthesis in coronaviruses. Annu Rev Virol. 2:265–288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Sola  I, Moreno  JL, Zúñiga  S, Alonso  S, Enjuanes  L. 2005. Role of nucleotides immediately flanking the transcription-regulating sequence core in coronavirus subgenomic mRNA synthesis. J Virol. 79(4):2506–2516. 10.1128/JVI.79.4.2506-2516.2005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Vasilakis  N, Guzman  H, Firth  C, Forrester  NL, Widen  SG, Wood  TG, Rossi  SL, Ghedin  E, Popov  V, Blasdell  KR, et al.  2014. Mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range. Virol J. 11(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Yang  Y, Yan  W, Hall  AB, Jiang  X. 2021. Characterizing transcriptional regulatory sequences in coronaviruses and their role in recombination. Mol Biol Evol. 38(4):1241–1248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Yao  Z, Weinberg  Z, Ruzzo  WL. 2006. CMfinder: a covariance model based RNA motif finding algorithm. Bioinformatics  22(4):445–452. [DOI] [PubMed] [Google Scholar]
  24. Zirkel  F, Roth  H, Kurth  A, Drosten  C, Ziebuhr  J, Junglen  S. 2013. Identification and characterization of genetically divergent members of the newly established family Mesoniviridae. J Virol. 87(11):6346–6358. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

msac133_Supplementary_Data

Articles from Molecular Biology and Evolution are provided here courtesy of Oxford University Press

RESOURCES