Accurate Identification of Transcription Regulatory Sequences and Genes in Coronaviruses

Chuanyi Zhang; Palash Sashittal; Michael Xiang; Yichi Zhang; Ayesha Kazi; Mohammed El-Kebir

doi:10.1093/molbev/msac133

. 2022 Jun 14;39(7):msac133. doi: 10.1093/molbev/msac133

Accurate Identification of Transcription Regulatory Sequences and Genes in Coronaviruses

Chuanyi Zhang ^1,^a, Palash Sashittal ^2,^a,^✉, Michael Xiang ³, Yichi Zhang ⁴, Ayesha Kazi ⁵, Mohammed El-Kebir ^6,^✉

Editor: Thomas Leitner

PMCID: PMC9214144 PMID: 35700225

Abstract

Transcription regulatory sequences (TRSs), which occur upstream of structural and accessory genes as well as the $5^{'}$ end of a coronavirus genome, play a critical role in discontinuous transcription in coronaviruses. We introduce two problems collectively aimed at identifying these regulatory sequences as well as their associated genes. First, we formulate the TRS Identification problem of identifying TRS sites in a coronavirus genome sequence with prescribed gene locations. We introduce CORSID-A, an algorithm that solves this problem to optimality in polynomial time. We demonstrate that CORSID-A outperforms existing motif-based methods in identifying TRS sites in coronaviruses. Second, we demonstrate for the first time how TRS sites can be leveraged to identify gene locations in the coronavirus genome. To that end, we formulate the TRS and Gene Identification problem of simultaneously identifying TRS sites and gene locations in unannotated coronavirus genomes. We introduce CORSID to solve this problem, which includes a web-based visualization tool to explore the space of near-optimal solutions. We show that CORSID outperforms state-of-the-art gene finding methods in coronavirus genomes. Furthermore, we demonstrate that CORSID enables de novo identification of TRS sites and genes in previously unannotated coronavirus genomes. CORSID is the first method to perform accurate and simultaneous identification of TRS sites and genes in coronavirus genomes without the use of any prior information.

Keywords: core sequences, gene identification, coronavirus, motif finding, local alignment, maximum weight independent set, interval graph

Introduction

Coronaviruses are enveloped viruses comprised of a positive-sense, single-stranded RNA genome that is ready to be translated by the host ribosome. While the majority of messenger RNA (mRNA) in eukaryotes is monocistronic, that is, each mRNA is translated into a single gene product, the coronavirus RNA genome is comprised of several structural, non-structural and accessory genes (fig. 1a). These genes are necessary for the viral life cycle and are expressed and translated using three distinct mechanisms (Sola et al. 2015).

Fig. 1. — Overview. (a) A coronavirus genome v consists of a leader region $v_{leader}$ and a body region $v_{body}$ . (b) Structural and accessory genes are expressed via discontinuous transcription with template switching occurring at transcription regulatory sequences (TRS, indicated in red), resulting in subgenomic messenger RNAs (sgmRNAs) for each gene. (c) In the TRS Identification (TRS-ID) problem, we wish to identify TRSs given a genome v with genes $x_{0}, \dots, x_{n}$ . The TRS and Gene Identification (TRS-Gene-ID) asks to simultaneously identify genes and their associated TRSs given genome v. Throughout this manuscript, we use “T” (thymine) rather than “U” (uracil).

First, upon cell entry, the viral genome is translated to produce polypeptides corresponding to one or two overlapping open reading frames (ORFs). The resulting polypeptides undergo auto-cleavage, producing many non-structural proteins, including the RNA-dependent-RNA-polymerase (RdRP). Second, the viral RdRP mediates the expression of the remaining viral genes via discontinuous transcription (Sola et al. 2015). That is, the RdRP is prone to perform template switching, predominantly upon encountering transcription regulatory sequences (TRSs), located in the $5^{'}$ untranslated region (UTR) of the genome—called TRS-L where L stands for leader—and upstream of viral genes—called TRS-B where B stands for body (fig. 1b). Note that while previous studies have found evidence of TRS-independent template switching leading to non-canonical transcripts, the function of these transcripts is still unknown (Kim et al. 2020; Finkel et al. 2021; Sashittal et al. 2021). Third, occasionally certain genes are expressed via leaky scanning, where a weak initiation codon leads to the translation of the next downstream ORF (Jungreis et al. 2021). Not only is the identification and characterization of TRS sites crucial to understanding the regulation and expression of the viral proteins, but here we hypothesize that the existence of these regulatory sequences can be leveraged to simultaneously identify TRS sites and associated viral genes in unannotated coronavirus genomes with high accuracy.

While there exist methods for identifying either TRS sites or viral genes, no method exists that does so simultaneously (supplementary table S1, Supplementary Material online). More specifically, since each TRS contains a 6–7 nt long conserved sequence, called a core sequence (Sola et al. 2015; Finkel et al. 2021), general-purpose motif finding methods (Pavesi et al. 2004; Down and Hubbard 2005; Yao et al. 2006; Bailey et al. 2009) can be employed to identify TRS-L and TRS-Bs in coronaviruses. For instance, MEME (Bailey et al. 2009) is a widely used method that employs expectation maximization to identify multiple appearances of multiple motifs simultaneously. The only method that is specifically developed for identifying TRS sites in coronaviruses is SuPER (Yang et al. 2021), which takes as input a coronavirus genome sequence with specified gene locations as well as additional taxonomic and secondary structure information. Importantly, SuPERas well as other general-purpose motif finding algorithms are unable to identify viral genes in unannotated coronavirus genome sequences.

On the other hand, gene prediction is a well-studied problem with many methods including Glimmer3 (Salzberg et al. 1998; Delcher et al. 2007), Prodigal (Hyatt et al. 2010, 2012) and VADR (Schäffer et al. 2020). Glimmer3uses a Markov model to assign scores to ORFs, and then processes overlapping genes to generate the final list of predicted genes. By contrast, Prodigalemploys a more heuristic approach with fine-tuned parameters that are optimized to identify genes in prokaryotes. While Glimmer3and Prodigalare designed for prokaryotic genomes, VADR is specifically designed for identifying genes in viral genomes. To that end, VADR first classifies the input sequence and finds the most similar sequence in a pre-specified database, maps the curated annotations to the input based on a covariance model, and then uses BLAST (Altschul et al. 1990) to validate the annotated genes. Importantly, current gene finding tools do not leverage the genomic structure of coronaviruses, specifically the TRS sites located upstream of the genes in the genome, nor are they able to directly identify these regulatory sequences.

In this study, we introduce the TRS Identification (TRS-ID) and the TRS and Gene Identification (TRS-Gene-ID) problems, to identify TRS sites in a coronavirus genome with specified gene annotations, and to simultaneously identifying TRS sites and genes in an unannotated coronavirus genome, respectively (fig. 1c). Underpinning our approach is the concept of a TRS alignment, which is a multiple sequence alignment of TRS sites with additional constraints that result from template switching by RdRP. We introduce CORSID-A, a dynamic programming (DP) algorithm to solve the TRS-ID problem, adapting the recurrence that underlies the Smith–Waterman algorithm (Smith and Waterman 1981) for local sequence alignment. Additionally, we introduce CORSID to solve the TRS-Gene-ID problem via a maximum-weight independent set problem (Hsiao et al. 1992) on an interval graph defined by the candidate ORFs in the genome with weights obtained from the previous DP.

CORSID enables de novo identification of viral genes in coronaviruses using only the nucleotide sequence of the viral genome. CORSID-A, on the other hand, is designed to identify all the TRSs in a coronavirus genome annotated with gene locations. We evaluate the performance of our methods on 468 coronavirus genomes downloaded from GenBank, demonstrating that CORSID-A outperforms MEME and SuPERin identifying TRS sites and, unlike these methods, possesses the ability to identify recombination events. Moreover, we find that CORSID outperforms state-of-the-art gene finding methods. Finally, we illustrate how CORSID enables de novo identification of TRS sites and genes in previously unannotated coronaviruses. In summary, CORSID is the first method to perform accurate and simultaneous identification of TRS sites and genes in coronavirus genomes without the use of prior taxonomic or secondary structure information.

New Approaches

Viewing TRS and gene identification as a multiple sequence alignment is a novel approach that we will outline in this section. We begin by introducing notation and key definitions, followed by stating the TRS Identification problem and then the TRS and Gene Identification problem. Finally, we overview the key methodological contributions of our algorithms.

Preliminaries

A genome $v = v_{1} \dots v_{| v |}$ is a sequence from the alphabet $Σ = {A, T, C, G}$ . The first position of the genome is known as the $5^{'}$ end whereas the last position of the genome is known as the $3^{'}$ end. We denote the contiguous subsequence $v_{p} \dots v_{q}$ of v by $v [p, q]$ . We also call a contiguous subsequence x of v a region, denoted as $x = [x^{-}, x^{+}]$ such that $x = v [x^{-}, x^{+}]$ . Thus, coordinates $x^{-}$ and $x^{+}$ of a subsequence x are in terms of the reference genome v, that is, $x = v_{x^{-}} \dots v_{x^{+}}$ . Alternatively, we may refer to individual characters in a subsequence x using relative indices, that is, $x = x_{1} \dots x_{| x |}$ . Our goals are twofold: given a coronavirus genome v, we aim to identify (i) TRS-L and TRS-Bs, and optionally, (ii) the associated genes (fig. 1c). To begin, recall the following definition of an alignment.

Definition 1.

Matrix $A = [a_{i j}]$ with $n + 1$ rows is an alignment of sequences $b_{0}, \dots, b_{n} \in Σ^{*}$ provided (i) entries $a_{i j}$ either correspond to a letter in the alphabet $Σ$ or a gap denoted by “–” such that (ii) no column of A is composed of only gaps, and (iii) the removal of gaps of row i of A yields sequence $b_{i}$ .

Here, we seek an alignment with two additional constraints, called a TRS alignment defined as follows.

Definition 2.

An alignment $A = [a_{0}, \dots, a_{n}]^{⊤}$ is a TRS alignment provided (i) $a_{0}$ does not contain any gaps, and (ii) $a_{1}, \dots, a_{n}$ do not contain any internal gaps.

Intuitively, the first sequence $a_{0}$ in the alignment A represents TRS-L, whereas $a_{1}, \dots, a_{n}$ represent TRS-Bs, each upstream of an accessory or structural gene. We do not allow gaps in the TRS-L sequence $a_{0}$ as template switching by RdRP occurs due to complementary base pairing between TRS-L and the nascent strand of TRS-B (Sola et al. 2005). For the same reason, we do not allow internal gaps in TRS-Bs $a_{i}$ . However, as each TRS-B may match a different region of the TRS-L, we do allow flanking gaps in these sequences (fig. 1c). We score a TRS alignment A using a scoring function $δ : Σ \times (Σ \cup {-}) \to R$ in the following way.

Definition 3.

The score $s (A)$ of a TRS alignment $A = [a_{0}, \dots, a_{n}]^{⊤}$ is given by $\sum_{i = 1}^{n} s (a_{0}, a_{i}) = \sum_{i = 1}^{n} \sum_{j = 1}^{| a_{0} |}$ $δ (a_{0 j}, a_{i j})$ , whereas the minimum score $s_{min} (A)$ is defined as $min_{i \in {1, \dots, n}} s (a_{0}, a_{i})$ .

In other words, we score each TRS-B $a_{i}$ (where $i \geq 1$ ) by comparing it to the TRS-L sequence $a_{0}$ in a way that is consistent with the mechanism of template switching during discontinuous transcription. As such, our scoring function differs from the traditional sum-of-pairs scoring function (Carrillo and Lipman 1988) where every unordered pair $(a_{i}, a_{j})$ of sequences contributes to the score of the alignment. Furthermore, each TRS alignment uniquely determines the core sequence as follows.

Definition 4.

Sequence $c (A)$ is the core sequence of a TRS alignment $A = [a_{0}, \dots, a_{n}]^{⊤}$ provided $c (A)$ is the largest contiguous subsequence of $a_{0}$ such that no character of c is aligned to a gap in any of $a_{1}, \dots, a_{n}$ .

Note that the core sequence is a subsequence of the TRS sequences. As such, the TRS alignment can include nucleotides immediately flanking the core sequence, which have been shown to play an important role in discontinuous transcription in previous experiments (Sola et al. 2005).

The TRS Identification Problem

The first problem we consider is that of identifying TRS sites given a viral genome with known genes $x_{0}, \dots, x_{n}$ . Specifically, we are given a candidate region $w_{0}$ that contains the unknown TRS-L $a_{0}$ upstream of gene $x_{0}$ as well as candidate regions $w_{1}, \dots, w_{n}$ that contain the unknown TRS-Bs $a_{1}, \dots, a_{n}$ of genes $x_{1}, \dots, x_{n}$ . We detail in Materials and Methods how to obtain these candidate regions when only given the gene locations. To further guide the optimization problem, we impose an additional constraint on the sought TRS alignment A in the form of a minimum length $ω$ on the core sequence $c (A)$ as well as a threshold $τ$ on the minimum score $s_{min} (A)$ of the TRS alignment. We formalize this problem as follows.

Problem 1 (TRS Identification (TRS-ID)) —

Given non-overlapping sequences $w_{0}, \dots, w_{n}$ , core-sequence length $ω > 0$ and score threshold $τ > 0$ , find a TRS alignment $A = [a_{0}, \dots, a_{n}]^{⊤}$ such that (i) $a_{i}$ corresponds to a subsequence in $w_{i}$ for all $i \in {0, \dots, n}$ , (ii) the core sequence $c (A)$ has length at least $ω$ , (iii) the minimum score $s_{min} (A)$ is at least $τ$ , and (iv) the alignment has maximum score $s (A)$ .

The TRS and Gene Identification Problem

In the second problem, we are no longer given an annotated genome with gene locations. Rather, we seek to simultaneously identify genes and TRS sites given a viral genome sequence v split into a leader region $v_{leader}$ and body region $v_{body}$ . We describe in Materials and Methods a heuristic for identifying these two regions when only given v. The key idea here is that each TRS alignment will uniquely determine a set of genes it encodes. To make this relationship clear, we begin by defining an open reading frame as follows.

Definition 5.

A contiguous subsequence $x = [x^{-}, x^{+}]$ of v is an open reading frame provided x (i) has a length $| x |$ that is a multiple of 3, (ii) starts with a start codon, that is, $x_{1} \dots x_{3} = ATG$ , (iii) ends at a stop codon, that is, $x_{| x | - 2} \dots x_{| x |} \in {TAA, TAG, TGA}$ , and (iv) does not contain an internal in-frame stop codon, that is, for all $j \in {1, \dots, | x | / 3 - 1}$ it holds that $x_{3 (j - 1) + 1} \dots x_{3 (j - 1) + 3} \notin$ ${TAA, TAG, TGA}$ .

Naively, to identify the ORF associated with TRS-B $a_{i}$ , one could simply scan downstream of the TRS-B for the first occurrence of a start codon and continue scanning to identify the corresponding in-frame stop codon. However, this would not take leaky scanning into account, where the ribosome does not initiate translation at the first encountered “ATG.” We provide a more robust definition of a downstream ORF that takes leaky scanning into account in Materials and Methods. To summarize, we have that a TRS alignment $A = [a_{0}, \dots, a_{n}]^{⊤}$ uniquely determines a set $Γ (A)$ of candidate genes.

Definition 6.

A set $Γ (A)$ of ORFs are induced genes of a TRS alignment $A = [a_{0}, \dots, a_{n}]^{⊤}$ provided $Γ (A)$ is composed of the ORFs that occur downstream of each TRS-B $a_{1}, \dots, a_{n}$ in $v_{body}$ .

Note that there may not be an ORF downstream of each TRS-B $a_{i}$ . As such, we have that $| Γ (A) | \leq n$ . While in theory multiple TRS-Bs of a TRS alignment $A = [a_{0}, \dots, a_{n}]^{⊤}$ may induce the same gene in $v_{body}$ , in practice each coronavirus gene typically has a unique TRS-B. Moreover, these viral genes are typically non-overlapping in the genome. To that end, we have the following definition.

Definition 7.

A TRS alignment $A = [a_{0}, \dots, a_{n}]^{⊤}$ is concordant provided (i) each TRS-B $a_{i}$ corresponds to a unique gene in $Γ (A)$ , and (ii) there are no two ORFs in $Γ (A)$ whose positions in $v_{body}$ overlap.

In practice, since some coronavirus genes may overlap, we later relax this definition to allow some overlap between ORFs (details in supplementary methods, Supplementary Material online). Finally, coronavirus genomes tend to be compact with most positions coding for genes. To capture these biological constraints, we introduce the following definition.

Definition 8.

The genome coverage $g (A)$ of a TRS alignment A is the number of positions in $v_{body}$ that are covered by the set $Γ (A)$ of induced genes.

This leads to the following problem.

Problem 2 (TRS and Gene Identification (TRS-Gene-ID)) —

Given leader region $v_{leader}$ , body region $v_{body}$ , core-sequence length $ω > 0$ and score threshold $τ > 0$ , find a TRS alignment $A = [a_{i}]$ such that (i) $a_{0}$ corresponds to a subsequence in $v_{leader}$ , (ii) $a_{i}$ corresponds to a subsequence in $v_{body}$ for all $i \geq 1$ , (iii) the core sequence $c (A)$ has length at least $ω$ , (iv) the minimum score $s_{min} (A)$ is at least $τ$ , (v) A is concordant, and (vi) A induces a set $Γ (A)$ of genes with maximum genome coverage $g (A)$ and subsequently maximum score $s (A)$ .

In other words, there are two objectives, genome coverage $g (A)$ and alignment score $s (A)$ , that are ordered lexicographically. That is, if there exist multiple TRS alignments that have maximum genome coverage, we break ties using the alignment score.

Overview of CORSID and CORSID-A

Our algorithm CORSID-A solves the TRS-ID problem to optimality. The key algorithmic insight is that the problem decomposes into n independent pairwise alignment problems when fixing a window in the leader candidate region $w_{0}$ that contains the core sequence. Our second problem, the TRS-Gene-ID problem, is solved by CORSID. We use the same insight of sliding a window through the leader region $v_{leader}$ and show that the constrained problem corresponds to a maximum-weight independent set on an interval graph, which can be solved in polynomial time. We implemented both methods in Python. Moreover, we implemented a web-based visual analytics tool for exploring the space of near-optimal solutions. The source code is available at https://github.com/elkebir-group/CORSID. The results of CORSID and CORSID-A are available at https://github.com/elkebir-group/CORSID-data and we also built a web application to visualize these results for easier exploration of the solution space and manual annotation (https://elkebir-group.github.io/CORSID-viz/). We refer to Materials and Methods for further details, including a description of scope, recommendations, practical considerations and heuristics for obtaining the required input to each problem.

Results

To evaluate the performance of CORSID-A and CORSID, we downloaded the same set of 505 assembled coronavirus genomes previously analyzed by SuPER (Yang et al. 2021) from GenBank along with their annotation GFF files, indicating gene locations. To benchmark methods for the TRS-ID problem, we assessed each method’s ability to correctly identify TRS-L as well as identify a TRS-B upstream of each gene. For the TRS-Gene-ID problem, we additionally assessed each method’s ability to identify ground-truth genes. To account for missing genes in annotation GFF files, we used BLASTx to identify an extended set of ground-truth genes (Altschul et al. 1990) (supplementary fig. S1, Supplementary Material online). We refer to supplementary methods, Supplementary Material online for additional details on how we established the set of genes and locations of TRS sites in the coronavirus genomes (supplement section 2.1 and fig. S2, Supplementary Material online). We excluded 35 genomes with incomplete leader sequences, thus lacking TRS-L. We excluded two more genomes due to empty GFF files, thus lacking gene annotations. The remaining 468 genomes comprised all four genera of the Coronaviridae family and spanned a total of 22 subgenera (supplementary table S2, Supplementary Material online).

CORSID-A Finds TRS-L and TRS-Bs with High Accuracy

We begin by comparing the performance of CORSID-A with MEME and SuPERfor the TRS Identification problem. Recall that MEME is a general-purpose motif detection algorithm (Bailey et al. 2009), whereas SuPERis specifically designed for identifying core sequences within coronavirus genomes annotated with genes (Yang et al. 2021). To run CORSID-A, we extracted candidate regions $w_{1}, \dots, w_{n}$ upstream of annotated genes $x_{0}, \dots, x_{n}$ (see supplementary methods, Supplementary Material online for a precise definition of candidate region). The minimum length $ω$ of core sequence is set to seven following existing literature (Alonso et al. 2002; Sola et al. 2015), and we use a minimum alignment score of $τ = 2$ . We provided MEME with the same candidate regions $w_{0}, \dots, w_{n}$ , and ran it in “zero or one occurrence per sequence” mode. For SuPER, we analyzed the previously reported results on the same 468 sequences. We refer to the supplementary results, Supplementary Material online for detailed commands and parameters.

As shown in figure 2b, CORSID-A correctly identified TRS-Ls in 466 out of 468 genomes, reaching a higher accuracy ( $99.6 %$ ) than MEME (442 genomes, $94.4 %$ ), but was outperformed by SuPER, which was correct in 467 genomes ( $99.8 %$ ). The two genomes where our method failed are outliers in their respective subgenera, indicative of possible sequencing errors (supplementary results, supplementary figs. S3 and S4, Supplementary Material online). We discuss the one genome (MN996532) where SuPERfailed to identify TRS-L correctly in supplementary fig. S5, Supplementary Material online, showing that the TRS-L sequence identified by our method is supported by both secondary structure information as well as a split read in a corresponding RNA sequencing sample (SRR11085797). Split reads from RNA-sequencing data map to non-contiguous regions of the viral genome and provide direct evidence of template switching at TRS sites during viral replication in infected cells (Sashittal et al. 2021).

Fig. 2. — CORSID-A accurately identifies TRS-Ls and TRS-Bs. (a) We used SuPER (Yang et al. 2021), MEME (Bailey et al. 2009) and CORSID-A to identify TRS sites in 468 coronavirus genome with known gene locations. (b) The fraction of genomes for which the three methods identified the TRS-L correctly. (c) The fraction of genes of the genomes for which the three methods identified the corresponding TRS-B site correctly. (d) Number of coronavirus genomes of the four genera of the *Coronaviridae* family with different lengths of the TRS-L identified by the three methods. The TRS alignment identified by CORSID-A for the genome indicated by “*” is shown in supplementary fig. S7, Supplementary Material online.

Of note, SuPERuses additional information to identify TRS-L and TRS-B sites compared to MEME and CORSID-A. That is, SuPERrequires the user to specify the genus of origin for each input sequence, which is used to obtain a genus-specific motif of the core sequence from a look-up table. This motif is used to identify matches along the genome. In addition, SuPERtakes as input the $5^{'}$ UTR secondary structure, restricting the region in which the TRS-L occurs until the fourth stem loop (SL4). Importantly, while CORSID-A does not rely on any pre-specified motif, taxonomic or secondary structure information, our method identified more TRS-Bs than either SuPERor MEME (fig. 2c). Specifically, we define the TRS-B recall as the fraction of genes for which TRS-Bs were identified. While the median TRS-B recall of all three methods is 1, CORSID-A found putative TRS-Bs of all genes in 387 genomes ( $82.7 %$ ), while SuPERand MEME did so in only 290 ( $62.0 %$ ) and 315 ( $67.3 %$ ) genomes, respectively.

To validate the identified TRS sites, we examined split reads in publicly available RNA-sequencing data of cells infected by coronaviruses. Here we considered two samples, Accepted SRR1942956 and Accepted SRR1942957, of SARS-CoV-1-infected cells (Accepted NC_004718) with a median depth of $2940 \times$ and $2765 \times$ , respectively. The TRS-B region for ORF7b predicted by CORSID-A is supported by 246 reads in sample Accepted SRR1942956 and 233 reads in sample Accepted SRR1942957. On the other hand, SuPERfound a different TRS-B region for this gene, which it marked as not recommended, and is supported by only 1 read in each sample (supplementary fig. S6a, Supplementary Material online). We suspect our method successfully identified the TRS-B region by using matching flanking positions of the core sequence rather than restricting the search to a short 6–7 nt motif in a fixed length region as done by SuPER.

As CORSID-A does not restrict the length of regulatory sequences, our method is able to find evidence for homologous recombination and/or putative TRS-L derived insertions. Specifically, even though the length of the core sequence is fixed at 7, the length of the TRSs identified by our method can be longer due to matching sequences in regions flanking the core sequence. While the core sequences identified by SuPERand MEME (fig. 2d and supplementary fig. S9, Supplementary Material online) are at most 10 nt long, the length of TRSs identified by CORSID-A ranges from 9 to 45 (median: 22, supplementary fig. S7, Supplementary Material online). Across all the genomes for which CORSID-A identifies a TRS-L longer than 25 nt (42 genomes), the median length of the core sequences is 7 and the median number of mismatches of the between core sequences within the longest TRS-B and the core sequences within TRS-L is only 1 (supplementary fig. S8, Supplementary Material online). In particular, in betacoronavirus genome Accepted NC_006577, CORSID-A identifies a TRS-B upstream of ORF4 with a length of 36 nucleotides that perfectly matches the TRS-L as well as another TRS-B preceding gene HE with a length of 27 nucleotides with only 1 mismatch, showing strong evidence of recombination and/or TRS-L derived insertion (supplementary fig. S7, Supplementary Material online). Thus, we corroborate previous findings showing numerous genomic insertions of $5^{'}$ -UTR in betacoronaviruses (Patarca and Haseltine 2022) and that recombination hotspots in coronaviruses are colocated with TRS sites (Yang et al. 2021). Furthermore, we note that there is experimental evidence that, besides the core sequences, flanking nucleotides also play an important role in discontinuous transcription (Sola et al. 2005). In summary, by considering matches in the regions flanking the core sequences using the TRS alignment, CORSID-A finds evidence for putative recombination and/or TRS-L derived insertion events and more accurately identifies regulatory sequences compared to existing motif finding methods such as SuPERand MEME.

CORSID Identifies Genes with High Accuracy

We now focus on the TRS-Gene-ID problem, where we compared CORSID to three gene finding methods: Glimmer3 (Salzberg et al. 1998; Delcher et al. 2007), Prodigal (Hyatt et al. 2010, 2012), and VADR (Schäffer et al. 2020). Each method was given as input the complete, unannotated genome sequence of each of the 468 coronaviruses. Following recommended instructions, we ran Glimmer3by first building the required interpolated context model (ICM) on each genome sequence separately. We ran Prodigalin meta-genomics mode. We ran VADR using their recommended parameters as well as their released database for coronaviruses. For CORSID, we used window length $ω = 7$ and progressively reduced the score threshold $τ$ from 7 to 2. These parameter values were determined from a fivefold cross validation study (details in supplementary section 2.3, figs. S10, and S11, Supplementary Material online). We refer the reader to supplementary results, Supplementary Material online for the precise commands used to run previous tools and details on how the predicted set of genes are compared to ground truth.

Figure 3 a shows that CORSID outperformed Glimmer3and Prodigalin terms of both precision and recall, and achieved higher recall than VADR. The median precision and recall of CORSID is $0.889$ and $1.00$ , respectively, whereas the median precision and recall is $0.625$ and $0.600$ , respectively, for Glimmer3, $0.714$ and $0.636$ , respectively, for Prodigal. Although VADR has a higher precision ( $1.00$ ) than CORSID, its median recall is lower ( $0.900$ ), and its median $F_{1}$ score is $0.909$ , less than CORSID’s $F_{1}$ score of $0.923$ .

The same trends are observed when pooling all gene predictions as shown in fig. 3b. CORSID achieved the highest pooled recall ( $0.926$ ) and $F_{1}$ score ( $0.895$ ), while the precision ( $0.865$ ) is only slightly lower than VADR’s ( $0.876$ ). The higher precision achieved by VADR can be explained by the fact that its reference database contains 55 coronavirus sequences, 48 of which are included in the 468 complete genomes we test on. If these 48 genomes are removed from the test set, CORSID achieves better overall performance than VADR in the remaining 420 genomes (supplementary fig. S12, Supplementary Material online).

While Prodigal, Glimmer3, and VADR do not have the capability to identify TRS sites, CORSID identifies these regulatory sites in addition to the genes. Specifically, compared to CORSID-A, which identified TRS-L correctly for 466 ( $99.6 %$ ) genomes, CORSID does so for 443 ( $94.7 %$ ) genomes (supplementary fig. S13, Supplementary Material online). This is a modest reduction in performance, especially when taking into account that CORSID, unlike CORSID-A, is not given any additional information apart from the complete, unannotated genome sequence. Analyzing the previously discussed SARS-CoV-1genome (Accepted NC_004718), we found that CORSID identified the same 10 genes as CORSID-A, while Prodigalmissed four genes and Glimmer3missed two genes (supplementary fig. S6b, Supplementary Material online). Although VADR found all genes, including three genes missed by CORSID, SARS-CoV-1is contained in its reference database as mentioned earlier.

In summary, CORSID accurately identifies TRS sites and genes given just the unannotated genome, outperforming existing gene finding methods.

CORSID Enables De Novo Identification of TRS Sites and Genes

To demonstrate how users can use CORSID to annotate genes and identify TRS-L and TRS-Bs given a newly assembled genome, we analyzed a previously excluded genome that lacks gene annotation (genome DQ288927). This genome is 27,534 nt long, which we provided as input to CORSID, Glimmer3, Prodigaland VADR. We note that this genome is absent from VADR’s reference database. CORSID identified nine genes spanning $91.66 %$ of the genome, all of which match annotated genes in other Igacoviruses sequences in the BLASTx database (fig. 4). By contrast, VADR found eight genes, missing gene 4b, covering $88.03 %$ of the genome. Glimmer3identified a total of six genes spanning $80.52 %$ of the genome, five of which match genes in the BLASTx database. Finally, Prodigalfound six genes, all of which were present in the database, spanning $84.22 %$ of the genome. In summary, CORSID identified more genes than existing methods, all of which occurred in homologous previously annotated genomes in the BLASTx database, demonstrating that CORSID can be used to accurately annotate coronavirus genomes.

Fig. 4. — CORSID accurately finds genes in an unannotated *Igacovirus* genome (Accepted DQ288927). (a) The position of the genes identified by CORSID. The Venn diagram shows every gene found by CORSID, Glimmer3, Prodigaland VADR. “*” indicates $\geq$ 95% query/hit coverage by BLASTx, “**” indicates a BLASTx hit with query/hit coverage less than $95 %$ , and “?” represents a predicted gene with no BLASTx hit. (b) TRS alignment for genes identified by CORSID. (c) The fraction of positions in $v_{body}$ covered by genes identified by the four methods.

Discussion

In this paper, we demonstrated that transcription regulatory sequences in coronavirus genomes can be leveraged to simultaneously infer these regulatory sequences and their associated genes in a synergistic manner. To that end, we formulated the TRS Identification (TRS-ID) problem of identifying TRS sites in a coronavirus genome with given gene locations, and the general problem, the TRS and Gene Identification (TRS-Gene-ID) problem of simultaneous identification of genes and TRS sites given only the coronavirus genome. Underpinning both problems is the notion of a TRS alignment, which extends the previous concept of core sequences to include flanking nucleotides that provide additional signal. Our proposed method for the first problem, CORSID-A, is based upon a dynamic programming formulation which extends the classical Smith–Waterman recurrence (Smith and Waterman 1981). CORSID, which solves the general problem, additionally incorporates a maximum-weight independent set formulation on an interval graph to identify TRS sites and genes.

Using extensive experiments on 468 coronavirus genomes, we showed that CORSID-A outperformed two motif-based approaches, MEME (Bailey et al. 2009) and SuPER (Yang et al. 2021). Additionally, we showed that CORSID outperformed two general-purpose gene finding algorithms, Glimmer3 (Salzberg et al. 1998; Delcher et al. 2007) and Prodigal (Hyatt et al. 2010). We performed direct validation of TRS sites predicted for the SARS-CoV-1genome (Accepted NC_004718), showing that the TRS sites identified by our method are more strongly supported by split reads in RNA-seq samples than the TRS sites identified by SuPER. Lastly, we demonstrated that CORSID enables de novo identification of TRSs and genes in newly assembled coronavirus genomes by applying it on a previously unannotated coronavirus (Accepted DQ288927) belonging to the Igacovirus subgenus.

There are several limitations and avenues for future research. First, the accuracy of identifying genes can be improved by accounting for alternative start codons (supplementary table S3, Supplementary Material online to improve recall and incorporating Kozak sequence information to improve precision. Second, CORSID is designed for de novo gene annotation of novel coronaviruses given only the nucleotide sequence of the genome. However, RNA-sequencing data when aligned to the reference genome contain split reads, that is, reads that span non-contiguous regions of the genome, which can be leveraged for identifying candidate regions that contain TRSs. We plan to extend our method by supporting the use of RNA-sequencing data to improve gene annotation. Third, CORSID currently requires the complete genome as input to identify the TRS sites and the genes. CORSID can be extended to allow gene identification in the several coronaviruses available in GenBank with only partial reference genomes, such as Accepted NC_014470, by leveraging knowledge from other coronaviruses with complete genomes with similar TRS sites. Fourth, while in this study we only focused on coronaviruses, discontinuous transcription occurs in all viruses in the taxonomic order of Nidovirales. However, CORSID, which assumes a single TRS-L region in the genome, cannot be directly applied to other families of viruses within Nidovirales such as the family Mesoniviridae that contain multiple TRS-L regions in the genome (Zirkel et al. 2013; Vasilakis et al. 2014). Incorporating such features and extending CORSID to all Nidovirales viruses is a useful direction of future work. Finally, currently CORSID requires the reference genome of the virus as input. In the future, we plan to perform de novo assembly jointly with core sequence and TRS site identification, facilitating comprehensive analysis from raw sequencing data of novel coronaviruses.

Materials and Methods

We begin by discussing CORSID-A, which solves the TRS-ID problem. Next, we introduce CORSID, which solves the TRS-Gene-ID problem. Finally, we discuss a web-based visual analytics tool to the space of near-optimal solutions.

Solving the TRS Identification Problem

Recall that in the TRS-ID problem we seek a TRS alignment A given input candidate regions sequences $w_{0}, \dots, w_{n}$ that each occur upstream of genes $x_{0}, \dots, x_{n}$ . Intuitively, we define the candidate region for a gene $x_{i}$ as the region $w_{i} = [w_{i}^{-}, w_{i}^{+}]$ composed of positions $w^{-} \leq p \leq w^{+}$ such that any sgRNA starting at p will lead to the translation of ORF $x_{i}$ by the ribosome. SuPER (Yang et al. 2021), the only other method for identifying TRSs in annotated coronavirus genomes, employs a heuristic by defining the candidate region $w_{i}$ of a gene $x_{i}$ as $v_{x^{-} - 170} \dots v_{x^{-} - 1}$ , that is, the candidate region $w_{i}$ is a subsequence of 170 nt immediately upstream of gene $x_{i}$ for $1 \leq i \leq n$ . Here, we take a more rigorous and flexible approach that takes leaky scanning into account by skipping over previous ORFs with length smaller than 100 nt (details in supplementary methods section 1.1, figs. S14 and S15, Supplementary Material online).

Recall that in a TRS alignment $A = [a_{0}, \dots, a_{n}]^{⊤}$ only the TRS-Bs $a_{1}, \dots, a_{n}$ are allowed to have gaps (restricted to the flanks), and that the TRS-L $a_{0}$ is gapless. To score a TRS alignment, we use a simple scoring function $δ : Σ \times (Σ \cup {-}) \to R$ such that $s (x, y)$ equals $+ 1$ for matches (i.e., $x = y$ ), $- 2$ for mismatches (i.e., $x \neq y$ and $y \neq -$ ), and 0 for gaps (i.e., $y = -$ ). In other words, while we reward matches and penalize mismatches, we do not penalize flanking gaps.

Recall that the sought TRS alignment A must induce a core sequence $c (A)$ of length at least $ω$ . Due to this constraint, the input sequences $w_{0}, \dots, w_{n}$ depend on one another and cannot be considered in isolation. We break this dependency by considering a subsequence u within $w_{0}$ of length $ω$ , restricting the induced core sequence $c (A)$ of output TRS alignments A to contain u. We solve this constrained version of the TRS-ID problem using dynamic programming in time $O (| w_{0} | L)$ where L is the total length of candidate regions $w_{1}, \dots, w_{n}$ (details are in supplementary methods, supplementary fig. S16, Supplementary Material online, fig. 5a). We obtain the solution to the original TRS-ID problem by identifying the window u that induces a TRS alignment A with maximum score. As there are $O (| w_{0} |)$ windows in $w_{0}$ of fixed length $ω$ , this procedure takes $O (| w_{0} |^{2} L)$ time.

Fig. 5. — Algorithm details. (a) Given genes $x_{0}, \dots, x_{n}$ , we obtain candidate regions $w_{0}, \dots, w_{n}$ by identifying upstream ORFs, skipping over ORFs if they are of length less than 100 nt (indicated by “*”). CORSID-A solves the TRS-ID problem by sliding a window u through $w_{0}$ , solving n independent pairwise dynamic programming problems, which together yield the optimal TRS alignment A for window u. (b) To solve the TRS-Gene-ID problem, CORSID additionally solves a maximum-weight independent set problem (Hsiao et al. 1992) on an interval graph defined by the candidate ORFs to simultaneously identify an optimal pair $(A, Γ (A))$ for window u.

Solving the TRS and Gene Identification Problem

In the TRS-Gene-ID problem, we require two sequences: $v_{leader}$ which contains TRS-L $a_{0}$ and $v_{body}$ which contains each TRS-B $a_{1}, \dots, a_{n}$ . We propose a heuristic to partition a genome v into $v_{leader}$ and $v_{body}$ , which takes $O (m^{2})$ time where m is the number of ORFs in v that incorporates a classifier to identify truncated genomes missing TRS-L in the $5^{'}$ UTR (supplementary methods, supplementary fig. S17, Supplementary Material online).

We will now define the relationship between a TRS alignment $A = [a_{0}, \dots, a_{n}]^{⊤}$ and the set $Γ (A)$ of induced genes. Upon removing (flanking) gaps, each aligned sequence $a_{i}$ corresponds to a contiguous subsequence $v_{i}$ of the viral genome v. Specifically, $v_{0}$ occurs in $v_{leader}$ and $v_{i}$ occurs in $v_{body}$ (where $i \geq 1$ ). By Definition 4, each subsequence $v_{i}$ has positions that are aligned with the core sequence $c (A)$ . These aligned positions induce the subsequence $c_{i} = [c_{i}^{-}, c_{i}^{+}]$ of length equal to $| c (A) |$ . Note that while $c_{0} = c (A)$ , it may be that $c_{i} \neq c (A)$ where $i \geq 1$ due to mismatches. Importantly, there are coronaviruses where the last three nucleotides of the core sequence within a TRS-B coincide with the start codon of the associated gene (supplementary fig. S18, Supplementary Material online). As such, we have the following definition.

Definition 9.

Let $A = [a_{0}, \dots, a_{n}]^{⊤}$ be a TRS alignment and let $c_{i} = [c_{i}^{-}, c_{i}^{+}]$ be the subsequence of $a_{i}$ that is aligned to the core sequence $c (A)$ . The ORF associated with TRS-B $a_{i}$ is the unique ORF x where position $c_{i}^{+}$ occurs within the candidate region of x.

As discussed, there may not exist an ORF associated with a TRS-B $a_{i}$ , which may happen when the TRS-B is located near the $3^{'}$ end of the genome. Given a TRS alignment $A = [a_{0}, \dots, a_{n}]^{⊤}$ , the set $Γ (A)$ of induced genes equals the set of ORFs that are associated with $a_{1}, \dots, a_{n}$ .

To solve the TRS-Gene-ID problem, we take a similar sliding window approach that we used to solve the TRS-ID problem. That is, we consider all subsequences u within $v_{leader}$ of length $ω$ and solve a constrained version of the TRS-Gene-ID problem, additionally requiring that the sought TRS alignment A has a core sequence $c (A)$ that fully contains u, using the following two steps. First, we construct a DP table similar to the previous table used in TRS-ID problem in $O (| v_{leader} | | v_{body} |)$ time, and for each ORF, we select the alignment with the highest score in the corresponding candidate region. Second, given these ORFs and corresponding alignments, we build a vertex-weighted interval graph combining ORF lengths and alignment scores as weights. To identify the optimal TRS alignment A and associated genes $Γ (A)$ , we solve a maximum-weight independent set (MWIS) on this graph in $O (m)$ time, where m is the number of candidate ORFs in $v_{body}$ (supplementary methods, Supplementary Material online and fig. 5b). Each instance of the constrained TRS-Gene-ID problem takes $O (| v_{leader} | | v_{body} | + m)$ time. Since the number of windows of length $ω$ in $v_{leader}$ is $O (| v_{leader} |)$ , the total running time of CORSID to solve the TRS-Gene-ID problem is $O (| v_{leader} |^{2} | v_{body} | + | v_{leader} | m)$ . In practice, the number m of candidate ORFs in $v_{body}$ ranges from $21 - 92$ , the length $| v_{leader} |$ of leader region ranges from 171 to 716 and the length $| v_{body} |$ of the body region ranges from 6280 to 11,462 across all the coronaviruses studied in this paper. Finally, to obtain biologically meaningful solutions, we employ a progressive approach and consider overlapping genes (see supplementary methods, Supplementary Material online for details and supplementary fig. S19, Supplementary Material online).

Web Application to Explore Solution Space

In order to present a comprehensive overview of identified TRS sites and genes across solutions, we created a web application that visualizes all solutions and allows for manual annotation. After obtaining solutions from CORSID and CORSID-A, users can launch the application with the output JSON file, then inspect all possible solutions. Specifically, we show a summary table of all solutions, followed by the optimal solution for which we show a sequence logo of the identified TRS-L and TRS-Bs, a genome coverage map, and a detailed table of each identified gene. Users can click the summary table and show other alternative solutions below the fixed optimal solution for comparison. A demo of the visualization can be found at https://elkebir-group.github.io/CORSID-viz. We also made an integrated dockerized workflow including CORSID, CORSID-A, BLASTx, and the visualization web application. After obtaining the docker images, users can easily analyze a new genome by running the workflow without any manual configuration. Additional details about the scope and recommendations for using CORSID and CORSID-A, including the combination of running CORSID-A after VADR, are provided in Section 2.7 of the supplement (supplementary fig. S20, Supplementary Material online).

Supplementary Material

msac133_Supplementary_Data

Click here for additional data file.^{(6.2MB, pdf)}

Acknowledgments

This material is based upon work supported by the National Science Foundation under award numbers CCF-1850502, CCF-2027669, and CCF-2046488. This work used resources, services, and support provided via the Greg Gulick Honorary Research Award Opportunity supported by a gift from Amazon Web Services.

Contributor Information

Chuanyi Zhang, Department of Electrical & Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

Palash Sashittal, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

Michael Xiang, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

Yichi Zhang, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

Ayesha Kazi, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

Mohammed El-Kebir, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online.

References

Alonso S, Izeta A, Sola I, Enjuanes L. 2002. Transcription regulatory sequences and mRNA expression levels in the coronavirus transmissible gastroenteritis virus. J Virol. 76(3):1293–1308. [DOI] [PMC free article] [PubMed] [Google Scholar]
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol. 215(3):403–410. [DOI] [PubMed] [Google Scholar]
Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. 2009. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37(Suppl 2):W202–W208. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carrillo H, Lipman D. 1988. The multiple sequence alignment problem in biology. SIAM J Appl Math. 48(5):1073–1082. [Google Scholar]
Delcher AL, Bratke KA, Powers EC, Salzberg SL. 2007. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23(6):673–679. [DOI] [PMC free article] [PubMed] [Google Scholar]
Down TA, Hubbard TJP. 2005. NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence. Nucleic Acids Res. 33(5):1445–1453. [DOI] [PMC free article] [PubMed] [Google Scholar]
Finkel Y, Mizrahi O, Nachshon A, Weingarten-Gabbay S, Morgenstern D, Yahalom-Ronen Y, Tamir H, Achdout H, Stein D, Israeli O, et al. 2021. The coding capacity of SARS-CoV-2. Nature 589(7840):125–130. [DOI] [PubMed] [Google Scholar]
Hsiao JY, Tang CY, Chang RS. 1992. An efficient algorithm for finding a maximum weight 2-independent set on interval graphs. Inf Process Lett. 43(5):229–235. [Google Scholar]
Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hyatt D, LoCascio PF, Hauser LJ, Uberbacher EC. 2012. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics 28(17):2223–2230. [DOI] [PubMed] [Google Scholar]
Jungreis I, Sealfon R, Kellis M. 2021. SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes. Nat Commun. 12(1):1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim D, Lee J-Y, Yang J-S, Kim JW, Kim VN, Chang H. 2020. The architecture of SARS-CoV-2 transcriptome. Cell 181(4):914–921. [DOI] [PMC free article] [PubMed] [Google Scholar]
Patarca R, Haseltine WA. 2022. Intragenomic rearrangements of SARS-CoV-2 and other $β$ -coronaviruses. bioRxiv.
Pavesi G, Mereghetti P, Mauri G, Pesole G. 2004. Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 32(Suppl 2):W199–W203. [DOI] [PMC free article] [PubMed] [Google Scholar]
Salzberg SL, Delcher AL, Kasif S, White O. 1998. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 26(2):544–548. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sashittal P, Zhang C, Peng J, El-Kebir M. 2021. Jumper enables discontinuous transcript assembly in coronaviruses. Nat Commun. 12:67280. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schäffer AA, Hatcher EL, Yankie L, Shonkwiler L, Brister JR, Karsch-Mizrachi I, Nawrocki EP. 2020. VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinform. 21:1–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smith TF, Waterman MS. 1981. Identification of common molecular subsequences. J Mole Biol. 147(1):195–197. [DOI] [PubMed] [Google Scholar]
Sola I, Almazan F, Zúñiga S, Enjuanes L. 2015. Continuous and discontinuous RNA synthesis in coronaviruses. Annu Rev Virol. 2:265–288. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sola I, Moreno JL, Zúñiga S, Alonso S, Enjuanes L. 2005. Role of nucleotides immediately flanking the transcription-regulating sequence core in coronavirus subgenomic mRNA synthesis. J Virol. 79(4):2506–2516. 10.1128/JVI.79.4.2506-2516.2005 [DOI] [PMC free article] [PubMed] [Google Scholar]
Vasilakis N, Guzman H, Firth C, Forrester NL, Widen SG, Wood TG, Rossi SL, Ghedin E, Popov V, Blasdell KR, et al. 2014. Mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range. Virol J. 11(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Y, Yan W, Hall AB, Jiang X. 2021. Characterizing transcriptional regulatory sequences in coronaviruses and their role in recombination. Mol Biol Evol. 38(4):1241–1248. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yao Z, Weinberg Z, Ruzzo WL. 2006. CMfinder: a covariance model based RNA motif finding algorithm. Bioinformatics 22(4):445–452. [DOI] [PubMed] [Google Scholar]
Zirkel F, Roth H, Kurth A, Drosten C, Ziebuhr J, Junglen S. 2013. Identification and characterization of genetically divergent members of the newly established family Mesoniviridae. J Virol. 87(11):6346–6358. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

msac133_Supplementary_Data

Click here for additional data file.^{(6.2MB, pdf)}

[msac133-B1] Alonso S, Izeta A, Sola I, Enjuanes L. 2002. Transcription regulatory sequences and mRNA expression levels in the coronavirus transmissible gastroenteritis virus. J Virol. 76(3):1293–1308. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B2] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol. 215(3):403–410. [DOI] [PubMed] [Google Scholar]

[msac133-B3] Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. 2009. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37(Suppl 2):W202–W208. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B4] Carrillo H, Lipman D. 1988. The multiple sequence alignment problem in biology. SIAM J Appl Math. 48(5):1073–1082. [Google Scholar]

[msac133-B5] Delcher AL, Bratke KA, Powers EC, Salzberg SL. 2007. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23(6):673–679. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B6] Down TA, Hubbard TJP. 2005. NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence. Nucleic Acids Res. 33(5):1445–1453. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B7] Finkel Y, Mizrahi O, Nachshon A, Weingarten-Gabbay S, Morgenstern D, Yahalom-Ronen Y, Tamir H, Achdout H, Stein D, Israeli O, et al. 2021. The coding capacity of SARS-CoV-2. Nature 589(7840):125–130. [DOI] [PubMed] [Google Scholar]

[msac133-B8] Hsiao JY, Tang CY, Chang RS. 1992. An efficient algorithm for finding a maximum weight 2-independent set on interval graphs. Inf Process Lett. 43(5):229–235. [Google Scholar]

[msac133-B9] Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B10] Hyatt D, LoCascio PF, Hauser LJ, Uberbacher EC. 2012. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics 28(17):2223–2230. [DOI] [PubMed] [Google Scholar]

[msac133-B11] Jungreis I, Sealfon R, Kellis M. 2021. SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes. Nat Commun. 12(1):1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B12] Kim D, Lee J-Y, Yang J-S, Kim JW, Kim VN, Chang H. 2020. The architecture of SARS-CoV-2 transcriptome. Cell 181(4):914–921. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B13] Patarca R, Haseltine WA. 2022. Intragenomic rearrangements of SARS-CoV-2 and other $β$ -coronaviruses. bioRxiv.

[msac133-B14] Pavesi G, Mereghetti P, Mauri G, Pesole G. 2004. Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 32(Suppl 2):W199–W203. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B15] Salzberg SL, Delcher AL, Kasif S, White O. 1998. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 26(2):544–548. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B16] Sashittal P, Zhang C, Peng J, El-Kebir M. 2021. Jumper enables discontinuous transcript assembly in coronaviruses. Nat Commun. 12:67280. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B17] Schäffer AA, Hatcher EL, Yankie L, Shonkwiler L, Brister JR, Karsch-Mizrachi I, Nawrocki EP. 2020. VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinform. 21:1–23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B18] Smith TF, Waterman MS. 1981. Identification of common molecular subsequences. J Mole Biol. 147(1):195–197. [DOI] [PubMed] [Google Scholar]

[msac133-B19] Sola I, Almazan F, Zúñiga S, Enjuanes L. 2015. Continuous and discontinuous RNA synthesis in coronaviruses. Annu Rev Virol. 2:265–288. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B20] Sola I, Moreno JL, Zúñiga S, Alonso S, Enjuanes L. 2005. Role of nucleotides immediately flanking the transcription-regulating sequence core in coronavirus subgenomic mRNA synthesis. J Virol. 79(4):2506–2516. 10.1128/JVI.79.4.2506-2516.2005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B21] Vasilakis N, Guzman H, Firth C, Forrester NL, Widen SG, Wood TG, Rossi SL, Ghedin E, Popov V, Blasdell KR, et al. 2014. Mesoniviruses are mosquito-specific viruses with extensive geographic distribution and host range. Virol J. 11(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B22] Yang Y, Yan W, Hall AB, Jiang X. 2021. Characterizing transcriptional regulatory sequences in coronaviruses and their role in recombination. Mol Biol Evol. 38(4):1241–1248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msac133-B23] Yao Z, Weinberg Z, Ruzzo WL. 2006. CMfinder: a covariance model based RNA motif finding algorithm. Bioinformatics 22(4):445–452. [DOI] [PubMed] [Google Scholar]

[msac133-B24] Zirkel F, Roth H, Kurth A, Drosten C, Ziebuhr J, Junglen S. 2013. Identification and characterization of genetically divergent members of the newly established family Mesoniviridae. J Virol. 87(11):6346–6358. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Accurate Identification of Transcription Regulatory Sequences and Genes in Coronaviruses

Chuanyi Zhang

Palash Sashittal

Michael Xiang

Yichi Zhang

Ayesha Kazi

Mohammed El-Kebir

Roles

Abstract

Introduction

Fig. 1.

New Approaches

Preliminaries

Definition 1.

Definition 2.

Definition 3.

Definition 4.

The TRS Identification Problem

Problem 1 (TRS Identification (TRS-ID)) —

The TRS and Gene Identification Problem

Definition 5.

Definition 6.

Definition 7.

Definition 8.

Problem 2 (TRS and Gene Identification (TRS-Gene-ID)) —

Overview of CORSID and CORSID-A

Results

CORSID-A Finds TRS-L and TRS-Bs with High Accuracy

Fig. 2.

CORSID Identifies Genes with High Accuracy

Fig. 3.

CORSID Enables De Novo Identification of TRS Sites and Genes

Fig. 4.

Discussion

Materials and Methods

Solving the TRS Identification Problem

Fig. 5.

Solving the TRS and Gene Identification Problem

Definition 9.

Web Application to Explore Solution Space

Supplementary Material

Acknowledgments

Contributor Information

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases