Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2022 Sep 1;17(9):e0260331. doi: 10.1371/journal.pone.0260331

A Putative long-range RNA-RNA interaction between ORF8 and Spike of SARS-CoV-2

Okiemute Beatrice Omoru 1, Filipe Pereira 2,3, Sarath Chandra Janga 1,4,5, Amirhossein Manzourolajdad 1,6,*
Editor: Danny Barash7
PMCID: PMC9436084  PMID: 36048827

Abstract

SARS-CoV-2 has affected people worldwide as the causative agent of COVID-19. The virus is related to the highly lethal SARS-CoV-1 responsible for the 2002–2003 SARS outbreak in Asia. Research is ongoing to understand why both viruses have different spreading capacities and mortality rates. Like other beta coronaviruses, RNA-RNA interactions occur between different parts of the viral genomic RNA, resulting in discontinuous transcription and production of various sub-genomic RNAs. These sub-genomic RNAs are then translated into other viral proteins. In this work, we performed a comparative analysis for novel long-range RNA-RNA interactions that may involve the Spike region. Comparing in-silico fragment-based predictions between reference sequences of SARS-CoV-1 and SARS-CoV-2 revealed several predictions amongst which a thermodynamically stable long-range RNA-RNA interaction between (23660–23703 Spike) and (28025–28060 ORF8) unique to SARS-CoV-2 was observed. The patterns of sequence variation using data gathered worldwide further supported the predicted stability of the sub-interacting region (23679–23690 Spike) and (28031–28042 ORF8). Such RNA-RNA interactions can potentially impact viral life cycle including sub-genomic RNA production rates.

Introduction

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly transmissible and pathogenic coronavirus that emerged in late 2019 and has caused a pandemic of acute respiratory disease, named coronavirus disease 2019 (COVID-19) [1]. SARS-CoV-2 is related to SARS-CoV-1, a life-threatening virus responsible for an outbreak in 2002–2003 that was contained after intense public health mitigation measures [2]. The Coronaviruses belong to the Coronaviridae family. They are enveloped, positive-sensed, and have a single-stranded RNA genome [3] and are categorized into different genera based on their protein sequences [4]. While certain genera are non-pathogenic in Humans [5, 6], the genera of beta-coronaviruses comprise most human coronaviruses (HCoVs), including the SARS-CoV-1, MERS-CoV, HCoVOC43, HCoV-HKU1, and SARS-CoV-2 [7]. Beta-coronaviruses, including SARS-CoV-2, are highly pathogenic and are responsible for life-threatening respiratory infections in humans.

The SARS-CoV-2 genome is approximately 30’000 nucleotides long. The nucleotide content of the viral genome consists majorly of two large open reading frames (ORF1a and ORF1b) and structural proteins spike (S), envelope (E), membrane (M), and nucleocapsid (N) proteins, as well as several accessory proteins known as Open Reading Frames (ORF) 3a, 6, 7a, 7b, 8, and 10 [5, 8]. The structural proteins are responsible for viral assembly and suppressing the host’s immune response [9, 10].

The first steps of coronavirus infection involve the viral entry into the host cell via binding of the Spike (S) protein to the cellular entry receptors for attachment to the receptor-binding site of the hosts cell membrane, fusion, and the release of the viral RNA into the cell. In humans, the host cellular receptor for SARS-CoV-2 is human angiotensin-converting enzyme 2 (ACE2) [11]. The interaction between Spike and ACE2 determines the viral response and pathogenicity [1214]. After entry, SARS-CoV-2 expresses and replicates its genomic RNA to produce full-length copies, integrated into the newly created viral particles [11]. SARS-CoV-2’s genome encodes NSPs, which are essential for viral RNA synthesis, and structural proteins necessary for virion assembly [15].

Coronavirus RNA-dependent RNA synthesis includes two differentiated processes of genome replication and transcription of a collection of sub-genomic RNAs. The sub-genomic RNAs encode the viral structural and accessory proteins. These RNAs are produced by discontinuous transcription where the synthesis of the negative-sense strand is disrupted. The resulting strand will then produce a plus RNA strand sub-genomic RNA. The complex replication/transcription machinery production of a series of sub-genomic RNAs through the process of template switching during negative-sense RNA synthesis [16, 17].

Beta-coronaviruses can form long-range high-order RNA-RNA interactions that contribute to template switch and consequently regulate the viral transcription and regulatory pathways for the production of sub-genomic mRNAs [16]. Long-range interactions are generally found in positive-strand viruses [18, 19]. The longest RNA-RNA interaction found so far spans ~26000 and is involved in a sub-genomic RNA synthesis in coronaviruses [18]. Mediated by stabilizing proteins, such interactions impact the tertiary structure of the genomic RNA, facilitating binding of the 5’ UTR Transcript Regulatory Sequences (TRS) to the regulatory sequence upstream of a particular gene, leading to the switching of minus strand template to that of the gene’s sub-genomic transcript. Regulation of the N-gene sub-genomic transcript is a fair example of such high-order RNA-RNA interactions [16]. Although some efforts have been made to investigate RNA-RNA interactions in in general of SARS-CoV-2 [20], It is very difficult to identify all the genomic RNA regions that are involved in such intricate interactions, presenting challenges to finding novel interacting regions within the virus [18].

The co-evolution of coronaviruses with their hosts is navigated by genetic variations made possible by its large genome [21], recombination frequency (of up to 25% for the entire genome in vivo) [22, 23], and a high mutation rate [24, 25]. SARS-CoV-2’s mutation occurs spontaneously during replication. Thousands of aggregate mutations have occurred since the emergence of the virus [26]. A significant cause of concern about SARS-CoV-2’s mutations is a change that could lead to a highly lethal infection or a failure on the effects of the current vaccines [27]. It is known that the strain with the highest similarity to SARS-CoV-2 is SARS-CoV-1. Similar to SARS-CoV-2, SARS-CoV-1 has a genome length of around 30kb (29’751nt), and its similarity ratio to the SARS-CoV-2 genome is 82.45% [28]. The genomic differences explain the disparities in both viruses’ dispersal and immune evasion [29]. The percentage similarity of the Spike protein of SARS-CoV-2 and SARS-CoV-1 is 97.71%. Spike protein’s Receptor Binding Domain (RBD) which is the most variable part of the coronavirus genome [30], has 74.41% similarity. In fact, computational analysis has affirmed that the RBD sequence of SARS-CoV-2 differs from those observed to be ideal in SARS-CoV-1 [31]; hence, the high-affinity binding of the SARS-CoV-2 RBD to the human ACE2 is consequently due to natural selection on human ACE2, which allows for a solution for binding [32]. A significant difference between the Spike regions of both viruses is a polybasic insertion at the S1/S2 cleavage site, resulting from a 12-nt insert in the Spike region of SARS-CoV-2 that does not exist in SARS-CoV. In addition to increasing Spike protein infectivity, the 12-nt insert may also have a role on the RNA level, since it has a high GC content (CCUCGGCGGGCA; positions 23,603–23,614 of the reference). Similarity of other structural proteins are as follows: E-96%, M-89.41%, and N- 85.41%. The similarity between the structural protein of SARS-CoV-2 and other Coronaviruses is less than 50% [33].

RNA structures can play critical roles in the life cycle of Beta-coronaviruses. For instance, studies have reported that SARS-CoV-2’s genomic RNA occupy some of the hosts MiRNAs that control immune regulated genes, thus depriving them of their function [34]. Recent studies have found locally stable RNA structures within the SARS-CoV-2 genome [3538]. Moreover, in-vivo RNA structure prediction methods such as dimethyl sulfate mutational profiling with sequencing (DMS-MaP-seq) suggest that SARS-CoV-2 forms RNA structures within most of its genome [37], some of the possible relevance to the virus life cycle. These RNA structures can potentially be the target of RNA-based therapeutic applications [39, 40], or may lead to methods for inhibiting viral growth [41].

The Spike gene has been observed for having conserved RNA structural elements [42]. The 12-nt insert, which does not exist in Spike region of SARS-CoV-1, also contains unusually high GC composition, increasing its likelihood to have a role on the RNA level as well as protein level. In this work, we investigate the Spike gene on an RNA level. Using an in-silico fragment-based method, we compare the original SARS-CoV-2 sequence with its closest relative SAR-CoV-1 for any sign of major long-range RNA-RNA interactions that involve a genomic segment on the Spike region. The impact of locally stable RNA structures on the long-range predictions are also investigated. Subsequently, we considered the population of evolving SARS-CoV-2 sequences available worldwide to further investigate the conservation of our inferred interactions.

Materials and methods

Data

We used the SARS-CoV-2 isolate Wuhan-Hu-1 (NC_045512.2) and SARS-CoV-1 (NC_004718.2) reference sequences for identifying long-range RNA-RNA interactions in each of the viruses. For population-based sequence-covariance analyses, a set of 2,348,494 aligned full-length SARS-CoV-2 genome sequences were taken from the Nextstrain project [43] on December 9, 2021. The sequences were originally from the Global Initiative on Sharing All Influenza Data (GISAID) platform [4446] (https://www.gisaid.org/) and were subsequently filtered for high quality sequence (nextstrain.org, filename: filtered.fasta.xz). We further filtered the sequences for having no ambiguous nucleotides in desired locations which resulted in a total of 2,068,427 sequences. Finally, we performed down-sampling to around 10 percent of original size (206,745 sequences) due to computational complexity constraints. S1 Table contains the corresponding GISAID accession numbers for the 206,745 sequences.

Predicting RNA-RNA interactions

Genome-wide RNA-RNA interaction between the Spike region (query) and the genomic RNA of SARS-CoV-2 (target) were predicted using IntaRNA [4750]. First, the Spike region was divided into smaller regions using a sliding window of length 500nt and overlap of 50nt. Each segment was then used as the query parameter by IntaRNA using search mode parameters (—mode H—outNumber 5—outOverlap Q). The parameters allowed for extracting top 5 non-overlapping targets on the full genome that form thermodynamically favorable RNA-RNA base-pairing interactions with a region on the corresponding query segment. Targets that were at least 1000nt apart from their query counterparts were subsequently kept. A similar procedure was carried out on SARS-CoV-1.

Different components of the RNAstructure software package [51] along with other tools were used for secondary structure predictions. Individual base-pair probabilities are according to McCaskill’s partition function [52, 53].

Compensatory mutations analysis of long-range RNA-RNA interactions

Compensatory mutations within the multiple sequence alignments were investigated using the R-scape software package [5457], which analyzes covariation in nucleotide pairs in the population to infer possible compensatory mutations in an RNA base pair. If the consensus RNA secondary structure is not provided by the user, the software is also capable of predicting the consensus structure from the population of sequences using an implementation of the CaCoFold algorithm.

Compensatory (covarying) mutations for long-range RNA-RNA interactions were analyzed by retrieving the two sequence segments that constitute the desired RNA-RNA interaction for all downloaded SARS-CoV-2 sequences. Pairs of sequence segments were extended on each of their ends by 5nt (totaling 20nt) and concatenated. Then, the long-range RNA-RNA interacting structure was predicted by finding the consensus secondary structure within the population of sequences in the dataset using R-scape implementation of CaCoFold. The consensus structure was compared to bifold predictions for verification. Nucleotide pairs belonging to the consensus structure were then examined within the dataset for evidence of covariation using the built-in survival function that plots the distribution of base pairs with respect to their corresponding covariation scores.

Results

Long-range RNA-RNA base-pairing interactions were predicted between the Spike region and the full genome for both SARS-CoV-1 and SARS-CoV-2 using IntaRNA software package (Fig 1). For each genome, the Spike region was extended 50nt on both directions. Spike sequence segments of length 500nt were analyzed separately for possible long-range interactions with their corresponding genomes (See Materials and Methods for details). We considered an arbitrary maximum of five hits (the optimal interaction and another four sub-optimal interactions) for each analysis. Fig 1 shows the location of all the hits in both the genomes.

Fig 1. Predicted long-range RNA-RNA base-pairing interactions between Spike and the full genomic RNA.

Fig 1

Spike sequence segments of length 500nt and overlap of 50nt were queried against the full genomes using IntaRNA software package. Each individual test resulted in at most five hits. All hits are summarized for both SARS-CoV-1 and SARS-CoV-2 (See Materials and Methods for details).

Long-range RNA-RNA predictions between Spike and the full genome are spread across almost all other genes for both SARS-CoV-1 and SARS-CoV-2 genomes. These interactions consisted of different thermodynamic stabilities and included interacting regions of as short as around 20nt. S2 Table contains details about each hit. There were some major observations in our comparison. First, no interacting candidate was observed between the Spike and E genes for neither of the stains. Second, unlike SARS-CoV-2, a considerably long segment on SARS-CoV-1 Spike gene did not contain any prediction with the rest of the genome. In fact, the query segment of SARS-CoV-1 Spike (23,238–23,737) contained only two hits, while other segments (on both strains) resulted at least four long-range predictions. The no-hit region corresponded to (23238–23698) on SARS-CoV-1, in specific. Finally, no prediction was observed between the SARS-CoV-1 Spike and ORF8 regions, while this was not true for SARS-CoV-2. As we can see in Fig 1, there are multiple hits between Spike and ORF8 for SARS-CoV-2.

There was a total of 69 long-range interactions across both viruses. Table 1 summarizes the top quantile hits. The ranking of interactions was based on using their residual values against a generalized linear model that estimates interaction energy from interaction length. The reason for choice of model was that the fact that expected interaction energy is related to sequence length. The built-in function glm(energy~length, data = data, family = "gaussian")in R programming language was used to fit the model. Length was a significant factor in the model with (Pr(>|t|) for length = 0.00067 which confirmed our assumption about impact of length on interaction energy (See Table 1 caption for Model details). Residual values were used to rank the interactions, since hits with lower residuals imply higher stability compared to other hits.

Table 1. Top quantile predicted long-range RNA-RNA base-pairing interactions between the Spike region the full genome for both SARS-CoV-1 and SARS-CoV-2 using IntaRNA software package.

Rank SARS-CoV Hit Start Hit End Target Start Target End Total Length Energy Residual Target Gene
1 2 21639 21750 12261 12355 207 -26.29 -7.3397887 ORF1a
2 1 22604 22631 12507 12532 54 -21.23 -7.1602061 ORF1a
3 2 24114 24157 5367 5402 80 -19.93 -5.0308541 ORF1a
4 1 24396 24414 25582 25602 40 -18.19 -4.5667802 ORF3a
5 1 24841 24877 2247 2288 79 -19.26 -4.3927522 ORF1a
6 1 25198 25239 17014 17053 82 -19.1 -4.1370578 ORF1b
7 2 24084 24114 17012 17046 66 -18.4 -3.9474282 ORF1b
8 1 23698 23734 26957 27000 81 -18.87 -3.9389559 M
9 2 23271 23295 19401 19423 48 -17.4 -3.521595 ORF1b
10 2 22846 22862 18954 18970 34 -16.92 -3.4881691 ORF1b
11* 2 23660 23703 28025 28060 80 -18.07 -3.1708541 ORF8
12 2 22303 22337 24984 25023 75 -17.79 -3.0503449 Spike
13 2 24984 25023 22303 22337 75 -17.79 -3.0503449 Spike
14 1 21523 21559 2321 2354 71 -17.53 -2.9179375 ORF1a
15 2 25331 25358 18595 18620 54 -16.79 -2.7202061 ORF1b
16 1 24667 24677 24104 24114 22 -15.37 -2.320947 Spike
17 2 24648 24660 19153 19165 26 -15.44 -2.2633543 ORF1b
18 1 22530 22558 20039 20077 68 -16.67 -2.1536319 ORF1b

See Materials and Methods for details. There was a total of 69 independent hits across both genomes. Complete results included as S2 Table. Column SARS-CoV denotes the strain. Column TotalLength denotes length of the interacting regions (query + target). Ranking is according to residual values against the generalized linear model where length of interaction was used to estimate interaction energy. The built-in function glm(energy~length, data = data, family = "gaussian")in R programming language was used to fit the model. Length coefficient = -0.03190. Length was a significant factor in the model. (Pr(>|t|) for length = 0.00067. Median of residuals = -0.2287). 1-Quantile of residuals = -2.1536. SARS-CoV-2 hits are shown as bold. Rank 11 also shown with * denotes the SARS-CoV-2 Spike-ORF8 interaction.

Focusing only on SARS-CoV-2 hits, the top hit corresponds to the beginning of the Spike gene. In fact, the interaction overlaps with the upstream region of Spike. Interestingly, the second and third top hits are exactly adjacent to each other on the Spike region. Target regions shown in rank 3 and rank 7 are (24114–24157 Spike) and (24084–24114 Spike) and interact with their corresponding regions on ORF1a and ORF1b, respectively. Base-pair level interaction details for the top three interactions can be found in S1 Fig. From amongst the predicted interactions, we decided to focus on further investigating the major hit between Spike and ORF8 of SARS-CoV-2. This rather qualitative choice was based on the following: As mentioned before, we were primarily interested in novel interactions and ORF8 was not observed to contain any long-range integration in SARS-CoV-1. In addition, as will be explained in the next section, the above hit is within the top quantile predictions (Table 1) and is not sensitive to the top-5-hit choice of cut-off (data not shown).

Spike-ORF8 RNA-RNA interaction

The interaction between Spike and ORF8 with the highest ranking appears as the 11th top hit within a total of 69, under a generalized linear model that estimates interaction energy from sum of lengths of interacting sequences. It is also the 6th top hit within SARS-CoV-2. Base-pairing interactions between SARS-CoV-2 Spike and ORF8 are shown in Fig 2. Intervals (23660–23703 Spike) and (28025–28060 ORF8) consist of a total of 80nt and have a stabilizing energy of -18.07 kcal/Mol. Fig 2 shows the individual base pairs of the above hit, denoted here as Spike-ORF8 interaction. Pairs shown by ‘+’ symbol point to stable sub-interactions and thus likely to be starting points of the full long-range RNA-RNA interaction (predictions according to IntaRNA). This sub-interaction is shown within the red rectangle in Fig 2 and denoted as the Core interacting region.

Fig 2. Long-range RNA-RNA interaction between Spike and ORF8 regions of SARS-CoV-2 genome.

Fig 2

Interacting intervals are (23660–23703 Spike) and (28025–28060 ORF8). Prediction done via IntraRNA software. Base pairs with ‘plus’ notation denote stable sub-interactions. The stable sub-interaction is shown within the red rectangle in and denoted as the Core interacting region: (23679–23690 Spike) and (28031–28042 ORF8).

The predicted Spike-ORF8 interaction was analyzed for compensatory mutations. Sequence segments were extended 5nt to avoid unwanted base-pairing in the consensus structure prediction. Resulting intervals were (23655–23708 Spike) and (28020–28065 ORF8). A total of 206,745 sequence segments each corresponding to a particular viral sequence was used for the analysis. Sequences were a down-sampled selection of nearly two million SARS-CoV-2 sequences (See Materials and Methods for detail). No significantly covarying mutations were detected by R-scape. Table 2 shows the coordinates of all base pairs for which variation was observed. Column power an output of the R-scape software, denotes the statistical power of substitutions.

Table 2. Coordinates of interacting base pairs between (23660–23703 Spike) and (28025–28060 ORF8) for which nucleotide variations were observed.

Spike ORF8 Power
23660 28060 0
23661 28059 0
23662 28058 0
23663 28057 0.08
23664 28056 0.11
23671 28050 0
23672 28049 0
23673 28048 0.39
23674 28046 0
23675 28045 0.05
23676 28044 0.04
23677 28043 0
23679 28042 0
23680 28041 0.01
23681 28040 0
23682 28039 0
23683 28038 0
23684 28037 0
23685 28036 0
23686 28035 0
23687 28034 0
23688 28033 0.01
23689 28032 0
23690 28031 0

Total number of sequences was 206,745. Column power is an output of the R-scape software that is proportional to the statistical power of substitutions. Mutations in coordinates in black bold are shown Fig 3. Base pairs within the Core interacting region (Fig 2) are shown in cells shaded red.

Interestingly, comparing Fig 1 and Table 2, we can see that the base pairings within the Core interacting region also have lower variation in the population of sequences than other predicted base pairs of the interaction. They are shown in Table 2 in black bold.

Fig 3 illustrates the individual pairing configurations within the RNA-RNA interaction. Results were according to the consensus structure prediction algorithm CaCoFold built in the R-scape software. Both structure prediction methods IntaRNA (thermodynamic long-range) and CaCoFold (consensus structure) had consistent results in predicting most based pairs including those in the Core integrating region. The first part of the interaction, however, is predicted by intaRNA but not CaCoFold. The coordinate for this region is (23698–23703 Spike) and (28025–28030 ORF8). Four base pairs with highest number of mutations are shown in bold black in Fig 3. Nucleotide position with highest observed mutation was G28048U ORF8, with 36,366 occurrences in a total of 206,745 viral sequences. This mutation does not support the predicted interaction. Mutation C23664U Spike was observed 2207 times and was the second highest mutation observed. This mutation accommodates for the Spike-ORF8 interaction. Adjacent to this base pair, mutation G28045U ORF8 with frequency 329 also accommodates for the interaction stability. The fourth most frequent mutation was C28045U ORF8. It was observed 329 times which also accommodated the predicted Spike-ORF8 interaction. Base pairs falling within the Core interacting region are shown cells shaded red in Table 2. Sequence variation in almost all these base pairs is zero. The flanking sequences on both ends of interactions did not form any base pairing with each other as expected by IntaRNA results.

Fig 3. Consensus structure of the predicted Spike-ORF8 RNA-RNA interaction.

Fig 3

RNA-RNA interaction coordinates were (23660–23703 Spike) and (28025–28060 ORF8). Total number of sequences was 206,745. Number of mutations observed for four locations with highest power are shown. The Core interacting region is shown by the transparent red rectangle.

Local RNA analysis in Spike

The local stability of RNA structure in the vicinity of the (23660–23703 Spike) was evaluated and compared to its SARS-CoV-1 counterpart. The original interval was extended by 100nt on both directions on the SARS-CoV-2 genome, resulting region (23560–23803 Spike). The region that aligned with the above selection on SARS-CoV-1 was selected for comparison, (23447–23650 Spike S2 and S3 Figs show the base pair probabilities for both SARS-CoV-2 (23560–23803 Spike) and its corresponding region in SARS-CoV-1 (23447–23650 Spike). Base pairs colored in red are those with higher likelihood of forming. As we can see, there are major differences in the base-pairing probability patterns between the two sequences. The black bar shows the approximate location of the Spike-ORF8 interaction. As we can see this location seems to contain many base pairs that can form local base pairs. The corresponding location on SARS-CoV-1, for which no long-range interaction with ORF8 was observed, seems to have relatively less locally stable bases pairs (comparing red base pairs between S2 and S3 Figs). This observation was also true for another arbitrary selection of sequence segments. Overall, region of Spike that is predicted to base pair with ORF8, also tends to form a local structure which seems to be mutually exclusive from the ORF8 interaction.

Discussion

The Spike region of SARS-CoV-2 RNA was investigated for novel genomic long-range RNA-RNA interaction. Fragment-based in-silico predictions were performed on the reference sequence and compared to those for the reference sequence of SARS-CoV-1 that was responsible for the 2002–2003 outbreak.

The predictions were inclusive and made in favor of more sub-optimal but diverse results. They provide a collection of top non-overlapping candidate regions on the reference sequences that can potentially form thermodynamically favorable RNA-RNA base pairing with a sub-region on their corresponding Spike (Fig 1). We found RNA structural differences between corresponding regions in SARS-CoV-2 and SARS-CoV-1. It is worth noting, however, that the cut-off for storing number of interactions was chosen arbitrary, which implies there may be more predictions that are not included in Fig 1.

Top interacting regions were ranked according to their relative thermodynamic stabilities with regards to total length of interaction, using a generalized linear model. Table 2 shows the top quantile of results (See S2 Table for full results). Some of the predictions are as follows. Strongest interactions that occurred on the SARS-CoV-2 genomic RNA were between Spike and ORF1ab. A region in the beginning of SARS-CoV-2 Spike (21639–21750) formed an interaction with a region on ORF1ab (12261–12355) with a predicted free energy of -26.29 kcal/Mol, highest amongst both viruses. Details about base pairing interactions of the top three hits is presented in S1 Fig. The second and third strongest predictions on SARS-CoV-2 also occurred on ORF1ab but formed continuous region on the side of Spike. Regions (24114–24157 Spike) and (24084–24114 Spike) intersect at position 24114 but interact with distant regions on ORF1ab, namely the beginning (5367–5402) and the middle (17012–17046), respectively. This observation was unique, since interactions were allowed to overlap on Spike by the IntaRNA software, but they only have one nucleotide overlap on Spike (See Table 2, rows 3 and 7 for details). Some of the other interesting observations were the fact that a significantly long region of SARS-CoV-1 Spike (23238–23698), around 460nt, did not form any long-range RNA-RNA interacting predictions with any part of the genome, despite the software’s flexibility to allow for sub-optimal hits. This lack of predictions was not observed in SARS-CoV-2 Spike.

Most genes and annotated regions contained several interacting regions with the Spike gene in both reference genomes SARS-CoV-1 and SARS-CoV-2 (comparing Fig 1A and 1B). The ranking of strength of base-pairings, however, were dramatically different between corresponding genes. For instance, the strongest ranked interaction between Spike and M in SARS-CoV-1 was 8th while this number dropped to 40 for SARS-CoV-2 (See S2 Table). The only gene that did not contain any predictions was the E gene. No thermodynamically stable interacting candidate was observed on neither of SARS-CoV-1 and SARS-CoV-2 reference genomes.

SARS-CoV-2 contained a few regions that can potentially form long-range RNA-RNA interactions with the Spike and ORF8 regions on different locations (Fig 1B, red links), while SARS-CoV-1 didn’t contain any. The ranking of the highest observed interaction stability fell within the first quantile of results (ranking 11). Regions (23660–23703 Spike) and (28025–28060 ORF8) formed an RNA-RNA interaction with free energy of -18.07 kcal/Mol. While the other results were interesting and worth further investigation, our focus was further analysis of the above Spike-ORF8 interaction, due to the strong gene-based observed contrast between the SARS-CoV-1 and SARS-CoV-2. A sub-interacting region (23679–23690 Spike) and (28031–28042 ORF8) within the above interval was predicted to have a higher likelihood to form thermodynamic stable base pairings, denoted here as the Core integrating region (Fig 2).

The population of SARS-CoV-2 sequences were analyzed for signs of sequence co-variation that might validate the above S-ORF8 RNA-RNA base-pairing interaction. From amongst the nearly 20 million sequences, 206,745 (roughly 10%) were randomly selected for the analysis, due to limitations in computational complexity. The aligned SARS-CoV-2 sequences were investigated for compensatory mutations that might occur within and between Spike-ORF8 binding location. Although not any significantly covarying mutations were observed, the positions of polymorphisms were in support of the in-silico results. Interestingly the Core interacting region was observed to tolerate less mutations (location shown in bold red, Table 2). The lower variance in the more stable base pairs is in support of the Spike-ORF8 RNA-RNA interaction. Other regions of the interaction either had higher variation or did not even appear in the consensus structure predicted by CaCoFold. The integration of thermodynamic-based predictions and sequence variation identify a region (Core region) for the predicted Spike-ORF8 RNA-RNA interaction.

Observed mutations within the interacting region, however, had conflicting implications, with some such as C28045U ORF8, G28048U ORF8, C23664U Spike being in favor of the interactions and some such as G28048U not accommodating for base pairing (Fig 3). Being a synonymous mutation, C28045U has been previously identified as one of the polymorphic positions of ORF8 [58]. In the mentioned work, in the local RNA secondary structure prediction of ORF8, C28045U is unpaired, while in the predicted long-range RNA-RNA interaction with Spike, it pairs with G23675. The C28045U variation, hence, is suggestive of the long-range Spike-ORF8 interaction. Further investigation on the above set of mutations along the evolutionary trajectory of the virus is needed for a more comprehensive conclusion about their possible roles. In addition, since the data was filtered and aligned for having no long inspersions, deletions, or ambiguous nucleotides, certain meaningful sequence variations might not have been accounted for in the analysis.

Local RNA structure analyses on the Spike region suggests an increase in locally stable RNA structures in the vicinity of the Spike-ORF8 interaction. There is a conserved RNA stem-loop, namely S1, which has been previously found in SAR-CoV-1 sequences [42]. This stem is roughly 30nt upstream of the Spike-ORF8 interaction and its stability was confirmed by different in-silico programs in both SARS-CoV-1 and SAR-CoV-2 sequences. Immediately upstream of the conserved stem, there is the high-GC content 12-nt insert in the Spike region (23603–23614), which is present in SARS-CoV-2 but absent in SARS-CoV-1. The insert is roughly 50nt upstream of the predicted Spike-ORF8 interaction. Given the above comparisons to SARS-CoV-1, it seems that this region of Spike is undergoing local RNA structural changes as well as having affinity to form a long-range interaction with ORF8.

Locally stable RNA base pairs and the long-range Spike-ORF8 base-pairing interactions are mutually exclusive. Base pair probability distributions of corresponding regions on Spike in both SARS-CoV-1 and SARS-CoV-2 reveal that the same nucleotides that can pair with ORF8, are also likely to form local base pairs within Spike (Fig 3, red base pairs falling under the bar). Ironically, the corresponding region on SARS-CoV-1, for which there were no signs of long-range interaction with ORF8, is observed to have less deterministic local base-pairing probabilities (Comparing S2 and S3 Figs, range indicated by black bar). One possibility is that a complex RNA structure may be emerging within the specified region of Spike in SARS-CoV-2 that can form RNA-RNA interaction with ORF8, at certain times can avoid the interaction at others. Whether the predicted long-range Spike-ORF8 interaction is in competition or cooperation with other local elements of Spike such as the 12-nt polybasic insert in SARS-CoV-2, is subject to speculation about in-vivo conformational specifics.

Given our methodology, it cannot be inferred if the predicted Spike-ORF8 RNA-RNA interaction could form in the genomic RNA or within a sub-genomic RNA, or even between two different sub-genomic RNAs, since only small fragments of sequences were effectively considered in our predictions. An interesting possibility is the genomic scenario where the hypothesized interaction can potentially impact template switch during negative strand synthesis. Template switch in Beta-coronaviruses might occur if the TSR element downstream of the 5’UTR is in proximity of the TSR element immediately upstream of a viral gene. Such complex genomic conformation may involve other RNA-RNA as mediators. The dE-pE (Fig 2 of [16]) acts as such mediator RNA binding locations to facilitate a discontinuous negative strand synthesis of the viral genome, leading to N-gene sub-genomic RNA. The coronavirus nucleocapsid (N) is known to be a structural protein that forms complexes with genomic RNA, interacts with the viral membrane protein during virion assembly and plays a critical role in enhancing the efficiency of virus transcription and assembly [16]. The predicted Spike-ORF8 interaction here is 200nt upstream of the N-gene TSR [58]. Although high-order RNA-RNA interactions needed for template switch can be more complex and may involve the 5’UTR as well, the predicted Spike-ORF8 interaction could indeed be acting as an additional mediator step to bring the TRS elements of 5’UTR and the coronavirus N-gene closer to each other. It could be speculated that the Spike-ORF8 interaction is taking part in regulating sub-genomic RNA production. Since the first gene downstream of Spike-ORF8 interaction happens to be the N-gene, the binding location might be affecting the N-gene sub-genomic RNA production.

Amongst coronaviruses, ORF8 is a rapidly evolving hypervariable gene that undergoes deletions to possibly adapt to human host [5861]. It has also been previously observed that patients infected with SARS-CoV-2 variants with a 382-nucleotide deletion (Δ382) in ORF8 had milder symptoms [62]. In addition, ORF8 contains RNA structural features [58]. While this observation may very well be due to impact of absence of the translated protein, ORF8 RNA structural characteristics of the genome may also play a role in the viral life cycle, making long-range RNA-RNA prediction with Spike a less remote possibility. A comprehensive exploration of the predicted Spike-ORF8 interaction amongst SARS-CoV-2 variants and evaluating corresponding sub-genomic RNA production rates of these variants may lead to further clues about the predicted long-range Spike-ORF8 RNA-RNA interaction, which can be rewarding for therapeutic purposes.

Supporting information

S1 Fig. Top three interacting regions with SARS-CoV-2 Spike.

Corresponding ranking of the hits are also included. Generalized linear model was used to rank hits with highest interaction energy relative to interaction length (Table 1).

(TIF)

S2 Fig. Base pair probabilities for aligned segments of SARS-CoV-1 Spike.

Probabilities were calculated using McCaskill’s partition function [51, 52]. Coordinates: SARS-CoV-1 (23447–23650 Spike). SARS-CoV-2 Spike/ORF8 region (23660–23703) was mapped on SARS-CoV-1, with coordinates: SARS-CoV-1 (23534–23575 Spike) and extended on both directions by 100nt. The resulting region is highlighted as a black bar.

(TIF)

S3 Fig. Base pair probabilities for aligned segments of SARS-CoV-2 Spike.

Probabilities were calculated using McCaskill’s partition function [51, 52]. Coordinates: SARS-CoV-2 (23560–23803 Spike). The Spike/ORF8 region (23660–23703), shown as a black bar was extended on both directions by 100nt.

(TIF)

S1 Table. GISAID accession numbers.

Accession numbers of the 206,745 SARS-CoV-2 sequences used in the study.

(TXT)

S2 Table. Ranking of predicted RNA-RNA base-pairing interactions.

Predicted long-range RNA-RNA base-pairing interactions between the Spike region the full genome for both SARS-CoV-1 and SARS-CoV-2 using IntaRNA software package. See Materials and Methods for details. There was a total of 69 independent hits across both genomes. Column SARS-CoV denotes the strain. Column TotalLength denotes length of the interacting regions (query + target). Ranking is according to residual values against the generalized linear model where length of interaction was used to estimate interaction energy. The built-in function glm(energy~length, data = data, family = "gaussian")in R programming language was used to fit the model. Length coefficient = -0.03190. Length was a significant factor in the model. (Pr(>|t|) for length = 0.00067. Median of residuals = -0.2287). 1-Quantile of residuals = -2.1536. SARS-CoV-2 hits are shown as bold. Rank 11 also shown with * denotes the SARS-CoV-2 Spike-ORF8 interaction.

(CSV)

Data Availability

Yes, all data fully available without restriction. S1 Table provides GISAID accession numbers for sequences used in this study. The sequences are publicly available in the GISAID database for download.

Funding Statement

This research was funded by the National Institute of General Medical Sciences of the NIH under Award Number R01GM123314 (SCJ), Indo-U.S. Science and Technology Forum under Award Number IUSSTF/VN-COVID/005/2020 (SCJ) and IUPUI's Office of the Vice Chancellor for Research COVID-19 Rapid Response Grant (SCJ). There are no grant numbers for funding received from within IU. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The two authors Sarath Chandra Janga and Okiemute B. Omoru were supported by the R01GM123314 grant number from NIH.

References

  • 1.Hu B., et al., Characteristics of SARS-CoV-2 and COVID-19. Nat Rev Microbiol, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Stang A., et al., Excess mortality due to COVID-19 in Germany. J Infect, 2020. 81(5): p. 797–801. doi: 10.1016/j.jinf.2020.09.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Cui J., Li F., and Shi Z.L., Origin and evolution of pathogenic coronaviruses. Nat Rev Microbiol, 2019. 17(3): p. 181–192. doi: 10.1038/s41579-018-0118-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Phan M.V.T., et al., Identification and characterization of Coronaviridae genomes from Vietnamese bats and rats based on conserved protein domains. Virus Evol, 2018. 4(2): p. vey035. doi: 10.1093/ve/vey035 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Su S., et al., Epidemiology, Genetic Recombination, and Pathogenesis of Coronaviruses. Trends Microbiol, 2016. 24(6): p. 490–502. doi: 10.1016/j.tim.2016.03.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wong L.Y., Lui P.Y., and Jin D.Y., A molecular arms race between host innate antiviral response and emerging human coronaviruses. Virol Sin, 2016. 31(1): p. 12–23. doi: 10.1007/s12250-015-3683-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Malik Y.A., Properties of Coronavirus and SARS-CoV-2. Malays J Pathol, 2020. 42(1): p. 3–11. [PubMed] [Google Scholar]
  • 8.Chen Y., Liu Q., and Guo D., Emerging coronaviruses: Genome structure, replication, and pathogenesis. J Med Virol, 2020. 92(4): p. 418–423. doi: 10.1002/jmv.25681 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Kim D., et al., The Architecture of SARS-CoV-2 Transcriptome. Cell, 2020. 181(4): p. 914–921.e10. doi: 10.1016/j.cell.2020.04.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Yao H., et al., Molecular Architecture of the SARS-CoV-2 Virus. Cell, 2020. 183(3): p. 730–738.e13. doi: 10.1016/j.cell.2020.09.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.V’Kovski P., et al., Coronavirus biology and replication: implications for SARS-CoV-2. Nat Rev Microbiol, 2021. 19(3): p. 155–170. doi: 10.1038/s41579-020-00468-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Tortorici M.A. and Veesler D., Structural insights into coronavirus entry. Adv Virus Res, 2019. 105: p. 93–116. doi: 10.1016/bs.aivir.2019.08.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Li F., Structure, Function, and Evolution of Coronavirus Spike Proteins. Annu Rev Virol, 2016. 3(1): p. 237–261. doi: 10.1146/annurev-virology-110615-042301 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Letko M., Marzi A., and Munster V., Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage B betacoronaviruses. Nat Microbiol, 2020. 5(4): p. 562–569. doi: 10.1038/s41564-020-0688-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Shin J.S., et al., Saracatinib Inhibits Middle East Respiratory Syndrome-Coronavirus Replication In Vitro. Viruses, 2018. 10(6). doi: 10.3390/v10060283 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Sola I., et al., Continuous and Discontinuous RNA Synthesis in Coronaviruses. Annual Review of Virology, 2015. 2(1): p. 265–288. doi: 10.1146/annurev-virology-100114-055218 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Snijder E.J., et al., A unifying structural and functional model of the coronavirus replication organelle: Tracking down RNA synthesis. PLoS biology, 2020. 18(6): p. e3000715–e3000715. doi: 10.1371/journal.pbio.3000715 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Nicholson B.L. and White K.A., Functional long-range RNA–RNA interactions in positive-strand RNA viruses. Nature Reviews Microbiology, 2014. 12(7): p. 493–504. doi: 10.1038/nrmicro3288 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Chkuaseli T. and White K.A., Intragenomic Long-Distance RNA–RNA Interactions in Plus-Strand RNA Plant Viruses. Frontiers in Microbiology, 2018. 9. doi: 10.3389/fmicb.2018.00529 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Ziv O., et al., The Short- and Long-Range RNA-RNA Interactome of SARS-CoV-2. Molecular Cell, 2020. 80(6): p. 1067–1077.e5. doi: 10.1016/j.molcel.2020.11.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Woo P.C.Y., et al., Coronavirus genomics and bioinformatics analysis. Viruses, 2010. 2(8): p. 1804–1820. doi: 10.3390/v2081803 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Baric R.S., et al., Establishing a genetic recombination map for murine coronavirus strain A59 complementation groups. Virology, 1990. 177(2): p. 646–56. doi: 10.1016/0042-6822(90)90530-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Singh J., et al., Evolutionary trajectory of SARS-CoV-2 and emerging variants. Virology Journal, 2021. 18(1): p. 166. doi: 10.1186/s12985-021-01633-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Vijgen L., et al., Genetic variability of human respiratory coronavirus OC43. Journal of virology, 2005. 79(5): p. 3223–3225. doi: 10.1128/JVI.79.5.3223-3225.2005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Sánchez C.M., et al., Genetic evolution and tropism of transmissible gastroenteritis coronaviruses. Virology, 1992. 190(1): p. 92–105. doi: 10.1016/0042-6822(92)91195-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Li X., et al., Transmission dynamics and evolutionary history of 2019-nCoV. J Med Virol, 2020. 92(5): p. 501–511. doi: 10.1002/jmv.25701 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Collier D.A., et al., SARS-CoV-2 B.1.1.7 sensitivity to mRNA vaccine-elicited, convalescent and monoclonal antibodies. medRxiv, 2021. doi: 10.1101/2021.01.19.21249840 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Chen Z., et al., Genomic and evolutionary comparison between SARS-CoV-2 and other human coronaviruses. J Virol Methods, 2021. 289: p. 114032. doi: 10.1016/j.jviromet.2020.114032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Ortiz-Fernández L. and Sawalha A.H., Genetic variability in the expression of the SARS-CoV-2 host cell entry factors across populations. Genes Immun, 2020. 21(4): p. 269–272. doi: 10.1038/s41435-020-0107-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Tai W., et al., Characterization of the receptor-binding domain (RBD) of 2019 novel coronavirus: implication for development of RBD protein as a viral attachment inhibitor and vaccine. Cell Mol Immunol, 2020. 17(6): p. 613–620. doi: 10.1038/s41423-020-0400-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Wan Y., et al., Receptor Recognition by the Novel Coronavirus from Wuhan: an Analysis Based on Decade-Long Structural Studies of SARS Coronavirus. J Virol, 2020. 94(7). doi: 10.1128/JVI.00127-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Walls A.C., et al., Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein. Cell, 2020. 181(2): p. 281–292.e6. doi: 10.1016/j.cell.2020.02.058 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Cosar B., et al., SARS-CoV-2 Mutations and their Viral Variants. Cytokine & Growth Factor Reviews, 2021. doi: 10.1016/j.cytogfr.2021.06.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Zhang S., et al., RNA–RNA interactions between SARS-CoV-2 and host benefit viral development and evolution during COVID-19 infection. Briefings in Bioinformatics, 2021. 23(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Andrews R.J., et al., An in silico map of the SARS-CoV-2 RNA Structurome. bioRxiv, 2020: p. 2020.04.17.045161. doi: 10.1101/2020.04.17.045161 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Bartas M., et al., In-Depth Bioinformatic Analyses of Nidovirales Including Human SARS-CoV-2, SARS-CoV, MERS-CoV Viruses Suggest Important Roles of Non-canonical Nucleic Acid Structures in Their Lifecycles. Frontiers in microbiology, 2020. 11: p. 1583–1583. doi: 10.3389/fmicb.2020.01583 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Lan T.C.T., et al., Structure of the full SARS-CoV-2 RNA genome in infected cells. bioRxiv, 2020: p. 2020.06.29.178343. [Google Scholar]
  • 38.Simmonds P., Pervasive RNA Secondary Structure in the Genomes of SARS-CoV-2 and Other Coronaviruses. mBio, 2020. 11(6): p. e01661–20. doi: 10.1128/mBio.01661-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Manfredonia I., et al., Genome-wide mapping of SARS-CoV-2 RNA structures identifies therapeutically-relevant elements. Nucleic Acids Research, 2020. 48(22): p. 12436–12452. doi: 10.1093/nar/gkaa1053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Rouskin S., et al., Insights into the secondary structural ensembles of the full SARS-CoV-2 RNA genome in infected cells. 2021, Research Square. [Google Scholar]
  • 41.Huston N.C., et al., Comprehensive in vivo secondary structure of the SARS-CoV-2 genome reveals novel regulatory motifs and mechanisms. Molecular Cell, 2021. 81(3): p. 584–598.e5. doi: 10.1016/j.molcel.2020.12.041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Rangan R., Zheludev I.N., and Das R., RNA genome conservation and secondary structure in SARS-CoV-2 and SARS-related viruses. bioRxiv: the preprint server for biology, 2020: p. 2020.03.27.012906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Hadfield J., et al., Nextstrain: real-time tracking of pathogen evolution. Bioinformatics, 2018. 34(23): p. 4121–4123. doi: 10.1093/bioinformatics/bty407 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Elbe S. and Buckland-Merrett G., Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges, 2017. 1(1): p. 33–46. doi: 10.1002/gch2.1018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Shu Y. and McCauley J., GISAID: Global initiative on sharing all influenza data—from vision to reality. Euro surveillance: bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin, 2017. 22(13): p. 30494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Khare S., et al., GISAID’s Role in Pandemic Response. China CDC weekly, 2021. 3(49): p. 1049–1051. doi: 10.46234/ccdcw2021.255 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Mann M., Wright P.R., and Backofen R., IntaRNA 2.0: enhanced and customizable prediction of RNA–RNA interactions. Nucleic Acids Research, 2017. 45(W1): p. W435–W439. doi: 10.1093/nar/gkx279 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Wright P.R., et al., CopraRNA and IntaRNA: predicting small RNA targets, networks and interaction domains. Nucleic Acids Research, 2014. 42(W1): p. W119–W123. doi: 10.1093/nar/gku359 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Busch A., Richter A.S., and Backofen R., IntaRNA: efficient prediction of bacterial sRNA targets incorporating target site accessibility and seed regions. Bioinformatics, 2008. 24(24): p. 2849–2856. doi: 10.1093/bioinformatics/btn544 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Raden M., et al., Freiburg RNA tools: a central online resource for RNA-focused research and teaching. Nucleic Acids Research, 2018. 46(W1): p. W25–W29. doi: 10.1093/nar/gky329 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Reuter J.S. and Mathews D.H., RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics, 2010. 11(1): p. 129. doi: 10.1186/1471-2105-11-129 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Mathews D.H., Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA (New York, N.Y.), 2004. 10(8): p. 1178–1190. doi: 10.1261/rna.7650904 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.McCaskill J.S., The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 1990. 29(6‐7): p. 1105–1119. doi: 10.1002/bip.360290621 [DOI] [PubMed] [Google Scholar]
  • 54.Rivas E., RNA structure prediction using positive and negative evolutionary information. bioRxiv, 2020: p. 2020.02.04.933952. doi: 10.1371/journal.pcbi.1008387 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Rivas E., Clements J., and Eddy S.R., A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs. Nature Methods, 2017. 14(1): p. 45–48. doi: 10.1038/nmeth.4066 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Rivas E., Clements J., and Eddy S.R., Estimating the power of sequence covariation for detecting conserved RNA structure. Bioinformatics, 2020. 36(10): p. 3072–3076. doi: 10.1093/bioinformatics/btaa080 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Rivas E. and Eddy S.R., Response to Tavares et al., “Covariation analysis with improved parameters reveals conservation in lncRNA structures”. bioRxiv, 2020: p. 2020.02.18.955047. [Google Scholar]
  • 58.Pereira F., Evolutionary dynamics of the SARS-CoV-2 ORF8 accessory gene. Infection, genetics and evolution: journal of molecular epidemiology and evolutionary genetics in infectious diseases, 2020. 85: p. 104525–104525. doi: 10.1016/j.meegid.2020.104525 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Zinzula L., Lost in deletion: The enigmatic ORF8 protein of SARS-CoV-2. Biochemical and biophysical research communications, 2021. 538: p. 116–124. doi: 10.1016/j.bbrc.2020.10.045 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Su Y.C.F., et al., Discovery and Genomic Characterization of a 382-Nucleotide Deletion in ORF7b and ORF8 during the Early Evolution of SARS-CoV-2. mBio, 2020. 11(4): p. e01610–20. doi: 10.1128/mBio.01610-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Gong Y.N., et al., SARS-CoV-2 genomic surveillance in Taiwan revealed novel ORF8-deletion mutant and clade possibly associated with infections in Middle East. Emerg Microbes Infect, 2020. 9(1): p. 1457–1466. doi: 10.1080/22221751.2020.1782271 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Young B.E., et al., Effects of a major deletion in the SARS-CoV-2 genome on the severity of infection and the inflammatory response: an observational cohort study. The Lancet, 2020. 396(10251): p. 603–611. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Danny Barash

13 Dec 2021

PONE-D-21-35236Evidence for a long-r ange RNA-RNA interaction between ORF8 and the downstream region of the Spike polybasic insertion of SARS-CoV-2PLOS ONE

Dear Dr. Manzourolajdad,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The revised manuscript should address all the critical points raised by all reviewers.

Please submit your revised manuscript by Jan 27 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Danny Barash

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

3. We note that you have included the phrase “data not shown” in your manuscript. Unfortunately, this does not meet our data sharing requirements. PLOS does not permit references to inaccessible data. We require that authors provide all relevant data within the paper, Supporting Information files, or in an acceptable, public repository. Please add a citation to support this phrase or upload the data that corresponds with these findings to a stable repository (such as Figshare or Dryad) and provide and URLs, DOIs, or accession numbers that may be used to access these data. Or, if the data are not a core part of the research being presented in your study, we ask that you remove the phrase that refers to these data.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: # Review for PONE-D-21-35236

Evidence for a long-r ange RNA-RNA interaction between ORF8 and the downstream region of the Spike polybasic insertion of SARS-CoV-2

Filipe Pereira and Amirhossein Manzourolajdad

# Summary

The article summarizes the investigation of potential long-range RNA-RNA interactions (RRIs) within the SARS-CoV-2 genome and related sequences in SARS-CoV-1. In details, the study focuses on the genomic subregion 23600-24200 that is part of the Spike gene. The region is defined by the intra-molecular helix formed by a SARS-CoV-2-specific 12nt insert located at the beginning of the region that forms stable base pairing with the region's end.

The authors performed IntaRNA-based RRI prediction to screen for RRIs that interlink the genomic subregion with other parts of the genome respectively for both viruses. Two high-scoring interactions were identified, i.e. a stable and conserved RRI of the region's end with ORF1ab and a non-conserved RRI that interlinks in SARS-CoV-2 the beginning of the region with ORF8.

Following the headlines of the Results section, the main findings by the authors are (1) locally stable structure within the region (2) the two stable interactions mentioned above and (3) support for the formation of the Spike-ORF8 RRI via covariation analyses.

# General remarks

I have couple of central issues with this manuscript in its current form.

## (1) 12 nt insert irrelevant and obscures the focus and informative value

Beside the definition of the genomic subregion of interest, the 12 nt insert is eventually unimportant for any statement supported by the study (beside observing that S1 stem loop is stable with and without the insert).

The authors write themselves "The contribution of the 12-nt insert [...] is still unclear".

Thus, I have to ask: why then using the 12-nt fragment as such a rigid anchor for the study? The authors use about 2/3 of the Spike gene somewhat at random (based on the 12nt and its structure formation) instead of the whole gene! The rational for taking the whole could be the same, i.e. "12-nt insertion is inside", but now the SARS-CoV-1 subregion is clearly defined (as the related full gene sequence) and not based on the mapping of a substructure region...

Using just a subregion of the gene

(a) causes problems with local structure prediction AT LEAST at the ends of the regions if not overall,

(b) has thus implications on RRI prediction, since IntaRNA incorporates the accessibility of interacting subregions, which is based on local structure prediction, and

(c) restricts the insights of the study to a hypthetical local fragment rather than a valid mRNA-like molecule.

## (2) mfe structures based on global folding are no local structure signal

Using minimum free energy (mfe) estimates based on GLOBAL structure prediction (i.e. allowing base pairing over the range of the RNA) and respective structure plots to discuss local structure formation is inappropriate. The mfe is only ONE possible stable structure and strongly depends on the underlying thermodynamic model. Mfe prediction gets decreasingly reliable (in its details) the longer the RNA. Thus, instead of individual base pairing (of the presented mfe structure), LOCAL base pair probabilities (eg visualized via dot plots) are a much sounder tool to investigate local structure formation of RNAs or genomic regions. And if the details of base pairing of of no importance, even local unpaired probabilities (related to the accessibility values used by IntaRNA), are simple tool to identify locally (un)structured regions etc. without detailing base pairing.

When presenting mfe predictions of different tools (Fig-1) that are based more or less on the same thermodynamic energy model, no further insight is gained...

Furthermore, structure representations (Fig-1, Fig-4) are

-) so small that they are not readable in print

-) due to the latter, coloring is completely lost in print

-) eventually unimportant and not discussed in details for the RRI prediction

## (3) There is no covariation as far as I can see...

Please check Fig-1a of

https://academic.oup.com/bioinformatics/article/36/10/3072/5729989

the only thing I observe in the provided data (Fig-3) is VARIATION, i.e. only one side of the base pair shows variants...

The whole covariation part has several flaws..

(a) the authors use the concatenated interacting subsequences as an input to the CO-FOLDING BASED R-scape software... This is terribly wrong, since no LINKER sequence spacing the two FRAGMENTS is used. Without it, the helix close to the of the interaction that is close to the point of concatenation cannot be formed due to STERICAL reasons respected by the structure prediction software... Thus, the co-variation study misses eventually 5 base pairs and even more important, the Spike region is ONE TOO SHORT! missing the final interacting U...

?!?

If done that way, ie. using the co-folding based approach, one has to insert a linker region of at least 3-5 nt, eg. poly-A or if possible poly-N, to allow for enough flexibility that the concatenated ends of the sequences can form a structure. Or why not extending the subregion range by 5nt in both directions? would be simple and avoid this central error..

(b) who does the counts in Fig-3b relate to the substitution count in Table-1? The authors refer to the manual of R-scape, which is in no way a proper scientific explanation, even more so since these counts are central for selecting pairs and position within the argumentation of the authors..

(c) the substitution counts incorporate AMBIGOUS nucleotides (Fig-3b)... ?!? What about sequencing errors etc. as a source of such ambiguity? For other parts of the study such subsequences were excluded, why not here?

(d) the authors MISS an important point: the IntaRNA predictions are annotated with based pairs that are likely to be formed first, the seed base pairs represented by "+" within the plot. These base pairs seem to correlate with LOW subsitution scores, which I would find a supportive and good sign! That is low sequence variation in the important sub-RRI!

Thus, an investigation and relation of VARIATION with RRI formation would be more appropriate than a co-variation study that shows there is no co-variation..

## (4) Why showing bi-/tri-molecule co-folding results?

I find no rational for nor any additional insights from the bi-/tri-molecule folding part of the manuscript. The description of the assembly of the tri-molecule prediction from bi-molecule graphs is very vague.

Maybe I missed it, but what's the point? Fig-4a shows the same RRI patterns as Fig-2a. The lack of structure relation of Fig-4b and Fig-2b can/will have multiple possible reasons that are not discussed at all..

If you are interested in co-formation of multiple RRIs, why not using a constraint IntaRNA prediction? Therein you can mark regions that are involved in base pairing (eg. an RRI) and you will get corrected predictions for the rest of the molecule EXCLUDING the ballast of local intra-molecular mfe structure (see (2)).

But eventually, that wasnt even the point, or was it? I am confused by this part..

## (5) No suboptimal RRIs investigated/shown/discussed ...

For RRIs, the same holds as discussed for (2). While base pairing probability prediction is much harder for RRI prediction compared to intra-molecular structure prediction, one can relate to suboptimal interactions as a substitute to assess the structural variety. This is needed at multiple parts within this manuscript:

(a) the extremely long UTF interaction in Fig-2b is most likely an artifact of the summation-based prediction scheme. Simply spoken: the more base pairs the better until accessibility penalties overweight the gain.

BUT: did you check suboptimal interactions of the region? I would expect much shorter sub-RRIs with similar stability (i.e. energy).

(b) Furthermore did you check if a similar ORF8 interaction is among the suboptimals of the SARS-CoV-1 predictions? Since the ORF8 SARS-CoV-2 RRI shows a lower energy that the UTF-RRI, the chance is high that it is "hidden" there...

# Conclusion

Given the length of my remarks and the bulk of flaws, I cannot recommend the manuscript in its current form for publication.

I hope the authors take the time to really reshape and rework this study before resubmitting as I took the time to point out the weaknesses.

Reviewer #2: Prior studies have discovered many long-range RNA-RNA interactions within the genomes of positive-strand RNA viruses, which have functional roles in fundamental viral processes. In this submission, Pereira and Manzourolajdad analyze SARS-CoV-2 genomic sequences from the GISAID database. They report a long-range interaction of a region containing a 12-nt insertion in the Spike protein region that is not present in SARS-CoV. They use standard methods to analyze RNA sequence data, and find locally stable structural features downstream of the insertion region, evidence for potentially functional interactions with ORF8 region, and some evidence for sequence covariation in the interacting regions. These findings can be useful to understand the biology of SARS-CoV-2 and develop therapies. Eventually, experimental data will be needed to ascertain the presence of these interactions in vivo and its functional significance. However, that may be beyond the scope of the present study. I have focused my comments on the computational analysis, which I hope the authors will consider addressing.

1. The 12 nt insertion site is 23,603-23,614 of the genome. The authors searched for long range interactions of any SARS-CoV-2 region with the region 23,600-24,107 and 23,917-24,118, i.e., only downstream region of the insertion site. Shouldn't equally long stretches upstream of the insertion site be considered as well? It is not clear to me why the present study has focused on the downstream region only.

2. Similar to my previous question, it is not clear why the overlapping regions 23,600-24,107 and 23,917-24,118 were analyzed separately. It will be helpful if the authors can comment on this.

3. The developers of IntaRNA have shown that integrating experimentally obtained chemical probing data with IntaRNA can significantly improve RNA-RNA interaction prediction. Since such data is available for SARS-CoV-2, have the authors considered using it in their analysis?

4. In the second subsection of the Results, the authors report long-range interaction predictions from IntaRNA. Are these all the predictions from IntaRNA or have the authors reported the subset of the most significant predictions? It might be worthwhile to add a Supplementary table listing all the interactions that meet a reasonable significance cutoff.

5. In the paragraph above Table 1, it seems there is a typo in the second sentence. For the region 23,675-28,045, the table shows n=92 while the text says n=2.

6. There are many grammatical mistakes scattered throughout the manuscript, which should be corrected, if necessary using free software such as Grammarly.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Martin Raden

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Sep 1;17(9):e0260331. doi: 10.1371/journal.pone.0260331.r002

Author response to Decision Letter 0


26 Feb 2022

# Review for PONE-D-21-35236

Response to referees:

We are grateful to both the reviewers for their enthusiastic feedback and constructive criticism on our manuscript. We are also thankful for the helpful suggestions in improving the different sections. We have now incorporated the suggestions provided by the reviewers and hope that the revised version of the manuscript will now meet the expectations of the referees. In particular, the revised manuscript has been re-organized to address some of the major points raised by both reviewers regarding the underlying assumption of the work. In addition, we considered a bigger and a more updated dataset for our population analysis (around 200’000 rather than previously 27’000 sequences). Below are the responses to the raised queries:

Point-by-point response to reviewer’s concerns:

Reviewer #1:

# General remarks

I have couple of central issues with this manuscript in its current form.

## (1) 12 nt insert irrelevant and obscures the focus and informative value

Beside the definition of the genomic subregion of interest, the 12 nt insert is eventually unimportant for any statement supported by the study (beside observing that S1 stem loop is stable with and without the insert).

The authors write themselves "The contribution of the 12-nt insert [...] is still unclear".

Thus, I have to ask: why then using the 12-nt fragment as such a rigid anchor for the study? The authors use about 2/3 of the Spike gene somewhat at random (based on the 12nt and its structure formation) instead of the whole gene! The rational for taking the whole could be the same, i.e. "12-nt insertion is inside", but now the SARS-CoV-1 subregion is clearly defined (as the related full gene sequence) and not based on the mapping of a substructure region...

Using just a subregion of the gene

(a) causes problems with local structure prediction AT LEAST at the ends of the regions if not overall,

(b) has thus implications on RRI prediction, since IntaRNA incorporates the accessibility of interacting subregions, which is based on local structure prediction, and

(c) restricts the insights of the study to a hypothetical local fragment rather than a valid mRNA-like molecule.

Response: We agree with the reviewer’s concern about the 12-nt-insert bias and appreciate the pertinent feedback. Hence, as suggested above, the rationale presented in the revised manuscript is now based on investigating the complete Spike region, as opposed to being exclusive to surrounding regions of the 12-nt polybasic insert. In the previous version of submission, the introduction and a great part of results (Figures 1 and 2) were based on the 12-nt polybasic insertion narrative, whereas the current version is based on the suggested rationale. Our hypothesis is now to see if SARS-CoV-2 Spike region has evolved on RNA level compared to SARS-CoV-1. The emergence of the GC-rich 12-nt polybasic insertion in SARS-CoV-2 Spike, is presented in the introduction as the background and motivation to drive a detailed analysis to understand the RNA structural evolution in Spike. The newly presented work (Figure 1 and Table 1) is the result of scanning the complete Spike region and is not biased to the 12-nt. We have also significantly revised and improved the introduction as well as most of the results sections, considering the comments from both the reviewers.

## (2) mfe structures based on global folding are no local structure signal

Using minimum free energy (mfe) estimates based on GLOBAL structure prediction (i.e. allowing base pairing over the range of the RNA) and respective structure plots to discuss local structure formation is inappropriate. The mfe is only ONE possible stable structure and strongly depends on the underlying thermodynamic model. Mfe prediction gets decreasingly reliable (in its details) the longer the RNA. Thus, instead of individual base pairing (of the presented mfe structure), LOCAL base pair probabilities (eg visualized via dot plots) are a much sounder tool to investigate local structure formation of RNAs or genomic regions. And if the details of base pairing of of no importance, even local unpaired probabilities (related to the accessibility values used by IntaRNA), are simple tool to identify locally (un)structured regions etc. without detailing base pairing.

When presenting mfe predictions of different tools (Fig-1) that are based more or less on the same thermodynamic energy model, no further insight is gained...

Furthermore, structure representations (Fig-1, Fig-4) are

-) so small that they are not readable in print

-) due to the latter, coloring is completely lost in print

-) eventually unimportant and not discussed in details for the RRI prediction

Response: We thank the reviewer for this insightful comment and agree with the reviewer’s concern that MFE-based structural specifics are not critical and can be omitted. All figures relating to RNA structure prediction are omitted from the revised manuscript as they did not carry critical information with regards to the main point of the manuscript. As suggested by the reviewer, visualizations via dot plots are now used instead, to study local structures (Figure 4). Please also note that our arguments and conclusions are based on the inferences about base-pairing probabilities rather than MFE-based base pair formation (See Discussion Paragraph 7, Locally stable RNA base pairs …). We have also improved the quality and layout the figures so that they are easily readable and interpretable.

## (3) There is no covariation as far as I can see...

Please check Fig-1a of

https://academic.oup.com/bioinformatics/article/36/10/3072/5729989

the only thing I observe in the provided data (Fig-3) is VARIATION, i.e. only one side of the base pair shows variants...

The whole covariation part has several flaws..

(a) the authors use the concatenated interacting subsequences as an input to the CO-FOLDING BASED R-scape software... This is terribly wrong, since no LINKER sequence spacing the two FRAGMENTS is used. Without it, the helix close to the interaction that is close to the point of concatenation cannot be formed due to STERICAL reasons respected by the structure prediction software... Thus, the co-variation study misses eventually 5 base pairs and even more important, the Spike region is ONE TOO SHORT! missing the final interacting U...?!?

If done that way, ie. using the co-folding based approach, one has to insert a linker region of at least 3-5 nt, eg. poly-A or if possible poly-N, to allow for enough flexibility that the concatenated ends of the sequences can form a structure. Or why not extending the subregion range by 5nt in both directions? would be simple and avoid this central error..

Response: We agree with the reviewer about this error and have taken measures to apply the linker in the corresponding analysis. One of the suggestions was to extent the actual sequence by 5nt on both ends (totalling 20nt for each pair of segments). Original segments were extended from 23660-23-703 and 28025-28060 to 23655-23-708 and 28020-28065. Section Materials and Methods, subsection Compensatory Mutations Analysis of Long-range RNA-RNA interactions, Paragraph 2 contains the details about the modification in this revision.

(b) who does the counts in Fig-3b relate to the substitution count in Table-1? The authors refer to the manual of R-scape, which is in no way a proper scientific explanation, even more so since these counts are central for selecting pairs and position within the argumentation of the authors..

Response: This point has now been disambiguated. Actual counts of mutations are explicitly extracted from data and presented in Figure 3 for the top four cases.

(c) the substitution counts incorporate AMBIGOUS nucleotides (Fig-3b)... ?!? What about sequencing errors etc. as a source of such ambiguity? For other parts of the study such subsequences were excluded, why not here?

Response: As suggested, we removed any ambiguity from our dataset. All remaining symbols are strictly {A,G,C,U} at all steps. Mutation counts in Figure 3 also reflect this change.

(d) the authors MISS an important point: the IntaRNA predictions are annotated with based pairs that are likely to be formed first, the seed base pairs represented by "+" within the plot. These base pairs seem to correlate with LOW substitution scores, which I would find a supportive and good sign! That is low sequence variation in the important sub-RRI!

Response: We thank the reviewer for bringing this observation to our attention. The “supportive and good sign…low sequence variation in the important sub-RRI” mentioned by the reviewer remains consistent even in our newer and bigger choice of dataset. Comparing Figures 2 and “power” column of Table 2, we can clearly see that the reviewer’s observations holds and that the more critical bonds (shown by ‘+’ in Figure 2) have lower variation (0.0 power). The above point has been mentioned in Discussion (paragraph 4, The population of SARS-CoV-2 sequences …) as one of the two main pieces of evidence for the Spike-ORF8 RNA-RNA interaction.

Thus, an investigation and relation of VARIATION with RRI formation would be more appropriate than a co-variation study that shows there is no co-variation..

Response: We agree with the reviewer on no sign of covariation in the data and have explicitly mentioned that in the discussion.

## (4) Why showing bi-/tri-molecule co-folding results?

I find no rational for nor any additional insights from the bi-/tri-molecule folding part of the manuscript. The description of the assembly of the tri-molecule prediction from bi-molecule graphs is very vague.

Maybe I missed it, but what's the point? Fig-4a shows the same RRI patterns as Fig-2a. The lack of structure relation of Fig-4b and Fig-2b can/will have multiple possible reasons that are not discussed at all..

Response: As suggested by reviewer, bi/tri-molecule co-folding results are now eliminated as they were mostly based on speculations and did not have a clear hypothesis. The corresponding figures and discussion was also omitted in the revised version.

If you are interested in co-formation of multiple RRIs, why not using a constraint IntaRNA prediction? Therein you can mark regions that are involved in base pairing (eg. an RRI) and you will get corrected predictions for the rest of the molecule EXCLUDING the ballast of local intra-molecular mfe structure (see (2)). But eventually, that wasn’t even the point, or was it? I am confused by this part..

Response: As indicated above, we have now removed the co-formation of multiple RRIs as it is not directly relevant to the core hypothesis of the manuscript and the supporting data and discussion was weak. We thank the reviewer for pointing to this weakness of this section.

## (5) No suboptimal RRIs investigated/shown/discussed ...

For RRIs, the same holds as discussed for (2). While base pairing probability prediction is much harder for RRI prediction compared to intra-molecular structure prediction, one can relate to suboptimal interactions as a substitute to assess the structural variety. This is needed at multiple parts within this manuscript:

Response: As indicated above, for each individual search, four sub-optimal predictions have been included making it a total of 5 predictions. The Supplementary Table 1, column ranking indicates the ranking of each hit. With regards to applying local stability, although the above suggestion of including local stabilities are constraint on the long-range interaction can be informative in terms of validity of the long-range binding. we did not explore this analysis in this work. Our reasoning was that the application of local stability constraints may be a very stringent constraint in initial analyses, especially for complex RNA structural mechanistic, where two regions may be in competition to bind.

(a) the extremely long UTF interaction in Fig-2b is most likely an artifact of the summation-based prediction scheme. Simply spoken: the more base pairs the better until accessibility penalties overweight the gain.

BUT: did you check suboptimal interactions of the region? I would expect much shorter sub-RRIs with similar stability (i.e. energy).

Response: We absolutely agree. Sub-optimal interactions, as suggested, were considered in the revised version. Sequence segments of 500nt (with 50nt overlap) were considered by the software, rather than applying the complete Spike gene in only one try. In addition, we used IntaRNA parameters to include top five hits rather than just one optimal hit for each run. Finally, using IntaRNA parameters, we constrained the program to exclude overlapping targets, to avoid repetitive results.

(b) Furthermore, did you check if a similar ORF8 interaction is among the sub-optimals of the SARS-CoV-1 predictions? Since the ORF8 SARS-CoV-2 RRI shows a lower energy that the UTF-RRI, the chance is high that it is "hidden" there...

Response: We thank the suggestion of the reviewer and have now investigated this possibility. However, this was not the case and even in the most inclusive sense, no hit between Spike and ORF8 in SARS-CoV-1 ever appeared. (Figure 1, and Supplementary Table 1). We find this to be a meaningful result. The observation that ORF8 appeared only in SARS-CoV-2 analysis, is detailed in the manuscript as one of the main pieces of evidence for the Spike-ORF8 RNA-RNA interaction to be specific to SARS-COV-2.

Although it may be possible to increase our prediction thresholds to generate more than 5 hits and re-investigate this further, but we doubt that this will lead to meaningful results as we already have many short-sequence hits in our results which might already be false positives. Secondly, in order to minimize the effect of length in ranking stability of interacting regions we used a generalized linear model and trained it on the results (Table 1 and Supplementary Table 1). This estimation also addresses the issue of length in the UTF-RRI (UTR-RRI) region raised by the reviewer. Indeed, the Spike-ORF8 interaction was in the top quantile of our results.

Reviewer #2:

1. The 12-nt insertion site is at position 23,603-23,614 of the genome. The authors searched for long range interactions of any SARS-CoV-2 region with the region 23,600-24,107 and 23,917-24,118, i.e., only downstream region of the insertion site. Shouldn't equally long stretches upstream of the insertion site be considered as well? It is not clear to me why the present study has focused on the downstream region only.

Response: We completely agree with the reviewer suggestion. Hence, in the revised version, we focused on scanning the complete Spike gene which includes not only the upstream of the insert but also 50nt upstream of the start codon of Spike. We have also included 50-nt downstream of the stop codon of Spike.

2. Similar to my previous question, it is not clear why the overlapping regions 23,600-24,107 and 23,917-24,118 were analyzed separately. It will be helpful if the authors can comment on this.

Response: In the original submission of this study, the choice of intervals was largely arbitrary. We agree with the reviewer’s comment that this forms a bias, and it is ad hoc. Hence, in the revised analysis we now equally consider all 50nt overlapping segments within Spike and have significantly revised the manuscript’s figures and results section.

3. The developers of IntaRNA have shown that integrating experimentally obtained chemical probing data with IntaRNA can significantly improve RNA-RNA interaction prediction. Since such data is available for SARS-CoV-2, have the authors considered using it in their analysis?

Response: Unfortunately, we did not consider the probing data. We agree that experimental data on full-genome of the virus (rather than on the individual sub-genomic RNAs which to our knowledge is currently available in the public domain in the form of SHAPE profiles from Anna Pyle’s lab) would have led to a more reliable long-range prediction analysis.

4. In the second subsection of the Results, the authors report long-range interaction predictions from IntaRNA. Are these all the predictions from IntaRNA or have the authors reported the subset of the most significant predictions? It might be worthwhile to add a Supplementary table listing all the interactions that meet a reasonable significance cutoff.

Response: The supplementary Table 1 includes all possible interactions found between Spike and the rest of the genome including very weak and sub-optimal interactions. We agree that these predictions rely on assumptions and more comprehensive experimental verifications are needed to confirm their in-vivo validity as suggested by the reviewer. However, high confidence predictions as listed and discussed in our study would form a roadmap for further understanding, interpretation and validation of these theoretically sound predictions.

5. In the paragraph above Table 1, it seems there is a typo in the second sentence. For the region 23,675-28,045, the table shows n=92 while the text says n=2.

Response: We apologize for this oversight and have corrected/updated these values.

6. There are many grammatical mistakes scattered throughout the manuscript, which should be corrected, if necessary, using free software such as Grammarly.

Response: The revised manuscript has been re-written significantly and now proofread by multiple native English-speaking colleagues.

Attachment

Submitted filename: Response to reviewers-final.docx

Decision Letter 1

Danny Barash

21 Apr 2022

PONE-D-21-35236R1Evidence for a long-range RNA-RNA interaction between ORF8 and Spike of SARS-CoV-2PLOS ONE

Dear Dr. Manzourolajdad,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The revised manuscript should address all the critical points raised by the important reviewer.

Please submit your revised manuscript by Jun 05 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Danny Barash

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: # Review for PONE-D-21-35236-R1

Evidence for a long-range RNA-RNA interaction between ORF8 and Spike of SARS-CoV-2

Okiemute B. Omoru, Filipe Pereira, Sarath Chandra Janga, Amirhossein Manzourolajdad

# Summary

In short, the authors want to identify putative long-range RRIs within the SARS-CoV-2 genome that have high chance to be virus specific. To this end, they do a comparative investigation with SARS-CoV-1. Since there is a SARS-CoV-2-specific high-GC insert in Spike, the authors focus on this gene. Using a sliding window approach, putative interaction sites of Spike with other genomic regions are identified for both viruses. Since only SARS-CoV-2 shows a stable RRI with ORF8, the authors focus on the support of one of the predictions using structure prediction, base pair probabilities and (co-)variation analyses.

# General remarks

The revised work has positively sharpened the focus of the manuscript and provides more motivation for the whole endeavor. While improved, it still shows flaws and is still, in my opinion, of limited interest.

My major points are:

(*) There is no evidence anywhere that the identified RRIs are true "long-range" interactions!

The best one can speak of are "putative long-range" interactions, since (a) the whole identification and prediction is *fragment-based*, *in-silico* and without evidence that the full genome forms the interaction. It is also quite likely that the interaction is formed but only by (m)RNA fragments such that the whole hypothesis of the manuscript would be lost...

This is nowhere discussed within the manuscript!

Furthermore, both title and main contributions of the article need to be rephrased that way..

(*) The presented Pseudoknot structure prediction is neither local nor in line with the other used tools.

While it is hard to impossible to study or compare the dot-bracket-reported PK structures (Table 3), I find their presentation of no use and eventually wrong in the used context. First, the ProbKnot tool does a GLOBAL mfe prediction, with the limitations I discussed in the last review. Thus to infer local structure information from single mfes is, in my opinion, optimistic and wrong. Furthermore, all other used models within the study (IntaRNA and bp-prob computation) are based on nested structure models.. Thus, the base pair probs discussed along with the PK are using a different energy AND structure model and are thus hard to compare and a wrong intuition is triggered by the current text layout (namely that bp probs and PK structure are related).

Why using it at all? In the end, only a single minor crossing helix (5bp) is found and none for SARS-CoV-1. I doubt the relevance of this observation. Even more so while the authors have no mechanistic or whatsoever explanation or discussion of the observation (beside that it is observed).

To sanatize the course of the manuscript, I recommend to drop the PK part.

(*) Inconsistencies within the manuscript

- the Methods section still lists tons of tools that are not used in the current version

- the manuscript still referes to bifold predictions that are not present

- the 2nd and 4th paragraph of the introduction are mainly redundant

- the RRI visualizations (taken from IntaRNA) are using local fragment indices for Spike rather than the genomic positions, which makes it hard to follow and map the information. Just edit! (both in Fig-2 and supplement)

(*) Missing rationals

- the reason why the authors investigate the Spike-ORF8 RRI is only given within the discussion and one is lost wondering in the result section

- the use and interpretation of the AIC is no where to find and it stays unclear if the reported values are good or bad or how to interprete at all..

(*) Missing data

- the genome versions are not given (important since the reference genomes are undergoing some changes)

- the supplement lists the whole set of 2 million genome IDs but not the 200k used for the analyses

- how was the linear energy model derived? tool? library? handwritten?

(*) Overstating seed base pairs of RRIs

The "+" annotated bps in IntaRNA outputs are from stable subinteractions (of a used/default defined length, typically 7bp). Thus, these so called seed interactions are (in itself) stable enough to form (i.e. typically in unstructured regions) and thus likely to be starting points of the full RRI formation. Since an interaction can cover multiple such regions, all respective bps are annotated.

Currently, the manuscript describes these basepairs as "those that pair earlier than other base pairs", which is not true but again just a hypothesis.

Respective formulations should be amended respectively.

(*) Artifacts from subopt-limit

Since the authors limited the predicted suboptimals to 5 per fragment, the interaction atlas presented in Fig-1 is limited too. It could be that certain regions could interact with even more regions, which just dont pop up due to the hard "top-5" limit.

While this is no big drawback, it needs discussion. Even more so since the lack of predicted RRIs is a central point of discussion within both the result as well as discussion section!

(*) Suggestion: alpha via p-values

Just a suggestion that could improve the presentation. Given a large amount of genome-wide predictions (as done here) it is possible to estimate p-values for the energy scores (as presented on the IntaRNA webserver). The IntaRNA package even provides a respective script for computation.

The p-values could be used to set the alpha channel of the arcs within the circle plots to highlight highly probable RRIs and to distinguish them from weaker ones.

Currently, it is hard to say what interactions are stable. Even more so, since there is no general energy cut-off etc.

(*) Fig 4 not interpretable in printed version (and hard in pdf)

The dot plots are so small that dots are hard to spot or interprete/compare in print.

But even in pdf this is hard since only a pixel graphics is provided that cannot be zoomed without getting pixelated.

I suggest to move both figures in vector graphics format (PDF) to the supplement in full page width each to allow for detailed investigation.

The authors could present a respective cutout for the main manuscript if needed.

(*) Minor issues that caught the eye

- the text uses "Orf.." instead of "ORF.."

- Table 1 (and the supplement table) do not use "ORF8" but rather just "8" etc.

- Table 1 shows a red highlight not discussed in its caption

- "11th top hit within a total of 66" .. what 66? or is about the 69?

- "segments each corresponding to a particular viral strain" .. nope, each corresponds to a full genome sequencing (i.e. sample) but not necessarily strain!

- Table 2: lines Spike 23679-80 should be bold too

- it is not discussed that the CaCoFold predictions (Table 2 + Fig 3) miss the left-most RRI part from Fig-2, thus, rendering that RRI part less likely

- Fig-3 seems to be of low quality

- no vector graphics.. zoom in provides pixel art ..

- "contains five pseudoknots" .. NO, only ONE KNOT but "5 crossing base pairs". A BIG difference!

- Fig-4: it would be helpful to state in the caption that most likely base pairs are colored in red (for non-math readers)

- the text often states "SARS-CoV" instead of "SARS-CoV-1"

- Fig-4 caption "using the partition function" is a useless comment. better name the used tool.

- Fig-4 "shown in bold" .. better use "highlighted with a black bar". You can also annotate the same region on the y-axis (same coordinates) and draw horizontal/vertical lines at the respective bar ends to guide the eye to the important corridors within the plot

- "as well as well"

- it would be interesting to relate the S1 hairpin with the dot plot or annotate it(s position) within

- the supplementary figure needs a caption or the figure has to be extended to be self-explaining (what is relating to what and where)

(*) Carving out the core RRI based on the variation and stability investigation

Eventually, I think the authors miss a central outcome of the study or do not present it as such. While all the "long-range" part and the hopeful hypothesis of its impact on virulence, regulation etc. is quite speculative, the authors miss to highlight that the integration of RRI prediction and variation analyses strongly identifies the core of the putate Spike-ORF8 interaction. Namely the seed-region stretch 23679-23690 (Spike). The left part of Fig-2 is not predicted in Fig-3 and no variation is seen in this area.

Thus, one can conclude that the true RRI part (or at least the most likely part) is defined by that region and that it might be relevant (but maybe not exclusively for the RRI) since it is not mutated in both genes.

Why is it that most core conclusions of that manuscript are by reviewers?

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Martin Raden

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Sep 1;17(9):e0260331. doi: 10.1371/journal.pone.0260331.r004

Author response to Decision Letter 1


7 Jun 2022

Response to Reviewer’s Comments

Reviewer #1: # Review for PONE-D-21-35236-R1

Evidence for a long-range RNA-RNA interaction between ORF8 and Spike of SARS-CoV-2

Okiemute B. Omoru, Filipe Pereira, Sarath Chandra Janga, Amirhossein Manzourolajdad

# Summary

In short, the authors want to identify putative long-range RRIs within the SARS-CoV-2 genome that have high chance to be virus specific. To this end, they do a comparative investigation with SARS-CoV-1. Since there is a SARS-CoV-2-specific high-GC insert in Spike, the authors focus on this gene. Using a sliding window approach, putative interaction sites of Spike with other genomic regions are identified for both viruses. Since only SARS-CoV-2 shows a stable RRI with ORF8, the authors focus on the support of one of the predictions using structure prediction, base pair probabilities and (co-)variation analyses.

# General remarks

The revised work has positively sharpened the focus of the manuscript and provides more motivation for the whole endeavor. While improved, it still shows flaws and is still, in my opinion, of limited interest.

My major points are:

(*) There is no evidence anywhere that the identified RRIs are true "long-range" interactions!

The best one can speak of are "putative long-range" interactions, since (a) the whole identification and prediction is *fragment-based*, *in-silico* and without evidence that the full genome forms the interaction. It is also quite likely that the interaction is formed but only by (m)RNA fragments such that the whole hypothesis of the manuscript would be lost...

This is nowhere discussed within the manuscript!

Furthermore, both title and main contributions of the article need to be rephrased that way..

Response: We thank the reviewer for these suggestions and have now revised the title and discussion to reflect this input. The main hypothesis is the predicted Spike-ORF8 interaction. Various mechanistic speculations as discussed by the reviewer are mentioned in the discussion. The in-silico fragment-based nature of our approach is also emphasized in Abstract, Introduction, and Discussion. The main contribution of the work is further clarified. We thank the reviewer for encouraging us to clarify our main finding regarding integration of thermodynamic-based modeling and mutation patterns to identify the core sub-interacting region in the Spike-ORF8 prediction.

(*) The presented Pseudoknot structure prediction is neither local nor in line with the other used tools.

While it is hard to impossible to study or compare the dot-bracket-reported PK structures (Table 3), I find their presentation of no use and eventually wrong in the used context. First, the ProbKnot tool does a GLOBAL mfe prediction, with the limitations I discussed in the last review. Thus to infer local structure information from single mfes is, in my opinion, optimistic and wrong. Furthermore, all other used models within the study (IntaRNA and bp-prob computation) are based on nested structure models.. Thus, the base pair probs discussed along with the PK are using a different energy AND structure model and are thus hard to compare and a wrong intuition is triggered by the current text layout (namely that bp probs and PK structure are related).

Why using it at all? In the end, only a single minor crossing helix (5bp) is found and none for SARS-CoV-1. I doubt the relevance of this observation. Even more so while the authors have no mechanistic or whatsoever explanation or discussion of the observation (beside that it is observed).

To sanatize the course of the manuscript, I recommend to drop the PK part.

Response: As recommended by the reviewer, we have now removed the PK part from the manuscript to improve the clarity and flow of the manuscript.

(*) Inconsistencies within the manuscript

- the Methods section still lists tons of tools that are not used in the current version

Response: We have now removed the listing of tools in the methods section which are not used in the current version of the manuscript.

- the manuscript still refers to bifold predictions that are not present

Response: We have now removed the bifold predictions from the manuscript.

- the 2nd and 4th paragraph of the introduction are mainly redundant

Response: Thank you for this suggestion. We have now reduced the redundancy in these paragraphs of the introduction.

- the RRI visualizations (taken from IntaRNA) are using local fragment indices for Spike rather than the genomic positions, which makes it hard to follow and map the information. Just edit! (both in Fig-2 and supplement)

Response: We appreciate the reviewer pointing to this inconsistency and have now edited the figure 2 to reflect the genomic positions so that it is easy to map and follow the information across the study.

(*) Missing rationals

- the reason why the authors investigate the Spike-ORF8 RRI is only given within the discussion, and one is lost wondering in the result section

Response: To improve the flow and logic for investigating Spike-ORF8 RRI in this study, we have now included a transition in the result section right before the Spike-ORF8 section providing a rational for choosing these regions for performing RRI analysis in this study.

- the use and interpretation of the AIC is no where to find and it stays unclear if the reported values are good or bad or how to interprete at all..

Response: We have now removed AIC from the main text. As a measure of model fitness, we used the significant factor of the Length parameter instead. (Pr(>|t|) for length is reported in Table 1 caption along with other details of the model used. We have also explained details regarding model derivation.

(*) Missing data

- the genome versions are not given (important since the reference genomes are undergoing some changes)

Response: We have now included the specific version of the genomes that were used in the study.

- the supplement lists the whole set of 2 million genome IDs but not the 200k used for the analyses

Response: As suggested by the reviewer, we have now included the genome sequence IDs for the 200K genomes that were used in the analyses.

- how was the linear energy model derived? tool? library? handwritten?

Response: The linear energy model was generated in R statistical package, and we now have explained it in the manuscript.

(*) Overstating seed base pairs of RRIs

The "+" annotated bps in IntaRNA outputs are from stable subinteractions (of a used/default defined length, typically 7bp). Thus, these so called seed interactions are (in itself) stable enough to form (i.e. typically in unstructured regions) and thus likely to be starting points of the full RRI formation. Since an interaction can cover multiple such regions, all respective bps are annotated.

Currently, the manuscript describes these basepairs as "those that pair earlier than other base pairs", which is not true but again just a hypothesis.

Respective formulations should be amended respectively.

Response: We have now changed multiple places in the manuscript to reflect the above interpretation recommended by the reviewer. We refrained from speculations regarding the above results.

(*) Artifacts from subopt-limit

Since the authors limited the predicted suboptimals to 5 per fragment, the interaction atlas presented in Fig-1 is limited too. It could be that certain regions could interact with even more regions, which just dont pop up due to the hard "top-5" limit.

While this is no big drawback, it needs discussion. Even more so since the lack of predicted RRIs is a central point of discussion within both the result as well as discussion section!

Response: As suggested by the reviewer, we have now elaborated the discussion to reflect on the possibility that there could be additional RRIs which may have escaped our limit of top 5 hits but still could be biologically interesting.

(*) Suggestion: alpha via p-values

Just a suggestion that could improve the presentation. Given a large amount of genome-wide predictions (as done here) it is possible to estimate p-values for the energy scores (as presented on the IntaRNA webserver). The IntaRNA package even provides a respective script for computation.

The p-values could be used to set the alpha channel of the arcs within the circle plots to highlight highly probable RRIs and to distinguish them from weaker ones.

Currently, it is hard to say what interactions are stable. Even more so, since there is no general energy cut-off etc.

Response: We agree that p-values and/or setting cut-off are also good methods for producing meaningful results. In this work, however, we had decided to use an alternative approach for ranking predictions, which is the residual values of our model as discussed in results section.

(*) Fig 4 not interpretable in printed version (and hard in pdf)

The dot plots are so small that dots are hard to spot or interprete/compare in print.

But even in pdf this is hard since only a pixel graphics is provided that cannot be zoomed without getting pixelated.

I suggest to move both figures in vector graphics format (PDF) to the supplement in full page width each to allow for detailed investigation.

The authors could present a respective cutout for the main manuscript if needed.

Response: We have now made a concerted effort to significantly increase the resolution of Figure 4 for better readability. However, please note that PLoS One submission system often decreases the resolution of submitted figures for peer review purposes to generate less heavy files for reviewers and it may be possible that this has resulted in down resolution. Nevertheless, as suggested we have now also included the figure as supplementary file too.

(*) Minor issues that caught the eye

- the text uses "Orf.." instead of "ORF.."

- Table 1 (and the supplement table) do not use "ORF8" but rather just "8" etc.

- Table 1 shows a red highlight not discussed in its caption

- "11th top hit within a total of 66" .. what 66? or is about the 69?

- "segments each corresponding to a particular viral strain" .. nope, each corresponds to a full genome sequencing (i.e. sample) but not necessarily strain!

- Table 2: lines Spike 23679-80 should be bold too

- it is not discussed that the CaCoFold predictions (Table 2 + Fig 3) miss the left-most RRI part from Fig-2, thus, rendering that RRI part less likely

- Fig-3 seems to be of low quality

- no vector graphics.. zoom in provides pixel art ..

- "contains five pseudoknots" .. NO, only ONE KNOT but "5 crossing base pairs". A BIG difference!

- Fig-4: it would be helpful to state in the caption that most likely base pairs are colored in red (for non-math readers)

- the text often states "SARS-CoV" instead of "SARS-CoV-1"

- Fig-4 caption "using the partition function" is a useless comment. better name the used tool.

- Fig-4 "shown in bold" .. better use "highlighted with a black bar". You can also annotate the same region on the y-axis (same coordinates) and draw horizontal/vertical lines at the respective bar ends to guide the eye to the important corridors within the plot

- "as well as well"

- it would be interesting to relate the S1 hairpin with the dot plot or annotate it(s position) within

- the supplementary figure needs a caption or the figure has to be extended to be self-explaining (what is relating to what and where)

Response: We sincerely thank the reviewer for these suggestions and have now made every effort to address all these minor issues to significantly clean up the manuscript.

(*) Carving out the core RRI based on the variation and stability investigation

Eventually, I think the authors miss a central outcome of the study or do not present it as such. While all the "long-range" part and the hopeful hypothesis of its impact on virulence, regulation etc. is quite speculative, the authors miss to highlight that the integration of RRI prediction and variation analyses strongly identifies the core of the putate Spike-ORF8 interaction. Namely the seed-region stretch 23679-23690 (Spike). The left part of Fig-2 is not predicted in Fig-3 and no variation is seen in this area.

Thus, one can conclude that the true RRI part (or at least the most likely part) is defined by that region and that it might be relevant (but maybe not exclusively for the RRI) since it is not mutated in both genes.

Why is it that most core conclusions of that manuscript are by reviewers?

Response: We appreciate the input from the reviewer, but we want to emphasize that study’s main goal was to identify the core RRIs in the Spike-ORF8 regions as the main contribution. Co-variation analysis was initially anticipated to be an independent means for understanding the functional meaning and evolutionary conservation of these inferred associations.

Attachment

Submitted filename: Response to reviewers-final-R2-SCJ-edits.pdf

Decision Letter 2

Danny Barash

23 Jun 2022

A Putative long-range RNA-RNA interaction between ORF8 and Spike of SARS-CoV-2

PONE-D-21-35236R2

Dear Dr. Manzourolajdad,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Danny Barash

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Martin Raden

**********

Acceptance letter

Danny Barash

20 Jul 2022

PONE-D-21-35236R2

A Putative long-range RNA-RNA interaction between ORF8 and Spike of SARS-CoV-2

Dear Dr. Manzourolajdad:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Danny Barash

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Top three interacting regions with SARS-CoV-2 Spike.

    Corresponding ranking of the hits are also included. Generalized linear model was used to rank hits with highest interaction energy relative to interaction length (Table 1).

    (TIF)

    S2 Fig. Base pair probabilities for aligned segments of SARS-CoV-1 Spike.

    Probabilities were calculated using McCaskill’s partition function [51, 52]. Coordinates: SARS-CoV-1 (23447–23650 Spike). SARS-CoV-2 Spike/ORF8 region (23660–23703) was mapped on SARS-CoV-1, with coordinates: SARS-CoV-1 (23534–23575 Spike) and extended on both directions by 100nt. The resulting region is highlighted as a black bar.

    (TIF)

    S3 Fig. Base pair probabilities for aligned segments of SARS-CoV-2 Spike.

    Probabilities were calculated using McCaskill’s partition function [51, 52]. Coordinates: SARS-CoV-2 (23560–23803 Spike). The Spike/ORF8 region (23660–23703), shown as a black bar was extended on both directions by 100nt.

    (TIF)

    S1 Table. GISAID accession numbers.

    Accession numbers of the 206,745 SARS-CoV-2 sequences used in the study.

    (TXT)

    S2 Table. Ranking of predicted RNA-RNA base-pairing interactions.

    Predicted long-range RNA-RNA base-pairing interactions between the Spike region the full genome for both SARS-CoV-1 and SARS-CoV-2 using IntaRNA software package. See Materials and Methods for details. There was a total of 69 independent hits across both genomes. Column SARS-CoV denotes the strain. Column TotalLength denotes length of the interacting regions (query + target). Ranking is according to residual values against the generalized linear model where length of interaction was used to estimate interaction energy. The built-in function glm(energy~length, data = data, family = "gaussian")in R programming language was used to fit the model. Length coefficient = -0.03190. Length was a significant factor in the model. (Pr(>|t|) for length = 0.00067. Median of residuals = -0.2287). 1-Quantile of residuals = -2.1536. SARS-CoV-2 hits are shown as bold. Rank 11 also shown with * denotes the SARS-CoV-2 Spike-ORF8 interaction.

    (CSV)

    Attachment

    Submitted filename: Response to reviewers-final.docx

    Attachment

    Submitted filename: Response to reviewers-final-R2-SCJ-edits.pdf

    Data Availability Statement

    Yes, all data fully available without restriction. S1 Table provides GISAID accession numbers for sequences used in this study. The sequences are publicly available in the GISAID database for download.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES