Skip to main content
Oxford University Press - PMC COVID-19 Collection logoLink to Oxford University Press - PMC COVID-19 Collection
. 2014 Mar 3;30(13):1793–1799. doi: 10.1093/bioinformatics/btu105

Challenges in RNA virus bioinformatics

Manja Marz 1,*, Niko Beerenwinkel 1,1, Christian Drosten 1, Markus Fricke 1, Dmitrij Frishman 1,1,1, Ivo L Hofacker 1, Dieter Hoffmann 1, Martin Middendorf 1, Thomas Rattei 1, Peter F Stadler 1,1,1,1, Armin Töpfer 1,1
PMCID: PMC7110044  PMID: 24590443

Abstract

Motivation: Computer-assisted studies of structure, function and evolution of viruses remains a neglected area of research. The attention of bioinformaticians to this interesting and challenging field is far from commensurate with its medical and biotechnological importance. It is telling that out of >200 talks held at ISMB 2013, the largest international bioinformatics conference, only one presentation explicitly dealt with viruses. In contrast to many broad, established and well-organized bioinformatics communities (e.g. structural genomics, ontologies, next-generation sequencing, expression analysis), research groups focusing on viruses can probably be counted on the fingers of two hands.

Results: The purpose of this review is to increase awareness among bioinformatics researchers about the pressing needs and unsolved problems of computational virology. We focus primarily on RNA viruses that pose problems to many standard bioinformatics analyses owing to their compact genome organization, fast mutation rate and low evolutionary conservation. We provide an overview of tools and algorithms for handling viral sequencing data, detecting functionally important RNA structures, classifying viral proteins into families and investigating the origin and evolution of viruses.

Contact: manja@uni-jena.de

Supplementary information: Supplementary data are available at Bioinformatics online. The references for this article can be found in the Supplementary Material.

1 INTRODUCTION

Viruses have been the first biological systems for which complete genomic information became available: bacteriophage MS2 in 1976 and Inline graphic X174 in 1977. By the mid 1990s, multiple strains of important human pathogens, in particular HIV, had been sequenced, laying the foundation for a systematic comparative genomics of the diverse virus families. With the completion of the human genome and the sequencing of hundred of eukaryotic and thousands of prokaryotic genomes, however, the bioinformatics community focussed almost entirely on these much larger and more complex systems. With few exceptions, most viral genomes are thus relatively poorly annotated and few computational tools and techniques have been developed specifically for the many idiosyncratic features of individual virus families.

The small size of viral genomes makes it possible to sequence large numbers of isolates, usually in clinical context, that are unavailable for any living systems. This flood of sequencing data in itself calls for specific methods of analysis, which so far are available in part at best.

In this survey we concentrate on RNA viruses. They form a highly diverse grouping, usually classified in terms of their genome organization. They may have a single-stranded (ss) genome in either plus (e.g. Poliovirus, Enterovirus, Hepatitis-A-virus) or in minus orientation (e.g. Rabiesvirus, maize-mosaic-virus), or a double-stranded (ds) RNA genome (e.g. Rotavirus), or ssRNA with a dsDNA as an intermediate product (e.g. Retroviruses including HIV). Nevertheless, they share several common features. In particular, their genomes are small, ranging from 3400 (Enterobacteria phage BZ13) to 31 000 nt (coronavirus), encoding just a handful of proteins. Furthermore, their mutation rates are large enough to form a quasi-species rather than a single genetically homogenous species in the classical sense.

Many of the common questions raised in virology call for specific bioinformatics support: How is viral gene expression regulated? How do RNA viruses evolve? How common are they? Have we already seen the whole diversity of RNA virus families or genera? How quickly do they change? How often do variations occur? How important is recombination in viral evolution? Is there a single common viral origin or do we find clearly independent origins?

In this contribution we review the state of the art of the computational methods that have helped to address some of these questions. In particular, we will be concerned with (i) finding and assembling viral genomes based on RNA-seq data, (ii) regulatory RNA elements and processed subgenomic RNA species, (iii) classification problems regarding viral protein families, (iv) phylogenetics and evolution of RNA viruses and (v) virus–host associations as a basis for biomedical applications.

2 DISCOVERY OF VIRAL SEQUENCES

Viruses display high genetic diversity both within and among viral species as well as within and among infected hosts. Although next-generation sequencing provides cost-effective access to high-throughput data, inferring the viral genetic diversity of a mixed sample from deep-coverage sequencing data has remained a challenging task. The reasons lie in difficulties of sample preparation and sequencing errors, short read length and, in particular, an incomplete a priori knowledge of existing viruses and their diversity. Viral diversity estimation may range from identifying viral species in metagenomics studies to reconstructing the individual mutants in the intra-host population of a single species.

The composition of a mixed sample can be assessed by metagenomics approaches, reviewed in Fancello et al. (2012), Mokili et al. (2012) and Reyes et al. (2012). A main approach is sequence read annotation by taxonomic classification using existing reference genomes and databases ( http://www.rna.uni-jena.de/rna.php ). In many metagenomics applications, however, classification is not possible because the majority of sequences have no known reference genome or homolog (Edwards and Rohwer, 2005). In this case, de novo discovery of viral species can be performed by state-of-the-art de novo genome assemblers. These methods try to assemble the genomes of the major species in the sample, ignoring low-frequency variants and technical errors.

Once the (reference) genomes are known and reads are classified, the resolution of diversity estimation can be increased by inferring the viral population structure of each individual species. Intra-host virus populations consist of many related mutants, generated by mutation, recombination and selection. Even low-frequency variants can be of great interest, for example, because they may harbor drug resistance mutations (Barzon et al. , 2011), facilitate immune escape (Luciani et al. , 2012) or affect virulence (Töpfer et al. , 2013a). Estimating intra-host viral genetic diversity and reconstructing the individual haplotype sequences relies on both error correction and read assembly. It can be performed on different spatial scales, including single sites of the genome (Single-Nucleotid-Varient calling), small sliding windows (local reconstruction) or complete genomes (global reconstruction). Current viral haplotype reconstruction tools, reviewed in Beerenwinkel et al. (2012), Beerenwinkel and Zagordi (2011) and Vrancken et al. (2010), can quantify viral diversity from NGS data, with recombinant population structure (Töpfer et al. , 2013b), provided that haplotypes differ enough, reads are not too short and coverage is high (Zagordi et al. , 2012). A common prerequisite for these tools is a high-quality alignment of the reads.

As of today, NGS-based discovery of viral sequences in mixed samples remains challenging because most analysis steps are not easily automated and each one has technical or biological limitations. There is a need for an integrated workflow combining the different processing steps in viral diversity studies to discover the underlying virus populations that can be used on a daily basis by clinicians and virologists.

3 STRUCTURAL RNA ELEMENTS

3.1 Detection and distribution of structured RNAs

The realization that conserved RNA structure plays a role in virology dates back to the beginning of the 1980s (Ahlquist et al. , 1981). Most of the structured viral RNA elements contained in the Rfam database are cis-acting elements, in particular internal ribosomal entry sites (IRES), cis-acting replication elements and other elements located in the untranslated regions (UTRs) of RNA viruses. Functional RNA structures also appear to be abundant within the viral coding regions. Furthermore, regular arrangements of hairpins throughout the genomic RNA have been shown to be instrumental for packaging in Leviviridae (Dykeman et al. , 2011) and some satellite viruses (Schroeder et al. , 2011). Evolutionary conserved large-scale ordering of RNA virus genomes seems to be abundant in many animal and plant viruses (Davis et al. , 2008).

The first systematic searches for conserved, and hence likely functional, RNA secondary structure elements were performed in RNA viruses more than a decade ago (Rauscher et al. , 1997). This stimulated the development of early computational methods (Hofacker and Stadler, 1999; Hofacker et al. , 1998) capable of surveying alignments of complete virus genomes (Thurner et al. , 2004; Witwer et al. , 2001) for local RNA motifs in which the structure is more conserved than the underlying sequence. Somewhat surprisingly, however, the next generation of comparative RNA secondary structure predictors such as RNAz (Washietl et al. , 2005) and evofold (Pedersen et al. , 2006) apparently have not been used extensively on virus data. The results of (Davis et al. , 2008) suggest that coverage with conserved secondary structure varies substantially between virus families.

Viral RNAs have recently become accessible to structural probing at larger scales using combinations of SHAPE and sequencing. The analysis of these data requires both elaborate processing of the raw SHAPE data (Pang et al. , 2011) and the incorporation of these data into RNA structure prediction algorithms in the form of constraints (Reuter and Mathews, 2010; Washietl et al. , 2012). First results include the HCV 5′ UTR (Pang et al. , 2011) and the secondary structure of a complete HIV-1 genome (Watts et al. , 2009). As essentially all RNA molecules form secondary structures, one has to keep in mind that the entire structure is not necessarily of functional relevance.

3.2 Viral non-coding RNAs

In addition to proteins, viruses may also encode non-coding RNAs (ncRNAs). Although most of the well-described examples have been found in viruses with DNA genomes, we include a brief overview here for two reasons: first, ncRNAs do appear in retroviruses, and second, it is at least conceivable that processing products of viral RNAs might act as ncRNA species.

The formation of independent functional RNA species is a widespread phenomenon among diverse virus families, best known but apparently not limited to DNA viruses ( Table 1 ). The largest class are virus-encoded microRNAs (miRNA), of which >200 distinct types have been reported during the past decade (Grundhoff, 2011; Grundhoff and Sullivan, 2011), with herpesviruses accounting for the overwhelming majority of examples (Boss et al. , 2009). Smaller numbers of examples have been reported also in Polyomaviridae, Ascoviridae, Baculoviridae and Retroviridae. Viral miRNAs appear to regulate viral-encoded transcripts and/or networks of host genes predominantly using the host miRNA regulation systems.

Table 1.

Examples of virus-encoded small RNAs

ncRNA Virus References
MiRNAs
    BART cluster Epstein–Barr Edwards et al. (2008)
    BHRF1 cluster Epstein–Barr Pfeffer et al. (2004)
    TAR-mir HIV-1 Klase et al. (2007)
Small nuclear RNAs
    EBER-1,2 Epstein–Barr virus Klase et al. (2007)
    HSURs 1-6 Herpes saimiri virus Klase et al. (2007)
    VA-I,II Adenoviruses Mathews (1995)
    telomerase RNA Marek’s disease virus Fragnet et al. (2005)
    v-snoRNA-1 Epstein–Barr virus Hutzinger et al. (2009)
Long ncRNAs
    PAN KSAH Klase et al. (2007)

In contrast to animal and plant miRNAs, their viral counterparts are often poorly conserved. This complicates both their annotation in newly sequenced genomes (Grundhoff, 2011; Grundhoff and Sullivan, 2011) and the computational reconstruction of their interaction networks (Ghosh et al. , 2009; Kim et al. , 2012). Nowadays, new viral miRNAs are usually found by means of deep sequencing, see e.g. (Tuddenham et al. , 2012). A recurring computational problem in this context is to distinguish bona fide miRNAs from small degradation products.

The RNA repertoire of herpesvirus species is by no means restricted to miRNAs, Table 1 . They also encode a diverse set of small nuclear RNAs with diverse functions, a small nucleolar RNA and even a derived copy of the telomerase RNA component. In some cases, unrelated ncRNAs from different families have analogous functions. For instance, both the EBER-1 RNA of the herpesvirus EBV and the VA-I RNAs of adenoviruses are effective inhibitors of PKR activation (McKenna et al. , 2006). Again, the oftentimes poor conservation and the diversity of the viral RNAs complicate their annotation.

A related topic is subviral RNAs and satellite RNAs. In particular, plant viruses often bring with them non-coding deletion mutants. These defective interfering RNAs often maintain crucial cis-acting RNA elements (Pathak and Nagy, 2009), which are described in more detail below.

3.3 Secondary structures in the mRNA coding regions

The existence of extensive secondary structures in native mRNAs is well supported by experimental evidence, and in silico with the assumption that they have lower folding energies and are thus more stable than codon-randomized sequences (Katz and Burge, 2003). However, in general and especially for comparatively variable viral sequences, MFE is considered to be minor relevant (Rivas and Eddy, 2000; Workman and Krogh, 1999) and compensatory mutations analysis over a broader range of individuals would be more sophisticated. On the other hand, computational analysis suggests that the three mRNA functional domains—5′UTR, CDS and 3′UTR—form largely independent folding units, while base pairing across domain borders is rare (Shabalina et al. , 2006). Global architectures appear to be poorly conserved between sequence-similar mRNA molecules (Chursov et al. , 2012b), but evolutionary conserved functional local secondary structures are abundant (Findeiss et al. , 2011; Meyer and Miklós, 2005; Olivier et al. , 2005). The relationship between mRNA structure and gene expression has been demonstrated both computationally and experimentally (Carlini et al. , 2001; Duan et al. , 2003; Ilyinskii et al. , 2009; Kudla et al. , 2009; Nackley et al. , 2006). For example, in the influenza virus a novel structural feature was identified in a functionally important region of the NS1 mRNA (Ilyinskii et al. , 2009). Synonymous mutations altering this mRNA element lead to significantly reduced protein expression, whereas non-synonymous mutations designed to preserve this local structure do not affect expression, implying that distinct secondary structure elements may be important for viral gene expression. Reduced mRNA stability near the start codon has been observed in a wide range of species, including dsDNA viruses (Zhou and Wilke, 2011), probably as a mechanism to facilitate ribosome binding or start codon recognition by initiator-tRNA (Gu et al. , 2010). There is also computational evidence that temperature-induced changes in mRNA structures may constitute a yet unappreciated molecular mechanism of the virus cold adaptation/temperature sensitivity phenomena (Chursov et al. , 2012a).

In a few cases, extensive and well-conserved RNA structures are superimposed on the coding sequence. Maybe the most impressive example is the IRES of HIV-2 (Herbreteau et al. , 2005) and the Rev-response element in HIV-1 (Pallesen et al. , 2009). Internal RNA elements are also located in the ORF1b of group 2 coronavirus mouse hepatitis virus (MHV); here deletion analysis has identified a 69-nt bulged stem loop required for packaging RNAs into particles (Fosmire et al. , 1992) or the cis-active elements involved in picornavirus replication (Steil and Barton, 2009). The latter initiate plus and minus strand RNA synthesis and cloverleaf elements, controlling both translation and replication (Liu et al. , 2009a). Another example is the ribosomal frameshift known, e.g. in coronavirus ORF1, induced by a short hairpin of 4–11 bp, which also affects genomic and subgenomic RNA production (Plant et al. , 2013).

One particularly intriguing aspect of mRNA life—the one that makes it distinctly different from any other kind of RNA—is the dual selection pressure toward maintaining both stable RNA structures of CDSs and the 3D folds of their encoded proteins (White et al. , 1972). Additional layers of selection arise owing to miRNA and protein binding sites within the mRNA coding regions. It has been argued that the redundancy of the genetic code plays an important role in satisfying these requirements (Shabalina et al. , 2006). Evolutionary models able to make a distinction between the evolutionary pressure at the RNA and protein level have been proposed (Rubinstein et al. , 2011).

In general, structure prediction of the mRNA coding regions remains an underappreciated area of RNA bioinformatics, arguably because these molecules are large and do not easily yield to current structure prediction methods and experimental structure-probing data are only beginning to emerge (Kertesz et al. , 2010). Most of the insights into the evolutionary constraints acting on mRNAs therefore come from correlating predicted base-paring patterns with the effects of site-directed mutagenesis on mRNA expression and degradation as well as on the expression levels and activity of encoded protein products. On the other hand, prediction of mRNA secondary structure is facilitated by the availability of abundant comparative sequence information both from viruses and cellular organisms. RNAdecoder (Meyer and Miklós, 2005; Pedersen et al. , 2004) implements a comparative method for finding and folding RNA secondary structures within protein coding regions. A recent survey of fly genomes (Findeiss et al. , 2011), however, indicated that the specificity of this combined approach is insufficient for genome-wide application. The problem of cataloging conserved secondary structure motifs within coding regions, in particular in viruses, remains open.

3.4 The secrets of viral UTRs

Functionally important viral RNA structures tend to be concentrated in the UTRs. This is not unexpected, of course, as UTRs are typically the only non-coding regions within the densely packed virus genomes. A wide range of experimental data demonstrates that the UTRs are essential for determining the efficiency of translation, mRNA lifetime and localization. Functional UTR elements are often binding sites for viral or host proteins, but can also be involved in RNA–RNA interactions either within the genome (cyclization) or with host RNAs (e.g. the ribosome). UTR structures have been studied most intensively in positive-strand RNA viruses (Liu et al. , 2009b), in particular in those affecting humans and/or animal livestock.

In positive-strand RNA viruses, the genomic RNA has to function directly as an mRNA. However, the viral RNA often lacks the 5′cap as well as the poly-A tail of canonical eukaryotic mRNAs. Eukaryotic translation usually starts with binding of initiation factors to the 5′cap, and these in turn recruit the small ribosomal subunit, which then scans along the mRNA. In 1988, two independent studies (Jang et al. , 1988; Pelletier and Sonenberg, 1988) showed that certain picornaviruses exhibit a cap-independent translation initiation mechanism. The structured RNA region of some 300–700 nt responsible for this mechanism was termed ‘internal ribosomal entry site’ (IRES) and is perhaps the best-studied example of a viral UTR structure. IRES structures seem to be present in all Picornaviridae (Witwer et al. , 2001). While viruses in the genus Flavivirus have a 5′cap, the other genera of the family Flaviviridae , such as Pestivirus and Hepacivirus, seem to use IRES structures (Thurner et al. , 2004). In addition, IRESs have been found or implicated in several other viruses including the subgenomic mRNAs of retroviruses, see e.g. (Vallejos et al. , 2012). While IRES regions within, for example, the Picornaviridae show significant similarity of RNA secondary structure, no such similarities are obvious across family boundaries, arguing against a common origin of different IRES structures.

IRES structures are also common in some positive strand RNA plant viruses, while others replace the IRES with a structure in the 3′UTR, called 3′cap-independent translation enhancers (3′CITE) (Nicholson and White, 2011). 3′CITE structures are much shorter (∼100 nt) and can be grouped into several distinct classes. Some of them bind translation initiation factors such as elF4E, while others seem to interact directly with the ribosome. Cyclization of the RNA is required to then bring these initiation factors/ribosome close to the translation start site.

Genome cyclization through complementary sequences in the 5′ and 3′UTR is a common theme observed in many virus families. The flaviviruses are an example among the positive strand RNA viruses. Presumably, cyclization improves translation rates by allowing the ribosome to transfer from the 3′end back to the start of the coding region. Among negative strand viruses, the Bunyaviridae , including Hanta virus, or Orthomyxoviridae , including Influenza virus exhibits a segmented genome where each segment has strong complementarity between 3′ and 5′ end. In UTRs of plant virus genomes or HCV, tRNA-like secondary structures are known, which are believed to interact with the viral genome and the ribosome may interact during translation (Annamalai and Rao, 2006; Piron et al. , 2005).

The 5′UTR and the 3′UTR feature translation efficiency and replication. Both Picornaviridae and Flaviviridae contain highly structured UTRs, which, however, differ significantly between genera. In Enterovirus, these structures have been shown to be essential for the assembly of the RNA-dependent RNA polymerase (RdRp) complex (Zoll et al. , 2009).

3.5 Cis-acting elements

Apart from target prediction for viral miRNAs, interactions of structured RNA elements in viruses have remained largely unexplored. 5′ and 3′ UTRs containing cis-active elements are essential for viral genome replication (Ulferts and Ziebuhr, 2011; van den Born and Snijder, 2008). A complex example is given by nidoviruses, which synthesize a nested set of 3′/5′-coterminal subgenomic (sg)mRNAs (Stern and Kennedy, 1980; van Berlo et al. , 1982) from which the structural and accessory protein genes are expressed. The 5′ ends of nidovirus sgRNAs share a leader sequence that is identical to the 5′-end of the genomic RNA (de Vries et al. , 1990; Spaan et al. , 1983; van Vliet et al. , 2002). A copy of this leader sequence is fused to the 3′-ends of nascent sg minus-strand RNAs in a process called discontinuous extension of minus strands (Sawicki and Sawicki, 1995; Sawicki et al. , 2007). The so-called transcription regulation sequences (TRSs) are located upstream of each of the structural and accessory protein genes; however, TRSs is also found downstream of the 5′-leader sequence on the viral genomic RNA (Ulferts and Ziebuhr, 2011). The proposed coronavirus transcription mechanism implies a close interaction between TRS-L and each of the cTRS-B present in the genomic RNA, imposing strong constraints of the evolution of the TRS sequences (Enjuanes et al. , 2001; Zúñiga et al. , 2004). The hypothesis of sgmRNA synthesis in coronaviruses requires a minimum thermodynamic stability in the TRS-L and cTRS-B duplex (Dufour et al. , 2011; Sola et al. , 2005) could not be proven in silico for all coronaviruses (Fricke and Marz, 2013). Corona-, Bafini- and Arteriviruses feature such a leader sequence, while Okaviruses do not (Cowley et al. , 2002); Equine torovirus contains one sgRNA with a 5′-leader sequence while the other sgmRNA species are leaderless (van Vliet et al. , 2002).

Coronavirus harbors several additional cis-active elements forming distinct stem loops involved in regulating sgRNA transcription and RNA replication (Li et al. , 2008; Liu et al. , 2009a; Raman and Brian, 2005; Raman et al. , 2003). The 3′UTR of MHV and BoCV upstream end contains a bulged stem loop and a pseudoknot, which cannot form simultaneously. This has led to a proposal that these structures are part of a molecular switch that regulates different steps of replication (Hsue and Masters, 1997; Williams et al. , 1999). Several cis-active structures are located within the coding regions.

The examples given here are by no means exhaustive. In fact, the Rfam database features several dozens of distinct families. Despite their importance for viral control, however, there is no comprehensive analysis and only a few computational surveys, see e.g. (Li et al. , 2010), have been attempted following a few family-specific studies almost a decade ago (Hofacker et al. , 2004; Thurner et al. , 2004; Witwer et al. , 2001). Comparative investigations across families and detailed studies into the evolution of these elements are largely lacking.

4 CLASSIFICATION OF VIRAL PROTEIN FAMILIES

Already early in the era of genomics, NCBI’s viral genomes project established a large-scale comparative resource providing information on orthology and paralogy of viral proteins (Bao et al. , 2004). Clusters of related viral proteins (viral orthologous groups, VOG), as well as the specialized collection of phage orthologous groups, are available as part of the NCBI protein clusters (Kristensen et al. , 2013; Sayers et al. , 2012). Widely used databases of orthologous proteins such as EggNOG (Powell et al. , 2012), OMA (Altenhoff et al. , 2011) or KEGG (Kanehisa et al. , 2012) do not consider virus proteins at all. A variety of software tools for orthology detection have been proposed (Koonin, 2005), falling into four large groups (Kristensen et al. , 2010): phylogenetic tree-based approaches, heuristic best-match methods, synteny-based and hybrid approaches. None of the available tools such as EnsemblCompara (Vilella et al. , 2009), OrthoMCL (Li et al. , 2003) or InParanoid (Alexeyenko et al. , 2006) have been specifically designed for viral genome analysis.

Remarkably, even the VOG data are rarely used for comparative genomics of viruses, despite their potential for studying the natural history of viral genes (Koonin et al. , 2006). So, what are the problems with the currently available VOG? Three main limitations are most evident: current VOG (i) are rarely updated and are not hierarchical, (ii) lack remote and short homologs and (iii) are not linked to their hosts and other cellular organisms.

A significant fraction of current limitations in comparative genomics of viruses derives from the divergent sequences. Analysis tools that include a broad analysis of compensatory mutations of a wide range and conserved motifs for further interactions do not exist and would result in higher computational costs of the underlying calculations. Recently developed tools such as Phamerator rely on fast, but even less sensitive, approaches for sequence similarity calculations (Cresawn et al. , 2011). Considering the explosive growth of the genome databases, such all-versus-all comparisons can be expected to become even more crucial in the future. Efficient approaches such as incrementally calculated matrices of sequence similarities (Arnold et al. , 2005) are therefore promising tools for the next generation of classification systems for viral protein families.

5 VIRUS EVOLUTION AND PHYLOGENETICS

Phylogenetic analysis is a ubiquitous method in virology, forming an essential element of investigations describing viruses or viral epidemiology. However, several characteristics of viruses pose specific challenges for phylogenetics: (i) strong differences in evolution rates, typically high on a short-term and much lower on the long-term, (ii) large potential for recombination and gene transfer even between distant viruses or their host species, (iii) often strong evolutionary relationships between viruses and their hosts, (iv) lack of physical ‘fossil records’ of viruses and (v) abundance of genomic ‘fossil records’ (viral fossils) as parts of ancient viral genomes that occur within the genomes of extant species.

Phylogenetic trees are the most widespread presentation for virus phylogenies in the literature and several tree-building methods and software exists [e.g. MrBayes (Ronquist and Huelsenbeck, 2003), BEAST (Drummond et al. , 2012), PhyloBayes (Lartillot et al. , 2009), RAxML (Stamatakis et al. , 2008)]. However, trees cannot represent complex evolutionary relations that are relevant for viruses as horizontal gene transfer, interspecific recombination or the evolutionary relations between viruses and their hosts. Different types of phylogenetic networks have been developed in recent years to represent such relations (e.g. Huson et al. , 2011). However, there is still much need for research on how to reconstruct such aspects of virus phylogeny.

5.1 Short-term viral evolution

The short-term evolution rates of many viruses are so high that genomic evolution can already be observed over the course of years or even days. For analyzing viral short-term sequence evolution, it is important that the phylogenetic methods can include the sampling dates of the sequences, such as TipDate (Rambaut, 2000). Moreover, spatial dispersal processes play an important role, e.g. the spatial distribution of a virus within the host’s body or the geographic spread of an infectious disease. Several methods and tools have been developed to analyze and reconstruct the history of such complex phylogenetic and phylogeographic processes [overviews are (Bloomquist et al. , 2010), (Faria et al. , 2011) and (Lemey et al. , 2009)]. Recent tools that implement Bayesian approaches are based on Markov chain models or continuous diffusion models [ BayArea (Landis et al. , 2013), SPREAD (Bielejec et al. , 2011)]. It is not easy to interpret whether the delivered phylogeographic reconstructions and visualization tools, e.g. Phylowood (Landis and Bedford, 2014), can help.

The evolutionary rates of viruses can differ even for short-term evolutionary scenarios, for example, between different lineages (infectious, non-infectious) or between different time intervals (e.g. states of an infection or seasons). One reason is that substitution rates reflect a complex product of mutation rate, generation time, effective population size and fitness (Jenkins et al. , 2002; Sanjuán et al. , 2010). Vastly different replication profiles [stamping machines versus geometric replication (Martínez et al. , 2011)] make the estimation of substitution rates difficult. In viruses in particular, substitutions may also be an artifact caused by polymerase errors and nucleotide modifications (Domingo and Holland, 1997). For all these reasons, the classical assumption of a time-homogeneous substitution process that is used by several phylogenetic and phylogeographic statistical inference methods does not hold and new approaches that can include varying evolutionary rates have been proposed, such as (Bielejec et al. , 2013). Unfortunately, the computational effort of such complex statistical inference methods is high. One remedy is to use parallelized versions that could offer a dramatic speedup on various parallel architectures, e.g. computer cluster (Baele and Lemey, 2013), graphics processing units with BEAGLE (Ayres et al. , 2012) or multiple Field Programmable Gate Arrays via extended BEAGLE (Jin and Bakos, 2013).

5.2 Viral ‘deep phylogeny’

Because physical fossils of viruses do not exist, there is no direct evidence about the time when viruses have emerged, and their origin is still not clear. There is indication that viruses are polyphyletic and several hypothesis exists about their relation to cellular life (Wessner, 2010): (i) they might be precursors of cellular life or (ii) they might have originated from cellular life via a regressive, or reductive, process from whole cells or via a progressive process from genetic elements.

A problem for ‘deep phylogeny’ reconstruction is that the genetic distance between viruses can be so large that reasonable alignments become impossible to calculate, and therefore, standard alignment-based phylogenetic methods cannot be applied. The development of advanced approaches to achieve biologically correct alignments would help, but can only marginally alleviate the problem of saturated substitution processes. Other approaches might be based on using aspects of genome organization or protein structures as phylogenetic characters (Holmes, 2011).

However, some ancient viruses have left parts of their genome (or other traces) in the genome of germ line cells of their hosts. Such parts, called endogenous viral elements (EVEs), have survived as non-functional, neutrally evolving pseudogenes or even became fixed as functional. Most EVEs stem from retroviruses because they integrate into host genomes as part of their life cycle. For example, ∼8% of the human genome is derived from >100 000 retroviral fossils (Lander et al. , 2001). However, in recent years, EVEs from many other viruses have been found (Horie and Tomonaga, 2011; Katzourakis and Gifford, 2010; Patel et al. , 2011). Some paleoviruses could even be almost entirely reconstructed from EVEs. To detect EVEs in complete genome sequences, different programs have been developed, e.g. RepeatMasker (Smit et al. , 2010), LTR_STRUC (McCarthy and McDonald, 2003), RetroTector (Sperber et al. , 2009), and using a combination of several of them seems most promising (Lerat, 2009).

Orthologues EVEs that are found in multiple host species indicate a single integration event that happened before the divergence of the host species group and therefore can be used to infer the phylogeny of ancient viruses and to calibrate the long-term evolutionary timelines for viruses (Feschotte and Gilbert, 2012). With EVEs it was possible to stretch back the history of several RNA virus families [e.g. bornaviruses (Horie and Tomonaga, 2011)] over ∼40 million years. This example shows that EVEs might help to solve the following problem of RNA virus phylogenetic dating. Studies that are based on genomic sequences of extant species often came to the conclusion that large taxonomic units of viruses (on the rank of genera) must have evolved from a common ancestor several ten-thousand years ago, whereas there are contrasting ideas based on virus–host coevolution over similarly wide taxonomic entities, suggesting bifurcation ages in the range of several millions of years (Buckling and Brockhurst, 2012; Fraile and García-Arenal, 2010; Marques and Carthew, 2007). However, in general, the calibration of phylogenies with fossil dates is difficult when the evolutionary rates are heterogenous and new algorithmic methods have to be developed for this, see (Heath et al. , 2013).

5.3 Virus–host associations

Associations between viruses and their hosts can have an important influence on the phylogeny of both partners. A divergence of the host might lead to a divergence of the virus (codivergence) and hence to a (local) congruence of both phylogenies. Such a match of the virus phylogeny with host evolutionary events at known dates can be used to calibrate the virus phylogeny or corresponding molecular clocks (Sharp and Simmonds, 2011). The property of viruses to switch their hosts may enable viruses to replicate and spread much more efficiently, a process commonly referred to as an epidemic is observed in pathogenic viruses (Weiss, 2003). Owing to the advantages conferred by the conquest of new host territory, some researchers assume that host switching is an elementary component of virus evolution and might also initiate viral speciation (Kitchen et al. , 2011).

Because virologists are highly interested to reconstruct the common history of viruses and their hosts, several bioinformatics tools have been developed for this purpose [for an overview see (de Vienne et al. , 2013; Doyon et al. , 2011)]. A program for testing of congruence between host phylogenies and parasite phylogenies is ParaFit (Legendre et al. , 2002). A fast implementation of ParaFit [ AxParafit (Stamatakis et al. , 2007)] is integrated via a wrapper in the tool CopyCat (Meier-Kolthoff et al. , 2007), which incorporates also a graphical user interface. CopyCat was used, e.g. to investigate the codivergence between mycoviruses and their hosts (Göker et al. , 2011).

Most programs for inferring reconciliations use a parsimony criterion where a reconciliation of minimum total cost is sought for. In this approach, a cost is given to each evolutionary event type (e.g. codivergence or host switch) and the total cost of a reconciliation is the sum of the costs of its events. The most often used programs are CoRe-Pa (Merkle et al. , 2010), Jane (Conow et al. , 2010) and TreeMap (Charleston and Page, 2002). An evaluation of the different reconciliation programs using a new model for cophylogeny generation can be found in (Keller-Schmidt et al. , 2011). These programs have been used, for example, to reconstruct the phylogenetic relationship between orbiviruses (Dilcher et al. , 2012), papillomaviruses (Gottschling et al. , 2011) or arenaviruses in Africa (Coulibaly-N’Golo et al. , 2011) and their respective hosts.

There are still many research problems and a need for new bioinformatics methods that can, for example, include biogeographic information and ecological traits, preferential host switching (Cuthill and Charleston, 2013) or different mutation rates. A better knowledge of the timing and underlying conditions of those processes could enable projections into the future and thereby contribute to the tackling of one of the major issues in today’s infectious diseases research, i.e. the prediction (and prevention) of future pandemics and outbreaks.

6 MEDICAL AND BIOTECHNOLOGICAL APPLICATIONS

Viral evolution has many implications for clinical virology. Emergence of resistance mutations is among the biggest obstacles to a successful antiviral therapy (Richman, 2006). The molecular mechanisms selecting resistance mutations are complex, particularly when multiple antiviral agents are used, e.g. for HIV (Shibata et al. , 2011). Computational analysis of HIV-1 genome variation correlating with CCR5 or CXCR4 coreceptors led to AIDS therapy strategy (Lengauer et al. , 2007).

Thus, new bioinformatics approaches to characterize viral evolution both on an intra- and interindividual level would be an important asset. A better understanding and knowledge of additional selection mechanisms such as RNA secondary structures could explain new pathways leading to resistance and immune escape mutations. New sequence-based therapeutic concepts, such as RNA interference, could prevent the selection of resistance mutations (Schopman et al. , 2012).

Previous work showed that the host immune system influences the genetic variability in chronically infected individuals (Hoffmann et al. , 2008). This can be analyzed with population genetic methods (Hoffmann et al. , 2012). In this setting, viral species acquire numerous mutations over time (Hoffmann et al. , 2010). An increasing number of immunocompromised patients susceptible to chronic infections represents an important reservoir for new viral genotypes (Siebenga et al. , 2008).

By integrating bioinformatic methods, it might be possible to predict viral evolution in patients from their individual viral population, including lower prevalent individuals with single genetic variations. Thus, the goal is to forecast the course of a virus infection and adjust the treatment accordingly.

7 CONCLUDING REMARKS

RNA viruses pose a wide variety of challenges to computational methodology owing to their staggering diversity, compact genome organization and rapid rate of evolution. On the other hand, the availability of large numbers of complete sequences and the small to at most moderate size of RNA virus genomes holds particular promises for specialized bioinformatics approaches. The latter two characteristics are shared with (animal) mitochondrial genomes. In contrast to viral sequences, however, mitogenomes have attracted considerable interest in the bioinformatics community, resulting in the development of a wide array of specialized tools [reviewed (Bernt et al. , 2012)]. This software often capitalizes on the fact that the small size of the mitogenomes makes it possible to use much more expensive algorithms than could be feasibly used in the context of prokaryotic or even eukaryotic genomics. Given the many specific questions and importance of RNA viruses in both basic research and in medical sciences, it is hard to understand why the bioinformatics community has shown little interest in developing a comprehensive suite of methods and tools for RNA virology. The open problems remain many and diverse, ranging from orthology detection, protein annotation and deep phylogeny to the evaluation of multiple superimposed selection pressures, the evolution of viral gene regulation and the understanding of the rapidly evolving populations of viruses and their arms race with the host immune system.

Supplementary Material

Supplementary Data

ACKNOWLEDGEMENT

The authors thank Allan Clark for critically reading the manuscript

Funding : Deutsche Forschungsgemeinschaft (STA 850/7-2 to P.F.S., MI 439/14-1 to M.Mi. and P.F.S., MA5082/1-1 to M.Ma.) and the Swiss National Science Foundation (CR32I2_127017). SPP 1596 - DR772-10/1 to D.R.

Conflicts of Interest : none declared.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES