Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Apr 15.
Published in final edited form as: Mol Biosyst. 2016 Apr 26;12(5):1507–1526. doi: 10.1039/c6mb00122j

Functional correlations of respiratory syncytial virus proteins to intrinsic disorder

Jillian N Whelan a,, Krishna D Reddy b,, Vladimir N Uversky b,c,d, Michael N Teng a
PMCID: PMC6464112  NIHMSID: NIHMS1009524  PMID: 27062995

Abstract

Protein intrinsic disorder is an important characteristic demonstrated by the absence of higher order structure, and is commonly detected in multifunctional proteins encoded by RNA viruses. Intrinsically disordered regions (IDRs) of proteins exhibit high flexibility and solvent accessibility, which permit several distinct protein functions, including but not limited to binding of multiple partners and accessibility for post-translational modifications. IDR-containing viral proteins can therefore execute various functional roles to enable productive viral replication. Respiratory syncytial virus (RSV) is a globally circulating, non-segmented, negative sense (NNS) RNA virus that causes severe lower respiratory infections. In this study, we performed a comprehensive evaluation of predicted intrinsic disorder of the RSV proteome to better understand the functional role of RSV protein IDRs. We included 27 RSV strains to sample major RSV subtypes and genotypes, as well as geographic and temporal isolate differences. Several types of disorder predictions were applied to the RSV proteome, including per-residue (PONDR®-FIT and PONDR® VL-XT), binary (CH, CDF, CH–CDF), and disorder-based interactions (ANCHOR and MoRFpred). We classified RSV IDRs by size, frequency and function. Finally, we determined the functional implications of RSV IDRs by mapping predicted IDRs to known functional domains of each protein. Identification of RSV IDRs within functional domains improves our understanding of RSV pathogenesis in addition to providing potential therapeutic targets. Furthermore, this approach can be applied to other NNS viruses that encode essential multifunctional proteins for the elucidation of viral protein regions that can be manipulated for attenuation of viral replication.

Introduction

The order Mononegavirales encompasses the spectrum of non-segmented, negative sense (NNS) RNA viruses. NNS viruses are characterized by a single-stranded RNA genome ranging from 15 to 19 kb in length, encoding a linear array of genes in antisense orientation to mRNA. The genomic organization of the Mononegavirales is similar and the gene products share sequence and functional homology. Among the NNS viruses are notable viruses such as measles, rabies, and Ebola viruses. There are several families within the Mononegavirales order. In particular, the Paramyxoviridae consists of a large array of human pathogens, including measles, mumps, Nipah, parainfluenza viruses 1–4, and respiratory syncytial virus (RSV).1 Human RSV is the prototype member of the Pneumovirus genus within the Pneumovirinae subfamily of paramyxoviruses.1,2

RSV is a global threat, causing severe lower respiratory tract infections in infants and young children. RSV is also a significant health problem in the elderly and immunocompromised patients. Unlike most paramyxoviruses, RSV commonly causes recurrent infection in healthy, immunocompetent individuals, resulting in mild, self-limiting upper respiratory tract disease. This is likely due to the induction of a short-lived anti-RSV adaptive immune response and poor memory to infection.35 However, RSV interaction with the host and the subsequent immune response to RSV are poorly understood.

RSV is divided into two subtypes, RSV A and RSV B, which are based on the antigenicity of the RSV attachment (G) and fusion (F) surface glycoproteins. These RSV subtypes co-circulate globally at relatively equal levels, although RSV A is slightly more prevalent and pathogenic.6 In addition, RSV subtypes will reappear consistently throughout the year, as opposed to other respiratory diseases such as influenza that circulate annually before being replaced by a variant strain.710 Until the late 2000’s there were few whole RSV genome sequences published and available for reference. Due to recent technological advances in whole genome sequencing, as well the use of the highly variable regions of the RSV surface glycoprotein G as a marker for diagnostics and genotyping, we have a much greater knowledge of the RSV phylogenetic tree.1115 The RSV A subtype contains 9 known genotypes, with the two largest clades GA2 and GA5 encompassing most of the RSV A genotypes. The RSV B subtype consists of 10 known genotypes; interestingly, the RSV BA genotype first emerged in the 1990’s and has rapidly become the predominant B genotype globally.1619

RSV is an enveloped virus containing a 15 kb NNS RNA genome that is replicated exclusively in the cytoplasm in the infected cell. The viral genome consists of a linear array of 10 genes, encoding 11 proteins, with cis-acting transcription initiation and transcription termination sequences flanking each gene. The 11 RSV proteins are categorized as structural when packaged within the virion, or non-structural if expressed solely within the infected cell. Most of the structural proteins of RSV have functional homologues among the paramyxoviruses. As with other paramyxoviruses, the RSV ribonucleoprotein (RNP) consists of the viral genome encapsidated by the nucleo-protein (N) to form the template for viral RNA synthesis while simultaneously protecting the RNA from cellular nucleases.20 The viral polymerase (L) protein encodes the enzymatic activities necessary for RNA synthesis and requires a cofactor, the phosphoprotein (P), which is essential for L interaction with the RNP and polymerase function.2124 Unique to the Pneumovirinae, the viral polymerase complex is regulated by two small proteins encoded by the M2 gene, M2-1 and M2-2. M2-1 functions as an antitermination factor to allow for elongation of mRNA transcripts; M2-2 appears to act as a molecular switch for the polymerase to toggle between transcription and genome replication modes.2428 The RSV matrix (M) protein is required for assembly of progeny virions, directing the RNP–polymerase complexes to the plasma membrane for budding from the cell.2931 RSV encodes three surface glycoproteins, the small hydrophobic (SH), attachment (G), and fusion (F) proteins.32 RSV F is structurally and functionally homologous to other paramyxovirus F proteins, responsible for fusion of the viral and target cell membranes.33 G and SH are not structurally similar to their counterparts in other paramyxoviruses; however, all paramyxoviruses express an attachment protein while SH proteins are present in other subfamilies of Paramyxoviridae. The RSV nonstructural proteins, NS1 and NS2, are the first proteins expressed during infection and function primarily in blocking the innate immune response to allow for optimal viral transcription and replication.3437 Their sequence does not share homology with any mammalian or viral proteins, and little is understood about their structure.

Small RNA viruses such as RSV encode relatively few viral proteins with which to carry out key life cycle events like RNA transcription and viral replication. In order to thrive in an intracellular environment, these viruses have evolved proteins that perform multiple functions. These multifunctional proteins often contain intrinsically disordered regions (IDRs) that are highly flexible to facilitate a diverse array of functions, such as transient protein–protein interactions, signal transduction, post-translational modifications, and other crucial biological roles.38,39 In general, IDRs are defined by unique physicochemical features such as low hydrophobicity, low sequence complexity, and high net charge.4042 Using both computational and experimental techniques, several studies have shown that IDRs are prominent and play important roles in viral systems.43,44 Among other functions, IDRs allow for promiscuous binding between the various components of the host, including membranes, DNA, RNA, and protein. Additionally, it has been proposed that IDRs allow for higher tolerance of the rapid mutation that occurs in viral genomes, which is in the range of 10−5–10−3 mutations per position per generation for RNA viruses. Due to the numerous functional applications of intrinsic protein disorder, any particular functional advantage conferred by protein disorder can be determined by examining the degree of sequence conservation, or the location and frequency of polymorphisms within an IDR.45

For the Paramyxoviridae, the relevance of intrinsic disorder in viral function has already been described in the specific cases of the N and P proteins of Nipah (NiV), Hendra (HeV), and measles (MeV) viruses.4655 The C-terminus of the N protein (NTAIL) consists of a region of disorder-to-order transition, also known as a molecular recognition feature (MoRF). This NTAIL MoRF binds to the C-terminal X domain of the P protein in a “fuzzy” fashion, a term which describes significant freedom of both bound and unbound states. In agreement with this paradigm, the NTAIL of MeV N protein also interacts with the folded RNA-binding domain of N (NCORE), providing a framework where IDRs regulate the assembly and positioning of the polymerase complex.

As intrinsic disorder is prominent in negative-strand RNA viral proteomes, and several IDRs have been identified which are important to viral function, understanding the level of intrinsic disorder in RSV may allow for the identification of viral targets for antiviral therapeutics and an avenue to better defining the interaction of RSV with the immune system. To determine key elements of RSV protein structure, we implemented several bioinformatics predictors of intrinsically disordered protein regions (IDRs) to evaluate the degree and location of intrinsic disorder among the 11 RSV proteins. We compared proteins from 27 fully sequenced RSV genomes across various genotypes within the two RSV subtypes to determine a potential relationship between intrinsic disorder status and protein function. We also evaluated the existence of polymorphisms within IDRs to expand on our assessment of functionally relevant, conserved RSV protein sequence. While traditional drug design targets structural regions required for protein function, our analysis instead reveals functional domains within intrinsically disordered regions of RSV proteins. In addition, mutation or deletion of these IDRs that have specific function may alter RSV infectivity and associated immune responses without affecting protein stability and expression, allowing for rational design of live-attenuated RSV vaccine candidates.

Materials and methods

Dataset

The 27 RSV clinical isolates used in this study are fully sequenced and reviewed.16 The two RSV subtypes, A and B, are represented by 17 RSV A and 10 RSV B genomes, including the classically referenced RSV A2 and B1 laboratory strains. Included are isolates from four RSV A genotypes – GA1, GA2, GA5 and ON1. RSV B isolates are from one of six genotypes – GB1, GB3, GB4, BA, BA2 and BA4. Accession IDs were collected from the National Center for Biotechnology Information (NCBI) GenBank and FASTA files were obtained for each RSV protein from Uniprot. GenBank accession IDs: M74568, KF826836, KF826846, KF826824, KF826847, KF826832, JQ901451, KC731482, KC731483, KF826848, KF826855, KF826821, KF826838, KF826840, KF826831, JX015499, JX015483, AY353550, AF013254, KF826853, JQ582844, KF826829, KF826845, KF530259, KF826851, KF826858, JX576761. Genotypes were chosen from several RSV phylogenetic clades based on predominance in current circulation patterns. Isolates were collected between 1961 and 2011 to include RSV global sequence divergence from the past 50 years in our analysis. We have also included an extensive sampling of locations throughout the world (Table 1).

Table 1.

RSV clinical isolates in this study. Clinical isolates were collected from the NCBI GenBank, with GenBank accession number shown

GenBank Acc Subtype Genotype Location Collection date
M74568 A GA1 Australia 1961
KF826836 A GA5 Mexico 2006
KF826846 A GA5 Argentina 2008
KF826824 A GA5 USA 1998
KF826847 A GA5 Australia 2007
KF826832 A GA5 Italy 2009
JQ901451 A GA5 Netherlands 2001
KC731482 A ON1 India 2011
KC731483 A GA2 India 2011
KF826848 A GA2 Australia 2007
KF826855 A GA2 Italy 2009
KF826821 A GA2 USA 2007
KF826838 A GA2 Argentina 2006
KF826840 A GA2 Mexico 2007
KF826831 A GA2 Germany 2009
JX015499 A GA2 Belgium 2008
JX015483 A GA2 Netherlands 2008
AY353550 B GB1 USA 1977
AF013254 B GB4 USA 1985
KF826853 B GB3 Germany 2008
JQ582844 B BA2 USA 2002
KF826829 B BA Mexico 2005
KF826845 B BA Argentina 2008
KF530259 B BA South Africa 2006
KF826851 B BA USA 2007
KF826858 B BA Italy 2009
JX576761 B BA4 Netherlands 2002

RSV isolates range from 15 106 bp to 15 283 bp, depending on the genotype, with proteins ranging from 64 to 2166 amino acids (aa). Figures depicting averaged RSV A and RSV B data include the sequences from all 17 A and 10 B genomes. For the G protein, genotypes ON1 (one isolate) and BA (averaged five isolates) are represented separately from all other averaged RSV A and B genotypes. With 11 RSV proteins per isolate, there are a total of 297 proteins for classifications of IDRs by functional domain and post-translational modification.

Diversity analysis

The RSV proteomes of each isolate were concatenated to yield 27 strings, each containing the 11 RSV proteins. These strings were then aligned using a PAM matrix using the ClustalW function in the MEGA 7.0 software (default parameters).56 The A and B isolates were then separated into two separate alignment files, and the entropy (H(x)) plot function in BioEdit 7.0 was used to produce the plot values.

Per-residue analysis of RSV proteins

In order to predict level of disorder in single sequences, per-residue disorder predictors will be used. Different predictors take different sequence characteristics into account, but all consider scores above 0.5 to be disordered, whereas scores below 0.5 are considered ordered. Two predictors were primarily used. While PONDR® VL-XT57 sacrifices accuracy compared to more recent predictors, it is useful for detection of potential interaction regions because it is sensitive to local compositional biases.58,59 For example, IDRs are known to assume secondary structures known as molecular recognition features (MoRFs) when binding their partners, which are often represented in PONDR® VL-XT plots as sharp dips from disorder to order, and back to disorder.59,60 PONDR®-FIT61 is a highly accurate meta-predictor that utilizes PONDR® VL-XT as an input, as well as PONDR®-VSL2,62 PONDR®-VL3,63 FoldIndex,64 IUPred,65 and TopIDP.66 As it utilizes several different input features for prediction, it achieves ~85% accuracy, which is more accurate than the individual predictors. Throughout the study, PONDR® VL-XT is generally used to correlate IDRs to potential functional regions such as MoRFs, whereas PONDR®-FIT is generally used to understand overall level of intrinsic disorder.

Analysis of consensus intrinsic disorder and putative interaction regions of a representative RSV strain

The proteome of RSV strain A2 (UniProt IDs: P04544, P04543, P03418, P03421, P03419, P04852, P03423, P03420, P04545, P88812, P28887) was analyzed for consensus disorder using the MobiDB database, where consensus disorder is defined as incorporating data from multiple data sources including X-ray/NMR structures and intrinsic disorder predictors.67 Regions are classified as either ordered, disordered, or ‘ambiguous’ when different sources disagree regarding levels of intrinsic disorder. In order to quantitatively evaluate regions of potential interactions to support PONDR® VL-XT analysis, the proteome of RSV strain A2 was evaluated using ANCHOR, which determines regions unlikely to fold independently but potentially can in the presence of a partner,65,68 and MoRFpred, a predictor which can identify multiple MoRF types (α, β, coil, and complex).69

Charge-hydropathy (CH) plot

One established binary method of order–disorder classification is the CH plot, where ordered and disordered proteins plotted in charge-hydropathy space can be separated by a linear boundary.42 Absolute mean net charge for each protein was determined by calculating the total amount of charged amino acids (Lys, Arg, Asp, Glu), then dividing by the total number of residues to obtain the average charge per residue. Histidines were excluded as these residues are highly ionizable at physiological pH. Kyte-Doolittle hydropathy was calculated for each protein using a sliding window of 5 amino acids.70 The disorder/order boundary line is a modified, optimized version based on the original by Uversky and colleagues, represented by the equation ⟨charge⟩ = 2.743⟨hydropathy⟩ – 1.109.42,71 The boundary margins of the line were set to ±0.045, which reaches accuracy up to 95% for disordered proteins and 97% for ordered proteins.

Cumulative distribution function (CDF)

CDF is a binary analysis based on per-residue local sequence predictions,72 and in this case PONDR® VL-XT scores were used. Disorder scores were plotted against their cumulative frequency. The resulting distributions were classified based on the distance from a previously validated linear boundary, which is a measurement of the proportion of residues with high vs. low disorder scores. The x, y coordinates of the boundary line are 0.60, 0.6948; 0.65, 0.7323; 0.70, 0.7736; 0.75, 0.8141; 0.80, 0.8538; 0.85, 0.9051; 0.90, 0.9467. CDF curves above and below the boundary represent ordered and disordered, respectively, while curves that cross the boundary line are predicted to be a mixture of order and disorder.

Combined CH–CDF plot

While CH and CDF analyses are valuable separately, combination of these can yield even more information about the native state of a protein and roughly classify proteins as structured (Q2, lower right), mixture of order and disorder (Q3, lower left), disordered (Q4, upper left), and rare (Q1, upper right).73 Values for CH–CDF analysis were determined by calculating the distance between the selected point and the CH or CDF boundary line.

Graphics software

Clustal Omega (EMBL-EBI, ebi.ac.uk) was used for multiple sequence alignment for determination of amino acid polymorphisms. Illustrator for Biological Sequences (IBS) 1.0 and Adobe Illustrator CS6 were used for domain mapping. GraphPad Prism 6.0 was used for PONDR®-FIT and PONDR® VL-XT graphs. PyMOL was used for imaging of molecular structures (The PyMOL Molecular Graphics System, Version 1.8 Schrödinger, LLC).

Results and discussion

Intrinsic disorder of the respiratory syncytial virus proteome

We collected 27 fully sequenced RSV clinical isolates, 17 from the RSV subtype A and 10 from the RSV subtype B. Of the RSV A isolates, nine were from the GA2 genotype, six from the GA5 genotype, and one isolate from the ON1 genotype. In addition, we included the prototype A2 laboratory strain, which is of the GA1 genotype. The RSV B isolates include one GB3, one GB4, one BA2, one BA4, and five BA genotype isolates. Similarly, we included the RSV B prototype B1 laboratory strain, which is genotype GB1. All of the isolates have been reviewed and can be found in the NCBI GenBank database. Table 1 describes the 27 clinical isolates used in this study.

We strategically selected clinical isolates whose collection dates encompass the last fifty years to account for any sequence diversity over the past half-century. Furthermore, we chose isolates with an altogether worldwide distribution to include any global sequence diversity among samples. Despite these sampling methods, we performed diversity analysis on both RSV A and B isolates to validate a high degree of variation within selected samples, shown in Fig. 1. Our diversity analysis confirmed our sampling methods as effective, and was generally in agreement with previously published research, showing the G protein as expressing high levels of polymorphism relative to other RSV proteins.16

Fig. 1.

Fig. 1

Protein entropy plots of 27 RSV proteomes. Entropy plots of concatenated and aligned proteomes of 17 RSVA isolates (red) and 10 RSVB isolates (black). Values were calculated using the entropy H(x) function in BioEdit 7.0. Entropy value (y-axis) is directly correlated to positional variation.

Disorder predictions for each RSV protein are shown in Fig. 2. PONDR®-FIT and PONDR® VL-XT disorder predictor tools consider an amino acid residue to be disordered if its disorder score is >0.5 and ordered if its score is <0.5, simplified by the horizontal line at the y-axis 0.5 traversing each graph. Intrinsic disorder regions (IDRs) are, by definition, four or more consecutive residues with disorder scores above 0.5. Peaks and valleys represent the degree of disorder and order, respectively. The proteins are organized in the order in which they are transcribed, and roughly scaled to represent the size of their amino acid sequence. The 17 RSV A genotypes (blue line) and 10 RSV B genotypes (green line) were averaged and compared for each RSV protein. Certain G protein genotypes contain a 24aa (RSV A) or 20aa (RSV B) duplication in their C-terminal sequence; these sequences are therefore shown separately to avoid distorting the graph in the C-terminal direction. In addition, we compared the IDR differences between the more recently isolated ON1 and BA genotypes and the remaining RSV A and RSV B genotypes, respectively. The RSV A ON1 genotype (one isolate) is shown in purple and the RSV B BA genotype (five averaged isolates) is shown in orange.

Fig. 2.

Fig. 2

Intrinsic disorder of the RSV proteome. RSV proteome map of averaged PONDR®-FIT data (A) and PONDR® VL-XT data (B) for RSV A (blue) and B (green) subtypes. The G protein ON1 (purple) and BA (orange) genotypes are shown separately. Disorder score (disordered >0.5, ordered <0.5) on the y-axis and amino acid residue position on the x-axis.

PONDR®-FIT predictions for the RSV proteome are shown in Fig. 2A. For each protein, RSV A and B subtypes generally display a similar pattern of intrinsic disorder throughout the amino acid sequence, with a few minor differences. The SH protein RSV A isolates have highly disordered C-terminal peaks that remain disordered throughout the C-terminus. The SH RSV B isolates’ C-terminal peaks remain ordered, and the C-terminal IDRs are much shorter. Other minor variances involve the short IDR peaks in the RSV B proteins F and L near the N- and C-termini, respectively, that are present in their RSV A equivalents but as smaller, more ordered peaks.

Fig. 2B shows PONDR® VL-XT predictions for the RSV proteome. As we observed in the PONDR®-FIT predictions, there is an overall comparable trend between RSV A and RSV B. However, there are major differences between the two subtypes, namely in the SH and G graphs. For the SH protein, the RSV A isolates maintain the C-terminal IDRs displayed in the PONDR®-FIT data while the RSV B isolates remain highly ordered throughout the entirety of the protein sequence. The G graph for RSV A isolates contains two separate ordered regions near amino acid positions 100 and 220 whereas the parallel region of the RSV B G graph maintains its disordered status throughout those regions. The ON1 and BA isolate C-terminal disordered regions are not only different from one another, but they also display different C-terminal disorder patterns than their respective RSV A and B predecessor genotypes. There are noticeable differences in IDR prediction for each RSV protein between the two bioinformatics tools, many of which highlight the purpose of PONDR® VL-XT predictions. Many RSV proteins – N, M, SH, G, F, M2-1, and L – reveal an enhanced disorder profile, or higher peaks and lower valleys (Fig. 2A and B).

Overall disorder level for each RSV protein was calculated using the PONDR® VL-XT predictions (Table 2). Proteins were classified as either disordered (>15% disordered) or ordered (<15% disordered). Ordered proteins of the RSV proteome include NS1, NS2, N, M, SH, F, M2-2 and L. Proteins P, G and M2-1 are disordered, which is apparent in Fig. 2. All four G protein genotypes display a high degree of disorder, however ON1 and BA isolates containing the C-terminal duplication display an increased degree of disorder overall, compared to their RSV A and B counterparts.

Table 2.

Level of RSV protein intrinsic disorder. PONDR® VL-XT predictions for 17 RSV A and 10 RSV B isolates were averaged for each RSV protein, with ON1 and BA isolates shown separately for G. RSV proteins were classified as ordered (<15%) or disordered (>15%, bold) by percentage of total amino acids residues above the disorder score of 0.5

RSV NS1 (%) NS2 (%) N (%) P (%) M (%) SH (%) G (%) F (%) M2-1 (%) M2-2 (%) L (%)
A 5.7 13.7 14.6 59.3 9.0 12.5 53.4 12.0 42.8 8.9 7.6
B 2.1 12.0 14.3 58.1 11 0.0 64.8 15.0 41.5 9.7 7.2
ON1 76.0
BA 65.2

In order to gain the most conservative and accurate view of the level of intrinsic disorder in the RSV protein, we analyzed a representative proteome of RSV strain A2 using the ‘consensus’ prediction of the MobiDB database (Table 3). As expected, the consensus predicted values generally fluctuated from the PONDR® VL-XT predictions, likely because of its relatively low accuracy and propensity to predict MoRFs. The structural data generally agreed with the predictions, with the exception of the F protein, which had an X-ray structure with much higher intrinsic disorder than the predictions would suggest. Potential reasons for this discrepancy will be discussed further in subsequent sections.

Table 3.

Consensus disorder predictions of RSV strain A2. The MobiDB database was used to determine the consensus disorder prediction of RSV strain A2 (UniProt IDs: P04544, P04543, P03418, P03421, P03419, P04852, P03423, P03420, P04545, P88812, P28887), where regions of disagreement between two structure determination methods are defined as ‘ambiguous’ rather than structured or unstructured. Bold represents the overall consensus disorder prediction per protein

RSV A2 NS1 NS2 N P M SH G F M2-1 M2-2 L
Predict 6.47 11.29 5.88 66.39 3.91 17.19 60.07 2.61 14.43 4.4 2.4
NMR 0.0 10.82
X-ray 3.07 0.78 19.34 10.31

To expand the analysis of overall intrinsic disorder of the RSV proteome, binary disorder predictors of intrinsic disorder were used on the proteomes of the 27 isolates, thereby depicting all 297 proteins individually (Fig. 3). Charge-hydropathy analysis revealed that all proteins besides G, M2-1, and P clustered on the ordered side of the modified Uversky line. P was the only cluster to fall in the definitively disordered region, while G and M2-1 clustered within the line’s range of error, representing a mixture of order and disorder. Cumulative distribution function (CDF) analysis revealed similar results as CH analysis, with the notable exception of the M2-1 cluster, which was predicted as structured. Combined CH–CDF analysis revealed that M2-1 is clustered close to the center in the unusual quadrant, which is consistent with its high degree of variability in its disorder profile. P is clustered in the highly disordered, while G protein had very high degrees of variability, with several instances either surrounding or on the border of nonstructured and unusual. All other proteins were predicted as structured by CH–CDF analysis.

Fig. 3.

Fig. 3

Combined CH–CDF analysis 27 RSV proteomes. Q1 (upper right) contains proteins predicted as structured by CDF, and unstructured by CH (unusual). Q2 (lower right) contains proteins predicted as structured by CDF, and structured by CH (ordered). Q3 (lower left) contains proteins predicted as unstructured by CDF, and structured by CH (mix). Q4 (upper left) contains proteins predicted as unstructured by CDF, and unstructured by CH (disordered). CH values were calculated by taking the vertical distance between the point and the modified Uversky line. CDF values were calculated by taking the average distance between the CDF line and the boundary line.

Classification of intrinsic disorder regions of the RSV proteome

To further understand the prevalence and function of identified IDRs in RSV proteins, we grouped the IDRs using the PONDR®-FIT predictions from each protein of all 27 isolates based on size. The size groupings, derived from a previous publication, are shown as colored bars ranging from 4–10aa to 91–300aa (Fig. 4A).74 The RSV proteins are arranged on the x-axis in genomic order. In addition to IDR size, the number of IDRs within that size group for each RSV protein is shown on the y-axis. The shorter IDRs in the 4–10aa and 11–20aa range are more abundant overall, and each appears in every protein except NS2.

Fig. 4.

Fig. 4

Number and size of intrinsically disordered regions of RSV proteins. Number of IDRs shown on the y-axis. (A) On the x-axis, IDR size for each RSV protein in genomic order, with light blue as shortest IDRs and maroon as longest IDRs. (B) On the x-axis, each functional classification in order of most IDRs to least, with light blue for shortest IDRs and dark blue as the longest IDRs.

Every IDR was also classified by functional domain, including post-translational modifications with implications in protein function. In Fig. 4B, the number of IDRs are once more on the y-axis, and different functional classifications are arranged along the x-axis in order of high to low abundance within the RSV proteome. The colored bars depict IDR size groupings ranging from 5–10aa to 101–300aa. As above, phosphorylation and glycosylation events appear to frequently occur within longer IDRs. Unsurprisingly, protein-binding domains contained the most IDRs of all of the functional categories, as intrinsic disorder is known to facilitate multivalent, promiscuous interactions. This is also likely due to the large number of known RSV protein binding domains, including those with either cellular or viral binding partners. Notably, a common mechanism by which viruses exploit host machinery during infection is by expressing short linear motifs (SLiMs) of host proteins, which often reside within viral protein IDRs, to mimic cellular proteins and thereby interact with their functional counterparts.75 Using the eukaryotic linear motif (ELM) database (http://elm.eu.org), we found numerous cellular protein binding ELMs present within the NS2 protein N-terminal IDR (data not shown), indicative of ELM-expressing IDRs as potential sites of NS2 activity.76

One notable advantage of having a large number of complete RSV clinical isolate sequences available is the ease at which protein sequences can be aligned to determine the positions of amino acid polymorphisms across a diverse sampling of RSV strains. Table 4 reports each polymorphism identified for all RSV proteins within our 17 RSV A and 10 RSV B isolates examined in this study. IDRs containing polymorphisms are considered less functionally relevant due to a lower degree of conservation within that region of the protein. Importantly, targeting these polymorphism-containing IDRs for the purposes of therapeutic development would be less effective than targeting protein sequence exhibiting a high degree of conservation, therefore it is important to identify polymorphisms that fall within IDRs (italicized in Table 4). The majority of the polymorphic residues were outside of predicted IDRs; however, the polymorphisms in two proteins were largely, or exclusively, in IDRs. For the phosphoprotein (P), the polymorphisms are clustered in regions that have no known function, mostly in the N-terminal region (see below). Unsurprisingly, the attachment protein (G) is the most polymorphic protein in RSV, as it is one of the major antigens and is under constant immune selection. The G polymorphisms lie within the IDRs and thus would be considered poor candidates for therapeutic targeting.

Table 4.

RSV A and B protein polymorphisms. The Clustal Omega multiple sequence alignment tool was used for determination of amino acid polymorphisms within the 17 RSV A and 10 RSV B isolates sampled in this study. Polymorphisms within IDRs are italicized

RSV A RSV B
NS1 5, 36, 105 6, 45, 105, 124, 139
NS2 6, 7, 8, 10, 25, 38, 43, 100 6, 10, 24, 63, 68, 82
N 64, 84, 216 57, 115, 201, 264, 372, 380
P 66, 73, 75, 171 61–63, 75, 77, 80, 205
M 157, 168, 254 28, 136, 166
SH 27, 31 2, 49, 53, 57, 61, 65
G 4, 13, 15, 38, 52, 57, 67, 71, 81, 95, 99102, 104, 106–108, 110–111, 113, 117, 120, 121–127, 131, 133, 136, 140–142, 146, 151, 153, 156–157, 160–161, 187, 191, 196, 205–206, 208, 215, 222–223, 225–226, 232–233, 236–238, 241, 244, 250–254, 256, 258, 262, 265, 269–271, 273–274, 279–280, 286, 289–290, 292–293, 295–297 4, 6, 32, 89, 95, 101, 103, 107, 109, 118, 131, 133, 135, 137–138, 140, 143, 150, 152, 158–159, 191, 200, 207–208, 219, 222–223, 227, 229, 233–235, 238–239, 268, 271–272, 276, 278–279, 282, 288, 290, 292–294, 301–302, 305, 307
F 2, 4, 8, 13, 15–17, 19–20, 25, 101–103, 105, 119, 122, 124–125, 127, 152, 276, 324, 356, 378–380, 384, 447, 482, 510, 518, 535, 540, 547, 555, 574 8, 45, 65, 67, 97, 100, 103, 117, 152, 197, 215, 234, 278, 326, 467, 490, 527, 529
M2-1 4, 105, 118, 125, 180, 182, 185 107, 142, 172, 179–180, 185, 181–182
M2-2 18–19, 23, 25, 39, 44, 48, 50, 52, 54, 64, 68–70, 77–80 28–29, 31, 38, 52, 56, 71, 82, 86, 88
L 6, 37, 59, 65, 81, 102–104, 144, 148, 162, 173–174, 177, 200, 216, 224, 232, 234, 236–237, 240, 257, 388, 445, 555, 575, 598, 754, 821, 839, 955, 967, 970, 1113–1114, 1124, 1180, 1206, 1238, 1471, 1489, 1551, 1599, 1657, 1700, 1715, 1718, 1721, 1723–1725, 1730–1731, 1745, 1754–1757, 1761, 1764, 1773, 1778, 1832, 1847, 1940, 1969, 1980, 2009, 2016, 2076, 2120, 2135, 2154, 2163 8, 138, 165–166, 177, 183–184, 191, 195, 254, 354, 374, 445, 564, 578, 943, 948, 973, 1032, 1043, 1087, 1250, 1314, 1325, 1413, 1471, 1489, 1546, 1593, 1667, 1700, 1716, 1718, 1723, 1726–1727, 1731, 1735, 1740, 1742, 1760, 1764, 1773, 1780, 1787, 1794, 1843, 1930, 1942–1943, 1972, 2014, 2021, 2030, 2042, 2065, 2119, 2164

Functional analysis of RSV protein intrinsic disorder predictions

Nonstructural proteins NS1 and NS2.

NS1 and NS2 accessory proteins are expressed solely within infected cells and while their repertoire of functions are not yet fully understood, it is clear both nonstructural proteins play a major role in antagonism of type I interferon (IFN-α/β) via several innate immune response pathways.77 NS1 contains a BC box consensus sequence for binding to the cellular protein Elongin C at aa22–29, an adaptor protein subunit of E3 ubiquitin ligases.78,79 NS1 also contains a putative binding domain for Cul2, a member of the cullin family of E3 ligase scaffolding proteins, identified by comparison of NS1 amino acid sequence with those of other Cul2-binding proteins. This putative binding sequence is likely required for NS1 association with Cul2.78 These two cellular proteins are subunits of an E3 ligase complex that polyubiquitinates signal transducer and activator of transcription 2 (STAT2), an essential transcription factor required in the type I and type III IFN signaling pathways, and targets STAT2 for proteasomal degradation during RSV infection. Expression of NS1 enhances proteasomal degradation of STAT2, an activity induced primarily by NS2.78,80 It is possible NS1 interacts with Elongin C and Cul2 E3 ligase components to aid in inhibition of IFN signaling via STAT2. However, mutation of NS1’s Elongin C-binding domain destabilized NS1, suggesting these residues may also play a structural role.79 Unlike NS1, NS2 does not contain direct binding domains for E3 ligase proteins, although its C-terminus is essential for STAT2 degradation, indicating NS1–NS2 form a multi-subunit STAT2-degradation complex.35,78,81

Upstream of STAT2 degradation, NS1 and NS2 alter the expression levels of two cellular signaling proteins, inhibitor of nuclear factor kappa-B kinase subunit epsilon (IKKε) and tumor necrosis factor (TNF) receptor-associated factor 3 (TRAF3), which are important for IFN-α/β production. While it is unclear the precise mechanisms by which NS1 and NS2 affect IKKε or TRAF3 activity, it is apparent that core of each NS1 and NS2 is essential. The 10 C-terminal and 20 N-terminal amino acids are disposable for NS1 and NS2 interaction with IKKε, respectively. This is consistent with the disorder predictions, where these disposable regions are predicted to be relatively flexible and are therefore not required for stable interactions. NS1 interaction with TRAF3 excludes the 20 N-terminal and C-terminal amino acids, whereas NS2 interaction with TRAF3 requires aa21–94.81,82 Residues aa64–65 are predicted as a potential MoRF (Table 5), and therefore may be an interesting target within this region for TRAF3 interaction.

Table 5.

Potential interaction sites within regions of intrinsic disorder. The ANCHOR and MoRFpred algorithms were applied to the proteome of RSV strain A2, and the list of potential interaction residues and the percentage of potential interaction sites are displayed. Bolded residues in the ANCHOR predictions represent regions of high confidence

Protein UniProt
ID
ANCHOR %
ANCHOR
MoRFpred %
MoRFpred
NS1 P04544 11–14 5.76
131–132
135, 138
NS2 P04543 12 8.87
64–65
116–123
N P03418 47–51 3.32 9–13 3.32
159–166 162–166
245, 368, 391
P P03421 1–8 40.66 12, 14, 18, 22 14.52
18–27 44–47
39–51 60
60–63 62–64
79–87 81–85
98–108 100–107
118–121 220–222
141–156 235–241
170–175
192–192
220–228
235–241
M P03419
SH P04852 11–15 7.81
G P03423 111–116 24.83 19 6.38
162–190 112–113
239–255 164–171
257–269 242
278–286 279–284
298
F P03420 139–141 0.52 16, 18, 58 2.79
113–115
138–143
569–570
572–573
M2-1 P04545 127–134 12.37 11–12 4.64
148–156 128–134
165–171
M2-2 P88812 11–12 11.11
82–89
L P28887 124–128 0.28 17–19 2.17
600–600 155–156
208
584–588
668, 714
731–732
838–842
847, 1041, 1059,
1174, 1247
1294–1295
1313–1314
1331–1334
1336, 1486, 1505, 1569
2052–2057
2157–2158
2164–2165

In addition to interaction with innate immune response proteins, NS1 and NS2 both contain putative microtubule-associated protein 1B (MAP1B) binding domains at their C-terminal end, which both coincide with predicted MoRFs (Table 5).81 Interestingly, this binding site coincides with the DLNP sequence, the four extreme C-terminal amino acids of each nonstructural protein and the only sequence that is conserved amongst the two. The NS1 DLNP sequence is not required for any known function, however NS2 STAT2 and IKKε-related functions require its DLNP sequence. MAP1B expression appears to enhance NS2-induced STAT2 degradation,81 indicating a conserved region of the nonstructural proteins that is used for one of its synergistic functions.

The RSV nonstructural proteins are thought to exist as dimers or oligomers during infection.82 The functional overlap and synergistic behavior of NS1 and NS2 likely require dimerization or oligomerization for full activity; co-expression of the two proteins results in enhanced expression or stabilization. Similar to regions involved in cellular protein binding, NS1 and NS2 homo- and hetero-dimerization with one another require central aa21–119 and aa21–104, respectively, each excluding 20aa C-terminal regions.81

Small, viral proteins with multiple functions or binding partners often encode intrinsically disordered regions to allow different functions of the same amino acids, at specific points throughout infection. The IDRs of NS1 and NS2, which reside at the extreme N- and C-termini, are potentially sites of protein–protein binding, since both NS proteins bind several cellular interacting partners (Fig. 5). In the case of most putative functions of NS1 and NS2, the N- and C-terminal domains are not required and these activities require structured domains in the central region of the protein. However, several regions within the core of NS1 and NS2 are predicted to be relatively flexible, raising the possibility that these regions may not be directly involved in stable interactions.

Fig. 5.

Fig. 5

Nonstructural proteins NS1 and NS2. PONDR® VL-XT A2 plot overlaid with protein domain map depicting amino acids assigned to specific function and post-translational modification, x-axis = amino acid position, y-axis = disorder score (ordered <0.5, disordered >0.5). Gray = unassigned protein sequence; dark blue domain = protein binding with labeled binding partner Elongin C (EC); line = putative domain with labeled binding partner or type of oligomerization. NS1 (A) NS2 (B).

Nucleoprotein.

The N protein protects the RSV genome from cellular RNAses and cytosolic antiviral sensors by encapsidating the RNA in a ring structure termed the nucleocapsid, with one N decamer per one ring of RNA.20 The crystal structure of the 391aa N in complex with RNA revealed several key residues in N-RNA binding, and more recent studies further characterized the specific amino acid residues of N required for oligomerization and interaction with viral RNA.20,83 Residues involved in RNA-binding include K170, D175, R184, R185, Y337 and R338, indicating two potential N domains critical for nucleocapsid formation. Mutation of residues K170 and R185 inhibits N oligomerization in addition to RNA binding, but do not affect other N interactions.83

The RNA-dependent RNA polymerase, consisting of the large subunit L polymerase and P phosphoprotein cofactor, must gain access to RNA within the nucleocapsid for transcription and RNA synthesis. The P protein interacts with N to unwind the nucleocapsid and allow entry and initiation of RdRp activity. The P-binding region of N has been mapped to its central aa36–253 core.23,84 Interaction with P while arranged as part of the nucleocapsid requires N residues K46, M50, I53, R132, Y135, R150 and H151, and the first two residues are predicted to be potential interaction residues (ANCHOR). N–P binding is also correlated with inclusion body formation as well as polymerase activity, for which both N residues I53 and R132 are essential.23

Alternatively, N–P binding occurs during recruitment of monomeric N to genomic RNA for nucleocapsid formation by P chaperone activity. Furthermore, P binds monomeric N to prevent N from self-oligomerization or interaction with cellular RNA. K170 and R185 are required for higher order N structure and RNA-binding, however they are dispensable for P interaction with monomeric N. It is possible P-binding monomeric and oligomeric N requires separate domains of N. Further studies must be done to distinguish P binding of monomeric N monomer versus N present in nucleocapsids.

Structural data depicts N (in complex with RNA) as a highly ordered protein with C- and N-terminal domains that are connected by a hinge region where the RNA groove exists. These domains interact with adjacent equivalents within the ring structure. Each domain is comprised of α-helices with a disordered region at the extreme C- or N-terminal end that projects from the decameric N-RNA ring.20 The N-terminal projection, aa1–35, functions to stabilize the nucleocapsid structure, while the disordered C-terminal aa361–391 reside in the space between helical nucleocapsid turns.20,55

Overall the ring assumes a highly stable yet flexible structure, which is somewhat reflected in the PONDR® VL-XT data (Fig. 6A). The N-terminus is predicted to have a MoRF at aa9–13 (Table 5), and this region is within a helix in the crystal structure (Fig. 6B). Therefore, the intrinsic disorder of this region may have a stabilizing function on the nucleocapsid structure. The larger peak at aa184–195 could be the result of the RNA-binding residues at K170, D175, R184 and R185, since these sites would extend from the ring structure for interaction with the RSV genome. The salt bridge formed between N175 and R338 punctuates the IDR peak at residues aa332–338. The C-terminal 20 amino acids are disordered and the N-RNA ring structure supports the lack of order in this region.20

Fig. 6.

Fig. 6

Nucleoprotein N. (A) PONDR® VL-XT A2 plot overlaid with protein domain map depicting amino acids assigned to specific function and post-translational modification, x-axis = amino acid position, y-axis = disorder score (ordered <0.5, disordered >0.5). Gray = unassigned protein sequence; line = putative domain with labeled binding partner; RNA = residues required for RNA-binding. (B) A predicted N-terminal MoRF makes intramolecular and intermolecular contacts with a region in the C-terminus. PDB structure 2WJ8 was used for this analysis. Red represents aa1–35, green represents aa282–312 on the same protein, while magenta represents aa282–312 on a different subunit. The orange line represents bound RNA. (C) A predicted MoRF in the N protein is surrounded by several helices. PDB structure 2WJ8 was imaged using Pymol analysis. Red indicates the region predicted to be a MoRF by ANCHOR and MoRFpred, aa159–166. The orange line represents bound RNA.

We observe IDR peaks at residues aa120–125 and aa184–195 that coincide with α-helices described by the nucleocapsid crystal structure. The first appears to be part of a larger α-helix, while the second is a kinked helix surrounded by long stretches of disordered residues. Both ANCHOR and MoRFpred predict a MoRF around aa159–166, which is helical in the crystal structure flanked by long stretches of disordered residues (Table 5). Interestingly, this region is surrounded by several α-helices in the tertiary structure, potentially providing several weak stabilizing interactions which promote helix formation (Fig. 6C). Of course, other external factors such as solvent conditions or the packing of the crystal lattice itself may also stabilize these flexible regions.85 We must assume variations in intrinsic disorder between monomeric and nucleocapsid-bound N based on the known experimental differences in residue usage for the different N assemblies. The residues required for protein- and RNA-binding to decameric N are disposable to monomeric N, therefore the IDR data can be interpreted alternatively for N–P assemblies (Fig. 6).

Phosphoprotein.

The 241 amino acid RSV P protein is the essential RdRp cofactor, vital to viral RNA synthesis. P interacts with each component of the ribonucleoprotein complex – L, N, viral genomic RNA as well as the M2-1 transcription anti-termination factor. As described for the N protein, P binds both monomeric N at the P N-terminus, and N in the nucleocapsid, at the P C-terminus, although the P C-terminus can also bind monomeric N.23,83,84 N-terminal P amino acids 1–29, which contains residues predicted for interaction (Table 5) are involved in P chaperone activity. Residues aa2–10 (ANCHOR) and aa20–26 (ANCHOR, MoRFpred) are directly involved in recruiting monomeric N to the ribonucleoprotein complex. P mutations in F4, F8, F20, L21 and I24 (ANCHOR) inhibit interaction with N (Table 5).83

Immediately upstream of the C-terminal RNA-binding domain, P contains an L-binding region, from aa212–239 (Table 5).86 The proximity of the RNA- and L-binding domains within the P protein sequence is suitable for P contribution to transcription and RNA synthesis. Concurrently, P binds M2-1, which is also required for RSV transcription. Mason et al. mapped an M2-1-binding region to aa100–120, just upstream of the P oligomerization domain.87 Additionally, mutation of residues L101, Y102, T108 or F109 results in inhibition of P interaction with M2-1, indicative of an M2-1 binding domain within this region of the P protein.88

P exists as a homotetramer when bound to the ribonucleo-protein complex. The oligomerization domain is at the core of the P protein, residues aa120–150 (ANCHOR), within what is predicted to be a coiled-coil domain (aa120–160).89 Fig. 7 maintains previously published predictions of intrinsic disorder regions flanking the P oligomerization domain, while the coiled coil domain itself is ordered.90 Circular dichroism studies indicated a high α-helical content in the central, ordered region of P, further supporting our PONDR® VL-XT predictions (Fig. 7).90

Fig. 7.

Fig. 7

Phosphoprotein P. PONDR® VL-XT A2 plot overlaid with protein domain map depicting amino acids assigned to specific function and post-translational modification, x-axis = amino acid position, y-axis = disorder score (ordered <0.5, disordered >0.5). Gray = unassigned protein sequence; dark blue domain = protein binding with labeled binding partners N, M2-1 and L; light blue domain = oligomerization; orange domain = RNA-binding; blue hexagon = phosphorylation site.

P is phosphorylated at numerous sites: S30, S39, S45, T46, S54, T108, S116, S117, S119, S143, S156, T160, S161, T210, S215, S232, and S237.88,9195 There is variation in transient and constitutive phosphorylation, and is unclear how these modifications are related to function, or whether they are sequential or co-dependent. Some studies have proven phosphorylation dispensable for oligomerization, while others have determined it is required.89,96 Phosphorylation is not necessary for replication, P–N or P–M2-1 interactions, suggesting L-binding, transcriptional activity or possibly budding as potential purposes for modification.89 However, phosphorylation at P residues S116, S117 and S119 is required for M2-2 regulation of the switch from viral transcription to replication, suggestive of a potential M2-2 binding site overlapping that of M2-1.97 In the PONDR® VL-XT plot, the phosphorylated residues are scattered throughout ordered and disordered regions of P without any noticeable pattern (Fig. 7). However, many of the phosphorylated residues fall within sharp dips likely to correspond with MoRFs, indicating that these residues may still be natively disordered (Table 5). Indeed, when phosphorylated residues are plotted against the more accurate PONDR®-FIT plot, there is a clear enrichment of phosphorylated residues within disordered region, a phenomenon that has been well-documented in other systems.98

P is a highly disordered protein, which is to be expected considering its various binding partners during infection. The high level of disorder in the region required for polymerase activity suggests structural flexibility necessary for binding multiple partners simultaneously. The PONDR® VL-XT plot demonstrates that the regions required for interaction with N, M2-1, and L fall within dips of the plot, which may indicate that the flexible P protein coordinates highly transient interactions during infection. Unlike previous reports, we find the majority of the sequence upstream of the oligomerization domain to be ordered, with disordered peaks at aa29–32 and aa47–77 (Fig. 7). The latter contains MoRFs predicted by ANCHOR and MoRFpred (Table 5).

Matrix protein.

The 256aa RSV M protein assembles the encapsidated RNA genome and associated structural proteins into the progeny virion in preparation for budding from the host plasma membrane. As part of the ribonucleoprotein complex, M2-1 binds M during virus assembly as a mediator of M interaction with genomic RNA. The M2-1 binding region has been assigned to the N-terminal 110aa of the M protein.29 The sequential process of virus assembly requires M interaction with RNA for facilitation of genomic and protein products into a virion.99 Sites of RNA-binding (K121, K130, K156, K157, R170) have delineated a putative RNA-binding domain from amino acids 120–170, which overlaps the zinc-finger and central oligomerization domains.99

The stable and biologically active form of M is a dimer. The oligomerization region at aa92–105 is responsible for M dimerization specifically, which is critical to virus-like particle (VLP) formation and budding.100 Subsequently, M forms higher-order oligomers to induce a switch from RNA synthesis to virus assembly and budding. M higher-order oligomerization is key to formation of a viral structure comprised of M oligomers, encompassing the viral nucleocapsid that will bud through the plasma membrane to form a virus particle. There are several well-defined oligomerization domains dispersed across the protein at aa63–68, aa129–134, aa144–163 and aa225–235, emphasizing the importance of M oligomerization.100 Interestingly, many of the oligomerization regions coincide with peaks on the disorder plots, potentially indicating that stable dimerization is coordinated by multiple, weak interactions by means of flexible regions in the protein. In addition, a putative oligomerization domain has been mapped to aa205–220, near the C-terminus.99,101

M nuclear trafficking during infection is one key feature of M distinct from any other RSV protein, although it is shared among other M proteins of the order Mononegavirales.102-104 Early in infection, M localizes to the nucleus, potentially for inhibition of host cell transcription.105 The M nuclear localization sequence is limited to aa110–183, encompassing the zinc-finger and central oligomerization domains, as well as the putative central RNA-binding domain.106 The nuclear export signal for trafficking back into the cytoplasm for virus assembly late in infection is located at aa194–206.107 M protein aa114–144 were initially identified as containing a putative zinc-finger domain via sequence alignment with closely related viruses.108 Indeed, M nuclear accumulation depends upon metal ion availability, indicating the presence of a metal-binding domain critical to nuclear trafficking.100,106

M undergoes phosphorylation at T205, the final residue of its nuclear export signal. T205 also marks the beginning of a putative oligomerization domain from aa205–220. This phosphorylation is essential for higher-order oligomerization of M during assembly, and mutation of T205 therefore attenuates RSV.101

The crystal structure of M was initially solved for the monomeric form of M, but has also recently been solved for the more biologically relevant dimeric form of M.100,109 The monomer structure describes M composition as β-sheets primarily, with some α-helices interspersed. Connecting the N-terminal domain (aa1–126) with the C-terminal domain (aa162–255) is an unstructured 36aa linker.109 Overall, M is a highly ordered protein with two predicted IDR peaks (Fig. 8A). While M is similar to P with its various domains and binding partners, it displays more order within and between those domains. The second, dimeric structure of M depicts the two IDR peaks as each coinciding with oligomerization domains.100 In particular, the disordered region from aa63–68 seems to facilitate inter-molecular contacts with other known oligomerization regions at aa129–134 and aa227–231 (Fig. 8B). Additionally, it should be noted that the M protein has no predicted MoRFs by neither ANCHOR nor MoRFpred (Table 5).

Fig. 8.

Fig. 8

Matrix protein M. (A) PONDR® VL-XT A2 plot overlaid with protein domain map depicting amino acids assigned to specific function and post-translational modification, x-axis = amino acid position, y-axis = disorder score (ordered <0.5, disordered >0.5). Gray = unassigned protein sequence; line = putative domain with labeled function (NLS = nuclear localization sequence, Olig. = oligomerization); light blue domain = oligomerization; central red domain = zinc-finger domain (ZFD); C-terminal red domain = nuclear export signal; RNA = residues required for RNA-binding; blue hexagon = phosphorylation site. (B) Intrinsic disorder regulates inter-molecular contacts of dimerization regions. Analysis of PDB structure 4V23 using Pymol reveals that the intrinsically disordered loop (red, aa63–68) likely facilitates contacts with two other oligomerization regions (blue, aa127–131; magenta, aa227–231). Black spheres represent potassium ions.

Small hydrophobic protein.

The RSV SH protein is a type II integral membrane proteins and a member of the viroporin family of small, viral membrane proteins that oligomerize to enhance fusion and entry into the host. Little is understood regarding the function of SH, which at 64aa is the smallest of the three RSV viral surface proteins. Infection with SH-deleted RSV results in attenuation of RSV and decreased apoptosis of infected cells.110

The SH transmembrane domain from aa18–43 takes on an α-helical secondary structure, which is depicted as highly ordered in our IDR predictions in (Fig. 9). The N-terminal 17aa comprise the intracellular, cytoplasmic domain, while the C-terminal 21aa reside in the extracellular space.111 Our predictions suggest an IDR comprised of the C-terminal aa58–64 of the extracellular domain, with increases in intrinsic disorder beginning around residue 48. A flexible region on the extracellular surface may hint at its function, potentially as a target for transient protein–protein interactions. The only predicted MoRF (MoRFpred) is found at aa11–15, which may be an interesting target for future studies (Table 5).

Fig. 9.

Fig. 9

Small hydrophobic protein SH. PONDR® VL-XT A2 plot overlaid with protein domain map depicting amino acids assigned to specific function and post-translational modification, x-axis = amino acid position, y-axis = disorder score (ordered <0.5, disordered >0.5). Bright green domain = cytoplasmic; yellow domain = transmembrane; purple domain = extracellular.

Glycoprotein G.

RSV G is the major surface glycoprotein for virus attachment with the host cell.112 Depending on the genotype, G is anywhere from 298 to 319 amino acids long. G is a type II integral membrane protein with the N-terminal 36aa residing in the cytoplasm and aa67–298 in the extracellular space. The helical transmembrane domain is found in aa37–66.113 Most of the cytoplasmic domain and the entire transmembrane domain are predicted to be ordered, as depicted in (Fig. 10A). A soluble form of G, sG, aa65–74 shorter at the N-terminus is secreted during infection, presumably to act as an antibody decoy.114116 It is interesting to note that the majority of the ordered sequence of G is absent in the secreted form.

Fig. 10.

Fig. 10

Glycoprotein G. PONDR® VL-XT plot of A2 (A) or averaged BA strains (B) overlaid with protein domain map depicting amino acids assigned to specific function and post-translational modification, x-axis = amino acid position, y-axis = disorder score (ordered <0.5, disordered >0.5). Dark blue domain = protein binding with labeled binding partners M, CX3C motif and heparin (HB); bright green domain = cytoplasmic; yellow domain = transmembrane; purple = extracellular mucin-like domains (MLDI and MLDII) with textured 20aa duplication in (B); red domain = central conserved region; red pentagon = potential glycosylation sites; brackets = disulfide bond cysteine noose.

G is a heavily glycosylated protein, with 30–40 O-linked glycans and 4–5 N-linked glycans.117 The glycosylated sites are within one of the two mucin-like domains (MLDI and MLDII), which are highly variable in sequence. MLDI and MLDII distinguish G as the most variable RSV protein, which is useful for classification and diagnostic purposes.14,118 With few exceptions, the glycosylation sites reside within the highly disordered variable regions, which interact with many different antigenic sites (Fig. 10A). Appropriately, IDR regions are useful when interacting with a wide variety of binding partners, and are generally enriched in post-translational modifications.

The first six N-terminal residues of G, which are part of the cytoplasmic tail, interact with the M protein.30 This interaction is important during virus assembly, as M transports the viral nucleocapsids to sites on the plasma membrane from which budding will occur. Here, RSV surface proteins are embedded in the membrane, with exposed cytoplasmic tails for M-binding and subsequent nucleocapsid envelopment by the plasma membrane before budding from the infected cell.30,119 The M-binding site of G is the only region of the cytoplasmic domain that is predicted to be disordered (Fig. 10A). As described for other RSV protein–protein binding sites, disorder is likely required for the G–M interaction and subsequent budding.

The central conserved domain (CCD) sits between the two MLDs, from aa164–177 (Table 5).113 The C-terminal end of the CCD contains two cysteine residues, C173 and C176, which form disulfide bonds with C186 and C182, respectively. These four cysteines are linked in a 1–4 and 2–3 manner to create a cysteine noose.120 The C-terminal cysteine residues of the cysteine noose also function as part of the CX3C motif, which competes with the chemokine CX3CL1, also known as fractalkine, for its receptor CX3CR1.121 Immediately downstream of the CX3C motif is a heparin-binding domain at residues 184–198.122 This basic domain is the site of attachment to the cell surface receptor heparin sulfate on immortalized cells, although there are likely alternative cellular receptors for RSV attachment to the host.113,123,124 The highly conserved CCD displays the most consistent disorder predictions among the different G genotypes (Fig. 10A). Recent studies have shown that the CX3C domain is essential for RSV attachment to primary human airway epithelial cells, indicating that this region controls virus binding.125127

Within the last two decades, new genotypes of G have evolved from both A and B subgroups. In 1999, the BA genotype from subgroup B was discovered, which contains an exact 20aa duplication inserted as aa260–279 (ANCHOR), within the MLDII domain.17,128130 In 2009, isolation of a new A genotype, ON1, was first reported.131 Comparable to the BA genotype, the ON1 genotype contains a 24aa duplication of aa261–283, inserted as aa285–307, within its MLDII.131 The insertions provide up to seven additional glycosylation sites within the expanded variable region.131 The BA genotype has rapidly become the most prevalent circulating B genotype worldwide, suggesting that this duplication event confers a fitness advantage to RSV. It will be interesting to observe an potential ON1 circulation pattern resembling that of the current BA strain within the next decade.132 Increased glycosylation sites, combined with added sequence in the region responsible for RSV antigenic drift, may account for ON1 and BA rapid circulation throughout the world.

Shown in Fig. 10B, G is a highly disordered protein. When comparing ON1 and BA sequences with other A and B genotypes, respectively, G overall disorder status is increased in ON1 and BA strains with the MLDII duplication. We compared the averaged PONDR® VL-XT BA data to the A2 data both to showcase the changes resulting from the introduction of 20aa in the MLDII, and also to exhibit the RSV B subtype. Comparing G–A with G–BA, it is apparent that the increased overall disorder of BA is at least partially a result of the duplication near the C-terminus. While most of the extracellular domains of both graphs display disorder, the sequence from aa223–246 (Table 5), which drops into the ordered section of the graph in G–A, is no longer ordered in G–BA. Since the only difference between ON1 and BA genotypes and their older A and B counterparts is the duplication, we can theoretically correlate increased disorder with the insertion. Furthermore, this data suggests an indirect link between increased disorder of G and higher circulation of ON1 and BA genotypes.

Interestingly, there is another key difference between the G–A and G–BA graphs, upstream of the duplication. The dip in the disordered region of the N terminal end of G at aa91–109 disappears, and instead the G-BA sequence remains disordered from aa91–156 (Fig. 10B). Due to the location within the MLDI, it is probable that this inconsistency is a demonstration of the high level of variation in the two mucin-like domains of G, which may represent modified protein function.

Glycoprotein F.

In contrast to SH and G, the 574aa F fusion protein is a type I integral membrane protein. F functions primarily to fuse the viral envelope with the plasma membrane to release the viral nucleocapsid into the cytoplasm of the host cell.2 The first 528 amino acids make up the extracellular domain, aa529–550 the transmembrane domain and aa551–574 reside in the cytoplasm.2 The RSV F protein is synthesized as the inactive precursor protein F0 and modified with a C-terminal palmitoylation at C550 and 5–6 N-linked glycans, depending on the strain of RSV.133,134 Here, we modeled domain maps around the A2 strain, which undergoes glycosylation at 5 residues – N27, N70, N116, N126 and N500.134 F0 is cleaved by a cellular furin-like protease at extracellular domain residues 109 and 136, yielding three distinct peptides – the F1 and F2 active subunits, and a 27aa peptide (p27) derived from the intervening sequence.135138 The function of p27 is unknown and it dissociates shortly after cleavage.139,140 The PONDR® VL-XT data shows p27 to be highly disordered, although there is also a trend of disorder peaks at potential glycosylation sites, of which there are two within p27 (Fig. 11). The signal peptide, the N-terminal 25aa of F, is also cleaved at its C-terminal end in generation of the active F monomer.134 The resulting F1 and F2 chains remain attached via a disulfide bond linking C69 with C212.134,141 Fully mature F thus contains just three N-linked glycans.

Fig. 11.

Fig. 11

Glycoprotein F. (A) PONDR® VL-XT A2 plot overlaid with protein domain map depicting amino acids assigned to specific function and post-translational modification, x-axis = amino acid position, y-axis = disorder score (ordered <0.5, disordered >0.5). Red domains from N- to C-termini = signal protein (SP), p27, fusion peptide (FP); dark green = heptad repeat domains (HRA and HRB) for intra-protein interactions; yellow domain = transmembrane; bright green domain = cytoplasmic; purple domain = extracellular; red pentagon = potential glycosylation sites; green diamond = palmitoylation site; bracket = disulfide bond; parallel lines depict proteolytic cleavage sites, resulting in F1 and F2 subunits designated by bracketed lines above domain map. (B) The prefusion state (PDB ID: 4JHW) bound to antibody D25 (gray), imaged using Pymol. The red residues (aa63–74, aa200–213) represent the antibody recognition epitope ASØ.

Upon triggering, the coiled-coil heptad repeat domain HRA near the N-terminus of the F1 subunit, aa157–209, lengthens and trimerizes with the adjacent F protein HRA domains.113,142 Specifically, the unstructured regions connecting the short α-helices that comprise HRA refold into α-helices themselves, generating an extended α-helix.113 This remodeling causes HRA trimerization and insertion of the hydrophobic stretch of amino acids at the N-terminus of the F1 subunit, termed the fusion peptide (FP, from aa136–157), into the plasma membrane of the host cell.134,142 FP incorporation into the host membrane allows for further intra-protein interactions, in which the HRB domains, from aa476–524 of each F protein, fold to interact with the HRA trimer to form a stable α-helical trimer consisting of HRA-HRB heterodimers. This drives the viral and host membranes together for fusion.143 Each heptad repeat domain induces a single IDR peak (Fig. 11). Although we know all HRA and HRB domains are α-helical, they are responsible for refolding F for its key fusion activity. In addition, the HRA IDR peak resides within the region of F that switches from unstructured to α-helical in the fusion process.

The C-terminal tail of F facilitates the release of M-ribonucleoprotein complexes from the inclusion bodies, which are the sites of RNA synthesis and translation.144,145 The phenylalanine residue F572 within the F cytoplasmic domain is critical for mediating assembly (Table 5). While our predictions display the transmembrane and cytoplasmic domains as ordered, there is a peak in the data corresponding to F572 (Fig. 11).

The crystal structure has been solved for both the pre- and post-fusion forms of F.33,146 To note, pre-fusion F consists an unstructured region from aa62–69 that connects the structured N- and C-terminal portions of F2. This region shifts drastically during the switch from pre- to post-fusion F, in comparison to the F2 peptide as a whole. Similarly, the F1 α-helix from aa196–210 alters its orientation during pre- to post-fusion transformation. These two regions are part of the antigenic site Ø (ASØ), located at the apex of the pre-fusion F trimer, and they account for most of the F variability in the otherwise highly conserved protein.33

Several crystal structures have been solved for a number of complexes in which an antibody is bound to F at an antigenic site.33,147,148 For our purposes, these complex structures are useful in determining correlations, if any, between our predicted IDRs and F-antibody binding sites. Interestingly, all identified antigenic sites are positioned within PONDR® VL-XT dips, potentially indicating a MoRF (Fig. 11A and B). This observation that the residues involved in ASØ fall within regions of predicted MoRFs is consistent with the high sequence variability and recognition of this region by multiple antibodies. Overall, F is a moderately ordered protein, which is reflected in its high global sequence conservation as well as its use of higher order oligomerization to carry out its fundamental F fusion activity.

M2-1 and M2-2 proteins.

The M2 gene encodes two proteins with two distinct, overlapping ORFs – M2-1 and M2-2.149 M2-1 is an antitermination factor that is important for processive transcription by the RSV polymerase. M2-2 is approximately half the size of M2-1 and functions to switch RNP activity from viral transcription to genomic RNA synthesis.25 While it was previously noted that M2-2 inhibits viral transcription late in infection, which is dependent on P phosphorylation, the exact mechanism of this activity is unknown.97,150 Our intrinsic disorder predictions thus do not help to elucidate M2-2’s role in transcription and replication; therefore we will focus on M2-1 (Fig. 12B).

Fig. 12.

Fig. 12

Transcriptional regulators M2-1 and M2-2. PONDR® VL-XT A2 plot overlaid with protein domain map depicting amino acids assigned to specific function and post-translational modification, x-axis = amino acid position, y-axis = disorder score (ordered <0.5, disordered >0.5). Gray = unassigned protein sequence; red = zinc-finger domain (ZFD)/N-binding domain; light blue domain = oligomerization; orange domains = RNA-binding; dark blue domains = protein binding with labeled binding partner P; blue hexagon = phosphorylation sites. M2-1 (A) M2-2 (B).

M2-1 regulates transcription through its anti-termination activity.25,151 The 194aa M2-1 protein is found within cytoplasmic inclusion bodies, where it associates with the RNP complex via P, the N-terminal domain of M and RNA.29,87,152,153 Core M2-1 residues aa53–177 (ANCHOR predicts 3 MoRFs, MoRFpred predicts 1, Table 5) bind P and RNA in a competitive manner that is independent of M2-1 phosphorylation.87,154 The more recently determined NMR structure of M2-1 revealed a partial overlap between RNA- and P-binding domains, supporting their competitive binding. In particular, residues K92-V97, L149-L152 and D155-K159 are sites of RNA-binding (ANCHOR). Interaction with P occurs at residues V127-S137 and L152-T164 (MoRFpred).155 M2-1 is recruited to viral inclusion bodies for RNA synthesis by association with P, while direct interaction with RNA is required for M2-1 transcription anti-termination and processivity activity.88,153,155 Based on our PONDR® VL-XT data, four out of the five RNA- and P-binding domains are of predicted structured sequence, within putative MoRFs. The disorder peak from aa144–160 contains one of the RNA-binding domains, as well as the N-terminal region of the P-binding domain from aa152–164 (ANCHOR) (Fig. 12A).

M2-1 forms a disc-like assembly as a tetramer.156 The oligomerization domain for M2-1 is located from aa32–63. M2-1 activity and optimal transcription will not occur without proper M2-1 tetramer formation.154 The majority of the oligomerization domain contains sequence predicted to have high disorder, although the phosphorylated residues within this domain appear to either reside within a disordered region, or they themselves prompt disorder within the M2-1 structure (Fig. 12A).

While interactions with P and RNA are independent of phosphorylation, M2-1’s transcription anti-termination function is dependent on its phosphorylation.153 Though the major M2-1 species is not phosphorylated, the functionally active form is the minor, phosphorylated species.154,157 Host kinases phosphorylate M2-1 at serines 58 and 61 of its oligomerization domain.156 An additional residue at position T56 is potentially phosphorylated.157 As previously mentioned, the three potential phosphorylated residues all lie within a peak of predicted high disorder, indicating that an unstructured region of M2-1 is likely required for phosphorylation (Fig. 12A).

M2-1 contains an N-terminal zinc-finger domain from aa1–32, which interacts with the viral nucleocapsid, although this interaction is not required for M2-1 transcriptional activity.22,158 The ZFD is required for phosphorylation of M2-1, as well as its ability to bind zinc. It is possible the M2-1 zinc-binding activity is necessary for its anti-termination function through the interaction with the RNP complex.158

Unlike the predictions of a highly structured sequence overall for M2-2, M2-1 is one of the three RSV proteins predicted to display high intrinsic disorder (Table 2). The crystal structure of M2-1 revealed four functionally significant regions, including the aforementioned zinc-finger and oligomerization domains. In addition, there is a core domain important for antitermination activity as it contains RNA- and P-binding domains. Finally, there is an unstructured C-terminal region, which is supported by our IDR predictions (Fig. 12A).154,156

Large RNA-dependent RNA polymerase subunit L.

The large subunit of the RdRp, L, is essential for RSV transcription and replication. Due to the transcription gradient from 3′ to 5′ end of the RSV genome, the 2,165aa L protein is itself transcribed last and expressed at low levels during infection.2 Although there are numerous peaks of IDR spanning the lengthy L amino acid sequence, L is predicted to be a highly ordered protein, shown in Table 2. Due to low copy number and low stability, few structural details are known, however several L functions critical to RSV infection have been described.

L contains two variable regions, VRI and VRII, at aa137–184 and aa1718–1764.159 Demonstrated by our PONDR® VL-XT predictions in Fig. 13, both variable regions are found within peaks of intrinsic disorder. There are six conserved regions throughout the sequence as well, labeled CRI-CRVI. The CRs were determined by sequence comparison of five NNS virus L proteins.160 While the variable regions each contain IDRs, there does not appear to be any correlation between the CRs and protein disorder and/or structure (Fig. 13). The CRs are often used as reference points across the vast L amino acid sequence for describing domain locations.

Fig. 13.

Fig. 13

Large polymerase subunit L. PONDR® VL-XT A2 plot overlaid with protein domain map depicting amino acids assigned to specific function and post-translational modification, x-axis = amino acid position, y-axis = disorder score (ordered <0.5, disordered >0.5). Gray = unassigned protein sequence; red domains = variable regions (VR1 and VR2); orange domains = RNA-binding (from N- to C-termini contain RdRp, PRNTase and 2′-O-MTase catalytic activities); conserved regions CR1 to CRVI designated with bracketed lines above domain map.

The RNA-dependent RNA polymerase catalytic activity of L falls within CRII and CRIII, from aa693–877 (Table 5). This domain contains the signature GDNQ polymerase active motif, at aa810–813, which is responsible for phosphodiester bond formation during nucleotide incorporation.159,161 The GDNQ motif itself is predicted to be within an ordered structure; however immediately downstream of the active site is a predicted IDR within the RdRp catalytic domain (Fig. 13).

NNS virus mRNA cap formation requires a unique mechanism that differs from eukaryotic mRNA capping, which utilizes the L polyribonucleotidyl transferase (PRNTase) domain to create the 5′ mRNA cap independent of external transferase activity. Located at aa1152–1228 within CRIV, L mRNA capping activity is conserved amongst the NNS viruses.162,163 PONDR® VL-XT predictions show an IDR peak within the PRNTase domain, in CRIV (Fig. 13). After mRNA is transcribed and undergoes mRNA capping, the 5′-triphosphate is then methylated by a methyltransferase. Also conserved amongst all NNS viruses is the methyltransferase (2′-O-MTase) domain for cap 1 methylation of the mRNA 5′-cap, located after VRII in CRVI from aa1820–2008.164 This domain is found within a region of L predicted to be highly ordered. The inconsistencies between L conserved functional domains and predicted disorder regions support the poor correlation between L CRs and intrinsic protein disorder.

IDR predictions label L as one of the most ordered RSV proteins. However, there are several peaks throughout the disorder predictions, potentially indicating the presence of several small linkers or solvent-accessible sites for modification or interaction. Additionally, numerous sites uncharacterized by the literature are predicted to be MoRFs, potentially indicating sites of previously unknown function (Table 5). We also know that overall polymerase sequence and structure is well conserved amongst negative-strand RNA viruses, therefore we can infer high level of RSV L order from other MNV RdRp structures. This is demonstrated by the NNS vesicular stomatitis virus L RdRp structure, which contains two structural domains in addition to the three catalytic domains and six CRs illustrated by our RSV L domain map.165

Conclusions

The predictions presented in this study have established a comparable intrinsic disorder status between RSV subtypes A and B, which have persisted and co-circulated globally for over fifty years. SH was the only RSV protein exhibiting drastic divergence in intrinsic disorder between RSV subtypes, although the functional relevance of these findings is unknown since SH is not essential for RSV infectivity.110

With a better understanding the RSV IDRs that are well conserved and critical for viral activity, efforts can be made for IDR mutation and subsequent loss of protein function to limit essential viral activity. In terms of vaccine development, the objective is to achieve to a functionally incompetent protein while preserving structural stability and expression, to attenuate viral growth while maintaining virus viability. Therefore, targeting an unstructured yet functionally active sequence is favorable to that of a structured sequence. Due to consistency in degree and location of IDRs in the RSV proteome across all screened strains, targeting identified IDRs would be effective against all RSV strains.

Acknowledgements

J. N. W. acknowledges support from the University of South Florida Signature Research Doctoral Fellowship in Drug Design and Delivery. M. N. T. acknowledges support from NIAID R01 AI081977. We would like to acknowledge Insung Na for assistance with disorder prediction.

References

RESOURCES