Abstract
Mutational correlation patterns found in population-level sequence data for the Human Immunodeficiency Virus (HIV) and the Hepatitis C Virus (HCV) have been demonstrated to be informative of viral fitness. Such patterns can be seen as footprints of the intrinsic functional constraints placed on viral evolution under diverse selective pressures. Here, considering multiple HIV and HCV proteins, we demonstrate that these mutational correlations encode a modular co-evolutionary structure that is tightly linked to the structural and functional properties of the respective proteins. Specifically, by introducing a robust statistical method based on sparse principal component analysis, we identify near-disjoint sets of collectively-correlated residues (sectors) having mostly a one-to-one association to largely distinct structural or functional domains. This suggests that the distinct phenotypic properties of HIV/HCV proteins often give rise to quasi-independent modes of evolution, with each mode involving a sparse and localized network of mutational interactions. Moreover, individual inferred sectors of HIV are shown to carry immunological significance, providing insight for guiding targeted vaccine strategies.
Author summary
HIV and HCV cause devastating infectious diseases for which no functional vaccine exists. A key problem is that while individual mutations in viral epitopes under immune pressure may substantially compromise viral fitness, immune escape is typically facilitated by other “compensatory” mutations that restore fitness. These compensatory pathways are complicated and remain poorly understood. They do, however, leave co-evolutionary markers which may be inferred from measured sequence data. Here, by introducing a new robust statistical method, we demonstrated that the compensatory networks employed by both viruses exhibit a remarkably simple decomposition involving small and near-distinct groups of protein residues, with most groups having a clear association to biological function or structure. This provides insights that can be harnessed for the purpose of vaccine design.
Introduction
HIV and HCV are the cause of devastating infectious diseases that continue to wreak havoc worldwide. Both viruses are highly variable, possessing an extraordinary ability to tolerate mutations while remaining functionally fit. While individual residue mutations may be deleterious, these are often compensated by changes elsewhere in the protein which restore fitness [1, 2]. These interacting residues form compensatory networks which provide mutational escape pathways against immune-mediated defense mechanisms, presenting a major challenge for the design of effective vaccines [3].
The compensatory interaction networks that exist for HIV and HCV—and for other viruses more generally—are complicated and far from being well understood. Resolving these by experimentation is difficult, due largely to the overwhelming number of possible mutational patterns which must be examined. An alternative approach is to employ computational methods to study the statistical properties of sequence data, under the basic premise that the residue interactions which mediate viral fitness manifest as observable mutational correlation patterns. For HIV, recent analytical, numerical and experimental studies [4–7] provide support for this premise, indicating that these patterns may be seen as population-averaged evolutionary “footprints” of viral escape during the host-pathogen combat in individual patients. This idea has been applied to propose quantitative fitness landscapes for both HIV [8, 9] and HCV [10] which are predictive of relative viral fitness, as verified through experimentation and clinical data.
Fitness is a broad concept that is ultimately mediated through underlying biochemical activity. For both HIV and HCV, experimental efforts have provided increased biochemical understanding of the constituent proteins, leading in particular to the discovery of small and often distinct groups of residues with functional or structural specificity (Fig 1 and S1 File). These include, for example, sets of protein residues lying at important structural interfaces, those involved with key virus-host protein-protein interactions, or those found experimentally to directly affect functional efficacy. An important open question is how these biochemically important groups relate to the interaction networks formed during viral evolution.
Some insights have been provided for a specific protein of HIV [11] and HCV [12]. The main objective of these investigations was to identify potential groups of co-evolving residues (referred to as “sectors”) which may be most susceptible to immune targeting. In each case, a sector with potential immunological vulnerability was inferred, and this was found to embrace some residues with functional or structural significance. It is noteworthy that the employed inference methods were designed to produce strictly non-overlapping groups of co-evolving residues, which may hinder identification of inherent co-evolutionary structure and associated biochemical interpretations.
In a parallel line of work, computational methods have been used to understand co-evolution networks for various protein families, with compelling results (reviewed in [13]). Notably, for the family of S1A serine proteases [14], a correlation-based method termed statistical coupling analysis (SCA) uncovered a striking modular co-evolutionary structure comprising a small number of near-independent groups of co-evolving residues (again referred to as sectors), each bearing a clear and distinct biochemical association, in addition to other qualitative properties. Sectors have also been obtained for other protein families using SCA, and the functional relevance of these has been confirmed through experimentation [14–17]. A natural question is to what extent such modular, biochemically-linked co-evolutionary organization exists for the viral proteins of HIV and HCV? This is not obvious, particularly when one considers the evolutionary dynamics of these viruses, which are complex and very distinct to those of protein families. Specifically, they involve greatly accelerated mutation rates, and are shaped by effects including intrinsic fitness, host-specific but population-diverse immunity, recombination, reversion, genetic drift, etc. The sampling process is also complicated and subject to potential biases, making inference of co-evolutionary structure difficult.
In this paper, considering multiple proteins of HIV and HCV, we identify in each case a sparse and modular co-evolutionary structure, involving near-independent sectors. This is established by introducing a statistical method, which we refer to as “robust co-evolutionary analysis (RoCA)”, that learns the inherent co-evolutionary structure by providing resilience to the statistical noise caused by limited data. Strikingly, the sectors are shown to distinctly associate with often unique functional or structural domains of the respective viral protein, indicating clear and well-resolved linkages between the evolutionary dynamics of HIV and HCV viral proteins and their underlying biochemical properties. Our results suggest that distinct functional or structural domains associated with each of the viral proteins give rise to quasi-independent modes of evolution. This, in turn, points to the existence of simplified networks of sparse interactions used by both HIV and HCV to facilitate immune escape, with these networks being quite localized with respect to specific biochemical domains. The insights provided by the inferred sectors also carry potential importance from the viewpoint of immunology and vaccine design, which we demonstrate for a specific protein of HIV.
Results
RoCA method for inferring co-evolutionary sectors of viral proteins
By employing available sequence data, we investigated the co-evolutionary interaction networks for multiple HIV and HCV proteins. Our proposed approach RoCA resolves co-evolutionary structures by applying a spectral analysis to the mutational (Pearson) correlation matrices and identifying inherent structure embedded within the principal components (PCs), which are representative of strong modes of correlation in the data. The RoCA algorithm is designed to be robust against statistical noise, which is a significant issue since the number of available sequences for each protein is rather limited, being comparable to the protein size (Fig 2A).
The RoCA method comprises two main steps. Similar to the previous PCA-based co-evolutionary methods [11, 12, 14], the first step involves isolating PCs which carry correlation information from those which are supposedly dominated by statistical noise (Fig 2B, top panel). The second and most significant step of RoCA provides robust estimates of the PCs (Fig 2B, bottom panel) selected in the first step. We developed an automated and suitably adapted version of a sparse PCA technique [19], which is based on the standard orthogonal iteration procedure used to obtain the PCs of a matrix [20]. To summarize, this step (Fig 2B, bottom panel) involves: (i) a data-driven thresholding procedure applied to the PCs—designed based on ideas from random matrix theory—that distinguishes, for each PC, the significantly correlated residues from those residues whose correlations are consistent with statistical noise, and (ii) an iterative procedure that tries to robustly estimate the correlation structure between the selected residues across different PCs (see Materials and methods for details). Based on the resulting PCs, the RoCA algorithm directly infers co-evolutionary sectors, representing groups of residues whose mutations are collectively coupled (Fig 2C).
We note that while many sparse PCA methods have been developed [19, 21–24] and applied previously to problems in different fields (e.g., senate voting and finance [25], network systems [26], image processing [27], and genome-wide association studies in cancer [28]), to our knowledge, RoCA is the first application of sparse PCA techniques to the co-evolutionary analysis of proteins.
The automated robust estimation of PCs produced by RoCA (Fig 2B, bottom panel) bears an important distinction from previously proposed sectoring methods [11, 12, 14] which (indirectly) attempted to reduce the effect of statistical noise in the PCs using either visual inspection or an ad hoc thresholding procedure. Moreover, other than applying a suitable data-driven thresholding step to remove statistical noise, RoCA makes no structural imposition on the inferred sectors (e.g., enforcing the sectors to be non-overlapping as in [11, 12, 15]), and it is therefore designed to reveal inherent co-evolutionary networks as reflected by the data.
Modular and sparse co-evolutionary structure of HIV/HCV proteins
We used RoCA for inferring the co-evolutionary structure of two proteins each of HIV and HCV which represent a good mix of structural (HIV Gag), accessory (HIV Nef), and non-structural (HCV NS3-4A and NS4B) proteins. For each viral protein, the RoCA method identified a small number of sectors (Fig 3A and S1 Fig) which together embraced a rather sparse set of residues (i.e., between 35%–60% of the protein; see S2 File for the complete list). In some cases the sector residues were localized in the primary sequence, while in others they were quite well spread (Fig 3B and S1 Fig). Importantly, while each sector was identified from a distinct PC, they were found to be largely disjoint (Fig 3A and S1 Fig). This suggests that the co-evolutionary structures are highly modular, with the different modules (sectors) being nearly uncorrelated to each other. In fact, further statistical tests demonstrated that the inferred sectors are nearly independent (Fig 3C and S1 Fig).
This identified modular co-evolutionary structure is in fact reminiscent of ‘community structure’ that has been observed in numerous complex networks, e.g., metabolic, webpage, and social networks [29]. In such applications, the identified modules or communities have been shown to represent dense sub-networks which perform different functions with some degree of autonomy. For our co-evolutionary sectors, in line with previous studies on the fitness landscape of HIV [6–8] and HCV [10], they appear to represent groups of epistatically-linked residues which work together to restore or maintain viral fitness when subjected to strong selective pressures during evolution (e.g., as a consequence of immune pressure). In light of this, one anticipates that the co-evolutionary sectors should afford an even deeper interpretation in terms of the underlying biochemical properties of the viral proteins, which fundamentally mediate viral fitness.
Modularity in HIV/HCV is tightly coupled to biochemical domains
To explore potential correspondences between the identified RoCA sectors and basic biochemical properties, we compiled information determined by experimental studies for each of the viral proteins. This consists of residue groups having prescribed functional or structural specificity (see Fig 1; also S1 File for a more extensive list including small groups). These groups, which are seen to occupy sparse and largely distinct regions of the primary structure (Fig 1), are collectively referred to as “biochemical domains”. (This should not be confused with the term “domain”, as classically used for a folding unit in structural biology and biochemistry.) For each viral protein, structural domains were defined based on spatial proximity of residues in the available protein crystal structure; they include, for example, residues which lie on critical interfaces needed to form stable viral complexes, or those involved in essential virus-host protein-protein interactions. Functional domains, on the other hand, were typically identified using site-directed mutagenesis or truncation experiments, and they include groups of residues found to have a direct influence on the efficacy of specific protein functions. It is important to note, however, that while structural domains are typically clearly specified, functional domains are expected to be less so, due to experimental limitations. Results reported based on truncation experiments, for example, may comprise false positives. This is because the truncation experiment (see [30, 31] for specific examples) typically involves a coarse procedure to predict the functionally important residues by removing different groups of contiguous residues from the protein, and studying the effect of each truncation on the protein function. This procedure may suggest a particular group of residues to be important for a protein function when only one or two residues may be critical. Thus, the remaining residues in the reported important group would be false positives.
Despite potential limitations of the compiled biochemical domains, contrasting these domains with the RoCA sectors (for all four viral proteins) revealed a striking pattern, with most sectors showing a clear and highly significant association to a unique biochemical domain (Fig 4). This is most marked for the HIV Gag protein, where there is a one-to-one correspondence. These observations carry important evolutionary insights. Not only are the co-evolutionary networks of both HIV and HCV proteins modular, but the modules (sectors) seem to be intimately connected to distinct biological phenotype. Our results suggest that the fundamental structural or functional domains of these viral proteins spawn quasi-independent co-evolutionary modes, each involving a simplified sparse network of largely localized mutational interactions. The observed phenomenon is seemingly a natural manifestation of immune targeting against residues within the biochemical domains, since escape mutations at these residues likely lead to structural instability or functional degradation, necessitating the formation of compensatory mutations to restore fitness.
Co-evolutionary structure identified with previous sectoring methods
We investigated whether our main findings could also be revealed by other sectoring methods. We first considered a method that we proposed previously based on classical PCA [12], that sought to identify groups of collectively-correlated viral residues which may be susceptible to immune targeting. Note that this PCA-based method [12] is very similar to the method introduced in [11], mainly differing in the procedure to form sectors from PCs; specifically, the method in [11] formed sectors from visual inspection of PC biplots, whereas an automated procedure was applied in [12] (see S1 Text for implementation details). Moreover, while the PCA-based method [12] was used to study only a specific HCV protein, it is general in its application (similar to the method presented in [11] and the proposed RoCA method) and can be used to infer co-evolutionary sectors for the viral proteins under study (S1 Text).
An important feature of the PCA method [12] (and also [11]) was the imposition of a structural constraint in the inferred sectors, enforced to be disjoint, which may compromise its ability to infer natural co-evolutionary structure (S1 Text). Despite the imposed zero inter-sector-overlap constraint, sectors produced by the PCA-based method [12] for the studied viral proteins tended to be larger than RoCA sectors (S3 Fig), and they collectively embraced a larger set of residues (covering 40%–80% of the protein). Comparing the sectors inferred by this method with those obtained using the RoCA method revealed that they included a mix of residues from multiple RoCA sectors (Fig 5A), a fact that was also reflected in the biochemical associations of the sectors, where much of the resolved (unique) sector/domain associations shown by RoCA (Fig 4) were indeed no longer revealed (Fig 5C). We found that these key differences were attributed to the sensitivity of the PCA-based approach to sampling noise, as reflected by the noisy and significantly overlapping principal components (Fig 5B). This was corroborated with a ground-truth simulation study, through which the ability to infer co-evolutionary structure was tested in synthetic model constructions (S2 Text). The RoCA method resolved all the individual (true) sectors with high accuracy, whereas our previous method [12] inferred comparatively large sectors, which often included false positives and merged residues from different true sectors (S3 Fig).
We also tested other co-evolutionary methods, which tended to return very different results to RoCA, and generally revealed little biochemical association for the studied viral proteins (S5 Fig). Most notable is the limited biochemical association of sectors identified using the benchmark SCA method [14] (S5 Fig) which has shown much success in resolving co-evolutionary structure for certain protein families [15, 32]. Aside from the noise sensitivities shared by both SCA and classical PCA-based methods (S5 Fig), the surprising disparity in this case appears due to the weighted covariance construction employed by SCA (as opposed to the Pearson correlation) which, while apparently suited to the analysis of certain protein families data [14–17], does not seem suitable for identifying the co-evolutionary structure in the considered HIV and HCV proteins (see S3 Text for details).
Detailed analysis of the biochemical associations of the inferred sectors
In the following, we provide details on the biochemical associations of the identified RoCA sectors for each of the four viral proteins.
HIV Gag
Gag poly-protein encodes for the matrix (p17), capsid (p24), spacer peptide 1 (SP1), nucleocapsid (p7), spacer peptide 2 (SP2), and p6 proteins. Being a core structural poly-protein of HIV, the experimentally identified domains in Gag consist of critical structural interfaces involving either virus-host or virus-virus protein interactions (Fig 1).
Strikingly, five of the six identified RoCA sectors were individually associated to distinct structural domains (Fig 6A). Sector 1 was enriched (52%) with N-terminal residues of p17 involved with virus-host protein interaction—binding of Gag with plasma membrane—critical for viral assembly and release [33]. The remaining sectors were associated with virus-virus protein interactions. In particular, sector 2 comprised 56% of the residues that form the p24-SP1 interface, which is considered to be important for viral assembly and maturation [34]. Sector 3 was dominated by the residues belonging to the capsid protein p24. The HIV capsid exists as a fullerene cone with 250 hexamers and 12 pentamers that cap the ends of the cone. The monomer-monomer interface formed in the oligomerization of p24 (in both the hexamer and pentamer structures) has been shown to be important for structural assembly of the HIV capsid [35, 36]. Sector 3 was enriched with ∼50% of the residues in the largely overlapping interfaces of these p24 oligomers (S1 File). In addition to these residues, the hexamer-hexamer interface in p24 has been shown to be important for proper capsid formation [37]. Sector 6 was found to comprise 36% of the residues within this interface. Sector 5 consisted of 40% of residues involved in a critical functional domain—two zinc finger structures separated by a basic domain—in p7, important for packaging genomic RNA [38].
We found that sector 4 (indicated as a star in Fig 4) was not significantly associated with any of the large biochemical domains identified for HIV Gag (listed in Fig 1). This sector comprised the complete SP2 protein and a few N-terminal residues of the p6 protein, for which little experimental information is available. To our knowledge, only five of these residues were experimentally studied previously [39, 40], wherein mutations were shown to alter protein processing and abolish viral infectivity and replication. While the biochemical implications for the remaining residues in sector 4 are not known, our result suggests that they could also be important for the mentioned functions. These residues thus serve as potential candidates for further experimental studies. A similar comment applies for each of the proteins discussed below in relation to those sectors with as yet unspecified biochemical association.
HIV Nef
Nef is an accessory protein which is a critical determinant of HIV pathogenesis and is involved in multiple important functions.
Of the four RoCA sectors revealed for Nef, two of these (sectors 1 and 3) were associated with distinct biochemical domains, while one (sector 4) was associated with two domains (Fig 6B). Specifically, sector 1 was enriched (90%) with residues in a functional domain consisting of the proline-x-x repeat, shown to be critical for the enhancement of viral infectivity [41], while sector 3 consisted of 44% of residues considered to be crucial in virus-host protein interactions that down-regulate the surface expression of HLA1 molecules [42]. In contrast, sector 4 comprised (i) 33% of the residues involved in the virus-host protein interaction that results in the down-regulation of CD4 surface expression [43], and (ii) 44% of the residues involved in the virus-virus protein interaction critical for Nef dimerization [44]. Although the residues in these two biochemical domains are close in the primary structure (Fig 1), they are largely distinct with only one residue in common. CD4 down-regulation is one of the best characterized functions of Nef and is well-known to weaken the immune response against HIV, resulting in a viral production increase [45]. On the other hand, Nef has been observed in vivo to form dimers, suggesting the importance of this structure for Nef function [44]. The association of a single co-evolutionary sector with these two domains suggests that these may be biochemically related, with mutations in one domain influencing the other. Interestingly, this is corroborated by a recent study [46] which shows that while the wild-type dimeric Nef induced marked CD4 down-regulation, all mutations affecting the Nef dimer structure highly disrupted this function. Further experimental work is still required to more finely resolve the dependencies between these multiple Nef domains. Nonetheless, the association of viral infectivity enhancement and CD4 down-regulation/Nef dimerization with distinct sectors (1 and 4) is remarkably in line with the experimentally reported dissociation of these functions [41, 46].
We found a single sector (sector 2) that was not associated with any known biochemical domain (Figs 4 and 6B). The available crystal structure suggests that these residues are predominantly located away from the dimer interface, yet the biochemical significance of these residues remains unknown.
HCV NS3-4A
NS3 is a large protein involved with performing serine protease and helicase functions. Based on these functions, NS3 is divided into two domains: the protease domain, consisting of N-terminal one-third protein residues, and the helicase domain, comprising the remaining C-terminal two-third protein residues. NS4A is a very short protein that functions as a co-factor for the serine protease activity of NS3.
Of the four RoCA sectors, two (sectors 3 and 4) were individually associated with distinct biochemical domains, while one (sector 1) was associated with multiple domains (Fig 6C). Specifically, the small sector 3 contained all the residues of a relatively conserved motif in the NS3 helicase domain, considered to be important for ATPase and duplex unwinding activities [47], while sector 4 comprised 31% of the residues involved in dimerization of NS3, important for helicase activity and viral replication [48]. In contrast, sector 1 was a mixture of the NS3 protease domain and NS4A residues, encompassing multiple functional domains. In particular, the N-terminal residues of the NS3 protease domain have been reported to be involved in three virus-virus protein interactions with NS4A to mediate multiple functions including: (i) activation of the NS3 protease function [49]; (ii) membrane association and assembly of a functional HCV replication complex [50]; and (iii) NS5A hyper-phosphorylation [51, 52]. Sector 1 comprised 40%–52% of the residues associated with these functions. The association of a single sector with these multiple domains (Fig 1) in NS3-4A suggests that they may be functionally coupled. Interestingly, this is in line with experimental studies which report the dependence of NS3-4A membrane association and NS5A hyper-phosphorylation on an active NS3 protease [50, 52]. Further work is still required to more finely resolve these dependencies.
For NS3-4A, sector 2 was not associated with any known biochemical domain (Figs 4 and 6C). This sector included residues that are well-distributed in the NS3 protease and helicase domains as well as in NS4A.
HCV NS4B
NS4B is a small hydrophobic membrane protein which is involved in multiple functions including viral replication and assembly. Compared to other HCV proteins, NS4B is relatively poorly characterized and its full-length crystal structure is still not available.
Of the four RoCA sectors, two were associated with known biochemical domains (Fig 4). Sector 3 was strongly associated with a functional domain comprising the C-terminal α-helix 1 (H1), shown to be important for HCV RNA replication and viral assembly [53]. This sector consisted of all H1 residues, but none of the α-helix 2 (H2) residues. This is in agreement with [53] which reported a comparably higher impact of H1, as compared with H2, on viral replication and assembly.
Sector 4 comprised half of the residues considered to be involved in virus-host protein interaction between a basic leucine zipper (bZIP) motif at the N-terminal of NS4B and the central part of the human protein ATF6beta (activating transcription factor 6 beta) [54]. Moreover, this sector also contained residues within a region that, at a coarse level, was identified to be sufficient for NS4B oligomerization by a truncation procedure [30]. Thus, while the specific residues involved in NS4B oligomerization remain unknown, sector 4 may assist in a more accurate identification of these residues, thereby refining the coarse analysis of [30]. The bZIP motif residues completely overlap with this biochemical domain (Fig 1) and thus both were associated with the same sector.
With the limited current understanding of the functional and structural characteristics of NS4B, sectors 1 and 2 could not be associated to any known biochemical domain (Fig 4). Nonetheless, we examined the predicted secondary structure of the NS4B protein [55] to gain some insight. Both sectors consisted of residues present in the central part of NS4B that contains the trans-membrane (TM) segments. Specifically, sector 1 comprised the majority of residues in TM3, while sector 2 consisted of residues in TM1 and TM2. These TM segments are considered to be important for mediating the membrane-association of NS4B [55]. However, the specific residues involved with this function are still unresolved, and therefore the corresponding association with sectors 1 or 2 could not be clearly established.
Association of sectors with HIV long-term non-progressors and rapid progressors
Our main results carry potential immunological significance, which may provide useful input for vaccine design. To demonstrate this, we considered the HIV Gag protein, and contrasted the inferred RoCA sectors with the epitope residues targeted by T cells of HIV “long-term non-progressors” (LTNP) and “rapid progressors” (RP). LTNP correspond to rare individuals who keep the virus in check without drugs, whereas RP are individuals who tend to progress to AIDS in less than 5 years (compared to the population average of 10 years [56]). Clinical studies of HIV-infected cohorts have demonstrated a high correlation between possession of specific human leukocyte antigen (HLA) alleles with the disease outcome (LTNP or RP) [57, 58]. The information of the epitopes targeted by the T cells in HIV-infected individuals with these specific HLA alleles was obtained from the Los Alamos HIV Molecular Immunology Database (http://www.hiv.lanl.gov/content/immunology) and is presented in S1 Table. Our analysis revealed that LTNP elicit immune responses strongly directed towards residues in sector 3, whereas RP elicit responses against residues in sector 2 (Fig 7). Recalling the sector biochemical associations (Figs 4 and 6), these observations seem to promote the design of T-cell vaccine strategies which target sector residues lying on the p24 intra hexamer interfaces, while avoiding targeting residues on the p24-SP1 interface. In the former case, such targeting seemingly compromises viral fitness by disrupting the formation of stable HIV capsid [35], which appears quite difficult to restore through compensatory mutations. In contrast, restoring fitness costs associated with destabilization of the p24-SP1 interface appears less difficult.
These results were contrasted against a previous analysis of HIV Gag [11], in which an inferred sector based on a classical PCA approach (a slight variant of the approach [12], discussed earlier) was also found to associate with LTNP. The residues defining this immunologically important sector were directly extracted from [11]. Analyzing the biochemical association of the residues in this sector (similar to the analysis done in Fig 4) revealed a significant association (P < 0.05, Fisher’s exact test) with the p24 intra-hexamer interface (as pointed out in [11]), but also with the p24-SP1 interface (see S6 Fig for details). Hence, while reaffirming the importance of targeting interfaces within p24 hexamers, different conclusions were established regarding p24-SP1, suggesting that this interface should be targeted, rather than avoided. This important distinction arises as a consequence of the methodological differences between RoCA and the previous PCA-based methods [11, 12], as discussed previously.
By integrating our observations with population-specific HLA allele and haplotype information, candidate HIV immunogens eliciting potentially robust T cell responses can be proposed [11, 12]. A more detailed investigation along these lines, as well as broadening the analysis to other viral proteins, is planned to be carried out in future work.
Discussion
Characterizing the co-evolutionary interactions employed by HIV and HCV is an important problem. These interactions reflect the mutational pathways used by each virus to maintain fitness while evading host immunity. However, they are not well understood and pose a significant challenge for vaccine development. By applying statistical analysis to the available cross-sectional sequence data, we showed that for multiple HIV/HCV proteins the interaction networks possess notable simplicity, involving mainly distinct and sparse groups of interacting residues, which bear a strikingly modular association with biochemical function and structure. Essential to unraveling this phenomena was the introduction of a robust inference method.
Our approach is particularly well suited for the “internal” proteins of chronic viruses such as HIV and HCV that are subjected to broadly directed T cell responses. For such proteins, and for HIV in particular, recent experimental and computational work has provided evidence that the population-averaged mutational correlations are reflective of intrinsic interactions governing viral fitness. This was shown to be a consequence of multiple factors which influence the complex evolutionary dynamics of HIV, including the extraordinary diversity of HLA genes in the human population which place selective pressure on diverse regions of the protein, thereby promoting wide exploration of sequence space, in addition to the tendency of mutations to revert upon transmission between hosts [4–6]. An additional important evolutionary factor is that of recombination, which introduces diversity through template switching during viral replication. A consequence of recombination is that it breaks mutational correlations between residues that are distant in the primary structure. That is, higher rates of recombination should lead to shorter-range correlations and vice-versa. Thus, the recombination involved in HIV and HCV evolution [59] may consequently distort the mapping between biochemical domains and the inferred sectors, and possibly result in inference of multiple distinct sectors associated with the same biochemical domain. However, such an effect of recombination is not evident from our results. Nonetheless, the effect of high recombination rate of HIV as compared to HCV [59] seems to be reflected in the separation among residues involved in the inferred sectors. Specifically, the HIV protein sectors are quite localized, with a median separation in the primary structure of up to 140 residues (sector 6 of HIV Gag), while those of the HCV proteins are well separated with a median separation of up to 480 residues (sector 1 of HCV NS3-4A).
In general, the predicted sectors primarily comprise residues within the corresponding biochemical domains and a few other residues which are close in either primary or tertiary structure. However, these sectors also include a small proportion of residues which are distant from those within the respective biochemical domains (S7 Fig) and thus, appear to influence the associated structure or function by an allosteric mechanism. Such long-range interactions have been reported to play a role in maintaining viral fitness and facilitating immune evasion [60–62]. Allosteric interactions have also been observed in the co-evolutionary sectors of different protein families obtained previously with the SCA method [14–17].
The identified sectors for each viral protein together comprise between 35%–60% of the total residues in the protein (Fig 3A and S1 Fig). This is consistent with the sparse sectors of co-evolving residues observed in different protein families using the SCA method [14–17]. However, one may ask about the role of non-sector residues, i.e., those not allocated to any sector. Similar to the observations in other proteins [14–17], our analysis suggests that non-sector residues evolve nearly independently, with associated biochemical domains being impacted only by individual mutations at these residues.
Our co-evolutionary analysis is based on a binary approximation of the amino acid sequences, with mutants distinguished from the most frequent amino acid at each position (see Materials and methods for details). This is a reasonable approximation due to the high conservation of the studied internal viral proteins (S8 Fig). For other viral proteins which are comparatively less conserved, like surface proteins HIV Env and HCV E1/E2, a similar co-evolutionary analysis may require refinement of such approximation by incorporating the information of amino acid identities of different mutants. This is a worthwhile problem to be pursued in a future study, in particular given the relevance for vaccine design of surface proteins, as these are a major target of neutralizing antibodies. We point out however that multiple HIV and HCV clinical studies have also demonstrated strong correlation of a broadly-directed cellular (T-cell-based) immune response against internal proteins with HIV control [57, 58] and spontaneous HCV clearance [63, 64]. These reports suggest that, for HIV and HCV, the immune response against internal proteins (similar to those studied in this work) may be just as important as—if not more important than—the antibody response against surface proteins.
While our analysis has focused primarily on viral proteins, the proposed RoCA approach is general and may be applied more broadly, provided that the studied proteins are reasonably conserved. As an example, we computed sectors for the S1A family of serine proteases and compared these with results obtained previously with the SCA algorithm [14, 15]. Similar to SCA, RoCA yielded three co-evolutionary sectors which had statistically-significant associations with distinct phenotypic properties; namely thermal stability, enzymic activity, and catalytic specificity (S9 Fig). We point out however that the very notion of a “sector” as defined previously for SCA [14–17] has some differences to that of the RoCA sectors. Specifically, while for the HIV/HCV proteins in this work we employed an unweighted Pearson correlation measure and the inferred sectors were interpreted as simply groups of correlated residues; for the protein families [14–17], SCA involved a conservation-weighted correlation measure and thus the inferred sectors represented groups of not only correlated but also conserved residues (see S3 Text for details). For serine proteases, the relatively higher statistical significance of biochemical association for SCA sectors (S9 Fig) suggests that using a measure that also incorporates conservation may be useful for identifying biologically important residues in this case. Nonetheless, the RoCA sectors produced for the serine proteases, based on an unweighted Pearson correlation measure, further attest to the importance of residue interactions in mediating fundamental protein functions.
For the HIV/HCV viral proteins under study, the relation between the biologically important residues (reflected by the biochemical domains) and conservation was not clearly apparent (S10 Fig). In fact, a significant and particularly surprising aspect of our analysis is the substantial extent to which the correlation patterns, with no regard for conservation, encode information regarding qualitatively distinct phenotypes including structural units—virus-host and virus-virus protein interactions—and functional domains. The identified sectors may therefore also be seen as predictors of important biochemical domains. For each of the four viral proteins under study, there is at least one sector with unknown biochemical significance. Subsequent experimentation, such as mutagenesis experiments targeted at the identified sector residues, could therefore provide new insight which furthers the current understanding of HIV and HCV. Particularly interesting is the poorly understood NS4B protein of HCV, for which any biochemical activity underpinning the leading two sectors—representing the strongest co-evolutionary modes—have yet to be resolved.
Materials and methods
Sequence data: Acquisition and pre-processing
The sequence data for HIV-1 clade B Gag and Nef was obtained from the Los Alamos National Laboratory HIV database, http://www.hiv.lanl.gov/. We restricted our analysis to drug-naive sequences and any sequence marked as problematic on the database was excluded. To avoid any patient-bias, only one sequence per patient was selected. After aligning the sequences based on the HXB2 reference, they were converted to a N × M amino acid multiple sequence alignment (MSA) matrix, where N denotes the number of sequences and M denotes the number of amino acid sites (residues) in the protein. The downloaded sequences may include a few outliers due to mis-classification (e.g., sequences assigned to an incorrect subtype or clade) in the database. Such outlying sequences were identified and removed using a standard PCA clustering approach (see [11] for details). This yielded N = 1897 and N = 2805 sequences for HIV Gag and Nef, respectively. Moreover, the fully conserved and problematic residues (with blanks or gaps greater than 12.5%) were eliminated, resulting in M = 451 variable residues for Gag and M = 202 for Nef. Similarly, the sequence data for HCV subtype 1a NS3-4A and NS4B was downloaded from the Los Alamos National Laboratory HCV database, http://www.hcv.lanl.gov/. The downloaded HCV sequences were then aligned based on the H77 reference and converted to the amino acid MSA. Applying the above-mentioned pre-processing resulted in N = 2832 sequences for NS3-4A and N = 675 sequences for NS4B, with an effective length of M = 482 for NS3-4A and M = 190 for NS4B.
The processed amino acid MSA matrix A = (Aij) was converted into a binary matrix B, with (i, j)th entry
(1) |
Thus, ‘0’ represents the most prevalent amino acid at a given residue and ‘1’ represents a mutant (substitution). This is a reasonable approximation of the amino acid MSA, given the high conservation of the internal viral proteins under study (S8 Fig).
The binary sequences in B are generally corrupted by the so-called phylogenetic effect, which represents ancestral correlations. A comparatively large eigenvalue is observed in the associated correlation matrix due to these phylogenetic correlations [11, 12]. Following previous ideas (for details, see Sections 3 and 4 in the Supporting Information of [11] and Section 2 in the Appendix of [12]), such effects are reduced using standard linear regression. The resulting data matrix, denoted by , was the base for computing the correlations used to infer sectors. Specifically, we computed the M × M sample correlation matrix, along with its spectral decomposition, given by
(2) |
Here, S is the sample covariance matrix with entries where is the sample mean, while V is a diagonal matrix containing the sample variances, i.e., Vii = Sii, and λk and qk are the kth-largest eigenvalue of C and its corresponding eigenvector, respectively. The superscript ⊺ denotes vector transposition.
RoCA method
We introduced an approach based on robust PCA methods to accurately estimate the PCs (i.e., the leading eigenvectors) of the correlation matrix, which were then directly used to identify sectors. In particular, we considered the iterative thresholding sparse PCA (ITSPCA) method which, in short, is a combination of the standard orthogonal iteration method [20], used to compute the eigenvectors of a given matrix, and an intermediate thresholding step which filters out noise in the estimated PCs. However, the original ITSPCA method was not directly applicable to our correlation-based sectoring problem, since it was designed primarily for covariance matrices, and it involved a variance-dependent coordinate pre-selection algorithm which is no longer suitable. As such, for RoCA, we developed a version (called Corr-ITSPCA, see Algorithm 1) which is appropriately adapted to operate on correlation matrices, and we designed automated methods for tuning the relevant parameters; specifically, the number of significant PCs α and the noise threshold γk.
Algorithm 1 Corr-ITSPCA Method
Inputs:
1. Correlation matrix of size M × M, C;
2. Number of PCs to be estimated, α;
3. Noise threshold, γk, k = 1, 2, ⋯, α.
Output: Robust estimates of the PCs, pk, given as columns of the M × α matrix P = Q(∞), where Q(∞) denotes Q(i) at convergence.
1: Initialization: i = 1;
2: Initial orthonormal matrix of size M × α, Q(0) = Qα; here Qα is a matrix whose columns are the α leading eigenvectors of C, i.e., Qα = [q1 q2 ⋯ qα].
3: repeat
4: Multiplication: ;
5: Thresholding: , with , where 1{E} is the indicator function of an event E;
6: QR Factorization: ;
7: i = i + 1;
8: until convergence
Such automated design is crucial to obtain accurate results, as these parameters control respectively the number of sectors that we infer and the number of residues included in each sector. Note that this is a principled design approach, as opposed to an ad hoc approach considered previously by the authors to uncover vaccine targets against the NS5B protein of HCV [65]. These parameters are designed as follows:
Number of significant PCs, α
The design relies on the observed deviations from a null model. Specifically, we generate randomized alignments under the null hypothesis that no genuine correlations are present in the data. This is simply obtained by randomly shuffling the entries of each column of , effectively breaking any existing genuine correlation in the data set, and yielding a randomized (null-model) alignment. This shuffling procedure is repeated 105 times and, for each randomized alignment, the maximum eigenvalue of the sample correlation matrix is recorded. The maximum of the recorded eigenvalues, denoted by , is used as an upper bound of statistical noise so that any eigenvalue of C exceeding it is identified as a relevant spectral mode, i.e., as a genuine contribution to the correlation. Thus, the number of significant eigenvectors is set to
(3) |
Noise threshold, γk
During the initial iteration, the matrix T subject to the thresholding step in Algorithm 1 is given by
(4) |
Our design looks for a suitable threshold which eliminates variables that appear uncorrelated (i.e., non-sector variables), for which the corresponding entries of qi correspond purely to sampling noise for every i = 1, …, α. (Note that Eq (4) only holds for the first iteration of Algorithm 1. However, from [19], coordinates that are set to zero in the first iteration remain zero in subsequent iterations.) Assume that there are Mns (unknown) non-sector residues, which we denote by , and let represent qk but restricted to the coordinates in only. The proposed threshold design is based on a statistical description of the worst-case non-sector coordinate; such description relies on the observation that the sample correlation matrices generated with HCV and HIV sequence data have spectral characteristics that are reminiscent of so-called “spiked” correlation models of random matrix theory (Fig 2). To be more specific, we exploit theoretical properties derived for spiked models (Theorems 4 and 6 in [66]) concerning the asymptotic distributions of sample eigenvalues and eigenvectors. Upon particularizing those results to the specific eigenvector structure conveyed by the sectors, they indicate that the coordinates of are distributed, up to scaling, as those of a vector that is uniformly distributed on the (M − α)-dimensional unit sphere. Such a vector is well-known to admit an equivalent representation as a rescaled vector of independent standard Gaussians (i.e., scaled to unit norm). Hence, with N and M both sufficiently large, M ≫ α, and η = M/N, these results along with some basic arguments lead to
(5) |
for each k = 1, …, α, where the yk, i are independent standard Gaussian random variables, and where the notation represents “equivalence in distribution”. Here,
which is a function of the quantity
The factor in Eq (5) stems from a complex statistical analysis given in [66], and represents the total statistical noise variance accumulated across all non-sector coordinates of qk. This quantity increases with increasing η = M/N, and its value may be quite large, particularly for scenarios in which the number of samples N are not substantially greater than the number of protein residues M (i.e., the relevant case for our viral data sets). [As a technical aside, we note that there is a condition on the minimum value of ℓk (equivalently λk) in order for the above relations to hold (see [66]); though, this condition will typically be obeyed as a consequence of the shuffling procedure used to infer α, which selects “sufficiently strong” spectral modes.]
Based on Eq (5), by standard arguments from order statistics, the cumulative distribution function of is given by
(6) |
where
(7) |
with erf the Gaussian “error function”. Note that all parameters of this distribution are functions of observable quantities (e.g., λk, M, and N), with the exception of the number of non-sector coordinates, Mns. To account for this, we invoke a worst-case assumption, replacing Mns with its upper bound, M. We may then set a threshold for the coordinates of qk based on this distribution, considering a suitable percentile. Here, taking a 95% cut-off,
(8) |
Numerical simulations (S2 Text) demonstrate that this choice of serves as a good compromise between the ability to accurately capture the sector residues of interest (i.e., a high true positive rate), while rejecting most of non-sector residues (i.e., a low false discovery rate). Finally, since the threshold is applied to the column vectors tk (Eq (4)), rather than the qk, the noise threshold γk is chosen as
(9) |
The iterative procedure in the Corr-ITSPCA method (Algorithm 1) is similar to the standard orthogonal iteration procedure used to obtain the eigenvectors of a matrix [20]. However, addition of the intermediate thresholding step helps to identify a subspace spanned by the significant PCs such that there is no contribution from the non-sector residues in the estimated PCs. Moreover, in the process of obtaining an orthogonal subspace, this iterative procedure accurately infers the coordinates contributing to each PC by resolving spurious overlap between the support (non-zero components) of all the significant PCs, as demonstrated by ground-truth simulations (S11 Fig).
From the sample correlation matrix C, the robust estimate of the PCs was obtained using the Corr-ITSPCA method (Algorithm 1), with α and γk designed as above. The estimated PCs pk, k = 1, …, α, produced by Algorithm 1, were then used to form α sectors as
(10) |
where |pk(m)| is the absolute value of the mth coordinate of pk. Note that the estimated pk do not generally have strict zero entries in the non-sector coordinates, but may contain very small values due to residual noise. As such, a small threshold was applied (Eq (10)) to form sectors in an automated way. The spurious entries were generally quite distinguishable from the relevant coordinates, even from simple visual inspection of pk (Fig 3A and S1 Fig).
As mentioned above, all fully conserved residues in the MSA were initially excluded from our analysis, as the Pearson correlation involving these residues was not defined. Given the lack of information regarding their potential interactions with other residues, and considering the tendency of neighboring residues in the primary structure to interact with each other, any fully conserved residue in the immediate neighborhood of a sector residue was incorporated into that sector.
Heat map of cleaned correlations for visualization
In Figs 3 and 6, we used heat maps to illustrate the computed sample correlation matrix C. As discussed above, the sample correlations were generally corrupted by statistical noise due to the finite number of available sequences. Thus, for a better visualization and, in particular, to appreciate the strong correlations within the inferred sectors, the sample correlation matrix was cleaned from statistical noise by thresholding the sample eigenvalues in such a way that the significant α spectral modes (Eq (3)) were kept unaltered, while the remaining eigenvalues (which do not appear to contribute genuine correlations) were collapsed to a constant. Specifically, the cleaned sample correlation matrix was obtained as
(11) |
where
with ζ a constant value such that the trace of remained normalized (equal to M). Note that is not a standard correlation matrix as . A standardized version was then computed as
(12) |
where D is a diagonal matrix with , and used to depict the cleaned correlations as a heat map.
Statistical independence of inferred sectors
We introduced a metric called “normalized entropy deviation (NED)” to quantify the extent to which two groups of residues are statistically independent of each other. The NED between two sectors i and j is defined as
(13) |
where si is a set comprising the five residues with largest correlation magnitude of sector i and is the entropy of si computed from the binary MSA matrix. Specifically, this entropy is computed over all combinations of the residues in set si as follows
(14) |
where fκ is the frequency of the combination κ in the MSA and #(si) is the cardinality of set si. In theory, if two given sectors are perfectly independent, the sum of the entropies of the individual sectors must be equal to the entropy of both sectors taken together, resulting in NEDinter = 0. In practice however, a small non-zero value of NEDinter is expected due to finite-sampling noise, even if the sectors are independent. We obtain an estimate of it by constructing a null case, where the entries of the MSA corresponding to the sets si and sj are randomly shuffled in such a way that any correlation between the two sets is essentially eliminated, while the correlations between residues in an individual set remain unaltered. Using Eq (13), NEDinter is computed for 500 such randomly shuffled realizations of the MSA and the average value (referred to as NEDrandom) represents the null (lower) reference value for NEDinter which is expected if the two sectors are independent. Substantial deviations from NEDrandom should reflect correlation between the sectors. In order to quantify the extent of such deviations in a clearly correlated case, we computed an upper reference NEDintra, obtained when the residues in both sets si and sj belong to the same sector. It is defined as
(15) |
where is the set comprising the five residues with largest correlation magnitude of sector i with the residues in si excluded.
Clinical data used in the immunological study
The HLA alleles associated with LTNP and RP were reported in [58]. A list of HIV Gag epitopes that are presented by HLA alleles and targeted by T cells of either HIV LTNP or RP was compiled using the data from the Los Alamos HIV Molecular Immunology Database (http://www.hiv.lanl.gov/content/immunology) and is presented in S1 Table.
Supporting information
Acknowledgments
We are particularly grateful to Arup Chakraborty for extensive discussions over the duration of this work. We also thank Iain Johnstone, Karthik Shekhar, John Barton, and Raymond Louie for providing useful input on an earlier version of the manuscript.
Data Availability
Accession numbers of all sequences used in this work are provided in S3 File. Source code for the proposed RoCA method along with the code for reproducing all figures is available at https://github.com/ahmedaq/RoCA.
Funding Statement
This work was supported by the General Research Fund of the Hong Kong Research Grants Council (RGC) (grant numbers 16207915, 16234716). AAQ was also supported by the Hong Kong Ph.D. Fellowship Scheme (HKPFS) and MRM by a Hari Harilela endowment. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Goulder PJR, Watkins DI. HIV and SIV CTL escape: Implications for vaccine design. Nat Rev Immunol. 2004;4(8):630–640. 10.1038/nri1417 [DOI] [PubMed] [Google Scholar]
- 2. Oniangue-Ndza C, Kuntzen T, Kemper M, Berical A, Wang YE, Neumann-Haefelin C, et al. Compensatory mutations restore the replication defects caused by cytotoxic T lymphocyte escape mutations in hepatitis C virus polymerase. J Virol. 2011;85(22):11883–11890. 10.1128/JVI.00779-11 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. John M, Gaudieri S. Influence of HIV and HCV on T cell antigen presentation and challenges in the development of vaccines. Front Microbiol. 2014;5(SEP):1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Zanini F, Brodin J, Thebo L, Lanz C, Bratt G, Albert J, et al. Population genomics of intrapatient HIV-1 evolution. eLife. 2015;4:e11282 10.7554/eLife.11282 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Shekhar K, Ruberman C, Ferguson A, Barton J, Kardar M, Chakraborty A. Spin models inferred from patient-derived viral sequence data faithfully describe HIV fitness landscapes. Phys Rev E. 2013;88(6):062705 10.1103/PhysRevE.88.062705 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Barton JP, Goonetilleke N, Butler TC, Walker BD, McMichael AJ, Chakraborty AK. Relative rate and location of intra-host HIV evolution to evade cellular immunity are predictable. Nat Commun. 2016;7(May):11660 10.1038/ncomms11660 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Louie RHY, Kaczorowski KJ, Barton JP, Chakraborty AK, McKay MR. Fitness landscape of the human immunodeficiency virus envelope protein that is targeted by antibodies. Proc Natl Acad Sci. 2018;115(4):E564–E573. 10.1073/pnas.1717765115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Ferguson AL, Mann JK, Omarjee S, Ndung’u T, Walker BD, Chakraborty AK. Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design. Immunity. 2013;38(3):606–617. 10.1016/j.immuni.2012.11.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Mann JK, Barton JP, Ferguson AL, Omarjee S, Walker BD, Chakraborty AK, et al. The fitness landscape of HIV-1 Gag: Advanced modeling approaches and validation of model predictions by in vitro testing. PLoS Comput Biol. 2014;10(8):e1003776 10.1371/journal.pcbi.1003776 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Hart GR, Ferguson AL. Empirical fitness models for hepatitis C virus immunogen design. Phys Biol. 2015;12(6):066006 10.1088/1478-3975/12/6/066006 [DOI] [PubMed] [Google Scholar]
- 11. Dahirel V, Shekhar K, Pereyra F, Miura T, Artyomov M, Talsania S, et al. Coordinate linkage of HIV evolution reveals regions of immunological vulnerability. Proc Natl Acad Sci. 2011;108(28):11530–11535. 10.1073/pnas.1105315108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Quadeer AA, Louie RHY, Shekhar K, Chakraborty AK, Hsing IM, McKay MR. Statistical linkage analysis of substitutions in patient-derived sequences of genotype 1a hepatitis C virus nonstructural protein 3 exposes targets for immunogen design. J Virol. 2014;88(13):7628–44. 10.1128/JVI.03812-13 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. de Juan D, Pazos F, Valencia A. Emerging methods in protein co-evolution. Nat Rev Genet. 2013;14(4):249–261. 10.1038/nrg3414 [DOI] [PubMed] [Google Scholar]
- 14. Halabi N, Rivoire O, Leibler S, Ranganathan R. Protein sectors: Evolutionary units of three-dimensional structure. Cell. 2009;138(4):774–786. 10.1016/j.cell.2009.07.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Rivoire O, Reynolds KA, Ranganathan R. Evolution-based functional decomposition of proteins. PLoS Comput Biol. 2016;12(6):e1004817 10.1371/journal.pcbi.1004817 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Novinec M, Korenč M, Caflisch A, Ranganathan R, Lenarčič B, Baici A. A novel allosteric mechanism in the cysteine peptidase cathepsin K discovered by computational methods. Nat Commun. 2014;5 10.1038/ncomms4287 [DOI] [PubMed] [Google Scholar]
- 17. Smock RG, Rivoire O, Russ WP, Swain JF, Leibler S, Ranganathan R, et al. An interdomain sector mediating allostery in Hsp70 molecular chaperones. Mol Sys Biol. 2010;6(414):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Johnstone IM. On the distribution of the largest eigenvalue in principal components analysis. Ann Stat. 2001; p. 295–327. 10.1214/aos/1009210544 [DOI] [Google Scholar]
- 19. Ma Z. Sparse principal component analysis and iterative thresholding. Ann Stat. 2013;41(2):772–801. 10.1214/13-AOS1097 [DOI] [Google Scholar]
- 20. Golub GH, Van Loan CF. Matrix computations. Johns Hopkins University Press, Baltimore MD, USA; 1996. [Google Scholar]
- 21. Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. J Comput Graph Stat. 2006;15(2):265–286. 10.1198/106186006X113430 [DOI] [Google Scholar]
- 22. Johnstone IM, Lu AY. On consistency and sparsity for principal components analysis in high dimensions. J Am Stat Assoc. 2009;104(486):682–693. 10.1198/jasa.2009.0121 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Paul D, Johnstone I. Augmented sparse principal component analysis for high dimensional data. arXiv Prepr arXiv12021242. 2012; p. 1–45.
- 24. Yuan X, Zhang T. Truncated power method for sparse eigenvalue problems. The Journal of Machine Learning Research. 2013;14:899–925. [Google Scholar]
- 25. Zhang Y, D’Aspremont A, Ghaoui LE. In: Anjos MF, Lasserre JB, editors. Sparse PCA: Convex relaxations, algorithms and applications. Boston, MA: Springer US; 2012. p. 915–940. Available from: 10.1007/978-1-4614-0769-0_31. [DOI] [Google Scholar]
- 26. Jiang R, Fei H, Huan J. A family of joint sparse PCA algorithms for anomaly localization in network data streams. IEEE Trans Knowl Data Eng. 2013;25(11):2421–2433. 10.1109/TKDE.2012.176 [DOI] [Google Scholar]
- 27. Wang D, Lu H, Yang MH. Online object tracking with sparse prototypes. IEEE Trans Image Process. 2013;22(1):314–325. 10.1109/TIP.2012.2202677 [DOI] [PubMed] [Google Scholar]
- 28. Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, et al. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nat Methods. 2016;13(5):443–445. 10.1038/nmeth.3809 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Newman MEJ. Modularity and community structure in networks. Proc Natl Acad Sci. 2006;103(23):8577–8582. 10.1073/pnas.0601602103 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Yu GY, Lee KJ, Gao L, Lai MMC. Palmitoylation and polymerization of hepatitis C virus NS4B protein. J Virol. 2006;80(12):6013–6023. 10.1128/JVI.00053-06 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Tong W, Nagano-Fujii M, Hidajat R. Physical interaction between hepatitis C virus NS4B protein and CREB-RP/ATF6β. Biochemical and Biophysical Research Communications. 2002;299:366–372. 10.1016/S0006-291X(02)02638-4 [DOI] [PubMed] [Google Scholar]
- 32. McLaughlin RN Jr, Poelwijk FJ, Raman A, Gosal WS, Ranganathan R. The spatial architecture of protein function and adaptation. Nature. 2012;491:138–142. 10.1038/nature11500 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Zhou W, Parent LJ, Wills JW, Resh MD. Identification of a membrane-binding domain within the amino-terminal region of human immunodeficiency virus type 1 Gag protein which interacts with acidic phospholipids. J Virol. 1994;68(4):2556–2569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Datta SAK, Temeselew LG, Crist RM, Soheilian F, Kamata A, Mirro J, et al. On the role of the SP1 domain in HIV-1 particle assembly: A molecular switch? J Virol. 2011;85(9):4111–4121. 10.1128/JVI.00006-11 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Pornillos O, Ganser-Pornillos BK, Kelly BN, Hua Y, Whitby FG, Stout CD, et al. X-ray structures of the hexameric building block of the HIV capsid. Cell. 2009;137(7):1282–1292. 10.1016/j.cell.2009.04.063 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Pornillos O, Ganser-Pornillos BK, Yeager M. Atomic-level modelling of the HIV capsid. Nature. 2011;469(7330):424–427. 10.1038/nature09640 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Byeon IJL, Meng X, Jung J, Zhao G, Yang R, Ahn J, et al. Structural convergence between Cryo-EM and NMR reveals intersubunit interactions critical for HIV-1 capsid function. Cell. 2009;139(4):780–790. 10.1016/j.cell.2009.10.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Guo J, Wu T, Anderson J, Kane BF, Johnson DG, Gorelick RJ, et al. Zinc finger structures in the human immunodeficiency virus type 1 nucleocapsid protein facilitate efficient minus- and plus-strand transfer. J Virol. 2000;74(19):8980–8988. 10.1128/JVI.74.19.8980-8988.2000 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Hill MK, Shehu-Xhilaga M, Crowe SM, Mak J. Proline residues within spacer peptide p1 are important for human immunodeficiency virus type 1 infectivity, protein processing, and genomic RNA dimer stability. J Virol. 2002;76(22):11245–53. 10.1128/JVI.76.22.11245-11253.2002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Coren LV, Thomas JA, Chertova E, Sowder RC, Gagliardi TD, Gorelick RJ, et al. Mutational analysis of the C-terminal Gag cleavage sites in human immunodeficiency virus type 1. J Virol. 2007;81(18):10047–10054. 10.1128/JVI.02496-06 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Goldsmith MA, Warmerdam MT, Atchison RE, Miller MD, Greene WC. Dissociation of the CD4 downregulation and viral infectivity enhancement functions of human immunodeficiency virus type 1 Nef. J Virol. 1995;69(7):4112–4121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Greenberg ME, Iafrate AJ, Skowronski J. The SH3 domain-binding surface and an acidic motif in HIV-1 Nef regulate trafficking of class I MHC complexes. EMBO J. 1998;17(10):2777–2789. 10.1093/emboj/17.10.2777 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Grzesiek S, Stahl SJ, Wingfield PT, Bax A. The CD4 determinant for downregulation by HIV-1 Nef directly binds to Nef. Mapping of the Nef binding surface by NMR. Biochemistry. 1996;35(32):10256–10261. 10.1021/bi9611164 [DOI] [PubMed] [Google Scholar]
- 44. Arold S, Hoh F, Domergue S, Birck C, Delsuc MA, Jullien M, et al. Characterization and molecular basis of the oligomeric structure of HIV-1 Nef protein. Protein Sci. 2000;9(6):1137–48. 10.1110/ps.9.6.1137 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Lama J, Mangasarian A, Trono D. Cell-surface expression of CD4 reduces HIV-1 infectivity by blocking Env incorporation in a Nef- and Vpu-inhibitable manner. Curr Biol. 1999;9(12):622–631. 10.1016/S0960-9822(99)80284-X [DOI] [PubMed] [Google Scholar]
- 46. Poe JA, Smithgall TE. HIV-1 Nef dimerization is required for Nef-mediated receptor downregulation and viral replication. J Mol Biol. 2009;394(2):329–342. 10.1016/j.jmb.2009.09.047 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Kim DW, Kim J, Gwack Y, Han JH, Choe J. Mutational analysis of the hepatitis C virus RNA helicase. J Virol. 1997;71(12):9400–9409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Mackintosh SG, Lu JZ, Jordan JB, Harrison MK, Sikora B, Sharma SD, et al. Structural and biological identification of residues on the surface of NS3 helicase required for optimal replication of the hepatitis C virus. J Biol Chem. 2006;281(6):3528–3535. 10.1074/jbc.M512100200 [DOI] [PubMed] [Google Scholar]
- 49. Lin C. HCV NS3-4A serine protease In: Hepatitis C Viruses: Genomes and Molecular Biology. Horizon Scientific Press; 2006. p. 163–206. [Google Scholar]
- 50. Brass V, Berke JM, Montserret R, Blum HE, Penin F, Moradpour D. Structural determinants for membrane association and dynamic organization of the hepatitis C virus NS3-4A complex. Proc Natl Acad Sci. 2008;105(38):14545–14550. 10.1073/pnas.0807298105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Koch JO, Bartenschlager R. Modulation of hepatitis C virus NS5A hyperphosphorylation by nonstructural proteins NS3, NS4A, and NS4B. J Virol. 1999;73(9):7138–7146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Neddermann P, Clementi A, De Francesco R. Hyperphosphorylation of the hepatitis C virus NS5A protein requires an active NS3 protease, NS4A, NS4B, and NS5A encoded on the same polyprotein. J Virol. 1999;73(12):9984–9991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Jones DM, Patel AH, Targett-Adams P, McLauchlan J. The hepatitis C virus NS4B protein can trans-complement viral RNA replication and modulates production of infectious virus. J Virol. 2009;83(5):2163–2177. 10.1128/JVI.01885-08 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Li S, Yu X, Guo Y, Kong L. Interaction networks of hepatitis C virus NS4B: Implications for antiviral therapy. Cell Microbiol. 2012;14(7):994–1002. 10.1111/j.1462-5822.2012.01773.x [DOI] [PubMed] [Google Scholar]
- 55. Gouttenoire J, Penin F, Moradpour D. Hepatitis C virus nonstructural protein 4B: A journey into unexplored territory. Rev Med Virol. 2010;20:117–129. 10.1002/rmv.640 [DOI] [PubMed] [Google Scholar]
- 56. Gao X, Nelson GW, Karacki P, Martin MP, Phair J, Kaslow R, et al. Effect of a single amino acid change in MHC class I molecules on the rate of progression to AIDS. N Engl J Med. 2001;344(22):1668–1675. 10.1056/NEJM200105313442203 [DOI] [PubMed] [Google Scholar]
- 57. Deeks SG, Walker BD. Human immunodeficiency virus controllers: Mechanisms of durable virus control in the absence of antiretroviral therapy. Immunity. 2007;27(3):406–416. 10.1016/j.immuni.2007.08.010 [DOI] [PubMed] [Google Scholar]
- 58. Pereyra F, Jia X, McLaren PJ, Telenti A, de Bakker PIW, Walker BD, et al. The major genetic determinants of HIV-1 control affect HLA Class I peptide presentation. Science. 2010;330(6010):1551–1557. 10.1126/science.1195271 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Simon-Loriere E, Holmes EC. Why do RNA viruses recombine? Nat Rev Micobiol. 2011;9(8):617–626. 10.1038/nrmicro2614 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Ternois F, Sticht J, Duquerroy S, Kräusslich HG, Rey FA. The HIV-1 capsid protein C-terminal domain in complex with a virus assembly inhibitor. Nat Struct Mol Biol. 2005;12(8):678–682. 10.1038/nsmb967 [DOI] [PubMed] [Google Scholar]
- 61. Saalau-Bethell S, Woodhead A. Discovery of an allosteric mechanism for the regulation of HCV NS3 protein function. Nat Chem Biol. 2012;8(11):920–925. 10.1038/nchembio.1081 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Drummer HE. Challenges to the development of vaccines to hepatitis C virus that elicit neutralizing antibodies. Front Microbiol. 2014;5(July):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. McKiernan SM, Hagan R, Curry M, McDonald GSA, Kelly A, Nolan N, et al. Distinct MHC class I and II alleles are associated with hepatitis C viral clearance, originating from a single source. Hepatology (Baltimore, Md). 2004;40(1):108–114. 10.1002/hep.20261 [DOI] [PubMed] [Google Scholar]
- 64. Grakoui A, Shoukry NH, Woollard DJ, Han JH, Hanson HL, Ghrayeb J, et al. HCV persistence and immune evasion in the absence of memory T cell help. Science (80-). 2003;302(5645):659–662. 10.1126/science.1088774 [DOI] [PubMed] [Google Scholar]
- 65.Quadeer AA, Morales-Jimenez D, McKay MR. A tailored sparse PCA method for finding vaccine targets against hepatitis C. 50th Asilomar Conference on Signals, Systems and Computers. 2016; p. 100–104.
- 66. Paul D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Stat Sinica. 2007;17(4):1617. [Google Scholar]
- 67. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci. 2011;108(49):E1293–E1301. 10.1073/pnas.1111471108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68. Dunn SD, Wahl LM, Gloor GB. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics. 2008;24(3):333–340. 10.1093/bioinformatics/btm604 [DOI] [PubMed] [Google Scholar]
- 69. Göbel U, Sander C, Schneider R, Valencia A. Correlated mutations and residue contacts in proteins. Proteins. 1994;18(4):309–317. 10.1002/prot.340180402 [DOI] [PubMed] [Google Scholar]
- 70. Kass I, Horovitz A. Mapping pathways of allosteric communication in GroEL by analysis of correlated mutations. Proteins. 2002;48(4):611–617. 10.1002/prot.10180 [DOI] [PubMed] [Google Scholar]
- 71. Mihalek I, Reš I, Lichtarge O. A family of evolution-entropy hybrid methods for ranking protein residues by importance. J Mol Biol. 2004;336(5):1265–1282. 10.1016/j.jmb.2003.12.078 [DOI] [PubMed] [Google Scholar]
- 72. Fodor AA, Aldrich RW. Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins. 2004;56(2):211–221. 10.1002/prot.20098 [DOI] [PubMed] [Google Scholar]
- 73. Rausell A, Juan D, Pazos F, Valencia A. Protein interactions and ligand binding: From protein subfamilies to functional specificity. Proc Natl Acad Sci. 2010;107(5):1995–2000. 10.1073/pnas.0908044107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74. Hedstrom L, Perona JJ, Rutter WJ. Converting trypsin to chymotrypsin: Residue 172 is a substrate specificity determinant. Biochemistry. 1994;33(29):8757–8763. 10.1021/bi00195a018 [DOI] [PubMed] [Google Scholar]
- 75. Hedstrom L. Serine protease mechanism and specificity. Chem Rev. 2002;102(12):4501–4524. 10.1021/cr000033x [DOI] [PubMed] [Google Scholar]
- 76. Huntington JA, Esmon CT. The molecular basis of thrombin allostery revealed by a 1.8Å structure of the “slow” form. Structure. 2003;11(4):469–479. 10.1016/S0969-2126(03)00049-2 [DOI] [PubMed] [Google Scholar]
- 77. Guinto ER, Caccia S, Rose T, Fütterer K, Waksman G, Di Cera E. Unexpected crucial role of residue 225 in serine proteases. Proc Natl Acad Sci. 1999;96(5):1852–1857. 10.1073/pnas.96.5.1852 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Accession numbers of all sequences used in this work are provided in S3 File. Source code for the proposed RoCA method along with the code for reproducing all figures is available at https://github.com/ahmedaq/RoCA.