Abstract
Background
Proteases of human pathogens are becoming increasingly important drug targets, hence it is necessary to understand their substrate specificity and to interpret this knowledge in practically useful ways. New methods are being developed that produce large amounts of cleavage information for individual proteases and some have been applied to extract cleavage rules from data. However, the hitherto proposed methods for extracting rules have been neither easy to understand nor very accurate. To be practically useful, cleavage rules should be accurate, compact, and expressed in an easily understandable way.
Results
A new method is presented for producing cleavage rules for viral proteases with seemingly complex cleavage profiles. The method is based on orthogonal search-based rule extraction (OSRE) combined with spectral clustering. It is demonstrated on substrate data sets for human immunodeficiency virus type 1 (HIV-1) protease and hepatitis C (HCV) NS3/4A protease, showing excellent prediction performance for both HIV-1 cleavage and HCV NS3/4A cleavage, agreeing with observed HCV genotype differences. New cleavage rules (consensus sequences) are suggested for HIV-1 and HCV NS3/4A cleavages. The practical usability of the method is also demonstrated by using it to predict the location of an internal cleavage site in the HCV NS3 protease and to correct the location of a previously reported internal cleavage site in the HCV NS3 protease. The method is fast to converge and yields accurate rules, on par with previous results for HIV-1 protease and better than previous state-of-the-art for HCV NS3/4A protease. Moreover, the rules are fewer and simpler than previously obtained with rule extraction methods.
Conclusion
A rule extraction methodology by searching for multivariate low-order predicates yields results that significantly outperform existing rule bases on out-of-sample data, but are more transparent to expert users. The approach yields rules that are easy to use and useful for interpreting experimental data.
Background
The human body contains almost 600 proteases [1] that are involved in a number of important biological pathways such as blood coagulation, fibrinolysis, complement activation, hormone production and digestion [2]. These proteases are often essential players in elaborate networks, or cascades, where zymogens (catalytically inactive protease precursors) are activated in order to activate a downstream zymogen or digest/inactivate a structural or biologically active protein. Proteases therefore tend to have highly specific substrate repertoires and are regulated by endogenous protease inhibitors, with a delicate balance between these counteracting forces. An imbalance between active proteases and protease inhibitors may cause, or contribute to, many diseases. A classical example is hereditary deficiency of α1 proteinase inhibitor (also known as antitrypsin), that can lead to early-onset emphysema [3]. This process is greatly accelerated upon smoking, partly because antitrypsin is susceptible to inactivation by oxidation by cigarette smoke and because neutrophil elastase, a potent protease with a broad substrate specificity including lung elastin, is up-regulated and released from neutrophils by nicotine. Similarly, the progression of some severe diseases can be successfully slowed down with the use of protease inhibitors. Proteases are consequently important drug targets [4] and the list of protease inhibitors approved for clinical use is steadily growing, including drugs to treat, e.g., hypertension, thrombosis, pancreatitis, periodontitis, respiratory diseases, cancer, HIV/AIDS and probably soon hepatitis C [5]. However, the road to success has been paved with numerous failures, because of many unexpected (and in most cases unwanted) side effects. An ideal protease inhibitor should have a well defined substrate specificity, broad enough to treat the disease efficiently, but narrow enough not to interfere with other proteases or biological processes, combined with low toxicity to the host cells. HIV-1 protease has an important function for virus maturation during HIV infection, but its substrate specificity is complex and endogenous protease inhibitors of the host cells infected with HIV do not seem to be able to stop its action. Therefore, artificial HIV-1 protease inhibitors are needed and have been in clinical use for HIV/AIDS treatment for over a decade. Among other viral targets, the hepatitis C virus (HCV) NS3 protease is also a candidate target for antiviral drugs, since currently available HCV therapy is effective in only half of the patients and limited by serious side effects [6,7].
A challenge in the hunt for better protease inhibitors is to make maximum use of available experimental protease data and provide accurate rules for the substrate specificity, rules that can be used to estimate the effect of proteases in biological pathways. There are three important criteria that should be met by these specificity rules: they should be accurate in terms of out-of-sample prediction; they should be comprehensible (understandable); and they should have a high fidelity to the data from which they were extracted. The aim of the present work was to present a method for extracting cleavage rules from viral protease data. A method that provides rules that meet these three criteria better than previously presented approaches. The method is exemplified on HIV-1 and HCV NS3 protease data sets.
The structure of cleavage rules
The first step towards meeting the three requirements above (accuracy, comprehensibility, and fidelity) is to understand how cleavage rules are typically formulated. Rules that are expressed in the standard form are probably more likely to be comprehensible, and they may even be more accurate if the standard form has developed over time so that people have included their knowledge of how protease cleavage works into these rules.
Cleavage specificities tend to be described assuming that certain positions should contain certain amino acids. Common terms used are "consensus motif" (or "consensus sequence") and "sub-site preference" (see e.g. [8]), where the latter refers to those amino acids that seem to match the active site in the protease, and sub-site preferences are typically illustrated with specificity profiles or histograms showing the frequency of amino acids in the different positions of cleaved substrates. This assumption is evident in the rules used for different proteases in the tools PeptideCutter and PeptideMass on http://www.expasy.ch; they are all expressed as lists of allowed (or disallowed) amino acids in the positions of the peptides (examples are provided below).
The tradition of expressing cleavage rules in this fashion may not be a coincidence, and a rule extraction method that explicitly produces conjunctive rules may therefore be the right way to go for these problems.
Conjunctive rules are rules that are formulated as lists of requirements that must all be true (i.e. the requirements for each position are connected with a logical AND). Using conjunctive rules puts no restriction on the possible cleavage specificities that can be described; conjunctive rules can be combined with the logical OR function into any logical rule. This is the so-called Disjunctive Normal Form Theorem [9].
A very simple cleavage rule is, e.g., the Arg-C proteinase cleavage rule [10]. It cleaves peptides on the C-terminal side of Arginine (Arg, R). This is expressed as
using the standard Schechter and Berger notation [11] where the substrate sites are denoted by ...-P3-P2-P1-P1'-P2'-P3'-..., with the scissile bond located between P1 and P1'.
An effective immune response of the human host against a viral attack involves the generation of virus-specific cytotoxic T lymphocytes (CTLs), whose main function is to kill virus-infected cells. The two important countermeasures used by the CTLs are perforin/granzyme mediated apoptosis and Fas-mediated apoptosis. Granzyme B, a member of the hematopoietic serine protease superfamily, is stored in granules of cytotoxic T cells and natural killer cells and has a requirement for Aspartate (Asp, D) in the P1 position [12]. However, the full rule is a conjunctive rule that involves more positions [13]:
which means that it cleaves on the C-terminal side of the sequence IEPD, Isoleucine-Glutamate-Proline-Aspartate (Ile-Glu-Pro-Asp), i.e. all four positions must have specific amino acids in them for cleavage to occur.
Each position can also allow a set of amino acids, as is the case for the bovine coagulation factor, Factor Xa [14]:
This means that position P4 should be occupied by any of {A,G,I,L,T,V,M}, but nothing else, that position P3 should be occupied by any of {D,E}, but nothing else, that P2 should be occupied by G, and that P1 should be occupied by R for cleavage to occur (and it occurs on the C-terminal side of R).
There are also proteases that have more than one rule, e.g. if there are two types of cleavages that occur. This is the case for Thrombin [10], which has the following two cleavage rules:
and
The notation for positions P1' and P2' means that cleavage will not occur if Aspartate (Asp, D) or Glutamate (Glu, E) are in any of those positions.
Combinations of conjunctive rules can also be used to describe cleavage rules if there are interactions and competitions among the positions. Assume for instance a cleavage site with two nearby positions (P2 and P1) with preference for hydrophobic amino acids, e.g. L, M, F, Y (Leucine, Methionine, Phenylalanine and Tyrosine). Assume also that it is impossible to fit both positions with F or Y at the same time (due to space constraints) but that it is sufficient that one of them is present in one of the two positions. This would lead to a set of two conjunctive rules
and
To conclude, conjunctive rules follow the notation used in standard tools for describing cleavage of peptides and they can be used to describe any type of rule. If conjunctive rules can be used to produce cleavage rules that are simpler but as accurate as those from other methods that do not follow the standard notation, then that would speak in favour of a conjunctive rule approach.
Results
Data
HIV-1 protease data
You et al. [15] collected a HIV-1 protease substrate data set containing 746 octamers, of which 401 are cleaved and 345 are non-cleaved. In this data set, the octamer SQNYAIVQ was originally erroneously labeled as non-cleaved [16]. This error was corrected and the resulting data set is referred to as the HIV-1 PR 746 data set in this study. The octamers are denoted P4-P3-P2-P1-P1'-P2'-P3'-P4'.
Kontijevskis et al. [17] collected a substantially larger HIV-1 protease substrate data set containing 1625 octamers, where 374 are cleaved and 1251 are non-cleaved. This data set also has the wrong label for SQNYAIVQ, which we corrected. The corrected data set is referred to as the HIV-1 PR 1625 data set in this study.
The similarity between the HIV-1 PR 746 and the HIV-1 PR 1625 data sets is very high since they were partly collected from the same literature sources. There are 659 octamers that are common between the HIV-1 PR 746 and the HIV-1 PR 1625 data sets. Of these, seven are labeled differently in the two data sets: AAAMSSAI, ARVLAQAM, GRINVALV, SGVFSVNG and SGVYQLSA are labeled as cleaved in the 746 data set and as non-cleaved in the 1625 data set; AEAMSQVT and FRSGVETT are labeled as non-cleaved in the 746 data set and as cleaved in the 1625 data set.
Recently, Schilling and Overall [8] collected an even larger set of peptides cleaved by HIV-1 protease. This set was published after the rules described herein were constructed and can thus be used as independent test data. A set of octamers was generated from this data in the following way: cleaved octamers were taken from all peptides that contained at least four residues on the prime and the non-prime sides of the scissile bond; non-cleaved octamers were generated by sliding a window of size eight residues on both sides of the scissile bond (but not across the scissile bond) if the prime or the non-prime side had at least eight residues. For example, Schilling and Overall [8] (supplementary Table nineteen) report the cleaved peptide PLLGGSLMEYAILSAIAAMNEPK, where the cleavage site is between Y and A. This produces the cleaved octamer LMEYAILS and the non-cleaved octamers PLLGGSLM, LLGGSLME, LGGSLMEY, AILSAIAA, ILSAIAAM, LSAIAAMN, SAIAAMNE, AIAAMNEP and IAAMNEPK. Peptides that had an ambiguous P1 residue or that were marked "excluded" in Tables nineteen and twenty in the supporting material to their paper [8] were removed.
The final data set contains 3261 octamers, of which 436 are cleaved and 2825 are non-cleaved. We refer to this as the HIV-1 PR 3261 data set.
There is no overlap between the HIV-1 PR 746 and the HIV-1 PR 3261 data sets. There are twenty common octamers in the HIV-1 PR 1625 and the HIV-1 PR 3261 data sets, of which two are in conflict (EENFAVEA and QEEMLQRE, which are both labeled as non-cleaved in the 1625 data set but as cleaved in the 3261 data set). There is only one sequence in the HIV-1 PR 3261 data set that differs by one position from any octamer from any sequence in the HIV-1 PR 746 or HIV-1 PR 1625 data sets. This sequence is GWVLGEHG, which is labeled as cleaved and differs in one position from the cleaved GWILGEHG in the HIV-1 PR 746 and HIV-1 PR 1625 data sets. There are two sequences in the HIV-1 PR 746 data set that differ in two positions from sequences in the HIV-1 PR 3261 data set: the cleaved ARVLFDAL, which is similar to the non-cleaved APVLLDAL in the HIV-1 PR 3261 data set, and the cleaved GWILAEHG, which is similar to the cleaved GWVLGEHG in the HIV-1 PR 3261 data set. There are four sequences in the HIV-1 PR 1625 data set that differ in two positions from sequences in the HIV-1 PR 3261 data set: the non-cleaved NKILLAEL, VDKLVSAG and TEEKIKAL, which are similar to the non-cleaved octamers NKVNLAEL, VDVLVSSG and TEDKINAL, and the cleaved octamer GWILAEHG that differs in two positions from the cleaved GWVLGEHG in the HIV-1 PR 3261 data set. All other sequences in the HIV-1 PR 3261 data set differ in at least three positions out of eight from any octamer in the HIV-1 PR 746 or the HIV-1 PR 1625 data sets.
It is also relevant to check how many sequences in the data sets that have identical residues in the most important positions in the octamer: P2-P1-P1'-P2'. About 92% of the octamers in the HIV-1 PR 746 data set are identical, in the four central positions, to sequences in the HIV-1 PR 1625 data set. However, only 7% of the octamers in the HIV-1 PR 746 data set match to sequences in the HIV-1 3261 data set. About 41% of the octamers in the HIV-1 PR 1625 data set are identical to sequences in the HIV-1 PR 746 data set, but only 7% of the octamers in the HIV-1 PR 1625 data set match sequences in the HIV-1 PR 3261 data set. About 1% of the sequences in the HIV-1 PR 3261 data set match to sequences in the HIV-1 PR 746 data set, and 4% match to sequences in the HIV-1 PR 1625 data set.
There is thus very little sequence similarity between the HIV-1 PR 3261 data set and the HIV-1 PR 746 or the HIV-1 PR 1625 data sets, while there is a lot of sequence similarity between the two smaller data sets. The smaller data sets are used to extract cleavage rules, which are then tested on the larger (3261) data set.
HCV protease data
We initially intended to use a HCV NS3 protease data set used previously by other researchers [18-22]. There were, however, several uncertainties and conflicts between this data set and available references on HCV NS3 cleavage [23-30], which made us doubt the quality of this data set. A new HCV NS3 data set was therefore created from scratch from the references. A procedure described by [18] was followed to generate 706 additional non-cleaved decamers by moving a ten residue window over the 4B, 5A and 5B non-structural proteins [31] of the HCV polyprotein [GenBank: AJ238799], excluding the cleavage sites themselves since they were already in the data set. The 4A part was excluded since it is a protease co-factor. The decamers are of the form P6-P5-P4-P3-P2-P1-P1'-P2'-P3'-P4', i.e. the cleavage occurs between position six and seven in the decamer. The final HCV NS3 data set contains 939 decamers, of which 199 are cleaved and 740 are non-cleaved. We denote this data set the HCV NS3 data set.
The HCV NS3 data set is quite different from the data set used in previous rule extraction studies [18-22]: 8% of the decamers that occur in both sets are labeled differently.
Three separate out-of-sample test data sets for HCV were created. The NS3 protease itself [GenBank: NP_803144] was used to generate one test data set with 621 decamers, of which none are in the HCV NS3 data set. This test data set was intended for comparison with reported internal cleavage sites in the NS3 protease [32-34]. Four proteins from the TLR3 pathway were used for another test data set: IκB kinase ε (IKKε) [GenBank: AAC51216]; TRAF family member-associated NF-κB activator-binding kinase 1 (TBK1) [GenBank: NP_037386]; Toll-like receptor 3 (TLR3) [GenBank: NP_003256]; and Toll-IL-1 receptor domain-containing adaptor inducing IFN-β (TRIF or TICAM-1) [GenBank: BAC55579]. These four proteins have been tested for HCV NS3 cleavage by Li et al. [35]. The four proteins produced a total of 2805 decamers, of which one is also in the HCV NS3 data set (the only observed cleavage site in TRIF). A third test data set was made up of 69 in vivo tested NS3 substrates from Kim et al. [36], none of which are in the HCV NS3 data set. The three out-of-sample test sets are denoted NS3 internal, TLR3, and NS3 in vivo, respectively.
Artificial data
Two artificial data sets were created to measure the orthogonal search-based rule extraction (OSRE) method's ability to extract rules of the form that we are looking for. The data sets were designed using conjunctive rules to mimic typical cleavage rules as described in the introduction.
The two artificial rule sets are shown in Table 1. The simpler problem (A) was modeled partly after the Thermolysin [10] specificity, as described for the PeptideCutter tool. Four positions were used instead of the minimum two in order to see how well OSRE could deal with irrelevant information. The more complex problem (B) was modeled after the Thrombin [10] specificity, as described for the PeptideCutter tool. Training peptides from data set A were sampled randomly. Training peptides from data set B were picked with balanced sampling (i.e. achieving a 1:1 ratio of cleaved to non-cleaved sequences). This was done because the random probability for observing a cleaved peptide for data set B is very low.
Table 1.
Artificial data set A (4-mers: P2-P1-P1'-P2') | Artificial data set B (6-mers: P4-P3-P2-P1-P1'-P2') |
P1' ∈ {A,I,L,M,F,V} | P4 ∈ {A,G,I,L,M,F,T,V} |
P2' ∉ {P} | P3 ∈ {A,G,I,L,F,T,V,W} |
P2 ∈ {P} | |
P1 ∈ {R} | |
P1' ∉ {D,E} | |
P2' ∉ {D,E} |
Rules for the two artificial data sets used. occur.
Any amino acid could occupy the two first positions in artificial data set A (the generated peptides were longer than the actual rule). One letter amino acid abbreviations are used. The sign ∈ means "in" and the sign ∉ means "not in". The rules are connected with the Boolean AND operator, which means that all position rules must be true for cleavage to
The power of OSRE – the artificial data
Table 2 lists the OSRE rules extracted for the two artificial protease specificity problems for different sizes of the training data sets. Problem A is easy but problem B is quite a lot trickier. OSRE quickly finds the rule (P2 ∈ {P} AND P1 ∈ {R}) and this simple rule is sufficient to get very high classification accuracy (approximately 99.7% correct) on the data. A huge amount of additional data is then required before the full rule is extracted. This is because a very low fraction of the negative examples are in conflict with this rule. OSRE extracts the almost correct rule when it is presented with a data set with 100,000 examples.
Table 2.
Artificial data set A | Artificial data set B | ||
Data | OSRE rule | Data | OSRE rule |
10 | P2 ∈ {D,E,G,S,W,V} | 102 | P2 ∈ {P} |
P1 ∈ {R} | |||
102 | P1' ∈ {A,I,L,M,F,V} | 103 | P4 ∉ {S} |
P2 ∈ {P} | |||
P1 ∈ {R} | |||
103 | P1' ∈ {A,I,L,M,F,V} | 104 | P4 ∉ {R,N,Q,E,K,S} |
P2' ∉ {P} | P3 ∈ {Q} | ||
P2 ∈ {P} | |||
P1 ∈ {R} | |||
105 | P4 ∈ {A,G,I,L,M,F,T,V} | ||
P3 ∈ {A,G,I,L,F,T,V,W} | |||
P2 ∈ {P} | |||
P1 ∈ {R} | |||
P1' ∉ {D,E} |
Rules extracted with OSRE from the artificial data sets (cf. Table 1). The left column for each data set shows the number of peptides used to extract the rules.
HIV-1 protease
Rule extraction
Rules for the HIV-1 PR 746 and HIV-1 PR 1621 data sets were extracted using OSRE, as described in the Methods section. OSRE produced slightly different numbers of rules for each cross validation (CV) subset, varying between 7 and 10 rules for the 746 peptide data set and between 6 and 9 rules for the 1625 peptide data set. The CV generalization error was estimated for the rules when one, two, three, four, five and all rules were used to predict cleavage for the hold-out CV data set (rules were ordered in priority order by OSRE); this CV error is shown in Figure 1 and Figure 2. The 746 peptide set is a bit more difficult to predict because it is a balanced data set with fewer negative examples than the 1625 peptide set. The CV performance improves until five rules are used (this was one motivation for using five clusters in the rule clustering). The CV prediction accuracies of the OSRE method when using all rules are 87% for the 746 peptide data set and 93% for the 1625 peptide set.
The fidelity is measured by using the rules to label the peptide data sets from which they were generated. The OSRE consensus rules, produced by using spectral clustering on the OSRE rule sets (as described in the Methods section), are listed in Table 3. These consensus rules were used to label the HIV-1 PR peptides in the data sets and the resulting in-sample accuracy (we denote this the fidelity) is shown in Table 4. Table 4 also lists, for reference, the fidelities for rule sets generated using the rough set theory approach [17].
Table 3.
HIV-1 PR 746 peptide set | HIV-1 PR 1625 peptide set | ||
HIVA1 | P3 ∉ {N} | HIVA2 | P4 ∉ {N,C,I} |
P2 ∉ {Q,L,K,S} | P2 ∉ {Q,K,P,S} | ||
P1 ∈ {G,L,M,F,Y} | P1 ∈ {L,M,F,Y} | ||
P1' ∈ {H,I,L,M,F,P,Y,V} | P1' ∉ {R,N,D,C,Q,E,K,S,T} | ||
P2' ∉ {R,N,D,G,H,K,P,S} | P2' ∉ {R,N,D,G,H,K} | ||
P3' ∉ { K } | |||
HIVB1 | P1 ∉ {Q,K,T,V} | HIVB2 | P4 ∉ {N,C,I,K,M,F,W,Y} |
P1' ∈ {L,F,P,Y} | P3 ∉ {N,C,V} | ||
P2' ∉ {R,N,D,G,H,K,S,T,Y} | P2 ∉ {K} | ||
P3' ∉ {N,H,I,L,K,P,S} | P1 ∉ {R,Q,E,K,T} | ||
P4' ∉ {N,Q,I,M,V} | P1' ∉ {R,N,D,C,Q,E,K,S,T} | ||
P2' ∈ {A,C,Q,E,I,L,S,V} | |||
P3' ∈ {A,C,Q,M,F,Y} | |||
HIVC1 | P2 ∉ {Q,L,K,S} | HIVC2 | P4 ∉ {N,C,I,K,M,F,W,Y} |
P1 ∉ {K,T,V} | P3 ∉ {N,V} | ||
P2' ∈ {E} | P2 ∈ {A,C,I,M,F,V} | ||
P3' ∉ {N} | P1 ∉ {R,Q,E,K,P,T} | ||
P1' ∉ {R,N,D,C,Q,E,K,S,T} | |||
P2' ∈ {A,C,Q,I,L,S,V} | |||
P3' ∉ {N,K,P} | |||
HIVD1 | P3 ∉ {N,C,S,V} | HIVD2 | P4 ∉ {C,I,Y} |
P2 ∈ {I,V} | P3 ∉ {N,V} | ||
P1 ∉ {Q,K,P,T,V} | P2 ∈ {A,C,I,M,F,T,V} | ||
P1' ∉ {D,Q,K,S} | P1 ∈ {A,L,M,F,Y} | ||
P2' ∉ {R,N,D,G,K} | P1' ∉ {K} | ||
P3' ∉ {N} | P2' ∈ {A,E,I,L,S,V} | ||
P4' ∉ {A,N,Q,I,M,W,Y,V} | P3' ∉ {N,I,K,P} | ||
HIVE1 | P1 ∈ {L,M,F,Y} | HIVE2 | P2 ∈ {A,C,I,M,F,V} |
P1' ∈ {L,M,F,P,Y} | P1 ∉ {R,Q,E,K} | ||
P4' ∉ {A,N,I,M,Y,V} | P2' ∈ {A,C,Q,E,I,L,S,T,V} | ||
P3' ∈ {A,C,M,F,Y} |
Consensus rules for the HIV-1 PR data sets. The rules are listed in the priority order given by the OSRE method. The text in italics is the label for each rule.
Table 4.
Rules used | Accuracy 746 HIV-1 | Accuracy 1621 HIV-1 |
HIVA1, HIVB1, ..., HIVE1 | 92.9% | 90.0% |
HIVA2, HIVB2, ..., HIVE2 | 92.2% | 94.9% |
[17] Fig. 1a | 63.4% | 85.4% |
[17] Fig. 1b | 75.3% | 65.2% |
[17] Table III | 63.0% | 85.1% |
[17] Table IV | 75.5% | 65.3% |
The fidelity of the rules for the HIV-1 PR data sets (cf. Table 3).
Out-of-sample tests
The OSRE rules' prediction accuracy was tested on the HIV-1 PR 3261 data set. This data set was published after the rules were extracted and there is almost no sequence overlap with the data set used to generate the rules. It therefore constitutes a true out-of-sample test of the rules' ability to predict cleavage for novel sequences. However, the HIV-1 PR 3261 data set has many more non-cleaved octamers than cleaved octamers; the prediction accuracy is almost 87% if all sequences are predicted as non-cleaved (which is not a very useful prediction). The prediction accuracy is therefore not a good quality measure for the performance of the rules, it is better to present the sensitivity (true positive fraction), specificity (true negative fraction), and positive likelihood ratio. These are defined as:
(1) |
(2) |
(3) |
The positive likelihood ratio measures how much better the odds of correctly predicting a cleavage location with the rule set are than predicting randomly according to prevalence.
The sensitivity and specificity are usually shown together in a so-called receiver operator characteristic (ROC) curve. The ROC plots for the OSRE HIV-1 rules are shown in Figure 3 and Figure 4, together with the same values for the rough set theory rules [17], the recently published HIVcleave web-server [37], and a linear support vector machine [38], which was the best predictor we had hitherto tried for this problem. The positive likelihood ratio is shown in Table 5, together with corresponding sensitivity and specificity values.
Table 5.
Rules used | Positive likelihood ratio | Sensitivity | Specificity |
HIVA1 | 11.3 | 31% | 97% |
HIVA1, HIVB1 | 6.2 | 41% | 93% |
HIVA1, HIVB1, HIVC1 | 5.2 | 57% | 89% |
HIVA1, HIVB1, HIVC1, HIVD1 | 4.7 | 61% | 87% |
HIVA1, HIVB1, HIVC1, HIVD1, HIVE1 | 4.7 | 64% | 86% |
HIVA2 | 11.3 | 33% | 97% |
HIVA2, HIVB2 | 6.8 | 40% | 94% |
HIVA2, HIVB2, HIVC2 | 5.6 | 44% | 92% |
HIVA2, HIVB2, HIVC2, HIVD2 | 5.2 | 48% | 91% |
HIVA2, HIVB2, HIVC2, HIVD2, HIVE2 | 4.5 | 49% | 89% |
Kontijevskis et al. [17] Fig. 1a | 2.1 | 3% | 99% |
Kontijevskis et al. [17] Fig. 1b | 1.7 | 87% | 49% |
Kontijevskis et al. [17] Table III | 0.8 | 3% | 97% |
Kontijevskis et al. [17] Table IV | 1.7 | 87% | 49% |
HIVcleave [37] | 45.4 | 2% | 100% |
HIVcleave [37] | 3.0 | 32% | 89% |
HIVcleave [37] | 2.8 | 50% | 82% |
L-SVM [38] | 29.1 | 2% | 100% |
L-SVM [38] | 10.6 | 31% | 97% |
L-SVM [38] | 6.7 | 50% | 93% |
The positive likelihood ratios for different rules and predictors on the HIV-1 3261 data set. The results are shown for the OSRE HIV-1 rules and the Kontijevskis et al. rough set theory rules [17], together with some reference values for the HIVcleave web server [37] and a linear support vector machine [38].
HCV NS3 protease
Rule extraction
The OSRE method was applied to the HCV NS3 data in a corresponding way as for the HIV-1 PR data. The CV out-of-sample error as a function of the number of rules is shown in Figure 5. The accuracy when using all rules is 95%. This is close to the performance of a non-rule-based linear support vector machine classifier with tuned "slack", which has an accuracy of 97%, and it is the hitherto best result using a rule based method for HCV NS3. The consensus rules for HCV NS3 protease are listed in Table 6. Their fidelity, shown in Table 7, is similar to the out of sample accuracy (95%).
Table 6.
HCV NS3 peptide set | |
HCVA | P3 ∉ {K} |
P1 ∈ {C} | |
P1' ∉ {P} | |
HCVB | P6 ∈ {A,N,D,C,E,I,K,M,F,P} |
P5 ∉ {I} | |
P4 ∉ {N,D,Q,G,L,F,P,S,V} | |
P3 ∈ {R,E,G,I,L,T,V} | |
P1 ∉ {R,P,Y} | |
P1' ∈ {A,R,N,D,H,M,S,W,Y} | |
P3' ∉ {Q,E,I,L,M,P,V} | |
HCVC | P4 ∉ {S} |
P3 ∈ {R,E,G,I,F,T,V} | |
P2 ∉ {C,L,S,T} | |
P1 ∉ {R,D,P,Y} | |
P1' ∉ {C,Q,E,G,I,L,K,P,T,V} | |
P2' ∉ {P} | |
P3' ∉ {P} | |
P4' ∉ {A,R,D,E,G,I,P,S,T} | |
HCVD | P6 ∈ {C} |
P4 ∈ {R,C,E,I,M,T,V} | |
P4' ∈ {C,W,Y,V} |
Consensus rules for the HCV NS3 data set (939 peptides). The rules are listed in the priority order given by the OSRE method. The text in italics is the label for each rule.
Table 7.
Rules used | Accuracy |
HCV NS3 | |
HCVA, HCVB, HCVC, HCVD | 94.7% |
The fidelity of the rules for the HCV NS3 data set.
Out-of-sample tests
Applying the HCV rules to the NS3 protease sequence (the NS3 internal test data set) produces 46 predicted internal cleavage sites, which are listed in Table 8. These can be compared to observed internal cleavages for the NS3 protease [32-34].
Table 8.
P1 | Decamer | Rule match | Mass (kDa) | Mass (kDa) |
C16 | RGLLGCIITS | HCVA | 1.7 | 65.6 |
C47 | TFLATCINGV | HCVA | 5.0 | 62.3 |
C52 | CINGVCWTVY | HCVA, HCVC | 5.5 | 61.8 |
Q73 | KGPVIQMYTN | HCVC | 7.7 | 59.6 |
M74 | GPVIQMYTNV | HCVC | 7.8 | 59.4 |
V78 | QMYTNVDQDL | HCVC | 8.3 | 59.0 |
C97 | RSLTPCTCGS | HCVA | 10.3 | 56.9 |
C99 | LTPCTCGSSD | HCVA | 10.5 | 56.7 |
S102 | CTCGSSDLYL | HCVB | 10.7 | 56.5 |
T108 | DLYLVTRHAD | HCVB | 11.4 | 55.8 |
G120 | PVRRRGDSRG | HCVB | 12.8 | 54.4 |
S122 | RRRGDSRGSL | HCVC | 13.0 | 54.2 |
I132 | LSPRPISYLK | HCVC | 14.1 | 53.1 |
G137 | ISYLKGSSGG | HCVB | 14.7 | 52.6 |
C159 | FRAAVCTRGV | HCVA | 16.7 | 50.5 |
V163 | VCTRGVAKAV | HCVC | 17.1 | 50.1 |
K165 | TRGVAKAVDF | HCVC | 17.3 | 49.9 |
Q195 | PPAVPQSFQV | HCVC | 20.6 | 46.6 |
A240 | TLGFGAYMSK | HCVC | 25.1 | 42.2 |
C279 | LADGGCSGGA | HCVA | 29.2 | 38.1 |
C289 | YDIIICDECH | HCVA, HCVC | 30.2 | 37.1 |
C292 | IICDECHSTD | HCVA | 30.5 | 36.7 |
G314 | QAETAGARLV | HCVC | 32.6 | 34.6 |
V339 | PNIEEVALST | HCVB | 35.2 | 32.0 |
L341 | IEEVALSTTG | HCVB | 35.4 | 31.9 |
C368 | RHLIFCHSKK | HCVA, HCVC | 38.3 | 28.9 |
N387 | VALGINAVAY | HCVC | 40.3 | 26.9 |
L395 | AYYRGLDVSV | HCVB, HCVC | 41.2 | 26.0 |
T402 | VSVIPTSGDV | HCVC | 41.9 | 25.3 |
C428 | DSVIDCNTCV | HCVA, HCVC | 44.5 | 22.7 |
C431 | IDCNTCVTQT | HCVA | 44.9 | 22.4 |
S488 | SGMFDSSVLC | HCVC | 51.2 | 16.1 |
C492 | DSSVLCECYD | HCVA | 51.6 | 15.7 |
C494 | SVLCECYDAG | HCVA | 51.8 | 15.4 |
A497 | CECYDAGCAW | HCVD | 52.1 | 15.1 |
C499 | CYDAGCAWYE | HCVA | 52.3 | 14.9 |
V511 | PAETTVRLRA | HCVB | 53.6 | 13.6 |
L513 | ETTVRLRAYM | HCVB, HCVC | 53.9 | 13.3 |
C525 | PGLPVCQDHL | HCVA | 55.2 | 12.0 |
V535 | EFWEGVFTGL | HCVC | 56.5 | 10.8 |
I542 | TGLTHIDAHF | HCVC | 57.2 | 10.0 |
Q552 | LSQTKQSGEN | HCVC | 58.4 | 8.8 |
A562 | FPYLVAYQAT | HCVB | 59.5 | 7.8 |
C568 | YQATVCARAQ | HCVA, HCVC | 60.1 | 7.1 |
C584 | DQMWKCLIRL | HCVA | 62.0 | 5.2 |
C622 | KYIMTCMSAD | HCVA | 66.3 | 1.0 |
The table shows the result of applying the HCV NS3 cleavage rules to the NS3 protease amino acid sequence. Only decamers that are predicted to be cleaved are listed. The first column mentions the amino acid and the position in the NS3 sequence that occupies the P1 position in the decamer. The second column lists the decamer. The third column shows which HCV NS3 rules that the decamer matches to. The fourth and fifth column show the molecular masses of the cleavage products if NS3 is cleaved in that position.
Yang et al. [32] report two internal cleavages of NS3 which are cleaved by the NS3 protease itself together with the NS4A co-factor. These are HLIFCH-SKKK (H369) and VSVIPT-SGDV (T402). The latter of these is predicted by the HCV rules (Table 8). The first one is not predicted to be cleaved but the nearby cleavage RHLIFC-HSKK (C368) is predicted. It is probably also the correct cleavage site (see discussion).
Kou et al. [33] tested the internal cleavages at HLIFCH-SKKK (which we believe is actually RHLIFC-HSKK) and VSVIPT-SGDV for sensitivity to genotype. They found that the latter cleavage had genotype specificity, i.e. NS3 protease from HCV-1b could not cleave NS3 protease from genotype HCV-2a, and vice versa. The HCV rules predict the same; NS3 from genotypes HCV-1a and HCV-1b is predicted to be cleaved at this site by the HCV rules, but not NS3 from genotypes HCV-2a, HCV-2b or HCV-2c.
Hou et al. [34] report an alternative cleavage of NS3/4A, which can be observed when the wild type NS3/4A cleavage site is mutated so that it is not cleaved. They estimate that the alternative cleavage site (an internal NS3 cleavage site) is located about 6 kDa upstream from the wild type cleavage site, based on their observations of two 12 kDa and 67 kDa fragments that appear when the wild type cleavage site is mutated, but they were unable to find the location of the alternative cleavage site although they tested to mutate sites that were 6 kDa upstream from the NS3/4A cleavage site [34]. It is therefore not possible to test whether the HCV rules would match this site. However, we can predict the possible cleavage sites under the assumption that the HCV rules are correct. It is peculiar that Hou et al. report observed masses for the NS3 and the NS3-NS4A complex which are larger (73 kDa and 80 kDa, respectively) than their molecular masses (which are 67 kDa and 73 kDa, respectively). It is possible that the reason is that there was something extra (with mass 6–7 kDa) sitting on the N-terminal side of the cloned NS3+NS4A complex, which was not cleaved off. This means that one should look for NS3 cleavage fragments of about 61 kDa and 6 kDa (since the NS4A has a mass just below 6 kDa). If this is correct then the alternative cleavage site could be in any of four positions (cf. Table 8): C47, C52, C568 or C584.
The HCV rules predict at least two correct cleavage sites (out of 46 potential sites), no false negatives and 575 (out of 621) true negatives for the NS3 internal test data. This indicates an out-of-sample specificity (true negative rate) of 93% and a sensitivity (true positive rate) of 100% (no false negatives) for the HCV rules.
The HCV rules produce 196 possible cleavage sites on the TLR3 test data set (2805 decamers). Of these, only one is a verified cleavage site, PSSTP-CSAHL in TICAM-1 [35], which matches a decamer in the HCV NS3 data set. None of the remaining 2804 decamers match any decamers in the HCV NS3 data set. Thus, there are (at most) 195 false positives and 2609 true negatives. This indicates an out-of-sample specificity of 2609/(195 + 2609) = 93%, which agrees with the estimate from the NS3 internal test data.
Kim et al. [36] performed an in vivo determination of HCV NS3 substrate specificity, using a genetic assay that produces random sequences based on the NS4B/5A cleavage site in HCV. They list 69 decamers that are especially good substrates for NS3 in vivo. Table 9 shows these 69 decamers (plus a consensus sequence suggested by Kim et al.) and how they match the HCV rules. Only one of the 69 decamers does not match any HCV rule. Most of the other decamers match the rule HCVA and many match more than one rule. This indicates an out-of-sample sensitivity of 68/(68 + 1) = 99%, which agrees well with the estimate from the NS3 internal test data.
Table 9.
Decamer | Rule match | Decamer | Rule match |
DCYVYCSGSW | HCVA, HCVB, HCVC | DCSQPCAGSW | HCVA |
DCAVTCSGSW | HCVA, HCVB | DCIIVCAGSW | HCVA, HCVB, HCVC |
DCAVRCSGSW | HCVA, HCVB, HCVC | DCQQLCAGSW | HCVA |
DCIKCCSGSW | - | DCPSPCAGSW | HCVA |
DCVSNCSGSW | HCVA | DCAILCAGSW | HCVA, HCVB |
DCVMKCSGSW | HCVA | DCIMPCAGSW | HCVA |
DCATTCSGSW | HCVA, HCVB | DCVRMCAGSW | HCVA, HCVC |
DCTQMCSGSW | HCVA | DCSVLCAGSW | HCVA |
DCLDLCSGSW | HCVA | DCYRPCAGSW | HCVA, HCVB, HCVC |
DCVTPCSGSW | HCVA, HCVC | DCLCLCAGSW | HCVA |
DCVSLCSGSW | HCVA | DCAVCCAGSW | HCVA, HCVB |
DCPLACSGSW | HCVA | DCPIMCAGSW | HCVA, HCVC |
DCMCDCSGSW | HCVA | DCHQMCAGSW | HCVA |
DCVLRCSGSW | HCVA | DCWTPCAGSW | HCVA, HCVB, HCVC |
DCAVTCSGSW | HCVA, HCVB | DCIIKCCGSW | HCVA |
DCSVRCSGSW | HCVA | DCIMECCGSW | HCVA |
DCSHPCSGSW | HCVA | DCSQLCCGSW | HCVA |
DCCVRCSGSW | HCVA, HCVB, HCVC | DCTVACTGSW | HCVA |
DCSVLCSGSW | HCVA | DCLVACTGSW | HCVA |
DCRVRCSGSW | HCVA, HCVB, HCVC | DCPCPCTGSW | HCVA |
DCVTPCSGSW | HCVA, HCVC | DCLELCTGSW | HCVA |
DCIYICSGSW | HCVA | DCLVACTGSW | HCVA |
DCRLPCSGSW | HCVA, HCVB | DCQIICTGSW | HCVA |
DCYERCSGSW | HCVA, HCVB, HCVC | DCVVCCGGSW | HCVA |
DCAVLCSGSW | HCVA, HCVB | DCTVTTSGSW | HCVB |
DCVRLCSGSW | HCVA | DCTVETSGSW | HCVB, HCVC |
DCPTNCSGSW | HCVA, HCVC | DCSVVCSSSW | HCVA |
DCPRLCSGSW | HCVA | DCSVACSQSW | HCVA |
DCVSNCSGSW | HCVA | DCSTLCSTSW | HCVA |
DCRIPCSGSW | HCVA, HCVB, HCVC | DCSITCAQSW | HCVA |
DCVSNCSGSW | HCVA | DCSVLCARSW | HCVA |
DCFAMCSGSW | HCVA | DCSVPCTGSW | HCVA |
DCTIKCAGSW | HCVA, HCVB, HCVC | DCSLPCGSSW | HCVA |
DCLITCAGSW | HCVA | DCSAPCGSSW | HCVA |
DCKVTCAGSW | HCVA, HCVB | DCVVPCSGSW | HCVA, HCVC (*) |
The table shows the result of applying the HCV NS3 cleavage rules to the 69 decamers listed as especially good substrates in vivo by Kim et al. [36]. The first and third column show the decamers and the second and fourth column show the HCV rules that it matches (if any). The last (70th) sequence (DCVVPCSGSW), marked with (*), is the consensus sequence tested by Kim et al., this sequence is already in the HCV NS3 data set and is thus not an out-of-sample prediction.
If the estimates of sensitivity and specificity are correct, then the positive likelihood ratio for the OSRE HCV rules would be 14.
Discussion
The artificial data
The results on the two artificial problems illustrate two things. First, OSRE quickly finds rules that are very close to the generating rules in terms of classification accuracy. Secondly, where an identified rule is not completely correct but the correction refers to rare events, then many examples are needed before the rare cases are represented in sufficient number to affect the performance ranking of the rule set and hence generate the required update.
The difficulty with artificial data set B lies in the very low probability for observing false positives once the rule (P2 ∈ {P} AND P1 ∈ {R}) has been learned. For example, the training data set with 100,000 observations has 50,002 positive examples and 49,998 negative examples. All the positive examples (of course) match the rule but only 97 of the negative examples (0.2% of the negative examples) match the rule and are thus counterexamples that point to that the rule must be modified.
In summary, OSRE finds a very good rule already with 100 samples and is able to find the corrections to this rule from a mere additional 97 false positives in the case of data set B. This must be considered a very good performance.
HIV-1 PR data
Table 4 shows that the OSRE rules have a higher fidelity to the HIV-1 PR data sets than rules extracted using rough set theory [17], i.e. the OSRE rules are a more faithful description of the data used to generate them. The out-of-sample test, Figure 3, shows that the OSRE rules have excellent predictive power, matching that of the hitherto state-of-the-art predictor, the linear support vector machine [38], which is not a rule based method. This is a positive surprise since there is usually a trade-off between accuracy and comprehensibility. Other previously suggested predictors (rule based or not) have significantly worse accuracy than the OSRE rules and the linear support vector machine. These include a recently published HIV cleave web-server [37] and the rough set theory rules [17]. The rough set theory can produce better results if many more (hundreds of) rules are used but the comprehensibility is then definitely sacrificed [17]. The OSRE rules for HIV-1 protease are, on the other hand, both compact and accurate.
The rough set theory approach uses several physicochemical properties for the amino acids and one might believe that this should produce a better model than just using letter codes. The results here, however, indicate that this is not at all the case; conjunctive rules with simple amino acid letter codes are shorter, better predictors, and have higher fidelity to the data. Also, the OSRE method is fast; it will produce a small (comprehensible) rule set for the data in this study in the matter of minutes (the timing is discussed in the Methods section).
The OSRE rules for positions P1 and P1' are the most consistent rules and agree with earlier findings [15,39,40] for HIV-1 PR: large hydrophobic residues are preferred in position P1 and hydrophobic residues are preferred in position P1'. The OSRE consensus rules and the rough set rules do not agree completely. The rough set rules for cleaved peptides have, e.g., P3 ∈ {C} which does not occur very often in the OSRE consensus rules. Another difference is P1' ∈ {D,C,K} for the rough set rules but not for the OSRE rules. We cannot with certainty say whether the OSRE or the rough set rules are more correct, but the fact that the OSRE rules have a higher fidelity to the data sets indicates that the OSRE rules would be more correct.
The HIV-1 protease cleavage rules based on the 1625 data set, listed in in Table 3, require at least four residues to be specified for cleavage. This is different from Kontijevskis et al. [17], who find that three specified residues are sufficient. However, this is probably not a significant difference (the HIV-1 protease cleavage rules for the 746 data set contain one rule, HIVE1, where three positions are sufficient).
It is difficult, and hardly worthwhile, to compare the OSRE results for the HIV-1 PR data with other earlier rule extraction approaches, e.g. [18,20,41,42], except the rough set rule method. Earlier results have been based on a much smaller data set (362 peptides) and tend to refer to single amino acids in single positions and not groups of amino acids.
The two rules that agree most for the HIV-1 PR 746 and HIV-1 PR 1625 data sets are HIVA1 and HIVA2; 76% of the octamers that match HIVA1 also match HIVA2 and 69% of the octamers that match HIVA2 also match HIVA1, see Table 10. It is therefore reasonable to create a new consensus sequence by building a joint rule from HIVA1 and HIVA2, i.e. a rule that describes all octamers that match both HIVA1 and HIVA2. This consensus sequence is X-X- [ARNDCEGHIMFTWYV]- [LMFY]- [HILMFPYV]- [ACQEILMFTWYV]-X-X, where X denotes any amino acid. The sensitivity, specificity and positive likelihood ratio for this consensus sequence, when evaluated on the HIV-1 3261 PR data set, are 28%, 99% and 24, respectively. No other predictor is as accurate as this at this sensitivity level (cf. Table 5). It is notable that the consensus sequence only considers positions P2-P1-P1'-P2', i.e. not the full eight residue sequence.
Table 10.
HIV.. | ..A1 | ..B1 | ..C1 | ..D1 | ..E1 | ..A2 | ..B2 | ..C2 | ..D2 | ..E2 |
HIVA1 | 4.56% | 1.90% | 0.36% | 0.27% | 1.60% | 3.45% | 0.43% | 0.41% | 0.43% | 0.29% |
HIVB1 | 4.29% | 0.31% | 0.19% | 0.93% | 0.73% | 0.53% | 0.31% | 0.16% | 0.28% | |
HIVC1 | 3.23% | 0.14% | 0.12% | 0.29% | 0.24% | 0.25% | 0.25% | 0.26% | ||
HIVD1 | 2.05% | 0.17% | 0.28% | 0.12% | 0.31% | 0.20% | 0.22% | |||
HIVE1 | 3.50% | 1.58% | 0.20% | 0.17% | 0.22% | 0.12% | ||||
HIVA2 | 4.97% | 0.54% | 0.53% | 0.67% | 0.32% | |||||
HIVB2 | 2.56% | 0.62% | 0.22% | 0.63% | ||||||
HIVC2 | 1.86% | 0.53% | 0.55% | |||||||
HIVD2 | 1.53% | 0.41% | ||||||||
HIVE2 | 2.70% |
There are 208 possible octamers.
The fractions of these octamers that match to each OSRE HIV-1 rule, and to more than one rule, are shown here. The diagonal elements in the table show the fraction of the possible octamers that match one rule. The off-diagonal elements in the table show what fraction of the octamers that match both rules.
HCV NS3 data
A cysteine (C) in position P1 is considered to be the most important determinant for cleavage by HCV NS3/4A [31]. This is only reflected in the rule HCVA; the other rules are much less specific about the P1 position. Urbani et al. [27] conclude that the specificity is quite broad unless for the requirement for a small hydrophobic residue (e.g. C or T) in position P1. The OSRE HCV rule set indicates that the cysteine in position P1 is the dominating cleavage process: 10% of all decamers match the rule P1 ∈ {T,C}, 9% of all decamers match the rule HCVA and about 11% of all decamers match any of the OSRE HCV rules. This means that roughly 80% of the decamers that are cleaved by HCV NS3 (as predicted by the OSRE rules) match the rule HCVA.
The standard HCV NS3/4A cleavage rule [23,31,43] (P6 ∈ {E,D} AND P1 ∈ {T,C} AND P1' ∈ {A,S}) is much more specific than the OSRE HCV rules; only 0.1% of all possible decamers match this standard rule. Kim et al. [36] suggest an even more restrictive cleavage rule (consensus sequence) based on their in vivo studies: (P6 ∈ {E,D} AND P4 ∈ {V} AND P3 ∈ {L,P} AND P1 ∈ {C} AND P1' ∈ {A,S}). Only one in eight million decamers match this consensus sequence. None of the decamers studied by Kim et al. [36], except the one they handcraft to fit, match their consensus sequence.
None of the OSRE HCV rules support P6 ∈ {E,D} and P1' ∈ {A,S}. Thus, it cannot be said that the OSRE HCV rules are in strong agreement with previously reported consensus sequences. However, the OSRE HCV rules are more accurate at predicting the cleavage sites. They are dominated by HCVA and HCVC, which together match about 10% of all decamers, see Table 11. The two rules have quite a small overlap, 7% of the decamers that match HCVA also match HCVC, and 6% of the decamers that match HCVC also match HCVA (cf. Table 11). This indicates that there are two different processes taking place, one described by HCVA and the other described by HCVB and HCVC, which have a higher overlap. The contribution from HCVD is quite small and it can probably be ignored.
Table 11.
HCVA | HCVB | HCVC | HCVD | |
HCVA | 4.51% | 0.16% | 0.33% | 0.02% |
HCVB | 2.27% | 0.77% | 0.02% | |
HCVC | 5.28% | 0.04% | ||
HCVD | 0.35% |
There are 2010 possible decamers. The fractions of these decamers that match to each OSRE HCV NS3 rule, and to more than one rule, are shown here. The diagonal elements in the table show the fraction of the possible octamers that match one rule. The off-diagonal elements in the table show what fraction of the decamers that match both rules.
The positive likelihood ratio for the OSRE HCV rules is estimated to be 0.99/(1-0.93) ≈ 14, which is the same as for the HIV-1 PR consensus sequence, but the sensitivity is much higher for HCV NS3/4A than for HIV-1 PR.
It was mentioned in the result section that we believe the internal cleavage of NS3 to be at C368 (RHLIFC-HSKK) and not at H369 (HLIFCH-SKKK) that Yang et al. [32] report. There are two reasons for this. One is that the HCV rules match one site and not the other. The other reason is that the experimental data presented by Yang et al. fits also with a cleavage at C368. They [32] test their predicted location of the cleavage site by several pairwise mutations in the HCV polyprotein: ...RHLIFCHSKKKC... → ...RHLPGCHSKKKC... (pNS34A-M4); ...RHLIFCPGKKKC... (pNS34A-M5); ...RHLIFCHSKPGC... (pNS34A-M6); and ...RHLIFCHSKKPG... (pNS34A-M7). Yang et al. note that the internal cleavage is blocked when the mutation is pNS34A-M5, whereas the internal cleavage occurs in all other cases. These observations are perfectly consistent with cleavage at C368 (between C and H) if the rule HCVA is correct.
It is not worthwhile to compare the OSRE HCV rules to previous automated rule extraction (and prediction) work on HCV NS3 protease since these have been based on a data set with many errors in it.
The HCV NS3 problem seems to be easier than the HIV-1 PR problem. Fewer rules are required and the OSRE NS3 rules are more accurate than the OSRE HIV-1 PR rules.
Conclusion
A methodology that combines the OSRE rule extraction method and spectral clustering was introduced for efficiently, i.e. quickly and accurately, extracting accurate and comprehensible specificity rules for proteases, rules that also had a high fidelity to the data used to create them. The approach was demonstrated on two medically important protease cases, HIV-1 protease and HCV NS3/4A protease. The HIV-1 protease rules were shown to be more accurate than previous state-of-the-art rule extraction results on a large HIV-1 protease out-of-sample test data set. The proposed methodology achieved this performance using fewer rules than previous approaches and with a higher fidelity to the data set that had been used to create the rules.
The HCV NS3/4A protease rules were shown to fit very well with experimental findings for the HCV NS3 protease. The rules were used to correct the position of an internal cleavage site in HCV NS3, which demonstrates the usefulness of accurate and comprehensible rules when interpreting experimental data.
The results indicate that the HCV NS3/4A protease cleavage is a simpler problem than the HIV-1 protease cleavage.
In summary, the OSRE approach yields rules that are simpler and more accurate than other rule extraction methods for protease specificity problems, and it is significantly faster. It does so by using conjunctive low-order rules, i.e. rules with few arguments and in a form that is commonly used to describe protease specificities.
Methods
Orthogonal search-based rule extraction (OSRE)
The OSRE algorithm [44] finds conjunctive rules for classifications from any classifier that produces a smooth response surface. It is an efficient method to find low-order rules from labeled data. Given a data set with an associated classification label, the method starts by fitting the response surface that best classifies the data. This is a smooth surface in data space which separates it into different regions by means of generic, non-linear decision boundaries. This fit to the data must be obtained by a robust statistical methodology, to ensure good out-of-sample generalization [44]. This step is essential in order to obtain a smooth fit of the data, cutting through noise and thus avoiding over-fitting. In the first stage of the rule-extraction algorithm, rules are fitted to the response surface of this statistical model, rather than to the data themselves. For this reason, the non-linear model is a multi-layer perceptron neural network with strong regularization using the Bayesian framework of Automatic Relevance Determination [45]. In cases where the data are linearly separable, this heavy robust regularization will default to a linear decision boundary. Cycling through the data, this stage results in the maximal multivariate box, centered on each data point, which is to one side of the decision boundary, returning an initial number of rules equal to the number of data points.
The second stage in the application of OSRE, is to sort the rules by their performance on the data set, measuring performance by the proportion of actual data points within each multivariate box, i.e. conjunctive rule, which belongs in the correct class. This process includes removing boxes within boxes and selecting the rules with the best balance between coverage, i.e. true detection rate, and specificity, i.e. a low false detection rate. This is achieved by starting with the individual rule whose performance in the ROC plot is closest to the ideal point with unit sensitivity and specificity, then adding more rules in a stepwise manner, each time selecting the best additional rule by measuring the position of the aggregate rule set in the ROC plot. The automatic forward selection of the minimal rule set that best approximates the ideal performance of unit sensitivity and specificity is a development of the OSRE methodology since the publication of the original paper.
By not imposing a requirement of mutual exclusivity between individual rules and, instead, searching directly in a multivariate space, rather than in a sequential univariate manner as with most rule extraction methods, it is found that the well performing rules are of low-order. This means that the rules are more readily interpretable by expert users, as they involve fewer arguments. The representation framework using multivariate conjunctive rules at each node in the search hierarchy, rather than univariate nodes in a decision tree, is a particular feature of the OSRE methodology which makes it more suitable to the derivation of rule sets that are interpretable by human experts. In effect, it represents a trade-off between the simplicity in the definition of each node in the tree, which is now a multivariate vector rather than a scalar node, for simplicity of the rule set as whole, with fewer and simpler rules.
The method was originally validated on artificial data [44] and has since been successfully applied in a number of practical applications, e.g. [46].
Creating the decision surface is what tends to take the longest time. This is, however, sometimes available from a previous study. The rule extraction phases (first and second stages) scale, in the current implementation, approximately quadratically with the number of observations. Running OSRE on HIV-1 data sets of different sizes, excluding the decision surface creation, takes approximately 0.5 seconds for 100 observations, 12 seconds for 1600 observations, and 60 seconds for 3200 observations (on an AMD Athlon 64 Processor 2 GHz 3 GB RAM with Windows XP SP3). These are indicative times and individual runs may vary a lot.
Model validation
The OSRE method was applied using ten-fold cross-validation (CV) on the data sets. This means that the OSRE method extracted rules using 90% of the data and the rules were then tested on the remaining 10%, which thus was a hold-out sample. This was repeated ten times so that each peptide in the data set had been used out-of-sample once. The average out-of-sample accuracy over these ten runs was then used as the expected generalization performance of the OSRE method.
The performance obtained by OSRE on the cleavage data was obtained by the application of the standard methodology as outlined above, rather than by selecting the best performance from a range of possible methodologies, which may be biased towards the suitability of the chosen method to the particular nuances of the data sets under study. Therefore, it is expected that this performance benefit represents a generic feature of the proposed methodology, which will generalize to other cleavage data.
Rule clustering
OSRE produced slightly different rules for each CV subset. The rules were therefore grouped using the normalized cut spectral clustering algorithm [47]. The affinity matrix W used had elements
where dij was the fraction of peptides that the two rules disagreed on when evaluated on the data set that the rules were extracted for (i.e. the 1625 HIV-1 PR data set, the 746 HIV-1 PR data set or the 939 HCV NS3 data set). The number of clusters was set to five, which was a subjective choice based on the OSRE results and the quality of the clusterings measured with the Fowlkes-Mallows index [48]. The iterative clustering method gave slightly different results each time it was run. It was therefore repeated five times and the rules that clustered together all five times were grouped together. The most common amino acids in each position within each group then defined the consensus rule for that group. Two of the five clusters mixed very much for the HCV NS3 data. These two clusters where joined and there are therefore only four consensus rules for the HCV NS3 data although the clustering was done with five clusters.
The fidelity of the consensus rules was tested on the data after the clustering had been done. The consensus rules (and the number of clusters) were not changed once the rules had been tested on the data.
List of abbreviations
A Ala: Alanine; AIDS: Acquired Immune Deficiency Syndrome; CTL: Cytotoxic T Lymphocyte; C Cys: Cysteine; CV: Cross Validation; D Asp: Aspartate; E Glu: Glutamate; F Phe: Phenylalanine; G Gly: Glycine; H His: Histidine; HCV: Hepatitis C Virus; HIV-1: Human Immunodeficiency Virus type 1; IKKε: IκB kinase ε; I Ile: Isoleucine; K Lys: Lysine; L Leu: Leucine; M Met: Methionine; N Asn: Asparagine; NS3: Non Structural Protein 3; NS4A: Non Structural Protein 4A; OSRE: Orthogonal Search-based Rule Extraction; P Pro: Proline; Q Gln: Glutamine; R Arg: Arginine; ROC: receiver operator characteristic; S Ser: Serine; T Thr: Threonine; TBK1: TRAF family member-associated NF-κB activator-binding kinase 1; TLR: Toll-like receptor 3; TRIF TICAM1: Toll-IL-1 receptor domain-containing adaptor inducing IFN-β; V Val: Valine; W Trp: Tryptophan; Y Tyr: Tyrosine.
Authors' contributions
TR designed the study, did the rule clustering work, some analysis and drafted the manuscript. TAE implemented the OSRE methodology. LY collected the data, produced some of the results (rough set rules test and state-of-the-art classifiers) and did some of the analysis. DG contributed to the collection of data and interpretation of the results. TAE, IJH and PJGL converted the data into a form compatible with the OSRE methodology and extracted the OSRE rules from the 10 fold cross validation data sets. All authors contributed to the writing and revised the manuscript for intellectual content and approved the final version of the manuscript.
Acknowledgments
Acknowledgements
We thank Dr. Aleksejs Kontijevskis for sharing the 1625 HIV-1 protease peptide set with us. LY was funded within the National Research School in Genomics and Bioinformatics hosted by the University of Gothenburg, Sweden. DG acknowledges funding from the Lars Hierta Memorial foundation, the Stockholm County Council, and Karolinska Institutet. TAE, PJGL and IHJ acknowledge funding from the Biopattern Network of Excellence FP6/2002/IST/1; N° IST-2002-508803 http://www.biopattern.org.
Contributor Information
Thorsteinn Rögnvaldsson, Email: denni@ide.hh.se.
Terence A Etchells, Email: T.A.Etchells@ljmu.ac.uk.
Liwen You, Email: liwenyou@gmail.com.
Daniel Garwicz, Email: Daniel.Garwicz@ki.se.
Ian Jarman, Email: I.H.Jarman@ljmu.ac.uk.
Paulo JG Lisboa, Email: P.J.Lisboa@ljmu.ac.uk.
References
- Overall CM, Blobel CP. In search of partners: linking extracellular proteases to substrates. Nature Reviews: Molecular Cell Biology. 2007;8:245–257. doi: 10.1038/nrm2120. [DOI] [PubMed] [Google Scholar]
- Neurath H, Walsh KA. Role of proteolytic enzymes in biological regulation (a review) Proceedings of the National Academy of Sciences of the United States of America. 1976;73:3825–3832. doi: 10.1073/pnas.73.11.3825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stoller JK, Aboussouan LS. α1-antitrypsin deficiency. Lancet. 2005;365:2225–2236. doi: 10.1016/S0140-6736(05)66781-5. [DOI] [PubMed] [Google Scholar]
- Turk BE. Targeting proteases: successes, failures and future prospects. Nature Reviews: Drug Discovery. 2006;5:785–799. doi: 10.1038/nrd2092. [DOI] [PubMed] [Google Scholar]
- Malcolm BA, Liu R, Lahser F, Agrawal S, Belanger B, Butkiewicz N, Chase R, Gheyas F, Hart A, Hesk D, Ingravallo P, Jiang C, Kong R, Lu J, Pichardo J, Prongay A, Skelton A, Tong X, Venkatraman S, Xia E, Girijavallabhan V, Njoroge FG. SCH 50 a Mechanism-Based Inhibitor of Hepatitis C Virus NS3 Protease, Suppresses Polyprotein Maturation and Enhances the Antiviral Activity of Alpha Interferon in Replicon Cells. Antimicrobial agents and chemotherapy. 3034;50:1013–1020. doi: 10.1128/AAC.50.3.1013-1020.2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stauber RE, Stadlbauer V. Novel approaches for therapy of chronic hepatitis C (Review) The Journal of Clinical Virology. 2006;36:87–94. doi: 10.1016/j.jcv.2006.02.003. [DOI] [PubMed] [Google Scholar]
- Rönn R, Sandström A. New developments in the discovery of agents to treat hepatitis C. Current Topics in Medicinal Chemistry. 2008;8:533–562. doi: 10.2174/156802608783955647. [DOI] [PubMed] [Google Scholar]
- Schilling O, Overall CM. Proteome-derived, database-searchable peptide libraries for identifying protease cleavage sites. Nature Biotechnology. 2008;26:685–694. doi: 10.1038/nbt1408. [DOI] [PubMed] [Google Scholar]
- Howson C. Logic with Trees: An Introduction to Symbolic Knowledge. Oxford, UK: Routledge; 1997. [Google Scholar]
- Keil B. Specificity of proteolysis. Berlin – Heidelberg – New York: Springer-Verlag; 1992. [Google Scholar]
- Schechter I, Berger A. On the size of the active site in proteases. Biochemical Biophysics Research Communications. 1967;27:157–162. doi: 10.1016/S0006-291X(67)80055-X. [DOI] [PubMed] [Google Scholar]
- Best SM. Viral Subversion of Apoptotic Enzymes: Escape from Death Row. Annual Review of Microbiology. 2008;62:171–192. doi: 10.1146/annurev.micro.62.081307.163009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thornberry NA, Rano TA, Peterson EP, Rasper DM, Timkey T, Garcia-Calvo M, Houtzager VM, Nordstrom PA, Roy S, Vaillancourt JP, Chapman KT, Nicholson DW. A combinatorial approach defines specificities of members of the caspase family and Granzyme B. Journal of Biological Chemistry. 1997;272:17907–17911. doi: 10.1074/jbc.272.29.17907. [DOI] [PubMed] [Google Scholar]
- Fujikawa K, Titani K, Davie EW. Activation of bovine factor X (Stuart factor): conversion of factor Xa alpha to factor Xa beta. Proceedings of the National Academy of Sciences of the United States of America. 1975;72:3359–3363. doi: 10.1073/pnas.72.9.3359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- You L, Garwicz D, Rögnvaldsson T. Comprehensive bioinformatic analysis of the specificity of human immun-odeficiency virus type 1 protease. Journal of Virology. 2005;79:12477–12486. doi: 10.1128/JVI.79.19.12477-12486.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poorman RA, Tomasselli AG, Heinrikson RL, Kézdy FJ. A cumulative specificity model for protease from human immunodeficiency virus types 1 and 2, inferred from statistical analysis of an extended substrate data base. The Journal of Biological Chemistry. 1991;266:14554–14561. [PubMed] [Google Scholar]
- Kontijevskis A, Wikberg JES, Komorowski J. Computational Proteomics Analysis of HIV-1 Protease Interac-tome. Proteins: Structure, Function, and Bioinformatics. 2007;68:305–312. doi: 10.1002/prot.21415. [DOI] [PubMed] [Google Scholar]
- Narayanan A, Wu X, Yang Z. Mining Viral Protease Data to Extract Cleavage Knowledge. Bioinformatics. 2002;18:S5–S13. doi: 10.1093/bioinformatics/18.suppl_1.s5. [DOI] [PubMed] [Google Scholar]
- Berry E, Yang ZR, Wu X. A biology inspired neural learning algorithm for analysing protein sequences. Proc of 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI03) 2003. pp. 1–8.
- Yang ZR, Thomson R, Hodgman TC, Dry J, Doyle AK, Narayanan A, Wu XK. Searching for discrimination rules in protease proteolytic cleavage activity using genetic programming with a min-max scoring function. Biosystems. 2003;72:159–176. doi: 10.1016/S0303-2647(03)00141-2. [DOI] [PubMed] [Google Scholar]
- Yang ZR. Predicting Hepatitis C Virus Protease Cleavage Sites Using Generalized Linear Indicator Regression Models. IEEE Transactions on Biomedical Engineering. 2006;53:2119–2123. doi: 10.1109/TBME.2006.881779. [DOI] [PubMed] [Google Scholar]
- Yang ZR. A Probabilistic Peptide Machine for Predicting Hepatitis C Virus Protease Cleavage Sites. IEEE Transactions on Information Technology in Biomedicine. 2007;11:593–595. doi: 10.1109/TITB.2006.889314. [DOI] [PubMed] [Google Scholar]
- Grakoui A, McCourt DW, Wychowski C, Feinstone SM, Rice CM. Characterization of the hepatitis C virus-encoded serine proteinase: determination of proteinase-dependent polyprotein cleavage sites. Journal of Virology. 1993;67:2832–2843. doi: 10.1128/jvi.67.5.2832-2843.1993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leinbach SS, Bhat RA, Xia SM, Hum WT, Stauffer B, Davis AR, Hung PP, Mizutani S. Substrate specificity of the NS3 serine proteinase of hepatitis C virus as determined by mutagenesis at the NS3/NS4A junction. Virology. 1994;204:163–169. doi: 10.1006/viro.1994.1520. [DOI] [PubMed] [Google Scholar]
- Kolykhalov AA, Agapov EV, Rice CM. Specificity of the hepatitis C virus NS3 serine protease: effects of substitutions at the 3/4A, 4A/4B, 4B/5A, and 5A/5B cleavage sites on polyprotein processing. Journal of Virology. 1994;68:7525–7533. doi: 10.1128/jvi.68.11.7525-7533.1994. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bartenschlager R, Ahlborn-Laake L, Yasargil K, Mous J, Jacobsen H. Substrate determinants for cleavage in cis and in trans by the hepatitis C virus NS3 proteinase. Journal of Virology. 1995;69:198–205. doi: 10.1128/jvi.69.1.198-205.1995. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Urbani A, Bianchi E, Narjes F, Tramontano A, Francesco RD, Steinkühler C, Pessi A. Substrate specificity of the hepatitis C virus serine protease (NS3) The Journal of Biological Chemistry. 1997;272:9204–9209. doi: 10.1074/jbc.272.14.9204. [DOI] [PubMed] [Google Scholar]
- Zhang R, Durkin J, Windsor WT, McNemar C, Ramanathan L, Le HV. Probing the substrate specificity of hepatitis C virus NS3 serine protease by using synthetic peptides. Journal of Virology. 1997;71:6208–6213. doi: 10.1128/jvi.71.8.6208-6213.1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kwong AD, Kim JL, Rao G, Lipovsek D, Raybuck SA. Hepatitis C virus NS3/4A protease. Antiviral Research. 1998;40:1–18. doi: 10.1016/S0166-3542(98)00043-6. [DOI] [PubMed] [Google Scholar]
- Attwood MR, Bennett JM, Campbell AD, Canning GGM, Carr MG, Conway E, Dunsdon RM, Greening JR, Jones PS, Kay PB, Handa BK, Hurst DN, Jennings NS, Jordan S, Keech E, O'Brien MA, Overton HA, King-Underwood J, Raynham TM, Stenson KP, Wilkinson CS, Wilkinson TCI, Wilson FX. The design and synthesis of potent inhibitors of hepatitis C virus NS3-4A proteinase. Antiviral Chemistry & Chemotherapy. 1999;10:259–273. doi: 10.1177/095632029901000505. [DOI] [PubMed] [Google Scholar]
- Bartenschlager R. The NS3/4A proteinase of the hepatitis C virus: unraveling structure and function of an unusual enzyme and a prime target for antiviral therapy. Journal of Viral Hepatology. 1999;6:165–181. doi: 10.1046/j.1365-2893.1999.00152.x. [DOI] [PubMed] [Google Scholar]
- Yang SH, Lee CG, Song MK, Sung YC. Internal Cleavage of Hepatitis C Virus NS3 Protein is Dependent on the Activity of NS34A Protease. Virology. 2000;268:132–140. doi: 10.1006/viro.1999.0168. [DOI] [PubMed] [Google Scholar]
- Kou YH, Chang MF, Wang YM, Hung TM, Chang SC. Differential Requirements of NS4A for Internal NS3 Cleavage and Polyprotein Processing of Hepatitis C Virus. Journal of Virology. 2007;81:7999–8008. doi: 10.1128/JVI.00348-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hou X, Yang W, Zhao Y, Agarwal A, Huang M. Internal cleavages of hepatitis C virus NS3 induced by P1 mutations at the NS3/4A cleavage site. Virology. 2009;383:271–278. doi: 10.1016/j.virol.2008.10.037. [DOI] [PubMed] [Google Scholar]
- Li K, Foy E, Ferreon JC, Nakamura M, Ferreon ACM, Ikeda M, Ray SC, Gale M, Lemon SM. Immune evasion by hepatitis C virus NS3/4A protease-mediated cleavage of the Toll-like receptor 3 adaptor protein TRIF. Proceedings of the National Academy of Sciences of the United States of America. 2005;102:2992–2997. doi: 10.1073/pnas.0408824102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim SY, Park KW, Lee YJ, Back SH, Goo JH, Park OK, Jang SK, Park WJ. In Vivo Determination of Substrate Specificity of Hepatitis C Virus NS3 Protease: Genetic Assay for Site-Specific Proteolysis. Analytical Biochemistry. 2000;284:42–48. doi: 10.1006/abio.2000.4662. [DOI] [PubMed] [Google Scholar]
- Shen HB, Chou KC. HIVcleave: a web-server for predicting human immunodeficiency virus protease cleavage sites in proteins. Analytical Biochemistry. 2008;375:388–390. doi: 10.1016/j.ab.2008.01.012. [DOI] [PubMed] [Google Scholar]
- Rögnvaldsson T, You L, Garwicz D. Bioinformatic approaches for modeling the substrate specificity of HIV-1 protease: an overview. Expert Reviews of Molecular Diagnostics. 2007;7:435–451. doi: 10.1586/14737159.7.4.435. [DOI] [PubMed] [Google Scholar]
- Pettit SC, Simsic J, Loeb DD, Everitt L, Hutchison CA, Swanstrom R. Analysis of Retroviral Protease Cleavage Sites Reveals Two Types of Cleavage Sites and the Structural Requirements of the P1 Amino Acid. The Journal of Biological Chemistry. 1991;266:14539–14547. [PubMed] [Google Scholar]
- Eizert H, Bander P, Bagossi P, Sperka T, Miklóssy G, Boross P, Weber IT, Tözsér J. Amino Acid Preferences of Retroviral Proteases for Amino-Terminal Positions in a Type 1 Cleavage Site. Journal of Virology. 2008;82:10111–10117. doi: 10.1128/JVI.00418-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
- You L, Rögnvaldsson T. Why Neural Networks Should Not Be Used for HIV-1 Protease Cleavage Site Prediction. Bioinformatics. 2004;20:1702–1709. doi: 10.1093/bioinformatics/bth144. [DOI] [PubMed] [Google Scholar]
- Yang ZR, Dalby A, Qiu J. Mining HIV Protease Cleavage Data Using Genetic Programming with a Sum-Product Function. Bioinformatics. 2004;20:3398–3405. doi: 10.1093/bioinformatics/bth414. [DOI] [PubMed] [Google Scholar]
- Dunn BM. Proteases of Infectious Agents. Academic Press; 1999. [Google Scholar]
- Etchells TA, Lisboa PJG. Orthogonal Search-Based Rule Extraction (OSRE) for Trained Neural Networks: A Practical and Efficient Approach. IEEE Transactions on Neural Networks. 2006;17:374–384. doi: 10.1109/TNN.2005.863472. [DOI] [PubMed] [Google Scholar]
- MacKay DJC. The evidence framework applied to classification networks. Neural Computation. 1992;4:720–736. doi: 10.1162/neco.1992.4.5.720. [DOI] [Google Scholar]
- Jarman IH, Etchells TA, Martin JD, Lisboa PJ. An integrated framework for risk profiling of breast cancer patients following surgery. Artificial Intelligence in Medicine. 2008;42:165–188. doi: 10.1016/j.artmed.2007.11.005. [DOI] [PubMed] [Google Scholar]
- Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000;22:888–905. doi: 10.1109/34.868688. [DOI] [Google Scholar]
- Fowlkes EB, Mallows CL. A method for comparing two hierarchical clusterings. Journal of The American Statistical Association. 1983;78:553–569. doi: 10.2307/2288117. [DOI] [Google Scholar]