Skip to main content
. 2015 Sep 24;163(1):187–201. doi: 10.1016/j.cell.2015.08.057

Figure 2.

Figure 2

Overview of the KINspect Algorithm

The KINspect workflow is designed to identify the specificity mask that best describes the importance of the different residues for specificity. Different combinations of contributions to specificity by different kinase domain residues are collected as specificity masks (top left), where a score between 0 and 1 is given to each position within the kinase domain. Originally, the specificity masks are initialized with random values to then follow a machine-learning procedure that will ensure the masks with the highest predictive power toward specificity are selected for and optimized. This procedure, known as a learning classifier system, is divided into three separate steps.

In step 1, for each specificity mask the system loops over all query kinases and, using a kinase domain alignment, compares the query kinase to all other kinases (except those belonging to the same kinase family, which are excluded only at this stage to avoid over-fitting) at the sequence level, generating a similarity vector. This vector is combined with the specificity mask, so that similarity in high-scoring positions of the mask is reinforced and similarity in low-scoring position of the mask is silenced, effectively producing a mask-weighted similarity vector and sum score for each kinase. These values are subsequently used to integrate the different observed PSSMs into a combined predicted PSSM for the query kinase (as further explained by the equations and text in Supplemental Experimental Procedures section and in Zhang et al., 2009).

In step 2, after a predicted kinase has been generated for all the kinases in our set, fitness is computed as the median of all the differences between the predicted and the experimentally determined PSSM for all the kinases obtained from the NetPhorest repository (Miller et al., 2008).

In step 3, the best-performing specificity masks are kept (“elite”), and new ones are generated by mutation (changing the value of a given position in the mask) and cross-over of the elite sequences (combining two segments of two other masks), as typically done in genetic algorithms. Once a new set of masks has been generated, the whole procedure (prediction, fitness evaluation, and generation of new masks) is repeated iteratively until fitness (defined as median error between predicted and observed specificity profiles) cannot be improved any further (i.e., convergence is reached).

Residues scoring high in the optimized specificity masks will be considered candidate DoS. For further details on this procedure, please refer to Supplemental Experimental Procedures.