Computational approaches for the classification of seed storage proteins

. 2014 Aug 14;52(7):4246–4255. doi: 10.1007/s13197-014-1500-x

NM Algorithm

Dataset description
Consider the dataset of sequences of seed storage proteins, P = X₁X₂…X_r, where X_1, X ₂,…,X_r are amino acids.
(1) For each protein P compute the following attributes:
Length = Count of all X ^'_i s in P

Composition of amino acid X_{i} = \frac{No . of X_{i}' s in P}{r} . i = 1, 2, \dots, 20

(2) Let {S(i, j): 1 ≤ i ≤ m, 1 ≤ j ≤ n} denote the set of all proteins, where S (i, j) = j^th attribute of the i^th protein, where m and n denote the number of proteins and attributes respectively. There are 21independent attributes in our study, hence n = 21. Let S (i, 22) denote the seed storage class of the protein (See Table 5 below for a sample dataset).
Algorithm
(3) Normalize the dependent attributes using the formulae;

S (i, j) ∶ = \frac{S (i, j) - min_{i} \{S (i, j)\}}{max_{i} \{S (i, j)\} - min_{i} \{S (i, j)\}} . j = 1, 2, \dots, 21

(4) Determine the most influential features using correlation based feature selection algorithm in the following manner
• The features should be correlated with the class attribute
• They should not be correlated among themselves
(5) Generate the test and train sets using k-fold cross validation as follows:
Define S = UQ_i as a random partition of S into k approximately equal parts.
For i = 1 to k, let Q_i be the testing set and the remaining parts be the training set.
(6) For each training and testing dataset generated in the above step, classification model is constructed as follows:
• For each member of testing set, its distance from the members of training set is calculated using the similarity index as follows:

Similarity (x, y) = \sqrt{\sum_{i = 1}^{n} f (x_{i},} y_{i})

where, f (x_{i}, y_{i}) = \{\begin{array}{c} {(x_{i} - y_{i})}^{2}, for numeric - valued attributes \\ (x_{i} \neq y_{i}), for boolean and symbolic attributes \end{array}

where, (x_{i} \neq y_{i}) = \{\begin{array}{c} 0, x_{i} \neq y_{i} \\ 1, x_{i} = y_{i} \end{array}

where, “x” is a member of testing set and “y” is a member of training set.
• The class of the training instance closest to the given test instance based on the above similarity index, is assigned to the test instance.
• Obtain the performance metrics accuracy (ACC), precision (p), recall (r) and F-measure (F), based on the predicted and actual classes of the test instances, using the formulae:

p = \frac{T P}{T P + F P}, r = \frac{T P}{T P + F N}

ACC = \frac{TP + TN}{TP + TN + FP + FN}, F = \frac{2 * p * r}{p + r}

where, TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives.
(7) Compute the average of the performance metrics over all the training and testing data sets. The values of accuracy, precision, recall and F-measure are measures of the goodness of fit of this model to the data. Hence higher measures of accuracy (close to 100 %) and precision, recall and F-measures (close to 1) indicate the suitability of the above model for the classification of seed storage proteins into its respective classes.