Skip to main content
. 2014 Aug 14;52(7):4246–4255. doi: 10.1007/s13197-014-1500-x
NM Algorithm
Dataset description
Consider the dataset of sequences of seed storage proteins, P = X1X2…Xr, where X1, X 2,…,Xr are amino acids.
(1) For each protein P compute the following attributes:
Length = Count of all X 'i s in P
CompositionofaminoacidXi=No.ofXi'sinPr.i=1,2,,20
(2) Let {S(i, j): 1 ≤ i ≤ m, 1 ≤ j ≤ n} denote the set of all proteins, where S (i, j) = jth attribute of the ith protein, where m and n denote the number of proteins and attributes respectively. There are 21independent attributes in our study, hence n = 21. Let S (i, 22) denote the seed storage class of the protein (See Table 5 below for a sample dataset).
Algorithm
(3) Normalize the dependent attributes using the formulae;
Sij=SijminiSijmaxiSijminiSij.j=1,2,,21
(4) Determine the most influential features using correlation based feature selection algorithm in the following manner
• The features should be correlated with the class attribute
• They should not be correlated among themselves
(5) Generate the test and train sets using k-fold cross validation as follows:
Define S = UQi as a random partition of S into k approximately equal parts.
For i = 1 to k, let Qi be the testing set and the remaining parts be the training set.
(6) For each training and testing dataset generated in the above step, classification model is constructed as follows:
• For each member of testing set, its distance from the members of training set is calculated using the similarity index as follows:
Similarityx,y=i=1nf(xi,yi)
where,fxi,yi=xiyi2,fornumericvaluedattributesxiyi,forbooleanandsymbolicattributes
where,xiyi=0,xiyi1,xi=yi
where, “x” is a member of testing set and “y” is a member of training set.
• The class of the training instance closest to the given test instance based on the above similarity index, is assigned to the test instance.
• Obtain the performance metrics accuracy (ACC), precision (p), recall (r) and F-measure (F), based on the predicted and actual classes of the test instances, using the formulae:
p=TPTP+FP,r=TPTP+FN
ACC=TP+TNTP+TN+FP+FN,F=2*p*rp+r
where, TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives.
(7) Compute the average of the performance metrics over all the training and testing data sets. The values of accuracy, precision, recall and F-measure are measures of the goodness of fit of this model to the data. Hence higher measures of accuracy (close to 100 %) and precision, recall and F-measures (close to 1) indicate the suitability of the above model for the classification of seed storage proteins into its respective classes.