Abstract
This paper concerns classification by Boolean functions. We investigate the classification accuracy obtained by standard classification techniques on unseen points (elements of the domain, {0, 1}n, for some n) that are similar, in particular senses, to the points that have been observed as training observations. Explicitly, we use a new measure of how similar a point x ∈ {0, 1}n is to a set of such points to restrict the domain of points on which we offer a classification. For points sufficiently dissimilar, no classification is given. We report on experimental results which indicate that the classification accuracies obtained on the resulting restricted domains are better than those obtained without restriction. These experiments involve a number of standard data-sets and classification techniques. We also compare the classification accuracies with those obtained by restricting the domain on which classification is given by using the Hamming distance.
Keywords: Boolean Functions, Boolean Similarity Measure, Classification
1 Introduction
In [4], the authors proposed a way of measuring the similarity s(x, A) of a Boolean vector x to a set A of such vectors. The measure is based on the absence of certain substrings of x from the set of vectors in A. In the context of machine learning classification problems, we may think of A as a training data-set, a set of observations on which we know the correct classifications. For example, each observation in the data set might arise from a set of medical tests on a patient and may represent, suitably encoded, the absence or presence—or degree of presence—of a number of symptoms the patient may have. In this context, the similarity measure provides a plausible way of deciding which unseen possible observations it would be credible to classify with some confidence once a classifier has been found that correctly classifies all (or most of) the observations in the training data-set.
Elegant and useful theories of classification error and confidence have been developed, but these usually make probabilistic assumptions about the way in which the observations have been generated. Specifically, the PAC model of learning and its variants (see, for instance [19,21,6,2,10]) assume that each observation in the data set has been chosen independently of the others, at random, according to a fixed probability distribution on {0, 1}n, the set of all conceivable observations. Vovk et al. [23,24,20] have studied on-line learning in which one wants not only to predict classifications, but to give some indication of how “credible” such predictions are, or not to predict if the predictions are not to be credible; and this is similar to the type of application we have in mind for the similarity measure. But in these papers, it is also assumed that the observations are generated independently according to the same probability distribution. In practice, what can one do without such probabilistic assumptions? It may be hard to prove anything sensible about classification accuracy in this case. Nonetheless, it might be at least useful not only to determine a classifier and to classify unseen observations with it, but also to attach to such predicted classifications the indication s(x, A) of how similar the observation x is to those in the training data-set. Equally, one may decide not to classify at all those unseen observations that have a low similarity with the training data-set. This paper reports on empirical investigations that suggest that a higher classification accuracy is then achieved on the region of the domain {0, 1}n on which we do decide to classify.
2 A Measure of Similarity
2.1 Definitions
Suppose x ∈ {0, 1}n, I ⊆ [n] = {1, 2, …, n}, and |I| = k. Then the projection x|I of x onto I is the k-vector obtained from x by considering only the coordinates in I. For example, if n = 5, I = {2, 4} and x = 01001 then x|I = 10.
By a positional substring of x ∈ {0, 1}n, we mean a pair (z, I) where z = x|I. The key point here is that the coordinates in I are specified: we will want, as part of our later definitions, to indicate that two vectors x and y have the same entries in exactly the same places, as specified by some I ⊆ [n]. For instance, although both x = 10101 and y = 01010 have substrings equal to 00, there is no I such that x|I = y|I = 00.
We now give the definition of similarity from [4].
Definition 1
For A ⊆ {0, 1}n and x ∈ {0, 1}n, the similarity of x to A, s(x, A), is defined to be the largest s such that every positional substring (x, I) of length s appears also as a positional substring (y, I) of some observation y ∈ A. That is,
Here x|I denotes the projection of x onto the coordinates indicated by I.
Equivalently, if r is the smallest length of a positional substring possessed by x that does not appear (in the same positions) anywhere in A, then s(x, A) = r − 1.
Notice that s(x, A) is a measure of how similar x is to a set of vectors. It is not a metric or distance function. It can immediately be seen, indeed, that if A consists solely of one vector y, not equal to x, then s(x, A) = 0, since there must be some coordinate on which x and y differ (and hence a positional substring of length 1 of x that is absent from A).
Informally, the similarity of x to A is low if x has a short positional substring absent from A; and the similarity is high if all positional substrings of x of a fairly large length can be found in the same positions in some y ∈ A. For motivation for the similarity measure, see [4].
This definition of similarity requires the elements of A to be binary vectors. However, in many applications, the raw data that we work with in a particular classification problem might be more naturally encoded as a real-valued vector. In such cases, the data may be transformed into binary data through a process known as binarization (see [7] for example). The transformed data set may then be simplified or cleaned in a variety of ways, by the removal of repeated points, for instance, and the deletion of coordinates found to be statistically insignificant in determining the classification.
3 Hierarchies based on similarity and relationship with Hamming distance
The similarity measure provides a way of filtering, or grading, {0, 1}n according to similarity to a given set A. For 0 ≤ k ≤ n, let
be the set of Boolean vectors which have similarity at least k to A. Then we have the following hierarchy:
So, for large k, Ak is the set of vectors highly similar to A. Suppose that, in a machine learning problem, A is a training data-set. We might then decide to form a classifier of a particular type, using a particular learning algorithm, on the basis of A, but not to use it to predict classifications outside Ak for a particular choice of k. The rationale for this would be that vectors in {0, 1}n \Ak are judged to be too dissimilar to those in A. In this paper we explore empirically whether this is a good strategy.
For a particular A, the hierarchy will typically look as follows:
where “⊃” denotes strict containment. (This is modified in the obvious way if p = e. Here, p = p(A) is the “pervasiveness” of A and e = e(A) is the “extent” of A, as defined in [4].)
Another very natural way to measure how “similar” a given x ∈ {0, 1}n is to A ⊆ {0, 1}n is to consider its Hamming distance. Recall that the Hamming distance d(x, y) between x, y in {0, 1}n is the number of entries on which they differ; and that, for A ⊆ {0, 1}n, the Hamming distance of x to the set A is defined by d(x, A) = min{d(x, y) : y ∈ A}. This leads, in a similar way, to a hierarchy of subsets of {0, 1}n: if for 0 ≤ k ≤ n, we let Dk = {x ∈ {0, 1}n : d(x, A) ≤ n − k}, then we have the hierarchy
It can be shown [4] that, for all k, Ak ⊆ Dk. So, in this sense, the hierarchy resulting from the use of similarity is a refinement of that resulting from Hamming distance. However, the two approaches are quite different. For example, as shown in [4], if Ak ≠ = {0, 1}n, then {0, 1}n\Ak contains an element of {0, 1}n that is at Hamming distance only 1 from A.
4 Classification accuracy and similarity
In this paper we explore, experimentally, the extent to which it appears that, on standard data-sets, standard learning algorithms produce more accurate classifications on unseen instances that have high similarity to those in a training set. We assume, therefore, that there is some underlying target concept c : {0, 1}n → {0, 1} that represents the “true” classifications of all x ∈ {0, 1}n. What we see when we learn is a subset A ⊆ {0, 1}n together with the corresponding values of c(y) for y ∈ A. On the basis of the training data-set and its classifications, we then produce a hypothesis h : {0, 1}n → {0, 1} that we hope to be a good approximation to c. Typically, we might aim to produce, using one of a standard range of learning algorithms, a function h such that h(y) = c(y) for all y ∈ A. Such a hypothesis is said to be consistent with the target concept on A (so that h is an extension of c). Ideally, we would hope that for many other points of {0, 1}n (not in A), we would also have h(x) = c(x). This has been thoroughly modelled and investigated within computational (or statistical) learning theory (see [2,6,10,19,21] for instance). However, as mentioned earlier, the theoretical results of computational learning theory require probabilistic assumptions about the way in which the data set is generated. Therefore, rather than require, as there, that highly probable instances be classified correctly, we might ask whether highly similar instances will be classified correctly by our hypothesis. That is, can we be sure that if the similarity of x to A is sufficiently high, then h(x) will indeed be correct?
There is some theoretical evidence that such an approach might work. Veal [22] has shown that if there is a “simple” underlying target concept, and if we use an algorithm that produces a simple classifier, then the classifications given to instances with sufficiently high similarity to the training data-set will be correct. More precisely, suppose the target concept, c, is an l-term k-DNF function and that the data-set is A. Suppose also that we have a hypothesis h which is an l′-term k′-DNF function and is consistent with the target concept on A. Then, for any x ∈ {0, 1}n, if s(x, A) ≤ max{l′ +k, l+k′}, then h(x) = c(x). Of course, we don’t necessarily know a priori bounds on k and l, so this is not in practice necessarily very useful. However, it does show that if the similarity is sufficiently high, we will classify correctly. One might be tempted to think that, generally, an instance with a higher similarity to A is more likely to be correctly classified than one with a lower similarity. In the notation used above, this would mean that if r > s then the proportion of points in Ar misclassified by h would be smaller than the proportion of points in As that are misclassified by h. We investigate experimentally, on standard data sets and using standard learning algorithms, whether this might be the case, and it does generally appear to be, at least for such standard data-sets. However, as shown in [22], it is possible to construct examples in which such a relationship does not hold: there is a target concept c and a training data-set A and hypothesis h such that h is consistent with c on A (that is, in an extension of c), but such that all the instances misclassified by h are of higher similarity than those correctly classified. It will not, therefore, be true in general that higher similarity necessarily implies higher classification accuracy, but this might, at least often, be the case for “real”, natural data-sets and target concepts.
5 Empirical results on classification accuracy for different data-sets
5.1 The data-sets
In our experiments we used the following nine data-sets, taken from the UCI Machine Learning Repository [18].
Cleveland heart disease (hea)
Pima Indian Diabetes (Pid)
German credit (nominal data from Statlog)
Hepatitis
Ionosphere
Mushroom
Tic-Tac-Toe
House Votes (vot)
Wisconsin breast cancer (bcw).
The data-sets were pre-processed in several ways before we ran our experiments. First, any observations in the data-set that had any missing attribute values were deleted. Next, the data-sets were binarized, according to the method described in [7], so that any numerical or nominal attribute values were changed to binary values. Next, techniques from [9] were used to determine that some attributes (of the binarized data) could be deemed irrelevant and therefore deleted. (Set covering was used to find a small “support set”.) The binarized data was then projected onto the remaining binary attributes. If this process resulted in any repetition, these were deleted, and we have retained only one copy of each observation. If any of the processed observations appeared once with each class label, all its occurrences were deleted. After pre-processing in this manner, the data-sets consisted of binary vectors, generally in a higher-dimensional space than the original data. The following table describes the characteristics of the data-sets before and after this pre-processing.
5.2 The learning algorithms
The classification methods, or learning algorithms, used in this experiment were taken from commonly used packages. These included Decision Trees (See5) [17] and LAD [9] (see [12,13] for background), the specific implementations used being Datascope [8] and Ladoscope [15]. For the other five commonly used machine learning algorithms we used the publicly available WEKA software package (see [25] and http://www.cs.waikato.ac.nz/~ml/weka/index.html), which consists of many algorithms. Those we used in our experiments are: Support Vector Machines (SMO), Simple Linear Logistic Regression (Simple Logistic), Neural Networks (Multilayer Perceptron), Nearest Neighbors (IBk), and Decision Trees (J48). In summary, the learning algorithms used are as follows.
Logical Analysis of Data (LAD) is a combinatorics and optimization based data analysis method and has been applied to business, economics, seismology, oil exploration, medicine and related disciplines, etc. The problems analyzed by LAD are similar to those analyzed by various other methods, including statistics, pattern recognition, clustering, classification, machine learning, neural networks, etc. One of the major goals of LAD is to learn from an archive of examples the correct way of classifying observations into several (usually two) categories.
SEE5 is a data mining tool that constructs a decision tree and extracts informative patterns from data. Patterns often concern the categories to which situations belong and are used to predict outcomes for future situations as an aid to decision-making.
SMO, which implements John Platt’s sequential minimal optimization algorithm for training a support vector classifier. It transforms the output of Support Vector Machine into probabilities by applying a standard sigmoid function that is not fitted to the data. This implementation globally replaces all missing values and transforms nominal attributes into binary ones [11,16,25].
Simple Logistic Regression Classifier (SL), which builds linear logistic regression models. LogitBoost with simple regression functions as base learners is used for fitting the logistic models. The optimal number of LogitBoost iterations to perform is cross-validated, which leads to automatic attribute selection. For more information see [14].
Multilayer Perceptron (MLP), which builds a neural model consisting of a network of processing elements or nodes arranged in layers. Typically it requires three or more layers of processing nodes: an input layer which accepts the input variables used in the classification procedure, one or more hidden layers, and an output layer with one node per class. The principle of the network is that when data from an input pattern is presented at the input layer the network nodes perform calculations in the successive layers until an output value is computed at each of the output nodes.
IBk is k-nearest neighbors classifier. This is a simple instance-based learner that uses the class of the nearest k training instances for the class of the test instances. This normalizes the attributes by default and can select appropriate value of k based on cross-validation. For more information, see [1].
J48 generates a pruned or unpruned decision tree where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes (see [17]).
6 Accuracy on similarity hierarchy
The first set of experiments we conducted was intended to investigate whether the classification accuracy improved as we restricted the domain on which we predict, according to similarity.
To describe this in more detail, we must first explain cross-validation estimates. Suppose we randomly partition the data-set into two equally-sized parts, S and R. Suppose, further that we then use S as input to the learning (or classification) algorithm and measure the accuracy of the output hypothesis, hS, of the algorithm on R, by which is meant the proportion of observations in R that are correctly classified by hS. Then, suppose we instead use R as input to the learning algorithm and measure the accuracy of the output hypothesis, hR, of the algorithm on S. If these two accuracy rates are then averaged, we obtain what is known as a 2-fold cross-validation estimate of accuracy for that partitioning of the data-set. If we repeat this procedure ten times, each time with a different randomly chosen partitioning of the data into two parts, then, for our purposes, we refer to the average accuracy of the ten cross-validation estimates as the 10-times 2-fold CV (cross-validation) estimate of the accuracy. We shall sometimes find it more convenient to consider error rather than accuracy. Error measures the proportion of observations incorrectly classified, and so it is just 1 minus the accuracy.
Now, we are interested in the performance of a classifier on observations that have at least a given similarity to the observations that were used as input to the learning algorithm that produced the classifier (or hypothesis). Suppose that k is some positive integer. We might then adapt the cross-validation procedure outlined above as follows: instead of finding the accuracy of hS on R and then of hR on S, and averaging the two, we instead determine the accuracies of hS on R∩Sk and of hR on S ∩ Rk, and average the two. Recall that Sk is the set of points in the data-set that have similarity at least k to S (and Rk is similarly defined). Repeating this ten times and averaging, we obtain an estimate which we call the 10-times 2-fold CV estimate on observations of similarity at least k.
For values of k between 2 and 6, and for each of the nine data-sets and each of the seven learning algorithms, we determined the 10-times 2-fold CV estimate on observations of similarity at least k. It is conceivable that any perceived improvement in the accuracy estimates as we increase the similarity might be an artefact of the use of a particular learning algorithm, so we report two types of result here. First, for each data-set, we report the average, over all seven learning algorithms, of the accuracy estimates. Secondly, we report, for each algorithm, the average of the accuracy estimates over all nine data-sets.
6.1 Performance on each data-set
For values of k between 2 and 6, and for each of the nine data-sets, Table 2 shows the average, over all seven learning algorithms, of the 10-times 2-fold CV estimate on observations of similarity at least k.
Table 2.
Average accuracy, over all seven learning algorithms, on observations with similarity at least k
| k | hea | Pid | GermanCredit | hepatitis | ionosphere | mushroom | tic-tac-toe | vot | bcw | Average |
|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 1 | 1 | 1 | 1 | 1 | 1 | ||||
| 5 | 0.974 | 1 | 0.857 | 1 | 0.999 | 0.969 | 1 | 0.992 | 0.974 | |
| 4 | 0.893 | 0.987 | 0.803 | 0.998 | 0.990 | 0.998 | 0.899 | 0.996 | 0.983 | 0.950 |
| 3 | 0.814 | 0.790 | 0.751 | 0.960 | 0.948 | 0.997 | 0.900 | 0.957 | 0.951 | 0.897 |
| 2 | 0.802 | 0.746 | 0.720 | 0.845 | 0.855 | 0.993 | 0.900 | 0.935 | 0.927 | 0.858 |
| Average | 0.801 | 0.746 | 0.712 | 0.811 | 0.855 | 0.981 | 0.900 | 0.928 | 0.926 | 0.851 |
For values of k between 2 and 6, and for each of the seven learning algorithms, Table 3 shows the average, over all nine data-sets, of the 10-times 2-fold CV estimate on observations of similarity at least k.
Table 3.
Average accuracy, over all nine data-sets, on observations of similarity at least k
| k | LAD | SEE5 | SMO | SL | MLP | IB3 | J48 | Average |
|---|---|---|---|---|---|---|---|---|
| 6 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 5 | 0.965 | 0.984 | 0.972 | 0.972 | 0.991 | 0.979 | 0.967 | 0.976 |
| 4 | 0.943 | 0.966 | 0.967 | 0.960 | 0.958 | 0.942 | 0.941 | 0.954 |
| 3 | 0.893 | 0.928 | 0.913 | 0.913 | 0.903 | 0.881 | 0.893 | 0.904 |
| 2 | 0.850 | 0.880 | 0.874 | 0.874 | 0.864 | 0.842 | 0.851 | 0.862 |
| Average | 0.842 | 0.871 | 0.868 | 0.869 | 0.857 | 0.840 | 0.843 | 0.856 |
7 Accuracy on Hamming distance hierarchy
The next set of experiments investigates whether the same type of increased accuracy estimates result when the domain of prediction is determined by Hamming distance rather than similarity. We use the same cross-validation partitions as for the previous experiments. The Hamming-distance estimates we use are defined in a a similar way to the CV estimates on observations of similarity at least k. For a range of values of d, we proceed exactly as described in Section 6, but instead of using the accuracies of hS on R ∩ Sk and of hR on S ∩Rk, we instead find the accuracies of hS on {x ∈ R : d(x, S) ≤ d} and of hR on {x ∈ S : d(x, R) ≤ d}. We call the resulting version of the 10-times 2-fold CV estimate the 10-times 2-fold CV estimate on observations of Hamming distance at most k. Again, as for similarity, we report two types of result: for each data-set, the average, over all seven learning algorithms, of the accuracy estimates and for each algorithm, the average of the accuracy estimates over all nine data-sets.
7.1 Performance on each data-set
For values of d between 1 and 16, and for each of the nine data-sets Table 4 shows the average, over all seven learning algorithms, of the 10-times 2-fold CV estimate on observations of Hamming distance at most d.
Table 4.
The average, over all seven learning algorithms, of the 10-times 2-fold CV estimates on observations at Hamming distance at most d for every data-set.
| hea | pid | German Credit | hepatitis | ionosphere | mushroom | tic-tac-toe | vot | bcw | Average | |
|---|---|---|---|---|---|---|---|---|---|---|
| HD=1 | 0.936 | 0.980 | 0.858 | 0.996 | 0.949 | 0.993 | 0.964 | 0.984 | 0.958 | |
| HD≤2 | 0.923 | 0.842 | 0.854 | 0.996 | 0.965 | 0.995 | 0.917 | 0.938 | 0.986 | 0.935 |
| HD≤3 | 0.881 | 0.833 | 0.867 | 0.951 | 0.949 | 0.994 | 0.891 | 0.929 | 0.978 | 0.919 |
| HD≤4 | 0.853 | 0.832 | 0.825 | 0.886 | 0.934 | 0.992 | 0.901 | 0.929 | 0.972 | 0.903 |
| HD≤5 | 0.829 | 0.797 | 0.819 | 0.855 | 0.925 | 0.990 | 0.900 | 0.929 | 0.964 | 0.890 |
| HD≤6 | 0.813 | 0.769 | 0.790 | 0.823 | 0.901 | 0.988 | 0.900 | 0.929 | 0.951 | 0.874 |
| HD≤7 | 0.809 | 0.761 | 0.776 | 0.818 | 0.886 | 0.987 | 0.900 | 0.929 | 0.943 | 0.868 |
| HD≤8 | 0.806 | 0.752 | 0.761 | 0.816 | 0.873 | 0.985 | 0.900 | 0.929 | 0.934 | 0.862 |
| HD≤9 | 0.801 | 0.749 | 0.746 | 0.815 | 0.865 | 0.983 | 0.900 | 0.929 | 0.929 | 0.857 |
| HD≤10 | 0.801 | 0.748 | 0.737 | 0.815 | 0.861 | 0.982 | 0.900 | 0.929 | 0.927 | 0.856 |
| HD≤11 | 0.800 | 0.747 | 0.728 | 0.815 | 0.857 | 0.981 | 0.900 | 0.929 | 0.927 | 0.854 |
| HD≤12 | 0.800 | 0.747 | 0.722 | 0.815 | 0.855 | 0.981 | 0.900 | 0.929 | 0.926 | 0.853 |
| HD≤13 | 0.800 | 0.747 | 0.721 | 0.815 | 0.854 | 0.980 | 0.900 | 0.929 | 0.926 | 0.852 |
| HD≤14 | 0.800 | 0.747 | 0.720 | 0.815 | 0.853 | 0.980 | 0.900 | 0.929 | 0.926 | 0.852 |
| HD≤15 | 0.800 | 0.747 | 0.720 | 0.815 | 0.853 | 0.979 | 0.900 | 0.929 | 0.926 | 0.852 |
| HD≤16 | 0.800 | 0.747 | 0.720 | 0.815 | 0.853 | 0.979 | 0.900 | 0.929 | 0.926 | 0.852 |
7.2 Performance of each learning algorithm
For values of d between 1 and 16, and for each of the seven learning algorithms, Table 5 shows the average, over all nine data-sets of the 10-times 2-fold CV estimate on observations of Hamming distance at most d.
Table 5.
The average, over all nine data-sets, of the 10-times 2-fold CV estimates on observations at Hamming distance at most d for every learning algorithm.
| LAD | SEES | SMO | SL | MLP | IBS | J48 | Average | |
|---|---|---|---|---|---|---|---|---|
| HD=1 | 0.892 | 0.967 | 0.967 | 0.973 | 0.967 | 0.968 | 0.969 | 0.958 |
| HD≤2 | 0.907 | 0.936 | 0.952 | 0.951 | 0.937 | 0.930 | 0.932 | 0.935 |
| HD≤3 | 0.906 | 0.919 | 0.933 | 0.934 | 0.927 | 0.899 | 0.918 | 0.919 |
| HD≤4 | 0.894 | 0.899 | 0.917 | 0.913 | 0.907 | 0.889 | 0.900 | 0.903 |
| HD≤5 | 0.880 | 0.887 | 0.903 | 0.903 | 0.893 | 0.875 | 0.886 | 0.890 |
| HD≤6 | 0.862 | 0.871 | 0.888 | 0.887 | 0.878 | 0.863 | 0.868 | 0.874 |
| HD≤7 | 0.854 | 0.865 | 0.884 | 0.883 | 0.873 | 0.856 | 0.860 | 0.868 |
| HD≤8 | 0.848 | 0.859 | 0.878 | 0.878 | 0.867 | 0.850 | 0.853 | 0.862 |
| HD≤9 | 0.844 | 0.855 | 0.874 | 0.873 | 0.862 | 0.846 | 0.848 | 0.857 |
| HD≤10 | 0.843 | 0.852 | 0.872 | 0.872 | 0.860 | 0.843 | 0.846 | 0.855 |
| HD≤11 | 0.841 | 0.851 | 0.870 | 0.870 | 0.858 | 0.841 | 0.845 | 0.854 |
| HD≤12 | 0.840 | 0.850 | 0.869 | 0.869 | 0.857 | 0.840 | 0.844 | 0.853 |
| HD≤13 | 0.840 | 0.849 | 0.869 | 0.869 | 0.856 | 0.840 | 0.843 | 0.852 |
| HD≤14 | 0.840 | 0.849 | 0.869 | 0.869 | 0.856 | 0.840 | 0.843 | 0.852 |
| HD≤15 | 0.840 | 0.849 | 0.869 | 0.869 | 0.856 | 0.840 | 0.843 | 0.852 |
| HD≤16 | 0.840 | 0.849 | 0.869 | 0.869 | 0.856 | 0.840 | 0.843 | 0.852 |
8 Using similarity and Hamming distance together
An observation that has both high similarity and low Hamming distance to a given set A is, arguably, strongly “like” the members of A. We have seen that classification accuracy appears to improve when we, separately, restrict prediction to observations of high similarity to, or small Hamming distance from, those used to produce the classifier. In this section, we report experimental results examining the accuracy when prediction is restricted simultaneously by similarity and Hamming distance. Explicitly, for each d between 1 and 16, and each k between 2 and 6, we proceed exactly as described in Section 6, but using the accuracies of hS on {x ∈ R : d(x, S) ≤ d, s(x, S) ≥ k} and of hR on {x ∈ S: d(x, R) ≤ d, s(x, R) ≥ k}.
8.1 Performance on each data-set
Figure 1 and Figure 2 illustrate, respectively, the average accuracies on Hamming distance at most d and similarity at least k for the Cleveland Heart Disease and German Credit data-sets.
Fig. 1.
The average, over all seven learning algorithms, of the 10-times 2-fold CV estimates on observations of given similarity and Hamming distance for the Cleveland Heart Disease Data
Fig. 2.
The average, over all seven learning algorithms, of the 10-times 2-fold CV estimates on observations of given similarity and Hamming distance for the German Credit Data
Figure 3 shows the average, over all nine data sets, of the average, over all seven learning algorithms, of the accuracies on Hamming distance at most d and similarity at least k.
Fig. 3.
The average, over all nine data sets, of the average, over all seven learning algorithms, of the 10-times 2-fold CV estimates on observations of given similarity and Hamming distance.
Further data can be found in the Tables in Section 12.7 of the Appendix to [5], where numbers of observations of at most a given Hamming distance and at least a given similarity are also indicated.
8.2 Performance of each learning algorithm
Figure 4 and Figure 5 illustrate, respectively, the average accuracies, over all data-sets, on Hamming distance at most d and similarity at least k when using the LAD and SEE5 classification techniques. Further data can be found in the Tables in Section 12.8 of the Appendix to [5].
Fig. 4.
The average, over all nine data-sets, of the 10-times 2-fold CV estimates on observations of given similarity and Hamming distance when using the LAD classification technique.
Fig. 5.
The average, over all nine data-sets, of the 10-times 2-fold CV estimates on observations of given similarity and Hamming distance when using the SEE5 classification technique.
9 Conclusions
The experimental results here indicate that there is some advantage in using “similarity” and Hamming distance, separately and in combination, to restrict the observations on which one is willing to offer a confident prediction. As noted, there are provably cases in which this is not so, but the principle does appear generally to be borne out by the data-sets and algorithms used here.
Table 1.
Characteristics of the data-sets
| Dataset | # of observations | # of attributes | After preprocessing | ||||
|---|---|---|---|---|---|---|---|
| Positive | Negative | Numeric | Nominal | # of observations | # of binary attributes | ||
| Positive | Negative | ||||||
| Cleveland Heart Disease | 139 | 164 | 10 | 3 | 137 | 158 | 63 |
| Pima Indian Diabetes | 130 | 262 | 8 | 0 | 130 | 262 | 47 |
| German credit | 700 | 300 | 7 | 13 | 697 | 300 | 66 |
| Hepatitis | 123 | 32 | 6 | 13 | 92 | 19 | 28 |
| Ionosphere | 225 | 126 | 34 | 0 | 216 | 125 | 49 |
| Mushroom | 3916 | 4208 | 0 | 22 | 2188 | 2047 | 50 |
| Tic-Tac-Toe | 626 | 332 | 0 | 9 | 626 | 332 | 27 |
| Voting | 267 | 168 | 16 | 0 | 96 | 64 | 16 |
| Wisconsin Breast Cancer | 458 | 241 | 9 | 0 | 203 | 182 | 48 |
Acknowledgments
Martin Anthony’s work is supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. Thanks to Iain Morrow for advice on graphs. Peter L. Hammer’s work is partially supported by NSF Grant NSF-IIS-0312953 and NIH Grants NIH-002748-001 and NIH-HL-072771-01.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- 1.Aha D, Kibler D. Instance-based learning algorithms. Machine Learning. 1991;6:37–66. [Google Scholar]
- 2.Anthony M, Bartlett PL. Neural Network Learning: Theoretical Foundations. Cambridge University Press; Cambridge, UK: 1999. [Google Scholar]
- 3.Anthony M. Norman Biggs, Computational Learning Theory: An Introduction. Cambridge University Press; Cambridge, UK: 1992. [Google Scholar]
- 4.Anthony M, Hammer PL. A Boolean measure of similarity. Discrete Applied Mathematics. 2006;54(16):2242–2246. doi: 10.1016/j.dam.2008.04.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Anthony M, Hammer PL, Subasi E, Subasi M. Using a similarity measure for credible classification, RUTCOR Research Report RRR-39-2005. doi: 10.1016/j.dam.2008.04.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Blumer A, Ehrenfeucht A, Haussler D, Warmuth Manfred. Learnability and the Vapnik-Chervonenkis dimension. J of the ACM. 1989;36(4):929–965. [Google Scholar]
- 7.Boros E, Hammer PL, Ibaraki T, Kogan A. Logical analysis of numerical data. Math Prog. 1997;79:163–190. [Google Scholar]
- 8.Alexe S. Datascope: A set of tools for the logical analysis of data. http://rutcor.rutgers.edu/~salexe/LAD\_kit/SETUP-LAD-DS-SE20.zip.
- 9.Boros E, Hammer PL, Ibaraki Toshihide, Kogan A, Mayoraz E, Muchnik I. An Implementation of Logical Analysis of Data. IEEE Trans on Knowledge and Data Engineering. 2000;12-1:292–306. [Google Scholar]
- 10.Kearns MJ, Vazirani U. Introduction to Computational Learning Theory. MIT Press; Cambridge, MA: 1995. [Google Scholar]
- 11.Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation. 2001;13(3):637–649. [Google Scholar]
- 12.Hammer PL. Partially Defined Boolean Functions and Cause-Effect Relationships. Presented at the International Conference on Multi-Attribute Decision Making Via OR-Based Expert Systems; Passau, Germanym: University of Passau; Apr, 1986. [Google Scholar]
- 13.Crama Y, Hammer PL, Ibaraki T. Cause-effect relationships and partially-defined Boolean functions. Annals of Operations Research. 1988;16:299–326. [Google Scholar]
- 14.Landwehr N, Hall M, Frank E. Logistic Model Trees. ECML; 2003. [Google Scholar]
- 15.Lemaire P. Ladoscope: A set of tools for the logical analysis of data. http://rutcor.rutgers.edu/~lemaire/LAD/
- 16.Platt J. Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schoelkopf B, Burges C, Smola A, editors. Advances in Kernel Methods - Support Vector Learning. MIT Press; 1998. [Google Scholar]
- 17.Quinlan R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers; San Mateo, CA: 1993. [Google Scholar]
- 18.University of California at Irvine Machine Learning Repositoryhttp://www.ics.uci.edu/~mlearn/MLRepository.html
- 19.Valiant LG. A theory of the learnable. Communications of the ACM. 1984;27(11):1134–1142. [Google Scholar]
- 20.Saunders C, Gammerman A, Vovk V. Transduction with confidence and credibility. Proceedings of the 16th International Joint Conference on Artificial Intelligence; 1999. pp. 722–726. [Google Scholar]
- 21.Vapnik VN. Statistical Learning Theory. Wiley; 1998. [DOI] [PubMed] [Google Scholar]
- 22.Veal B. Properties of a Binary Similarity Measure, CDAM Research Report LSE-CDAM-2005–06. Centre for Discrete and Applicable Mathematics; London School of Economics: [Google Scholar]
- 23.Vovk V. On-line confidence machines are well-calibrated. Proceedings of the 43rd Annual Symposium on Foundations of Computer Science; Los Alamitos, CA: IEEE Computer Society; 2002. pp. 187–196. [Google Scholar]
- 24.Vovk V. Asymptotic optimality of Transductive Confidence Machine. In: Cesa-Bianchi N, Numao M, Reischuk R, editors. Lecture Notes in Artificial Intelligence; Proceedings of the 13th International Conference on Algorithmic Learning Theory; 2002. pp. 336–350. [Google Scholar]
- 25.Witten IH, Frank E. Data Mining: Practical machine learning tools with Java implementation. Morgan Kaufmann; San Francisco: 2000. [Google Scholar]





