Skip to main content
Proceedings. Mathematical, Physical, and Engineering Sciences logoLink to Proceedings. Mathematical, Physical, and Engineering Sciences
. 2016 Mar;472(2187):20150551. doi: 10.1098/rspa.2015.0551

Maximum margin classifier working in a set of strings

Hitoshi Koyano 1,, Morihiro Hayashida 2, Tatsuya Akutsu 2
PMCID: PMC4841474  PMID: 27118908

Abstract

Numbers and numerical vectors account for a large portion of data. However, recently, the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely used approach to this problem is to convert strings into numerical vectors using string kernels and subsequently apply a support vector machine that works in a numerical vector space. However, this non-one-to-one conversion involves a loss of information and makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. In this study, we approach this classification problem by constructing a classifier that works in a set of strings. To evaluate the generalization error of such a classifier theoretically, probability theory for strings is required. Therefore, we first extend a limit theorem for a consensus sequence of strings demonstrated by one of the authors and co-workers in a previous study. Using the obtained result, we then demonstrate that our learning machine classifies strings in an asymptotically optimal manner. Furthermore, we demonstrate the usefulness of our machine in practical data analysis by applying it to predicting protein–protein interactions using amino acid sequences and classifying RNAs by the secondary structure using nucleotide sequences.

Keywords: strings, machine learning, probability theory, statistical asymptotics, bioinformatics

1. Introduction

Mathematicians have conducted detailed examinations of a large number of objects, such as numbers, manifolds, equations, functions and operators, throughout the long history of mathematics, but they have not studied strings in detail. A string is an object that computer scientists have addressed in depth. Stringology, a field of computer science, has thoroughly investigated algorithms and data structures for string processing [1,2]. However, computer scientists have not studied strings using a mathematical approach; for example, functions, operators and probabilities on a set of strings provided with topological and algebraic structures have not been investigated.

Numbers and numerical vectors account for a large portion of data. However, in recent years, the amount of string data generated has increased dramatically. For example, large amounts of text data have been produced on the web. In the life sciences, large amounts of data regarding genes, RNAs and proteins have been generated. These data are nucleotide or amino acid sequences and can be represented as strings. Consequently, a random string that randomly generates strings based on a probability law is necessary for string data analysis, much as random variables that randomly generate numbers and stochastic processes that randomly generate functions are essential in various fields. Statistical methods for numerical data were rigorously constructed based on probability theory. Similarly, the development and systematization of methods on the basis of probability theory on a set of strings will be required for text mining techniques and methods for analysing biological sequences.

Let A* be a set of strings on an alphabet A={a1,…,az}. From the viewpoint of mathematics, A* is a monoid with respect to concatenation (the operation of appending one string to the end of another) and a metric space with the Levenshtein distance (the minimal number of deletions, insertions or substitutions required to transform one string into another), and it forms a non-commutative topological monoid. Koyano and co-workers [3,4] were the first to construct a theory for treating strings that are randomly generated according to a probability law by developing a probability theory on the space A* with the above-mentioned mathematical structures and to construct statistical methods for analysing string data using this probability theory. They used the developed statistical methods to address problems that involved biological sequence data. However, those researchers developed specific methods for estimating the α and β diversities of biological communities using gene sequences rather than general-purpose methods for statistical analysis of string data. In this study, we first extend a limit theorem demonstrated in [3] on the asymptotic behaviour of a consensus sequence of strings. This theorem is an analogue of the strong law of large numbers in a p-dimensional real vector space Rp, because a consensus sequence of strings is the counterpart in A* of a mean in Rp. Using this result, we then develop the theory of statistical machine learning for string classification.

Classifying string data is a common problem in many fields, including computer science and the life sciences because of recent significant increases in the prevalence of string data. The most widely used approach to this problem is to convert strings into numerical vectors using a string kernel and subsequently apply a support vector machine (SVM) [59] to the vectors. The earliest string kernels were developed by Haussler [10], Watkins [11] and Lodhi et al. [12]. These papers proposed that the similarity between strings should be defined based on the number of subsequences common to them. Leslie and Paaß and co-workers [13,14] used the spectrum kernel, a string kernel that quantifies the similarity between strings based on the number of common substrings, without considering common subsequences for which gaps are allowed. The spectrum kernel was subsequently extended by Leslie and Vishwanathan and co-workers [1517]. In addition to these kernels, a number of novel string kernels were developed and applied to problems in bioinformatics by recent studies [1821]. The spectrum kernel has become the most widely used of these various string kernels, although this kernel discards considerable amounts of the information concerning the order of the letters that compose the strings.

Converting strings into numerical vectors using a string kernel is not bijective and involves information loss. A more serious problem is that this conversion makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. Consequently, the performance of a learning machine has been evaluated based on whether the machine yields better results compared with other machines in a certain simulation experiment or in the application to a certain real dataset, and the fundamental evaluation of a learning machine by theoretically evaluating its generalization error has been abandoned. In this study, we develop a learning machine that classifies strings without converting them into numerical vectors by constructing a direct sum decomposition of A* under the principle of margin maximization, as an SVM does in Rp (§§2 and 3). We then provide a theoretical evaluation of the generalization error of our learning machine (§5) by applying the theoretical result demonstrated in §4. We also demonstrate the usefulness of our machine for practical data analysis by applying it to predicting protein–protein interactions based on amino acid sequences and classifying RNAs by the secondary structure based on nucleotide sequences (§6).

2. Specification of the problem

In the following, we refer to a classifier that decomposes a space into two disjoint subsets by choosing a hyperplane under the principle of margin maximization as an SVM, although an SVM also has other characteristics, such as (i) learning on the dual of a vector space, (ii) extracting features from input vectors, and (iii) using kernel functions. We consider a plane R2 for the sake of simplicity. A line in R2 is represented as {zR2:z=αx+βy,α+β=1,α,βR} for x,yR2 and {(x,y)R2:y=ax+b,xR} for a,bR. The first representation uses the vector space structure of R2, because addition and scalar multiplication are used in the representation, and the second representation uses the field structure of R, because addition and multiplication are used. We denote a set of strings on the alphabet A={a1,…,az} by A*. The intrinsic operation and distance on A* are concatenation (hereafter denoted by ⋅) and the Levenshtein distance (hereafter denoted by dL), respectively. Therefore, we provide A* with algebraic and topological structures using ⋅ and dL. A* forms a non-commutative topological monoid, but it does not form a vector space or field. Therefore, ‘a line’ cannot be defined in A* using the above two forms. However, this does not mean that a line cannot be defined in A*. Thus, we consider the following two questions: (i) Can ‘a line’ be defined in A* in some way? (ii) If so, can A* be decomposed into two disjoint subsets by using ‘the line’? The answer to the first question is ‘Yes’, whereas the answer to the second question is ‘No’.

By considering a curve in a space to be a subset of the space that is obtained by repeating the operation of connecting a point in the space to one of its contiguous points, we can roughly define ‘a curve’ in A*, for example, in the following manner: if dL(si,si+1)=1, i=1,…,n−1 holds for s1,…,snA*, we call {s1,…,sn} ‘a curve’ in A*. Furthermore, considering a segment between two points in a space to be the shortest curve that connects the two points, we can define ‘a segment’ in A* as follows: we suppose that dL(s,s′)=n for s,s′∈A*. s can be transformed into s′ by performing one of three types of operation, insertion, deletion and substitution, n times. We denote a string obtained by performing the first i operations of the n operations on s by s(i) for each i=1,…,n−1. For the uniqueness of ‘a segment’ that connects two given strings, we suppose that the order of priority is given among insertion, deletion and substitution and that a series of operations is performed on s in ascending order with respect to the letter number in s and according to the order of priority among the three operations. We call {s,s(1),…,s(n−1),s′} ‘a segment’ in A* that connects s and s′.

Therefore, we consider the decomposition of a sufficiently large subset for applications that are composed of strings whose length is less than or equal to that of s, although not the entire space of A*, by choosing a sufficiently long string s and drawing a segment between s and the empty string (a string composed of zero letters). The alphabet A={a1,…,az} forms a metric space with the Hamming distance dH(ai,aj)=0 (if i=j) or 1 (if ij). By comparison, the set of real numbers R also forms a metric space with the absolute value of the difference d(x,y)=|xy| as well as a totally ordered set with respect to the usual less-than-or-equal relation ≤. The distance d and the total order ≤ on R are consistent in the sense that if xy and yz, then d(x,y)≤d(x,z) holds for any x,y,zR. Such an intrinsic total order as the less-than-or-equal relation ≤ on R does not exist on A. By defining a total order that is consistent with the Hamming distance dH in the sense mentioned above, can we make A form a totally ordered set without destroying its structure as a metric space? This task is impossible owing to the definition of dH. Consequently, we have the following problem: R2 can be divided into upper and lower half-spaces H+={(x,y)R2:ax+by} and H={(x,y)R2:yax+b} with a line ={(x,y)R2:y=ax+b}. H+ and H are defined using the total order ≤ on R. In other words, for the concepts of upper and lower areas of a line in the direct product space R2 to make sense, the total order on the direct product factor R is required. The analogies of a curve and segment can be defined in A* in the above manner using the Levenshtein distance. However, the concepts of upper and lower areas of a segment cannot make sense without destroying the structure of A as a metric space, because a total order that is consistent with the Hamming distance cannot be defined on A. Consequently, A* cannot be divided by determining such a non-closed subset as a line, in contrast to R2.

However, the above discussion does not indicate that A* cannot be divided into two disjoint subsets in any manner. As the Jordan curve theorem [22] and the Jordan–Brouwer separation theorem [23] of topology state, R2 and Rp (p≥3) can be divided into two disjoint subsets by choosing a closed curve and hypersphere without determining a line or hyperplane, respectively. Can we decompose A* into two disjoint subsets by using a method other than by drawing a line? We set U(s,r)={tA*:dL(t,s)≤r} for sA* and rZ+ (Z+ represents the set of positive integers) and consider the decomposition of A* into U(s,r) and U(s,r)c=A*−U(s,r). In other words, we examine a method of drawing a sphere in A* and subsequently decomposing A* into its interior and exterior. In this manner, the decomposition does not require the concepts of upper and lower areas. In the following, we refer to ∂U(s,r)={tA*:dL(t,s)=r} as a discriminant sphere and the number of strings in U(s,r) as the size of ∂U(s,r). We set the convention that strings that lie on ∂U(s,r) are classified into a class of positive examples.

3. Learning machine working in A*

To decompose A* in the manner described in §2, it is necessary to specify the centre sA* and the radius rZ+ of a discriminant sphere ∂U(s,r) given positive and negative examples. We say that the positive examples Xm={s1,…,sm} and negative examples Y n={t1,…,tn} are spherically separable if there exists s0A* such that max1im{dL(si,s0)}<min1in{dL(ti,s0)} holds and that Xm and Y n are spherically inseparable if they are not spherically separable. We denote a set of m-tuples of strings for which a consensus sequence is uniquely determined by [(A*)m]. A formal definition of a consensus sequence is provided in §S2 of the electronic supplementary material for this paper. We suppose s1,…,sm∈[(A*)m] in the following and choose the consensus sequence s¯m of positive examples s1,…,sm as the centre of a discriminant sphere.

We first consider the problem of choosing the radius of a discriminant sphere for the case in which the positive and negative examples are spherically separable. Similar to a discriminant hyperplane of an SVM in Rp, the distance between a string and a discriminant sphere is the distance between the string and a string in the sphere that is nearest to the string, and given samples of positive examples and negative examples, the margin of a discriminant sphere is the distance between the sphere and an example in the samples that is nearest to the sphere. Under the principle of margin maximization, the following result can be immediately obtained: if the positive examples Xm={s1,…,sm} and negative examples Y n={t1,…,tn} are spherically separable with respect to s¯m, the radius of a discriminant sphere that maximizes the margin is given by

r=12{max1im{dL(si,s¯m)}+min1in{dL(ti,s¯m)}}. 3.1

If r* is not an integer, then we arbitrarily choose one of the integers closest to r*.

Next, we consider the case in which the positive examples Xm and negative examples Y n are spherically inseparable. We denote subsamples of the positive and negative examples that a discriminant sphere U(s¯m,r) with a centre s¯m and a radius r correctly classifies by Xm(s¯m,r) and Yn(s¯m,r), respectively. We denote the number of elements of a finite set S by ♯S. The numbers of strings in Xm and in Y n that U(s¯m,r) misclassifies are represented by mXm(s¯m,r) and nYn(s¯m,r), respectively. If the two samples Xm and Y n are spherically inseparable, we choose the radius of a discriminant sphere based on the principle of minimizing the number of misclassified inputs and maximizing the margin, which is a modification of the principle used by an ordinary SVM in Rp in soft margin optimization. If the positive and negative examples are spherically separable, the following procedure is reduced to choosing the radius according to equation (3.1).

Step 1 (minimizing the number of misclassified inputs). Search for a set of radii that minimize the number of misclassified inputs, i.e. a set of positive integers r~ that satisfy

r~=argminrZ+{mXm(s¯m,r)+nYn(s¯m,r)}

or, equivalently,

r~=argmaxrZ+{Xm(s¯m,r)+Yn(s¯m,r)}.

We denote this set by R~. R~ is a non-empty finite set.

Step 2 (maximizing the margin). Choose rR~ that maximizes the distance to the closest string that is correctly classified (if such r* is not uniquely determined, then we arbitrarily choose one of them). This step is formally written as follows: the distances between sXm(s¯m,r) and U(s¯m,r) and between tYn(s¯m,r) and U(s¯m,r) are equal to rdL(s,s¯m) and dL(t,s¯m)r, respectively. These distances are not necessarily equal when s and t are support strings, in contrast to support vectors for an ordinary SVM in Rp, because their sum rdL(s,s¯m)+dL(t,s¯m)r=dL(t,s¯m)+dL(s,s¯m) may be odd. The optimal radius r* is represented as r=argmaxr~R~ρ(r~) for

ρ(r~)=min(s,t)Xm(s¯m,r~)×Yn(s¯m,r~)min{r~dL(s,s¯m),dL(t,s¯m)r~},r~R~,

and the support strings are given by

(s,t)=argmin(s,t)Xm(s¯m,r)×Yn(s¯m,r)min{rdL(s,s¯m),dL(t,s¯m)r}.

To search for the set R~, it is sufficient to examine only those radii between max1im{dL(si,s¯m)} and min1in{dL(ti,s¯m)}1.

Let N and ℓ be the number and the maximum length of input strings, respectively. The time complexity is O(Nℓ) for constructing a centre string of a discriminant sphere, O(N2) for computing Levenshtein distances from all input strings to the centre string, O(Nℓ) for computing the number Xm(s¯m,r)+Yn(s¯m,r) for all candidates for r, because there exist O(ℓ) candidates for r, O(Nℓ) for searching r~ and O(Nℓ) for searching r*. Therefore, the total time complexity for constructing the classifier is equal to O(N2).

4. Limit theorems in A*

In §3, we constructed a statistical learning machine that classifies string data by forming a direct sum decomposition of A*. In this section, we demonstrate a generalization of theorem 2 described in the electronic supplemental material of [3], which will be applied to an asymptotic analysis of the generalization error of our learning machine in §5. We also derive several corollaries from the generalization. The notation and definitions of terms used in this section are provided in §S2 of the electronic supplementary material for this paper. In the following sections, the alphabet is A={a1,…,az−1}. We set az=e for the empty letter e and refer to A¯=A{e}={a1,,az} as the extended alphabet. We denote the set of all strings on A¯ by A*. Roughly describing, (Ω,F,P) is an underlying probability space, and M(Ω,A¯) and M(Ω,A) are sets of random letters and of random strings, respectively. μ(α1,…,αn) denotes the consensus letter of random letters α1,…,αn, and μ(σ1,…,σn) and κ(σ1,…,σn) represent the consensus sequence and variance of random strings σ1,…,σn, respectively. [M(Ω,A¯)n] and [M(Ω,A)n] are sets of n-tuples of random letters the consensus letter of which is uniquely determined and of n-tuples of random strings the consensus sequence of which is uniquely determined, respectively.

Let {αi:iZ+}M(Ω,A¯). We set

p(i,h)=P({ωΩ:αi(ω)=ah}),p¯(h,n)=1ni=1np(i,h)

for each h=1,…,z. p(i,h) represents the probability that the ith random letter realizes the hth letter in the extended alphabet A¯, and p¯(h,n) represents the average probability that the hth letter in A¯ is observed when n observations are made. For a statement S, S a.s. represents that S holds with probability one. First, theorem 1 from the electronic supplemental material of [3] can be generalized as follows.

Theorem 4.1 —

We suppose that {αi:iZ+}M(Ω,A¯) (i) satisfies (α1,,αn)[M(Ω,A¯)n] for each nZ+, and (ii) is independent. If (iii) ι=argmax1hzp¯(h,n) is uniquely determined independent of n, there exists n0Z+ such that if n≥n0, then

μ(α1,,αn)=aιa.s.

holds.

Proofs of all results described in this paper are provided in §S1 of the electronic supplementary material. In theorem 4.1, the independence of random letters α1,…,αn is assumed, but the identical distribution is not. Therefore, the consensus letters can vary among the distributions of α1,…,αn. Theorem 4.1 states that even if the distributions of α1,…,αn do not have an identical consensus letter, the consensus letter μ(α1,…,αn)(ω) of the observed letters converges to a letter aι under the above conditions. Even if α1,…,αn do not have an identical distribution, by the definition of aι, we have

m(α1)==m(αn)=aι

if α1,…,αn have an identical consensus letter. Thus, theorem 1 from the electronic supplemental material of [3] is a special case of theorem 4.1.

Let {σi={αij:jZ+}:iZ+}M(Ω,A). We set

p(i,j,h)=P({ωΩ:αij(ω)=ah}),p¯(j,h,n)=1ni=1np(i,j,h)

for each h=1,…,z.

Corollary 4.2 —

If (i) {σi={αij:jZ+}:iZ+}M(Ω,A) satisfies (σ1,,σn)[M(Ω,A)n] for each nZ+, (ii) {αij:iZ+} is independent for each jZ+ and (iii) ι(j)=argmax1hzp¯(j,h,n) is uniquely determined independent of n, there exists n0Z+ such that if nn0, then

μ(σ1,,σn)={aι(j):jZ+}a.s.

holds.

Theorem 2 from the electronic supplementary material of [3] can be obtained as a special case of corollary 4.2. We denote almost sure convergence by a.s. and the expectation of a random variable X by E[X]. Using corollary 4.2, theorem 3 from the electronic supplementary material of [3] is extended as follows.

Corollary 4.3 —

We suppose that (i) {σi={αij:jZ+}:iZ+}M(Ω,A) satisfies (σ1,,σn)[M(Ω,A)n] for each nZ+ and that (ii) {σi:iZ+} is independent. If (iii) ι(j)=argmax1hzp¯(j,h,n) is uniquely determined independent of n, then we have

κ(σ1,,σn)a.s.1ni=1nE[dL(σi,{aι(j):jZ+})]

as n. More strictly than under condition (iii), if (iv) {σi}[M(Ω,A)] holds and (v) {σi} have an identical family of finite dimensional distributions, we have

κ(σ1,,σn)a.s.v(σ1)

as n.

Because a consensus sequence of strings is a majority vote, unlike a mean of real vectors, the distributions of random letters that compose a consensus sequence converge to the Dirac measures for letters of the extended alphabet, rather than to distributions such as the normal distribution, under the conditions of corollary 4.2.

Corollary 4.4 —

We suppose that {σi={αij:jZ+}:iZ+}M(Ω,A) satisfies the conditions of corollary 4.2. We denote the Dirac measure for a letter aι(j)A¯ such that ι(j)=argmax1hzp¯(h,n,j) by δι(j) (ι(j) is independent of n and unique). Then, there exists n0Z+ such that if nn0, then a sequence of distributions of random letters that compose μ(σ1,…,σn) is equal to {δι(j):jZ+}.

Let {σi={αij:jZ+}:iZ+}M(Ω,A). We assume that {σi:iZ+} is independent and {αij:iZ+} is identically distributed (therefore, p(i,j,h) is independent of i) for each jZ+. Then, noting lemmas 2 and 3 from the supplemental material of [3] and the proof of theorem 4.1, we observe that if there exists j0Z+ such that argmax1hzp(i,j0,h) has more than two elements, then a consensus letter of α1j0,…,αnj0 and consequently, a consensus sequence of σ1,…,σn are not determined with probability one as n. Thus, if there exists j0Z+ such that {αij0:iZ+} has the uniform distribution on A¯, then a consensus sequence of σ1,…,σn is not determined with probability one as n, even if {σi:iZ+} is independent.

5. Asymptotic optimality of the proposed learning machine

As described in §1, in the conventional framework of converting strings into numerical vectors using a string kernel and subsequently applying a classifier working in a numerical vector space to the vectors, it is impossible to theoretically evaluate the generalization error of the classifier for string classification, because conversion using a string kernel is not bijective. Consequently, to evaluate the performance of a classifier for string data, we have no option but to apply the classifier to certain datasets and repeat the cross-validation. In this section, by applying corollary 4.2 demonstrated in §4 on the asymptotic behaviour of a consensus sequence of random strings, we theoretically consider whether our statistical learning machine working in A* constructed in §3 is optimal in terms of its generalization error. Generally, in classical statistics, the problems of estimation and hypothesis testing are considered under the assumption that the data are generated according to an unknown but unique population distribution. However, machine learning is typically applied to mining datasets that are considerably larger than the datasets analysed using traditional statistical methods. Therefore, it is absolutely impossible to suppose that data are generated according to an unknown but unique distribution. In this section, we theoretically analyse the generalization error of our learning machine in a setting in which both positive and negative examples are generated according to an unknown number of unknown distributions.

For f1,f2Z+, we suppose that the ith positive example si is generated according to one of f1 unknown distributions with probability functions p1(1),,p1(f1) on A* for each i=1,…,m and the ith negative example ti is generated according to one of f2 unknown distributions with probability functions p2(1),,p2(f2) on A* for each i=1,…,n. f1 and f2 are also unknown. Thus, the models that generate positive and negative examples are finite mixture models

p1(s)=k=1f1π1(k)p1(k)(s)andp2(t)=k=1f2π2(k)p2(k)(t),

respectively, where k=1fiπi(k)=1 and πi(1),,πi(fi)(0,1) for i=1,2. (For i=1,2 and k=1,…,fi, pi(k) is introduced as pσ in §S2 of the electronic supplementary material for this paper.) D1 and D2 represent the supports of p1 and p2, respectively, i.e. Di={sA*:pi(s)>0} for i=1,2. We assume that D1D2≠∅ and D2D1≠∅. If D1D2=∅, then the probability that the generalization error becomes zero after finite times of learning is equal to one. Therefore, we consider the case of D1D2≠∅ in the following. The generalization error E0(s,r) of a discriminant sphere ∂U(s,r) is written as E0(s,r)=E1(s,r)+E2(s,r) for

E1(s,r)=tD1U(s,r)cp1(t)andE2(s,r)=tD2U(s,r)p2(t). 5.1

We formally set

(s,r)=argmin(s,r)A×Z+E0(s,r),r(s0)=argminrZ+E0(s0,r)

for each s0A*. r(s0) is the radius of a discriminant sphere that is optimal in terms of the generalization error given a centre. We denote the relative frequencies of t in Xm and in Y n by p^1(t) and p^2(t), respectively, for any tA*. We set

E^1(s,r)=tXmU(s,r)cp^1(t)andE^2(s,r)=tYnU(s,r)p^2(t) 5.2

for sA* and rZ+.

Assuming that s¯m is used as the centre of a discriminant sphere, we first consider whether r* converges to an optimal radius in terms of the generalization error as our learning machine updates r* through a learning process. Note that if positive examples are generated in a manner in which the conditions of corollary 4.2 described in the previous section are satisfied, the optimal radius is r({aι(j)}), because s¯m is equal to {aι(j)} with probability one, given a sufficient number of positive examples.

Theorem 5.1 Asymptotic optimality of r* —

If (i) the positive examples s1,…,sm are realizations of random strings σ1,…,σm that are independent and distributed according to the mixture model p1 and the negative examples t1,…,tn are realizations of random strings τ1,…,τn that are independent and distributed according to p2, (ii) σ1,…,σm satisfy the conditions of corollary 4.2 for each mZ+ and (iii) there uniquely exists r({aι(j)}), then we have

ra.s.r({aι(j)})

as m,n. In other words, r* converges to a radius that is asymptotically optimal given s¯m as the centre of a discriminant sphere with probability one.

In the proof of theorem 5.1, only the principle of minimizing the number of misclassified inputs was used, and the principle of maximizing the margin was not required (see Section S1 of the electronic supplementary material), because the samples of the positive and negative examples accurately reflected their population distributions in the asymptotic setting. This suggests that the reason an ordinary SVM working in Rp has a high predictive performance in a number of applications is that margin maximization plays a role in reducing the probability of misclassifying examples in a test sample when the positive and negative examples in a training sample do not necessarily accurately reflect their population distributions because, for example, a training sample is not sufficiently large; in other words, margin maximization is a reasonable principle for classifying data in a sample in cases where the sample does not include sufficient information on the population distribution.

We next consider the optimality of s¯m. We address this problem in the following setting, which models the situation in which the positive and negative examples are spherically inseparable: D1D2≠∅ holds and p1 and p2 satisfy the conditions that (i) p1(s) is monotonically non-increasing with respect to dL(s,{aι(j)}) on D1 and (ii) there exists d0Z+ such that p2(s) is monotonically non-decreasing with respect to dL(s,{aι(j)}) on D2={sD2:dL(s,{aι(j)})d0} and D1D2. We assume that d0 is sufficiently large and consider only discriminant spheres that are disjoint with (D1D2)c. We do not assume that r* is chosen as the radius of a discriminant sphere. Note that ♯U(s,r) increases monotonically with respect to the length of s and r. We denote sets of pairs (s,r)A×Z+ such that U(s,r)=U(s¯m,r), U(s,r)U(s¯m,r), and U(s,r)U(s¯m,r) by B0(r),B1(r), and B2(r), respectively, for any rZ+.

Theorem 5.2 (Asymptotic optimality of s¯m) —

In the setting described above, if random strings σ1,…,σm that generate positive examples satisfy the conditions of corollary 4.2 for each mZ+, we have

Ej(s¯m,r)Ej(s,r)a.s.

as m,n for any rZ+, (s,r)∈Bj(r), and j∈{0,1,2}. In other words, for any radius r, a discriminant sphere with a centre s¯m is asymptotically optimal in a class of discriminant spheres that are equal to it in size and has asymptotically minimum probabilities of false-negatives of discriminant spheres of equal or smaller size and of false-positives of discriminant spheres of equal or larger size.

Let σi={αij:jZ+} be a random string that generates the ith positive example for each i=1,…,m. If σ1,…,σm are independent, α1j,…,αmj are also independent for each jZ+, but the converse is not true. In theorem 5.2, the independence of α1j,…,αmj is assumed for each jZ+, but the independence of σ1,…,σm is not. Furthermore, the independence of random strings that generate negative examples is not also assumed. If we choose two different points x,x′ in Rp and then make two hyperspheres with centres x and x′ and equal radii, the measures (volumes for p=3) of the hyperspheres are equal. However, choosing two different strings s,s′ in A* and then making two spheres with centres s and s′ and equal radii with respect to the Levenshtein distance, the numbers of strings that the spheres contain are not necessarily equal, which is an essential reason why the statement of theorem 5.2 is divided into cases j=0,1,2.

6. Applications to biological sequence analysis

Our statistical learning machine classifies strings in an almost optimal manner under the conditions of theorems 5.1 and 5.2 in §5 when the training samples are sufficiently large. However, large training samples are not necessarily obtainable in all problems of classifying strings. How accurately does our machine classify strings in such cases? In this section, we examine the usefulness of our learning machine in practical data analysis by applying it to analysing amino acid sequences of proteins and nucleotide sequences of RNAs, including cases in which sufficiently large samples cannot be obtained.

(a). Application to predicting protein–protein interactions

A protein is a polymer of 20 types of amino acids and can be represented as a string on an alphabet A={a,…,z}−{b,j,o,u,x,z} composed of 20 letters. Predicting protein–protein interactions is one of the most important problems in bioinformatics, because most proteins fulfill their functions after forming a complex with other proteins. A domain of a protein generally interacts with multiple domains of other proteins. However, only a few proteins have domains that interact with a number of domains of other proteins and function as a hub in a protein–protein interaction network [24,25]. Thus, large numbers of positive examples cannot necessarily be obtained in the problem of predicting protein–protein interactions. We formulated this prediction problem as the problem of classifying domains of proteins into two classes of domains that interact and do not interact with a given domain and applied our machine working in A* to this problem. We examined the classification accuracy of our machine through comparisons with the SVM with the two-spectrum kernel (the dot product of spectral representations of two strings [13]), the SVM with the spatial kernel with t=2,k=1, and d=2 (see [26] for these parameters), and the one-nearest-neighbour method that employs outputs from the spatial kernel as dissimilarities.

We first prepared positive examples by using the three-dimensional interacting domains (3did) database [27], which contains high-resolution, three-dimensional structural data for domain–domain interactions obtained from the Protein Data Bank (PDB) [28]. The PDB includes three-dimensional structures of proteins and protein complexes obtained from experiments such as X-ray crystal structural analysis and nuclear magnetic resonance spectroscopy. We randomly selected amino acid sequences of 20 protein domains, ‘1il1 A:134–215’ (which denotes the amino acid subsequence from residue 134 to 215 in chain A of PDB ID 1il1), ‘2bnq E:127–220’, ‘1inq B:1011–1092’, ‘1it9 H:133–216’, ‘1it9 L:123–210’, ‘1ikv A:317–419’, ‘1cff A:84–145’, ‘1iza A:1–20’, ‘1ifh L:119–206’, ‘1p2c F:1501–1627’, ‘1j5o B:238–307’, ‘1mcz A:190–325’, ‘1at1 B:101–152’, ‘1ikv B:1238–1307’, ‘1lw6 E:27–275’, ‘3tu4 F:24–93’, ‘1jgw H:11–137’, ‘1z8u B:6–106’, ‘4e5x A:1–179’ and ‘4e5x B:11–92’. For each of the amino acid sequences of these 20 domains, we collected sequences of interacting domains from 3did without any redundancy in sequences. Sequences having extremely different lengths relative to the sequences collected from the database were not included in the samples of positive examples. It should be noted that the same amino acid sequence can be included in different entries of the PDB.

We next consider the procedure for preparing negative examples. Real sequences that would not be positive examples and artificial and randomly generated sequences have been used as negative examples in the development and validation of classifiers for string and sequence data. For example, in promoter prediction in Escherichia coli, Horton & Kanehisa [29] used sequences randomly chosen from coding regions as the negative examples. However, as Larsen et al. [30] indicated, the use of such negative examples is very different from a real biological discrimination problem. For the problem of predicting the interaction between miRNA and mRNA, recent studies [3133] used randomly generated artificial sequences as the negative examples. However, as recent studies [3436] demonstrated experimentally, such sequences often interact with miRNA, and therefore, it is uncertain if they are negative examples. Even if the randomly generated sequences are real negative examples, they may be unrealistically different from positive examples. In this case, as Bandyopadhyay & Mitra [37] indicated, the positive and negative examples are easily distinguishable, and a classifier that yields poor performance on other independent datasets may be produced.

Thus, in this study, we considered a procedure for preparing negative examples that were somewhat similar to the positive examples but were not likely to interact based on biophysical chemistry data. Bogan & Thorn [38] compiled a database of 2325 alanine mutants of heterodimeric protein–protein complexes and examined which amino acids are located in the interfaces of protein–protein complexes with a relatively high frequency. Ma et al. [39] identified amino acids that are located in the interfaces with a high frequency on the basis of 1629 two-chain interface entries in the PDB. Based on the results of these studies, Arg, Asp, Trp and Tyr are located at the interface of protein–protein interactions with high frequency, whereas Lys and Glu are located at interfaces with low frequency. Arg and Lys have a positive charge, and Arg tends to be located at interfaces; conversely, Lys does not tend to be located at interfaces. By contrast, Asp and Glu are negatively charged, whereas Asp tends to be located at interfaces, Glu does not. These observations imply that amino acids in which the side chain has relatively low entropy tend to be located at interfaces. π electrons in the side chains of aromatic amino acids interact strongly with positively charged amino acids [40], and, for this reason, aromatic amino acids such as Trp and Tyr are located at interfaces with high frequency. Therefore, we generated a negative example from each positive example using the following procedure.

Step (i). We first substituted Glu for Arg and Lys for Asp, Trp and Tyr in a positive example with a probability of 0.5 (because if all the Arg, Asp, Trp and Tyr were replaced, the resulting negative examples would not contain the four types of letters that represent these amino acids). According to van Holde et al. [41], the frequencies of Arg, Asp, Trp and Tyr in common proteins are 5.1%, 5.3%, 1.4% and 3.2%, respectively. Hence, in this step, approximately 7.5% of the letters in the positive example that represented amino acids that tend to be located in interfaces were replaced with letters representing amino acids that do not tend to be located in interfaces.

Step (ii). Next, we trisected a positive example with the substitutions in step (i), chose a letter in the intermediate substring at random, and transposed the order of the first half substring, from the first letter in the substituted positive example to the chosen letter, and the second half substring, composed of the other letters in the substituted positive example.

The sum of the numbers of positive and negative examples is given as N in table 1 (the number of positive examples = the number of negative examples =N/2 from the procedure of preparing the negative examples described in the above paragraph). Our examination included cases with an insufficient sample size N=20 up to a case with a large sample size N=126. We constructed the above-mentioned four machines for each of the 20 domains. We evaluated the performance of the machines using the four indices of accuracy =(TP+TN)/N, precision =TP/(TP+FP), recall =TP/(TP+FN) and F-measure =2×precision×recall/(precision+recall), where TP,FP,TN and FN represent the numbers of true-positives, false-positives, true-negatives and false-negatives, respectively. After randomly dividing each of the samples of positive and negative examples into three subsamples of equal size, we used the union of two subsamples as a training sample and the other subsample as a test sample to compute the above indices.

Table 1.

Results of the simulation experiments on the prediction of protein–protein interactions. The first row in each of the 10 panels presents ‘the PDB ID of the protein the chain: the initial residue number in the interaction site–the last residue number’. N and l¯ denote the sample size and the mean length of the positive and negative examples, respectively. MMC A*, SVM spec, SVM spat and NN spat represent the developed learning machine working in A*, the SVM with the spectrum kernel, the SVM with the spatial kernel and the nearest-neighbour method that uses outputs from the spatial kernel as dissimilarities, respectively.

MMC A* SVM spec SVM spat NN spat MMC A* SVM spec SVM spat NN spat
1il1 A:134–215 N=20 l¯=87.20 2bnq E:127–220 N=22 l¯=79.45
accuracy 0.9433 0.8267 0.8570 0.7163 1.0000 0.9475 1.0000 0.9864
precision 0.9500 0.8833 0.8538 0.6897 1.0000 1.0000 1.0000 0.9758
recall 0.9533 0.7933 0.8620 0.7900 1.0000 0.8950 1.0000 1.0000
F-measure 0.9516 0.8359 0.8554 0.7316 1.0000 0.9446 1.0000 0.9872
1inq B:1011–1092 N=32 l¯=83.63 1it9 H:133–216 N=36 l¯=87.28
accuracy 0.9140 0.8380 0.8475 0.5318 0.9817 0.8267 0.9194 0.7796
precision 0.9105 0.8860 0.8180 0.5275 0.9914 0.9971 0.8937 0.7600
recall 0.9520 0.8080 0.8963 0.6723 0.9733 0.6567 0.9533 0.8186
F-measure 0.9308 0.8452 0.8547 0.5907 0.9823 0.7919 0.9224 0.7875
1it9 L:123–210 N=40 l¯=84.05 1ikv A:317–419 N=46 l¯=171.83
accuracy 0.8900 0.8029 0.7840 0.5506 1.0000 1.0000 1.0000 0.8853
precision 0.8951 0.9527 0.7611 0.5443 1.0000 1.0000 1.0000 1.0000
recall 0.9086 0.6543 0.8300 0.7413 1.0000 1.0000 1.0000 0.7641
F-measure 0.9018 0.7758 0.7927 0.6264 1.0000 1.0000 1.0000 0.8653
1cff A:84–145 N=56 l¯=20.57 1iza A:1–20 N=88 l¯=25.02
accuracy 0.8211 0.7944 0.6225 0.3816 0.9727 0.8160 0.8945 0.7364
precision 0.8846 0.8698 0.6128 0.4129 0.9782 0.9669 0.8575 0.6857
recall 0.7511 0.7311 0.6621 0.5107 0.9693 0.6560 0.9477 0.9379
F-measure 0.8124 0.7944 0.6353 0.4555 0.9737 0.7817 0.9001 0.7920
1ifh L:119–206 N=96 l¯=82.08 1p2c F:1501–1627 N=126 l¯=111.83
accuracy 0.9738 0.9125 0.9567 0.7611 0.9562 0.9481 0.9649 0.8184
precision 0.9688 0.9663 0.9450 0.7218 0.9546 0.9841 0.9615 0.7952
recall 0.9825 0.8575 0.9700 0.8691 0.9610 0.9114 0.9689 0.8585
F-measure 0.9756 0.9087 0.9572 0.7882 0.9578 0.9464 0.9651 0.8254
1j5o B:238–307 N=20 l¯=171.60 1mcz A:190–325 N=20 l¯=169.00
accuracy 1.0000 1.0000 0.9990 0.9680 1.0000 0.9567 1.0000 1.0000
precision 1.0000 1.0000 0.9982 1.0000 1.0000 1.0000 1.0000 1.0000
recall 1.0000 1.0000 1.0000 0.9360 1.0000 0.9133 1.0000 1.0000
F-measure 1.0000 1.0000 0.9990 0.9663 1.0000 0.9547 1.0000 1.0000
1at1 B:101–152 N=22 l¯=93.18 1ikv B:1238–1307 N=30 l¯=114.33
accuracy 1.0000 0.9625 0.9982 0.9864 1.0000 0.9060 0.9193 0.9660
precision 1.0000 1.0000 0.9967 0.9750 1.0000 1.0000 0.8846 0.9376
recall 1.0000 0.9250 1.0000 1.0000 1.0000 0.8120 0.9667 0.9987
F-measure 1.0000 0.9610 0.9983 0.9870 1.0000 0.8962 0.9232 0.9671
1lw6 E:27–275 N=38 l¯=63.0 3tu4 F:24–93 N=40 l¯=74.20
accuracy 0.9800 0.9317 0.9347 0.9251 0.9986 0.9771 0.9900 0.9372
precision 1.0000 0.9810 0.9117 0.8905 1.0000 1.0000 0.9810 0.9177
recall 0.9600 0.8867 0.9653 0.9716 0.9971 0.9543 1.0000 0.9650
F-measure 0.9796 0.9315 0.9366 0.9284 0.9985 0.9766 0.9902 0.9398
1jgw H:11–137 N=52 l¯=252.77 1z8u B:6–106 N=66 l¯=105.00
accuracy 1.0000 0.9878 1.0000 0.9722 1.0000 0.9318 0.9906 0.9424
precision 1.0000 1.0000 1.0000 0.9479 1.0000 1.0000 0.9982 0.9503
recall 1.0000 0.9756 1.0000 1.0000 1.0000 0.8636 0.9830 0.9339
F-measure 1.0000 0.9876 1.0000 0.9732 1.0000 0.9268 0.9904 0.9419
4e5x A:1–179 N=80 l¯=110.25 4e5x B:11–92 N=126 l¯=178.84
accuracy 0.9754 0.9669 0.9235 0.6077 1.0000 1.0000 1.0000 0.9710
precision 0.9777 0.9895 0.9207 0.5942 1.0000 1.0000 1.0000 0.9867
recall 0.9754 0.9446 0.9275 0.7011 1.0000 1.0000 1.0000 0.9549
F-measure 0.9765 0.9665 0.9238 0.6420 1.0000 1.0000 1.0000 0.9703

We calculated the mean accuracy, precision, recall and F-measure by repeating this process 50 times. The mean values obtained in this procedure are shown in table 1. Furthermore, values obtained by averaging these mean accuracies, precisions, recalls and F-measures over the 20 protein domains and their standard deviations are provided in table S1 in §S3 of the electronic supplementary material. Moreover, we tested the significance of the differences between the mean accuracies, precisions, recalls and F-measures for each pair of the four methods using the Wilcoxon signed-rank test. The results are shown in table S2. The differences are statistically significant, except for those between the mean precisions of the developed learning machine and the SVM with spectrum kernel and between the mean recalls of the SVM with the spectrum kernel and the nearest-neighbour method. The mean precision from the SVM with the spectrum kernel is higher than that from the SVM with the spatial kernel, whereas the SVM with the spatial kernel is higher than the SVM with the spectrum kernel in the other three indices. Therefore, the developed learning machine has the highest predictive power, and the nearest-neighbour method has the lowest predictive power. The predictive powers of the SVMs with the spectrum kernel and with the spatial kernel are almost equal.

(b). Application to classifying RNAs by the secondary structure

Next, we applied the learning machine developed in this study, the SVM with the spectrum kernel, the SVM with the spatial kernel and the nearest-neighbour method that uses outputs from the spatial kernel as dissimilarities to predicting whether RNAs belong to a family with similar secondary structures based on nucleotide sequences. We chose 10 RNA families registered in the Rfam database [42] at random. Each of the families is composed of RNAs that have similar secondary structures. We used sequence data of the RNAs in each family as positive examples after deleting gaps of the alignment in the sequences. Sequences that are not likely to form secondary structures similar to those of the RNAs in each family are needed for negative examples. Therefore, we prepared a negative sequence from each positive sequence according to step (ii) of the procedure described in §6a to change the positions of complementary pairs of nucleotides in the RNA sequences. For each family, a sample of negative sequences was prepared in the above-mentioned manner and sequences in other families were not used as negative examples.

As in §6a, we calculated the mean accuracy, precision, recall and F-measure by repeating the threefold cross-validation 50 times. The obtained mean values are shown in table 2. Furthermore, the means and standard deviations of the mean accuracies, precisions, recalls and F-measures from the four methods over the 10 RNA families are provided in table S1. Moreover, table S2 shows the results of testing the significance of the differences of the mean accuracies, precisions, recalls and F-measures between each pair of the four methods using the Wilcoxon signed-rank test. The differences of all of the four indices between the SVMs with the spectrum kernel and with the spatial kernel are not statistically significant. The differences between the other pairs of the four methods are statistically significant, except for those of the mean recalls between the SVM with the spectrum kernel and the nearest-neighbour method and between the SVM with the spatial kernel and the nearest-neighbour method. Therefore, from these tables, it is concluded that the developed learning machine has the highest predictive power, followed by the SVMs with the spectrum kernel and with the spatial kernel, and the nearest-neighbour method has the lowest predictive power, which is a result similar to that obtained in §6a.

Table 2.

Results of the simulation experiments on the classification of RNAs. The first row in each of the five panels presents the ID of the RNA family. N, l¯, MMC A*, SVM spec, SVM spat, and NN spat are the same as in table 1.

MMC A* SVM spec SVM spat NN spat MMC A* SVM spec SVM spat NN spat
RF01987 N=24 l¯=78.33 RF02116 N=36 l¯=166.22
accuracy 1.0000 0.7625 0.5983 0.4630 0.9833 0.6283 0.5044 0.4413
precision 1.0000 0.7763 0.5998 0.4774 0.9943 0.6496 0.5043 0.4618
recall 1.0000 0.8150 0.6083 0.7683 0.9733 0.6833 0.4833 0.6364
F-measure 1.0000 0.7952 0.6016 0.5859 0.9820 0.6660 0.4905 0.5323
RF02142 N=42 l¯=181.52 RF00216 N=46 l¯=302.26
accuracy 0.9986 0.6486 0.7271 0.5726 1.0000 0.8175 0.8248 0.5996
precision 1.0000 0.8376 0.7394 0.5599 1.0000 0.9400 0.8073 0.5663
recall 0.9971 0.4514 0.7067 0.6353 1.0000 0.7000 0.8557 0.5105
F-measure 0.9985 0.5866 0.7203 0.5908 1.0000 0.8024 0.8302 0.5182
RF00228 N=46 l¯=571.39 RF02495 N=68 l¯=76.74
accuracy 1.0000 0.6500 0.7948 0.4696 0.9991 0.7445 0.8715 0.5825
precision 1.0000 0.6355 0.8062 0.4794 1.0000 0.8521 0.8785 0.5581
recall 1.0000 0.8175 0.7817 0.7696 0.9982 0.6164 0.8653 0.7153
F-measure 1.0000 0.7151 0.7917 0.5882 0.9990 0.7153 0.8706 0.6257
RF00042 N=74 l¯=90.05 RF00061 N=158 l¯=246.95
accuracy 1.0000 0.8333 0.8581 0.4922 0.6227 0.5131 0.4189 0.4842
precision 1.0000 0.8998 0.8600 0.4722 0.8550 0.5134 0.4212 0.4883
recall 1.0000 0.7633 0.8595 0.1493 0.3100 0.5277 0.4461 0.7100
F-measure 1.0000 0.8259 0.8583 0.2054 0.4438 0.5205 0.4326 0.5758
RF01794 N=182 l¯=162.14 RF00229 N=184 l¯=251.73
accuracy 0.9463 0.6103 0.8625 0.5443 1.0000 0.6348 0.8433 0.4682
precision 0.9551 0.6168 0.8259 0.5319 1.0000 0.6613 0.8434 0.4707
recall 0.9393 0.6013 0.9196 0.6229 1.0000 0.5587 0.8443 0.5471
F-measure 0.9453 0.6090 0.8699 0.5722 1.0000 0.6057 0.8435 0.5051

7. Conclusion

String data analysis has a wide range of tasks, such as string comparison, quantifying string similarity, string clustering and identifying string features. In this study, we addressed the problem of classifying strings of these various problems. As described in §1, it is impossible to theoretically evaluate the performance of learning machines that convert strings into numerical vectors using string kernels, which are not bijective, and subsequently analyse the vectors, using probability theory. Therefore, to evaluate the performance of such learning machines, we have no option but to depend on the numerical method in which the cross-validation is repeated. However, as often pointed out, results of the performance evaluation in this manner vary greatly, depending on the datasets used. Dealing with these problems by applying the probability theory on A* developed in [3] is the main purpose of this study, and we conducted this study, placing special emphasis on the theoretical aspect of string data analysis.

Many methods for the above-mentioned tasks of string data analysis are based on string representations in the kernel-based methods. Many previous studies proved the kernel-based methods for string representations to be useful in practical string data analysis. See, for example, [26,4352] for important recent studies on the kernel-based methods for string data. However, it is impossible to evaluate, applying traditional probability theory that has been constructed on spaces such as the Euclidean space Rp and Hilbert space L2, the performance of the methods based on the string kernels, considering that a sample of observed strings is a part of a population generated according to a probability law. In this study, we addressed the problem of supervised classification of strings and obtained the theoretical results on the performance of the learning machine that works in A* by applying the limit theorem in probability theory on A* demonstrated in this paper. In [53], we approached the problem of unsupervised clustering of strings by introducing a parametric probability distribution on A* and developing a theory of EM algorithm for the mixture model of the distributions and demonstrated the asymptotic optimality of the constructed clustering procedure by combining the limit theorems obtained in this paper. It is a future challenge to develop a method for evaluating the performance of the kernel-based methods of string data analysis in a theoretical manner by extending probability theory on A* and inspecting the correspondence between observed strings and their representations by string kernels.

Supplementary Material

Supplementary Material for Maximum Margin Classifier Working in a Set of Strings
rspa20150551supp1.pdf (55.1KB, pdf)

Acknowledgements

We are grateful to the anonymous referees for their valuable comments on the manuscript.

Ethics

There are no ethical concerns with regard to this manuscript.

Data accessibility

All data used in this study are openly available from the 3did and Rfam databases. The source code for classifying strings using the learning machine that works in A* constructed in this study is available if requested.

Author' contributions

H.K. conceived of the study, constructed the theory, performed the simulation experiments and drafted the manuscript. M.H. helped to perform the simulation experiments. T.A. helped to constructed the theory.

Competing interests

The authors declare that they have no competing interests.

Funding

This work was supported in part by grant-in-aid for Challenging Exploratory Research from the Japan Society for the Promotion of Science (26610037).

References

  • 1.Gusfield D. 1997. Algorithms on strings, trees, and sequences. Cambridge, UK: Cambridge University Press. [Google Scholar]
  • 2.Crochemore M, Rytter W. 2002. Jewels of stringology. Singapore: World Scientific. [Google Scholar]
  • 3.Koyano H, Kishino H. 2010. Quantifying biodiversity and asymptotics for a sequence of random strings. Phys. Rev. E 81, 061912 (doi:10.1103/PhysRevE.81.061912) [DOI] [PubMed] [Google Scholar]
  • 4.Koyano H, Tsubouchi T, Kishino H, Akutsu T. 2014. Archaeal β diversity patterns under the seafloor along geochemical gradients. J. Geophys. Res. G (Biogeosciences) 119, 1770–1788. (doi:10.1002/2014JG002676) [Google Scholar]
  • 5.Aizerman MA, Braverman EM, Rozoner LI. 1964. Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control 25, 821–837. [Google Scholar]
  • 6.Boser BE, Guyon IM, Vapnik VN. 1992. A training algorithm for optimal margin classifiers. In Proc. 5th Annu. Workshop Comput. Learn. Theory, Pittsburgh, PA, 27–29 July (ed. D Houssler), pp. 144–152. New York, NY: ACM.
  • 7.Cortes C, Vapnik VN. 1995. Support-vector networks. Mach. Learn. 20, 273–297. (doi:10.1023/A:1022627411411) [Google Scholar]
  • 8.Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V. 1997. Support vector regression machines. In Adv. Neural Inf. Process. Syst. vol. 9 (eds MC Mozer, MI Jordan, T Petsche), pp. 155–161. Cambridge, MA: MIT Press.
  • 9.Vapnik VN. 1998. Statistical learning theory. London, UK: Wiley. [DOI] [PubMed] [Google Scholar]
  • 10.Haussler D. 1999. Convolution kernels on discrete structures. Santa Cruz, CA: Department of Computer Science, University of California, Santa Cruz; UCSC-CRL-99-10. [Google Scholar]
  • 11.Watkins C. 1999. Dynamic alignment kernels. Royal Holloway: Computer Science Department, University of London; CSD-TR-98-11. [Google Scholar]
  • 12.Lodhi H, Shawe-Taylor J, Cristianini N, Watkins C. 2001. Text classification using string kernel. In Adv. Neural Inf. Process. Syst. vol. 13 (eds TK Leen, TG Dietterich, V Tresp), pp. 563–569. Cambridge, MA: MIT Press.
  • 13.Leslie CS, Eskin E, Noble WS. 2002. The spectrum kernel: a string kernel for SVM protein classification. In Proc. 7th Pacific Symp. Biocomput., Lihue, HI, 3–7 January, vol. 7 (eds RB Altman, AK Dunker, L Hunter, TE Klein, K Lauderdale), pp. 566–575. Singapore: World Scientific.
  • 14.Paaß G, Leopold E, Larson M, Kindermann J, Eickeler S. 2002. SVM classification using sequences of phonemes and syllables. In Proc. 6th Eur. Conf. Principles Data Min. Knowl. Discov., Helsinki, Finland, 19–23 August (eds T Elomaa, H Mannila, H Toivonen), pp. 373–384. Berlin, Germany: Springer.
  • 15.Leslie C, Eskin E, Weston J, Noble WS. 2003. Mismatch string kernels for SVM protein classification. In Adv. Neural Inf. Process. Syst., Vancouver, Canada, 9–14 December 2002, vol. 15 (eds S Becker, S Thrun, K Obermayer), pp. 1417–1424. Cambridge, MA: MIT Press.
  • 16.Leslie C, Kuang R. 2004. Fast string kernels using inexact matching for protein sequences. J. Mach. Learn. Res. 5, 1435–1455. [Google Scholar]
  • 17.Vishwanathan SVN, Smola AJ. 2004. Fast kernels for string and tree matching. In Kernel methods in computational biology (eds K Tsuda, B Schölkopf, JP Vert), pp. 113–130. Cambridge, MA: MIT Press.
  • 18.Zien A, Rätsch G, Mika S, Schölkopf B, Lengauer T, Müller KR. 2000. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799–807. (doi:10.1093/bioinformatics/16.9.799) [DOI] [PubMed] [Google Scholar]
  • 19.Vert JP. 2002. Support vector machine prediction of signal peptide cleavage site using a new class of kernels for strings. In Proc. 7th Pacific Symp. Biocomput., Lihue, HI, 3–7 January, vol. 7 (eds RB Altman, AK Dunker, L Hunter, TE Klein, K Lauderdale), pp. 649–660. Singapore: World Scientific.
  • 20.Saigo H, Vert JP, Ueda N, Akutsu T. 2004. Protein homology detection using string alignment kernels. Bioinformatics 20, 1682–1689. (doi:10.1093/bioinformatics/bth141) [DOI] [PubMed] [Google Scholar]
  • 21.Li H, Jiang T. 2005. A class of edit kernels for SVMs to predict translation initiation sites in eukaryotic mRNAs. J. Comput. Biol. 12, 702–718. (doi:10.1089/cmb.2005.12.702) [DOI] [PubMed] [Google Scholar]
  • 22.Jordan C. 1887. Cours d'analyse de l'École Polytechnique. Paris, France: Gauthier-Villars. [Google Scholar]
  • 23.Brouwer LEJ. 1911. Beweis des Jordanschen Satzes für den n-dimensionalen Raum. Math. Ann. 71, 314–319. (doi:10.1007/BF01456847) [Google Scholar]
  • 24.Jeong H, Tombor B, Albert R, Oltvai ZN, Barabási AL. 2000. The large-scale organization of metabolic networks. Nature 407, 651–654. (doi:10.1038/35036627) [DOI] [PubMed] [Google Scholar]
  • 25.Jeong H, Mason SP, Barabási AL, Oltvai ZN. 2001. Lethality and centrality in protein networks. Nature 411, 41–42. (doi:10.1038/35075138) [DOI] [PubMed] [Google Scholar]
  • 26.Kuksa PP, Pavlovic V. 2010. Spatial representation for efficient sequence classification. In 2010 20th Int. Conf. Pattern Recogn. IEEE, Istanbul, Turkey, 23–26 August, pp. 3320–3323. New York, NY: IEEE.
  • 27.Mosca R, Céol A, Stein A, Olivella R, Aloy P. 2014. 3did: a catalog of domain-based interactions of known three-dimensional structure. Nucleic Acids Res. 42, D374–D379. (doi:10.1093/nar/gkt887) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Rose PW. et al. 2011. The RCSB protein data bank: redesigned web site and web services. Nucleic Acids Res. 39, D392–D401. (doi:10.1093/nar/gkq1021) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Horton PB, Kanehisa M. 1992. An assessment of neural network and statistical approaches for prediction of E. coli promoter sites. Nucleic Acids Res. 20, 4331–4338. (doi:10.1093/nar/20.16.4331) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Larsen NI, Engelbrecht J, Brunak S. 1995. Analysis of eukaryotic promoter sequences reveals a systematically occurring CT-signal. Nucleic Acids Res. 23, 1223–1230. (doi:10.1093/nar/23.7.1223) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Enright AJ, John B, Gaul U, Tuschl T, Sander C, Marks DS. 2004. MicroRNA targets in Drosophila. Genome Biol. 5, R1 (doi:10.1371/journal.pbio.0020363) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.John B, Enright AJ, Aravin A, Tuschl T, Sander C, Marks DS. 2004. Human microRNA targets. PLoS Biol. 2, 1862–1879. (doi:10.1371/journal.pbio.0020363) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Yousef M, Jung S, Kossenkov AV, Showe LC, Showe MK. 2007. Naïve Bayes for microRNA target predictions-machine learning for microRNA targets. Bioinformatics 23, 2987–2992. (doi:10.1093/bioinformatics/btm484) [DOI] [PubMed] [Google Scholar]
  • 34.Lewis BP, Shih Ih, Jones-Rhoades MW, Bartel DP, Burge CB. 2003. Prediction of mammalian microRNA targets. Cell 115, 787–798. (doi:10.1016/S0092-8674(03)01018-3) [DOI] [PubMed] [Google Scholar]
  • 35.Rodriguez A, Griffiths-Jones S, Ashurst JL, Bradley A. 2004. Identification of mammalian microRNA host genes and transcription units. Genome Res. 14, 1902–1910. (doi:10.1101/gr.2722704) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Krek A. et al. 2005. Combinatorial microRNA target predictions. Nat. Genet. 37, 495–500. (doi:10.1038/ng1536) [DOI] [PubMed] [Google Scholar]
  • 37.Bandyopadhyay S, Mitra R. 2009. TargetMiner: microRNA target prediction with systematic identification of tissue-specific negative examples. Bioinformatics 25, 2625–2631. (doi:10.1093/bioinformatics/btp503) [DOI] [PubMed] [Google Scholar]
  • 38.Bogan AA, Thorn KS. 1998. Anatomy of hot spots in protein interfaces. J. Mol. Biol. 280, 1–9. (doi:10.1006/jmbi.1998.1843) [DOI] [PubMed] [Google Scholar]
  • 39.Ma B, Elkayam T, Wolfson H, Nussinov R. 2003. Protein–protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proc. Natl Acad. Sci. USA 100, 5772–5777. (doi:10.1073/pnas.1030237100) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Burley SK, Petsko GA. 1988. Weakly polar interactions in proteins. In Advances in protein chemistry vol. 39 (eds CB Anfinsen, JT Edsall, FM Richards, DS Eisenberg), pp. 125–189. Waltham, MA: Academic Press.
  • 41.van Holde KE, Johnson WC, Ho PS. 2006. Principles of physical biochemistry. Upper Saddle River, NJ: Prentice Hall. [Google Scholar]
  • 42.Nawrocki EP. et al. 2015. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 43, D130–D137. (doi:10.1093/nar/gku1063) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Jaakkola T, Diekhans M, Haussler D. 1999. Using the Fisher kernel method to detect remote protein homologies. In Proc. 7th Int. Conf. Intell. Syst. Mol. Biol., Heidelberg, Germany, 6–10 August (eds T Lengauer, R Schneider, P Bork, D Brutlad, J Glasgow, HW Mewes, R Zimmer), pp. 149–158. Menlo Park, CA: AAAI Press.
  • 44.Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C. 2005. Profile-based string kernels for remote homology detection and motif extraction. J. Bioinf. Comput. Biol. 3, 527–550. (doi:10.1142/S021972000500120X) [DOI] [PubMed] [Google Scholar]
  • 45.Weston J, Leslie C, Ie E, Zhou D, Elisseeff A, Noble WS. 2005. Semi-supervised protein classification using cluster kernels. Bioinformatics 21, 3241–3247. (doi:10.1093/bioinformatics/bti497) [DOI] [PubMed] [Google Scholar]
  • 46.Yang Y, Tantoso E, Li KB. 2008. Remote protein homology detection using recurrence quantification analysis and amino acid physicochemical properties. J. Theor. Biol. 252, 145–154. (doi:10.1016/j.jtbi.2008.01.028) [DOI] [PubMed] [Google Scholar]
  • 47.Kuksa PP, Huang PH, Pavlovic V. 2009. Scalable algorithms for string kernels with inexact matching. In Adv. Neural Inf. Process. Syst. 21, Vancouver, Canada, 8–11 December 2008, vol. 21 (eds D Koller, D Schuurmans, Y Bengio, L Bottou) pp. 881–888. Cambridge, MA: MIT Press.
  • 48.Kuksa P, Huang PH, Pavlovic V. 2009. Efficient use of unlabeled data for protein sequence classification: a comparative study. BMC Bioinformatics 10, S2 (doi:10.1186/1471-2105-10-S4-S2) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Toussaint NC, Widmer C, Kohlbacher O, Rätsch G. 2010. Exploiting physico-chemical properties in string kernels. BMC Bioinformatics 11, S7 (doi:10.1186/1471-2105-11-S8-S7) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Webb-Robertson BJM, Ratuiste KG, Oehmen CS. 2010. Physicochemical property distributions for accurate and rapid pairwise protein homology detection. BMC Bioinformatics 11, 145 (doi:10.1186/1471-2105-11-145) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Kuksa PP, Khan I, Pavlovic V. 2012. Generalized similarity kernels for efficient sequence classification. In Proc. SIAM Int. Conf. Data Mining, Anaheim, CA, 28–30 April 2011 (eds J Ghosh, H Liu, I Davidson, C Domeniconi, C Kamath), pp. 873–882. Philadelphia, PA: SIAM.
  • 52.Kuksa PP. 2013. Biological sequence classification with multivariate string kernels. IEEE/ACM Trans. Comput. Biol. Bioinf. 10, 1201–1210. (doi:10.1109/TCBB.2013.15) [DOI] [PubMed] [Google Scholar]
  • 53.Koyano H, Hayashida M, Akutsu T.2015. Optimal string clustering based on a Laplace-like mixture and EM algorithm on a set of strings. (http://arxiv.org/abs/1411.6471. )

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material for Maximum Margin Classifier Working in a Set of Strings
rspa20150551supp1.pdf (55.1KB, pdf)

Data Availability Statement

All data used in this study are openly available from the 3did and Rfam databases. The source code for classifying strings using the learning machine that works in A* constructed in this study is available if requested.


Articles from Proceedings. Mathematical, Physical, and Engineering Sciences / The Royal Society are provided here courtesy of The Royal Society

RESOURCES