Skip to main content
Entropy logoLink to Entropy
. 2022 Oct 31;24(11):1567. doi: 10.3390/e24111567

Distributed Support Vector Ordinal Regression over Networks

Huan Liu 1, Jiankai Tu 1, Chunguang Li 1,*
Editors: Minyu Feng1, Liang-Jian Deng1, Feng Chen1
PMCID: PMC9689832  PMID: 36359657

Abstract

Ordinal regression methods are widely used to predict the ordered labels of data, among which support vector ordinal regression (SVOR) methods are popular because of their good generalization. In many realistic circumstances, data are collected by a distributed network. In order to protect privacy or due to some practical constraints, data cannot be transmitted to a center for processing. However, as far as we know, existing SVOR methods are all centralized. In the above situations, centralized methods are inapplicable, and distributed methods are more suitable choices. In this paper, we propose a distributed SVOR (dSVOR) algorithm. First, we formulate a constrained optimization problem for SVOR in distributed circumstances. Since there are some difficulties in solving the problem with classical methods, we used the random approximation method and the hinge loss function to transform the problem into a convex optimization problem with constraints. Then, we propose subgradient-based algorithm dSVOR to solve it. To illustrate the effectiveness, we theoretically analyze the consensus and convergence of the proposed method, and conduct experiments on both synthetic data and a real-world example. The experimental results show that the proposed dSVOR could achieve close performance to that of the corresponding centralized method, which needs all the data to be collected together.

Keywords: ordinal regression, support vector machine, support vector ordinal regression, distributed algorithm, subgradient method

1. Introduction

Many real-world data labels have natural orders that are usually called ordinal labels. For example, fault severity in industrial processes is usually divided into {harmless, slight, medium, severe}. Ordinal regression, which aims at predicting ordinal labels for given patterns, has attracted a great deal of research in many fields, such as disease severity assessment [1], satisfaction evaluation [2], wind-speed prediction [3], age estimation [4], credit-rating prediction [5], and fault severity diagnosis [6]. Although classical classification and regression methods can be applied to the ordinal regression problem [7,8], they require additional prior information about the distances between labels. Otherwise, they often perform unsatisfactorily since they cannot fully use ordering information [9,10].

To tackle the aforementioned problems of classical classification and regression methods, many ordinal regression methods were proposed [10]. Among them, the most popular type of approaches are threshold models, which assume that a continuous latent variable underlies the ordinal response [10]. In threshold models, the order of the labels is represented by a set of ordered thresholds. These ordered thresholds define a series of intervals, and the data label depends on the interval the corresponding latent variable falls into. Among the threshold models, support vector ordinal regression (SVOR) [11,12] is widely used because of good generalization performance. A representative work is the support vector ordinal regression with implicit constraints (SVORIM) proposed in [11,12]. This determines each threshold by taking all the samples into consideration, where the threshold inequality constraints can be satisfied without explicit constraints.

Most of the existing ordinal regression methods have been developed in a centralized framework. However, in practice, data used for ordinal regression may be distributed in a network [13]. Each node of the network collects and stores part of the data, and it is not enough for a single node to train a model with good performance. For instance, in industrial processes, sensors are often used in factories to monitor the operating status of equipment and diagnose fault severity. Due to the rarity of faults, a single sensor can only collect very few data, and the faults encountered by each factory may also be different. To train a proper model, we need to use as many data as possible. However, in some realistic scenarios, it is difficult for data to be transmitted to a central node for various reasons [13]. For example, factories may not want to leak data regarding their equipment in order to protect privacy. Moreover, if the data are collected by image sensors or video sensors, it may be difficult for a single machine to store and process such a large amount of data. In such situations, centralized methods are inapplicable, and distributed methods are more suitable choices.

In this paper, we propose a distributed support vector ordinal regression algorithm based on the SVORIM method to deal with more complex nonlinear problems in distributed ordinal regression. First, we formulate a constrained optimization problem for SVORIM in the distributed scenarios. Classical methods usually solve the problem by transforming it into the dual problem. In distributed circumstances where the original data cannot be transmitted to others, it is difficult for classical methods to calculate the kernel function values and optimize the dual variables because they require data from different nodes. Thus, we adopted a random approximation method and the hinge loss function to transform the optimization problem to overcome the above difficulties. Increasing the number of random approximation dimensions can improve the approximation accuracy, but brings redundancy. In order to find an appropriate number of approximation dimensions, we further added a sparse regularization term of the approximation dimension number to the objective function. Through the above steps, we transformed the original problem into a convex optimization problem with consensus constraints. Then, to solve the problem, we propose a subgradient-based algorithm called distributed SVOR (dSVOR) where each node only uses its own data and the parameter estimates exchanged from its neighbors. To verify the effectiveness of dSVOR, we theoretically analyze its consensus and convergence, and conducted some experiments on synthetic data and a real-world example. The experimental results show that the proposed distributed algorithm under additional constraints could achieve close performance to that of the corresponding centralized method, which needs all the data to be collected to a central node.

The main contributions of this paper are summarized as follows.

  • 1.

    Existing work on distributed ordinal regression [14] uses a linear model; therefore, it cannot deal with the problems of linearly inseparable data. We extended the SVOR method to distributed scenarios to solve distributed ordinal regression problems with linearly inseparable data.

  • 2.

    We developed a decentralized implementation of SVOR, and propose a dSVOR algorithm. In the proposed algorithm, the kernel feature map is approximated by random feature maps to avoid transmitting the original data, and sparse regularization is added to avoid excessively high approximation dimensions.

  • 3.

    The consensus and convergence of the proposed algorithm are theoretically analyzed.

The rest of this paper is organized as follows. In Section 2, we introduce related works. The ordinal regression problem and the SVORIM method are introduced in Section 3 as preliminary knowledge. In Section 4, we formulate the distributed support vector ordinal regression problem, propose the dSVOR algorithm, and perform theoretical analysis of the proposed algorithm. Experiments were conducted to evaluate the effectiveness of the proposed algorithm and they are presented in Section 5. Lastly, in Section 6, we draw some conclusions.

2. Related Works

Ordinal Regression Methods. Many ordinal regression methods have been proposed to solve ordinal regression problems. The ordered logit model [15,16] makes assumptions about the distribution of the prediction error of the latent variable, and uses the cumulative distribution function to build the label cumulative probability function. The support vector ordinal regression (SVOR) [11,12] maximizes margins between two adjacent labels. Variants of SVOR with nonparallel hyperplanes were discussed in [17,18]. There are also ordinal regression methods that solve ordinal regression problems by solving a series of binary classification subproblems. In [4,19], extended labels were extracted from the original ordinal labels to learn a binary classifier (such as support vector machine [19] or logistic regression [4]); then, a ranking rule was constructed from the binary classifier to predict ordinal labels. In [20], the authors used the stick-breaking process to construct a series of binary classification subproblems to guarantee that the cumulative probabilities were monotonically decreasing. However, the above ordinal regression methods are all centralized and are infeasible in distributed scenarios.

Distributed methods. Distributed methods were extensively studied in many fields, such as distributed estimation [21,22], distributed optimization [23,24], distributed clustering [25], distributed Kalman filter [26], and distributed anomaly detection [27]. However, as far as we know, there are few works investigating distributed ordinal regression [14]. In [14], the authors proposed a distributed generalized ordered logit model, which is a linear model and therefore cannot handle complex problems.

3. Preliminaries

3.1. Ordinal Regression Problem

The classification problem aims at classifying the K-dimensional input vector xXRK into one of Q discrete categories yY={C1,C2,,CQ}. The ordinal regression problem is a type of classification problem in which the data labels have a natural order C1C2CQ, where ≺ is an order relation [10]. The purpose of ordinal regression is to find a mapping function f:XY to predict the ordinal labels for new patterns given a training set of N samples D={(xi,yi),i=1,,N}.

3.2. Support Vector Ordinal Regression with Implicit Constraints

Let ϕ(x) denote the feature vector in a high-dimensional reproducing kernel Hilbert space (RKHS) of input vector x. The inner product in the RKHS is defined by the reproducing kernel function: K(x,x)=ϕ(x)·ϕ(x).

Support vector machines construct a discriminant hyperplane in the RKHS by maximizing the distance between support vectors and the discriminant hyperplane. The discriminant hyperplane is defined by an optimal direction w and a single optimal threshold b. It divides the feature space into two regions for two classes.

The support vector ordinal regression constructs Q1 parallel discriminant hyperplanes for Q ordinal labels where these hyperplanes are defined by optimal direction w and Q1 thresholds {bq}q=1,,Q1. The ordinal information in the labels is represented by threshold inequalities b1b2bQ1. For convenience, vector b=[b1b1bQ1]T was used to denote these thresholds.

In [11,12], the SVORIM method determined a threshold bq by utilizing the samples of all the labels. For threshold bq, each sample belonging to Cp,pq should have a function value less than bq1; otherwise, ξpiq=w·ϕ(xip)(bq1) is the empirical error of xip for bq. Similarly, each sample belonging to Cp,p>q should have a function value greater than bq+1; otherwise, ξpi*q=(bq+1)w·ϕ(xip) is the empirical error of xip for bq.

As proved in [11,12], this approach has the property that the threshold inequalities can be automatically satisfied after convergence without explicitly including the corresponding constraints. This method is called support vector ordinal regression with implicit constraints and is formulated as follows:

minw,b,ξ,ξ*12w2+Cq=1Q1p=1qi=1Npξpiq+Cq=1Q1p=q+1Qi=1Npξpi*qs.t.w·ϕ(xip)bq1+ξpiq,ξpiq0,i,qandp=1,,qw·ϕ(xip)bq+1ξpi*q,ξpi*q0,i,qandp=q+1,,Q, (1)

where C is a predefined positive constant. The above problem can be solved by solving the dual problem, which can be derived with standard Lagrangian techniques. Let βpiq0,γpiq0,βpi*q0, and γpi*q0 be the Lagrangian multipliers for the constraints in the above equation. The dual problem is the following maximization problem [11,12].

maxβ,β*12p,ip,i(q=1p1βpi*qq=pQ1βpiq)(q=1p1βpi*qq=pQ1βpiq)K(xip,xip)+p,i(q=1p1βpi*q+q=pQ1βpiq)s.t.p=1qi=1Npβpiq=p=q+1Qi=1Npβpi*q,q0βpiqC,i,qandpq0βpi*qC,i,qandp>q. (2)

For a new pattern x, SVORIM calculates the function value w·ϕ(x) and then decides its category according to the interval the function value falls into, where the intervals are defined by thresholds {bq}q=1,,Q1.

4. Distributed Support Vector Ordinal Regression Algorithm

4.1. Network and Data Model

In this paper, we consider a network consisting of M nodes. We could use a graph G=(M,E) to represent this network. It consisted of a set of nodes M={1,2,,M} and a set of edges E. Each edge (m,n)E connected a pair of distinct nodes. We used Nm={n|(m,n)E} to represent the set of neighbors of node mM.

Data used for ordinal regression are distributedly collected and stored by the M nodes of this network. The i-th sample of node m is represented as (xm,i,ym,i), where xm,iX and ym,iY. More specifically, at node m, the total number of samples is Nm, the number of samples that belong to Cq is Nmq, and the i-th sample of Cq is denoted as (xm,iq,ym,iq).

Figure 1 shows a schematic of a distributed network. In distributed networks, due to limited storage, computation and communication resources and the need for privacy protection, node m can only transmit some parameters θm instead of the original data to its neighbor nodes in Nm, and perform local computation using only its own data {(xm,i,ym,i)}1iNm and the parameters exchanged from its neighbors. Each node should eventually obtain a model consensus with that obtained by other nodes, and the performance of the model should be close to that of the model trained using all the data.

Figure 1.

Figure 1

Schematic of a distributed network. Node m only transmits its parameters θm with nodes in Nm.

4.2. Problem Formulation

In centralized SVOR, the objective is to find an optimal direction w and a vector b. If the data from all the nodes of the distributed network can be collected together, then parameters θ={w,b} can be obtained by solving Problem (1).

In distributed situations, data are not allowed to be transmitted to a central node. Each node can only use its own data and some parameters from its neighbors. In this case, each node m has a local estimate θm of θ. With a connected network, we imposed constraints θm=θn,(m,n)E to ensure the consensus of {θm}m=1,,M. Then, the corresponding optimization problem in distributed scenarios can be written as follows:

min12m=1Mwm2+Cm=1Mq=1Q1p=1qi=1Nmpξm,piq+Cm=1Mq=1Q1p=q+1Qi=1Nmpξm,pi*qs.t.wm·ϕ(xm,ip)bm,q1+ξm,piq,ξm,piq0,m,i,qandp=1,,qwm·ϕ(xm,ip)bm,q+1ξm,pi*q,ξm,pi*q0,m,i,qandp=q+1,,Qwm=wn,bm=bn,(m,n)E, (3)

where ξm,piq is the empirical error of xm,ip for bm,q when p=1,,q and ξm,pi*q is the empirical error of xm,ip for bm,q when p=q+1,,Q. With the help of the consensus constraints, this problem is equivalent to Problem (1).

4.3. Problem Transformation

In classical solutions, a primal problem is solved by solving the corresponding dual problem. Applying such methods to Distributed Problem (3) is confronted with two major difficulties:

  • 1.

    For nonlinear kernel functions, the dimension of the RKHS is unknown, and we can only calculate the inner product of ϕ(xm,i) and ϕ(xn,j) rather than them. Because the data are distributed in various nodes of the network, the kernel function K(xm,i,xn,j) requiring data from different nodes is difficult to calculate without transmitting the original data.

  • 2.

    The dual variables of samples should satisfy constraints in (2). In the distributed scenarios, the dual variables of the first constraint in (2) are usually from different nodes. Since each node is only allowed to exchange information with its neighbors, it is difficult to optimize these dual variables.

To overcome the first difficulty, we use a random approximate function [28] z:RKRD, where D>K, to map the data to a D-dimensional space instead of RKHS. In this study, for Gaussian kernel function

K(x,x)=expxx2σ2, (4)

we adopted z(x)=[zω1(x),,zωD(x)]T, where each dimension zωi(x) was

zωi(x)=2Dcos(ωiTx+ψi), (5)

where ψi is drawn uniformly from [0,2π], and ωi is drawn from the Fourier transform of Gaussian kernel function

p(ω)=(2π)K2expσ2ω22. (6)

As proved in [28], if dimensional number D is large enough, z(x)Tz(x) can approximate K(x,x) well, and z(x) can approximate ϕ(x) well. According to Cover’s theorem [29], a complex pattern-classification problem nonlinearly cast in a high-dimensional space is more likely to be linearly separable than it is in a low-dimensional space. Therefore, to ensure good performance, we should set a relatively large D. For other shift-invariant kernels such as Laplacian and Cauchy, the authors in [28] provided corresponding finite-dimensional random approximate functions. For additive homogeneous kernels, such as Hellinger’s, χ2, intersection and Jensen-Shannon, the authors in [30] also provided efficient finite-dimensional approximate mapping functions. For a linear kernel function, random approximation is not necessary, so we defined z(x)=ϕ(x)=x.

With the random approximation, mapping function ϕ(x) in (3) is replaced by z(x). The calculation of z(x) only requires one data point from a single node instead of a pair of data from different nodes like the kernel function, so the first difficulty is solved.

After the random approximation is performed, the data are mapped into a D-dimensional feature space instead of the RKHS with unknown dimension. Thus, we could directly solve the primal problem instead of the dual problem, which automatically tackles the second difficulty.

With the use of hinge loss function L(x)=max(1x,0) [31], the problem can be rewritten as follows:

min12m=1Mwm2+Cm=1Mq=1Q1p=1qi=1NmpL(bm,qwm·z(xm,ip))+Cm=1Mq=1Q1p=q+1Qi=1NmpL(wm·z(xm,ip)bm,q)s.t.wm=wn,bm=bn,(m,n)E. (7)

4.4. Sparse Regularization

In the above steps, a D-dimensional random approximate function z(x) is used to approximate the unknown mapping function ϕ(x). In general, a large D can lead to small approximation error and good classification performance. However, an overlarge D may cause redundancy, which wastes storage space, and brings high computational complexity and high communication costs. There is a trade-off between the above two aspects, so we added a sparse regularization term. The regularization term pushes some dimensions of wm to 0, which means that these dimensions are redundant and can be discarded. When some dimensions of wm converge to 0, these dimensions do not need to be calculated, stored and transmitted.

The l0-norm is typically used to measure sparsity. However, it is nonconvex, and l0-norm-based problems are NP-hard. In practice, we can use the l1-norm as a convex approximation of the l0-norm. Introducing the l1-norm into the objective function in (7), we obtain

minm=1M(1α)12wm2+αwm1+Cm=1Mq=1Q1p=1qi=1NmpL(bm,qwm·z(xm,ip))+Cm=1Mq=1Q1p=q+1Qi=1NmpL(wm·z(xm,ip)bm,q)s.t.wm=wn,bm=bn,(m,n)E, (8)

where α[0,1] controls the proportion of the l1-norm sparsity regularization term in the entire regularization term. A larger α can lead to a sparser solution of wm. Therefore, since we set a relatively large D to ensure good performance, we could set a relatively large α to reduce redundancy.

We could view this problem from another perspective. If the last two terms in (8) are regarded to be the objective function, the first two terms combined together can be seen as a similar penalty to the elastic net penalty in [32], where α measures the weight of the l1-norm penalty term.

After the above steps, we transformed Problem (3) into a convex optimization problem with consensus constraints (8).

4.5. Distributed SVOR Algorithm

In this subsection, we propose the dSVOR algorithm to solve Problem (8). First, we used the following notation for convenience

Jm(θm)=(1α)12wm2+αwm1+Cq=1Q1p=1qi=1NmpL(bm,qwm·z(xm,ip))+Cq=1Q1p=q+1Qi=1NmpL(wm·z(xm,ip)bm,q), (9)

which is a convex function. The calculation of Jm(θm) does not need the data and estimated parameters from other nodes. Then, Problem (8) can be rewritten as follows:

minJ=m=1MJm(θm)s.t.θm=θn,(m,n)E. (10)

To deal with consensus constraints θm=θn,(m,n)E, we adopted the penalty function method. The penalty function used in this paper is θmθn2, and the corresponding positive penalty coefficient is λmn. Then, the optimization problem becomes

minm=1MJm(θm)+(m,n)Eλmnθmθn2 (11)

The larger the λmn is, the closer the solutions of Problems (11) and (10) are.

We then applied the subgradient method to optimize Problem (11). For the hinge loss function L(x)=max(1x,0), we adopted the following subgradient:

L(x)=1,x<10,x1, (12)

and for the l1-norm, we adopted

sgn(x)=1,x>01,x<00,x=0. (13)

At step k+1, the iterative equation is

θmk+1=θmkηkθmJm(θmk)2ηknNmλmn(θmkθnk), (14)

where ηk is the step size in step k+1, which is positive. The specific subgradients are

wmJm(θmk)=(1α)wmk+αsgn(wmk)Cq=1Q1p=1qi=1NmpL(bm,qkwmk·z(xm,ip))z(xm,ip)+Cq=1Q1p=q+1Qi=1NmpL(wmk·z(xm,ip)bm,qk)z(xm,ip), (15)
bm,qJm(θmk)=Cp=1qi=1NmpL(bm,qkwmk·z(xm,ip))Cp=q+1Qi=1NmpL(wmk·z(xm,ip)bm,qk). (16)

In the subgradient method, in order to converge to the optimal solution, step size ηk should satisfy [33]

k=0+ηk=+,andk=0+(ηk)2<+. (17)

We can rearrange Iterative Equation (14) as follows.

θmk+1=(12ηknNmλmn)θmk+nNm2ηkλmnθnkηkθmJm(θmk). (18)

If we use the following notations for convenience

cmn=2ηkλmn,cmm=1nNm2ηkλmn, (19)

the iterative equation can be rewritten as

θmk+1=nNm{m}cmnθnkηkθmJm(θmk). (20)

It can be divided into two steps, i.e., a combination step and an adaption step:

ϕmk=nNm{m}cmnθnk, (21)
θmk+1=ϕmkηkθmJm(θmk). (22)

In Combination Step (21), node m combines the parameters estimated by its neighbors and itself to obtain an intermediate estimate ϕmk, where the combination coefficient of node m and its neighbor n is denoted as cmn. In Adaption Step (22), node m uses the subgradient calculated by using only its own data to update θm.

Combination coefficients {cmn}(m,n)E represent a cooperation rule among nodes. Equation (19) was not used to define {cmn} because λmn was not defined in advance. In distributed algorithms, combination coefficients are generally determined by a certain cooperative protocol. In this study, we used the Metropolis rule [34]:

cmn=1max(|Nm|,|Nn|),nNm1nNmcmn,m=n0,otherwise, (23)

where |Nm| denotes the degree of node m, and

C1=1,1TC=1T, (24)

where C is an M×M matrix whose entries are defined by (23).

Equation (19) shows that λmn=cmn2ηk. Step size ηk satisfies (17), where the latter implies that limkηk=0. As k, step size ηk0 and penalty coefficient λmn, which renders the solutions of Problems (11) and (10) nearly equal.

The whole processes of dSVOR are summarized in Algorithm 1.

Algorithm 1 Distributed SVOR algorithm

Initialization: initialize hinge loss function weight C, sparsity regularization weight α, random approximate dimension D, and total iteration number T. Each node m initializes θm={wm,bm}.

fork=1:T

   for m=1:M

      Communication Step: communicate parameters θm with neighbors nNm.

   end for

   for m=1:M

      Combination Step: compute intermediate estimate ϕmk via (21).

      Computation Step: Compute the subgradients wmJm(θmk), bm,qJm(θmk) via (15) and (16);

      Adaption Step: update θmk+1 via (22).

   end for

end for

Remark 1.

In the above problems, ϕ(·) is a nonlinear mapping function that maps input x into a RKHS for classification, and input x is the original data or extracted features. In general, function ϕ(·) can also be regarded to be a generalized feature mapping function that extracts features of x, and maps x into a feature space for classification. Thus, it can also use an artificial neural network with learnable parameters. However, that may destroy the convexity of the problem, so that it is no longer guaranteed to converge to the global optimum.

4.6. Theoretical Analysis

In this subsection, we theoretically analyze the consensus and convergence of dSVOR.

We first introduce a reasonable assumption that is needed in analysis. According to [34], when the graph is not bipartite, this assumption can be guaranteed.

Assumption 1.

Spectral radius ρ(C1M11T)<1, where C is the combination coefficient matrix set as in Equation (23).

Then, we give two theorems about consensus and convergence each.

Theorem 1

(Consensus). If Assumption 1 holds, and step size ηk satisfies Condition (17), then limkθmkθ¯k=0,m, where θ¯k=1Mm=1Mθmk.

Theorem 2

(Convergence). If Assumption 1 holds, and step size ηk satisfies Condition (17), then limkm=1MJm(θmk)=J*, where J*=minJ.

For the proof, see Appendix A and Appendix B for details.

5. Experiments

In this section, we carry out experiments on synthetic data and a real-world example to demonstrate the performance of the proposed dSVOR algorithm.

We implemented the following algorithms for comparison:

  1. proposed dSVOR algorithm (dSVOR);

  2. centralized SVOR (cSVOR), which relies on all the data available in a central node;

  3. distributed SVOR with a noncooperative strategy (ncSVOR). In ncSVOR, each node uses only its own data to train a model without any information exchanged with other nodes.

All the algorithms were implemented using the PyTorch framework [35].

There are three points to emphasize:

  • 1.

    The centralized method needs data in a central node. For comparison, we artificially collected all the data distributed in the nodes of the network together to render it applicable, which is impractical in reality.

  • 2.

    In cSVOR [11,12], problems were solved by the SMO algorithm instead of subgradient-based algorithms, so we only display its final results.

  • 3.

    The distributed algorithms were subject to additional constraints, so a distributed algorithm is generally satisfactory if it can achieve comparable performance to the corresponding centralized algorithm.

In this study, we used the prediction accuracy (ACC) and mean absolute error (MAE) on the testing set as the performance evaluation metrics. ACC is a commonly used metric in classification problems, but it does not consider the ordered information of the labels. MAE is the mean absolute deviation of the predicted rank from the true one, which is commonly used in ordinal regression. Using a function O(·) to denote the position of a certain label in the ordinal scale, i.e., O(Cq)=q,q=1,,Q, we have

MAE=1Ni=1N|O(yi)O(y^i)|[0,Q1]. (25)

The performance of distributed algorithms (dSVOR and ncSVOR) is defined as the mean performance of models obtained by each node. The distributed algorithms ran on a randomly generated connected network that consisted of 20 nodes. For fair comparison, on a certain dataset, all implemented algorithms used the same parameters. All the results were obtained by averaging the results of 10 independent experiments.

5.1. Synthetic Data

In this subsection, we evaluate the performance of all algorithms on two synthetic datasets. On the first dataset, samples could be separated by a set of parallel straight lines if ignoring noises, and samples of the second dataset could be separated by a set of concentric circles. Figure 2a,b show some samples of these two datasets from one of the 10 independent experiments. Both datasets had 1200 samples: 1000 were used as the training set, and the others were the testing set. The training samples were randomly assigned to 20 nodes to simulate the situation where the data were collected and stored by these nodes in a distributed manner.

Figure 2.

Figure 2

Data visualization of the (a) first and (b) second synthetic datasets.

These two synthetic datasets were generated with the following methods. For the first dataset, we generated 1200 samples with uniform distribution from a rectangular area x1minx1x1max,x2minx2x2max. We then used three straight lines {x1=bi}i=1,2,3 to divide this area into 4 parts for 4 classes. The data labels were determined by their locations. Then, Gaussian noise with 0 mean and σ1 standard deviation was added to each dimension of input vector x=[x1x2]T. After that, these samples were rotated around the origin with β. Without loss of generality, in the experiments, these parameters were set as follows:

x1min=2,x1max=2,x2min=1,x2max=1,b1=1,b2=0,b3=1,σ1=0.5,β=π8.

For the second dataset, we generated 1200 samples with uniform distribution from a circle x12+x22<R2, which could be divided into four parts by three concentric circles {x12+x22=Ri2}i=1,2,3. The data labels were determined by their locations. Then, Gaussian noise with 0 mean and σ2 standard deviation was added to each dimension of input vector x=[x1x2]T. Without loss of generality, the parameters were set to be R=4, R1=1, R2=2, R3=3, σ2=0.2.

On the first dataset, we used a linear kernel function. In all methods, positive constant C was set to be 1000/N, where N is the number of samples of all nodes. Because the feature space was only 2-dimensional, the sparse regularization term in our method was not necessary. Thus, we set the coefficient of sparse regularization term α=0. In the distributed algorithm, we used the following diminishing step size:

ηk=η01+τk, (26)

which satisfied Condition (17). In (26), parameter η0 determines the initial step size, and τ determines the decreasing rate of the diminishing step size. We empirically set η0=0.1 and τ=0.01 in the following experiments.

Figure 3a,b show the ACC and MAE curves of different algorithms on the first synthetic dataset. As time increased, the MAE of our dSVOR algorithm decreased, and the ACC increased significantly. After about 500 iterations, the dSVOR algorithm converged to a value that was almost the same as that of cSVOR, while the result of ncSVOR was still some distance away from them. This means that it was not enough for a single node to train a model with good performance using its own data. The proposed dSVOR algorithm, which uses the local data of each node and the parameter estimates from neighbor nodes, could achieve a similar performance to that of the corresponding centralized method.

Figure 3.

Figure 3

(a) ACC and (b) MAE curves of different algorithms on the first synthetic dataset.

Figure 4 gives the parameters of each node estimated by different algorithms. In the ncSVOR algorithm, the estimated parameters obtained by different nodes were quite different. Thus, the model obtained by each node with its own data was quite different from the model trained using all the data. In contrast, the estimated parameters of different nodes in dSVOR were almost the same as the parameters in cSVOR. This illustrates the consensus of the proposed dSVOR algorithm. Because we used a linear kernel function here, optimal direction w in the centralized method had an explicit expression that allowed for us to compare it with the estimates of the distributed algorithms. In the following experiments using nonlinear kernel functions, we do not give the results about consensus.

Figure 4.

Figure 4

Final estimated parameters of different methods.

On the second dataset, we used a Gaussian kernel function. The kernel size was set to be σ=1K after Z-score normalization, where K is the dimension of input space. In all methods, positive constant C was set to be 1000/N. As analyzed before, in our method, we set a relatively large D and a relatively large a, D=200, α=0.9. α was not set to 1 because we wanted to use the strong convexity of the l2-norm regularization term to increase the convexity of the objective function, which is theoretically beneficial to the optimization of the problem. The learning rate parameters were still set to be η0=0.1 and τ=0.01.

Figure 5a,b show the ACC and MAE curves of different algorithms on the second synthetic dataset. The proposed dSVOR algorithm was able to obtain almost the same result as that of the centralized method, while ncSVOR could not.

Figure 5.

Figure 5

(a) ACC and (b) MAE curves of different algorithms on the second synthetic dataset.

We also conducted experiments under different hyperparameters D and α to show the parameter sensitivity of dSVOR. Figure 6 gives the MAEs of dSVOR for different D when α was fixed as 0.9. As D increased, the performance of dSVOR gradually improved and was eventually almost the same as that of the centralized method. With a relatively large approximation dimension D100, dSVOR could always obtain a similar MAE to that of cSVOR. However, as mentioned before, an overlarge D may cause redundancy. So, when using a large D to ensure good performance, it is better to use the sparse regularization term to reduce the redundancy. Figure 7a,b gives the MAEs of dSVOR and the proportions of dimensions of wm that were equal to 0 for different α when D is fixed as 200. The MAE was stable under different α, but the sparsity of wm was greatly affected by α. A small α led to a dense wm, which caused a lot of redundancy. A large α could bring a sparse wm, where the dimensions that converged to 0 could no longer be stored, calculated, and transmitted after converging to 0, thus saving storage, computation, and communication resources.

Figure 6.

Figure 6

MAEs of dSVOR on the second synthetic dataset for different D when α is fixed as 0.9.

Figure 7.

Figure 7

Results of dSVOR on the second synthetic dataset for different α when D is fixed as 200. (a) MAEs; (b) proportions of dimensions of wm that are equal to 0.

5.2. A Real-World Example

We now take the distributed fault severity diagnosis of rolling element bearings as a real-world example to illustrate the effectiveness of dSVOR.

Rolling element bearings are widely used in factory equipments. The fault severity diagnosis of bearings is a crucial task to ensure reliability in industrial processes. In recent years, data-driven methods have been widely used to identify faults and their severity [36]. To achieve good performance, these data-driven methods usually require a lot of data. However, due to the rarity of faults, a single sensor can only collect very few fault data, and the faults encountered by each factory may also be different. Thus, data from many sensors in many factories are needed to train a proper model. Sometimes, factories may not want to leak the data about their equipments, so it is not allowed to transmit the data to others. The centralized methods which need all the data available in a central node become inapplicable. The distributed methods become a better choice. Taking into account the ordinal information in the fault severity, it is suitable to apply the proposed dSVOR algorithm.

In this study, we used the rolling element bearings data provided by the Case Western Reserve University (CWRU) [37] for experiments. CWRU data were the vibration signals of drive end and fan end bearings collected by sensors at 12,000 and 48,000 samples/s under four different loads of 0–3 hp. There are three types of faults: outer race (OR), inner race (IR), and ball (B) faults, and each type has at most four severity levels (fault width: 0.18, 0.36, 0.53, 0.71 mm). In the experiments, we used drive end bearing data collected at 12,000 samples/s, and performed 4-level fault severity diagnosis in a total of 12 situations (3 different fault types and 4 different loads).

We adopted the feature based on permutation entropy (PE) proposed in [38] as the input x. For one datum, we intercepted a sequence of length 2400 from vibration signal data. This sequence was decomposed into a series of intrinsic mode functions (IMFs) by ensemble empirical mode decomposition (EEMD) with 100 ensembles and 0.2 noise amplitude to catch information on multiple time scales. Then, the PE values of the first 5 IMFs are calculated as the input feature of this piece of data.

For each fault severity level, we randomly took 300 training samples and 200 testing samples, and the samples in the testing set were different from those in the training set. For 4-level fault level diagnosis, there were a total of 1200 training samples and 800 testing samples. These training samples were randomly assigned to 20 nodes to simulate the situation where the data were collected and stored by these nodes in a distributed manner.

In the experiments, we used a Gaussian kernel function with kernel size σ=1K after Z-score normalization. In all methods, positive constant C was set to be 10,000/N. In our method, we still set a relatively large hyperparameter D and α, D=200, α=0.9. The other parameters used the same settings as before, i.e., η0=0.1 and τ=0.01.

Table 1 shows the experimental results where the value was the mean ± standard deviation of 10 independent experiments. The performance of ncSVOR was worse than that of cSVOR because each node only had part of the training samples that were not enough to represent the entire training set to train a proper model. Compared to ncSVOR, the proposed dSVOR algorithm could achieve similar results to those of cSVOR. In dSVOR, each node can only use the data of its own and exchange some estimated parameters with neighbor nodes. It was satisfactory to be able to achieve performance close to that of the centralized method that uses all the data from all nodes.

Table 1.

ACCs and MAEs of different algorithms in a real-world example (mean ± std).

Fault Type Load cSVOR ncSVOR dSVOR
ACC MAE ACC MAE ACC MAE
>OR 0 0.9585 ± 0.0057 0.0415 ± 0.0057 0.7977 ± 0.0211 0.2069 ± 0.0230 0.9553 ± 0.0064 0.0447 ± 0.0064
1 0.9317 ± 0.0147 0.0683 ± 0.0147 0.7376 ± 0.0228 0.2726 ± 0.0264 0.9278 ± 0.0136 0.0727 ± 0.0138
2 0.9547 ± 0.0091 0.0457 ± 0.0096 0.7901 ± 0.0136 0.2172 ± 0.0153 0.9517 ± 0.0094 0.0492 ± 0.0099
3 0.9253 ± 0.0099 0.0747 ± 0.0099 0.7599 ± 0.0158 0.2489 ± 0.0173 0.9243 ± 0.0095 0.0758 ± 0.0096
>IR 0 0.8853 ± 0.0133 0.1149 ± 0.0133 0.7472 ± 0.0087 0.2589 ± 0.0091 0.8844 ± 0.0120 0.1157 ± 0.0120
1 0.8624 ± 0.0112 0.1376 ± 0.0112 0.7288 ± 0.0103 0.2781 ± 0.0110 0.8556 ± 0.0137 0.1444 ± 0.0137
2 0.8435 ± 0.0109 0.1565 ± 0.0109 0.7071 ± 0.0116 0.3000 ± 0.0133 0.8391 ± 0.0113 0.1611 ± 0.0113
3 0.8726 ± 0.0095 0.1291 ± 0.0091 0.7238 ± 0.0110 0.2918 ± 0.0122 0.8632 ± 0.0094 0.1392 ± 0.0091
B 0 0.7768 ± 0.0110 0.2586 ± 0.0129 0.5440 ± 0.0184 0.5975 ± 0.0311 0.7594 ± 0.0221 0.2771 ± 0.0245
1 0.7836 ± 0.0105 0.2419 ± 0.0099 0.5770 ± 0.0124 0.5284 ± 0.0195 0.7710 ± 0.0067 0.2540 ± 0.0106
2 0.8256 ± 0.0088 0.1886 ± 0.0088 0.5820 ± 0.0156 0.5341 ± 0.0264 0.8177 ± 0.0147 0.1980 ± 0.0150
3 0.8627 ± 0.0167 0.1541 ± 0.0193 0.6345 ± 0.0138 0.4648 ± 0.0253 0.8485 ± 0.0169 0.1710 ± 0.0204

Taking the dataset of the IR fault type and 0 hp load as examples, we also show the results of dSVOR under different hyperparameters D and α in Figure 8 and Figure 9. Figure 8 shows that, with a relatively large random approximation dimension D100, dSVOR could obtain a similar MAE to that of cSVOR, which illustrates the effectiveness of the random approximation. Figure 9 shows that a relatively large α can lead to a sparse wm without affecting the MAE performance, thus effectively reducing redundancy.

Figure 8.

Figure 8

MAEs of dSVOR on the CWRU dataset of IR fault type and 0 hp load for different D when α is fixed as 0.9.

Figure 9.

Figure 9

The results of dSVOR on the CWRU dataset of IR fault type and 0 hp load for different α when D is fixed as 200 (a) the MAEs (b) the proportions of dimensions of wm that are equal to 0.

6. Conclusions

When data are distributedly collected and stored by multiple nodes, and are difficult to transmit to a central node, existing centralized ordinal regression methods become inapplicable. To this end, in order to handle the ordinal regression problem in distribution scenarios, we extended the SVORIM to a distributed version, and derived a distributed SVOR (dSVOR) algorithm. In dSVOR, each node combines the parameters estimated by its neighbors and performs local calculations using only its own data. After convergence, each node can obtain a model whose performance is close to that obtained by the centralized method relying on all the data available in a central node. Theoretically, we analyzed the consensus and the convergence of dSVOR. Practically, we carried out experiments on synthetic data and a real-world example to illustrate its effectiveness.

In our future work, we intend to consider how to automatically determine the proper parameters in dSVOR, e.g., introducing multi-kernel learning to automatically find suitable parameters of random approximate. We also aim to design adaptive strategies for adjusting combination coefficients.

Appendix A. Proof of Theorem 1

Proof. 

For convenience, we use the following notation:

θk=[θ1k,,θMk]T,Gθk=[θ1J1(θ1k),,θMJM(θMk)]T. (A1)

Then, Iterative Equation (20) can be written as follows.

θk+1=CθkηkGθk. (A2)

Considering θ¯k=1M1Tθk, the proof of limkθmkθ¯k=0,m can be done by proving limkθk1M11Tθk=0. We first construct

θk+11M11Tθk+1=(I1M11T)θk+1=(I1M11T)(CθkηkGθk)=(C1M11T)θk(I1M11T)ηkGθk. (A3)

Notice that

C1M11T=1M11T=1M11T1M11T. (A4)

We have

θk+11M11Tθk+1=(C1M11T)θk(I1M11T)ηkGθkC1M11Tθk+1M11T1M11Tθk=(C1M11T)(θk1M11Tθk)(I1M11T)ηkGθk. (A5)

For convenience, we use notation Δθk=θk1M11Tθk, then

Δθk+1=(C1M11T)Δθk(I1M11T)ηkGθk. (A6)

We then prove limkΔθk=0.

Taking the l2-norm on both sides of the above equation, we have

Δθk+1C1M11TΔθk+ηkI1M11TGθk=ρΔθk+ηkcGθk, (A7)

where c=I1M11T is a positive constant, and ρ denotes the spectral norm of C1M11T, which <1 according to Assumption A1.

Since Jm(θm) is Lipschitz continuous, Gθk is bounded, so there exists a positive constant L satisfying GθkL. Thus,

Δθk+1ρΔθk+ηkcL. (A8)

Now we prove that limkΔθk=0. To achieve this, we constructed an auxiliary variable uk that satisfied

uk+1=ρuk+ηkcL, (A9)

and u0=Δθ00. If ukΔθk0, then

uk+1=ρuk+ηkcLρΔθk+ηkcLΔθk+1. (A10)

So ukΔθk0 for all k0. With ρ<1 and limkηk=0, we have limkuk=0. Then

0limkΔθklimkuk=0. (A11)

So we have

limkθk1M11Tθk=limkΔθk=0. (A12)

The proof of Theorem 1 is completed. □

Appendix B. Proof of Theorem 2

Proof. 

From Equation (20), we can obtain

m=1Mθmk+1=m=1MnNm{m}cmnθnkm=1MηkθmJm(θmk)=m=1Mθmkm=1MηkθmJm(θmk). (A13)

From Theorem 1, we have limkθmkθ¯k=0. So, for a sufficiently large k=k1, the above equation can be written as follows:

θ¯k1+1=θ¯k1ηk1Mm=1MθmJm(θ¯k1), (A14)

where θmJm(θ¯k1) denotes the subgradient of Jm(θm) with respect to θm when θm=θ¯k1.

Supposing θ*=argminJ, we have

θ¯k1+1θ*2=θ¯k1ηk1Mm=1MθmJm(θ¯k1)θ*2=θ¯k1θ*2+ηk1Mm=1MθmJm(θ¯k1)22ηk1Mm=1MθmJm(θ¯k1)T(θ¯k1θ*). (A15)

Since Jm(θm) is Lipschitz continuous for all m, there exists a positive constant L satisfying 1Mm=1MθmJm(θm)2L2. Thus,

θ¯k1+1θ*2θ¯k1θ*2+(ηk1)2L22ηk1Mm=1MθmJm(θ¯k1)T(θ¯k1θ*). (A16)

Because Jm(θm) is convex, we have

θmJm(θ¯k1)T(θ¯k1θ*)Jm(θ¯k1)Jm(θ*). (A17)

Then

θ¯k1+1θ*2θ¯k1θ*2+(ηk1)2L22ηk1Mm=1M(Jm(θ¯k1)Jm(θ*))=θ¯k1θ*2+(ηk1)2L22ηk1M(J(θ¯k1)J*). (A18)

If limkm=1MJm(θmk)J*, ε>0,k2>0, kk2,J(θ¯k2)J*>ε. Let k3=max{k1,k2}. Then

θ¯k3+1θ*2<θ¯k3θ*2+(ηk3)2L22ηk3Mε. (A19)

Taking the summation of both sides of the above equation over k=k3,,k3+k*, we obtain

θ¯k3+k*θ*2<θ¯k3θ*2+k=k3k3+k*(ηk)2L2k=k3k3+k*2ηkMε. (A20)

Since θ¯k3+k*θ*20, we have

θ¯k3θ*2+k=k3k3+k*(ηk)2L2>k=k3k3+k*2ηkMε. (A21)

Thus,

θ¯k3θ*2+L2k=k3k3+k*(ηk)22Mk=k3k3+k*ηk>ε. (A22)

Since k=0+ηk=+ and k=0+(ηk)2<+,

limk*θ¯k3θ*2+L2k=k3k3+k*(ηk)22Mk=k3k3+k*ηk=0, (A23)

which conflicts with Equation (A22). Thus,

limkm=1MJm(θmk)=J*. (A24)

The proof of Theorem 2 is completed. □

Author Contributions

Conceptualization, H.L. and C.L.; methodology, H.L. and C.L.; formal analysis, H.L., J.T. and C.L.; writing—original draft preparation, H.L.; writing—review and editing, H.L., J.T. and C.L. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

the authors declare no conflict of interest.

Funding Statement

This work was supported by the National Natural Science Foundation of China (grant No. U20A20158), the Key-Area Research and Development Program of Guangdong Province (grant No. 2021B0101410004), and the National Program for Special Support of Eminent Professionals.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Doyle O.M., Westman E., Marqu A.F., Mecocci P., Vellas B., Tsolaki M., Kłoszewska I., Soininen H., Lovestone S., Williams S.C., et al. Predicting progression of alzheimer’s disease using ordinal regression. PLoS ONE. 2014;9:e105542. doi: 10.1371/journal.pone.0105542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Allen J., Eboli L., Mazzulla G., Ortúzar J.D. Effect of critical incidents on public transport satisfaction and loyalty: An Ordinal Probit SEM-MIMIC approach. Transportation. 2020;47:827–863. doi: 10.1007/s11116-018-9921-4. [DOI] [Google Scholar]
  • 3.Gutiérrez P.A., Salcedo-Sanz S., Hervás-Martínez C., Carro-Calvo L., Sánchez-Monedero J., Prieto L. Ordinal and nominal classification of wind speed from synoptic pressurepatterns. Eng. Appl. Artif. Intell. 2013;26:1008–1015. doi: 10.1016/j.engappai.2012.10.018. [DOI] [Google Scholar]
  • 4.Cao W., Mirjalili V., Raschka S. Rank consistent ordinal regression for neural networks with application to age estimation. Pattern Recognit. Lett. 2020;140:325–331. doi: 10.1016/j.patrec.2020.11.008. [DOI] [Google Scholar]
  • 5.Hirk R., Hornik K., Vana L. Multivariate ordinal regression models: An analysis of corporate credit ratings. Stat. Method. Appl. 2019;28:507–539. doi: 10.1007/s10260-018-00437-7. [DOI] [Google Scholar]
  • 6.Zhao X., Zuo M.J., Liu Z., Hoseini M.R. Diagnosis of artificially created surface damage levels of planet gear teeth using ordinal ranking. Measurement. 2013;46:132–144. doi: 10.1016/j.measurement.2012.05.031. [DOI] [Google Scholar]
  • 7.Kotsiantis S.B., Pintelas P.E. A cost sensitive technique for ordinal classification problems; Proceedings of the 3rd Hellenic Conference on Artificial Intelligence; Samos, Greece. 5–8 May 2004; pp. 220–229. [Google Scholar]
  • 8.Tu H.-H., Lin H.-T. One-sided support vector regression for multiclass cost-sensitive classification; Proceedings of the 27th International Conference on Machine Learning; Haifa, Israel. 21–24 June 2010; pp. 49–56. [Google Scholar]
  • 9.Harrington E.F. Online ranking/collaborative filtering using the perceptron algorithm; Proceedings of the 20th International Conference on Machine Learning; Washington, DC, USA. 21–24 August 2003; pp. 250–257. [Google Scholar]
  • 10.Gutiérrez P.A., Perez-Ortiz M., Sanchez-Monedero J., Fernez-Navarro F., Hervas-Martinez C. Ordinal regression methods: Survey and experimental study. IEEE Trans. Knowl. Data Eng. 2015;28:127–146. doi: 10.1109/TKDE.2015.2457911. [DOI] [Google Scholar]
  • 11.Chu W., Keerthi S.S. New approaches to support vector ordinal regression; Proceedings of the 22nd International Conference on Machine Learning; Bonn, Germany. 7–11 August 2005; pp. 145–152. [Google Scholar]
  • 12.Chu W., Keerthi S.S. Support vector ordinal regression. Neural Comput. 2007;19:792–815. doi: 10.1162/neco.2007.19.3.792. [DOI] [PubMed] [Google Scholar]
  • 13.Li T., Sahu A.K., Talwalkar A., Smith V. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 2020;37:50–60. doi: 10.1109/MSP.2020.2975749. [DOI] [Google Scholar]
  • 14.Liu H., Tu J., Li C. Distributed Ordinal Regression Over Networks. IEEE Access. 2021;9:62493–62504. doi: 10.1109/ACCESS.2021.3074629. [DOI] [Google Scholar]
  • 15.McCullagh P. Regression models for ordinal data. J. Royal Stat. Soc. Ser. B Methodol. 1980;42:109–142. doi: 10.1111/j.2517-6161.1980.tb01109.x. [DOI] [Google Scholar]
  • 16.Williams R. Understanding and interpreting generalized ordered logit models. J. Math. Sociol. 2016;40:7–20. doi: 10.1080/0022250X.2015.1112384. [DOI] [Google Scholar]
  • 17.Wang H., Shi Y., Niu L., Tian Y. Nonparallel Support Vector Ordinal Regression. IEEE Trans. Cybern. 2017;47:3306–3317. doi: 10.1109/TCYB.2017.2682852. [DOI] [PubMed] [Google Scholar]
  • 18.Jiang H., Yang Z., Li Z. Non-parallel hyperplanes ordinal regression machine. Knowl.-Based Syst. 2021;216:106593. doi: 10.1016/j.knosys.2020.106593. [DOI] [Google Scholar]
  • 19.Li L., Lin H.-T. Ordinal regression by extended binary classification. Adv. Neural Inf. Process. Syst. 2006;19:865–872. [Google Scholar]
  • 20.Liu X., Fan F., Kong L., Diao Z., Xie W., Lu J., You J. Unimodal regularized neuron stick-breaking for ordinal classification. Neurocomputing. 2020;388:34–44. doi: 10.1016/j.neucom.2020.01.025. [DOI] [Google Scholar]
  • 21.Cattivelli F.S., Sayed A.H. Diffusion LMS strategies for distributed estimation. IEEE Trans. Signal Process. 2009;58:1035–1048. doi: 10.1109/TSP.2009.2033729. [DOI] [Google Scholar]
  • 22.Li C., Shen P., Liu Y., Zhang Z. Diffusion information theoretic learning for distributed estimation over network. IEEE Trans. Signal Process. 2013;61:4011–4024. doi: 10.1109/TSP.2013.2265221. [DOI] [Google Scholar]
  • 23.Boyd S., Parikh N., Chu E., Peleato B., Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 2011;3:1–122. doi: 10.1561/2200000016. [DOI] [Google Scholar]
  • 24.Yang T., Yi X., Wu J., Yuan Y., Wu D., Meng Z., Hong Y., Wang H., Lin Z., Johansson K.H. A survey of distributed optimization. Annu. Rev. Control. 2019;47:278–305. doi: 10.1016/j.arcontrol.2019.05.006. [DOI] [Google Scholar]
  • 25.Shen P., Li C. Distributed information theoretic clustering. IEEE Trans. Signal Process. 2014;62:3442–3453. doi: 10.1109/TSP.2014.2327010. [DOI] [Google Scholar]
  • 26.Olfati-Saber R. Distributed Kalman filtering for sensor networks; Proceedings of the 46th Conference on Decision and Control; New Orleans, LA, USA. 12–14 December 2007; pp. 5492–5498. [Google Scholar]
  • 27.Miao X., Liu Y., Zhao H., Li C. Distributed online one-class support vector machine for anomaly detection over networks. IEEE Trans. Cybern. 2018;49:1475–1488. doi: 10.1109/TCYB.2018.2804940. [DOI] [PubMed] [Google Scholar]
  • 28.Rahimi A., Recht B. Random features for large-scale kernel machines. Adv. Neural Inf. Process. Syst. 2007;20:1177–1184. [Google Scholar]
  • 29.Cover T.M. Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition. IEEE Trans. Electron. Comput. 1965;14:326–334. doi: 10.1109/PGEC.1965.264137. [DOI] [Google Scholar]
  • 30.Vedaldi A., Zisserman A. Efficient additive kernels via explicit feature maps. IEEE Trans. Pattern Anal. Mach. Intell. 2012;34:480–492. doi: 10.1109/TPAMI.2011.153. [DOI] [PubMed] [Google Scholar]
  • 31.Scholkopf B., Smola A.J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press; Cambridge, MA, USA: 2002. [Google Scholar]
  • 32.Zou H., Hastie T. Regularization and variable selection via the elastic net. J. Royal Stat. Soc. Ser. B. 2005;67:301–320. doi: 10.1111/j.1467-9868.2005.00503.x. [DOI] [Google Scholar]
  • 33.Bertsekas D. Convex Optimization Algorithms. Athena Scientific; Belmont, MA, USA: 2015. [Google Scholar]
  • 34.Xiao L., Boyd S. Fast linear iterations for distributed averaging. Syst. Control Lett. 2004;53:65–78. doi: 10.1016/j.sysconle.2004.02.022. [DOI] [Google Scholar]
  • 35.Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019;32:8024–8035. [Google Scholar]
  • 36.Cerrada M., Sánchez R.V., Li C., Pacheco F., Cabrera D., de Oliveira J.V., Vásquez R.E. A review on data-driven fault severity assessment in rolling bearings. Mech. Syst. Signal Process. 2018;99:169–196. doi: 10.1016/j.ymssp.2017.06.012. [DOI] [Google Scholar]
  • 37.Smith W.A., Randall R.B. Rolling element bearing diagnostics using the Case Western Reserve University data: A benchmark study. Mech. Syst. Signal Process. 2015;64:100–131. doi: 10.1016/j.ymssp.2015.04.021. [DOI] [Google Scholar]
  • 38.Zhang X., Liang Y., Zhou J. A novel bearing fault diagnosis model integrated permutation entropy, ensemble empirical mode decomposition and optimized SVM. Measurement. 2015;69:164–179. doi: 10.1016/j.measurement.2015.03.017. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.


Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES