Deterministic convergence of chaos injection-based gradient method for training feedforward neural networks

Huisheng Zhang; Ying Zhang; Dongpo Xu; Xiaodong Liu

doi:10.1007/s11571-014-9323-z

. 2015 Jan 1;9(3):331–340. doi: 10.1007/s11571-014-9323-z

Deterministic convergence of chaos injection-based gradient method for training feedforward neural networks

Huisheng Zhang ^1,^2,^✉, Ying Zhang ¹, Dongpo Xu ³, Xiaodong Liu ²

PMCID: PMC4427592 PMID: 25972981

Abstract

It has been shown that, by adding a chaotic sequence to the weight update during the training of neural networks, the chaos injection-based gradient method (CIBGM) is superior to the standard backpropagation algorithm. This paper presents the theoretical convergence analysis of CIBGM for training feedforward neural networks. We consider both the case of batch learning as well as the case of online learning. Under mild conditions, we prove the weak convergence, i.e., the training error tends to a constant and the gradient of the error function tends to zero. Moreover, the strong convergence of CIBGM is also obtained with the help of an extra condition. The theoretical results are substantiated by a simulation example.

Keywords: Feedforward neural networks, Chaos injection-based gradient method, Batch learning, Online learning, Convergence

Introduction

Gradient method (GM) has been widely used as a training algorithm for feedforward neural networks. GM can be implemented in two practical ways: the batch learning and the online learning (Haykin 2008). The batch learning approach accumulates the weight correction over all the training samples before actually performing the update, nevertheless the online learning approach updates the network weights immediately after each training sample is fed. Though GM is widely used in neural network fields, it also has drawbacks of slow learning and getting trapped in local minimum. To overcome those problems, many heuristic improvements have been proposed, such as adding a penalty term to the error function (Karnin 1990), adding a momentum to the weight update (Zhang et al. 2006), injecting noise into the learning procedure (Sum et al. 2012a, b; Ho et al. 2010), etc. Some other nonlinear optimization algorithms such as the Newton method (Osowski et al. 1996), conjugate-gradient method (Charalambous 1992), extended Kalman filtering (Iiguni et al. 1992), and Levenberg–Marquardt method (Hagan and Mehnaj 1994) have also been used for training neural networks. Though these algorithms converge in fewer iterations than GM, they require much more computation per pattern, which makes them not so suitable especially for online learning (Behera et al. 2006). Thus, gradient method remains attractive because of its simplicity and ease of implementation.

As convergence is a precondition for the practical usage of a learning algorithm, the convergence analysis of GM and its various modifications have attracted many researchers in neural network fields (Fine and Mukherjee 1999; Wu et al. 2005, 2011; Wang et al. 2011; Shao and Zheng 2011; Zhang et al. 2007, 2008, 2009, 2012, 2014; Fan et al. 2014; Yu and Chen 2012). Recently, Sum John, Leung Chi-Sing and Ho Kevin theoretically investigated the convergence of noise injection-based online gradient methods (NIBOGM) in Sum et al. (2012a, b) and Ho et al. (2010), where the noises are independent mean zero Gaussian-distributed random variables. For the stability of the noise-induced neural systems and the effects of the noise on neural networks, we refer to Wu et al. (2013), Zheng et al. (2014) and Guo (2011). Besides the independent and identically distributed (i.i.d) noises, chaos noise is also widely used and has been shown to be effective (Ahmed et al. 2011; Uwate et al. 2004) when injected into the gradient training process of feedforward neural networks. Chaos injection enhances the resemblance to biological systems (Li and Nara 2008; Yoshida et al. 2010), and the dynamic variation that it introduces facilitates escaping from local minima and thus improves the convergence (Ahmed et al. 2011). However, as the chaos is not an i.i.d variable, the existing convergence results and the corresponding analysis methods for noise injection-based online gradient methods can not be directly applied to the chaos injection-based gradient methods (CIBGM).

Motivated by the above issues, in this paper we try to theoretically analyze the convergence of CIBGM, covering both the batch learning and the online learning. The weak convergence and strong convergence of the algorithms will be established. The online learning we considered in this paper is the case where the training samples are fed into the network in a fixed sequence, which is also called cyclic learning in literature (Heskes and Wiegerinck 1996). Thus, compared with the convergence results for NIBOGM (Sum et al. 2012a, b; Ho et al. 2010), where the training samples are fed into the network in a totally random sequence, our results will be of deterministic nature.

The remainder of this paper is organized as follows. The network structure and CIBGM are described in “Network structure and chaos injection-based gradient method” section. “Convergence results” section presents some assumptions and our main theorems. The detailed proof of the theorems is given in “Proofs” section. In “Simulation results” section, we use a simulation example to illustrate the theoretical analysis. We conclude the paper in “Conclusion” section.

Network structure and chaos injection-based gradient method

In this section, we first introduce the network structure, which is a typical three-layer neural network. Then we describe the chaos injection-based batch gradient method and the chaos injection-based online gradient method.

Network structure

Consider a three-layer network consisting of p input nodes, q hidden nodes, and 1 output node. Let $w_{0} = {(w_{01}, w_{02}, \dots, w_{0 q})}^{T} \in R^{q}$ be the weight vector between all the hidden units and the output unit, and $w_{i} = {(w_{i 1}, w_{i 2}, \dots, w_{i p})}^{T} \in R^{p}$ be the weight vector between all the input units and the hidden unit $i (i = 1, 2, \dots, q)$ . To simplify the presentation, we write all the weight parameters in a compact form, i.e., $w = {(w_{0}^{T}, w_{1}^{T}, \dots, w_{q}^{T})}^{T} \in R^{q + p q}$ and we define a matrix $V = {(w_{1}, w_{2}, \dots, w_{q})}^{T} \in R^{q \times p}$ .

Given activation functions $f, g : R \to R$ for the hidden layer and output layer, respectively, we define a vector function $F (x) = {(f (x_{1}), f (x_{2}), \dots, f (x_{q}))}^{T}$ for $x = {(x_{1}, x_{2}, \dots, x_{q})}^{T} \in R^{q}$ . For an input $ξ \in R^{p}$ , the output vector of the hidden layer can be written as $F (V ξ)$ and the final output of the network can be written as

ζ = g (w_{0} \cdot F (V ξ)),

where $w_{0} \cdot F (V ξ)$ represents the inner product between the two vectors $w_{0}$ and $F (V ξ)$ .

Chaos injection-based batch gradient method

Suppose that ${ξ^{j}, O^{j}}_{j = 1}^{J} \subset R^{p} \times R$ is a given set of training samples. The aim of the network training is to find the appropriate network weights $w^{*}$ that can minimize the error function

\begin{matrix} E (w) & = \frac{1}{2} \sum_{j = 1}^{J} {(O^{j} - g (w_{0} \cdot F (V ξ^{j})))}^{2} \\ = \sum_{j = 1}^{J} e_{j} (w_{0} \cdot F (V ξ^{j})), \end{matrix}

where $e_{j} (t) : = \frac{1}{2} {(O^{j} - g (t))}^{2}$ .

The gradient of the error function is given by

E_{w} (w) = {(E_{w_{0}}^{T} (w), E_{w_{1}}^{T} (w), \dots, E_{w_{q}}^{T} (w))}^{T}

with

\begin{matrix} E_{w_{0}} (w) = \sum_{j = 1}^{J} e_{j}^{'} (w_{0} \cdot F (V ξ^{j})) F (V ξ^{j}), \end{matrix}

\begin{matrix} E_{w_{i}} (w) & = \sum_{j = 1}^{J} e_{j}^{'} (w_{0} \cdot F (V ξ^{j})) w_{0 i} f^{'} (w_{i} \cdot ξ^{j}) ξ^{j}, \\ i = 1, 2, \dots, q . \end{matrix}

Starting from an arbitrary initial value $w^{0}$ , the chaos injection-based batch gradient method updates the weights ${w^{n}}$ iteratively by

w^{n + 1} = w^{n} - η_{n} (E_{w} (w^{n}) + η_{n} A v (n) I), n = 0, 1, 2, \dots,

where $η_{n} > 0$ is the learning rate, A is a positive parameter, $I = {(1, \dots, 1)}^{T}$ , and

v (n) = α v (n - 1) (1 - v (n - 1))

is the logistic map/Verhust equation which is highly sensitive to the initial value $v (0)$ and the parameter α. For specific values of $v (0)$ (e.g., $0 < v (0) < 1$ ) and α (e.g., $3.6 < α < 4$ ), the logistic map produces a chaotic time series.

Chaos injection-based online gradient method

The batch gradient method given in (5) updates the weights after all the training samples are fed into the network. This seems not so efficient if the training set is made up of a large number of samples. In this case, the online gradient method is preferred.

We consider the case that the training samples are supplied to the network in a fixed order in the training process. Starting from an arbitrary initial value $w^{0}$ , the chaos injection-based online gradient method updates the weights iteratively by

w_{0}^{n J + j} = w_{0}^{n J + j - 1} + ▵_{j} w_{0}^{n J + j - 1}

w_{i}^{n J + j} = w_{i}^{n J + j - 1} + ▵_{j} w_{i}^{n J + j - 1}, i = 1, 2, \dots, q

with

\begin{matrix} ▵_{k} w_{0}^{n J + j - 1} & = - η_{n} (e_{k}^{'} (w_{0}^{n J + j - 1} \cdot F (V^{n J + j - 1} ξ^{k})) \\ F (V^{n J + j - 1} ξ^{k}) + η_{n} A v (n) I) \end{matrix}

\begin{matrix} ▵_{k} w_{i}^{n J + j - 1} & = - η_{n} (e_{k}^{'} (w_{0}^{n J + j - 1} \cdot F (V^{n J + j - 1} ξ^{k})) w_{0 i}^{n J + j - 1} \\ f^{'} (w_{i}^{n J + j - 1} \cdot ξ^{k}) ξ^{k} + η_{n} A v (n) I) \end{matrix}

for $j, k = 1, 2, \dots, J$ , where $η_{n} > 0$ is the learning rate, whose value may be changed after each cycle of the training procedure, A is a positive parameter, and $v (n)$ is defined by (6).

Remark 1

During the training process, the injected chaos should be large at the beginning in order to help the gradient method avoid trapping into a local minimum, and then be smaller and smaller as the iteration (cycle) proceeds for the sake of ensuring the convergence of the algorithm to a minimum. Thus, in (5) and (8), we use $η_{n} A$ to control the magnitude of the chaos injected. Here p is used to magnify the effect of the chaos in the early training stage, and $η_{n}$ [as suggested by Assumption (A2) in the next section, ${lim}_{n \to \infty} η_{n} = 0$ ] is for the purpose of diminishing the effect of the injected chaos on the convergence of the algorithm with the iteration (cycle) increasing.

Convergence results

In this section, we give the convergence results of the CIBGM, covering both the batch learning case (5) and the online learning case (7).

Let $Φ = {w : E_{w} (w) = 0}$ be the stationary point set of the error function $E (w)$ , and $Φ_{s} = {w_{i j} : w = (w_{01}, \dots, w_{i j}, \dots, w_{q p}) \in Φ, s = q + (i - 1) p + j (if i > 0) or j (if i = 0)}$ be the projection of $Φ$ onto the (s)th coordinate axis, for $s = 1, \dots, p q + q$ . The following assumptions are needed for our convergence results.

(A1) $f^{'} (t)$ and $g^{'} (t)$ are Lipschitz continuous on any bounded closed interval;
(A2) $η_{n} > 0, \sum_{n = 0}^{\infty} η_{n} = \infty, \sum_{n = 0}^{\infty} η_{n}^{2} < \infty$ ;
(A3) ${w^{n}}$ generated by (5) is bounded over $R^{p q + q}$ ;
(A3′) ${w^{n J + j}}$ (or simply denoted by ${w^{m}}$ with $m = n J + j$ ) generated by (7) is bounded over $R^{p q + q}$ ;
(A4) The set $Φ_{s}$ does not contain any interior point for every $s = 1, \dots, p q + q$ .

Remark 2

Assumption (A1) is satisfied by most of the activation functions, such as sigmoid functions and linear functions. Assumption (A2) is a traditional condition for the convergence analysis of the online gradient method (Sum et al. 2012a, b; Ho et al. 2010). Here we also use this condition in the convergence analysis of the chaos injection-based batch gradient method for the sake of controlling the impact of the chaos on the convergence of the algorithm. Assumption (A3) [or Assumption (A3′)] is a commonly used condition for the convergence analysis of the gradient method in the literature (Wu et al. 2011). In fact, this condition can be easily satisfied by adding a penalty term to the error function (Zhang et al. 2009, 2012). Assumption (A4) is provided to establish the strong convergence.

Now we present our convergence results, where we use “ $‖ \cdot ‖$ ” to denote the Euclidean norm of a vector.

Theorem 1

Suppose that the error function is given by (2) and that the weight sequence ${w^{n}}$ is generated by the algorithm (5) for any initial value $w^{0}$ . Assume the conditions (A1)–(A3) are valid. Then there hold the weak convergence results

\begin{matrix} (a) T h e r e i s E^{*} > 0 s u c h t h a t lim_{n \to \infty} E (w^{n}) = E^{*} ; \end{matrix}

(b) lim_{n \to \infty} ∥E_{w} (w^{n})∥ = 0 .

Moreover, if Assumption (A4) is valid, then there holds the strong convergence, i.e., there exists a point $w^{*} \in Φ$ such that

(c) lim_{n \to \infty} w^{n} = w^{*} .

Theorem 2

Suppose that the conditions (A1), (A2) and (A3′) are valid. Then, starting from an arbitrary initial value $w^{0}$ , the weight sequence ${w^{m}}$ defined by (7) satisfies the following weak convergence

\begin{matrix} (a) T h e r e i s E^{⋆} > 0 s u c h t h a t lim_{m \to \infty} E (w^{m}) = E^{⋆} ; \end{matrix}

(b) lim_{m \to \infty} ∥E_{w} (w^{m})∥ = 0 .

Moreover, if Assumption (A4) is valid, then there holds the strong convergence: there exists $w^{⋆} \in Φ$ such that

(c) lim_{m \to \infty} w^{m} = w^{⋆} .

Proofs

In this section, we first list several lemmas in the literature, then we conduct the proofs of theorems 1 and 2 in “Proof of Theorem 1” and “Proof of Theorem 2” subsections, respectively.

Lemma 1

(See Lemma 1 in Bertsekas and Tsitsiklis 2000) Let $Y_{n}, W_{n}$ and $Z_{n}$ be three sequences such that $W_{n}$ is nonnegative for all n. Assume that

Y_{n + 1} \leq Y_{n} - W_{n} + Z_{n}, n = 0, 1 \dots

and that the series $\sum_{n = 0}^{\infty} Z_{n}$ is convergent. Then either $Y_{n} \to - \infty$ or else $Y_{n}$ converges to a finite value and $\sum_{n = 0}^{\infty} W_{n} < \infty$ .

Lemma 2

(See Lemma 4.2 in Wu et al. 2011) Suppose that the learning rate $η_{n}$ satisfies Assumption (A2) and that the sequence ${a_{n}} (n \in N)$ satisfies $a_{n} \geq 0, \sum_{n = 0}^{\infty} η_{n} a_{n}^{β} < \infty$ and $| a_{n + 1} - a_{n} | \leq μ η_{n}$ for some positive constants βand μ. Then there holds ${lim}_{n \to \infty} a_{n} = 0$ .

Lemma 3

(See Lemma 5.3 in Wang et al. 2011) Let $F : Ω \subset R^{k} \to R, (k \geq 1)$ be continuous for a bounded closed region $Ω$ , and $Φ = {z \in Ω : F (z) = 0}$ . The projection of $Φ$ on each coordinate axis does not contain any interior point. Let the sequence ${z^{n}}$ satisfy:

(i)
${lim}_{n \to \infty} F (z^{n}) = 0$ ;
(ii)
${lim}_{n \to \infty} ‖ z^{n + 1} - z^{n} ‖ = 0$ .

Then, there exists a unique $z^{*} \in Φ$ such that ${lim}_{n \to \infty} z^{n} = z^{*} .$

Proof of Theorem 1

Lemma 4

Suppose the conditions (A1) and (A3) are valid, then $E_{w} (w)$ satisfies Lipschitz conditon, that is, there exists a positive constant L, such that

∥E_{w} (w^{n + 1}) - E_{w} (w^{n})∥ \leq L ∥w^{n + 1} - w^{n}∥ .

Specially, for $θ \in [0, 1]$ , there holds

∥E_{w} (w^{n} + θ (w^{n + 1} - w^{n})) - E_{w} (w^{n})∥ \leq L θ ‖ w^{n + 1} - w^{n} ‖ .

Proof

The proof of this lemma is similar to Lemma 2 of Zhang et al. (2012) and thus omitted. $□$

Proof of (9)

Given that $0 < v (0) < 1$ and $3.6 < α < 4$ , it is easy to see

\begin{matrix} 0 < v (n) & = α v (n - 1) (1 - v (n - 1)) \\ \leq α \frac{{(v (n - 1) + 1 - v (n - 1))}^{2}}{4} = \frac{α}{4} < 1 . \end{matrix}

By the differential mean value theorem, there exists a constant $θ \in [0, 1]$ , such that

\begin{matrix} E (w^{n + 1}) - E (w^{n}) \\ = {(E_{w} (w^{n} + θ (w^{n + 1} - w^{n})))}^{T} (w^{n + 1} - w^{n}) \\ = {(E_{w} (w^{n}))}^{T} (w^{n + 1} - w^{n}) \\ + {(E_{w} (w^{n} + θ (w^{n + 1} - w^{n})) - (E_{w} (w^{n})))}^{T} (w^{n + 1} - w^{n}) \\ \leq {(E_{w} (w^{n}))}^{T} (w^{n + 1} - w^{n}) + L θ {‖ w^{n + 1} - w^{n} ‖}^{2}, \end{matrix}

where the last inequality is due to (16). Considering (5) and (18), we have

\begin{matrix} E (w^{n + 1}) & \leq E (w^{n}) + η_{n} {(E_{w} (w^{n}))}^{T} (- E_{w} (w^{n}) - η_{n} A v (n) I) \\ + L θ η_{n} {∥E_{w} (w^{n}) + A η_{n} v (n) I∥}^{2} . \end{matrix}

Using (17) and the inequality $∥E_{w} (w^{n})∥ \leq \frac{(1 + ‖ E_{w} (w^{n}) ‖^{2})}{2}$ , the second term on the right hand side of (19) can be evaluated

\begin{matrix} {(E_{w} (w^{n}))}^{T} [- E_{w} (w^{n}) - η_{n} A v (n) I] \\ \leq - {∥(E_{w} (w^{n}))∥}^{2} + η_{n} A \sqrt{p q + q} ‖ E_{w} (w^{n}) ‖ \\ = - {∥(E_{w} (w^{n}))∥}^{2} + η_{n} \frac{A}{2} \sqrt{p q + q} (1 + ‖ E_{w} (w^{n}) ‖^{2}) . \end{matrix}

Using inequality ${(a + b)}^{2} \leq 2 (a^{2} + b^{2})$ , the third term on the right hand side of (19) can be evaluated

\begin{matrix} {∥η_{n} E_{w} (w^{n}) + A η_{n}^{2} v (n) I∥}^{2} \\ \leq 2 η_{n}^{2} {∥E_{w} (w^{n})∥}^{2} + 2 η_{n}^{4} A^{2} {‖ I ‖}^{2} \\ \leq 2 η_{n}^{2} {∥E_{w} (w^{n})∥}^{2} + 2 A^{2} (p q + q) η_{n}^{4} . \end{matrix}

Combining (19)–(21), we have

\begin{matrix} E (w^{n + 1}) & \leq E (w^{n}) - η_{n} {∥E_{w} (w^{n})∥}^{2} \\ + η_{n}^{2} \frac{A}{2} \sqrt{p q + q} (1 + ‖ E_{w} (w^{n}) ‖^{2}) \\ + 2 L θ η_{n}^{2} {∥E_{w} (w^{n})∥}^{2} + 2 L θ A^{2} (p q + q) η_{n}^{4} \\ = E (w^{n}) - η_{n} {‖ E_{w} (w^{n}) ‖}^{2} \\ + η_{n}^{2} (\frac{A}{2} \sqrt{p q + q} + 2 L θ A^{2} η_{n}^{2} (p q + q) \\ + (2 L θ + \frac{A}{2} \sqrt{p q + q}) {∥E_{w} (w^{n})∥}^{2}) . \end{matrix}

By Assumptions (A1) and (A3), there is a constant $C_{1} > 0$ such that for all $n = 0, 1, \dots$

∥E_{w} (w^{n})∥ \leq C_{1} .

Thus, there exists a positive constant $C_{2}$ , such that

E (w^{n + 1}) \leq E (w^{n}) - η_{n} {∥E_{w} (w^{n}))∥}^{2} + η_{n}^{2} C_{2} .

Combining $\sum_{n = 1}^{\infty} η_{n}^{2} C_{2} < \infty, E (w^{n}) > 0$ , and according to Lemma 1, we can conclude that there exists a constant $E^{*}$ such that

lim_{n \to \infty} E (w^{n}) = E^{*}

and

\sum_{n = 0}^{\infty} {∥E_{w} (w^{n})∥}^{2} η_{n} < \infty .

This completes the proof of (9). $□$

Proof of (10)

Using (5), (15) and (23), we have

\begin{matrix} |∥E_{w} (w^{n + 1})∥ - ∥E_{w} (w^{n})∥| & \leq ∥E_{w} (w^{n + 1}) - E_{w} (w^{n})∥ \\ \leq L ‖ w^{n + 1} - w^{n} ‖ \\ \leq η_{n} L (‖ E_{w} (w^{n}) ‖ + η_{n} A ‖ I ‖) \\ \leq C_{3} η_{n}, \end{matrix}

where $C_{3} = L (C_{1} + A \sqrt{p q + q} {sup}_{n \in N} η_{n})$ . Thus, by (26), (27), and Lemma 2, we conclude

lim_{n \to \infty} E_{w} (w^{n}) = 0 . □

Proof of (11)

Obviously $‖ E_{w} (w) ‖$ is a continuous function under the Assumption (A1). Using (5), (17) and (23), we have

lim_{n \to \infty} ‖ w^{n + 1} - w^{n} ‖ = lim_{n \to \infty} η_{n} ∥E_{w} (w^{n}) + A η_{n} v (n) I∥ = 0 .

Furthermore, the Assumption (A4) is valid. Thus, applying Lemma 3, there exists a unique $w^{*} \in Φ$ such that ${lim}_{n \to \infty} w^{n} = w^{*}$ . $□$

Proof of Theorem 2

Let the sequence ${w^{n J + j}} (n \in N, j = 1, 2, \dots, J)$ be generated by (7). For brevity, we introduce the following notations:

F^{n J + j, k} = F (V^{n J + j} ξ^{k}),

29a

\begin{matrix} r_{i}^{n, j} = ▵_{j} w_{i}^{n J + j - 1} - ▵_{j} w_{i}^{n J}, \end{matrix}

29b

\begin{matrix} h_{i}^{n, l} & = w_{i}^{n J + l} - w_{i}^{n J} = \sum_{j = 1}^{l} ▵_{j} w_{i}^{n J + j - 1} \\ = \sum_{j = 1}^{l} ▵_{j} w_{i}^{n J} + \sum_{j = 1}^{l} r_{i}^{n, j}, \end{matrix}

29c

ψ^{n, l, j} = F^{n J + l, j} - F^{n J, j},

29d

for $n \in N ; j, k, l = 1, 2, \dots, J ; i = 0, 1, 2, \dots, q .$

Lemma 5

(See Lemma 4.1 in Wu et al. 2011) Let $h (x)$ be a function defined on a bounded closed interval $[a, b]$ such that $h^{'} (x)$ is Lipschitz continuous with Lipschitz constant $K > 0$ . Then, $h^{'} (x)$ is differentiable almost everywhere in $[a, b]$ and

| h^{''} (x) | \leq K, x \in [a, b] .

Moreover, there exists a constant $C_{4} > 0$ such that

\begin{matrix} h (x) & \leq h (x_{0}) + h^{'} (x_{0}) (x - x_{0}) + C_{4} {(x - x_{0})}^{2}, \\ \forall x_{0}, x \in [a, b] . \end{matrix}

Lemma 6

Suppose the conditions (A1) and (A3′) are valid, and the sequence ${w^{n J + j}}$ is generated by (7). Then there are $C_{5} - C_{8}$ such that

‖ F^{n J + j, k} ‖ \leq C_{5},

‖ h_{i}^{n, l} ‖ \leq C_{6} η_{n},

‖ ψ^{n, l, j} ‖ \leq C_{7} η_{n},

‖ r_{i}^{n, j} ‖ \leq C_{8} η_{n}^{2},

where $n \in N ; j, k, l = 1, 2, \dots, J ; i = 0, 1, 2, \dots, q .$

Proof

According to Assumption (A3′), we can define a constant $C_{w} = sup ‖ w^{m} ‖$ . Then we have

|w_{i}^{n J + j} \cdot ξ^{k}| \leq ∥w_{i}^{n J + j}∥ ∥ξ^{k}∥ \leq C_{w} max_{1 \leq k \leq J} ‖ ξ^{k} ‖ = C_{9} .

Accordingly, there exist two positive constants $C_{f}$ and $C_{f^{'}}$ such that

sup_{| t | \leq C_{9}} | f (t) | = C_{f}, sup_{| t | \leq 2 C_{9}} | f^{'} (t) | = C_{f^{'}} .

Thus we have

‖ F^{n J + j, k} ‖ = ‖ F (V^{n J + j} ξ^{k}) ‖ \leq \sqrt{q} C_{f} = C_{5},

and

|w_{0}^{n J + j} \cdot F^{n J + j, k}| \leq ∥w_{0}^{n J + j}∥ ∥F^{n J + j, k}∥ \leq C_{w} C_{5} .

Then, there is a positive constant $C_{e_{j}^{'}}$ such that

max_{| t | \leq C_{w} C_{5}} |e_{j}^{'} (t)| \leq C_{e_{j}^{'}} .

Using (8), (17), (29c), (38) and (40), we have

\begin{matrix} ∥h_{0}^{n, l}∥ & = ∥\sum_{j = 1}^{l} ▵_{j} w_{0}^{n J + j - 1}∥ \\ = ∥- η_{n} \sum_{j = 1}^{l} (e_{j}^{'} (w_{0}^{n J + j - 1} \cdot F^{n J + j - 1, j}) F^{n J + j - 1, j} + η_{n} A v (n) I)∥ \\ \leq η_{n} J C_{e_{j}^{'}} C_{5} + η_{n}^{2} A J \sqrt{q} \\ \leq J (C_{e_{j}^{'}} C_{5} + A \sqrt{q} sup η_{n}) η_{n} . \end{matrix}

Similarly, for $i = 1, \dots, q$ , we have

\begin{matrix} ∥h_{i}^{n, l}∥ & = ∥\sum_{j = 1}^{l} ▵_{j} w_{i}^{n J + j - 1}∥ \\ = ∥- η_{n} \sum_{j = 1}^{l} (e_{j}^{'} (w_{0}^{n J + j - 1} \cdot F^{n J + j - 1, j}) w_{0 i}^{n J + j - 1} \\ f^{'} (w_{i}^{n J + j - 1} \cdot ξ^{j}) ξ^{j} + η_{n} A v (n) I)∥ \\ \leq η_{n} J C_{e_{j}^{'}} C_{f^{'}} C_{9} + η_{n}^{2} A J \sqrt{p} \\ \leq J (C_{e_{j}^{'}} C_{f^{'}} C_{9} + A \sqrt{p} sup η_{n}) η_{n} . \end{matrix}

Let $C_{6} = J max {C_{e_{j}^{'}} C_{5} + A \sqrt{q} sup η_{n}, C_{e_{j}^{'}} C_{f^{'}} C_{9} + A \sqrt{p} sup η_{n}}$ , then we have $‖ h_{i}^{n, l} ‖ \leq C_{6} η_{n}$ for $i = 0, 1, \dots, q$ .

Using (29c), (29d), (33) and the mean value theorem, we have

\begin{matrix} ‖ ψ^{n, l, j} ‖ & = ‖ F (V^{n J + l} ξ^{j}) - F (V^{n J} ξ^{j}) ‖ \\ = {[\sum_{i = 1}^{q} {[f (w_{i}^{n J + l} \cdot ξ^{j}) - f (w_{i}^{n J} \cdot ξ^{j})]}^{2}]}^{\frac{1}{2}} \\ = {[\sum_{i = 1}^{q} {[f^{'} (w_{i}^{n J} \cdot ξ^{j} + θ_{i} (w_{i}^{n J + l} \cdot ξ^{j} - w_{i}^{n J} \cdot ξ^{j})) h_{i}^{n, l} \cdot ξ^{j}]}^{2}]}^{\frac{1}{2}} \\ \leq C_{f^{'}} ‖ ξ^{j} ‖ \sum_{i = 1}^{q} ‖ h_{i}^{n, l} ‖ \leq q C_{f^{'}} max_{1 \leq j \leq J} ‖ ξ^{j} ‖ C_{6} η_{n} = C_{7} η_{n}, \end{matrix}

where $θ_{i} \in (0, 1)$ and $C_{7} = q C_{f^{'}} {max}_{1 \leq j \leq J} ‖ ξ^{j} ‖ C_{6}$ .

As Assumptions (A1) and (A3′) are valid, it is easy to see that there exists a constant $L^{'}$ such that for any $n = 0, 1, \dots,$ and $1 \leq k_{1}, k_{2}, j_{1}, j_{2}, l_{1}, l_{2} \leq J$ , there holds

\begin{matrix} |e_{j}^{'} (w_{0}^{n J + k_{1}} \cdot F^{n J + j_{1}, l_{1}}) - e_{j}^{'} (w_{0}^{n J + k_{2}} \cdot F^{n J + j_{2}, l_{2}})| \\ \leq L^{'} |w_{0}^{n J + k_{1}} \cdot F^{n J + j_{1}, l_{1}} - w_{0}^{n J + k_{2}} \cdot F^{n J + j_{2}, l_{2}}| . \end{matrix}

Combining (8), (29), (32)–(34) and (43), we have

\begin{matrix} ∥r_{0}^{n, j}∥ & = ∥▵_{j} w_{0}^{n J + j - 1} - ▵_{j} w_{0}^{n J}∥ \\ = ∥- η_{n} (e_{j}^{'} (w_{0}^{n J + j - 1} \cdot F^{n J + j - 1, j}) F^{n J + j - 1, j} \\ - e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) F^{n J, j})∥ \\ = ∥- η_{n} [e_{j}^{'} (w_{0}^{n J + j - 1} \cdot F^{n J + j - 1, j}) ψ^{n, j - 1, j} \\ + (e_{j}^{'} (w_{0}^{n J + j - 1} \cdot F^{n J + j - 1, j}) - e_{j}^{'} (w_{0}^{n J} \cdot F^{n J + j - 1, j})) F^{n J, j} \\ + (e_{j}^{'} (w_{0}^{n J} \cdot F^{n J + j - 1, j}) - e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j})) F^{n J, j}]∥ \\ \leq η_{n} (|e_{j}^{'} (w_{0}^{n J + j - 1} \cdot F^{n J + j - 1, j})| ‖ ψ^{n, j - 1, j} ‖ \\ + |e_{j}^{'} (w_{0}^{n J + j - 1} \cdot F^{n J + j - 1, j}) - e_{j}^{'} (w_{0}^{n J} \cdot F^{n J + j - 1, j})| ‖ F^{n J, j} ‖ \\ + |e_{j}^{'} (w_{0}^{n J} \cdot F^{n J + j - 1, j}) - e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j})| ‖ F^{n J, j} ‖) \\ \leq η_{n} (C_{e_{j}^{'}} ‖ ψ^{n, j - 1, j} ‖ + L^{'} |w_{0}^{n J + j - 1} \\ \times F^{n J + j - 1, j} - w_{0}^{n J} \cdot F^{n J + j - 1, j}| ‖ F^{n J, j} ‖ \\ + L^{'} |w_{0}^{n J} \cdot F^{n J + j - 1, j} - w_{0}^{n J} \cdot F^{n J, j}| ‖ F^{n J, j} ‖) \\ \leq η_{n} (C_{e_{j}^{'}} ‖ ψ^{n, j - 1, j} ‖ + L^{'} C_{5}^{2} ∥h_{0}^{n, j - 1}∥ + L^{'} C_{w} C_{5} ‖ ψ^{n, j - 1, j} ‖) \\ \leq η_{n}^{2} (C_{e_{j}^{'}} C_{7} + L^{'} C_{5}^{2} C_{6} + L^{'} C_{w} C_{5} C_{7}) = C_{10} η_{n}^{2}, \end{matrix}

where $C_{10} = C_{e_{j}^{'}} C_{7} + L^{'} C_{5}^{2} C_{6} + L^{'} C_{w} C_{5} C_{7}$ . Similarly, we can show the existence of a constant $C_{11} > 0$ such that

‖ r_{i}^{n, j} ‖ \leq C_{11} η_{n}^{2} .

Let $C_{8} = max {C_{10}, C_{11}}$ , then we have $‖ r_{i}^{n, j} ‖ \leq C_{8} η_{n}^{2}$ for $i = 0, 1, 2, \dots, q$ . $□$

Lemma 7

Let the sequence ${w^{n J + j}}$ be generated by (7). Under assumptions (A1) and $(A 3^{'})$ , there holds

\begin{matrix} E (w^{(n + 1) J}) & \leq E (w^{n J}) - η_{n} {‖ E_{w} (w^{n J}) ‖}^{2} + C_{12} η_{n}^{2}, \\ (n = 0, 1, \dots) \end{matrix}

where $C_{12} > 0$ is a positive constant.

Proof

By virtue of Assumption (A1) and Lemma 5, we know that $f^{''} (w_{i}^{n J} \cdot ξ^{j} + t (h_{i}^{n, J} \cdot ξ^{j}))$ is integrable almost everywhere on $t \in [0, 1]$ . Then, using Taylor’s mean value theorem we arrive at

\begin{matrix} w_{0}^{n J} \cdot ψ^{n, J, j} & = \sum_{i = 1}^{q} w_{0 i}^{n J} [f (w_{i}^{(n + 1) J} \cdot ξ^{j}) - f (w_{i}^{n J} \cdot ξ^{j})] \\ = \sum_{i = 1}^{q} w_{0 i}^{n J} f^{'} (w_{i}^{n J} \cdot ξ^{j}) h_{i}^{n, J} \cdot ξ^{j} \\ + \sum_{i = 1}^{q} w_{0 i}^{n J} {(h_{i}^{n, J} \cdot ξ^{j})}^{2} \\ \times \int_{0}^{1} (1 - t) f^{''} (w_{i}^{n J} \cdot ξ^{j} + t (h_{i}^{n, J} \cdot ξ^{j})) d t . \end{matrix}

By virtue of (8), (29), (47) and Lemma 5, there is a constant $C_{13} > 0$ such that

\begin{matrix} e_{j} (w_{0}^{(n + 1) J} \cdot F^{(n + 1) J, j}) \\ \leq e_{j} (w_{0}^{n J} \cdot F^{n J, j}) + e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) \\ (w_{0}^{(n + 1) J} \cdot F^{(n + 1) J, j} - w_{0}^{n J} \cdot F^{n J, j}) \\ + C_{13} {(w_{0}^{(n + 1) J} \cdot F^{(n + 1) J, j} - w_{0}^{n J} \cdot F^{n J, j})}^{2} \\ = e_{j} (w_{0}^{n J} \cdot F^{n J, j}) + e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) \\ (h_{0}^{n, J} \cdot F^{n J, j} + w_{0}^{n J} \cdot ψ^{n, J, j} + h_{0}^{n, J} \cdot ψ^{n, J, j}) \\ + C_{13} {(h_{0}^{n, J} \cdot F^{n J, j} + w_{0}^{n J} \cdot ψ^{n, J, j} + h_{0}^{n, J} \cdot ψ^{n, J, j})}^{2} \\ = e_{j} (w_{0}^{n J} \cdot F^{n J, j}) + e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) F^{n J, j} \cdot h_{0}^{n, J} \\ + e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) \sum_{i = 1}^{q} w_{0 i}^{n J} f^{'} (w_{i}^{n J} \cdot ξ^{j}) ξ^{j} \cdot h_{i}^{n, J} + δ_{1} \\ = e_{j} (w_{0}^{n J} \cdot F^{n J, j}) + e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) \\ F^{n J, j} \times (- η_{n} \sum_{k = 1}^{J} [e_{k}^{'} (w_{0}^{n J} \cdot F^{n J, k}) F^{n J, k} + η_{n} A v (n) I] + \sum_{k = 1}^{J} r_{0}^{n, k}) \\ + e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) \sum_{i = 1}^{q} w_{0 i}^{n J} f^{'} (w_{i}^{n J} \cdot ξ^{j}) ξ^{j} \cdot (- η_{n} \sum_{k = 1}^{J} [e_{k}^{'} \\ (w_{0}^{n J} \cdot F^{n J, k}) w_{0 i}^{n J} f^{'} (w_{i}^{n J} \cdot ξ^{k}) ξ^{k} + η_{n} A v (n) I] + \sum_{k = 1}^{J} r_{i}^{n, k}) + δ_{1} \\ = e_{j} (w_{0}^{n J} \cdot F^{n J, j}) - η_{n} e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) \\ F^{n J, j} \cdot \sum_{k = 1}^{J} e_{k}^{'} (w_{0}^{n J} \cdot F^{n J, k}) F^{n J, k} \\ - η_{n} e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) \sum_{i = 1}^{q} w_{0 i}^{n J} f^{'} (w_{i}^{n J} \cdot ξ^{j}) ξ^{j} \\ \times \sum_{k = 1}^{J} e_{k}^{'} (w_{0}^{n J} \cdot F^{n J, k}) w_{0 i}^{n J} f^{'} (w_{i}^{n J} \cdot ξ^{k}) ξ^{k} + δ_{2}, \end{matrix}

where

\begin{matrix} δ_{1} & = e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) \sum_{i = 1}^{q} w_{0 i}^{n J} {(h_{i}^{n, J} \cdot ξ^{j})}^{2} \\ \times \int_{0}^{1} (1 - t) f^{''} (w_{i}^{n J} \cdot ξ^{j} + t (h_{i}^{n, J} \cdot ξ^{j})) d t \\ + e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) h_{0}^{n, J} \cdot ψ^{n, J, j} \\ + C_{13} {(h_{0}^{n, J} \cdot F^{n J, j} + w_{0}^{n J} \cdot ψ^{n, J, j} + h_{0}^{n, J} \cdot ψ^{n, J, j})}^{2} \end{matrix}

and

\begin{matrix} δ_{2} & = e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) F^{n J, j} \cdot (- η_{n}^{2} J A v (n) I + \sum_{k = 1}^{J} r_{0}^{n, k}) \\ + e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) \sum_{i = 1}^{q} w_{0 i}^{n J} f^{'} (w_{i}^{n J} \cdot ξ^{j}) ξ^{j} \\ \times (- η_{n}^{2} J A v (n) I + \sum_{k = 1}^{J} r_{i}^{n, k}) + δ_{1} . \end{matrix}

Summing (48) for $j$ from 1 to J up, and noticing (2), (4a, 4b), (29a), (49) and (50), we have

\begin{matrix} E (w^{(n + 1) J}) & \leq E (w^{n J}) - η_{n} \sum_{i = 0}^{q} {∥E_{w_{i}} (w^{n J})∥}^{2} + δ_{3} \\ = E (w^{n J}) - η_{n} {∥E_{w} (w^{n J})∥}^{2} + δ_{3}, \end{matrix}

where

\begin{matrix} δ_{3} & = E_{w_{0}} (w^{n J}) \cdot (- η_{n}^{2} J A v (n) I + \sum_{k = 1}^{J} r_{0}^{n, k}) \\ + \sum_{i = 1}^{q} E_{w_{i}} (w^{n J}) \cdot (- η_{n}^{2} J A v (n) I + \sum_{k = 1}^{J} r_{i}^{n, k}) \\ + \sum_{j = 1}^{J} e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) \sum_{i = 1}^{q} w_{0 i}^{n J} {(h_{i}^{n, J} \cdot ξ^{j})}^{2} \\ \times \int_{0}^{1} (1 - t) f^{''} (w_{i}^{n J} \cdot ξ^{j} + t (h_{i}^{n, J} \cdot ξ^{j})) d t \\ + \sum_{j = 1}^{J} e_{j}^{'} (w_{0}^{n J} \cdot F^{n J, j}) h_{0}^{n, J} \cdot ψ^{n, J, j} \\ + C_{13} \sum_{j = 1}^{J} {(h_{0}^{n, J} \cdot F^{n J, j} + w_{0}^{n J} \cdot ψ^{n, J, j} + h_{0}^{n, J} \cdot ψ^{n, J, j})}^{2} . \end{matrix}

Considering (17) and (32)–(40), it is easy to see that there exists a constant $C_{12}$ such that

δ_{3} < C_{12} η_{n}^{2} .

Thus, the desired estimate is deduced by combining (51) and (53). $□$

Now we are ready to prove the convergence theorem.

Proof of (12)

According to Lemmas 1 and 7, there exists a constant $E^{⋆}$ such that ${lim}_{m \to \infty} E (w^{m}) = E^{⋆}$ or ${lim}_{m \to \infty} E (w^{m}) = - \infty$ . Recall $E (w^{m}) ⩾ 0$ , then we have (12). $□$

Proof of (13)

Using Lemmas 1 and 7, we have that

\sum_{n = 0}^{\infty} η_{n} {∥E_{w} (w^{n J})∥}^{2} < \infty .

Similarly to Lemma 4, there exists a Lipschitz constant $L^{″}$ such that

∥E_{w} (w^{m + l}) - E_{w} (w^{m})∥ \leq L^{″} ∥w^{m + l} - w^{m}∥ .

where $w^{m}$ is the weight sequence generated by (7) and $l$ is a positive integer.

Using (55) and (33), for $n = 0, 1, \dots$ , and $j = 1, \dots, J$ , we have

\begin{matrix} |∥E_{w} (w^{(n + 1) J})∥ - ∥E_{w} (w^{n J})∥| \leq ∥E_{w} (w^{(n + 1) J}) - E_{w} (w^{n J})∥ \\ \leq L^{″} ‖ w^{(n + 1) J} - w^{n J} ‖ \\ = L^{″} \sqrt{\sum_{i = 0}^{q} {∥h_{i}^{n, J}∥}^{2}} \leq \sqrt{q + 1} L^{″} C_{6} η_{n} . \end{matrix}

Combining (54), (56) and Lemma 2, we have

lim_{n \to \infty} ∥E_{w} (w^{n J})∥ = 0 .

Since

\begin{matrix} ∥E_{w} (w^{n J + j})∥ & \leq ∥E_{w} (w^{n J + j}) - E_{w} (w^{n J})∥ + ∥E_{w} (w^{n J})∥ \\ \leq L^{″} C_{6} \sqrt{q + 1} η_{n} + ∥E_{w} (w^{n J})∥, \end{matrix}

we have ${lim}_{n \to \infty} ∥E_{w} (w^{n J + j})∥ = 0$ for $j = 1, 2, \dots, J$ . $□$

Proof of (14)

The proof is almost the same as the proof of (11) and thus it is omitted here. $□$

Simulation results

In this section, we illustrate the convergence behavior of the CIBGM using the sonar signal classification problem.

Sonar signal classification is one of the benchmark problems in neural network field. Our task is to train a network to discriminate between sonar returns bounced off a metal cylinder and those bounced off a roughly cylindrical rock. We obtained the data set from UCI machine learning repository (http://archive.ics.uci.edu/ml/), which comprises 208 samples, each with 60 components. In this simulation, we stochastically choose 164 samples for training and 44 samples for test.

The network for training is with the structure of 60–25–2. The activation functions for both the hidden and output layers are set to be $l o g s i g (\cdot)$ in MATLAB, which is a commonly used sigmoid function. We choose the initial weights to be random numbers in the interval $[- 0.5, 0.5]$ .

The simulation is carried out by choosing the parameters $A = 500, v (0) = 0.5$ and $α = 3.8$ . We set the learning rate $η_{n} = \frac{0.08}{164}$ if $n \leq 120$ and $\frac{n^{- 0.5}}{164}$ if $n > 120$ , which satisfies Assumption $(A 2)$ . Here 164 is the number of the training samples. The maximum training iteration (cycle) is 2,000. The learning curves for the chaos injection-based batch gradient method are depicted in Fig. 1, which shows the training error tends to a constant and the gradient of the error function tends to zero. This supports our theoretical analysis. The learning curves for the chaos injection-based online gradient method are almost the same with Fig. 1, with the only change that we should replace the x label “Number of iterations” with “Number of cycles”.

Fig. 1 — Learning curves for chaos injection-based batch gradient method

In order to show the effectiveness of the chaos injection method, we compare the test error curves of CIBGM and the standard GM (with no chaos injected) in Fig. 2. We can see that the test error of CIBGM converges faster and tends to a smaller number than that of the standard GM.

Fig. 2 — Test error curves for CIBGM and GM (with no chaos injected)

We mention that, though there is no restriction for the parameter A in Theorems 1 and 2, the choice of A is still of great importance. If A is too small, CIBGM will reduce to the standard GM. On the other hand, if A is too large, then the chaos term will dominate the update of the CIBGM especially in the early stage of the training procedure. As a result, the algorithm will converge very slowly and the performance may even be unacceptable. Figure 3 shows the results of the chaos injection-based batch gradient method for $A = 5, 000$ . We can find that the algorithm still converges. However, the performance is much worse than that for $A = 500$ .

Fig. 3 — Performance for chaos injection-based batch gradient method with $A = 5, 000$

Conclusion

This paper investigates the chaos injection-based gradient method (CIBGM) for the feedforward neural networks. Two learning mode cases, batch learning and online learning, are considered. Under the conditions that the derivatives of activation functions are Lipschitz continuous on any bounded closed interval and the learning rate $η_{n}$ is positive and satisfies $\sum_{n = 0}^{\infty} η_{n} = \infty$ and $\sum_{n = 0}^{\infty} η_{n}^{2} < \infty$ , we derive the weak convergence of the CIBGM, that is the gradient of the error function tends to zero and the error function tends to a constant. The strong convergence is also derived with the assumption that the set $Φ_{s}$ does not contain any interior point. The theoretical findings and the effectiveness of the CIBGM are illustrated by a simulation example. Future research includes the study on the convergence of the chaos injection-based stochastic gradient method.

Acknowledgments

This work is partly supported by the National Natural Science Foundation of China (Nos. 61101228, 61301202, 61402071), the China Postdoctoral Science Foundation (No. 2012M520623), and the Research Fund for the Doctoral Program of Higher Education of China (No. 20122304120028).

References

Ahmed SU, Shahjahan M, Murase K. Injecting chaos in feedforward neural networks. Neural Process Lett. 2011;34:87–100. doi: 10.1007/s11063-011-9185-x. [DOI] [Google Scholar]
Behera L, Kumar S, Patnaik A. On adaptive learning rate that guarantees convergence in feedforward networks. IEEE Trans Neural Netw. 2006;17(5):1116–1125. doi: 10.1109/TNN.2006.878121. [DOI] [PubMed] [Google Scholar]
Bertsekas DP, Tsitsiklis JN. Gradient convergence in gradient methods with errors. SIAM J Optim. 2000;3:627–642. doi: 10.1137/S1052623497331063. [DOI] [Google Scholar]
Charalambous C. Conjugate gradient algorithm for efficient training of artificial neural networks. Inst Electr Eng Proc. 1992;139:301–310. [Google Scholar]
Fan QW, Zurada JM, Wu W. Convergence of online gradient method for feedforward neural networks with smoothing L1/2 regularization penalty. Neurocomputing. 2014;131:208–216. doi: 10.1016/j.neucom.2013.10.023. [DOI] [Google Scholar]
Fine TL, Mukherjee S. Parameter convergence and learning curves for neural networks. Neural Comput. 1999;11:747–769. doi: 10.1162/089976699300016647. [DOI] [PubMed] [Google Scholar]
Guo DQ. Inhibition of rhythmic spiking by colored noise in neural systems. Cogn Neurodyn. 2011;5(3):293–300. doi: 10.1007/s11571-011-9160-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hagan MT, Mehnaj MB. Training feedforward networks with Marquardt algorithm. IEEE Trans Neural Netw. 1994;5(6):989–993. doi: 10.1109/72.329697. [DOI] [PubMed] [Google Scholar]
Haykin S. Neural networks and learning machines. New Jersey: Prentice Hall; 2008. [Google Scholar]
Heskes T, Wiegerinck W. A theoretical comparison of batch-mode, on-line, cyclic, and almost-cyclic learning. IEEE Trans Neural Netw. 1996;7(4):919–925. doi: 10.1109/72.508935. [DOI] [PubMed] [Google Scholar]
Ho KI, Leung CS, Sum JP. Convergence and objective functions of some fault/noise-injection-based online learning algorithms for RBF networks. IEEE Trans Neural Netw. 2010;21(6):938–947. doi: 10.1109/TNN.2010.2046179. [DOI] [PubMed] [Google Scholar]
Iiguni Y, Sakai H, Tokumaru H. A real-time learning algorithm for a multilayered neural netwok based on extended Kalman filter. IEEE Trans Signal Process. 1992;40(4):959–966. doi: 10.1109/78.127966. [DOI] [Google Scholar]
Karnin ED. A simple procedure for pruning back-propagation trained neural networks. IEEE Trans Neural Netw. 1990;1:239–242. doi: 10.1109/72.80236. [DOI] [PubMed] [Google Scholar]
Li Y, Nara S. Novel tracking function of moving target using chaotic dynamics in a recurrent neural network model. Cogn Neurodyn. 2008;2(1):39–48. doi: 10.1007/s11571-007-9029-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Osowski S, Bojarczak P, Stodolski M. Fast second order learning algorithm for feedforward multilayer neural network and its applications. Neural Netw. 1996;9(9):1583–1596. doi: 10.1016/S0893-6080(96)00029-9. [DOI] [PubMed] [Google Scholar]
Shao HM, Zheng GF. Boundedness and convergence of online gradient method with penalty and momentum. Neurocomputing. 2011;74:765–770. doi: 10.1016/j.neucom.2010.10.005. [DOI] [Google Scholar]
Sum JP, Leung CS, Ho KI. Convergence analyses on on-line weight noise injection-based training algorithms for MLPs. IEEE Trans Neural Netw Learn Syst. 2012;23(11):1827–1840. doi: 10.1109/TNNLS.2012.2210243. [DOI] [PubMed] [Google Scholar]
Sum JP, Leung CS, Ho KI. On-line node fault injection training algorithm for MLP networks: objective function and convergence analysis. IEEE Trans Neural Netw Learn Syst. 2012;23(2):211–222. doi: 10.1109/TNNLS.2011.2178477. [DOI] [PubMed] [Google Scholar]
Uwate Y, Nishio Y, Ueta T, Kawabe T, Ikeguchi T. Performance of chaos and burst noises injected to the hopfield NN for quadratic assignment problems. IEICE Trans Fundam. 2004;E87–A(4):937–943. [Google Scholar]
Wang J, Wu W, Zurada JM. Deterministic convergence of conjugate gradient method for feedforward neural networks. Neurocomputing. 2011;74:2368–2376. doi: 10.1016/j.neucom.2011.03.016. [DOI] [Google Scholar]
Wu W, Feng G, Li Z, Xu Y. Deterministic convergence of an online gradient method for BP neural networks. IEEE Trans Neural Netw. 2005;16:533–540. doi: 10.1109/TNN.2005.844903. [DOI] [PubMed] [Google Scholar]
Wu W, Wang J, Chen MS, Li ZX. Convergence analysis on online gradient method for BP neural networks. Neural Netw. 2011;24(1):91–98. doi: 10.1016/j.neunet.2010.09.007. [DOI] [PubMed] [Google Scholar]
Wu Y, Li JJ, Liu SB, Pang JZ, Du MM, Lin P. Noise-induced spatiotemporal patterns in Hodgkin–Huxley neuronal network. Cogn Neurodyn. 2013;7(5):431–440. doi: 10.1007/s11571-013-9245-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yoshida H, Kurata S, Li Y, Nara S. Chaotic neural network applied to two-dimensional motion control. Cogn Neurodyn. 2010;4(1):69–80. doi: 10.1007/s11571-009-9101-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yu X, Chen QF. Convergence of gradient method with penalty for Ridge Polynomial neural network. Neurocomputing. 2012;97:405–409. doi: 10.1016/j.neucom.2012.05.022. [DOI] [Google Scholar]
Zhang NM, Wu W, Zheng GF. Convergence of gradient method with momentum for two-layer feedforward neural networks. IEEE Trans Neural Netw. 2006;17(2):522–525. doi: 10.1109/TNN.2005.863460. [DOI] [PubMed] [Google Scholar]
Zhang C, Wu W, Xiong Y. Convergence analysis of batch gradient algorithm for three classes of sigma–pi neural networks. Neural Process Lett. 2007;261:77–180. [Google Scholar]
Zhang C, Wu W, Chen XH, Xiong Y. Convergence of BP algorithm for product unit neural networks with exponential weights. Neurocomputing. 2008;72:513–520. doi: 10.1016/j.neucom.2007.12.004. [DOI] [Google Scholar]
Zhang HS, Wu W, Liu F, Yao MC. Boundedness and convergence of online gradient method with penalty for feedforward neural networks. IEEE Trans Neural Netw. 2009;20(6):1050–1054. doi: 10.1109/TNN.2009.2020848. [DOI] [PubMed] [Google Scholar]
Zhang HS, Wu W, Yao MC. Boundedness and convergence of batch back-propagation algorithm with penalty for feedforward neural networks. Neurocomputing. 2012;89:141–146. doi: 10.1016/j.neucom.2012.02.029. [DOI] [Google Scholar]
Zhang HS, Liu XD, Xu DP, Zhang Y. Convergence analysis of fully complex backpropagation algorithm based on Wirtinger calculus. Cogn Neurodyn. 2014;8(3):261–266. doi: 10.1007/s11571-013-9276-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zheng YH, Wang QY, Danca MF. Noise induced complexity: patterns and collective phenomena in a small-world neuronal network. Cogn Neurodyn. 2014;8(2):143–149. doi: 10.1007/s11571-013-9257-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR1] Ahmed SU, Shahjahan M, Murase K. Injecting chaos in feedforward neural networks. Neural Process Lett. 2011;34:87–100. doi: 10.1007/s11063-011-9185-x. [DOI] [Google Scholar]

[CR2] Behera L, Kumar S, Patnaik A. On adaptive learning rate that guarantees convergence in feedforward networks. IEEE Trans Neural Netw. 2006;17(5):1116–1125. doi: 10.1109/TNN.2006.878121. [DOI] [PubMed] [Google Scholar]

[CR3] Bertsekas DP, Tsitsiklis JN. Gradient convergence in gradient methods with errors. SIAM J Optim. 2000;3:627–642. doi: 10.1137/S1052623497331063. [DOI] [Google Scholar]

[CR4] Charalambous C. Conjugate gradient algorithm for efficient training of artificial neural networks. Inst Electr Eng Proc. 1992;139:301–310. [Google Scholar]

[CR5] Fan QW, Zurada JM, Wu W. Convergence of online gradient method for feedforward neural networks with smoothing L1/2 regularization penalty. Neurocomputing. 2014;131:208–216. doi: 10.1016/j.neucom.2013.10.023. [DOI] [Google Scholar]

[CR6] Fine TL, Mukherjee S. Parameter convergence and learning curves for neural networks. Neural Comput. 1999;11:747–769. doi: 10.1162/089976699300016647. [DOI] [PubMed] [Google Scholar]

[CR7] Guo DQ. Inhibition of rhythmic spiking by colored noise in neural systems. Cogn Neurodyn. 2011;5(3):293–300. doi: 10.1007/s11571-011-9160-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] Hagan MT, Mehnaj MB. Training feedforward networks with Marquardt algorithm. IEEE Trans Neural Netw. 1994;5(6):989–993. doi: 10.1109/72.329697. [DOI] [PubMed] [Google Scholar]

[CR9] Haykin S. Neural networks and learning machines. New Jersey: Prentice Hall; 2008. [Google Scholar]

[CR10] Heskes T, Wiegerinck W. A theoretical comparison of batch-mode, on-line, cyclic, and almost-cyclic learning. IEEE Trans Neural Netw. 1996;7(4):919–925. doi: 10.1109/72.508935. [DOI] [PubMed] [Google Scholar]

[CR11] Ho KI, Leung CS, Sum JP. Convergence and objective functions of some fault/noise-injection-based online learning algorithms for RBF networks. IEEE Trans Neural Netw. 2010;21(6):938–947. doi: 10.1109/TNN.2010.2046179. [DOI] [PubMed] [Google Scholar]

[CR12] Iiguni Y, Sakai H, Tokumaru H. A real-time learning algorithm for a multilayered neural netwok based on extended Kalman filter. IEEE Trans Signal Process. 1992;40(4):959–966. doi: 10.1109/78.127966. [DOI] [Google Scholar]

[CR13] Karnin ED. A simple procedure for pruning back-propagation trained neural networks. IEEE Trans Neural Netw. 1990;1:239–242. doi: 10.1109/72.80236. [DOI] [PubMed] [Google Scholar]

[CR14] Li Y, Nara S. Novel tracking function of moving target using chaotic dynamics in a recurrent neural network model. Cogn Neurodyn. 2008;2(1):39–48. doi: 10.1007/s11571-007-9029-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] Osowski S, Bojarczak P, Stodolski M. Fast second order learning algorithm for feedforward multilayer neural network and its applications. Neural Netw. 1996;9(9):1583–1596. doi: 10.1016/S0893-6080(96)00029-9. [DOI] [PubMed] [Google Scholar]

[CR16] Shao HM, Zheng GF. Boundedness and convergence of online gradient method with penalty and momentum. Neurocomputing. 2011;74:765–770. doi: 10.1016/j.neucom.2010.10.005. [DOI] [Google Scholar]

[CR17] Sum JP, Leung CS, Ho KI. Convergence analyses on on-line weight noise injection-based training algorithms for MLPs. IEEE Trans Neural Netw Learn Syst. 2012;23(11):1827–1840. doi: 10.1109/TNNLS.2012.2210243. [DOI] [PubMed] [Google Scholar]

[CR18] Sum JP, Leung CS, Ho KI. On-line node fault injection training algorithm for MLP networks: objective function and convergence analysis. IEEE Trans Neural Netw Learn Syst. 2012;23(2):211–222. doi: 10.1109/TNNLS.2011.2178477. [DOI] [PubMed] [Google Scholar]

[CR19] Uwate Y, Nishio Y, Ueta T, Kawabe T, Ikeguchi T. Performance of chaos and burst noises injected to the hopfield NN for quadratic assignment problems. IEICE Trans Fundam. 2004;E87–A(4):937–943. [Google Scholar]

[CR20] Wang J, Wu W, Zurada JM. Deterministic convergence of conjugate gradient method for feedforward neural networks. Neurocomputing. 2011;74:2368–2376. doi: 10.1016/j.neucom.2011.03.016. [DOI] [Google Scholar]

[CR21] Wu W, Feng G, Li Z, Xu Y. Deterministic convergence of an online gradient method for BP neural networks. IEEE Trans Neural Netw. 2005;16:533–540. doi: 10.1109/TNN.2005.844903. [DOI] [PubMed] [Google Scholar]

[CR22] Wu W, Wang J, Chen MS, Li ZX. Convergence analysis on online gradient method for BP neural networks. Neural Netw. 2011;24(1):91–98. doi: 10.1016/j.neunet.2010.09.007. [DOI] [PubMed] [Google Scholar]

[CR23] Wu Y, Li JJ, Liu SB, Pang JZ, Du MM, Lin P. Noise-induced spatiotemporal patterns in Hodgkin–Huxley neuronal network. Cogn Neurodyn. 2013;7(5):431–440. doi: 10.1007/s11571-013-9245-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] Yoshida H, Kurata S, Li Y, Nara S. Chaotic neural network applied to two-dimensional motion control. Cogn Neurodyn. 2010;4(1):69–80. doi: 10.1007/s11571-009-9101-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] Yu X, Chen QF. Convergence of gradient method with penalty for Ridge Polynomial neural network. Neurocomputing. 2012;97:405–409. doi: 10.1016/j.neucom.2012.05.022. [DOI] [Google Scholar]

[CR26] Zhang NM, Wu W, Zheng GF. Convergence of gradient method with momentum for two-layer feedforward neural networks. IEEE Trans Neural Netw. 2006;17(2):522–525. doi: 10.1109/TNN.2005.863460. [DOI] [PubMed] [Google Scholar]

[CR27] Zhang C, Wu W, Xiong Y. Convergence analysis of batch gradient algorithm for three classes of sigma–pi neural networks. Neural Process Lett. 2007;261:77–180. [Google Scholar]

[CR28] Zhang C, Wu W, Chen XH, Xiong Y. Convergence of BP algorithm for product unit neural networks with exponential weights. Neurocomputing. 2008;72:513–520. doi: 10.1016/j.neucom.2007.12.004. [DOI] [Google Scholar]

[CR29] Zhang HS, Wu W, Liu F, Yao MC. Boundedness and convergence of online gradient method with penalty for feedforward neural networks. IEEE Trans Neural Netw. 2009;20(6):1050–1054. doi: 10.1109/TNN.2009.2020848. [DOI] [PubMed] [Google Scholar]

[CR30] Zhang HS, Wu W, Yao MC. Boundedness and convergence of batch back-propagation algorithm with penalty for feedforward neural networks. Neurocomputing. 2012;89:141–146. doi: 10.1016/j.neucom.2012.02.029. [DOI] [Google Scholar]

[CR31] Zhang HS, Liu XD, Xu DP, Zhang Y. Convergence analysis of fully complex backpropagation algorithm based on Wirtinger calculus. Cogn Neurodyn. 2014;8(3):261–266. doi: 10.1007/s11571-013-9276-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] Zheng YH, Wang QY, Danca MF. Noise induced complexity: patterns and collective phenomena in a small-world neuronal network. Cogn Neurodyn. 2014;8(2):143–149. doi: 10.1007/s11571-013-9257-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Deterministic convergence of chaos injection-based gradient method for training feedforward neural networks

Huisheng Zhang

Ying Zhang

Dongpo Xu

Xiaodong Liu

Abstract

Introduction

Network structure and chaos injection-based gradient method

Network structure

Chaos injection-based batch gradient method

Chaos injection-based online gradient method

Remark 1

Convergence results

Remark 2

Theorem 1

Theorem 2

Proofs

Lemma 1

Lemma 2

Lemma 3

Proof of Theorem 1

Lemma 4

Proof

Proof of (9)

Proof of (10)

Proof of (11)

Proof of Theorem 2

Lemma 5

Lemma 6

Proof

Lemma 7

Proof

Proof of (12)

Proof of (13)

Proof of (14)

Simulation results

Fig. 1.

Fig. 2.

Fig. 3.

Conclusion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases