Convergence analysis of fully complex backpropagation algorithm based on Wirtinger calculus

Huisheng Zhang; Xiaodong Liu; Dongpo Xu; Ying Zhang

doi:10.1007/s11571-013-9276-7

. 2014 Jan 3;8(3):261–266. doi: 10.1007/s11571-013-9276-7

Convergence analysis of fully complex backpropagation algorithm based on Wirtinger calculus

Huisheng Zhang ^1,^2,^✉, Xiaodong Liu ², Dongpo Xu ³, Ying Zhang ¹

PMCID: PMC4012068 PMID: 24808934

Abstract

This paper considers the fully complex backpropagation algorithm (FCBPA) for training the fully complex-valued neural networks. We prove both the weak convergence and strong convergence of FCBPA under mild conditions. The decreasing monotonicity of the error functions during the training process is also obtained. The derivation and analysis of the algorithm are under the framework of Wirtinger calculus, which greatly reduces the description complexity. The theoretical results are substantiated by a simulation example.

Keywords: Complex-valued neural networks, Fully complex backpropagation algorithm, Wirtinger calculus, Convergence

Introduction

The theoretical studies and practical implementations of complex-valued neural network (CVNN) have attracted considerable attention in signal processing, pattern recognition, and medical information processing (Fink et al. 2014; Hirose 2012; Nitta 2013). Based on different choices of the activation function, there are two main CVNN models: the split CVNN (Nitta 1997) and the fully CVNN (Kim and Adali 2003). The split CVNN uses a pair of real-valued functions to separately process the real part and the imaginary part of the neuron’s input signal. This strategy can effectively overcome the singularity problem during the training procedure. In contrast, activation functions of the fully CVNN are fully complex-valued, which helps the network make fully use of the phase information and thus achieve better performance in some applications. As one of the most popular training methods for neural networks, backpropagation algorithm (BPA) has been extended from the real domain to the complex domain in order to train the CVNN. Accordingly, there are two types of complex BPA: one is the split-complex BPA (SCBPA) (Nitta 1997) for split CVNN, and another is the fully complex BPA (FCBPA) (Li and Adali 2008) for fully CVNN.

Convergence is the precondition for any iteration algorithm to be used in real applications. (Wei et al. 2013; Osborn 2010) The convergence of the BPA has been extensively studied in literature (Wu et al. 2005, 2011; Zhang et al. 2007, 2008, 2009; Shao and Zheng 2011), where the boundedness and differentiability of the activation function are usually two necessary conditions for the convergence analysis. However, as stated by the Liouvilles theorem (an entire and bounded function in the complex domain is a constant), the complex activation function can not be both bounded and analytic. The conflict between the boundedness and differentiability of the activation function makes the theoretical convergence analysis for the complex BPA more difficult than that for the BPA. Fortunately, as the activation functions of the split CVNN can be split into two bounded and differential real-valued functions, the convergence analysis of SCBPA can then be conducted in the real domain. For the corresponding convergence results, we refer to (Nitta 1997; Zhang et al. 2013; Xu et al. 2010). However, though FCBPA has been widely used in many applications and has been experimentally shown to be convergent for some kinds of activation functions (Kim and Adali 2003), the theoretical convergence analysis remains challengeable.

Despite the challenge posed by the Liouville‘s theorem, another challenge for the theoretical convergence analysis of FCBPA is that the traditional mean value theorem, which is vital for the convergence analysis of BPA, dos not hold in the complex domain. (For example: $f (z) = e^{z} with z_{2} = z_{1} + 2 π i$ . We have f(z₂) − f(z₁) = 0 but $(z_{2} - z_{1}) f^{'} (w) = 2 π i e^{w} \neq 0$ for all w.) By expanding the analytic function with Taylor series and omitting the high order terms, some local stability results for complex ICA are obtained (Adali et al. 2008). Under the assumption that the activation function is a contraction, the convergence of the complex nonlinear adaptive filters is proved (Mandic and Goh 2009). However, to the best of our knowledge, the theoretical convergence results of the FCBPA has not yet been established. This becomes the main concern of this paper. Specifically, we make the following contributions:

By introducing a mean value theorem for a holomorphic function (Mcleod 1965), we will prove both the weak convergence and the strong convergence of FCBPA.
Instead of dropping the high order terms of the Taylor series (Adali et al. 2008), we give an accurate estimation for the difference of the error function between two iterations using the mean value theorem. As a result, our results are of global nature in that they are valid for arbitrarily given initial values of the weights.
The restrictive condition that the activation function is a contraction is not needed in our analysis.
The derivation and analysis of the algorithm are under the framework of Wirtinger calculus, which greatly reduces the description complexity.

The remainder of this paper is organized as follows. The network structure and the derivation of the FCBPA based on Wirtinger calculus are described in the next section. “Main results” section presents the main convergence theorem of the paper. The detailed proof of the theorem is given in Section Proofs. In “Simulation result” section we use a simulation example to support our theoretical results. The paper is concluded in “Conclusion” section.

Network structure and FCBPA based on Wirtinger calculus

We consider a single hidden layer feedforward network consisting of p input nodes, q hidden nodes, and 1 output node. Let $w_{0} = {(w_{01}, w_{02}, \dots, w_{0 q})}^{T} \in C^{q}$ be the weight vector between all the hidden units and the output unit, and $w_{l} = {(w_{l 1}, w_{l 2}, \dots, w_{l p})}^{T} \in C^{p}$ be the weight vector between all the input units and the hidden unit l $(l = 1, 2, \dots, q)$ . To simplify the presentation, we write all the weight parameters in a compact form, i.e., $w = {(w_{0}^{T}, w_{1}^{T}, \dots, w_{q}^{T})}^{T} \in C^{q + p q}$ and we define a matrix $V = {(w_{1}, w_{2}, \dots, w_{q})}^{T} \in C^{q \times p}$ .

Given activation functions $f, g : C \to C$ for the hidden layer and output layer, respectively, we define a vector function $F (x) = {(f (x_{1}), f (x_{2}), \dots, f (x_{q}))}^{T}$ for $x = {(x_{1}, x_{2}, \dots, x_{q})}^{T} \in C^{q}$ . For an input $z \in C^{p}$ , the output vector of the hidden layer can be written as $F (V z)$ and the final output of the network can be written as

y = g (w_{0} \cdot F (V z)),

where $w_{0} \cdot F (V z)$ represents the inner product between the two vectors w₀ and $F (V z)$ .

Suppose that ${z^{k}, d^{k}}_{k = 1}^{K} \subset C^{p} \times C$ is a given set of training samples, where $z^{k}$ is the input, and d^k is the desired output. The aim of the network training is to find the appropriate network weights $w^{★}$ that can minimize the error function

E (w) = \sum_{k = 1}^{K} (d_{k} - y_{k}) (\bar{d_{k}} - \bar{y_{k}}),

where

y_{k} = g (w_{0} \cdot F (V z^{k}))

and the notation ⁻ denotes the complex conjugation.

As noted by Adali et al. (2008), any function h(z) that is analytic in a bounded zone |z| < R with a Taylor series expansion with all real coefficients in |z| < R satisfies the property $\bar{h (z)} = h (\bar{z})$ . Examples of such functions include polynomials and most trigonometric functions and their hyperbolic counterparts, which are qualified activation functions of CVNNs (Kim and Adali 2003). As a result, we suppose both the activation functions $f (\cdot)$ and $g (\cdot)$ satisfy $\bar{f (z)} = f (\bar{z})$ , $\bar{g (z)} = g (\bar{z})$ . Therefore,

\bar{y_{k}} = g (\bar{w_{0}} \cdot F (\bar{V} \bar{z^{k}})) .

E(w) can be viewed as a function of complex variable vector w and its conjugate $\bar{w}$ . According to the Wirtinger calculus (Brandwood 1983; Bos 1994), we can define two gradient vectors ∇_wE (by taking partial derivatives with respect to w at the same time treating $\bar{w}$ as a constant vector in E) and $\nabla_{\bar{w}} E$ (by taking partial derivatives with respect to $\bar{w}$ at the same time treating w as a constant vector). Then the gradient $\nabla_{\bar{w}} E$ defines the direction of the maximum rate of change in E(w) with respect to w. As the output node y_k does not explicitly contain the variable vector $\bar{w}$ , we can conclude $\nabla_{\bar{w}} y_{k} = 0$ . Thus, by the chain rule of the Wirtinger calculus, we have

\frac{\partial E (w)}{\partial \bar{w_{0}}} = \sum_{k = 1}^{K} (y_{k} - d_{k}) g^{'} (\bar{w_{0}} \cdot F (\bar{V} \bar{z^{k}})) F (\bar{V} \bar{z^{k}}),

\frac{\partial E (w)}{\partial \bar{w_{l}}} = \sum_{k = 1}^{K} (y_{k} - d_{k}) g^{'} (\bar{w_{0}} \cdot F (\bar{V} \bar{z^{k}})) \bar{w_{0 l}} f^{'} (\bar{w_{l}} \cdot \bar{z^{k}}) \bar{z^{k}}, l = 1, 2 \dots, q .

Obviously,

\nabla_{\bar{w}} E (w) = {({(\frac{\partial E (w)}{\partial \bar{w_{0}}})}^{T}, {(\frac{\partial E (w)}{\partial \bar{w_{1}}})}^{T}, \dots, {(\frac{\partial E (w)}{\partial \bar{w_{q}}})}^{T})}^{T} .

Starting from an arbitrary initial value w⁰, the BPA based on Wirtinger calculus updates the weights {wⁿ} iteratively by

w^{n + 1} = w^{n} - η \nabla_{\bar{w}} E (w^{n}),

where $η > 0$ is the learning rate.

Main results

The following assumptions are needed in our convergence analysis.

(A1) There exists a constant c₁ > 0 such that $‖ w_{l}^{n} ‖ \leq c_{1}$ for all $l = 0, 1, \dots, q$ , $n = 0, 1, \dots$ ;
(A2) The functions f(z) and g(z) are analytic in a bounded zone |z| < R with a Taylor series expansion with all real coefficients in |z| < R, where R > max{c₂,c₃} (c₂ and c₃ are defined in (17) below).
(A3) The set $Φ_{1} = {w : \nabla_{\bar{w}} E (w) = 0}$ contains only finite points.

Remark 1

Assumption (A1) is the usual condition for the convergence analysis of the gradient method for both the real-valued neural networks (Zhang et al. 2007, 2008) and the CVNNs (Xu et al. 2010) in literature. As noted by Adali et al. (2008), Assumption (A2) is satisfied by quite a number of functions which are qualified as activation functions of the fully CVNNs. Assumption (A3) is used to establish a strong convergence result.

Now we present our convergence results.

Theorem 1

Suppose that the error function is given by (2), that the weight sequence {wⁿ} is generated by the algorithm (8) for any initial valuew⁰, that $0 < η < \frac{1}{L}$ , whereLis defined by (27) below, and that Assumptions (A1) and (A2) are valid. Then we have

$E (w^{n + 1}) \leq E (w^{n}), n = 0, 1, 2, \dots ;$ 9
$T h e r e i s E^{★} > 0 s u c h t h a t lim_{n \to \infty} E (w^{n}) = E^{★} ;$ 10
$T h e r e h o l d s t h e w e a k c o n v e r g e n c e : lim_{n \to \infty} ∥\nabla_{\bar{w}} E (w^{n})∥ = 0 .$ 11
Moreover, if Assumption (A3) is valid, then there holds the strong convergence: there exists a point $w^{★} \in Φ$ such that
$lim_{n \to \infty} w^{n} = w^{★} .$ 12

Proofs

Lemma 1

[see Theorem 10 in by Mcleod (1965)] Suppose h is a holomorphic function defined on a connected open set G in the complex plane. Ifz₁andz₂are points in G such that the segment joining them is also in G then

h (z_{2}) - h (z_{1}) = (z_{2} - z_{1}) (λ_{1} h^{'} (ξ_{1}) + λ_{2} h^{'} (ξ_{2}))

for some ξ₁and ξ₂on the segment joiningz₁andz₂and some nonnegative numbers $λ_{1}$ and $λ_{2}$ such that $λ_{1} + λ_{2} = 1$ .

Lemma 2

Suppose Assumptions (A1) and (A2) are valid, then $\nabla_{\bar{w}} E (w)$ satisfies Lipschitz condition, that is, there exists a positive constantL₁, such that

‖ \nabla_{\bar{w}} E (w^{n + 1}) - \nabla_{\bar{w}} E (w^{n}) ‖ \leq L_{1} ‖ w^{n + 1} - w^{n} ‖ .

Similarly, there exists a positive constantL₂, such that

‖ \nabla_{w} E (w^{n + 1}) - \nabla_{w} E (w^{n}) ‖ \leq L_{2} ‖ w^{n + 1} - w^{n} ‖ .

Proof

For simplicity, we introduce the following notations:

F^{n, k} = F (V^{n} z^{k}), {\bar{F}}^{n, k} = F (\bar{V^{n}} \bar{z^{k}}),

for $n = 1, 2, \dots, k = 1, 2, \dots, K$ .

By Assumption (A2), f and g have differentials of any order in the zone {z: |z| < R}. In addition, recalling ${z^{k}, d^{k}}_{k = 1}^{K}$ is finite and {wⁿ} is bounded, we can define c₂ and c₃ such that

c_{2} = sup_{l, n, k} | \bar{w_{l}^{n}} \cdot \bar{z^{k}} |, c_{3} = sup_{n, k} | \bar{w_{0}^{n}} \cdot F (\bar{V^{n}} \bar{z^{k}}) |,

c_{4} = sup_{| z | < max {c_{2}, c_{3}}} {| f (z) |, | g (z) |, | f^{'} (z) |, | g^{'} (z) |, | g^{″} (z) |} .

Using (18), Lemma 1 and the Cauchy-Schwartz Inequality, for any 1 ≤ k ≤ K and $n = 0, 1, \dots$ , we have that

\begin{matrix} ‖ F^{n + 1, k} - F^{n, k} ‖ & = ∥(\begin{matrix} f (w_{1}^{n + 1} \cdot z^{k}) - f (w_{1}^{n} \cdot z^{k}) \\ ⋮ \\ f (w_{q}^{n + 1} \cdot z^{k}) - f (w_{q}^{n} \cdot z^{k}) \end{matrix})∥ \\ = ∥(\begin{matrix} (w_{1}^{n + 1} \cdot z^{k} - w_{1}^{n} \cdot z^{k}) (λ_{11} f^{'} (ξ_{11}) + λ_{12} f^{'} (ξ_{12})) \\ ⋮ \\ (w_{q}^{n + 1} \cdot z^{k} - w_{q}^{n} \cdot z^{k}) (λ_{q 1} f^{'} (ξ_{q 1}) + λ_{q 2} f^{'} (ξ_{q 2})) \end{matrix})∥ \\ \leq c_{5} ∥(\begin{matrix} w_{1}^{n + 1} - w_{1}^{n} \\ ⋮ \\ w_{q}^{n + 1} - w_{q}^{n} \end{matrix})∥ \\ \leq c_{5} \sum_{l = 1}^{q} ‖ w_{l}^{n + 1} - w_{l}^{n} ‖, \end{matrix}

where $c_{5} = \sqrt{q} c_{4} sup_{k} ‖ z^{k} ‖, λ_{l 1} \geq 0, λ_{l 2} \geq 0, λ_{l 1} + λ_{l 2} = 1$ , ξ_l1 and ξ_l2 lie on the segment joining $w_{l}^{n + 1} \cdot z^{k}$ and $w_{l}^{n} \cdot z^{k}$ , $l = 1, \dots, q$ .

Similarly, we have

‖ {\bar{F}}^{n + 1, k} - {\bar{F}}^{n, k} ‖ \leq c_{5} \sum_{l = 1}^{q} ‖ w_{l}^{n + 1} - w_{l}^{n} ‖ .

By (18), (19), Lemma 1 and the Cauchy-Schwartz Inequality we have that for any 1 ≤ k ≤ K and $n = 0, 1, \dots$

\begin{matrix} ‖ y_{k}^{n + 1} - y_{k}^{n} ‖ & = ‖ g (w_{0}^{n + 1} \cdot F^{n + 1, k}) - g (w_{0}^{n} \cdot F^{n, k}) ‖ \\ = ‖ (w_{0}^{n + 1} \cdot F^{n + 1, k} - w_{0}^{n} \cdot F^{n, k}) (η_{1} g^{'} (ζ_{1}) + η_{2} g^{'} (ζ_{2})) ‖ \\ \leq c_{4} (‖ w_{0}^{n + 1} - w_{0}^{n} ‖ ‖ F^{n + 1, k} ‖ + ‖ w_{0}^{n} ‖ ‖ F^{n + 1, k} - F^{n, k} ‖) \\ \leq \sqrt{q} c_{4}^{2} ‖ w_{0}^{n + 1} - w_{0}^{n} ‖ + c_{1} c_{4} c_{5} \sum_{l = 1}^{q} ‖ w_{l}^{n + 1} - w_{l}^{n} ‖ \\ \leq c_{6} ‖ w^{n + 1} - w^{n} ‖, \end{matrix}

where $c_{6} = \sqrt{q + 1} max {\sqrt{q} c_{4}^{2}, c_{1} c_{4} c_{5}}$ , $η_{1} \geq 0, η_{2} \geq 0, η_{1} + η_{2} = 1$ , $ζ_{1}$ and $ζ_{2}$ lie on the segment joining $w_{0}^{n + 1} \cdot F^{n + 1, k}$ and $w_{0}^{n} \cdot F^{n, k}$ .

In the same way, we can prove that

‖ g^{'} (\bar{w_{0}^{n + 1}} \cdot {\bar{F}}^{n + 1, k}) - g^{'} (\bar{w_{0}^{n}} \cdot {\bar{F}}^{n, k}) ‖ \leq c_{6} ‖ w^{n + 1} - w^{n} ‖ .

With (18), (20), (22), and Cauchy-Schwartz Inequality we obtain

\begin{matrix} ‖ g^{'} (\bar{w_{0}^{n + 1}} \cdot {\bar{F}}^{n + 1, k}) {\bar{F}}^{n + 1, k} - g^{'} (\bar{w_{0}^{n}} \cdot {\bar{F}}^{n, k}) {\bar{F}}^{n, k} ‖ \\ = ‖ (g^{'} (\bar{w_{0}^{n + 1}} \cdot {\bar{F}}^{n + 1, k}) - g^{'} (\bar{w_{0}^{n}} \cdot {\bar{F}}^{n, k})) {\bar{F}}^{n + 1, k} + g^{'} (\bar{w_{0}^{n}} \cdot {\bar{F}}^{n, k}) ({\bar{F}}^{n + 1, k} - {\bar{F}}^{n, k}) ‖ \\ \leq | g^{'} (\bar{w_{0}^{n + 1}} \cdot {\bar{F}}^{n + 1, k}) - g^{'} (\bar{w_{0}^{n}} \cdot {\bar{F}}^{n, k}) | ‖ {\bar{F}}^{n + 1, k} ‖ + | g^{'} (\bar{w_{0}^{n}} \cdot {\bar{F}}^{n, k}) | ‖ ({\bar{F}}^{n + 1, k} - {\bar{F}}^{n, k}) ‖ \\ \leq c_{6} \sqrt{q} c_{4} ‖ w^{n + 1} - w^{n} ‖ + c_{4} c_{5} \sum_{l = 1}^{q} ‖ w_{l}^{n + 1} - w_{l}^{n} ‖ \\ \leq \sqrt{q} c_{4} (c_{5} + c_{6}) ‖ w^{n + 1} - w^{n} ‖ \end{matrix}

Combining (18), (21), (23), and the Cauchy-Schwartz Inequality we can conclude

\begin{matrix} ‖ \frac{\partial E (w^{n + 1})}{\partial \bar{w_{0}}} - \frac{\partial E (w^{n})}{\partial \bar{w_{0}}} ‖ \\ = ‖ \sum_{k = 1}^{K} (y_{k}^{n + 1} - d_{k}) g^{'} (\bar{w_{0}^{n + 1}} \cdot {\bar{F}}^{n + 1, k}) {\bar{F}}^{n + 1, k} - \sum_{k = 1}^{K} (y_{k}^{n} - d_{k}) g^{'} (\bar{w_{0}^{n}} \cdot {\bar{F}}^{n, k}) {\bar{F}}^{n, k} ‖ \\ = ‖ \sum_{k = 1}^{K} ((y_{k}^{n + 1} - y_{k}^{n}) g^{'} (\bar{w_{0}^{n + 1}} \cdot {\bar{F}}^{n + 1, k}) {\bar{F}}^{n + 1, k} \\ + (y_{k}^{n} - d_{k}) (g^{'} (\bar{w_{0}^{n + 1}} \cdot {\bar{F}}^{n + 1, k}) {\bar{F}}^{n + 1, k} - g^{'} (\bar{w_{0}^{n}} \cdot {\bar{F}}^{n, k}) {\bar{F}}^{n, k})) ‖ \\ \leq \sum_{k = 1}^{K} (| (y_{k}^{n + 1} - y_{k}^{n}) g^{'} (\bar{w_{0}^{n + 1}} \cdot {\bar{F}}^{n + 1, k}) | ‖ {\bar{F}}^{n + 1, k} ‖ \\ + | y_{k}^{n} - d_{k} | ‖ g^{'} (\bar{w_{0}^{n + 1}} \cdot {\bar{F}}^{n + 1, k}) {\bar{F}}^{n + 1, k} - g^{'} (\bar{w_{0}^{n}} \cdot {\bar{F}}^{n, k}) {\bar{F}}^{n, k} ‖) \\ \leq \sum_{k = 1}^{K} (c_{4}^{2} \sqrt{q} | y_{k}^{n + 1} - y_{k}^{n} | + (c_{4} + sup d_{k}) ‖ g^{'} (\bar{w_{0}^{n + 1}} \cdot {\bar{F}}^{n + 1, k}) {\bar{F}}^{n + 1, k} - g^{'} (\bar{w_{0}^{n}} \cdot {\bar{F}}^{n, k}) {\bar{F}}^{n, k} ‖) \\ \leq L_{3} ‖ w^{n + 1} - w^{n} ‖ \end{matrix}

where $L_{3} = K \sqrt{q} c_{4} (c_{4} c_{6} + (c_{4} + sup d_{k}) (c_{5} + c_{6}))$ .

Similarly, there exists a Lipschitz constants L₄ such that for $l = 1, \dots, q$

‖ \frac{\partial E (w^{n + 1})}{\partial \bar{w_{l}}} - \frac{\partial E (w^{n})}{\partial \bar{w_{l}}} ‖ \leq L_{4} ‖ w^{n + 1} - w^{n} ‖ .

Hence, (7), (24), and (25) validate (14) by setting L₁ = L₃ + qL₄.

Equation (15) can be proved in a similar way to (14). □

Now, we proceed to the proof of Theorem 2 by dealing with Equations (9)-(12) separately.

Proof of (9)

By the differential mean value theorem, there exists a constant $θ \in [0, 1]$ , such that

\begin{matrix} E (w^{n + 1}) - E (w^{n}) \\ = {(\nabla_{w} E (w^{n} + θ Δ w^{n}))}^{T} Δ w^{n} + {(\nabla_{\bar{w}} E (w^{n} + θ Δ w^{n}))}^{T} \bar{Δ w^{n}} \\ = {(\nabla_{w} E (w^{n}))}^{T} Δ w^{n} + {(\nabla_{\bar{w}} E (w^{n}))}^{T} \bar{Δ w^{n}} \\ + {(\nabla_{w} E (w^{n} + θ Δ w^{n}) - \nabla_{w} E (w^{n}))}^{T} Δ w^{n} \\ + {(\nabla_{\bar{w}} E (w^{n} + θ Δ w^{n}) - \nabla_{\bar{w}} E (w^{n}))}^{T} \bar{Δ w^{n}} \\ \leq 2 R e (\nabla_{w} E {(w^{n})}^{T} Δ w^{n}) \\ + (‖ \nabla_{w} E (w^{n} + θ Δ w^{n}) - \nabla_{w} E (w^{n}) ‖ + ‖ \nabla_{\bar{w}} E (w^{n} + θ Δ w^{n}) - \nabla_{\bar{w}} E (w^{n}) ‖) ‖ Δ w^{n} ‖ \\ \leq - 2 η R e ({(\nabla_{\bar{w}} E (w^{n}))}^{H} \nabla_{\bar{w}} E (w^{n})) + (L_{1} + L_{2}) θ {‖ Δ w^{n} ‖}^{2} \\ = (- 2 η + (L_{1} + L_{2}) θ η^{2}) {‖ \nabla_{\bar{w}} E (w^{n}) ‖}^{2} \end{matrix}

To make (9) valid, we only require the learning rate η to satisfy

0 < η < L,

where $L = \frac{2}{(L_{1} + L_{2}) θ}$ . □

Proof of (10)

Equation (10) is directly obtained by (9) and $E (w^{n}) > 0 (n = 1, 2, \dots)$ .

Proof of (11)

Let β = 2η − (L₁ + L₂)θη². By (26), we have

\begin{matrix} E (w^{n + 1}) \leq E (w^{n}) - β {‖ \nabla_{\bar{w}} E (w^{n}) ‖}^{2} \\ \leq \dots \leq E (w^{0}) - β \sum_{t = 0}^{n} {‖ \nabla_{\bar{w}} E (w^{t}) ‖}^{2} . \end{matrix}

Considering E(wⁿ⁺¹) > 0, let $n \to \infty$ , then we have

β \sum_{t = 0}^{\infty} {‖ \nabla_{\bar{w}} E (w^{t}) ‖}^{2} \leq E (w^{0}) < \infty .

This immediately gives

lim_{n \to \infty} ‖ \nabla_{\bar{w}} E (w^{n}) ‖ = 0 .

□

The following lemma, which will be used in the proof of (12), is a generalization of Theorem 14.1.5 by Ortega and Rheinboldt (1970) from the real domain to the complex domain. The proof of this lemma follows the same route as (Ortega and Rheinboldt 1970) and we omit it here.

Lemma 3

[22] Let $φ : Φ \subset C^{k} \to C (k \geq 1)$ be continuous for a bounded closed region $Φ$ , and $Φ_{0} = {z \in Φ : φ (z) = 0}$ .Suppose the set $Φ_{0}$ contains only finite points and the sequence {zⁿ} satisfy:

(i)
${lim}_{n \to \infty} φ (z^{n}) = 0$ ;
(ii)
${lim}_{n \to \infty} ‖ z^{n + 1} - z^{n} ‖ = 0$ .

Then, there exists a unique $z^{★} \in Φ_{0}$ such that $lim_{n \to \infty} z^{n} = z^{★}$ .

Proof of (12)

Obviously $\nabla_{\bar{w}} E (w)$ is continuous under the Assumption (A2). Using (8) and (11), we have

lim_{n \to \infty} ‖ w^{n + 1} - w^{n} ‖ = η lim_{n \to \infty} ‖ \nabla_{\bar{w}} E (w^{n}) ‖ = 0 .

Furthermore, the Assumption (A3) is valid. Thus, applying Lemma 3, there exists a unique $w^{★} \in Φ$ such that $lim_{n \to \infty} w^{n} = w^{★}$ . □

Simulation result

In this section we illustrate the convergence behavior of the FCBPA by the problem of one-step-ahead prediction of the complex-valued nonlinear signals. The nonlinear benchmark input signal is given by (Mandic and Goh 2009)

z (t) = \frac{z (t - 1)}{1 + z^{2} (t - 1)} + n^{3} (t),

where n(t) is a complex white Gaussian noise with zero mean and unit variance.

This example uses a network with one input node, five hidden nodes, and one output node. We set the activation function for both the hidden layer and output layer to be $s i n (\cdot)$ , which is analytic in the complex domain. The learning rate η is set to be 0.1. The test is carried out with the initial weights (both the real part and the imaginary part) taken as random numbers from the interval [−0.1, 0.1]. The simulation results are presented in Fig. 1, which shows that the gradient tends to zero and the square error decreases monotonically as the number of iteration increases and at last it tends to a constant. This supports our theoretical findings.

Conclusion

In this paper, under the framework of Wirtinger calculus, we investigate the FCBPA for fully CVNN. Using a mean value theorem for holomorphic functions, under mild conditions we prove the gradient of the error function with respect to the network weight vector satisfies the Lipschitz condition. Based on this conclusion, both the weak convergence and strong convergence of the algorithm are proved. Simulation results substantiate the theoretical findings.

Acknowledgments

This research is supported by the National Natural Science Foundation of China (61101228, 10871220), the China Postdoctoral Science Foundation (No. 2012M520623), the Research Fund for the Doctoral Program of Higher Education of China (No. 20122304120028), and the Fundamental Research Funds for the Central Universities.

References

Adali T, Li H, Novey M, et al. Complex ICA using nonlinear functions. IEEE Trans Signal Process. 2008;56(9):4356–544. [Google Scholar]
Bos AVD. Complex gradient and Hessian. Proc Inst Elec Eng Vision Image Signal Process. 1994;141:380–382. doi: 10.1049/ip-vis:19941555. [DOI] [Google Scholar]
Brandwood D. Complex gradient operator and its application in adaptive array theory. Proc Inst Electr Eng. 1983;130:11–16. [Google Scholar]
Fink O, Zio E, Weidmann U. Predicting component reliability and level of degradation with complex-valued neural networks. Reliab Eng Syst Safe. 2014;121:198–206. doi: 10.1016/j.ress.2013.08.004. [DOI] [Google Scholar]
Hirose A. Complex-valued neural networks. Berlin Heidelberg: Springer-Verlag; 2012. [Google Scholar]
Kim T, Adali T. Approximation by fully complex multilayer perceptrons. Neural Comput. 2003;15:1641–666. doi: 10.1162/089976603321891846. [DOI] [PubMed] [Google Scholar]
Li H, Adali T. Complex-valued adaptive signal processing using nonlinear functions. EURASIP J Adv Signal Process. 2008;2008:122. [Google Scholar]
Mandi DP, Goh SL. Complex valued nonlinear adaptive filters. Chichester: Wiley; 2009. [Google Scholar]
Mcleod RM. Mean value theorems for vector valued functions. Proc Edinburgh Math Soc. 1965;14(2):197–209. doi: 10.1017/S0013091500008786. [DOI] [Google Scholar]
Nitta T. An extension of the back-propagation algorithm to complex numbers. Neural Netw. 1997;10(8):1391–1415. doi: 10.1016/S0893-6080(97)00036-1. [DOI] [PubMed] [Google Scholar]
Nitta T. Local minima in hierarchical structures of complex-valued neural networks. Neural Netw. 2013;43:1–7. doi: 10.1016/j.neunet.2013.02.002. [DOI] [PubMed] [Google Scholar]
Osborn GW. A Kalman filtering approach to the representation of kinematic quantities by the hippocampal-entorhinal complex. Cogn Neurodyn. 2010;4:C315–C335. doi: 10.1007/s11571-010-9115-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shao HM, Zheng GF. Boundedness and convergence of online gradient method with penalty and momentum. Neurocomputing. 2011;74:765–770. doi: 10.1016/j.neucom.2010.10.005. [DOI] [Google Scholar]
Wei H, Ren Y, Wang ZY. A computational neural model of orientation detection based on multiple guesses: comparison of geometrical and algebraic models. Cogn Neurodyn. 2013;7:C361–C379. doi: 10.1007/s11571-012-9235-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu W, Feng GR, Li ZX, et al. Deterministic convergence of an online gradient method for BP neural networks. IEEE Trans Neural Netw. 2005;16:533–540. doi: 10.1109/TNN.2005.844903. [DOI] [PubMed] [Google Scholar]
Wang J, Wu W, Zurada J. Deterministic convergence of conjugate gradient method for feedforward neural networks. Neurocomputing. 2011;74:2368–2376. doi: 10.1016/j.neucom.2011.03.016. [DOI] [Google Scholar]
Xu DP, Zhang HS, Liu L. Convergence analysis of three classes of split-complex gradient algorithms for complex-valued recurrent neural networks. Neural Comput. 2010;22(10):2655–2677. doi: 10.1162/NECO_a_00021. [DOI] [PubMed] [Google Scholar]
Zhang C, Wu W, Xiong Y. Convergence analysis of batch gradient algorithm for three classes of sigma-pi neural networks. Neural Process Lett. 2007;26:177–180. doi: 10.1007/s11063-007-9050-0. [DOI] [Google Scholar]
Zhang C, Wu W, Chen XH, et al. Convergence of BP algorithm for product unit neural networks with exponential weights. Neurocomputing. 2008;72:513–520. doi: 10.1016/j.neucom.2007.12.004. [DOI] [Google Scholar]
Zhang HS, Wu W, Liu F, Yao MC. Boundedness and convergence of online gradient method with penalty for feedforward neural networks. IEEE Trans Neural Netw. 2009;20(6):1050–1054. doi: 10.1109/TNN.2009.2020848. [DOI] [PubMed] [Google Scholar]
Zhang HS, Xu DP, Zhang Y (2013) Boundedness and convergence of split-complex back-propagation algorithm with momentum and penalty. Neural Process Lett. doi:10.1007/s11063-013-9305-x
Ortega JM, Rheinboldt WC. Iterative solution of nonlinear equations in several variables. New York: Academic Press; 1970. [Google Scholar]

[CR1] Adali T, Li H, Novey M, et al. Complex ICA using nonlinear functions. IEEE Trans Signal Process. 2008;56(9):4356–544. [Google Scholar]

[CR2] Bos AVD. Complex gradient and Hessian. Proc Inst Elec Eng Vision Image Signal Process. 1994;141:380–382. doi: 10.1049/ip-vis:19941555. [DOI] [Google Scholar]

[CR3] Brandwood D. Complex gradient operator and its application in adaptive array theory. Proc Inst Electr Eng. 1983;130:11–16. [Google Scholar]

[CR4] Fink O, Zio E, Weidmann U. Predicting component reliability and level of degradation with complex-valued neural networks. Reliab Eng Syst Safe. 2014;121:198–206. doi: 10.1016/j.ress.2013.08.004. [DOI] [Google Scholar]

[CR5] Hirose A. Complex-valued neural networks. Berlin Heidelberg: Springer-Verlag; 2012. [Google Scholar]

[CR6] Kim T, Adali T. Approximation by fully complex multilayer perceptrons. Neural Comput. 2003;15:1641–666. doi: 10.1162/089976603321891846. [DOI] [PubMed] [Google Scholar]

[CR7] Li H, Adali T. Complex-valued adaptive signal processing using nonlinear functions. EURASIP J Adv Signal Process. 2008;2008:122. [Google Scholar]

[CR8] Mandi DP, Goh SL. Complex valued nonlinear adaptive filters. Chichester: Wiley; 2009. [Google Scholar]

[CR9] Mcleod RM. Mean value theorems for vector valued functions. Proc Edinburgh Math Soc. 1965;14(2):197–209. doi: 10.1017/S0013091500008786. [DOI] [Google Scholar]

[CR10] Nitta T. An extension of the back-propagation algorithm to complex numbers. Neural Netw. 1997;10(8):1391–1415. doi: 10.1016/S0893-6080(97)00036-1. [DOI] [PubMed] [Google Scholar]

[CR11] Nitta T. Local minima in hierarchical structures of complex-valued neural networks. Neural Netw. 2013;43:1–7. doi: 10.1016/j.neunet.2013.02.002. [DOI] [PubMed] [Google Scholar]

[CR12] Osborn GW. A Kalman filtering approach to the representation of kinematic quantities by the hippocampal-entorhinal complex. Cogn Neurodyn. 2010;4:C315–C335. doi: 10.1007/s11571-010-9115-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] Shao HM, Zheng GF. Boundedness and convergence of online gradient method with penalty and momentum. Neurocomputing. 2011;74:765–770. doi: 10.1016/j.neucom.2010.10.005. [DOI] [Google Scholar]

[CR14] Wei H, Ren Y, Wang ZY. A computational neural model of orientation detection based on multiple guesses: comparison of geometrical and algebraic models. Cogn Neurodyn. 2013;7:C361–C379. doi: 10.1007/s11571-012-9235-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] Wu W, Feng GR, Li ZX, et al. Deterministic convergence of an online gradient method for BP neural networks. IEEE Trans Neural Netw. 2005;16:533–540. doi: 10.1109/TNN.2005.844903. [DOI] [PubMed] [Google Scholar]

[CR16] Wang J, Wu W, Zurada J. Deterministic convergence of conjugate gradient method for feedforward neural networks. Neurocomputing. 2011;74:2368–2376. doi: 10.1016/j.neucom.2011.03.016. [DOI] [Google Scholar]

[CR17] Xu DP, Zhang HS, Liu L. Convergence analysis of three classes of split-complex gradient algorithms for complex-valued recurrent neural networks. Neural Comput. 2010;22(10):2655–2677. doi: 10.1162/NECO_a_00021. [DOI] [PubMed] [Google Scholar]

[CR18] Zhang C, Wu W, Xiong Y. Convergence analysis of batch gradient algorithm for three classes of sigma-pi neural networks. Neural Process Lett. 2007;26:177–180. doi: 10.1007/s11063-007-9050-0. [DOI] [Google Scholar]

[CR19] Zhang C, Wu W, Chen XH, et al. Convergence of BP algorithm for product unit neural networks with exponential weights. Neurocomputing. 2008;72:513–520. doi: 10.1016/j.neucom.2007.12.004. [DOI] [Google Scholar]

[CR20] Zhang HS, Wu W, Liu F, Yao MC. Boundedness and convergence of online gradient method with penalty for feedforward neural networks. IEEE Trans Neural Netw. 2009;20(6):1050–1054. doi: 10.1109/TNN.2009.2020848. [DOI] [PubMed] [Google Scholar]

[CR21] Zhang HS, Xu DP, Zhang Y (2013) Boundedness and convergence of split-complex back-propagation algorithm with momentum and penalty. Neural Process Lett. doi:10.1007/s11063-013-9305-x

[CR22] Ortega JM, Rheinboldt WC. Iterative solution of nonlinear equations in several variables. New York: Academic Press; 1970. [Google Scholar]

PERMALINK

Convergence analysis of fully complex backpropagation algorithm based on Wirtinger calculus

Huisheng Zhang

Xiaodong Liu

Dongpo Xu

Ying Zhang

Abstract

Introduction

Network structure and FCBPA based on Wirtinger calculus

Main results

Remark 1

Theorem 1

Proofs

Lemma 1

Lemma 2

Proof

Proof of (9)

Proof of (10)

Proof of (11)

Lemma 3

Proof of (12)

Simulation result

Fig. 1.

Conclusion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Convergence analysis of fully complex backpropagation algorithm based on Wirtinger calculus

Huisheng Zhang

Xiaodong Liu

Dongpo Xu

Ying Zhang

Abstract

Introduction

Network structure and FCBPA based on Wirtinger calculus

Main results

Remark 1

Theorem 1

Proofs

Lemma 1

Lemma 2

Proof

Proof of (9)

Proof of (10)

Proof of (11)

Lemma 3

Proof of (12)

Simulation result

Fig. 1.

Conclusion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases