Skip to main content
Cognitive Neurodynamics logoLink to Cognitive Neurodynamics
. 2014 Jan 3;8(3):261–266. doi: 10.1007/s11571-013-9276-7

Convergence analysis of fully complex backpropagation algorithm based on Wirtinger calculus

Huisheng Zhang 1,2,, Xiaodong Liu 2, Dongpo Xu 3, Ying Zhang 1
PMCID: PMC4012068  PMID: 24808934

Abstract

This paper considers the fully complex backpropagation algorithm (FCBPA) for training the fully complex-valued neural networks. We prove both the weak convergence and strong convergence of FCBPA under mild conditions. The decreasing monotonicity of the error functions during the training process is also obtained. The derivation and analysis of the algorithm are under the framework of Wirtinger calculus, which greatly reduces the description complexity. The theoretical results are substantiated by a simulation example.

Keywords: Complex-valued neural networks, Fully complex backpropagation algorithm, Wirtinger calculus, Convergence

Introduction

The theoretical studies and practical implementations of complex-valued neural network (CVNN) have attracted considerable attention in signal processing, pattern recognition, and medical information processing (Fink et al. 2014; Hirose 2012; Nitta 2013). Based on different choices of the activation function, there are two main CVNN models: the split CVNN (Nitta 1997) and the fully CVNN (Kim and Adali 2003). The split CVNN uses a pair of real-valued functions to separately process the real part and the imaginary part of the neuron’s input signal. This strategy can effectively overcome the singularity problem during the training procedure. In contrast, activation functions of the fully CVNN are fully complex-valued, which helps the network make fully use of the phase information and thus achieve better performance in some applications. As one of the most popular training methods for neural networks, backpropagation algorithm (BPA) has been extended from the real domain to the complex domain in order to train the CVNN. Accordingly, there are two types of complex BPA: one is the split-complex BPA (SCBPA) (Nitta 1997) for split CVNN, and another is the fully complex BPA (FCBPA) (Li and Adali 2008) for fully CVNN.

Convergence is the precondition for any iteration algorithm to be used in real applications. (Wei et al. 2013; Osborn 2010) The convergence of the BPA has been extensively studied in literature (Wu et al. 2005, 2011; Zhang et al. 2007, 2008, 2009; Shao and Zheng 2011), where the boundedness and differentiability of the activation function are usually two necessary conditions for the convergence analysis. However, as stated by the Liouvilles theorem (an entire and bounded function in the complex domain is a constant), the complex activation function can not be both bounded and analytic. The conflict between the boundedness and differentiability of the activation function makes the theoretical convergence analysis for the complex BPA more difficult than that for the BPA. Fortunately, as the activation functions of the split CVNN can be split into two bounded and differential real-valued functions, the convergence analysis of SCBPA can then be conducted in the real domain. For the corresponding convergence results, we refer to (Nitta 1997; Zhang et al. 2013; Xu et al. 2010). However, though FCBPA has been widely used in many applications and has been experimentally shown to be convergent for some kinds of activation functions (Kim and Adali 2003), the theoretical convergence analysis remains challengeable.

Despite the challenge posed by the Liouville‘s theorem, another challenge for the theoretical convergence analysis of FCBPA is that the traditional mean value theorem, which is vital for the convergence analysis of BPA, dos not hold in the complex domain. (For example: f(z)=ezwithz2=z1+2πi. We have f(z2) − f(z1) = 0 but (z2-z1)f(w)=2πiew0 for all w.) By expanding the analytic function with Taylor series and omitting the high order terms, some local stability results for complex ICA are obtained (Adali et al. 2008). Under the assumption that the activation function is a contraction, the convergence of the complex nonlinear adaptive filters is proved (Mandic and Goh 2009). However, to the best of our knowledge, the theoretical convergence results of the FCBPA has not yet been established. This becomes the main concern of this paper. Specifically, we make the following contributions:

  • By introducing a mean value theorem for a holomorphic function (Mcleod 1965), we will prove both the weak convergence and the strong convergence of FCBPA.

  • Instead of dropping the high order terms of the Taylor series (Adali et al. 2008), we give an accurate estimation for the difference of the error function between two iterations using the mean value theorem. As a result, our results are of global nature in that they are valid for arbitrarily given initial values of the weights.

  • The restrictive condition that the activation function is a contraction is not needed in our analysis.

  • The derivation and analysis of the algorithm are under the framework of Wirtinger calculus, which greatly reduces the description complexity.

The remainder of this paper is organized as follows. The network structure and the derivation of the FCBPA based on Wirtinger calculus are described in the next section. “Main results” section presents the main convergence theorem of the paper. The detailed proof of the theorem is given in Section Proofs. In “Simulation result” section we use a simulation example to support our theoretical results. The paper is concluded in “Conclusion” section.

Network structure and FCBPA based on Wirtinger calculus

We consider a single hidden layer feedforward network consisting of p input nodes, q hidden nodes, and 1 output node. Let w0=(w01,w02,,w0q)TCq be the weight vector between all the hidden units and the output unit, and wl=(wl1,wl2,,wlp)TCp be the weight vector between all the input units and the hidden unit l(l=1,2,,q). To simplify the presentation, we write all the weight parameters in a compact form, i.e., w=(w0T,w1T,,wqT)TCq+pq and we define a matrix V=(w1,w2,,wq)TCq×p.

Given activation functions f,g:CC for the hidden layer and output layer, respectively, we define a vector function F(x)=(f(x1),f(x2),,f(xq))T for x=(x1,x2,,xq)TCq. For an input zCp, the output vector of the hidden layer can be written as F(Vz) and the final output of the network can be written as

y=g(w0·F(Vz)), 1

where w0·F(Vz) represents the inner product between the two vectors w0 and F(Vz).

Suppose that {zk,dk}k=1KCp×C is a given set of training samples, where zk is the input, and dk is the desired output. The aim of the network training is to find the appropriate network weights w that can minimize the error function

E(w)=k=1K(dk-yk)(dk¯-yk¯), 2

where

yk=g(w0·F(Vzk)) 3

and the notation denotes the complex conjugation.

As noted by Adali et al. (2008), any function h(z) that is analytic in a bounded zone |z| < R with a Taylor series expansion with all real coefficients in |z| < R satisfies the property h(z)¯=h(z¯). Examples of such functions include polynomials and most trigonometric functions and their hyperbolic counterparts, which are qualified activation functions of CVNNs (Kim and Adali 2003). As a result, we suppose both the activation functions f(·) and g(·) satisfy f(z)¯=f(z¯), g(z)¯=g(z¯). Therefore,

yk¯=g(w0¯·F(V¯zk¯)). 4

E(w) can be viewed as a function of complex variable vector w and its conjugate w¯. According to the Wirtinger calculus (Brandwood 1983; Bos 1994), we can define two gradient vectors ∇wE (by taking partial derivatives with respect to w at the same time treating w¯ as a constant vector in E) and w¯E (by taking partial derivatives with respect to w¯ at the same time treating w as a constant vector). Then the gradient w¯E defines the direction of the maximum rate of change in E(w) with respect to w. As the output node yk does not explicitly contain the variable vector w¯, we can conclude w¯yk=0. Thus, by the chain rule of the Wirtinger calculus, we have

E(w)w0¯=k=1K(yk-dk)g(w0¯·F(V¯zk¯))F(V¯zk¯), 5
E(w)wl¯=k=1K(yk-dk)g(w0¯·F(V¯zk¯))w0l¯f(wl¯·zk¯)zk¯,l=1,2,q. 6

Obviously,

w¯E(w)=E(w)w0¯T,E(w)w1¯T,,E(w)wq¯TT. 7

Starting from an arbitrary initial value w0, the BPA based on Wirtinger calculus updates the weights {wn} iteratively by

wn+1=wn-ηw¯E(wn), 8

where η>0 is the learning rate.

Main results

The following assumptions are needed in our convergence analysis.

  • (A1) There exists a constant c1 > 0 such that wlnc1 for all l=0,1,,q, n=0,1,;

  • (A2) The functions f(z) and g(z) are analytic in a bounded zone |z| < R with a Taylor series expansion with all real coefficients in |z| < R, where R > max{c2,c3} (c2 and c3 are defined in (17) below).

  • (A3) The set Φ1={w:w¯E(w)=0} contains only finite points.

Remark 1

Assumption (A1) is the usual condition for the convergence analysis of the gradient method for both the real-valued neural networks (Zhang et al. 2007, 2008) and the CVNNs (Xu et al. 2010) in literature. As noted by Adali et al. (2008), Assumption (A2) is satisfied by quite a number of functions which are qualified as activation functions of the fully CVNNs. Assumption (A3) is used to establish a strong convergence result.

Now we present our convergence results.

Theorem 1

Suppose that the error function is given by (2), that the weight sequence {wn} is generated by the algorithm (8) for any initial valuew0, that0<η<1L, whereLis defined by (27) below, and that Assumptions (A1) and (A2) are valid. Then we have

  1. E(wn+1)E(wn),n=0,1,2,; 9
  2. ThereisE>0suchthatlimnE(wn)=E; 10
  3. Thereholdstheweakconvergence:limnw¯E(wn)=0. 11
    Moreover, if Assumption (A3) is valid, then there holds the strong convergence: there exists a pointwΦsuch that
  4. limnwn=w. 12

Proofs

Lemma 1

[see Theorem 10 in by Mcleod (1965)] Suppose h is a holomorphic function defined on a connected open set G in the complex plane. Ifz1andz2are points in G such that the segment joining them is also in G then

h(z2)-h(z1)=(z2-z1)(λ1h(ξ1)+λ2h(ξ2)) 13

for some ξ1and ξ2on the segment joiningz1andz2and some nonnegative numbersλ1andλ2such thatλ1+λ2=1.

Lemma 2

Suppose Assumptions (A1) and (A2) are valid, thenw¯E(w)satisfies Lipschitz condition, that is, there exists a positive constantL1, such that

w¯E(wn+1)-w¯E(wn)L1wn+1-wn. 14

Similarly, there exists a positive constantL2, such that

wE(wn+1)-wE(wn)L2wn+1-wn. 15

Proof

For simplicity, we introduce the following notations:

Fn,k=F(Vnzk),F¯n,k=F(Vn¯zk¯), 16

for n=1,2,,k=1,2,,K.

By Assumption (A2), f and g have differentials of any order in the zone {z: |z| < R}. In addition, recalling {zk,dk}k=1K is finite and {wn} is bounded, we can define c2 and c3 such that

c2=supl,n,k|wln¯·zk¯|,c3=supn,k|w0n¯·F(Vn¯zk¯)|, 17
c4=sup|z|<max{c2,c3}{|f(z)|,|g(z)|,|f(z)|,|g(z)|,|g(z)|}. 18

Using (18), Lemma 1 and the Cauchy-Schwartz Inequality, for any 1 ≤ k ≤ K and n=0,1,, we have that

Fn+1,k-Fn,k=f(w1n+1·zk)-f(w1n·zk)f(wqn+1·zk)-f(wqn·zk)=(w1n+1·zk-w1n·zk)(λ11f(ξ11)+λ12f(ξ12))(wqn+1·zk-wqn·zk)(λq1f(ξq1)+λq2f(ξq2))c5w1n+1-w1nwqn+1-wqnc5l=1qwln+1-wln, 19

where c5=qc4supkzk,λl10,λl20,λl1+λl2=1, ξl1 and ξl2 lie on the segment joining wln+1·zk and wln·zk, l=1,,q.

Similarly, we have

F¯n+1,k-F¯n,kc5l=1qwln+1-wln. 20

By (18), (19), Lemma 1 and the Cauchy-Schwartz Inequality we have that for any 1 ≤ k ≤ K and n=0,1,

ykn+1-ykn=g(w0n+1·Fn+1,k)-g(w0n·Fn,k)=(w0n+1·Fn+1,k-w0n·Fn,k)(η1g(ζ1)+η2g(ζ2))c4(w0n+1-w0nFn+1,k+w0nFn+1,k-Fn,k)qc42w0n+1-w0n+c1c4c5l=1qwln+1-wlnc6wn+1-wn, 21

where c6=q+1max{qc42,c1c4c5}, η10,η20,η1+η2=1, ζ1 and ζ2 lie on the segment joining w0n+1·Fn+1,k and w0n·Fn,k.

In the same way, we can prove that

g(w0n+1¯·F¯n+1,k)-g(w0n¯·F¯n,k)c6wn+1-wn. 22

With (18), (20), (22), and Cauchy-Schwartz Inequality we obtain

g(w0n+1¯·F¯n+1,k)F¯n+1,k-g(w0n¯·F¯n,k)F¯n,k=(g(w0n+1¯·F¯n+1,k)-g(w0n¯·F¯n,k))F¯n+1,k+g(w0n¯·F¯n,k)(F¯n+1,k-F¯n,k)|g(w0n+1¯·F¯n+1,k)-g(w0n¯·F¯n,k)|F¯n+1,k+|g(w0n¯·F¯n,k)|(F¯n+1,k-F¯n,k)c6qc4wn+1-wn+c4c5l=1qwln+1-wlnqc4(c5+c6)wn+1-wn 23

Combining (18), (21), (23), and the Cauchy-Schwartz Inequality we can conclude

E(wn+1)w0¯-E(wn)w0¯=k=1K(ykn+1-dk)g(w0n+1¯·F¯n+1,k)F¯n+1,k-k=1K(ykn-dk)g(w0n¯·F¯n,k)F¯n,k=k=1K((ykn+1-ykn)g(w0n+1¯·F¯n+1,k)F¯n+1,k+(ykn-dk)(g(w0n+1¯·F¯n+1,k)F¯n+1,k-g(w0n¯·F¯n,k)F¯n,k))k=1K(|(ykn+1-ykn)g(w0n+1¯·F¯n+1,k)|F¯n+1,k+|ykn-dk|g(w0n+1¯·F¯n+1,k)F¯n+1,k-g(w0n¯·F¯n,k)F¯n,k)k=1K(c42q|ykn+1-ykn|+(c4+supdk)g(w0n+1¯·F¯n+1,k)F¯n+1,k-g(w0n¯·F¯n,k)F¯n,k)L3wn+1-wn 24

where L3=Kqc4(c4c6+(c4+supdk)(c5+c6)).

Similarly, there exists a Lipschitz constants L4 such that for l=1,,q

E(wn+1)wl¯-E(wn)wl¯L4wn+1-wn. 25

Hence, (7), (24), and (25) validate (14) by setting L1 = L3 + qL4.

Equation (15) can be proved in a similar way to (14). □

Now, we proceed to the proof of Theorem 2 by dealing with Equations (9)-(12) separately.

Proof of (9)

By the differential mean value theorem, there exists a constant θ[0,1], such that

E(wn+1)-E(wn)=(wE(wn+θΔwn))TΔwn+(w¯E(wn+θΔwn))TΔwn¯=(wE(wn))TΔwn+(w¯E(wn))TΔwn¯+(wE(wn+θΔwn)-wE(wn))TΔwn+(w¯E(wn+θΔwn)-w¯E(wn))TΔwn¯2Re(wE(wn)TΔwn)+(wE(wn+θΔwn)-wE(wn)+w¯E(wn+θΔwn)-w¯E(wn))Δwn-2ηRe((w¯E(wn))Hw¯E(wn))+(L1+L2)θΔwn2=(-2η+(L1+L2)θη2)w¯E(wn)2 26

To make (9) valid, we only require the learning rate η to satisfy

0<η<L, 27

where L=2(L1+L2)θ. □

Proof of (10)

Equation (10) is directly obtained by (9) and E(wn)>0(n=1,2,).

Proof of (11)

Let β = 2η − (L1 + L2)θη2. By (26), we have

E(wn+1)E(wn)-βw¯E(wn)2E(w0)-βt=0nw¯E(wt)2. 28

Considering E(wn+1) > 0, let n, then we have

βt=0w¯E(wt)2E(w0)<. 29

This immediately gives

limnw¯E(wn)=0. 30

The following lemma, which will be used in the proof of (12), is a generalization of Theorem 14.1.5 by Ortega and Rheinboldt (1970) from the real domain to the complex domain. The proof of this lemma follows the same route as (Ortega and Rheinboldt 1970) and we omit it here.

Lemma 3

[22] Letφ:ΦCkC(k1)be continuous for a bounded closed regionΦ, andΦ0={zΦ:φ(z)=0}.Suppose the setΦ0contains only finite points and the sequence {zn} satisfy:

  • (i)

    limnφ(zn)=0;

  • (ii)

    limnzn+1-zn=0.

Then, there exists a uniquezΦ0such thatlimnzn=z.

Proof of (12)

Obviously w¯E(w) is continuous under the Assumption (A2). Using (8) and (11), we have

limnwn+1-wn=ηlimnw¯E(wn)=0. 31

Furthermore, the Assumption (A3) is valid. Thus, applying Lemma 3, there exists a unique wΦ such that limnwn=w. □

Simulation result

In this section we illustrate the convergence behavior of the FCBPA by the problem of one-step-ahead prediction of the complex-valued nonlinear signals. The nonlinear benchmark input signal is given by (Mandic and Goh 2009)

z(t)=z(t-1)1+z2(t-1)+n3(t), 32

where n(t) is a complex white Gaussian noise with zero mean and unit variance.

This example uses a network with one input node, five hidden nodes, and one output node. We set the activation function for both the hidden layer and output layer to be sin(·), which is analytic in the complex domain. The learning rate η is set to be 0.1. The test is carried out with the initial weights (both the real part and the imaginary part) taken as random numbers from the interval [−0.1, 0.1]. The simulation results are presented in Fig. 1, which shows that the gradient tends to zero and the square error decreases monotonically as the number of iteration increases and at last it tends to a constant. This supports our theoretical findings.

Fig. 1.

Fig. 1

Convergence behavior of FCBPA

Conclusion

In this paper, under the framework of Wirtinger calculus, we investigate the FCBPA for fully CVNN. Using a mean value theorem for holomorphic functions, under mild conditions we prove the gradient of the error function with respect to the network weight vector satisfies the Lipschitz condition. Based on this conclusion, both the weak convergence and strong convergence of the algorithm are proved. Simulation results substantiate the theoretical findings.

Acknowledgments

This research is supported by the National Natural Science Foundation of China (61101228, 10871220), the China Postdoctoral Science Foundation (No. 2012M520623), the Research Fund for the Doctoral Program of Higher Education of China (No. 20122304120028), and the Fundamental Research Funds for the Central Universities.

References

  1. Adali T, Li H, Novey M, et al. Complex ICA using nonlinear functions. IEEE Trans Signal Process. 2008;56(9):4356–544. [Google Scholar]
  2. Bos AVD. Complex gradient and Hessian. Proc Inst Elec Eng Vision Image Signal Process. 1994;141:380–382. doi: 10.1049/ip-vis:19941555. [DOI] [Google Scholar]
  3. Brandwood D. Complex gradient operator and its application in adaptive array theory. Proc Inst Electr Eng. 1983;130:11–16. [Google Scholar]
  4. Fink O, Zio E, Weidmann U. Predicting component reliability and level of degradation with complex-valued neural networks. Reliab Eng Syst Safe. 2014;121:198–206. doi: 10.1016/j.ress.2013.08.004. [DOI] [Google Scholar]
  5. Hirose A. Complex-valued neural networks. Berlin Heidelberg: Springer-Verlag; 2012. [Google Scholar]
  6. Kim T, Adali T. Approximation by fully complex multilayer perceptrons. Neural Comput. 2003;15:1641–666. doi: 10.1162/089976603321891846. [DOI] [PubMed] [Google Scholar]
  7. Li H, Adali T. Complex-valued adaptive signal processing using nonlinear functions. EURASIP J Adv Signal Process. 2008;2008:122. [Google Scholar]
  8. Mandi DP, Goh SL. Complex valued nonlinear adaptive filters. Chichester: Wiley; 2009. [Google Scholar]
  9. Mcleod RM. Mean value theorems for vector valued functions. Proc Edinburgh Math Soc. 1965;14(2):197–209. doi: 10.1017/S0013091500008786. [DOI] [Google Scholar]
  10. Nitta T. An extension of the back-propagation algorithm to complex numbers. Neural Netw. 1997;10(8):1391–1415. doi: 10.1016/S0893-6080(97)00036-1. [DOI] [PubMed] [Google Scholar]
  11. Nitta T. Local minima in hierarchical structures of complex-valued neural networks. Neural Netw. 2013;43:1–7. doi: 10.1016/j.neunet.2013.02.002. [DOI] [PubMed] [Google Scholar]
  12. Osborn GW. A Kalman filtering approach to the representation of kinematic quantities by the hippocampal-entorhinal complex. Cogn Neurodyn. 2010;4:C315–C335. doi: 10.1007/s11571-010-9115-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Shao HM, Zheng GF. Boundedness and convergence of online gradient method with penalty and momentum. Neurocomputing. 2011;74:765–770. doi: 10.1016/j.neucom.2010.10.005. [DOI] [Google Scholar]
  14. Wei H, Ren Y, Wang ZY. A computational neural model of orientation detection based on multiple guesses: comparison of geometrical and algebraic models. Cogn Neurodyn. 2013;7:C361–C379. doi: 10.1007/s11571-012-9235-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Wu W, Feng GR, Li ZX, et al. Deterministic convergence of an online gradient method for BP neural networks. IEEE Trans Neural Netw. 2005;16:533–540. doi: 10.1109/TNN.2005.844903. [DOI] [PubMed] [Google Scholar]
  16. Wang J, Wu W, Zurada J. Deterministic convergence of conjugate gradient method for feedforward neural networks. Neurocomputing. 2011;74:2368–2376. doi: 10.1016/j.neucom.2011.03.016. [DOI] [Google Scholar]
  17. Xu DP, Zhang HS, Liu L. Convergence analysis of three classes of split-complex gradient algorithms for complex-valued recurrent neural networks. Neural Comput. 2010;22(10):2655–2677. doi: 10.1162/NECO_a_00021. [DOI] [PubMed] [Google Scholar]
  18. Zhang C, Wu W, Xiong Y. Convergence analysis of batch gradient algorithm for three classes of sigma-pi neural networks. Neural Process Lett. 2007;26:177–180. doi: 10.1007/s11063-007-9050-0. [DOI] [Google Scholar]
  19. Zhang C, Wu W, Chen XH, et al. Convergence of BP algorithm for product unit neural networks with exponential weights. Neurocomputing. 2008;72:513–520. doi: 10.1016/j.neucom.2007.12.004. [DOI] [Google Scholar]
  20. Zhang HS, Wu W, Liu F, Yao MC. Boundedness and convergence of online gradient method with penalty for feedforward neural networks. IEEE Trans Neural Netw. 2009;20(6):1050–1054. doi: 10.1109/TNN.2009.2020848. [DOI] [PubMed] [Google Scholar]
  21. Zhang HS, Xu DP, Zhang Y (2013) Boundedness and convergence of split-complex back-propagation algorithm with momentum and penalty. Neural Process Lett. doi:10.1007/s11063-013-9305-x
  22. Ortega JM, Rheinboldt WC. Iterative solution of nonlinear equations in several variables. New York: Academic Press; 1970. [Google Scholar]

Articles from Cognitive Neurodynamics are provided here courtesy of Springer Science+Business Media B.V.

RESOURCES