Universal Approximation with Quadratic Deep Networks

Fenglei Fan; Jinjun Xiong; Ge Wang

doi:10.1016/j.neunet.2020.01.007

. Author manuscript; available in PMC: 2021 Apr 1.

Published in final edited form as: Neural Netw. 2020 Jan 18;124:383–392. doi: 10.1016/j.neunet.2020.01.007

Universal Approximation with Quadratic Deep Networks

Fenglei Fan ^a, Jinjun Xiong ^b, Ge Wang ^a,^*

PMCID: PMC7076904 NIHMSID: NIHMS1569260 PMID: 32062373

Abstract

Recently, deep learning has achieved huge successes in many important applications. In our previous studies, we proposed quadratic/second-order neurons and deep quadratic neural networks. In a quadratic neuron, the inner product of a vector of data and the corresponding weights in a conventional neuron is replaced with a quadratic function. The resultant quadratic neuron enjoys an enhanced expressive capability over the conventional neuron. However, how quadratic neurons improve the expressing capability of a deep quadratic network has not been studied up to now, preferably in relation to that of a conventional neural network. Regarding this, we ask four basic questions in this paper: (1) for the one-hidden-layer network structure, is there any function that a quadratic network can approximate much more efficiently than a conventional network? (2) for the same multi-layer network structure, is there any function that can be expressed by a quadratic network but cannot be expressed with conventional neurons in the same structure? (3) Does a quadratic network give a new insight into universal approximation? (4) To approximate the same class of functions with the same error bound, is a quantized quadratic network able to enjoy a lower number of weights than a quantized conventional network? Our main contributions are the four interconnected theorems shedding light upon these four questions and demonstrating the merits of a quadratic network in terms of expressive efficiency, unique capability, compact architecture and computational capacity respectively.

Keywords: deep learning, quadratic networks, approximation theory

1. Introduction

Over recent years, deep learning has become the mainstream approach for machine learning. Since AlexNet [1], increasingly more advanced neural networks [2–6] are being proposed, such as GoogleNet, ResNet, DenseNet, GAN and their variants, to enable practical performance comparable to or beyond what human delivers in computer vision [7], speech recognition [8], language processing [9] game playing [10], medical imaging [11–13], and so on. A heuristic understanding of why these deep learning models are so successful is that these models represent knowledge in hierarchy and facilitate high-dimensional non-linear functional fitting. It seems that deeper structures are correlated with greater capacities to approximate more complicated functions.

The representation power of neural networks has been rigorously studied since the eighties. The first result is that a network with a single hidden layer can approximate a continuous function at any accuracy given an infinite number of neurons [14]. This means that the network can be extremely wide. Then, a number of papers were published investigating the complexity of approximation. For example, Barron et al. [49] studied one-hidden-layer feedforward networks using sigmoidal activations, which can approximate a function family, wherein the first moment of the magnitude distribution of the Fourier transform of the function is bounded by a constant, with an accuracy in the order of $\frac{1}{n}$ , where n is the number of nodes and Kainen et al. [50] demonstrated that for worst-case error, a generic function with dimensionality d in the Hilbert space can be approximated by a one-hidden-layer neural network with an error of ϵ(d)κ(n), where ϵ(d) is bounded above by a polynomial in d and κ(n) is n^−1/m, m > 0. Kon et al. shows that given that the number of nodes n is no smaller than the number of examples k, the optimal network to approximate a function in reproducing kernel Hilbert space is given by $\sum_{i}^{k} c_{i} G (t_{i}, x)$ , where t_i is the sample input and G is the RBF neuron [51]. Please note that the optimality means the minimal error achieved in the worst case over all the possible f. Schmitt borrowed the results based on Vapnik-Chervonenkis dimension to prove that to approximate any polynomials, the network size need to grow $O ({(logk)}^{1 / 4})$ , where k is the degree of the polynomial [52]. With the emergence of deep neural networks, studies have been performed on theoretical benefits of deep models over shallow ones [15–24]. One way [15, 19] is to construct a special kind of functions that are easy to be approximated by deep networks but hard by shallow networks. It has been reported in [16] that a fully-connected network with ReLU activation can approximate any Lebesgue integrable function in the L₁-norm sense, provided a sufficient depth and at most d+4 neurons in each layer, where d is the number of inputs. Through a similar analysis, it was reported in [22] that ResNet with one single neuron per layer is a universal approximator. Moreover, it was demonstrated in [15] that a special class of functions is hard to be approximated by a conventional network with a single hidden layer unless an exponential number of neurons are used. Bianchini et al. [23] utilized a topological measure to characterize the complexity of functions that are realized by neural networks, and proved that deep networks can represent functions of much higher complexity than the shallow counterparts. Kurkova et al. [24] showed that unless with sufficiently many network units (more than any polynomial of the logarithm of the size of the domain), a good approximation cannot be achieved with a shallow perceptron network for almost any uniformly chosen function on a specially constructed domain.

In our previous studies [25–28], we proposed quadratic neurons and deep quadratic networks. In a quadratic neuron, the inner product of an input vector and the corresponding weights in a conventional neuron is replaced with a quadratic function. The resultant quadratic neuron enjoys an enhanced expressive capability over the conventional neuron (Here conventional neurons refer to neurons that perform activation of an inner-product, and conventional networks refer to networks completely comprising conventional neurons). Actually, a quadratic neuron can be viewed as a fuzzy logic gate, and a deep quadratic network is nothing but a deep fuzzy logic system [27]. Furthermore, we successfully designed a quadratic autoencoder for a real-world low-dose CT problem [28]. Note that high-order neurons [29–30] were taken into account in the early stage of artificial intelligence, but they are not connected to deep networks and suffer from a combinatorial explosion with the number of parameters due to high order terms. In contrast, our quadratic neuron uses a limited number of parameters (tripled that of a conventional neuron) and performs a cost-effective high-order operation in the context of deep learning. For quadratic deep networks, we already developed a general backpropagation algorithm [26], enabling the network training process.

However, how quadratic neurons improve the expressing capability of a deep quadratic network has not been theoretically studied up to now, preferably in relation to that of a conventional neural network. In this paper, we ask four basic questions regarding the expressive capability of a quadratic network: (1) for the one-hidden-layer network structure, is there any function that a quadratic network can approximate much more efficiently than a conventional network? (2) for the same multi-layer network structure, is there any function that can be expressed by a quadratic network but cannot be expressed with conventional neurons in the same structure? (3) Does a quadratic network give a new insight and a new method for universal approximation? (4) To approximate the same class of functions with the same error bound, is a quadratic network able to enjoy a lower number of weights than a conventional network? If the answers to these questions are favorable, quadratic networks should be significantly more powerful in many machine learning tasks.

In this paper, our contributions are to present four theorems addressing the above questions respectively and positively, thereby establishing the intrinsic advantages of quadratic networks over conventional networks. More specifically, these theorems characterize the merits of a quadratic network in terms of expressive efficiency, unique representation, compact architecture and computational capacity. We answer the first question with the first theorem, given the network with only one hidden layer and admissible activation function, there exists a function that a quadratic network can approximate it with a polynomial number of neurons but a conventional network can only do the same level approximation with an exponential number of neurons. Regarding the second question and the second theorem below, with the ReLU activation function, any continuous radial function can be approximated by a quadratic network in a structure of no more than four neurons in each layer but the function cannot be approximated by a conventional network of the same structure [16], while with the third theorem below we provide a new insight and a new method for a universal presentation from the perspective of the Algebraic Fundamental Theorem. Without introducing complex numbers, a univariate polynomial of degree n can be uniquely factorized into a product of quadratic terms instead of first-order terms. Since a ReLU quadratic network can represent any univariate polynomial in a unique and global manner, and by the Weierstrass theorem and the Kolmogorov theorem that multivariate functions can be represented through summation and composition of univariate functions, we can approximate any multivariate function with a well-structured ReLU quadratic neural network, justifying the universal approximation power of the quadratic network. To our best knowledge, our quadratic network is the first-of-the-kind global universal approximator. Finally, with the fourth theorem below, we attempt to address the theoretical foundation for quantization of a neural network. To approximate the same class of functions with the same error bound, a quantized quadratic network demands a much lower number of weights than a quantized conventional network. The saving is almost in the order of $O (λ (l o g^{\frac{1}{λ - 1} + 1} (\frac{1}{ϵ})) {(\frac{1}{ϵ})}^{\frac{d}{n}})$ , λ ≥ 2, d, n ≥ 1. Compared to the previous studies that theoretically explored the properties of quantized networks, our result adds unique insights into this aspect.

There are prior papers related to but different from our contributions [40–45]. Motivated by a need for more powerful activation, Livni et al. [40] and Krotov et al. [41] proposed to use the quadratic activation: σ(z) = z² and rectified polynomials in the neuron respectively. Despite somewhat misleading in their name, networks with quadratic activation or rectified polynomials and our proposed networks that consist of quadratic neurons have fundamental differences. At the cellular level, a neuron with quadratic activation is still characterized with a linear decision boundary, while our quadratic neuron allows a quadratic decision boundary. In [40], the authors demonstrated that networks with quadratic activation are as expressive as networks with threshold activation, and constant-depth networks with quadratic activation can learn in polynomial time. In contrast, our work goes further showing that the expressibility of the quadratic network is superior to that of the conventional networks; for example, a single quadratic neuron can implement the XOR gate, and a quadratic network of finite width and depth can represent a finite-order polynomial up to any desirable accuracy. In [42], Du et al. showed that over-parameterization and weight decay are instrumental to the optimization aided by quadratic activation. [43] reported how a neural network can provably learn a low-degree polynomial with gradient descent search from scratch, with an emphasis on the effectiveness of the gradient descent method. [44] presents that $O (l o g \frac{p}{ϵ})$ layers of binary units and $O (plog \frac{p}{ϵ})$ ReLU units can approximate $f (x) = \sum_{i = 0}^{p} a_{i} x^{i}$ with closeness of ϵ. In contrast, our Theorem 3 is based on the Algebraic Fundamental Theorem to provide an exact representation of any finite-order polynomial. [45] is on factorization machine (FM) dedicated to combine high order features, clearly different from the polynomial factorization we propose to perform using a quadratic network.

2. Preliminaries

Quadratic/Second-order Neuron:

The n-input function of a quadratic/second-order neuron before being nonlinearly processed is expressed as:

h (x) = (\sum_{i = 1}^{n} w_{i r} x_{i} + b_{r}) (\sum_{i = 1}^{n} w_{i g} x_{i} + b_{g}) + \sum_{i = 1}^{n} w_{i b} x_{i}^{2} + c = (w_{r} x^{T} + b_{r}) (w_{g} x^{T} + b_{g}) + w_{b} {(x^{2})}^{T} + c,

(1)

where x denotes the input vector, w_r,w_g,w_b are vectors of the same dimension with x and b_r, bg, c are adjustable biases. Our quadratic function definition only utilizes 3n parameters, which is more compact than the general second-order representation requiring $\frac{n (n + 1)}{2}$ parameters. While our quadratic neuron design is unique, other papers on quadratic neurons are also in the later literature; for example, [31] proposed a type of neurons with paraboloid decision boundaries. It is underlined that the emphasis of our work is not only on quadratic neurons individually but also deep quadratic networks in general.

One-hidden-layer Networks:

The generic function represented by a one-hidden-layer conventional network is as follows:

f_{1} (x) = \sum_{l = 1}^{k} t^{l} σ_{l} (w^{l} x + b^{l}),

(2)

where l refers to the l^th layer. In contrast, the generic function represented by a one-hidden-layer quadratic network is:

f_{2} (x) = \sum_{l = 1}^{k} t^{l} σ_{l} [(w_{r}^{l} x^{T} + b_{r}^{l}) (w_{g}^{l} x^{T} + b_{g}^{l}) + w_{b}^{l} {(x^{2})}^{T} + c^{l}] .

(3)

In our Theorem 1 below, we will compare the representation capability of a quadratic network and that of a conventional network assuming that both networks have the same one-hidden-layer structure.

L-Lipschitz Function:

A L-Lipschitz function f from $ℝ^{n}$ to $ℝ$ is defined by the following property:

| f (x) - f (y) | \leq L ‖ x - y ‖

Radial Function:

A radial function only depends on the norm of its input vector, generically denoted as f(∥x∥). The functions mentioned in Theorems 1 and 2 are all radial functions. By its nature, the quadratic neuron is well suited for modeling of a radial function.

Euclidean Unit-volume Ball:

In a d-dimensional space, let R_d be the radius of a Euclidean ball B_d such that R_dB_d has the unit volume. Euclidean ball is used to define density function μ later. Here, the main reason why the μ is defined on the Euclidean ball is because the function f being constructed is a radial function, and the Fourier transform of f is also radial. Thus, it is appropriate to investigate f in the Euclidean ball, which simplifies the analysis and helps focusing on the essence of the problem. The extension of considering other domains (such as cube) will be considered for future work.

Bernstein Polynomial:

$l_{n, m} = C_{n}^{m} x^{m} {(1 - x)}^{n - m}$ , 0 ≤ m ≤ n. The n-th Bernstein polynomial of f(x) in (0, 1) is defined as

B_{n} (x) = \sum_{m = 0}^{n} f (\frac{m}{n}) l_{n, m} (x)

Previously, the Bernstein Polynomials are used to prove the Weierstrass theorem.

Admissible function:

An activation function $σ : ℝ \to ℝ$ is called admissible if:

σ is measurable
there are K, κ > 0, |σ(x)| ≤ K(1 + |x|^κ) for all $x \in ℝ$
given any L-Lipschitz function $f : ℝ \to ℝ$ which is constant outside a bounded interval [−R, R] and any δ > 0, there are scalars $a \in ℝ$ and (α_i, β_i, γ_i)_i=1,2,…,w such that

| f (x) - [a + \sum_{i = 1}^{w} α_{i} σ (β_{i} x - γ_{i})] | \leq δ,

where $w \leq c_{σ} \frac{R L}{δ}$ .

The standard activation functions such as ReLU and sigmoid satisfy the above three properties. Detailed proofs are out of the scope of this paper, please refer to [15,45].

Function space $F_{n}^{d}$ :

The Sobolev space $W^{n, \infty} ({[0, 1]}^{d})$ with n= 1, 2, … is defined on [0, 1]^d, lying in L^∞ together with their weak derivatives of up to order n. The function space $F_{n}^{d}$ is made of any function $f \in W^{n, \infty} ({[0, 1]}^{d})$ and

max_{_{n : | n | \leq n}} e s s {sup}_{x \in {[0, 1]}^{d}} | D^{n} f (x) | \leq 1,

where $| n | = \sum_{k = 1}^{d} n_{k}$ .

3. Four Theorems

First, we present four theorems, and then give their proofs respectively.

Theorem 1: For an admissible activation function σ(·) and for some universal constants c > 0, C > 0, C′ > 0, c_σ > 0, there exist a probability measure μ and a radial function $\tilde{g} : ℝ^{d} \to ℝ$ , where d > C, that is bounded on [−2,2] and supported on $| | x | | \leq C^{'} \sqrt{d}$ satisfying:

$\tilde{g}$ can be approximated by a single-hidden-layer quadratic network with C′ c_σd^3.75 neurons, which is denoted as f₂.
For every function f₁expressed by a single-hidden-layer conventional network with at most ce^cd neurons, we have:

E_{x ~ μ} {(f_{1} (x) - f_{2} (x))}^{2} \geq δ

for some positive constant δ.

Theorem 2: For any compactly supported, radial, continuous function $f : ℝ^{n} \to ℝ$ and any ϵ > 0, there exists a function that can be implemented by a ReLU-activated quadratic network with at most four neurons in each layer, such that:

sup_{x \in ℝ^{d}} | f (x) - g (x) | d x \leq ϵ .

Theorem 3: With ReLU as activation function, the quadratic network is a global universal approximator.

Theorem 4: For any $f \in F_{d}^{n}$ and any ϵ ∈ (0, 1), there is a ReLU quadratic network with λ distinct weights that can approximate f with ϵ, satisfying (i) the depth is $O (l o g (l o g (\frac{1}{ϵ})) + l o g (\frac{1}{ϵ}))$ (ii) the number of weights is $O (l o g (l o g (\frac{1}{ϵ})) {(\frac{1}{ϵ})}^{\frac{d}{n}} + l o g (\frac{1}{ϵ}) {(\frac{1}{ϵ})}^{\frac{d}{n}})$ .

3.1. Theorem 1

Key Idea for Proving Theorem 1:

The form of functions represented by a single-hidden-layer conventional network is $f_{1} (x) = \sum_{l = 1}^{k} t^{l} σ_{l} (w^{l} x + b^{l})$ . It is observed that the distribution of the Fourier transform of f₁(x) is supported on a finite collections of lines. The support covered by the finite lines are sparse in the Fourier space, especially for a high dimension and high frequency regions, unless an exponential number of lines are involved. Thus, a possible target function to be constructed should have major components at high-frequencies. A suitable candidate has been constructed in [15]:

\tilde{g} (x) = \sum_{i = 1}^{N} ϵ_{i} g_{i} (‖ x ‖),

where ϵ_i ∈ {−1, 1}, N is a polynomial function of d, g_i(∥x∥) = 1 {∥x|| ∈ Δ_i,} are radial indicator functions over disconnected intervals. According to the definition, $\tilde{g}$ is well bounded. Although the constructed $\tilde{g}$ is hard to approximate by a conventional network, it is easy to approximate by a quadratic network, because $\tilde{g}$ is a radial function. Consequently, it is feasible for a single-hidden-layer quadratic network to approximate the radial function with a polynomial number of neurons. Note that $\tilde{g} (x)$ is discontinuous, hence it cannot be perfectly expressed by a neural network with continuous activation functions. Therefore, let us use a probability measure μ for network quality assessment. With μ, the distance between $\tilde{g}$ and f represented by a network is characterized as $E_{x ~ μ} {(f (x) - \tilde{g} (x))}^{2}$ . In particular, μ = ϕ², where ϕ is the inverse Fourier transform of the indicator function defined on the Euclidean ball: 1{x ∈ B}. Therefore, $E_{x ~ μ} {(f (x) - \tilde{g} (x))}^{2} = ‖ \hat{f ϕ} - \hat{\tilde{g} ϕ} ‖_{L_{2}}$ , where $\hat{f ϕ}$ , $\hat{\tilde{g} ϕ}$ are the frequency forms of fϕ and $\tilde{g} ϕ$ . The physical meaning of $E_{x ~ μ} {(f (x) - \tilde{g} (x))}^{2}$ is to measure the closeness of $\tilde{g}$ and f in frequency domain within Euclidean ball B.

Please note that c and c_σ are different constants. In Lemma 1, as long as the number of neurons is no more than ce^cd, the distance between f₁ and $\tilde{g}$ in the sense of L₂(μ) will be greater than a constant. In Lemma 2, c_σ is a fixed constant after the activation function σ is pre-specified.

The contribution of our work:

In our Theorem 1, we demonstrate the utility of a quadratic network in approximating the constructed radial function $\tilde{g}$ , which is straight-forward but interesting and non-trivial. Furthermore, our result holds for all the admissible activation functions, while the proof in [15] is only for the ReLU activation function. Proposition 1 in [15] demonstrates that $\tilde{g}$ cannot be well approximated by a single-hidden-layer conventional network with a polynomial number of neurons. We put this proposition here as Lemma 1 for completeness. We would like to emphasize that although we put Lemma 1 and other Lemmas on a equal footing, Lemma 1 is more important than other lemmas. Actually, our Theorem 1 is merely a corollary of Proposition 1 in [15] when ReLU activation is used. Again, our contribution is the proof that the one-hidden-layer quadratic network using an admissible activation function can approximate $\tilde{g}$ .

Lemma 1:

There are universal constants c,C > 0 such that for every d > C, there is a probability measure μ on $ℝ^{d}$ such that for any α > C and N ≥ Cα^1.5d², there exists a function $\tilde{g} (x) = \sum_{i = 1}^{N} ϵ_{i} g_{i} (x)$ , where ϵ_i ∈ {−1, 1}, i = 1, 2, …, N, such that for any function of the form: $f_{1} (x) = \sum_{l = 1}^{k} t^{l} σ_{l} (w^{l} x + b^{l})$ with σ_l as admissible function and k ≤ ce^cd, we have:

{‖ f_{1} - \tilde{g} ‖}_{L_{2} (μ)} \geq δ / α,

where δ > 0 is a universal constant.

Universal constants means they are independent from the dimension d. To illustrate $\tilde{g}$ is expressible with a quadratic network, we know from Lemma 12 of [15] that a continuous Lipschitz function g can approximate $\tilde{g}$ , what is remained for us to do is to use a quadratic network with a polynomial number of neurons to approximate g.

Lemma 2:

Given an admissible activation function σ, there is a constant c_σ ≥ 1 (depending on σ and other parameters) such that for any L-Lipschitz function $g : ℝ \to ℝ$ , which is constant outside a bounded interval [r, R] (r ≥ 1) and any δ, there exist scalars a, {α_i, β_i, γ_i }, i = 1,…, w, $w \leq c_{σ} \frac{(R^{2} - r^{2}) L}{4 \sqrt{r} δ}$ with which we have:

h (‖ x ‖^{2}) = a + \sum_{i = 1}^{w} α_{i} σ (β_{i} ‖ x ‖^{2} - γ_{i})

satisfies:

sup_{‖ x ‖ \in [r, R]} | g (‖ x ‖) - h (‖ x ‖^{2}) | < δ .

Proof: $g : ℝ \to ℝ$ is an L-Lipschitz function and supported on [r, R], r > 0. Let us define $s (x) = g (\sqrt{x})$ , which is of $\frac{L}{2 \sqrt{r}}$ -Lipschitz and supported on [r², R²]. By the property of an admissible function, we construct a function $h (t) = a + \sum_{i = 1}^{w} α_{i} σ (β_{i} t - γ_{i})$ , which satisfies the following condition:

sup_{l \in [r^{2}, R^{2}]} | s (l) - h (l) | < δ,

Where $w = c_{σ} \frac{(R^{2} - r^{2}) L}{4 \sqrt{r δ}}$ . Next, we have:

sup_{‖ x ‖ \in [r, R]} | g (‖ x ‖) - h (‖ x ‖^{2}) | = sup_{‖ x ‖ \in [r, R]} | s (‖ x ‖^{2}) - h (‖ x ‖^{2}) | = sup_{l \in [r^{2}, R^{2}]} | s (l) - h (l) | < δ

Lemma 3:

There are a universal constant C > 0 and δ ∈ (0, 1), for d ≥ C and any choice of ϵ_i ∈ {−1, 1}, i = 1, 2, …, N, there exists a function f₂ expressed by a single-hidden-layer quadratic network of a width of at most $c_{σ} \frac{3 N α^{1.5} d^{1.75}}{4 δ}$ and with the range [−2,+2] such that

{‖ f_{2} - \tilde{g} ‖}_{L_{2} (μ)} \leq \frac{\sqrt{3}}{α d^{\frac{1}{4}}} + δ .

Proof: In Lemma 2, we make the following substitutions: $R = 2 α \sqrt{d}$ , $r = α \sqrt{d}$ , L = N, h = g. Notably, α can be greater than 1, and $r = α \sqrt{d} > 1$ satisfying the condition of Lemma 2. Thus, g(∥x∥) is expressible by a single-hidden-layer quadratic network with at most $c_{σ} \frac{3 N α^{1.5} d^{1.75}}{4 δ}$ neurons. Coupled with Lemma 12 of [15], Lemma 3 is immediately obtained.

Proof of Theorem 1:

By the combination of Lemmas 1 and 3, the proof for Theorem 1 is straightforward. In Lemma 1, by choosing α = C, N = Cα^1.5d², we have

{‖ f_{1} - \tilde{g} ‖}_{L_{2} (μ)} \geq δ_{1}

Let $δ \leq \frac{δ_{1}}{2} - \frac{\sqrt{3}}{C d^{\frac{1}{4}}}$ which demands d is very large so that δ > 0, to approximate $\tilde{g}$ we need the number of quadratic neurons being at most

c_{σ} \frac{3 N α^{1.5} d^{1.75}}{4 δ} = c_{σ} \frac{3 C^{2.5} d^{3.75}}{4 δ} \leq C^{'} c_{σ} d^{3.75}

(where C′ is a universal constant depending on the universal constants C,δ₁) such that

{‖ f_{2} - \tilde{g} ‖}_{L_{2} (μ)} \leq \frac{\sqrt{3}}{C d^{\frac{1}{4}}} + δ \leq \frac{δ_{1}}{2} .

Therefore, we have ${‖ f_{1} - f_{2} ‖}_{L_{2} (μ)} \geq \frac{δ_{1}}{2}$ The proof is completed.

Classification Example:

The above exponential difference between the conventional and quadratic networks is due to the dimension d. To demonstrate this, we constructed an example for separation of two concentric rings. In this example, there are 60 instances in each of the two rings representing two classes. With only one quadratic neuron in a single hidden layer, the rings were totally separated, while at least six conventional neurons are required to complete the same task, as shown in Figure 1.

Figure 1: — Classification of two concentric rings with conventional and quadratic networks (d=2). To succeed in the classification, a conventional network requires at least 6 neurons (Left), and a quadratic network only takes one neuron.

3.2. Theorem 2

Key Idea for Proving Theorem 2:

It was proved in [16] that an n-input function that is not constant along any direction cannot be well approximated by a conventional network with no more than n – 1 neurons in each layer. However, for such a radially defined function, it is feasible for a quadratic network to approximate the function with width=4, which breaks the lower width bound n+4 given in [16]. The trick for the approximation by a deep conventional network is adapted for analysis of a deep quadratic network so that the function can be approximated in a “quadratic” way via composition layer by layer. To compute a radial function, we need to find the norm and then evaluate the function approximation by the norm. With a quadratic neuron, the norm can be easily found. Heuristically speaking, a quadratic network with no more than n – 1 neurons in each layer could approximate a radial function very well, even if the function is not constant along any direction. Therefore our theorem positively shows that it is more natural and more effective to represent functional non-linearity with a quadratic network. Finally, since a quadratic network with width at most 4 can approximate any radial function and radial function family can approximate any continuous function [47], a quadratic network can serve as a width-bounded universal approximator.

Proof of Theorem 2:

As per structure outlined in Figure 2, a conventional ReLU network with four neurons in each layer can implement any function of the form:

[0, + \infty] \to ℝ : x \to θ + \sum_{l = 1}^{k} γ_{l} ReLU (α_{l} x + β_{l})

, which is actually any continuous piecewise linear function defined on non-negative real number domain. Because function $f : ℝ^{d} \to ℝ$ is radial and continuous, without loss of generality, Let h(∥x∥²) = f(x), while h is function: $[0, + \infty] \to ℝ$ . It implies that for any ϵ > 0, there will be a continuous piecewise linear function $g : [0, + \infty] \to ℝ$ , and such that

sup_{[0, + \infty]} | h (s) - g (s) | < ϵ .

g can be implemented by a conventional ReLU network as shown in Figure 2. Therefore, the function $x \to g (‖ x ‖_{2}^{2}) : ℝ^{d} \to ℝ$ can be implemented with a quadratic ReLU network of the same structure by calculating norm of input. Then we will obtain:

sup_{x \to ℝ^{d}} | f (x) - g (‖ x ‖_{2}^{2}) | = sup_{x \to ℝ^{d}} | h (‖ x ‖_{2}^{2}) - g (‖ x ‖_{2}^{2}) | = sup_{s \in [0, + \infty]} | h (s) - g (s) | < ϵ

Lemma 4:

[47] Let $K : ℝ^{r} \to ℝ$ be an integrable bounded function such that K is continuous almost everywhere and $\int_{ℝ^{r}} K (x) d x \neq 0$ . Then the functional family $S_{K} : \sum_{i = 1}^{M} w_{i} K (\frac{‖ x - z_{i} ‖}{σ})$ ,(M is a positive integer, σ > 0 and $w_{i} \in ℝ$ ) is dense in $L^{p} (ℝ^{r})$ .

Proof: Please see [Theorem 1, 47].

Corollary 2:

With ReLU activation function, residual quadratic networks with at most four neurons in one layer is universal approximator.

Proof: As Figure 3 shown, because lemma 4 makes sure that the quadratic ReLU network can approximate function $K (\frac{‖ x - z_{i} ‖}{σ}), i = 1, 2 \dots$ . We use residual connections (red lines) to make the network modules that represent different $K (\frac{‖ x - z_{i} ‖}{σ}), i = 1, 2 \dots$ and we use another groups of residual connections (green lines) to aggregate the outputs of these modules, such that quadratic ReLU network can represent any function from the family S_K at any accuracy. By triangle relationship, we conclude that residual quadratic ReLU network with no more than four neurons in each layer is universal approximator.

Compared with the results in [16] and [22], our Corollary 2 presents another special width-bounded universal approximator, which is slimmer than that of [16] and has sparser shortcuts than that of [22], The upper limit of the width required by our width-bounded universal approximator is 4, while the counterpart in [16] is n + 4 given an n–input. Although only one neuron is demanded in each layer in [22], much denser residual shortcuts are employed as the trade-off. In our work, only sparser residual connections are needed to integrate different blocks.

3.3. Theorem 3

Key Idea for Proving Theorem 3:

For universal approximation by neural networks, the current mainstream strategies are all based on piecewise approximation in terms of L₁ L_∞, or other distance measures. For the purpose of piecewise approximation, the functional space is divided into numerous hypercubes, which are intervals in the one-dimensional case, according to a specified accuracy to approximate a target function in every hypercube. With quadratic neurons and the trick that two ReLU neurons can organically execute a linear operation, we can instead use a global approximating method that is much more efficient. At the same time, the quadratic network structure is neither too wide nor too deep, which can be regarded as a size-bounded universal approximator, in contrast to what we have introduced before. What’s more, aided by Algebraic Fundamental theorem, our novel proof reveals the uniqueness and facility of our proposed quadratic networks that cannot be elegantly made by networks with quadratic activation. More importantly, Theorem 3 gives a first-of-the-kind global universal approximator, facilitating feature representations in deep learning.

First, we show any univariate polynomial of degree N can be exactly expressed by a quadratic network with a complexity of $O (l o g_{2} (N))$ in depth. Next we refer the result [34] regarding Hilbert’s thirteen problem that multivariate functions can be represented with a group of separable functions, and then finalize the proof.

Lemma 5:

With ReLU as activation function, any univariate polynomial of degree N can be perfectly computed by a quadratic network with depth of $O (l o g_{2} (N))$ and width of no more than N.

Proof: According to the Algebraic Fundamental Theorem [35], a general univariate polynomial of degree N can be expressed as $P_{N} (x) = C \prod_{i}^{l}^{1} (x - x_{i}) \prod_{j}^{l}^{2} (x^{2} + a_{j} x + b_{j})$ , where l₁ + 2l₂ = N. The ReLU activation has an important property: f(x) = ReLU(f(x)) − ReLU(−f(x)), which is often resorted in the literature, such as [48]. The network we construct is shown in Fig. 4 (Each neuron actually represents two parallel neurons to implement a linear operation). Every two neurons are grouped in the first layer to compute (x–x_i), (x–x_i)(x–x_i+1) or x²+a_ix+b_i, then the second layer merely uses half the number of neurons in the first layer to combine the outputs of the first layer. By repeating such a process, the quadratic network with the depth of O(log₂(N)) can exactly represent P_N(x).

Figure 4: — Quadratic network approximating a univariate polynomial according to the Algebraic Fundamental Theorem.

The following lemma shows that any univariate function f(x) continuous in [0, 1] can be approximated with a Bernstein polynomial up to any accuracy, which is the Bernstein version of the proof for the Weierstrass theorem.

Lemma 6:

Let f(x) is a continuous function over [0, 1], we have

lim_{n \to + \infty} B_{n} (x) = f (x)

Proof: It is well known; please refer to [33].

Corollary 2:

Any continuous univariate function supported on [0, 1] can be approximated by a quadratic network with ReLUs up to any accuracy.

Lemma 7:

Every continuous n-variable function on [0, 1]ⁿ can be represented in the form:

f (x_{1}, \dots, x_{n}) = \sum_{i = 1}^{2 n + 1} g_{i} (\sum_{j = 1}^{n} h_{i j} (x_{j}))

Proof: This is a classical theorem made by Kolmogorov and his student Arnold. Please refer to [34].

Proof of Theorem 3:

Combining Lemmas 5–7, it can be shown that the ReLU-embedded quadratic network is a universal approximator, and importantly it realizes a universal approximation in a global manner.

Let us look at the following numerical example to illustrate our novel quadratic network based factorization method. First, 100 instances were sampled of a function, g(x) = (x² + l)(x–l)(x² + 1.7x+1.2)in [−1,0]. Instead of taking opposite weights and biases to create linearity as used in proof for clarity, here we have incorporated shortcuts to make our factorization method trainable in terms of adaptive offsets. Using the ReLU activation function, we trained a four-layer network 4-3-1-1 with shortcuts to factorize this polynomial, as shown in Fig. 5. The parameters in connections marked by green symbols are fixed to perform multiplication. The shortcut connections and vanilla connection are denoted by green and red lines respectively. The neurons in the first layer will learn the shifted factors T₁, T₂, T₃, with unknown constant offsets C₁,C₂,C₃. The whole network will be combined in pairs to form in the next layer. Then, the neurons in the third layer will multiply T₁ and T₂T₃ to obtain T₁T₂T₃. Finally, the neuron in the output layer is a linear one that will be trained to undo the effect of the constant offsets aided by the shortcuts. We trained the network to learn the factorization by initializing the parameters multiple times, with the number of iterations 600 and the learning rate 2.0e-3. The final average error is less than 0.0051. In this way, the function was learned to be $g^{'} (x) = (1.0117592 x^{2} + 0.02033176 x + 1.0040002) (0.00819827 x^{2} + 0.9977611 x - 1.0023365) (1.0113174 x^{2} + 1.7042247 x + 1.1885077)$ , which agrees well with the target function g(x) = (x² + 2)(x − l)(x² + 1.7x + 1.2).

Figure 5: — Quadratic network with shortcuts to learn the soft factorization of an exemplary univariate polynomial.

Mathematically, we can handle the general factorization representation problem as follows. Let us denote A_i = m_i, A_ij = m_im_j, A_ijk = m_im_jm_k, …, $A_{123 \dots n} = \prod_{i}^{n} m_{i}$ , which are generic factors for i = 1, 2, …, n, the question becomes if $Π_{i}^{n} (m_{i} + C_{i}) - Π_{i}^{n} m_{i}$ can be represented as the linear combination of A₀, A_i, A_ij, A_ijk, …, A₁₂₃…(n−1) and a constant bias? If the answer is positive, then our above-illustrated factorization method can be extended for factorization of any univariate polynomial. Here we offer a proof by mathematical induction.

For n = 1, we have m₁ + C₁ – m₁ = C_1. Assume that we can represent $\prod_{i}^{p} (m_{i} + C_{i}) \prod_{i}^{p} m_{i}$ with a linear combination of A_i,A_ij,A_ijk, …, A_{123…(p–1)} and a constant, denoted as f(A_i,A_ij,A_ijk, …, A_{123…(p–1)}). Then, $\prod_{i}^{p + 1} (m_{i} + C_{i} - \prod_{i}^{p + 1} m_{i} = (m_{p + 1} + C_{p + 1}) (\prod_{i}^{p} (m_{i} + C_{i}) - \prod_{i}^{p} m_{i}) + C_{p + 1} \prod_{i}^{p} m_{p} = (m_{p + 1} + C_{p + 1}) f (A_{i}, A_{i j}, A_{i j k}, \dots, A_{123, (p - 1)}) + C_{p + 1} \prod_{i}^{p} m_{p}$ . Clearly, the above terms m_p+1f(A_i,A_ij,A_ijk, …, A_{123…(p–1)}, $C_{p + 1} \prod_{i}^{p} m_{p}$ are both linear terms of A_i,A_ij,A_ijk, …, A_123…p and C_p+1f(A_i,A_ij,A_ijk, …, A_{123…(p–1)}) is a linear representation of A_i,A_ij,A_ijk, …, A_{123…(p–1)} plus constant bias as well. Combining these together, we immediately have a desirable linear representation of $\prod_{i}^{p + 1} (m_{i} + C_{i}) - \prod_{i}^{p + 1} m_{i}$ that concludes our proof.

3.4. Theorem 4

Key Idea for Proving Theorem 4:

It is notable that a quadratic neuron internally realizes addition and multiplication operations in a unique way, which enables quantized quadratic neurons as building blocks for construction of an arbitrary function in a top-down way. That is the key mechanism by which the saving of $O (λ ({l o g}^{\frac{1}{λ - 1} + 1} (\frac{1}{ϵ})) {(\frac{1}{ϵ})}^{\frac{d}{n}})$ , λ ≥ 2 is achieved in terms of the number of weights. Figure 6 clearly demonstrates our heuristics.

Figure 6: — Quantized quadratic ReLU networks with weight 2⁻¹ and −2⁻¹ to approximate any number with an error bound of 2^−t

Lemma 8:

A connection with a weight w ∈ [−1,1] can be approximated by a ReLU network with λ ≥ 2 distinct weights, satisfying that (i) the ReLU network is equivalent to weight w′ with error no more than 2^−t specifically, |w – w′| < 2^−t; (ii) the depth is $O [l o g (t)];$ (iii) the width is $O (t);$ and (iv) the number of weights is at most $O (t)$ .

Proof: We first consider λ = 2 (we select two different weights as 2⁻¹, −2⁻¹), w ≥ 0 and x ≥ 0, and will relax the constraints later. Setting w = 0 directly leads to an empty operation. There are three other necessary operations perfectly represented by a ReLU quadratic network: E_u (unity operation) E_p (power operation) and E_m (multiplication), which take 4, 8, 8 neurons respectively, as shown in Figure 6. Our goal is to approximate w in accuracy of 2^−t. Let us define

K_{c} \equiv {2^{- 1}, 2^{- 2}, \dots, 2^{- (t - 1)}, 2^{- t}} .

With a homogeneous weight 2⁻¹, all the members in K_c can be represented by the composition of E_u, E_p and E_m operations. Because a binary expansion of W_c can denote any integral multiple of 2^−t like a numeral system with the radix of 2^−t, which is equivalent to represent any number with an error bound of 2^−t. To implement a binary expansion, one more layer with unity and empty operations is needed. Therefore, the overall quantized quadratic network has log(t) + 1 layers, with no more than 24t weights.

Regarding the situation of w < 0, we are able to construct −1 with −2⁻¹ as the final layer. To relax the assumption x ≥ 0, we can use two parallel sub-networks that take x and −x respectively. Due to the gate property of ReLU, the output sign of a sub-network can be flipped in the last layer. In the case of λ > 2, because our construction is top-down, it is interesting that $\frac{1}{2}$ and $- \frac{1}{2}$ are enough for the construction. We can assign two weight are $\frac{1}{2}$ and $- \frac{1}{2}$ respectively among λ distinct weights. Therefore λ does not affect the final complexity.

Proof of Theorem 4:

The proof consists of the following two steps. First, we utilize f′ to approximate $f \in F_{d}^{n}$ , and then we construct f′ precisely with our quantized ReLU quadratic networks. For m = (m₁, …, m_d) ∈ {0, 1, …, N}^d, ψ_m(x) is defined as:

ψ_{m} (x) = Π_{k = 1}^{d} h (3 N x_{k} - 3 m_{k}),

where N is a constant and

h (x) = {\begin{array}{l} 1 | x | \leq 1 \\ 2 - | x | 1 \leq | x | \leq 2 \\ 0 | x | \geq 2 \end{array}

ψ_m(x) forms a partition of the unity on [0, 1]^d: $\sum_{m} ψ_{m} (x) \equiv 1$ , where x is supported on [0, 1]^d. Note that the support of ψ_m(x) is ${x : | x_{k} - \frac{m_{k}}{N} | < \frac{1}{N}, \forall k}$ . For any m, there is a corresponding order n – 1 Taylor expansion of the function f at $x = \frac{m}{N} :$

Q_{m} (x) = {\sum_{n : | n | < n} \frac{D^{n} f}{n!} |}_{x = \frac{m}{N}} {(x - \frac{m}{N})}^{n},

where $n! = Π_{k = 1}^{d} n_{k}!$ , and ${(x - \frac{m}{N})}^{n} = Π_{k = 0}^{d} {(x_{k} - \frac{m_{k}}{N})}^{n_{k}}$ . Furthermore, we define $Q_{m}^{'} (x) = \sum_{n : | n | < n} γ_{m, n} {(x - \frac{m}{N})}^{n}$ , where γ_m,n denotes some integral multiple of $\frac{1}{n} {(\frac{d}{N})}^{n - | n |}$ that is the closest to ${\frac{D^{n} f}{n!} |}_{x = \frac{m}{N}}$ . Thus, the approximation of f is constructed as:

f^{'} (x) = \sum_{m} ψ_{m} Q_{m}^{'} (x)

Hence, we have

sup_{x \in {[0, 1]}^{d}} | f (x) - f^{'} (x) | = sup_{x \in {[0, 1]}^{d}} | \sum_{m} ψ_{m} (f (x) - Q_{m}^{'} (x)) | \leq \sum_{m} sup_{x \in {| x_{k} - \frac{m_{k}}{N} | < \frac{1}{N}, \forall k}} | f (x) - Q_{m}^{'} (x) | \leq 2^{d} max_{x \in {| x_{k} - \frac{m_{k}}{N} | < \frac{1}{N}, \forall k}} | f (x) - Q_{m} (x) | + 2^{d} max_{x \in {| x_{k} - \frac{m_{k}}{N} | < \frac{1}{N}, \forall k}} | Q_{m}^{'} (x) - Q_{m} (x) |

\leq \frac{2^{d} d^{n}}{n!} {(\frac{1}{N})}^{n} max_{n : | n | = n} e s s {sup}_{x \in {[0, 1]}^{d}} | D^{n} f (x) | + 2^{d} \sum_{n : | n | < n} max (| \frac{D^{n} f}{n!} - γ_{m, n} |) {(x - \frac{m}{N})}^{n} \leq \frac{2^{d} d^{n}}{n!} {(\frac{1}{N})}^{n} {(\frac{1}{N})}^{n} + \frac{2^{d}}{n} ({(\frac{d}{N})}^{n} + \dots + {(\frac{d}{N})}^{1} {(\frac{d}{N})}^{n - 1}) \leq 2^{d} {(\frac{d}{N})}^{n} (1 + \frac{1}{n!}) \leq 2^{d + 1} {(\frac{d}{N})}^{n}

Therefore, if we choose $N \geq {(\frac{2^{d + 1} d^{n}}{ϵ})}^{\frac{1}{n}}$ , for ϵ ∈ (0, 1), any $f \in F_{d, n}$ can be approximated $f^{'}$ with an error bound ϵ. We rewrite

f^{'} (x) = \sum_{m \in {0, 1, \dots, N}^{d}} \sum_{n : | n | < n} γ_{m, n} ψ_{m} {(x - \frac{m}{N})}^{n} = \sum_{m \in {0, 1, \dots, N}^{d}} \sum_{n : | n | < n} γ_{m, n} f_{m, n}^{'} (x),

where $f_{m, n}^{'} (x) = ψ_{m} {(x - \frac{m}{N})}^{n}$ is the linear combination of no more than dⁿ(N + 1)^d terms of $f_{m, n}^{'} (x)$ . By Lemma 8, with $t = l o g (\frac{n N^{n}}{d^{n}})$ we can use a sub-network of $t = l o g (\frac{n N^{n}}{d^{n}})$ to represent the constant weights γ_m,n.Then, the remaining work is to construct $f_{m, n}^{'} (x)$ . N Is chosen as 2^l, where l is some large integer, such that all $\frac{m_{i}}{N}$ can be obtained precisely in the way of Lemma 8.

The strategy of using a quadratic network to construct $f^{'}$ is illustrated in Figure 7. Table 1 presents the complexities of individual blocks of interest, which helps us to analyze the complexity of the whole network. Although the whole complexity can be expressed in an explicit way, here we use the $O$ notation for clarity and simplicity. Let N_d and N_w be the network depth and the number of weights. According to Table 1, we have $N_{d} = O (l o g (t) + l o g (N))$ and $N_{w} = O (N^{d} l o g (t) + N^{d} l o g (N))$ . Because $t = l o g (\frac{n N^{n}}{d^{n}})$ and $N = {(\frac{2^{d + 1} d^{n}}{ϵ})}^{\frac{1}{n}} = O ({(\frac{1}{ϵ})}^{\frac{1}{n}})$ substituting them into the estimates results in

N_{d} = O (l o g (l o g (\frac{1}{ϵ})) + l o g (\frac{1}{ϵ}))

N_{w} = O (l o g (l o g (\frac{1}{ϵ})) {(\frac{1}{ϵ})}^{\frac{d}{n}} + l o g (\frac{1}{ϵ}) {(\frac{1}{ϵ})}^{\frac{d}{n}})

Figure 7: — Strategy of using a quantized quadratic network to implement $f^{'} (x)$

Table I:

Descriptions of different building blocks.

Block	Operation	Width	Depth
B_N	$Construct \frac{1}{N}, \dots, \frac{N - 1}{N}$	N	log(N)
$B_{N}^{n}$	$Construct {(x_{k} - m_{k})}^{n_{k}}$	n	log(n)
B_h	Implement h(x)	12	1
B_m	Multiply d terms	2d	log(2d)
B_γ	Construct γ	t	log(t)

Open in a new tab

As we know, the quantization of the network [38] is an effective and robust way for model compression, even a binarized network performs well in some meaningful tasks. However, the reason why quantization is robust and useful was rarely studied. Ding et al. [39] reported that to obtain an approximation error ϵ, the upper bound for a quantized ReLU network is delineated by

O (λ (l o g^{\frac{1}{λ - 1} + 1} (\frac{1}{ϵ})) {(\frac{1}{ϵ})}^{\frac{d}{n}} + l o g (\frac{1}{ϵ}) {(\frac{1}{ϵ})}^{\frac{d}{n}}),

(4)

where λ is the number of distinct weights. Because it is convenient for quadratic neurons to implement multiplication, our quantized ReLU quadratic network considerably reduces the upper bound to $O (l o g (l o g (\frac{1}{ϵ})) {(\frac{1}{ϵ})}^{\frac{d}{n}} + l o g (\frac{1}{ϵ}) {(\frac{1}{ϵ})}^{\frac{d}{n}})$ . It is observed that $O (l o g (l o g (\frac{1}{ϵ})) {(\frac{1}{ϵ})}^{\frac{d}{n}})$ is significantly lower than $O (λ (l o g^{\frac{1}{λ - 1} + 1} (\frac{1}{ϵ})) {(\frac{1}{ϵ})}^{\frac{d}{n}})$ therefore the saving made by quantized quadratic network is almost in an order of $O (λ (l o g^{\frac{1}{λ - 1} + 1} (\frac{1}{ϵ})) {(\frac{1}{ϵ})}^{\frac{d}{n}})$ . More interesting, the complexity bound achieved by quantized quadratic networks is independent of the value of λ, as explained in our proof.

4. Discussions and Conclusion

The rational for the first theorem is easily understandable. Since it is commonly known that a conventional network with one hidden layer is a universal approximator, the best thing we can do to justify the quadratic network is to find a class of functions that can be more efficiently approximated by a quadratic network than with a conventional network. By the first theorem, the quadratic network is indeed more efficient than the conventional counterpart.

The previous studies demonstrate that any function of n–variables that is not constant along any direction cannot be well represented by a fully-connected ReLU network with no more than n – 1 neurons in each layer [16]. However, breaking this network width bound, our second theorem states that when a radial function is not constant in any direction, the network width 4 is sufficient for a ReLU quadratic network to approximate the function accurately. Since it is more convenient to represent a nonlinear function with more nonlinear neurons, the incapability of conventional networks can be compensated by quadratic networks. Theorem 2 implies a slimmer and/or sparser width-bounded universal approximator.

Our third theorem is inspiring that a general polynomial function can be exactly expressed by a quadratic network via ReLU activation in a novel way of data-driven network-based algebraic factorization. Although third theorem assumes the ReLU activation function, considering ReLU is arguably the most important activation function, it makes a great sense in illustrating the capacity of quadratic networks. Most importantly, Theorem 3 shows that quadratic networks are advantageous in terms of expressibility and efficiency. By calculating a quadratic function of input variables, quadratic neurons create second-order terms. The second-order terms serve as nonlinear building blocks that are different from non-linearity generated only through nonlinear activation, and clearly preferred from the perspective of the Algebraic Fundamental Theorem. Accurately forming a polynomial allows global computation and high efficiency as compared to the popular piece-wise approximation with conventional neurons in a deep network. In our quadratic network construction, both the depth and width are limited, while all previous universal approximators based on ReLU are either too wide or too deep to achieve a high approximation accuracy. This fact suggests that quadratic networks could potentially minimize the number of neurons and free parameters for superior machine learning performance. It is surprising that although every quadratic neuron possesses a number of parameters three times as many as that of the corresponding conventional neuron, quadratic networks can use less parameters than the conventional counterparts in important cases, such as the presentation of radial functions and the approximation to multivariate polynomials.

Interestingly, our results (Corollary 1 and Theorem 3) imply that quadratic networks are highly adaptive. For example, while a quadratic network can do piece-wise approximation by degenerating into conventional networks, a generic quadratic network is capable of approximating any function in a global way by implementing basic algebraic factors. Also, quadratic networks can mimick RBF networks on the “wavelet” scale, as indicated by Corollary 1.

As we know, deep learning generalizes well in many applications. Some researchers questioned that quadratic networks may tend to over-fitting because the added neural complexity may not be necessary. With our above proved theorems, we refute that since quadratic networks de facto are more compact and more effective than the conventional counterpart in major circumstances, quadratic networks could improve the generalization ability, according to Occam’s razor principle. Since general features are either quadratic or can be expressed as combinations or compositions of quadratic features, quadratic deep networks will be universal, instead of being restricted to the circumstances where only quadratic features are relevant.

In conclusion, we have analyzed the approximation ability of the quadratic networks and proved four theorems, suggesting a great potential of quadratic networks in terms of expressive efficiency, unique representation, compact structure and computational capacity. Future research will be focused on evaluating the performance of deep quadratic networks in typical applications.

Biography

graphic file with name nihms-1569260-b0008.gif

Fenglei Fan received the Bachelor’s degree from the Harbin Institute of Technology before joining Dr. G. Wang’s Laboratory. He is currently pursuing the Ph.D. degree with the Department of Biomedical Engineering, Rensselaer Polytechnic Institute. His research interests include applied mathematics, machine learning, and medical imaging.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration of interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

[1].Krizhevsky A, Sutskever I, Hinton GE, “Imagenet classification with deep convolutional neural networks,” In NIPS, 2012. [Google Scholar]
[2].Szegedy C, et al. , “Going deeper with convolutions,” In CVPR, 2015. [Google Scholar]
[3].Simonyan K, Zisserman A, “Very deep convolutional networks for large-scale image recognition,”. arXiv preprint arXiv: 1409.1556, 2014 [Google Scholar]
[4].He K, et al. , “Deep residual learning for image recognition,” In CVPR, 2016. [Google Scholar]
[5].Huang G, et al. , “Densely Connected Convolutional Networks,” In CVPR, 2017. [Google Scholar]
[6].Goodfellow I, et al. , “Generative adversarial nets,” In NIPS, 2014. [Google Scholar]
[7].Szegedy C, et al. , “Rethinking the inception architecture for computer vision,” In CVPR, 2016. [Google Scholar]
[8].Dahl GE, Yu D, Deng L and Acero A, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing,” vol. 20, no. 1, pp. 30–42, 2012. [Google Scholar]
[9].Kumar A, et al. , “Ask me anything: Dynamic memory networks for natural language processing. In ICML, 2016. [Google Scholar]
[10].Silver D, et al. , “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354, 2017. [DOI] [PubMed] [Google Scholar]
[11].Wang G, “A Perspective on Deep Imaging,” IEEE Access, vol. 4, pp. 8914–8924, 2016. [Google Scholar]
[12].Zhang Y and Yu H, “Convolutional Neural Network based Metal Artifact Reduction in X-ray Computed Tomography,” IEEE Trans. Med. Imaging, vol. 37, no. 6, pp. 1370–1381, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Hong Y, Kim J, Chen G, Lin W, Yap PT and Shen D, “Longitudinal Prediction of Infant Diffusion MRI Data via Graph Convolutional Adversarial Networks,” IEEE transactions on medical imaging, doi: https://ieeexplore.ieee.org/document/8691605, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Hornik K, Stinchcombe M, White H, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5 pp. 359–366, 1989. [Google Scholar]
[15].Eldan R, Shamir O, “The power of depth for feedforward neural networks,” In COLT, 2016. [Google Scholar]
[16].Lu Z, Pu H, Wang F, Hu Z and Wang L, “The expressive power of neural networks: A view from the width,” In NIPS, 2017. [Google Scholar]
[17].Szymanski L and McCane B, “Deep networks are effective encoders of periodicity,” IEEE Trans.Neural. Netw. Learn. Syst, vol. 25, no. 10, pp. 1816–1827, 2014. [DOI] [PubMed] [Google Scholar]
[18].Cohen N, Sharir O, and Shashua A, “On the expressive power of deep learning: A tensor analysis,” In COLT, 2016. [Google Scholar]
[19].Telgarsky M, “Benefits of depth in neural networks,” In COLT, 2016. [Google Scholar]
[20].Mhaskar HN and Poggio T, “Deep vs. shallow networks: An approximation theory perspective,” Analysis and Applications, vol. 14, no. 6, pp. 829–848, 2016. [Google Scholar]
[21].Liang S and Srikant R, “Why deep neural networks for function approximation?” In ICLR, 2017. [Google Scholar]
[22].Lin H and Jegelka S, “ResNet with one-neuron hidden layers is a Universal Approximator,” arXiv preprint arXiv: 1806.10909, 2018. [Google Scholar]
[23].Bianchini M and Scarselli F, “On the complexity of neural network classifiers: A comparison between shallow and deep architectures,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, pp. 1553–1565, 2014. [DOI] [PubMed] [Google Scholar]
[24].Kurkova V and Sanguineti M (2017). Probabilistic Lower Bounds for Approximation by Shallow Perceptron Networks. Neural Networks 91, pp. 34–41, 2017. [DOI] [PubMed] [Google Scholar]
[25].Fan F, Cong W, Wang G, “A new type of neurons for machine learning,” International Journal for Numerical Methods in Biomedical Engineering, vol. 34, no. 2, February 2018. [DOI] [PubMed] [Google Scholar]
[26].Fan F, Cong W, Wang G, “Generalized Backpropagation Algorithm for Training Second-order Neural Networks,” International Journal for Numerical Methods in Biomedical Engineering, 10.1002/cnm.2956, 2017. [DOI] [PubMed] [Google Scholar]
[27].Fan F and Wang G, “Fuzzy Logic Interpretation of Artificial Neural Networks,” arXiv preprint arXiv: 1807.03215, 2018 [Google Scholar]
[28].Fan F, Shan H, Wang G. Quadratic Autoencoder for Low-Dose CT Denoising. arXiv preprint arXiv: 1901.05593. 2019. January 17. [Google Scholar]
[29].Minsky ML, Papert S, Perceptrons, Cambridge: MIT Press, 1969. [Google Scholar]
[30].Giles CL, Maxwell T, “Learning, invariance, and generalization in high-order neural networks,” Applied optics, vol. 26, pp. 4972–4978, 1987. [DOI] [PubMed] [Google Scholar]
[31].Tsapanos N, Tefas A, Nikolaidis N and Pitas I, “Neurons With Paraboloid Decision Boundaries for Improved Neural Network Classification Performance,” IEEE Trans. Neural. Netw. Learn. Syst, vol. 99, pp. 1–11. [DOI] [PubMed] [Google Scholar]
[32].Goodfellow I, Bengio Y, Courville A, Deep learning, Cambridge: MIT press, 2016 [Google Scholar]
[33].Jeffreys H and Jeffreys BS, “Weierstrass’s Theorem on Approximation by Polynomials” and “Extension of Weierstrass’s Approximation Theory” in Methods of Mathematical Physics, 3^rd ed., Cambridge, England, US: Cambridge University Press, 1988. [Google Scholar]
[34].Light WA and Cheney EW, Approximation Theory in Tensor Product Spaces, Springer-Verlag, 1985. [Google Scholar]
[35].Remmert R, “The fundamental theorem of algebra” in Numbers. New York, NY: Springer, pp. 97–122, 1991. [Google Scholar]
[36].Bengio Y, Courville A and Vincent P, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. [DOI] [PubMed] [Google Scholar]
[37].Lin HW, Tegmark M, Rolnick D, “Why does deep and cheap learning work so well?” Journal of Statistical Physics, vol. 168, no. 5 pp. 1223–1247, 2017. [Google Scholar]
[38].Han S, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” In ICLR, 2016. [Google Scholar]
[39].Ding Y, “On the Universal Approximability and Complexity Bounds of Quantized ReLU Neural Networks,” In ICLR, 2019. [Google Scholar]
[40].Livni R, et al. , “On the computational efficiency of training neural networks.” In NIPS, 2014. [Google Scholar]
[41].Krotov D and Hop field J, “Dense associative memory is robust to adversarial inputs,” Neural computation, 30(12), pp.3151–3167. 2018. [DOI] [PubMed] [Google Scholar]
[42].Du SS, Lee JD, “On the Power of Over-parametrization in Neural Networks with Quadratic Activation.” arXiv preprint arXiv: 1803.01206, 2018. [Google Scholar]
[43].Andoni A, et al. , “Learning polynomials with neural networks.” In ICML, 2014. [Google Scholar]
[44].Liang S and Srikant R., “Why deep neural networks for function approximation?.” In ICLR, 2017. [Google Scholar]
[45].Blondel M, et al. “Higher-order factorization machines.” In NIPS, 2016. [Google Scholar]
[46].Debao C, “Degree of approximation by superpositions of a sigmoidal function,” Approximation Theory and its Applications. 1993. September 1;9(3): 17–28. [Google Scholar]
[47].Park J, Sandberg IW, “Universal approximation using radial-basis-function networks,” Neural computation. 1991. June;3(2):246–57. [DOI] [PubMed] [Google Scholar]
[48].Ye JC, Han Y and Cha E, “Deep convolutional framelets: A general deep learning framework for inverse problems,” SIAM Journal on Imaging Sciences, 11(2), pp.991–1048. 2018. [Google Scholar]
[49].Barron AR, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Transactions on Information theory. 1993. May;39(3):930–45. [Google Scholar]
[50].Kainen PC, Kurkova VV, Sanguineti M, “Dependence of computational models on input dimension: Tractability of approximation and optimization tasks,” IEEE Transactions on Information Theory. 2012. February 6;58(2):1203–14. [Google Scholar]
[51].Kon MA and Plaskota L, “Information complexity of neural networks,” Neural Networks, 13(3), 365–375, 2000 [DOI] [PubMed] [Google Scholar]
[52].Schmitt M, “Lower bounds on the complexity of approximating continuous functions by sigmoidal neural networks,” In NIPS, 2000. [Google Scholar]

[R1] [1].Krizhevsky A, Sutskever I, Hinton GE, “Imagenet classification with deep convolutional neural networks,” In NIPS, 2012. [Google Scholar]

[R2] [2].Szegedy C, et al. , “Going deeper with convolutions,” In CVPR, 2015. [Google Scholar]

[R3] [3].Simonyan K, Zisserman A, “Very deep convolutional networks for large-scale image recognition,”. arXiv preprint arXiv: 1409.1556, 2014 [Google Scholar]

[R4] [4].He K, et al. , “Deep residual learning for image recognition,” In CVPR, 2016. [Google Scholar]

[R5] [5].Huang G, et al. , “Densely Connected Convolutional Networks,” In CVPR, 2017. [Google Scholar]

[R6] [6].Goodfellow I, et al. , “Generative adversarial nets,” In NIPS, 2014. [Google Scholar]

[R7] [7].Szegedy C, et al. , “Rethinking the inception architecture for computer vision,” In CVPR, 2016. [Google Scholar]

[R8] [8].Dahl GE, Yu D, Deng L and Acero A, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing,” vol. 20, no. 1, pp. 30–42, 2012. [Google Scholar]

[R9] [9].Kumar A, et al. , “Ask me anything: Dynamic memory networks for natural language processing. In ICML, 2016. [Google Scholar]

[R10] [10].Silver D, et al. , “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354, 2017. [DOI] [PubMed] [Google Scholar]

[R11] [11].Wang G, “A Perspective on Deep Imaging,” IEEE Access, vol. 4, pp. 8914–8924, 2016. [Google Scholar]

[R12] [12].Zhang Y and Yu H, “Convolutional Neural Network based Metal Artifact Reduction in X-ray Computed Tomography,” IEEE Trans. Med. Imaging, vol. 37, no. 6, pp. 1370–1381, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Hong Y, Kim J, Chen G, Lin W, Yap PT and Shen D, “Longitudinal Prediction of Infant Diffusion MRI Data via Graph Convolutional Adversarial Networks,” IEEE transactions on medical imaging, doi: https://ieeexplore.ieee.org/document/8691605, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Hornik K, Stinchcombe M, White H, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5 pp. 359–366, 1989. [Google Scholar]

[R15] [15].Eldan R, Shamir O, “The power of depth for feedforward neural networks,” In COLT, 2016. [Google Scholar]

[R16] [16].Lu Z, Pu H, Wang F, Hu Z and Wang L, “The expressive power of neural networks: A view from the width,” In NIPS, 2017. [Google Scholar]

[R17] [17].Szymanski L and McCane B, “Deep networks are effective encoders of periodicity,” IEEE Trans.Neural. Netw. Learn. Syst, vol. 25, no. 10, pp. 1816–1827, 2014. [DOI] [PubMed] [Google Scholar]

[R18] [18].Cohen N, Sharir O, and Shashua A, “On the expressive power of deep learning: A tensor analysis,” In COLT, 2016. [Google Scholar]

[R19] [19].Telgarsky M, “Benefits of depth in neural networks,” In COLT, 2016. [Google Scholar]

[R20] [20].Mhaskar HN and Poggio T, “Deep vs. shallow networks: An approximation theory perspective,” Analysis and Applications, vol. 14, no. 6, pp. 829–848, 2016. [Google Scholar]

[R21] [21].Liang S and Srikant R, “Why deep neural networks for function approximation?” In ICLR, 2017. [Google Scholar]

[R22] [22].Lin H and Jegelka S, “ResNet with one-neuron hidden layers is a Universal Approximator,” arXiv preprint arXiv: 1806.10909, 2018. [Google Scholar]

[R23] [23].Bianchini M and Scarselli F, “On the complexity of neural network classifiers: A comparison between shallow and deep architectures,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, pp. 1553–1565, 2014. [DOI] [PubMed] [Google Scholar]

[R24] [24].Kurkova V and Sanguineti M (2017). Probabilistic Lower Bounds for Approximation by Shallow Perceptron Networks. Neural Networks 91, pp. 34–41, 2017. [DOI] [PubMed] [Google Scholar]

[R25] [25].Fan F, Cong W, Wang G, “A new type of neurons for machine learning,” International Journal for Numerical Methods in Biomedical Engineering, vol. 34, no. 2, February 2018. [DOI] [PubMed] [Google Scholar]

[R26] [26].Fan F, Cong W, Wang G, “Generalized Backpropagation Algorithm for Training Second-order Neural Networks,” International Journal for Numerical Methods in Biomedical Engineering, 10.1002/cnm.2956, 2017. [DOI] [PubMed] [Google Scholar]

[R27] [27].Fan F and Wang G, “Fuzzy Logic Interpretation of Artificial Neural Networks,” arXiv preprint arXiv: 1807.03215, 2018 [Google Scholar]

[R28] [28].Fan F, Shan H, Wang G. Quadratic Autoencoder for Low-Dose CT Denoising. arXiv preprint arXiv: 1901.05593. 2019. January 17. [Google Scholar]

[R29] [29].Minsky ML, Papert S, Perceptrons, Cambridge: MIT Press, 1969. [Google Scholar]

[R30] [30].Giles CL, Maxwell T, “Learning, invariance, and generalization in high-order neural networks,” Applied optics, vol. 26, pp. 4972–4978, 1987. [DOI] [PubMed] [Google Scholar]

[R31] [31].Tsapanos N, Tefas A, Nikolaidis N and Pitas I, “Neurons With Paraboloid Decision Boundaries for Improved Neural Network Classification Performance,” IEEE Trans. Neural. Netw. Learn. Syst, vol. 99, pp. 1–11. [DOI] [PubMed] [Google Scholar]

[R32] [32].Goodfellow I, Bengio Y, Courville A, Deep learning, Cambridge: MIT press, 2016 [Google Scholar]

[R33] [33].Jeffreys H and Jeffreys BS, “Weierstrass’s Theorem on Approximation by Polynomials” and “Extension of Weierstrass’s Approximation Theory” in Methods of Mathematical Physics, 3^rd ed., Cambridge, England, US: Cambridge University Press, 1988. [Google Scholar]

[R34] [34].Light WA and Cheney EW, Approximation Theory in Tensor Product Spaces, Springer-Verlag, 1985. [Google Scholar]

[R35] [35].Remmert R, “The fundamental theorem of algebra” in Numbers. New York, NY: Springer, pp. 97–122, 1991. [Google Scholar]

[R36] [36].Bengio Y, Courville A and Vincent P, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. [DOI] [PubMed] [Google Scholar]

[R37] [37].Lin HW, Tegmark M, Rolnick D, “Why does deep and cheap learning work so well?” Journal of Statistical Physics, vol. 168, no. 5 pp. 1223–1247, 2017. [Google Scholar]

[R38] [38].Han S, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” In ICLR, 2016. [Google Scholar]

[R39] [39].Ding Y, “On the Universal Approximability and Complexity Bounds of Quantized ReLU Neural Networks,” In ICLR, 2019. [Google Scholar]

[R40] [40].Livni R, et al. , “On the computational efficiency of training neural networks.” In NIPS, 2014. [Google Scholar]

[R41] [41].Krotov D and Hop field J, “Dense associative memory is robust to adversarial inputs,” Neural computation, 30(12), pp.3151–3167. 2018. [DOI] [PubMed] [Google Scholar]

[R42] [42].Du SS, Lee JD, “On the Power of Over-parametrization in Neural Networks with Quadratic Activation.” arXiv preprint arXiv: 1803.01206, 2018. [Google Scholar]

[R43] [43].Andoni A, et al. , “Learning polynomials with neural networks.” In ICML, 2014. [Google Scholar]

[R44] [44].Liang S and Srikant R., “Why deep neural networks for function approximation?.” In ICLR, 2017. [Google Scholar]

[R45] [45].Blondel M, et al. “Higher-order factorization machines.” In NIPS, 2016. [Google Scholar]

[R46] [46].Debao C, “Degree of approximation by superpositions of a sigmoidal function,” Approximation Theory and its Applications. 1993. September 1;9(3): 17–28. [Google Scholar]

[R47] [47].Park J, Sandberg IW, “Universal approximation using radial-basis-function networks,” Neural computation. 1991. June;3(2):246–57. [DOI] [PubMed] [Google Scholar]

[R48] [48].Ye JC, Han Y and Cha E, “Deep convolutional framelets: A general deep learning framework for inverse problems,” SIAM Journal on Imaging Sciences, 11(2), pp.991–1048. 2018. [Google Scholar]

[R49] [49].Barron AR, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Transactions on Information theory. 1993. May;39(3):930–45. [Google Scholar]

[R50] [50].Kainen PC, Kurkova VV, Sanguineti M, “Dependence of computational models on input dimension: Tractability of approximation and optimization tasks,” IEEE Transactions on Information Theory. 2012. February 6;58(2):1203–14. [Google Scholar]

[R51] [51].Kon MA and Plaskota L, “Information complexity of neural networks,” Neural Networks, 13(3), 365–375, 2000 [DOI] [PubMed] [Google Scholar]

[R52] [52].Schmitt M, “Lower bounds on the complexity of approximating continuous functions by sigmoidal neural networks,” In NIPS, 2000. [Google Scholar]

PERMALINK

Universal Approximation with Quadratic Deep Networks

Fenglei Fan

Jinjun Xiong

Ge Wang

Abstract

1. Introduction

2. Preliminaries

Quadratic/Second-order Neuron:

One-hidden-layer Networks:

L-Lipschitz Function:

Radial Function:

Euclidean Unit-volume Ball:

Bernstein Polynomial:

Admissible function:

Function space Fnd:

3. Four Theorems

3.1. Theorem 1

Key Idea for Proving Theorem 1:

The contribution of our work:

Lemma 1:

Lemma 2:

Lemma 3:

Proof of Theorem 1:

Classification Example:

Figure 1:

3.2. Theorem 2

Key Idea for Proving Theorem 2:

Proof of Theorem 2:

Figure 2:

Lemma 4:

Corollary 2:

Figure 3:

3.3. Theorem 3

Key Idea for Proving Theorem 3:

Lemma 5:

Figure 4:

Lemma 6:

Corollary 2:

Lemma 7:

Proof of Theorem 3:

Figure 5:

3.4. Theorem 4

Key Idea for Proving Theorem 4:

Figure 6:

Lemma 8:

Proof of Theorem 4:

Figure 7:

Table I:

4. Discussions and Conclusion

Biography

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Function space $F_{n}^{d}$ :