Locally adaptive activation functions with slope recovery for deep and physics-informed neural networks

Ameya D Jagtap; Kenji Kawaguchi; George Em Karniadakis

doi:10.1098/rspa.2020.0334

. 2020 Jul 15;476(2239):20200334. doi: 10.1098/rspa.2020.0334

Locally adaptive activation functions with slope recovery for deep and physics-informed neural networks

Ameya D Jagtap ^1,^†,^✉, Kenji Kawaguchi ^2,^†,^✉, George Em Karniadakis ^1,^3,^✉

PMCID: PMC7426042 PMID: 32831616

Abstract

We propose two approaches of locally adaptive activation functions namely, layer-wise and neuron-wise locally adaptive activation functions, which improve the performance of deep and physics-informed neural networks. The local adaptation of activation function is achieved by introducing a scalable parameter in each layer (layer-wise) and for every neuron (neuron-wise) separately, and then optimizing it using a variant of stochastic gradient descent algorithm. In order to further increase the training speed, an activation slope-based slope recovery term is added in the loss function, which further accelerates convergence, thereby reducing the training cost. On the theoretical side, we prove that in the proposed method, the gradient descent algorithms are not attracted to sub-optimal critical points or local minima under practical conditions on the initialization and learning rate, and that the gradient dynamics of the proposed method is not achievable by base methods with any (adaptive) learning rates. We further show that the adaptive activation methods accelerate the convergence by implicitly multiplying conditioning matrices to the gradient of the base method without any explicit computation of the conditioning matrix and the matrix–vector product. The different adaptive activation functions are shown to induce different implicit conditioning matrices. Furthermore, the proposed methods with the slope recovery are shown to accelerate the training process.

Keywords: physics-informed neural networks, machine learning, bad minima, stochastic gradients, accelerated training, deep learning benchmarks

1. Introduction

In the recent years, research on neural networks (NNs) has intensified around the world due to their successful applications in many diverse fields such as speech recognition [1], computer vision [2] and natural language translation [3]. NNs have also been used in the area of scientific computing where they solve partial differential equations (PDEs) due to their ability to effectively approximate complex functions arising in diverse scientific disciplines (cf. Physics-informed Neural Networks (PINNs) by Raissi et al. [4] and references therein). PINNs can accurately solve both direct problems, where the approximate solutions of governing equations are obtained, as well as highly ill-posed inverse problems, where parameters involved in the governing equation are inferred from the training data.

A major drawback of such NN-based models is the slow training/convergence speed, which can adversely affect their performance, especially for solving real-life applications, which require an NN model to run in real-time. Therefore, it is crucial to accelerate the convergence of such models without sacrificing the performance. Highly efficient and adaptable algorithms are important to design the most effective NN, which not only increase the accuracy of the solution but also reduce the training cost. Various architectures of NN like Dropout NN [5] have been proposed in the literature, which can improve the efficiency of the algorithm for specific applications. One of the important features of NN is the activation function, which decides the activation of particular neuron during training process. There is no rule of thumb for the choice of activation function, and in fact it solely depends on the problem at hand. Therefore, in this work, we are particularly focusing on adaptive activation functions, which adapt automatically such that the network can be trained faster. Various methods have been proposed in the literature for adaptive activation function, like the adaptive sigmoidal activation function proposed by Yu et al. [6] for multilayer feedforward NNs. Qian et al. [7] focused on learning activation functions in convolutional NNs by combining basic activation functions in a data-driven way. Multiple activation functions per neuron were proposed by Dushkoff & Ptucha [8], where individual neurons select between a multitude of activation functions. Li et al. [9] proposed a tunable activation function, where only a single hidden layer is used and the activation function is tuned. Shen et al. [10] used a similar idea of tunable activation function but with multiple outputs. Recently, Kunc & Kléma proposed a transformative adaptive activation functions for gene expression inference, see [11]. One such adaptive activation function was proposed by Jagtap et al. [12] by introducing a scalable parameter in the activation function, which can be optimized by using any optimization method. Mathematically, it changes the slope of activation function thereby increasing the learning process by altering the loss landscape of NN, especially during the initial training period. Due to single scalar parameter, we call such adaptive activation functions globally adaptive activations, meaning that it gives an optimized slope for the entire network. One can think of doing such optimization at the local level, where the scalable parameters are introduced hidden layer-wise or even for each neuron in the network. Due to different learning capacity of each hidden-layer, such locally defined activation slopes can further improve the performance of the network. In order to further increase the training speed, an activation slope-based slope recovery term is added in the loss function, which further accelerates convergence.

The rest of the paper is organized as follows. Section 2 presents the methodology of the proposed layer-wise and neuron-wise locally adaptive activations in detail. This also includes a discussion on the slope recovery term, expansion of parametric space due to layer-wise and neuron-wise introduction of additional parameters, and its effect on the overall training cost. In this section, the PINN method is also introduced briefly for completeness. Section 3 gives theoretical results for gradient decent algorithms, where we analyse both the convergent points and gradient dynamics per iteration. In §4, we perform some computational experiments with the proposed method solving function approximation problem using deep NN and inverse PDE-based problems using PINNs. We also solved some standard deep learning benchmarks problems using the proposed activation functions. Several comparisons are made between the existing and the proposed methods. Finally, in §5, we summarize the findings of our work.

2. Methodology

We use an NN of depth D corresponding to a network with an input layer, D − 1 hidden-layers and an output layer. In the kth hidden-layer, N_k number of neurons are present. Each hidden-layer of the network receives an output $z^{k - 1} \in R^{N_{k - 1}}$ from the previous layer, where an affine transformation of the form

L_{k} (z^{k - 1}) ≜ w^{k} z^{k - 1} + b^{k}

2.1

is performed. The network weights $w^{k} \in R^{N_{k} \times N_{k - 1}}$ and bias term $b^{k} \in R^{N_{k}}$ associated with the kth layer are chosen from independent and identically distributed sampling. The nonlinear-activation function σ( · ) is applied to each component of the transformed vector before sending it as an input to the next layer. The activation function is an identity function after an output layer. Thus, the final NN representation is given by the composition

u_{Θ} (z) = (L_{D} \circ σ \circ L_{D - 1} \circ \dots \circ σ \circ L_{1}) (z),

where the operator ° is the composition operator, $Θ = {w^{k}, b^{k}}_{k = 1}^{D} \in V$ represents the trainable parameters in the network, and $V$ is the parameter space; u and z⁰ = z are the output and input of the network, respectively.

Jagtap et al. [12] proposed an adaptive activation function where an additional scalable parameter na, where n ≥ 1 is pre-defined scaling factor, is introduced. The parameter $a \in R$ acts as a slope of activation function. Since, the parameter a is defined for the complete network, we are calling it a global adaptive activation function (GAAF). Optimization of such parameter dynamically alters the loss landscape thereby increases the NN convergence, especially during the early training period. It is also possible to extend this strategy by locally defining the activation slope. In this regard, we propose the following two approaches to locally optimize the activation function.

— Layer-wise locally adaptive activation functions (L-LAAF) Instead of globally defining the parameter a for the adaptive activation function, let us define this parameter for each hidden layer as
$σ (n a^{k} L_{k} (z^{k - 1})), k = 1, 2, \dots, D - 1.$
This gives additional D − 1 parameters to be optimized along with weights and biases. Here, every hidden-layer has its own slope for the activation function.
— Neuron-wise locally adaptive activation functions (N-LAAF) One can also define such activation function at the neuron level as
$σ (n a_{i}^{k} (L_{k} {(z^{k - 1}))}_{i}), k = 1, 2, \dots, D - 1, i = 1, 2, \dots, N_{k} .$
This gives additional $\sum_{k = 1}^{D - 1} N_{k}$ parameters to be optimized. Neuron-wise activation function acts as a vector activation function in each hidden-layer, where every neuron has its own slope for the activation function, as opposed to scalar activation function given by L-LAAF and GAAF approaches.

In both cases, n ≥ 1 is a scaling factor. For every problem, there exists a critical scaling factor n_crit above which the optimization algorithm becomes very sensitive. The resulting optimization problem leads to finding the minimum of a loss function by optimizing activation slopes along with weights and biases. Then, the final layer-wise adaptive activation function-based NN representation of the solution is given by

u_{\hat{Θ}} (z) = (L_{D} \circ σ \circ {na}^{D - 1} L_{D - 1} \circ σ \circ {na}^{D - 2} L_{D - 2} \circ \dots \circ σ \circ {na}^{1} L_{1}) (z) .

Similarly, we can write the neuron-wise adaptive activation function-based NN representation of the solution. In this case, the set of trainable parameters $\hat{Θ} \in \hat{V}$ consists of ${w^{k}, b^{k}}_{k = 1}^{D}$ and ${a_{i}^{k}}_{k = 1}^{D - 1}, \forall i = 1, 2, \dots, N_{k}$ . In the proposed methods, the initialization of scalable parameters is done such that $n a_{i}^{k} = 1, \forall n$ .

The advantage of locally introducing the parameters for the activation slope over its global counterpart is that, it gives additional degrees of freedom to every hidden layer as well as to every neuron in all the hidden layers, which in turn increases the learning capacity of the network. Another advantage of LAAF is that different scaling factors can be assigned for every layer as well as for every neuron as opposed to the global scaling factor in GAAF.

Compared to single additional parameter of GAAF, the locally adaptive activation function-based PINN has several additional scalable parameters to train. Thus, it is important to consider the additional computational cost required. The increase of the parametric space leads to a high-dimensional optimization problem whose solution can be difficult to obtain. Between the previously discussed two approaches, i.e. L-LAAF and N-LAAF, N-LAAF introduces the highest number of additional parameters for optimization. Next, we discuss the qualitative picture of the increase in the number of parameters. Let ω and β be the total number of weights and biases in the NN. Then, the ratio $P$ , which is the size of parametric space of N-LAAF to that of fixed activation-based NN is, $P \approx (1 + 2 ϱ) / (1 + ϱ),$ where ϱ = β/ω. As an example, consider a fully connected NN with single input and output involving three hidden-layers with 20 neurons in each layer, which gives the values of ω = 840 and β = 61. Thus, $P = 1.0677$ , i.e.6.77% increase in the number of parameters. This increment can be further reduced with an increase in the number of layers as well as neurons in each layer, which eventually results in negligible increase in the number of parameters. In such cases, the computational cost for fixed activation function and that of neuron-wise locally adaptive activations is comparable.

(a). Physics-informed neural networks

In this section, we shall briefly introduce the PINN algorithm [4]. PINN is a very efficient method for solving forward and inverse differential and integro-differential equations involving noisy, sparse and multi-fidelity data. The main feature of the PINN is that it can easily incorporate all the given information like governing equation, experimental data, initial/boundary conditions, etc., into the loss function thereby recast the original problem into an optimization problem. One of the main limitation of PINN algorithm is its high computational cost for high-dimensional optimization problem, which is addressed in [13] by employing the domain decomposition approach. The PINN algorithm aims to learn a surrogate $u = u_{\hat{Θ}}$ for predicting the solution u of the governing PDE. In PINN algorithm the loss function is defined as

J (\hat{Θ}) = W_{F} {MSE}_{F} + W_{u} {MSE}_{u},

2.2

where the mean squared error (MSE) is given as

{MSE}_{F} = \frac{1}{N_{f}} \sum_{i = 1}^{N_{f}} | F_{\hat{Θ}} (x_{f}^{i}) |^{2}, {MSE}_{u} = \frac{1}{N_{u}} \sum_{i = 1}^{N_{u}} | u^{i} - u_{\hat{Θ}} (x_{u}^{i}) |^{2} .

${x_{f}^{i}}_{i = 1}^{N_{f}}$ represents the set of residual points, while ${x_{u}^{i}}_{i = 1}^{N_{u}}$ represents the training data points. $W_{F}$ and W_u are the weights for residual and training data points, respectively, which can be chosen dynamically [14]. The NN solution must satisfy the governing equation given by the residual $F_{\hat{Θ}} = F (u_{\hat{Θ}})$ evaluated at randomly chosen residual points in the domain. As an example, for the general differential equation of the form L u = f, the residual term is given by $F_{\hat{Θ}} ≜ L u_{\hat{Θ}} - f$ , where L represent the linear/nonlinear differential terms. To construct the residuals in the loss function, derivatives of the solution with respect to the independent variables are required, which can be computed using the automatic differentiation (AD) [15]. AD is an accurate way to calculate derivatives in a computational graph compared to numerical differentiation since they do not suffer from the errors such as truncation and round-off errors. Thus, the PINN method is a grid-free method, which does not require mesh for solving equations. This constitutes the physics-informed part of NN as given by the first term in equation (2.2). The second term in equation (2.2) includes the known boundary/initial conditions, experimental data, which must be satisfied by the NN solution.

The resulting optimization problem leads to finding the minimum of a loss function by optimizing the trainable parameters $\hat{Θ}$ . The solution to this minimization problem can be approximated iteratively by one of the forms of gradient descent algorithm. The stochastic gradient descent (SGD) algorithm is widely used in machine learning community see [16] for a complete survey. In this work, the ADAM optimizer [17], which is a variant of the SGD method is used.

(b). Loss function with slope recovery term

The main motivation of adaptive activation function is to increase the slope of activation function, resulting in non-vanishing gradients and fast training of the network. It is clear that one should quickly increase the slope of activation in order to improve the performance of NN. Thus, instead of only depending on the optimization methods, another way to achieve this is to include the slope recovery term $S (a)$ defined as

S (a) ≜ {\begin{cases} \frac{1}{1 / (D - 1) \sum_{k = 1}^{D - 1} exp (a^{k})} & for L-LAAF, \\ \frac{1}{1 / (D - 1) \sum_{k = 1}^{D - 1} exp (\frac{\sum_{i = 1}^{N_{k}} a_{i}^{k}}{N_{k}})} & for N-LAAF . \end{cases}

The main reason behind this is that such term contributes to the gradient of the loss function without vanishing. The overall effect of including this term is that it forces the network to increase the value of activation slope quickly thereby increasing the training speed.

Figure 1 shows a sketch of a neuron-wise locally adaptive activation function-based physics-informed neural network (LAAF-PINN), where both the NN part along with the physics-informed part can be seen. The activation slopes from every neuron are also contributing to the loss function in the form of slope recovery term. The following algorithm summarizes the LAAF-PINN algorithm with slope recovery term.

Algorithm 1: LAAF-PINN algorithm with slope recovery term. —

Step 1 : Specification of training set in computational domain

Training data : $u_{\hat{Θ}}$ network ${x_{u}^{i}}_{i = 1}^{N_{u}}$ , Residual training points : { $F_{\hat{Θ}}$ network ${x_{f}^{i}}_{i = 1}^{N_{f}}$ }

Step 2 : Construct neural network $u_{\hat{Θ}}$ with random initialization of parameters $\hat{Θ}$ .

Step 3 : Construct the residual neural network { $F_{\hat{Θ}}$ } by substituting surrogate $u_{\hat{Θ}}$ into the governing equations using automatic differentiation and other arithmetic operations.

Step 4 : Specification of the loss function that includes the slope recovery term:
$\tilde{J} (\hat{Θ}) = \frac{W_{F}}{N_{f}} \sum_{i = 1}^{N_{f}} | F_{\hat{Θ}} (x_{f}^{i}) |^{2} + \frac{W_{u}}{N_{u}} \sum_{i = 1}^{N_{u}} | u^{i} - u_{\hat{Θ}} (x_{u}^{i}) |^{2} + W_{a} S (a),$ 2.3

where $W_{a}$ is the weight for slope recovery term.

Step 5 : Find the best parameters ${\hat{Θ}}^{*}$ using a suitable optimization method for minimizing the loss function $\tilde{J} (\hat{Θ})$ as
${\hat{Θ}}^{*} = \underset{\hat{Θ} \in \hat{V}}{\arg \min} \tilde{J} (\hat{Θ}) .$

3. Gradient dynamics with adaptive activation: convergent points and acceleration of convergence

When compared with the standard method, the adaptive activation method induces a new gradient dynamics, which results in different convergent points and the acceleration of the convergence. The following theorem states that a gradient descent algorithm minimizing our objective function $\tilde{J} (\hat{Θ})$ in (??) does not converge to a sub-optimal critical point or a sub-optimal local minimum, for neither L-LAAF nor N-LAAF, given appropriate initialization and learning rates. For simplicity, $W_{F}, W_{u}$ and W_a are assumed to be unity. In the following theorem, we treat $\hat{Θ}$ as a real-valued vector. Let $\tilde{J} c (0) = {MSE}_{F} + {MSE}_{u}$ with the constant network $u_{Θ} (z) = u_{Θ} (z^{'}) = c \in R^{N_{D}}$ for all z, z′ where c is a constant.

Theorem 3.1. —

Let ${({\hat{Θ}}_{m})}_{m \in N}$ be a sequence generated by a gradient descent algorithm as ${\hat{Θ}}_{m + 1} = {\hat{Θ}}_{m} - η_{m} \nabla \tilde{J} (\hat{Θ})$ . Assume that $\tilde{J} ({\hat{Θ}}_{0}) < \tilde{J} c (0) + S (0)$ for any $c \in R^{N_{D}},$ $\tilde{J}$ is differentiable, and that for each i ∈ {1, …, N_f}, there exist differentiable function φⁱ and input ρⁱ such that $| F_{\hat{Θ}} (x_{f}^{i}) |^{2} = φ^{i} (u_{\hat{Θ}} (ρ^{i})) .$ Assume that at least one of the following three conditions holds.

(i)
(constant learning rate) $\nabla \tilde{J}$ is Lipschitz continuous with Lipschitz constant C (i.e. $| | \nabla \tilde{J} (\hat{Θ}) - \nabla \tilde{J} ({\hat{Θ}}^{'}) | |_{2} \leq C | | \hat{Θ} - {\hat{Θ}}^{'} | |_{2}$ for all $\hat{Θ}, {\hat{Θ}}^{'}$ in its domain), and ϵ ≤ η_m ≤ (2 − ϵ)/C, where ϵ is a fixed positive number.

(ii)
(diminishing learning rate) $\nabla \tilde{J}$ is Lipschitz continuous, η_m → 0 and $\sum_{m = 0}^{\infty} η_{m} = \infty .$

(iii)
(adaptive learning rate) the learning rate η_m is chosen by the minimization rule, the limited minimization rule, the Armjio rule, or the Goldstein rule [18].

Then, for both L-LAAF and N-LAAF, no limit point of ${({\hat{Θ}}_{m})}_{m \in N}$ is a sub-optimal critical point or a sub-optimal local minimum.

The initial condition $\tilde{J} ({\hat{Θ}}_{0}) < \tilde{J} c (0) + S (0)$ means that the initial value $\tilde{J} ({\hat{Θ}}_{0})$ needs to be less than that of a constant network plus the highest value of the slope recovery term. Here, note that $S (1) < S (0)$ . The proof of theorem 3.1 is included in appendix A.

We now study how the proposed method approaches the convergent points and why it can accelerate the convergence. To illustrate the principal mechanism behind the acceleration, we compare the gradient dynamics of the proposed method against that of the standard method in the domain of J. The gradient dynamics of the standard method in the domain of J is

Θ^{m + 1} = Θ^{m} - η_{m} \nabla J (Θ^{m}),

3.1

and generates the sequence ${(J (Θ^{m}))}_{m \in N}$ of the standard objective values. The gradient dynamics of the adaptive activation method in the domain of $\tilde{J}$ is ${\hat{Θ}}^{m + 1} = {\hat{Θ}}^{m} - η_{m} \nabla \tilde{J} ({\hat{Θ}}^{m})$ , and generates the sequence ${(\tilde{J} ({\hat{Θ}}^{m}))}_{m \in N}$ of the modified objective values. Those dynamics are in the two different spaces, the domains of J and $\tilde{J}$ . To compare them, we translate the dynamics ${({\hat{Θ}}^{m})}_{m \in N}$ in the domain of $\tilde{J}$ to the dynamics ${({\tilde{Θ}}^{m})}_{m \in N}$ in the domain of J.

More concretely, we show that the gradient dynamics ${({\hat{Θ}}^{m})}_{m \in N}$ of the global adaptive activation method generates the sequence ${(J ({\tilde{Θ}}^{m}))}_{m \in N}$ of the standard objective values where

{\tilde{Θ}}^{m + 1} = {\tilde{Θ}}^{m} - η_{m} \hat{G} ({\hat{Θ}}^{m + 1}) \nabla J ({\tilde{Θ}}^{m}) + η_{m}^{2} {\hat{H}}_{J} ({\tilde{Θ}}^{m}) {\tilde{Θ}}^{m},

3.2

where ${\tilde{Θ}}^{m + 1} \in dom (J)$ is ${\hat{Θ}}^{m + 1} \in dom (\tilde{J})$ being translated in the space of Θ^m+1, which is $dom (J)$ ( $dom (J$ ) is the domain of J),

\hat{G} ({\hat{Θ}}^{m + 1}) = {(a^{m})}^{2} I + W^{m} {(W^{m})}^{⊤}

and

{\hat{H}}_{J} ({\tilde{Θ}}^{m}) = \nabla J ({\tilde{Θ}}^{m}) \nabla J {({\tilde{Θ}}^{m})}^{⊤} .

Comparing equations (3.1) and (3.2), we can see that the gradient dynamics of the adaptive activation modifies the standard dynamics, by multiplying a conditioning matrix $\hat{G} ({\tilde{Θ}}^{m})$ to the gradient and by adding the approximate second-order term $η_{m}^{2} H ({\tilde{Θ}}^{m}) {\tilde{Θ}}^{m}$ . This provides the mathematical intuition of why the global adaptive activation method can accelerate the convergence and it is not equivalent to changing or adapting learning rates.

To understand the approximate second-order term $η_{m}^{2} H ({\tilde{Θ}}^{m}) {\tilde{Θ}}^{m}$ , note that the gradient dynamics of the standard method (3.1) can be viewed as the simplest discretization (Euler’s Method) of the differential equation of the gradient flow,

\dot{Θ} = - \nabla J (Θ),

3.3

where we approximate $Θ^{t_{0} + t_{1}} = Θ^{t_{0}} - \int_{t \in [t_{0}, t_{1}]} \nabla J (Θ^{t}) d t$ by setting $\nabla J (Θ^{t}) \approx \nabla J (Θ^{t_{0}})$ for t ∈ [t₀, t₁]. In this view, we can consider other discretization procedures of (3.3). For example, instead of setting $\nabla J (Θ^{t}) \approx \nabla J (Θ^{t_{0}})$ , we can approximate

\nabla J (Θ^{t}) \approx \nabla J (Θ^{t_{0}}) + H_{J} (Θ^{t_{0}}) (Θ^{t} - Θ^{t_{0}}),

where H_J is the Hessian of J. The term $η_{m}^{2} H ({\tilde{Θ}}^{m}) {\tilde{Θ}}^{m}$ in (3.2) can be obtained by further approximating the second term by setting $H_{J} (Θ^{t_{0}}) \approx {\hat{H}}_{J} ({\tilde{Θ}}^{t_{0}})$ and $Θ^{t} - Θ^{t_{0}} \approx - η_{m} {\tilde{Θ}}^{t_{0}}$ .

More generally, we show that the gradient dynamics ${({\hat{Θ}}^{m})}_{m \in N}$ of any adaptive activation method generates the sequence ${(J ({\tilde{Θ}}^{m}))}_{m \in N}$ of the standard objective values, where

{\tilde{Θ}}^{m + 1} = {\tilde{Θ}}^{m} - η_{m} G ({\hat{Θ}}^{m + 1}) \nabla J ({\tilde{Θ}}^{m}),

3.4

with

G ({\hat{Θ}}^{m + 1}) = diag ({(A a^{m})}^{2}) + diag (W^{m}) A A^{^{⊤}} diag (W^{m}) - η_{m} diag (V ({\hat{Θ}}^{m + 1}))

and

V ({\hat{Θ}}^{m + 1}) = diag (A a^{m}) A A^{^{⊤}} diag (W^{m}) \nabla J ({\tilde{Θ}}^{m}) .

Here, given a vector $v \in R^{d}$ , $diag (v) \in R^{d \times d}$ represents a diagonal matrix with diag(v)_ii = v_i and v² represents v° v where v° u is the element-wise product of the two vectors v and u. The matrix A differs for different methods of adaptive activation functions with different types of locality, and is a fixed matrix given a method of GAAF, L-LAAF or N-LAAF. For example, in the case of GAAF, d′ = 1 and $A = (1, 1, \dots, 1)^{⊤} \in R^{d}$ . By plugging this A into (3.4) and noticing that $diag (W^{m}) A A^{^{⊤}} diag (W^{m}) = W W^{^{⊤}}$ with this A, we can obtain (3.2) from (3.4); i.e. (3.4) is a strictly more general version of (3.2). In the general case for any adaptive activation method with different types of locality, we can write ${\tilde{Θ}}^{m + 1} = A a^{m + 1} \circ W^{m + 1}$ , where $W^{m + 1} \in R^{d}$ and $a^{m + 1} \in R^{d^{'}}$ with some matrix $A \in R^{d \times d^{'}}$ . In the case of L-LAAF, d′ is the number of layers and $A \in R^{d \times d^{'}}$ is the matrix that satisfies ${\tilde{Θ}}^{m + 1} = A a^{m + 1} \circ W^{m + 1}$ for L-LAAF. In the case of N-LAAF, d′ is the number of all neurons and $A \in R^{d \times d^{'}}$ is the matrix that satisfies ${\tilde{Θ}}^{m + 1} = A a^{m + 1} \circ W^{m + 1}$ for N-LAAF.

Comparing equations (3.1) and (3.4), we can see that the gradient dynamics of different adaptive activation methods modify the standard dynamics, by multiplying different conditioning matrices $G ({\hat{Θ}}^{m})$ to the gradient, with different matrices A. This provides the mathematical intuition of why various adaptive activation methods can accelerate the convergence in different ways with different matrices A, and they are not equivalent to changing or adapting learning rates. Also, our analysis with (3.4) is applicable to any adaptive activation methods beyond GAAF, L-LAAF and N-LAAF, and provides insights for designing new adaptive activation methods that correspond to the new matrices A in (3.4) in order to further accelerate the convergence.

The conditioning matrix $G ({\hat{Θ}}^{m})$ is positive definite with sufficiently small learning rate η_m since diag((Aa^m)²) is positive definite and $diag (W^{m}) A A^{^{⊤}} diag (W^{m})$ is positive semi-definite (when every entry of a^m is non-zero). Therefore, with sufficiently small learning rate η_m, the parameter update in equation (3.4) decreases the value of J as J(Θ^m+1) < J(Θ^m) at differentiable points whenever $| | \nabla J ({\tilde{Θ}}^{m}) | | \neq 0$ . This is because $J (Θ^{m + 1}) = J (Θ^{m}) - η_{m} \nabla J {({\tilde{Θ}}^{m})}^{⊤} G ({\hat{Θ}}^{m + 1}) \nabla J ({\tilde{Θ}}^{m}) + η_{m} φ (η_{m})$ at differentiable points with some function φ such that $lim_{η_{m} \to 0} φ (η_{m}) = 0$ .

We now derive equations (3.2) and (3.4), and explain the definition of each symbol in more detail. Let us first focus on the GAAF method without the recovery term. Let $\hat{J}$ be $\tilde{J}$ without the recovery term. Let $\tilde{Θ} = a W$ , where W consists of all standard weight and bias parameters ${w^{k}, b^{k}}_{k = 1}^{D}$ . Then, we have that $\hat{J} (\hat{Θ}) = J (\tilde{Θ})$ . While the standard method generates the sequence of standard objective values ${(J (Θ^{m + 1}))}_{m \in N}$ , the GAAF method generates the sequence of standard objective values ${(J ({\tilde{Θ}}^{m}))}_{m \in N}$ . To compare the two methods in terms of the same standard objective values, we are thus interested in the following gradient dynamics of the proposed method:

\begin{aligned} {\tilde{Θ}}^{m + 1} & = a^{m + 1} W^{m + 1} \\ = (a^{m} - η_{m} \nabla_{a} J ({\tilde{Θ}}^{m})) (W^{m} - η_{m} \nabla_{W} J ({\tilde{Θ}}^{m})) \\ = {\tilde{Θ}}^{m} - η_{m} a^{m} \nabla_{W} J ({\tilde{Θ}}^{m}) - η_{m} \nabla_{a} J ({\tilde{Θ}}^{m}) W^{m} + η_{m}^{2} \nabla_{a} J ({\tilde{Θ}}^{m}) \nabla_{W} J ({\tilde{Θ}}^{m}) \end{aligned}

3.5

Here, we have that $\nabla_{W} J (\tilde{Θ}) = {((\partial J (\tilde{Θ}) / \partial \tilde{Θ}) \partial \tilde{Θ} / \partial W)}^{⊤}$ and $\nabla_{a} J (\tilde{Θ}) = {((\partial J (\tilde{Θ}) / \partial \tilde{Θ}) \partial \tilde{Θ} / \partial a)}^{⊤}$ by the chain rule, where $\partial J (\tilde{Θ}) / \partial \tilde{Θ} = \nabla J {(\tilde{Θ})}^{⊤}$ , $\partial \tilde{Θ} / \partial W = a I$ (I is the identity matrix) and $\partial \tilde{Θ} / \partial a = W$ . By plugging these into each term in the last line of (3.5),

a^{m} \nabla_{W} J ({\tilde{Θ}}^{m}) = {(a^{m})}^{2} \nabla J ({\tilde{Θ}}^{m})

and

\nabla_{a} J ({\tilde{Θ}}^{m}) W^{m} = W^{m} {(W^{m})}^{⊤} \nabla J ({\tilde{Θ}}^{m}),

and

\nabla_{a} J ({\tilde{Θ}}^{m}) \nabla_{W} J ({\tilde{Θ}}^{m}) = \nabla J ({\tilde{Θ}}^{m}) \nabla J {({\tilde{Θ}}^{m})}^{⊤} {\tilde{Θ}}^{m},

which obtains (3.2) as desired. Similar to the case of GAAF, in the general case, we have

\begin{aligned} {\tilde{Θ}}^{m + 1} & = A a^{m + 1} \circ W^{m + 1} \\ = (A (a^{m} - η_{m} \nabla_{a} J ({\tilde{Θ}}^{m}))) \circ (W^{m} - η_{m} \nabla_{W} J ({\tilde{Θ}}^{m})) \\ = {\tilde{Θ}}^{m} - η_{m} (A a^{m} \circ \nabla_{W} J ({\tilde{Θ}}^{m})) - η_{m} (A \nabla_{a} J ({\tilde{Θ}}^{m}) \circ W^{m}) + η_{m}^{2} (A \nabla_{a} J ({\tilde{Θ}}^{m}) \circ \nabla_{W} J ({\tilde{Θ}}^{m})) . \end{aligned}

3.6

By using $\nabla_{W} J (\tilde{Θ}) = {((\partial J (\tilde{Θ}) / \partial \tilde{Θ}) \partial \tilde{Θ} / \partial W)}^{⊤}$ and $\nabla_{a} J (\tilde{Θ}) = {((\partial J (\tilde{Θ}) / \partial \tilde{Θ}) \partial \tilde{Θ} / \partial a)}^{⊤}$ ,

\begin{aligned} A a^{m} \circ \nabla_{W} J ({\tilde{Θ}}^{m}) & = diag (A a^{m} \circ A a^{m}) \nabla J ({\tilde{Θ}}^{m}), \end{aligned}

3.7

\begin{aligned} A \nabla_{a} J ({\tilde{Θ}}^{m}) \circ W^{m} & = diag (W^{m}) A A^{^{⊤}} diag (W^{m}) \nabla J ({\tilde{Θ}}^{m}) \end{aligned}

3.8

\begin{aligned} and A \nabla_{a} J ({\tilde{Θ}}^{m}) \circ \nabla_{W} J ({\tilde{Θ}}^{m}) & = diag (diag (A a^{m}) A A^{^{⊤}} diag (W^{m}) \nabla J ({\tilde{Θ}}^{m})) \nabla J ({\tilde{Θ}}^{m}) . \end{aligned}

3.9

By plugging these into (3.6), we obtain (3.4).

Figure 2 illustrates the effect of adaptive activation methods on the objective value and the normalized condition number. It shows that the adaptive activation methods accelerated the convergence of the objective value while decreasing the condition number. The improvement of the condition number roughly coincided with the improvement of the objective value. The condition number of interest is (the largest singular value of M)/(the smallest singular value of M) where $M = G {(\hat{Θ})}^{1 / 2} \nabla^{2} J (\tilde{Θ}) G {(\hat{Θ})}^{1 / 2}$ for adaptive activation function methods and $M = \nabla^{2} J (\tilde{Θ})$ for the standard method without adaptive activations. As this condition number gets smaller, the convergence rate of the objective value can improve when the matrix M is normal [18]. The normalized condition numbers in each subfigure are the minimum condition numbers over the previous epochs, and were normalized by dividing the condition numbers by the initial condition numbers of the standard method. For this experiment, we set n = 1, and we used a fixed dataset generated by $sklearn . datasets . make_circleswithn_samples =$ 1000, noise = 0.01, $random_state = 0$ , and factor = 0.7. This dataset is not linearly separable. We adopted the fully-connected NN with a single hidden layer. The standard cross entropy loss was used for training and plots.

4. Computational results

In this section, we shall solve the function approximation problem using deep NN and inverse PDE problems involving the two-dimensional Poisson’s and inviscid Burgers equations using PINN algorithm with the proposed methods. Some standard deep learning benchmarks problems are also solved. The performance of the proposed methods is evaluated in terms of the convergence speed and the accuracy.

(a). Neural network approximation of nonlinear discontinuous function

In this test case, a deep NN (without physics-informed part) is used to approximate a discontinuous function. Here, the loss function consists of the data mismatch and the slope recovery term. The following discontinuous function with discontinuity at x = 0 location is approximated by a deep NN.

u (x) = {\begin{cases} 0.2 \sin (6 x) & if x \leq 0, \\ 1 + 0.1 x \cos (18 x) & otherwise. \end{cases}

This function contains both high and low frequency components along with the discontinuity. The domain is [ − 3, 3] and the number of training points used is 300, which are chosen randomly. The activation function is tanh, learning rate is 2.0 × 10⁻⁴ and the number of hidden layers is four with 50 neurons in each layer. The scaling factor is 10 and both W_u, W_a are unity. Figure 3 shows the solution (first column) and point-wise absolute error in log scale (second column). The solution by the standard fixed activation function is given in the first row, the GAAF solution is given in second row, whereas the third and fourth row shows the solution given by L-LAAF and N-LAAF, respectively. We see that both L-LAAF and N-LAAF with slope recovery term accelerate training speed compared to other methods. We also note that, GAAF solution with the slope recovery term (not shown) is also comparable with the proposed methods in terms of training speed, and it can be considered as a new contribution due to involvement of the slope recovery term.

Figure 3. — Discontinuous function: NN approximation of function at 2000, 8000 and 15 000 iterations using standard fixed activation (first row), GAAF (second row), L-LAAF (third row), and N-LAAF (fourth row) using the *tanh* activation. The first column shows the solution. The second column gives the point-wise absolute error in the log scale for all the cases. (Online version in colour.)

(b). Inverse problem: 2D Poisson’s equation

In this example, we shall identify the unknown parameter in the diffusion coefficient. This example is taken from Pakravan et al. [19], where a variable diffusion coefficient parametrized as $D (x; α) = 1 + α x$ needs to be evaluated, with randomly chosen values of α ∈ [0.05, 0.95]. The computational domain is $Ω \in {[- 1 / \sqrt{2}, 1 / \sqrt{2}]}^{2}$ and the governing Poisson’s equation is given by

\nabla \cdot ([1 + α x] \nabla u) + x + y = 0, (x, y) \in Ω,

with boundary condition u_b = cos (πx)cos (πy,) (x, y) ∈ ∂Ω.

The feed-forward NN using three hidden layers with 30 neurons in each layer is trained over 500 generated solutions by randomly choosing the parameter α from the given range. The hyperbolic tangent activation function is used with 0.0008 learning rate. The value of scaling factor is unity in all the cases and $W_{F} = 1$ , W_u = 10, W_a = 10. Later we tested its performance on 50 independent solutions fields to identify the parameter α using the fixed activation, GAAF, L-LAAF and the N-LAAF without and with 2.5% Gaussian noise. To show the convergence in the early training period, the results are plotted after 4000 iterations with the clean data and the data with noise. Figure 4 shows the results of the standard activation, GAAF, L-LAAF and the N-LAAF (top to bottom row) without noise (first column) and with Gaussian noise (second column). Both L-LAAF and N-LAAF perform better than GAAF where the learned parameters are in good agreement with the true values. Table 1 shows the relative L₂ error in all the cases without noise, and among all methods, L-LAAF gives least error.

Figure 4. — Inverse problem 2D Poisson equation: The standard activation, GAAF, L-LAAF and the N-LAAF (top to bottom row) without noise (first column) and with Gaussian noise (second column). (Online version in colour.)

Table 1.

The relative L₂ error in all the cases without noise.

	standard Acti.	GAAF	L-LAAF	N-LAAF
Rel. L₂ error	1.012 × 10⁻¹	1.654 × 10⁻²	7.328 × 10⁻³	1.482 × 10⁻²

Open in a new tab

(c). Inverse problem: 2D inviscid Burgers equation

Our next example is the inverse problem of identification of viscosity coefficient in the Burgers equation. The solution of inviscid Burgers equation can be discontinuous even with sufficiently smooth initial condition. The two-dimensional inviscid Burgers equation is given as

u_{t} + u u_{x} + u_{y} = 0, x \in [- 0.1, 1], y \in [0, 1], and t > 0,

subject to boundary conditions

u (x, 0) = {\begin{cases} a & if x < 0, \\ b & otherwise, \end{cases}

u( − 0.2, y) = a and $u (1, y) = b, \forall y$ . The exact solution for the case of a = 2 and b = 0 is given in [20], which has a steady oblique discontinuity.

In this work, we used the recently proposed conservative PINN (cPINN) method [13], which is basically a domain decomposition approach in PINN for conservation laws. The computational domain is divided into 12 sub-domains and a separate PINN is employed in each sub-domain which are working in tandem. The solutions in each sub-domains are stitched using the interface conditions, which include the enforcement of conservative flux and the average solution along the common interface. The division of computational domain into 12 sub-domains is shown in figure 5. The interface locations on x- and y-axes are [0.2, 0.6] and [0.25, 0.5, 0.75], respectively.

Figure 5. — Numbering of 12 sub-domains.

In cPINN algorithm, instead of supplying the original equation, the following parametrized viscous Burgers equation is supplied

u_{t} + u u_{x} + u_{y} = ν (u_{x x} + u_{y y}),

and we aim to identify the value of the viscosity coefficient ν, which is zero for the inviscid Burgers equation. For this test case we used the hyperbolic tangent activation function, 0.0006 learning rate, scaling factor is 5 in all cases and table 2 gives the number of neurons, number of hidden layers and the number of residual points in each sub-domain. In this case, $W_{F} = 1$ , W_u = 10, W_a = 20 are used. The initial values of ν is chosen arbitrarily in all 12 sub-domains, which is [1, 2, 3, 4, −5, 6, −7, 8, 9, 10, 11, 12]/2, respectively. We note that cPINN is a robust method even in the case of negative values of viscosity that no standard solver would tolerate.

Table 2.

NN architecture in each sub-domains for the two-dimensional inviscid Burgers equation.

domain no.	1	2	3	4	5	6	7	8	9	10	11	12
# layers	6	6	3	6	6	3	6	3	3	3	3	3
# neurons	20	20	20	20	20	20	20	20	20	20	20	20
# residual pts.	2200	2200	400	600	2200	800	400	2200	2200	400	800	2200

Open in a new tab

All the simulations are performed up to 40 k iterations. The number and locations of training data points (300 pts.) and residual points (8000 pts.) are fixed in all cases. Figure 6 shows the variation of ν (first column) and point-wise error (second column) for fixed activation, GAAF, L-LAAF and N-LAAF (top to bottom row) cases. In all cases, ν converges to its actual value, which is zero. The cPINN algorithm using fixed activation function takes 14 750 iterations for convergence whereas the GAAF, L-LAAF and the N-LAAF takes 13 600, 11 090 and 11 170 iterations, respectively. Among all the cases, L-LAAF gives the smallest absolute point-wise error as shown in the figure. In this test case, both N-LAAF and L-LAAF with slope recovery term perform better than GAAF in terms of convergence speed and accuracy of the solution.

Figure 6. — Inverse problem 2D Burgers equation: Variation of ν (first column) and point-wise error (second column) for standard (fixed activation), GAAF, L-LAAF and N-LAAF (top to bottom row). (Online version in colour.)

(d). Standard deep learning benchmark problems

The previous sub-section demonstrates the advantages of adaptive activation functions with PINN for physics-related problems. One of the remaining questions is whether or not the advantage of adaptive activations remains with standard deep NNs for other types of deep learning applications. To explore the question, this section presents numerical results with various standard benchmark problems in deep learning.

MNIST [21], Fashion-MNIST [22] and KMNIST [23] are the datasets with handwritten digits, images of clothing and accessories, and Japanese letters. Apart from MNIST, Semeion [24] is a handwritten digit data set that contains 1593 digits collected from 80 persons. SVHN [25] is another data set for street view house numbers obtained from house numbers in Google Street View images. CIFAR [26] is a popular data set containing colour images. In particular, the CIFAR-10 data set contains 50 000 training and 10 000 testing images in 10 classes with image resolution of 32 × 32. CIFAR-100 is similar to the CIFAR-10, except it has 100 classes with 600 images in each class.

Figures 7 and 8 show the mean values and the uncertainty intervals of the training losses for fixed activation function (standard), GAAF, L-LAAF and N-LAAF, by using the standard deep learning benchmarks. The solid and dashed lines are the mean values over three random trials with random seeds. The shaded regions represent the intervals of 2 ×(the sample standard deviations) for each method. Figures 7 and 8 consistently show that adaptive activation accelerates the minimization process of the training loss values. Here, all of GAAF, L-LAAF and N-LAAF use the slope recovery term, which improved the methods without the recovery term.

Figure 7. — Training loss in log scale versus epoch without data augmentation. (a) Semeion, (b) MNIST, (c) KMNIST, (d) Fashion-MNIST, (e) CIFAR-10, (f ) CIFAR-100, (g) SVHN. (Online version in colour.)

Figure 8. — Training loss in log scale versus epoch with data augmentation. (a) Semeion, (b) MNIST, (c) KMNIST, (d) Fashion-MNIST, (e) CIFAR-10, (f ) CIFAR-100, (g) SVHN. (Online version in colour.)

The standard cross entropy loss was used for training and plots. We used pre-activation ResNet with 18 layers [27] for CIFAR-10, CIFAR-100 and SVHN datasets, whereas we used a standard variant of LeNet [21] with ReLU for other datasets; i.e. the architecture of the variant of LeNet consists of the following five layers (with the three hidden layers): (1) input layer, (2) convolutional layer with 64 5 × 5 filters, followed by max pooling of size of 2 by 2 and ReLU, (3) convolutional layer with 64 5 × 5 filters, followed by max pooling of size of 2 by 2 and ReLU, (4) fully connected layer with 1014 output units, followed by ReLU, and (5) Fully connected layer with the number of output units being equal to the number of target classes. All hyper-parameters were fixed a priori across all different datasets and models. We fixed the mini-batch size s to be 64, the initial learning rate to be 0.01, and the momentum coefficient to be 0.9. The learning rate was divided by 10 at the beginning of 10th epoch for all experiments (with and without data augmentation), and of 100th epoch for those with data augmentation. For the convolutional layers, L-LAAF shares the same single parameter across all pixels and channels within each layer, whereas N-LAAF has as many extra parameters as the number of pixels in every channel. In this section, we used scaling factor n = 1 and 2 because the NN with ReLU activation represents a homogeneous function and hence the behaviours with other values of n can be achieved by changing learning rates in this case. Note that this is not true in previous sections for PINN.

5. Conclusion

In summary, we present two versions of locally adaptive activation functions namely, layer-wise and neuron-wise locally adaptive activation functions. Such local activation functions further improve the training speed of the NN compared to its global predecessor. To further accelerate the training process, an activation slope-based slope recovery term is added in the loss function for both layer-wise and neuron-wise activation functions, which is shown to enhance the performance of the NN. To verify our claim, a function approximation problem is solved using deep NN and two inverse PDE problems are solved using PINNs, demonstrating that the locally adaptive activations outperform fixed as well as global adaptive activations in terms of training speed and accuracy. Moreover, while the proposed formulations increase the number of additional parameters compared to the fixed activation function, the overall computational cost is comparable. The proposed adaptive activation function with the slope recovery term is also shown to accelerate the training process in standard deep learning benchmark problems. We theoretically prove that no sub-optimal critical point or local minimum attracts gradient descent algorithms in the proposed methods (L-LAAF and N-LAAF) with the slope recovery term under only mild assumptions. We also show that the gradient dynamics of the proposed method is not equivalent to the dynamics of base method with any (adaptive) learning rates. Instead, the proposed methods are equivalent to modifying the gradient dynamics of the base method by implicitly multiplying conditioning matrices to the gradients of the base method. The explicit computations of such matrix–vector products are too expensive for NNs, whereas our adaptive activation functions efficiently avoid the explicit computations.

Appendix A. Proof of theorem 3.1

We first prove the statement by contradiction for L-LAAF. Suppose that the parameter vector $\hat{Θ}$ consisting of ${w^{k}, b^{k}}_{k = 1}^{D}$ and ${a^{k}}_{k = 1}^{D - 1}$ is a limit point of ${({\hat{Θ}}_{m})}_{m \in N}$ and a sub-optimal critical point or a sub-optimal local minimum.

Let $ℓ_{f}^{i} := φ^{i} (u_{\hat{Θ}} (ρ^{i}))$ and $ℓ_{u}^{i} := | u^{i} - u_{\hat{Θ}} (x_{u}^{i}) |^{2}$ . Let $z_{f}^{i, k}$ and $z_{u}^{i, k}$ be the outputs of the kth layer for ρⁱ and $(x_{u}^{i})$ , respectively. Define

h_{f}^{i, k, j} := n a^{k} (w^{k, j} z_{f}^{i, k - 1} + b^{k, j}) \in R

and

h_{u}^{i, k, j} := n a^{k} (w^{k, j} z_{u}^{i, k - 1} + b^{k, j}) \in R,

for all j ∈ {1, …, N_k}, where $w^{k, j} \in R^{1 \times N_{k - 1}}$ and $b^{k, j} \in R$ .

Following the proofs in [18, Propositions 1.2.1–1.2.4], we have that $\nabla \tilde{J} (\hat{Θ}) = 0$ and $\tilde{J} (\hat{Θ}) < \tilde{J} c (0) + S (0)$ , for all three cases of the conditions corresponding the different rules of the learning rate. Therefore, we have that for all k ∈ {1, …, D − 1},

\begin{aligned} \frac{\partial \tilde{J} (\hat{Θ})}{\partial a^{k}} & = \frac{\partial S (a)}{\partial a^{k}} + \frac{n}{N_{f}} \sum_{i = 1}^{N_{f}} \sum_{j = 1}^{N_{k}} \frac{\partial ℓ_{f}^{i}}{\partial h_{f}^{i, k, j}} (w^{k, j} z_{f}^{i, k - 1} + b^{k, j}) + \frac{n}{N_{u}} \sum_{i = 1}^{N_{u}} \sum_{j = 1}^{N_{k}} \frac{\partial ℓ_{u}^{i}}{\partial h_{u}^{i, k, j}} (w^{k, j} z_{u}^{i, k - 1} + b^{k, j}) \\ = \frac{\partial S (a)}{\partial a^{k}} + \sum_{j = 1}^{N_{k}} (\frac{n}{N_{f}} \sum_{i = 1}^{N_{f}} \frac{\partial ℓ_{f}^{i}}{\partial h_{f}^{i, k, j}} (w^{k, j} z_{f}^{i, k - 1} + b^{k, j}) + \frac{n}{N_{u}} \sum_{i = 1}^{N_{u}} \frac{\partial ℓ_{u}^{i}}{\partial h_{u}^{i, k, j}} (w^{k, j} z_{u}^{i, k - 1} + b^{k, j})) \\ = 0. \end{aligned}

A 1

Furthermore, we have that for all k ∈ {1, …, D − 1} and all j ∈ {1, …, N_k},

\begin{aligned} \frac{\partial \tilde{J} (\hat{Θ})}{\partial w^{k, j}} & = \frac{n a^{k}}{N_{f}} \sum_{i = 1}^{N_{f}} \frac{\partial ℓ_{f}^{i}}{\partial h_{f}^{i, k, j}} (z_{f}^{i, k - 1})^{⊤} + \frac{n a^{k}}{N_{u}} \sum_{i = 1}^{N_{u}} \frac{\partial ℓ_{u}^{i}}{\partial h_{u}^{i, k, j}} (z_{u}^{i, k - 1})^{⊤}, \\ = 0 \end{aligned}

A 2

and

\frac{\partial \tilde{J} (\hat{Θ})}{\partial b^{k, j}} = \frac{n a^{k}}{N_{f}} \sum_{i = 1}^{N_{f}} \frac{\partial ℓ_{f}^{i}}{\partial h_{f}^{i, k, j}} + \frac{n a^{k}}{N_{f}} \sum_{i = 1}^{N_{f}} \frac{\partial ℓ_{u}^{i}}{\partial h_{u}^{i, k, j}} = 0.

A 3

By combining (A 1)–(A 3), for all k ∈ {1, …, D − 1},

\begin{aligned} 0 & = a^{k} \frac{\partial \tilde{J} (\hat{Θ})}{\partial a^{k}} \\ = a^{k} \frac{\partial S (a)}{\partial a^{k}} + \sum_{j = 1}^{N_{k}} (\frac{n a^{k}}{N_{f}} \sum_{i = 1}^{N_{f}} \frac{\partial ℓ_{f}^{i}}{\partial h_{f}^{i, k, j}} (w^{k, j} z_{f}^{i, k - 1} + b^{k, j}) + \frac{n a^{k}}{N_{u}} \sum_{i = 1}^{N_{u}} \frac{\partial ℓ_{u}^{i}}{\partial h_{u}^{i, k, j}} (w^{k, j} z_{u}^{i, k - 1} + b^{k, j})) \\ = a^{k} \frac{\partial S (a)}{\partial a^{k}} + \sum_{j = 1}^{N_{k}} (w^{k, j} {(\frac{\partial \tilde{J} (\hat{Θ})}{\partial w^{k, j}})}^{⊤} + b^{k, j} (\frac{\partial \tilde{J} (\hat{Θ})}{\partial b^{k, j}})) = a^{k} \frac{\partial S (a)}{\partial a^{k}} . \end{aligned}

Therefore,

0 = a^{k} \frac{\partial S (a)}{\partial a^{k^{'}}} = - a^{k} (D - 1) {(\sum_{k = 1}^{D - 1} \exp (a^{k}))}^{- 2} \exp (a^{k}),

which implies that for all a^k = 0 since $(D - 1) {(\sum_{k = 1}^{D - 1} \exp (a^{k}))}^{- 2} \exp (a^{k}) \neq 0$ . This implies that $\tilde{J} (\hat{Θ}) = \tilde{J} c (0) + S (0)$ , which contradicts with $\tilde{J} (\hat{Θ}) < \tilde{J} c (0) + S (0)$ . This proves the desired statement for L-LAAF.

For N-LAAF, we prove the statement by contradiction. Suppose that the parameter vector $\hat{Θ}$ consisting of ${w^{k}, b^{k}}_{k = 1}^{D}$ and ${a_{j}^{k}}_{k = 1}^{D - 1} \forall j = 1, 2, \dots, N_{k}$ is a limit point of ${({\hat{Θ}}_{m})}_{m \in N}$ and a sub-optimal critical point or a sub-optimal local minimum. Redefine

h_{f}^{i, k, j} := n a_{j}^{k} (w^{k, j} z_{f}^{i, k - 1} + b^{k, j}) \in R

and

h_{u}^{i, k, j} := n a_{j}^{k} (w^{k, j} z_{u}^{i, k - 1} + b^{k, j}) \in R,

for all j ∈ {1, …, N_k}, where $w^{k, j} \in R^{1 \times N_{k - 1}}$ and $b^{k, j} \in R$ . Then, by the same proof steps, we have that $\nabla \tilde{J} (\hat{Θ}) = 0$ and $\tilde{J} (\hat{Θ}) < \tilde{J} c (0) + S (0)$ , for all three cases of the conditions corresponding the different rules of the learning rate. Therefore, we have that for all k ∈ {1, …, D − 1} and all j ∈ {1, …, N_k},

\begin{aligned} \frac{\partial \tilde{J} (\hat{Θ})}{\partial a_{j}^{k}} & = \frac{n}{N_{f}} \sum_{i = 1}^{N_{f}} \frac{\partial ℓ_{f}^{i}}{\partial h_{f}^{i, k, j}} (w^{k, j} z_{f}^{i, k - 1} + b^{k, j}) + \frac{n}{N_{u}} \sum_{i = 1}^{N_{u}} \frac{\partial ℓ_{u}^{i}}{\partial h_{u}^{i, k, j}} (w^{k, j} z_{u}^{i, k - 1} + b^{k, j}) + \frac{\partial S (a)}{\partial a_{j}^{k}} \\ = 0. \end{aligned}

A 4

By combining (A 2)–(A 4), for all k ∈ {1, …, D − 1} and all j ∈ {1, …, N_k},

\begin{aligned} 0 & = a_{j}^{k} \frac{\partial \tilde{J} (\hat{Θ})}{\partial a_{j}^{k}} \\ = \frac{n a_{j}^{k}}{N_{f}} \sum_{i = 1}^{N_{f}} \frac{\partial ℓ_{f}^{i}}{\partial h_{f}^{i, k, j}} (w^{k, j} z_{f}^{i, k - 1} + b^{k, j}) + \frac{n a_{j}^{k}}{N_{u}} \sum_{i = 1}^{N_{u}} \frac{\partial ℓ_{u}^{i}}{\partial h_{u}^{i, k, j}} (w^{k, j} z_{u}^{i, k - 1} + b^{k, j}) + a_{j}^{k} \frac{\partial S (a)}{\partial a_{j}^{k}} \\ = w^{k, j} {(\frac{\partial \tilde{J} (\hat{Θ})}{\partial w^{k, j}})}^{⊤} + b^{k, j} (\frac{\partial \tilde{J} (\hat{Θ})}{\partial b^{k, j}}) + a_{j}^{k} \frac{\partial S (a)}{\partial a_{j}^{k}} = a_{j}^{k} \frac{\partial S (a)}{\partial a_{j}^{k}} . \end{aligned}

Therefore,

0 = a_{j}^{k} \frac{\partial S (a)}{\partial a^{k^{'}}} = - 2 a_{j}^{k} (D - 1) {(\sum_{k = 1}^{D - 1} \exp (\frac{\sum_{i = 1}^{N_{k}} a_{i}^{k}}{N_{k}}))}^{- 2} \exp (\frac{\sum_{i = 1}^{N_{k}} a_{i}^{k}}{N_{k}}) / N_{k},

which implies that for all $a_{j}^{k} = 0$ since $(D - 1) {(\sum_{k = 1}^{D - 1} \exp (\sum_{i = 1}^{N_{k}} a_{i}^{k} / N_{k}))}^{- 2} \exp (\sum_{i = 1}^{N_{k}} a_{i}^{k} / N_{k}) \neq 0$ . This implies that $\tilde{J} (\hat{Θ}) = \tilde{J} c (0) + S (0)$ , which contradicts with $\tilde{J} (\hat{Θ}) < \tilde{J} c (0) + S (0)$ . This proves the desired statement for N-LAAF.

Data accessibility

https://github.com/AmeyaJagtap/Locally-Adaptive-Activation-Functions-Neural-Networks-.

Authors' contributions

A.D.J.: conception and design, acquisition and interpretation of data, drafting and revising the article, corrections. K.K.: theoretical analysis, acquisition and interpretation of data, revising the article, corrections. G.E.K.: drafting and revising the article, supervision, corrections. All authors agree to be accountable for all aspects of the work.

Competing interests

We declare we have no competing interest.

Funding

This work was supported by the Department of Energy PhILMs grant DE-SC0019453, and by the DARPA-AIRA grant HR00111990025.

Reference

1.Hinton G. et al. 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process Mag. 29, 82–97. ( 10.1109/MSP.2012.2205597) [DOI] [Google Scholar]
2.Krizhevsky A, Sutskever I, Hinton G. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, 26th Annual Conf. in Neural Information Processing Systems 2012, Lake Tahoe, NV, 1–6 December, vol. 1, pp. 1097–1105. Neural Information Processing Systems Foundation Inc.
3.Wu Y. et al. 2016. Google’s neural machine translation system: bridging the gap between human and machine translation. (http://arxiv.org/abs/1609.08144).
4.Raissi M, Perdikaris P, Karniadakis GE. 2019. Physics-informed neural network: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707. ( 10.1016/j.jcp.2018.10.045) [DOI] [Google Scholar]
5.Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learning Res. 15, 1929–1958. [Google Scholar]
6.Yu CC, Tang YC, Liu BD. 2002. An adaptive activation function for multilayer feedforward neural networks. In 2002 IEEE Region 10 Conf. on Computers, Communications, Control and Power Engineering. TENCOM ’02. Proc. Beijing, China, 28–31 October. Piscataway, NJ: IEEE.
7.Qian S, Liu H, Liu C, Wu S, San Wong H. 2018. Adaptive activation functions in convolutional neural networks. Neurocomputing 272, 204–212. ( 10.1016/j.neucom.2017.06.070) [DOI] [Google Scholar]
8. Dushkoff M, Ptucha R. Adaptive activation functions for deep networks. Electronic imaging, computational imaging XIV, pp. 1–5(5). (doi:10.2352/ISSN.2470-1173.2016.19.COIMG-149)
9.Li B, Li Y, Rong X. 2013. The extreme learning machine learning algorithm with tunable activation function. Neural Comput. & Applie 22, 531–539. ( 10.1007/s00521-012-0858-9) [DOI] [Google Scholar]
10.Shen Y, Wang B, Chen F, Cheng L. 2004. A new multi-output neural model with tunable activation function and its applications. Neural Processing Letters 20, 85–104. ( 10.1007/s11063-004-0637-4) [DOI] [Google Scholar]
11.Kunc V, Kléma J. 2019. On transformative adaptive activation functions in neural networks for gene expression inference. bioRxiv 587287 Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Publishing. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Jagtap AD, Kawaguchi K, Karniadakis GE. 2020. Adaptive activation functions accelerate convergence in deep and physics-informed neural networks. J. Comput. Phys. 404, 109136 ( 10.1016/j.jcp.2019.109136) [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Jagtap AD, Kharazmi E, Karniadakis GE. 2020. Conservative physics-informed neural networks on discrete domains for conservation laws: applications to forward and inverse problems. Comput. Methods Appl. Mech. Eng. 365, 113028 ( 10.1016/j.cma.2020.113028) [DOI] [Google Scholar]
14.Wang S, Teng Y, Perdikaris P. 2020. Understanding and mitigating gradient pathologies in physics-informed neural networks. (http://arxiv.org/abs/2001.04536).
15.Baydin AG, Pearlmutter BA, Radul AA, Siskind JM. 2018. Automatic differentiation in machine learning: a survey. J. Mach. Learning Res. 18, 1–43. [Google Scholar]
16.Ruder S. 2017. An overview of gradient descent optimization algorithms. (http://arxiv.org/abs/1609.04747v2).
17.Kingma DP, Ba JL. 2017. ADAM: a method for stochastic optimization. (http://arxiv.org/abs/1412.6980v9).
18.Bertsekas DP. 1999. Nonlinear programming. Belmont, MA: Athena Scientific. [Google Scholar]
19.Pakravan S, Mistani PA, Aragon-Calvo MA, Gibou F. 2020. Solving inverse-PDE problems with physics-aware neural networks. (http://arxiv.org/abs/2001.03608).
20.Jagtap AD. 2018. Method of relaxed streamline upwinding for hyperbolic conservation laws. Wave Motion 78, 132–161. ( 10.1016/j.wavemoti.2018.02.001) [DOI] [Google Scholar]
21.LeCun Y, Bottou L, Bengio Y, Haffner P. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324. ( 10.1109/5.726791) [DOI] [Google Scholar]
22.Xiao H, Rasul K, Vollgraf R. 2017. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. (http://arxiv.org/abs/1708.07747).
23.Clanuwat T, Bober-Irizar M, Kitamoto A, Lamb A, Yamamoto K, Ha D. 2018. Deep learning for classical Japanese literature. (http://arxiv.org/abs/1812.01718).
24.Tactile Srl, Brescia, Italy (1994). Semeion Handwritten Digit Data Set. Rome, Italy: Semeion Research Center of Sciences of Communication.
25.Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY. 2011. Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning.
26.Krizhevsky A, Hinton G. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer.
27.He K, Zhang X, Ren S, Sun J. 2016. Identity mappings in deep residual networks. Computer Vision - ECCV 2016, 14th European Conf., Amsterdam, The Netherlands, 11–14 October, Proc. Part IV , pp. 630–645. Springer.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

https://github.com/AmeyaJagtap/Locally-Adaptive-Activation-Functions-Neural-Networks-.

[RSPA20200334C1] 1.Hinton G. et al. 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process Mag. 29, 82–97. ( 10.1109/MSP.2012.2205597) [DOI] [Google Scholar]

[RSPA20200334C2] 2.Krizhevsky A, Sutskever I, Hinton G. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, 26th Annual Conf. in Neural Information Processing Systems 2012, Lake Tahoe, NV, 1–6 December, vol. 1, pp. 1097–1105. Neural Information Processing Systems Foundation Inc.

[RSPA20200334C3] 3.Wu Y. et al. 2016. Google’s neural machine translation system: bridging the gap between human and machine translation. (http://arxiv.org/abs/1609.08144).

[RSPA20200334C4] 4.Raissi M, Perdikaris P, Karniadakis GE. 2019. Physics-informed neural network: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707. ( 10.1016/j.jcp.2018.10.045) [DOI] [Google Scholar]

[RSPA20200334C5] 5.Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learning Res. 15, 1929–1958. [Google Scholar]

[RSPA20200334C6] 6.Yu CC, Tang YC, Liu BD. 2002. An adaptive activation function for multilayer feedforward neural networks. In 2002 IEEE Region 10 Conf. on Computers, Communications, Control and Power Engineering. TENCOM ’02. Proc. Beijing, China, 28–31 October. Piscataway, NJ: IEEE.

[RSPA20200334C7] 7.Qian S, Liu H, Liu C, Wu S, San Wong H. 2018. Adaptive activation functions in convolutional neural networks. Neurocomputing 272, 204–212. ( 10.1016/j.neucom.2017.06.070) [DOI] [Google Scholar]

[RSPA20200334C8] 8. Dushkoff M, Ptucha R. Adaptive activation functions for deep networks. Electronic imaging, computational imaging XIV, pp. 1–5(5). (doi:10.2352/ISSN.2470-1173.2016.19.COIMG-149)

[RSPA20200334C9] 9.Li B, Li Y, Rong X. 2013. The extreme learning machine learning algorithm with tunable activation function. Neural Comput. & Applie 22, 531–539. ( 10.1007/s00521-012-0858-9) [DOI] [Google Scholar]

[RSPA20200334C10] 10.Shen Y, Wang B, Chen F, Cheng L. 2004. A new multi-output neural model with tunable activation function and its applications. Neural Processing Letters 20, 85–104. ( 10.1007/s11063-004-0637-4) [DOI] [Google Scholar]

[RSPA20200334C11] 11.Kunc V, Kléma J. 2019. On transformative adaptive activation functions in neural networks for gene expression inference. bioRxiv 587287 Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Publishing. [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSPA20200334C12] 12.Jagtap AD, Kawaguchi K, Karniadakis GE. 2020. Adaptive activation functions accelerate convergence in deep and physics-informed neural networks. J. Comput. Phys. 404, 109136 ( 10.1016/j.jcp.2019.109136) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSPA20200334C13] 13.Jagtap AD, Kharazmi E, Karniadakis GE. 2020. Conservative physics-informed neural networks on discrete domains for conservation laws: applications to forward and inverse problems. Comput. Methods Appl. Mech. Eng. 365, 113028 ( 10.1016/j.cma.2020.113028) [DOI] [Google Scholar]

[RSPA20200334C14] 14.Wang S, Teng Y, Perdikaris P. 2020. Understanding and mitigating gradient pathologies in physics-informed neural networks. (http://arxiv.org/abs/2001.04536).

[RSPA20200334C15] 15.Baydin AG, Pearlmutter BA, Radul AA, Siskind JM. 2018. Automatic differentiation in machine learning: a survey. J. Mach. Learning Res. 18, 1–43. [Google Scholar]

[RSPA20200334C16] 16.Ruder S. 2017. An overview of gradient descent optimization algorithms. (http://arxiv.org/abs/1609.04747v2).

[RSPA20200334C17] 17.Kingma DP, Ba JL. 2017. ADAM: a method for stochastic optimization. (http://arxiv.org/abs/1412.6980v9).

[RSPA20200334C18] 18.Bertsekas DP. 1999. Nonlinear programming. Belmont, MA: Athena Scientific. [Google Scholar]

[RSPA20200334C19] 19.Pakravan S, Mistani PA, Aragon-Calvo MA, Gibou F. 2020. Solving inverse-PDE problems with physics-aware neural networks. (http://arxiv.org/abs/2001.03608).

[RSPA20200334C20] 20.Jagtap AD. 2018. Method of relaxed streamline upwinding for hyperbolic conservation laws. Wave Motion 78, 132–161. ( 10.1016/j.wavemoti.2018.02.001) [DOI] [Google Scholar]

[RSPA20200334C21] 21.LeCun Y, Bottou L, Bengio Y, Haffner P. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324. ( 10.1109/5.726791) [DOI] [Google Scholar]

[RSPA20200334C22] 22.Xiao H, Rasul K, Vollgraf R. 2017. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. (http://arxiv.org/abs/1708.07747).

[RSPA20200334C23] 23.Clanuwat T, Bober-Irizar M, Kitamoto A, Lamb A, Yamamoto K, Ha D. 2018. Deep learning for classical Japanese literature. (http://arxiv.org/abs/1812.01718).

[RSPA20200334C24] 24.Tactile Srl, Brescia, Italy (1994). Semeion Handwritten Digit Data Set. Rome, Italy: Semeion Research Center of Sciences of Communication.

[RSPA20200334C25] 25.Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY. 2011. Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning.

[RSPA20200334C26] 26.Krizhevsky A, Hinton G. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer.

[RSPA20200334C27] 27.He K, Zhang X, Ren S, Sun J. 2016. Identity mappings in deep residual networks. Computer Vision - ECCV 2016, 14th European Conf., Amsterdam, The Netherlands, 11–14 October, Proc. Part IV , pp. 630–645. Springer.

PERMALINK

Locally adaptive activation functions with slope recovery for deep and physics-informed neural networks

Ameya D Jagtap

Kenji Kawaguchi

George Em Karniadakis

Abstract

1. Introduction

2. Methodology

(a). Physics-informed neural networks

(b). Loss function with slope recovery term

Figure 1.

Algorithm 1: LAAF-PINN algorithm with slope recovery term. —

3. Gradient dynamics with adaptive activation: convergent points and acceleration of convergence

Theorem 3.1. —

Figure 2.

4. Computational results

(a). Neural network approximation of nonlinear discontinuous function

Figure 3.

(b). Inverse problem: 2D Poisson’s equation

Figure 4.

Table 1.

(c). Inverse problem: 2D inviscid Burgers equation

Figure 5.

Table 2.

Figure 6.

(d). Standard deep learning benchmark problems

Figure 7.

Figure 8.

5. Conclusion

Appendix A. Proof of theorem 3.1

Data accessibility

Authors' contributions

Competing interests

Funding

Reference

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases