Note on the quadratic penalties in elastic weight consolidation

Ferenc Huszár

doi:10.1073/pnas.1717042115

letter

. 2018 Feb 20;115(11):E2496–E2497. doi: 10.1073/pnas.1717042115

Note on the quadratic penalties in elastic weight consolidation

Ferenc Huszár ^a,¹

PMCID: PMC5856534 PMID: 29463735

Catastrophic forgetting is an undesired phenomenon which occurs when neural networks are trained on different tasks sequentially. Elastic weight consolidation (EWC; ref. 1), published in PNAS, is a novel algorithm designed to safeguard against this. Despite its satisfying simplicity, EWC is remarkably effective.

Motivated by Bayesian inference, EWC adds quadratic penalties to the loss function when learning a new task. The purpose of penalties is to approximate the loss surface from previous tasks. The authors derive the penalty for the two-task case and then extrapolate to handling multiple tasks. I believe, however, that the penalties for multiple tasks are applied inconsistently.

In ref. 1 a separate penalty is maintained for each task $T$ , centered at $θ_{T}^{*}$ , the value of $θ$ obtained after training on task $T$ . When these penalties are combined (assuming $λ_{T} = 1$ ), the aggregate penalty is anchored at

μ_{T} = {(F_{A} + F_{B} \dots + F_{T})}^{- 1} (F_{A} θ_{A}^{*} + F_{B} θ_{B}^{*} \dots + F_{T} θ_{T}^{*}) .

From the third task onward this is inconsistent with Bayesian inference. In the Bayesian paradigm the posterior $p (θ | D_{A}, D_{B})$ encapsulates the agent’s experience in both tasks $A$ and $B$ , thus rendering the previous posterior $p (θ | D_{A})$ irrelevant. Analogously, as $θ_{B}^{*}$ was obtained while incorporating the penalty around $θ_{A}^{*}$ , once we have $θ_{B}^{*}$ , $θ_{A}^{*}$ is not needed anymore.

The correct form of penalties can be obtained by recursive application of the two-task derivation (2). It turns out that a single penalty is sufficient; its center is the latest optimum $θ_{T}^{*}$ and its weights are given by the sum of diagonal Fisher information matrices from previous tasks $F_{A} + F_{B} + \dots + F_{T}$ . This single penalty version is akin to Bayesian online learning (3).

If the agent has to revisit data from past training episodes, multiple penalties should be maintained similarly to expectation propagation (4, 5). The aggregate penalty should be centered at $θ_{T}^{*}$ rather than $μ_{T}$ . The anchor point for task $T$ ’s penalty should therefore be

{\tilde{θ}}_{T} = F_{T}^{- 1} ((F_{A} + F_{B} + \dots F_{T}) θ_{T}^{*} - F_{A} {\tilde{θ}}_{A} - \dots - F_{S} {\tilde{θ}}_{S})

rather than $θ_{T}^{*}$ , where tasks $A \dots S$ precede task $T$ , and ${\tilde{θ}}_{A} \dots {\tilde{θ}}_{S}$ are the respective penalty centers for these tasks.

To illustrate the behavior and effect of different penalties I applied EWC to a sequence of linear regression tasks. The tasks define quadratic losses each with a diagonal Hessian (Fig. 1, Left). As all simplifying assumptions of EWC hold exactly, one should expect it to match exact Bayesian inference: Reach the global minimum of the combined loss and do so irrespective of the order in which tasks were presented. Fig. 1, Center shows that although a quadratic penalty could model task B perfectly the penalty is placed around the wrong anchor point. Table 1 shows that this leads to suboptimal performance and an unwanted sensitivity to task ordering. By contrast, EWC using the corrected penalties around ${\tilde{θ}}_{T}$ models the losses perfectly (Fig. 1, Right) and achieves optimal performance in an order-agnostic fashion (Table 1).

Fig. 1. — (*Left*) Elliptical level sets of quadratic loss functions for tasks A, B, and C also used in Table 1. (*Center*) When learning task C via EWC, losses for tasks A and B are replaced by quadratic penalties around $θ_{A}^{*}$ and $θ_{B}^{*}$ . (*Right*) Losses are approximated perfectly by the correct quadratic penalties around ${\tilde{θ}}_{A} = θ_{A}^{*}$ and ${\tilde{θ}}_{B}$ .

Table 1.

Final performance of a model trained via different versions of EWC on a sequence of three linear regression tasks

Algorithm	Task order	Task $A$	Task $B$	Task $C$	Combined
EWC, original penalties (1)
	$A \to B \to C$	0.06880	0.09033	0.00042	0.15955
	$A \to C \to B$	0.07410	0.08036	0.00036	0.15481
	$B \to A \to C$	0.09033	0.06880	0.00042	0.15955
	$B \to C \to A$	0.08036	0.07410	0.00036	0.15481
	$C \to A \to B$	0.07981	0.07487	0.00014	0.15481
	$C \to B \to A$	0.07487	0.07981	0.00014	0.15481
EWC, corrected penalties (2)	Agnostic	0.07693	0.07693	0.00022	0.15407
Simultaneous training	Agnostic	0.07693	0.07693	0.00022	0.15407

Open in a new tab

$λ_{T}$ is set to $1$ for all tasks. Using the penalties of ref. 1, the final performance is sensitive to the order in which the tasks were presented: The model incurs lower loss on tasks presented first, while performance on later tasks may degrade. Irrespective of task ordering the combined loss is suboptimal. Using the corrected penalties (2) EWC becomes order-agnostic and finds the same global minimum as simultaneous training, shown in the last row. Lowest values in each column are highlighted in boldface type.

I expect the negative impact of using incorrect penalties to be negligible until the network’s capacity begins to saturate. Furthermore, optimizing $λ$ can compensate for the effects to some degree. As a result, it is possible that using incorrect penalties would result in no observable performance degradation in practice. However, since the correct penalties are just as easy to compute, and are clearly superior in some cases, I see no reason against adopting them.

Footnotes

The author declares no conflict of interest.

References

1.Kirkpatrick J, et al. Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci USA. 2017;114:3521–3526. doi: 10.1073/pnas.1611835114. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Huszár F. 2017. On quadratic penalties in elastic weight consolidation. arXiv:1712.03847.
3.Opper M. A Bayesian approach to on-line learning. In: Saad D, editor. On-Line Learning in Neural Networks. Cambridge Univ Press; Cambridge, UK: 1998. pp. 363–378. [Google Scholar]
4.Minka TP. Expectation propagation for approximate Bayesian inference. In: Breese J, Koller D, editors. Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. Assoc for Computing Machinery; New York: 2001. pp. 362–369. [Google Scholar]
5.Eskin E, Smola AJ, Vishwanathan S. Laplace propagation. In: Thrun S, Saul LK, Schölkopf B, editors. Advances in Neural Information Processing Systems 16. MIT Press; Cambridge, MA: 2004. pp. 441–448. [Google Scholar]

[r1] 1.Kirkpatrick J, et al. Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci USA. 2017;114:3521–3526. doi: 10.1073/pnas.1611835114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r2] 2.Huszár F. 2017. On quadratic penalties in elastic weight consolidation. arXiv:1712.03847.

[r3] 3.Opper M. A Bayesian approach to on-line learning. In: Saad D, editor. On-Line Learning in Neural Networks. Cambridge Univ Press; Cambridge, UK: 1998. pp. 363–378. [Google Scholar]

[r4] 4.Minka TP. Expectation propagation for approximate Bayesian inference. In: Breese J, Koller D, editors. Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. Assoc for Computing Machinery; New York: 2001. pp. 362–369. [Google Scholar]

[r5] 5.Eskin E, Smola AJ, Vishwanathan S. Laplace propagation. In: Thrun S, Saul LK, Schölkopf B, editors. Advances in Neural Information Processing Systems 16. MIT Press; Cambridge, MA: 2004. pp. 441–448. [Google Scholar]

PERMALINK

Note on the quadratic penalties in elastic weight consolidation

Ferenc Huszár

Fig. 1.

Table 1.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Note on the quadratic penalties in elastic weight consolidation

Ferenc Huszár

Fig. 1.

Table 1.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases