Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
letter
. 2018 Feb 20;115(11):E2496–E2497. doi: 10.1073/pnas.1717042115

Note on the quadratic penalties in elastic weight consolidation

Ferenc Huszár a,1
PMCID: PMC5856534  PMID: 29463735

Catastrophic forgetting is an undesired phenomenon which occurs when neural networks are trained on different tasks sequentially. Elastic weight consolidation (EWC; ref. 1), published in PNAS, is a novel algorithm designed to safeguard against this. Despite its satisfying simplicity, EWC is remarkably effective.

Motivated by Bayesian inference, EWC adds quadratic penalties to the loss function when learning a new task. The purpose of penalties is to approximate the loss surface from previous tasks. The authors derive the penalty for the two-task case and then extrapolate to handling multiple tasks. I believe, however, that the penalties for multiple tasks are applied inconsistently.

In ref. 1 a separate penalty is maintained for each task T, centered at θT, the value of θ obtained after training on task T. When these penalties are combined (assuming λT=1), the aggregate penalty is anchored at

μT=(FA+FB+FT)1(FAθA+FBθB+FTθT).

From the third task onward this is inconsistent with Bayesian inference. In the Bayesian paradigm the posterior p(θ|DA,DB) encapsulates the agent’s experience in both tasks A and B, thus rendering the previous posterior p(θ|DA) irrelevant. Analogously, as θB was obtained while incorporating the penalty around θA, once we have θB, θA is not needed anymore.

The correct form of penalties can be obtained by recursive application of the two-task derivation (2). It turns out that a single penalty is sufficient; its center is the latest optimum θT and its weights are given by the sum of diagonal Fisher information matrices from previous tasks FA+FB++FT. This single penalty version is akin to Bayesian online learning (3).

If the agent has to revisit data from past training episodes, multiple penalties should be maintained similarly to expectation propagation (4, 5). The aggregate penalty should be centered at θT rather than μT. The anchor point for task T’s penalty should therefore be

θT=FT1((FA+FB+FT)θTFAθAFSθS)

rather than θT, where tasks AS precede task T, and θAθS are the respective penalty centers for these tasks.

To illustrate the behavior and effect of different penalties I applied EWC to a sequence of linear regression tasks. The tasks define quadratic losses each with a diagonal Hessian (Fig. 1, Left). As all simplifying assumptions of EWC hold exactly, one should expect it to match exact Bayesian inference: Reach the global minimum of the combined loss and do so irrespective of the order in which tasks were presented. Fig. 1, Center shows that although a quadratic penalty could model task B perfectly the penalty is placed around the wrong anchor point. Table 1 shows that this leads to suboptimal performance and an unwanted sensitivity to task ordering. By contrast, EWC using the corrected penalties around θT models the losses perfectly (Fig. 1, Right) and achieves optimal performance in an order-agnostic fashion (Table 1).

Fig. 1.

Fig. 1.

(Left) Elliptical level sets of quadratic loss functions for tasks A, B, and C also used in Table 1. (Center) When learning task C via EWC, losses for tasks A and B are replaced by quadratic penalties around θA* and θB*. (Right) Losses are approximated perfectly by the correct quadratic penalties around θA=θA and θB.

Table 1.

Final performance of a model trained via different versions of EWC on a sequence of three linear regression tasks

Algorithm Task order Task A Task B Task C Combined
EWC, original penalties (1)
ABC 0.06880 0.09033 0.00042 0.15955
ACB 0.07410 0.08036 0.00036 0.15481
BAC 0.09033 0.06880 0.00042 0.15955
BCA 0.08036 0.07410 0.00036 0.15481
CAB 0.07981 0.07487 0.00014 0.15481
CBA 0.07487 0.07981 0.00014 0.15481
EWC, corrected penalties (2) Agnostic 0.07693 0.07693 0.00022 0.15407
Simultaneous training Agnostic 0.07693 0.07693 0.00022 0.15407

λT is set to 1 for all tasks. Using the penalties of ref. 1, the final performance is sensitive to the order in which the tasks were presented: The model incurs lower loss on tasks presented first, while performance on later tasks may degrade. Irrespective of task ordering the combined loss is suboptimal. Using the corrected penalties (2) EWC becomes order-agnostic and finds the same global minimum as simultaneous training, shown in the last row. Lowest values in each column are highlighted in boldface type.

I expect the negative impact of using incorrect penalties to be negligible until the network’s capacity begins to saturate. Furthermore, optimizing λ can compensate for the effects to some degree. As a result, it is possible that using incorrect penalties would result in no observable performance degradation in practice. However, since the correct penalties are just as easy to compute, and are clearly superior in some cases, I see no reason against adopting them.

Footnotes

The author declares no conflict of interest.

References

  • 1.Kirkpatrick J, et al. Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci USA. 2017;114:3521–3526. doi: 10.1073/pnas.1611835114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Huszár F. 2017. On quadratic penalties in elastic weight consolidation. arXiv:1712.03847.
  • 3.Opper M. A Bayesian approach to on-line learning. In: Saad D, editor. On-Line Learning in Neural Networks. Cambridge Univ Press; Cambridge, UK: 1998. pp. 363–378. [Google Scholar]
  • 4.Minka TP. Expectation propagation for approximate Bayesian inference. In: Breese J, Koller D, editors. Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. Assoc for Computing Machinery; New York: 2001. pp. 362–369. [Google Scholar]
  • 5.Eskin E, Smola AJ, Vishwanathan S. Laplace propagation. In: Thrun S, Saul LK, Schölkopf B, editors. Advances in Neural Information Processing Systems 16. MIT Press; Cambridge, MA: 2004. pp. 441–448. [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES