In our recent work on elastic weight consolidation (EWC) (1) we show that forgetting in neural networks can be alleviated by using a quadratic penalty whose derivation was inspired by Bayesian evidence accumulation. In his letter (2), Dr. Huszár provides an alternative form for this penalty by following the standard work on expectation propagation using the Laplace approximation (3). He correctly argues that in cases when more than two tasks are undertaken the two forms of the penalty are different. Dr. Huszár also shows that for a toy linear regression problem his expression appears to be better. We would like to thank Dr. Huszár for pointing out the discrepancy between the standard expectation propagation and EWC in the multitask case. While we were aware of this difference, our paper failed to discuss this explicitly, a point now remedied by this letter.
We would like to clarify that we used Bayesian evidence accumulation as an inspiration rather than a dogma, and that we explored a range of algorithms and penalties before settling on the precise formulation that worked best in our case. Our experimental setup is a set of challenging reinforcement learning and supervised learning tasks using deep neural networks, and under these conditions EWC performed well. There are several reasons why we think our penalty might work better than Dr. Huszár’s. Our penalty forces the network to remember older tasks more vividly, which might compensate for the fact that older tasks are harder to remember. As we point out in our work (1), the Laplace approximation is a local estimate of the true posterior variance. This can lead to a massive underestimation of the importance of certain parameters (see ref. 1, figure 4C), especially on older tasks where the parameters might have changed from where we computed the Fisher information. For this reason it may be practically better to keep the regularization penalty tied to the point where the Fisher was computed.
Our main reservation with Dr. Huszár’s letter is that it does not contain empirical validation of the algorithm he proposes on a problem of similar complexity. To have a practical algorithm many approximations to the full Bayesian approach were required: (i) The posterior distribution is assumed to be diagonal Gaussian, (ii) the variance is computed from a point estimate using the Laplace approximation, and (iii) it is assumed that tasks only switch when convergence is reached. While these assumptions hold true in the case of the simple linear regression from Dr. Huszár’s letter, they do not in more complex tasks such as the ones we considered. One cannot, therefore, rely solely on the analytical argument. Empirical studies are necessary to assess the different approximations.
We are delighted to see the interest our publication generated and appreciate the follow-up publications and scientific contributions paving the way to solving the continual learning problem.
Footnotes
The authors declare no conflict of interest.
References
- 1.Kirkpatrick J, et al. Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci USA. 2017;114:3521–3526. doi: 10.1073/pnas.1611835114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Huszár F. Note on the quadratic penalties in elastic weight consolidation. Proc Natl Acad Sci USA. 2018;115:E2496–E2497. doi: 10.1073/pnas.1717042115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Eskin E, Smola AJ, Vishwanathan S. Laplace propagation. In: Thrun S, Saul LK, Schölkopf B, editors. Advances in Neural Information Processing Systems 16. MIT Press; Cambridge, MA: 2004. pp. 441–448. [Google Scholar]
