Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 May 1.
Published in final edited form as: Neural Netw. 2022 Feb 10;149:95–106. doi: 10.1016/j.neunet.2022.02.001

Deep Bayesian Unsupervised Lifelong Learning

Tingting Zhao a,*, Zifeng Wang b,*, Aria Masoomi b, Jennifer Dy b
PMCID: PMC8969892  NIHMSID: NIHMS1779413  PMID: 35219032

Abstract

Lifelong Learning (LL) refers to the ability to continually learn and solve new problems with incremental available information over time while retaining previous knowledge. Much attention has been given lately to Supervised Lifelong Learning (SLL) with a stream of labelled data. In contrast, we focus on resolving challenges in Unsupervised Lifelong Learning (ULL) with streaming unlabelled data when the data distribution and the unknown class labels evolve over time. Bayesian framework is natural to incorporate past knowledge and sequentially update the belief with new data. We develop a fully Bayesian inference framework for ULL with a novel end-to-end Deep Bayesian Unsupervised Lifelong Learning (DBULL) algorithm, which can progressively discover new clusters without forgetting the past with unlabelled data while learning latent representations. To efficiently maintain past knowledge, we develop a novel knowledge preservation mechanism via sufficient statistics of the latent representation for raw data. To detect the potential new clusters on the fly, we develop an automatic cluster discovery and redundancy removal strategy in our inference inspired by Nonparametric Bayesian statistics techniques. We demonstrate the effectiveness of our approach using image and text corpora benchmark datasets in both LL and batch settings.

Keywords: Unsupervised Lifelong Learning, Bayesian Learning, Deep Generative Models, Deep Neural Networks, Sufficient Statistics

1. Introduction

With exposure to a continuous stream of information, human beings are able to learn and discover novel clusters continually by incorporating past knowledge; however, traditional machine learning algorithms mainly focus on static data distributions. Training a model with new information often interferes with previously learned knowledge, which typically compromises performance on previous datasets [1]. In order to empower algorithms with the ability of adapting to emerging data while preserving the performance on seen data, a new machine learning paradigm called Lifelong Learning (LL) has recently gained some attention.

LL, also known as continual learning, was first proposed in [2]. It provides a paradigm to exploit past knowledge and learn continually by transferring previously learned knowledge to solve similar but new problems in a dynamic environment, without performance degradation on old tasks. LL is still an emerging field and most existing research work [3, 4, 5] has focused on Supervised Lifelong Learning (SLL), where the boundaries between different tasks are known and each task refers to a supervised learning problem with output labels provided. The term task can have different meanings under various contexts. For example, different tasks can represent different subsets of classes or labels in the same supervised learning problem [6] and can also represent supervised learning problems in different fields, where researchers target to perform continual learning across different domains [7] or lifelong transfer learning [8, 9].

While most research on LL has focused on resolving challenges in SLL problems with class labels provided, we instead consider Unsupervised Life-long Learning (ULL) problems, where the learning system is interacting with a non-stationary stream of unlabelled data and the cluster labels are unknown. One objective of ULL is to discover new clusters by interacting with the environment dynamically and adapting to the changes in the unlabeled data without external supervision or knowledge. To avoid confusion, it is worth pointing out that our work assumes that the non-stationary streaming unlabelled data come from a single domain; and we target to develop a single dynamic model that can perform well on all sequential data at the end of each training stage without forgetting previous knowledge. This setting can serve as a reasonable starting point for ULL. We leave the challenges in ULL across different problem domains for future work.

To retain knowledge from past data and learn new clusters from new data continually, good representations of the raw data make it easier to extract information and, in turn, support effective learning. Thus, it is more computationally appealing if we can discover new clusters in a low-dimensional latent space instead of the complex original data space while performing representation learning. To achieve this, we propose Deep Bayesian Unsupervised Lifelong Learning (DBULL), which is a flexible probabilistic generative model that can adapt to new data and expand with new clusters while seamlessly learning deep representations in a Bayesian framework.

A critical objective of LL is to achieve consistently good performance incrementally as new data arrive in a streaming fashion, without performance decrease on previous data, even if the data may have been completely overwritten. Compared with a traditional batch learning setting, there are additional challenges to resolve in a LL setting. One important question is how to design a knowledge preservation scheme to efficiently maintain previously learned information. To discover the new clusters automatically with streaming, unlabelled data, another challenge is how to design a dynamic model that can expand with incoming data to perform unsupervised learning. The last challenge is how to design an end-to-end inference algorithm to obtain good performance in an incremental learning way. To answer these questions, we make the following contributions:

  • To solve the challenges in ULL, we provide a fully Bayesian formulation that performs representation learning, clustering and automatic new cluster discovery simultaneously via our end-to-end novel variational inference strategy DBULL.

  • To efficiently extract and maintain knowledge seen in earlier data, we provide innovation in our incremental inference strategy by first using sufficient statistics in the latent space in an LL context.

  • To discover new clusters in the emerging data, we choose a nonparametric Bayesian prior to allow the model to grow dynamically. We develop a sequential Bayesian inference strategy to perform representation learning simultaneously with our proposed Cluster Expansion and Redundancy Removal (CERR) trick to discover new clusters on the fly without imposing bounds on the number of clusters, unlike most existing algorithms, using a truncated Dirichlet Process (DP) [10].

  • To show the effectiveness of DBULL, we conduct experiments on image and text benchmarks. DBULL can achieve superior performance compared with state-of-the-art methods in both ULL and classical batch settings.

The remaining of the paper is organized as follows. Section 2 describes the related work. Section 3 presents the problem formulation and the generative process of DBULL. Section 4 illustrates why Bayesian framework is a natural choice for our ULL setting. Section 5 provides our end-to-end variational inference algorithm under the ULL setting with publicly available implementation. In Section 6, we compare our algorithm to state-of-the-art methods using the most common text and image benchmark datasets in LL to evaluate the performance of our method. In Section 7, we discuss some research gaps and potential future extensions of our work.

2. Related Work

2.1. Alleviating Catastrophic Forgetting in Lifelong Learning

Research on LL aims to learn knowledge in a continual fashion without performance degradation on previous tasks when trained for new data. Reference [11] have provided a comprehensive review on LL with neural networks. The main challenge of LL using Deep Neural Networks (DNNs) is that they often suffer from a phenomenon called catastrophic forgetting or catastrophic interference, where a model’s performance on previous tasks may decrease abruptly due to the interference of training with new information [1, 12]. Recent work aims at adapting a learned model to new information while ensuring the performance on previous data does not decrease. Currently, there are no universal past knowledge preservation schemes for different algorithms and settings [13] in LL. Regularization methods target to reduce interference of new learning by minimizing changes to certain parameters that are important to previous learning tasks [14, 15]. Alternative approaches based on rehearsal have also been proposed to alleviate catastrophic forgetting while training DNNs sequentially. Rehearsal methods use either past data [16, 17], coreset data summarization [18] or a generative model [19] to capture the data distribution of previously seen data.

However, most of the existing LL methods focus on supervised learning tasks [14, 18, 7, 19], where each learning task performs supervised learning with output labels provided. In comparison, we propose a novel alternative knowledge preservation scheme in an unsupervised learning context via sufficient statistics. This is in contrast to existing work which uses previous model parameters [5, 20], representative items exacted from previous models [21] or past raw data, or coreset data summarization [16, 17, 18] as previous knowledge for new tasks. Our proposal to use sufficient statistics is novel and has the advantage of preserving past knowledge without the need of storing previous data while allowing incremental updates as new data arrive by taking advantage of the additive property of sufficient statistics.

2.2. Comparable Methods in Unsupervised Lifelong Learning

Recently, [22] proposed Continual Unsupervised Representation Learning (CURL) to deal with a fully ULL setting with unknown cluster labels. We have developed our idea independently in parallel with CURL but in a fully Bayesian framework. CURL is the most related and comparable method to ours in the literature. CURL focuses on learning representations and discovering new clusters using a threshold method. One major drawback of CURL is that it has over-clustering issues as shown in their real data experiment. We also show this empirically and demonstrate the improvement of our method over CURL in our experiment section. In contrast to CURL, we provide a novel probabilistic framework with a nonparametric Bayesian prior to allow the model to expand without bound automatically instead of using an ad hoc threshold method as in CURL. We develop a novel end-to-end variational inference strategy for learning deep representations and detecting novel clusters in ULL simultaneously.

2.3. Bayesian Lifelong Learning

A Bayesian formulation is a natural choice for LL since it provides a systematic way to incorporate previously learnt information in the prior distribution and obtain a posterior distribution that combines both prior belief and new information. The sequential nature of Bayes theorem also paves the way to recursively update an approximation of the posterior distribution and then use it as a new prior to guide the learning for new data in LL.

In [18], the authors propose a Bayesian formulation in LL. Although both under a Bayesian framework, our work is different from [18] due to different objectives, inference strategies and knowledge preservation techniques. In [18], the authors provide a variational online inference framework for deep discriminative models and deep generative models, where they studied the approximate posterior distribution of the parameters in DNNs in a continual fashion. However, their method does not have the capacity to find out the latent clustering structure of the data or detect new clusters for emerging data. In contrast, we develop a novel Bayesian framework for representation learning and discovering latent clustering structure and new clusters on the fly together with a novel end-to-end variational inference strategy in an ULL context.

2.4. Deep Generative Unsupervised Learning Methods in a Batch Setting

Recent research has focused on combining deep generative models to learn good representations of the original data and conduct clustering analysis in an unsupervised learning context [23, 24, 25, 26, 27]. However, the latest existing methods are designed for an independent and identically distributed (i.i.d.) batch training mode instead of a LL context. The majority of these methods are in a static unsupervised learning setting, where the number of clusters are fixed in advance. Thus, these methods cannot detect potential new clusters when new data arrive or the data distribution changes. These methods cannot adapt to a LL setting.

2.5. Uncertainty Quantification in Bayesian Deep Learning

Deep learning has achieved impressive prediction success on many practical tasks [28]. Recently, researchers have focused on providing predictive undertainty quantification for deep learning and machine learning techniques, making them better suited for risk-sensitive applications. Comprehensive reviews can be found on uncertainty quantification [29, 30]. Uncertainty quantification has also gained more attention in the field of medical research [31, 32, 33, 34]. Bayesian neural networks provide a reasonable choice by including uncertainty through priors on the weights of the neural network and quantifying the uncertainty of the functional mean via posterior predictive distributions. Our inference algorithm is under the Bayesian framework, which could potentially pave a way to characterize uncertainty in Bayesian LL. However, it is a challenging task due to the high-dimensional weight space of the neural network and also the complex dependencies among the weights. Uncertainty quantification is not the focus for our paper but instead instead could be a potential future direction to explore in the field of Bayesian LL.

To summarize, our work fills the gap by providing a fully Bayesian framework for ULL, which has the unique capacity to use a deep generative model for representation learning while performing new cluster discovery on the fly with a nonparametric Bayesian prior and our proposed CERR technique. To alleviate catastrophic forgetting challenge in LL, we propose to use sufficient statistics to maintain knowledge as a novel alternative to existing methods. We further develop an end-to-end Bayesian inference strategy DBULL to achieve our goal.

3. Model

3.1. Problem Formulation

In our ULL setting, a sequence of datasets D1, D2, . . . , DN arrive in a streaming order. When a new dataset DN arrives in memory, the previous dataset DN−1 is no longer available. Our goal is to automatically learn the clusters (unlabeled classes) in each dataset.

Let xX represent the unlabeled observation of the current dataset in memory, where X can be a high-dimensional data space. We assume that a low-dimensional latent representation z can be learned from x and in turn can be used to reconstruct x. We assume that the variation among observations x can be captured by its latent representation z. Thus, we let y represent the unknown cluster membership of z for observation x.

We target to find: (1) a good low-dimensional latent representation z from x to efficiently extract knowledge from the original data; (2) the clustering structure within the new dataset with the capacity to discover potentially novel clusters without for getting the previously learned clusters of the seen datasets; and, (3) an incremental learning strategy to optimize the cluster learning performance for a new dataset without dramatically degrading the clustering performance in seen datasets.

We summarize our work in a flow chart in Fig. 1, provide its graphical model representation in Fig 2, and describe the generative process of our graphical model for DBULL in the next section.

Figure 1:

Figure 1:

Flow chart of DBULL, best viewed in color. Given observations x, DBULL learns its latent representation z via an encoder z = fψ(x) and performs clustering under a Dirichlet Process Gaussian mixture model and reconstruct the original observation via a decoder x^=gθz, where ψ and θ denote the parameters in the encoder and decoder respectively. To perform Lifelong Learning in an incremental fashion when dealing with streaming data, we introduce two novel components: Sufficient Statistics for knowledge preservation and Cluster Expansion and Redundancy Removal to create and merge clusters.

Figure 2:

Figure 2:

Graphical model representation of DBULL. Nodes denote random variables, edges denote possible dependence, and plates denote replication. Solid lines denote the generative model; Dashed lines denote the variational approximation.

3.2. Generative Process of DBULL

The generative process for DBULL is as follows.

  1. Draw a latent cluster membership y ∼ Cat(π(v)), where the vector v comes from the stick-breaking construction of a Dirichlet Process (DP).

  2. Draw a latent representation vector z|y=k~Nμk,σk2I, where k is the cluster membership sampled from (a).

  3. Generate data x from x|z=z~Nμz;θ,diagσ2z;θ in the origi nal data space.

In (a), Cat(π(v)) is the categorical distribution parameterized by π(v), where we denote the kth element of π(v) as πk(v), which is the probability for cluster k. The value of π(v) depends on a vector of scalars v coming from the stick-breaking construction of a Dirichlet Process (DP) [35] and we describe an iterative process to draw πk(v) in Section 4.2. The DP is a nonparametric prior for clustering exchangeable observations. Most popular clustering algorithms require the number of data clusters is fixed and known in advance. By contrast, the DP mixture model provides a nonparametric Bayesian framework to describe distributions over mixture models. The DP mixture model is flexible to allow the number of mixture components (clusters) to be random and grow without bound as more data arrive. This property makes it a natural fit to model our LL setting.

CURL uses a latent mixture of Gaussian components to capture the clustering structure in an unsupervised learning context. In comparison, we adopt the DP mixture model in the latent space with the advantages that the number of mixture components can be random and grow without bound as new data arrive, which is an appealing property desired by LL. We further explain in Section 4.2 why DP is an appropriate prior for our problem in details. In (b), z is considered a low-dimensional latent representation of the original data x. We describe in Section 4.2 that a DP Gaussian mixture is used for modelling z since it is often assumed that the variation in z is able to reflect the variation within x. The current representation in (b) is for easy understanding. In (c), we assume that the generative model pθ(x|z) is parameterized by gθ:ZX and gθz=μz;θ,σ2z;θ, where gθ is chosen as DNNs due to its powerful function approximation and good feature learning capabilities [36, 37, 38].

Under this generative process, the joint probability density function can be factorized as

px,y,z,ϕ,v=pθx|zpz|ypy|πvpvpϕ, (1)

where ϕ=ϕ1,ϕ2,,ϕk and ϕk=μk,σk2 represents the parameters of the kth mixture component (or cluster), p(v) and p(ϕ) represent the prior distribution for v and ϕ.

In Section 4.2, we discuss how to choose appropriate p(v) and p(ϕ) to endow our model with the flexibility to grow the number of mixture components without bound with new data in a LL setting.

4. Why Bayesian for DBULL

In this section, we illustrate why Bayesian framework is a natural choice for our ULL setting. Recall that we have a sequence of datasets D1, D2, . . . , DN from a single domain arriving in a streaming order. To mimic a LL setting, we assume each time only one dataset can fit in memory. One key question in LL is how to efficiently maintain past knowledge to guide future learning.

4.1. Bayesian Reasoning for Lifelong Learning

Bayesian framework is a suitable solution to this type of learning since it learns a posterior distribution, or an approximation of a posterior distribution, that takes advantage of both the prior belief and the additional information in the new dataset. The sequential nature of Bayes theorem ensures valid recursive updates on an approximation of the posterior distribution given the observations. Later, the approximation of the posterior distribution serves as the new prior to guide future learning for new data in LL. Before describing our inference strategy, we first explain why utilizing the Bayesian updating rule is valid for our problem.

Given (N − 1) datasets Di, where i = 1, 2, . . . , (N − 1), the posterior after considering the N th dataset is

Py,z,ϕ,v|D1,D2,,DNPDN|y,z,ϕ,vPy,z,ϕ,v|D1,,DN1, (2)

which reflects that the posterior of (N − 1) tasks and datasets can be considered as the prior for the next task and dataset. If we know exactly the normalizing constant for Py,z,ϕ,v|D1,D2,,DN and Py,z,ϕ,v|D1,D2,,DN1, repeatedly updating (2) is streaming without the need of reusing past data. However, it is often intractable to compute the normalizing constant exactly. Thus, an approximation of the posterior distribution is necessary to update (2) since the exact posterior is infeasible to obtain.

4.2. Dirichlet Process Prior

The DP is often used as a nonparametric prior for partitioning exchangeable observations into discrete clusters. The DP mixture is a flexible mixture model where the number of mixture components can be random and grow without bound as more data arrive. These properties make it a natural choice for our LL setting. In practice, we show in our inference how we expand and merge the number of mixture components as new data arrive by starting from only one cluster in Section 5.6. Next, we briefly review DP and introduce our DP Gaussian mixture model to derive the joint probability density p(x, y, z, ϕ, v) defined in (1).

A DP is characterized by a base distribution G0 and a parameter α denoted as DP(G0, α). A constructive definition of DP via a stick-breaking process is of the form G=k=1πkδϕk, where δϕk is a discrete measure concentrated at ϕkG0, which is a random sample from the base distribution G0 with mixing proportion πk [39]. In DP, the πks are random weights independent of G0 but satisfy 0πk1 and k=1πk=1. The weights πk can be drawn through an iterative process:

πk=v1,ifk=1,vkj=1k11vj,fork>1,

where vk ∼ Beta(1, α).

Under the generative process of DBULL in Section 3.2, these πks represent the probabilities for each cluster (mixture component) used in step (a) and ϕk can be seen as the parameters of the Gaussian mixture for z in step (b). Thus, given our generative process, the corresponding joint probability density for our model is

px,y,z,ϕ,v=px|z;θpz|ypy|vpvG0ϕ|λ0=n=1NNxn|μzn;θ,diagσ2zn,θk=1Nzn|μk,σk2IPyn=k|πvBetavk|1,α0G0ϕk|λ0. (3)

For a Gaussian mixture model, the base distribution is often chosen as the Normal-Wishart (NW) denoted as G0=NWλ0 to generate the mixture parameters ϕk=μk,σk~NWλ0, where λ0=m0,β0,ν0,W0=0,0.2,D+2,ID×D, and D is the dimension of the latent vector z. The values of the hyper-parameter λ0 are conventional choices in the Bayesian nonparameteric literature for Gaussian mixture. Moreover, the performance of our method is robust to the hyper-parameter values.

5. Inference for DBULL

There are several new challenges to develop an end-to-end inference algorithm for our problem under the ULL setting compared with the batch setting: one has to deal with catastrophic forgetting, mechanisms for past knowledge preservation, and dynamic model expansion capacity for novel cluster discovery. For pedagogical reasons, we first describe our general parameter learning strategy via variational inference for DBULL in a standard batch setting. We then describe how we resolve the additional challenges in the lifelong (streaming) learning setting. We describe our novel components in the inference algorithm in terms of a new knowledge preservation scheme via sufficient statistics in Sections 5.4 and an automatic CERR strategy in Section 5.6. A summary of our algorithm in the LL setting is provided in Algorithm 1. Our implementation is available at https://github.com/KingSpencer/DBULL. The implementation details are provided in Appendix C. We explain the contribution of the sufficient statistics to the probabilistic density function of our problem and knowledge preservation in Section 5.5.

5.1. Variational Inference and ELBO Derivation

In practice, it is often infeasible to obtain the exact posterior distribution since the normalizing constant in the posterior distribution is intractable. Markov Chains Monte Carlo (MCMC) methods are a family of algorithms that provide a systematic way to sample from the posterior distribution but is often slow in a high-dimensional parameter space. Thus, effective alternative methods are needed. Variational inference is a promising alternative, which approximates the posterior distribution by casting inference as an optimization problem. It aims to find a surrogate distribution that is the most similar to the distribution of interest over a class of tractable distributions that can minimize the Kullback-Leibler (KL) divergence to the exact posterior distribution. Minimizing the KL divergence between qy,z,ϕ,v|x and py,z,ϕ,v|x in our setting is equivalent to maximizing the Evidence Lower Bound (ELBO), where qy,z,ϕ,v|x is the variational posterior distribution used to approximate the true posterior distribution. To make it easier for the readers to understand the core idea, we provide a high-level explanation of variational inference and mathematical details can be found in Appendix A.

Given the generative process in Section 3.2 and using Jensen’s inequality,

logpxEqy,z,ϕ,v|xlogpx,y,z,ϕ,vqy,z,ϕ,v|x=LELBOx, (4)

For simplicity, we assume that qy,z,ϕ,v|x=qψz|xqyqvqϕ. Thus, the ELBO is

Eqy,z,ϕ,v|xlogpθx|zpz|y,ϕpy|vpvpϕqψz|xqyqvqϕ=Eqy,z,ϕ,v|xlogpθx|z+Eqy,z,ϕ,v|xlogpz|y,ϕEqy,z,ϕ,v|xlogqψz|x+Eqy,z,ϕ,v|xlogpy|v+Eqy,z,ϕ,v|xlogpvEqy,z,ϕ,v|xlogqyEqy,z,ϕ,v|xlogpvEqy,z,ϕ,v|xlogqϕ+Eqy,z,ϕ,v|xlogpϕ (5)

We assume our variational distribution takes the form of

qy,z,ϕ,v|x=qψz|xqϕqvqy=Nμx;ψ,diagσ2x;ψt=1T1qηtvtt=1Tqζtϕtn=1Nqρnyn=Nμx;ψ,diagσ2x;ψt=1T1Betaηt1,ηt2t=1TNμt|mt,βtΛt1WΛt|Wt,νtn=1NMultT,ρn, (6)

where we denote fψx=μx;ψ,σ2x;ψ, which is a neural network parameterized by ψ, T is the number of mixture components in the DP of the variational distribution, zn~Nzn|μt,Λt1, ϕt=μt,Λt, and Mult(T, ρn) is a Multinomial distribution. The notation definitions in equation (5), (6) and (7) are provided in Table 1. Our inference strategy starts with only one mixture component and uses CERR technique described in Section 5.6 to either increase or merge the number of clusters.

Table 1:

Notations in LELBOVAEx.

Notations in the ELBO
θ: parameters in the decoder.
ψ: parameters in the encoder.
N: the total number of observations.
T: the total number of clusters.
D: the dimension of the latent representation z.
Σ : diag(σ2(x; ψ)).
xij: the jth dimension of the nth observation.
yn: cluster membership for the nth observation.
pyn=k=γik, Nk=n=1Nγnk.
L: the number of Monte Carlo samples in Stochastic
Gradient Variational Bayes (SGVB).
z^n=1Ll=1Lznlz¯k=1Nkn=1Nγnkz^n.
Uk=1Nkn=1Nγnkz^nz¯kz^nz¯kT.
βk = β0 + Nk: the posterior scalar precision in NW distribution.
mk=1βkβ0m0+Nkz¯k: the posterior mean of cluster k.
Wk1=W01+NkSk+β0Nkβ0+Nkz¯km0z¯km0T.
νk=ν0+Nk: the kth posterior degrees of freedom of NW.

5.2. General Parameter Learning Strategy

In equation (5), there are mainly two types of parameters which we need to optimize. The first type includes parameters θ and ψ in the neural network. The other type involves the latent cluster membership y and the parameters for the DP Gaussian mixture model.

In order to perform joint inference for both types of parameters, we adopt the alternating optimization strategy. First we update the neural network parameters (θ and ψ) to learn the latent representation z given the DP Gaussian mixture parameters. This is achieved by optimizing LELBO-VAE, which only involves the first three terms of equation (5) that make a contribution to optimize θ, ψ and z. Under our variational distribution assumptions in (6), by taking advantage of the reparameterization trick [23] and the Monte Carlo estimate of expectations, we obtain

LELBOVAEx=12k=1TNkνkTraceUkWk12k=1TNkνkz¯kmkTWkz¯kmk121Ll=1Li=1Nj=1Dlogσz,θ2jl+xijμz;θjlσz;θ2jl2+12logDet2πeΣ. (7)

We provide the notations in Table 1. The derivation details are provided in Appendix A. Then, we update the DP Gaussian mixture parameters and the cluster membership given the current neural network parameters θ, ψ and the latent representation z. This allows us to use improved latent representation to infer latent cluster memberships and the updated clustering will in turn facilitate learning latent knowledge representation. The update equations for DP mixture model parameters can be found in [10]. We describe the core idea of automatic CERR in our inference in Section 5.6 to explain how we start with only one cluster and achieve dynamic model expansion by creating new mixture components (clusters) given new data in LL.

Our general parameter learning strategy via variational inference may seem straightforward for a batch setting at first glance. However, both the derivation and the implementation is nontrivial especially when incorporating our new components in the end-to-end inference procedure to address the additional challenges in a LL setting. For illustration purposes, we choose to describe the high level core idea of our inference procedure. The main difficulty lies in how to adapt our inference algorithm from a batch setting to a LL setting, which requires us to overcome catastrophic forgetting, maintain past knowledge and develop a dynamic model that can expand with automatic cluster discovery and redundancy removal capacity. Next, we describe our novel solutions.

5.3. Our Ingredients for Alleviating Catastrophic Forgetting

Catastrophic forgetting or catastrophic interference is a dramatic issue for DNNs as witnessed in SLL [14, 19]. In our ULL setting, the issue is even more challenging since we have more sources than SLL that may lead to abrupt model performance decrease due to the interference of training with new data. The first source is the same as in SLL when the DNNs forget previously learned information upon learning new information. Additionally, in an unsupervised setting, the model is not able to recover the learned cluster membership and clustering related parameters in the DP mixture model when the previous data is no longer available, or when the previous learned information of DNNs has been wiped out upon learning new information, since the clustering structure learned depends on the latent representation of the raw data, which is determined by the DNNs’ parameters and the data distributions.

To resolve these issues, we develop our own novel solution via a combination of two ingredients: (1) generating and replaying a fixed small number of samples based on our generative process in Section 3.2 given the current DNNs and DP Gaussian mixture parameter estimates, which is a computationally effective byproduct of our algorithm; and, (2) developing a novel hierarchical sufficient statistics knowledge preservation strategy to remember the clustering information in an unsupervised setting.

We choose to replay a number of generative samples to preserve the previous data distribution instead of using a subset of past real data, since storing past data may require large memory and such data storage and replay may not be feasible in real big data applications. More details of replaying deep generative samples over real data in LL have been discussed in [19]. Moreover, our proposal to use sufficient statistics is novel and has the advantage of allowing incremental updates of the clustering information as new data arrive without the need of access to previous data because of the additive property of sufficient statistics. We introduce this novel strategy in the next section.

5.4. Sufficient Statistics for Knowledge Preservation

As LL is an emerging field, and there is no well-accepted knowledge definition or an appropriate representation scheme to efficiently maintain past knowledge from seen data. Researchers have adopted prior distributions [18] or model parameters [40] to represent past knowledge in most SLL problems, where achieving high prediction accuracy incrementally is the main objective. However, there is no guidance on preserving past knowledge in an unsupervised learning setup.

We propose a novel knowledge preservation strategy in DBULL. In our problem, there are two types of knowledge to maintain. The first contains previously learned DNNs’ parameters needed to encode the latent knowledge representation z of the raw data and the reconstruction of the real data from z. The other involves the DP Gaussian mixture parameters to represent different cluster characteristics and different cluster mixing proportions. Our novel knowledge representation scheme uses hierarchical sufficient statistics to preserve the information related to the DP Gaussian mixture. We develop a sequential updating rule to update our knowledge.

Assume that we have encountered N datasets Djj=1N and each time only one dataset Dj can be in memory. While in memory, each dataset can be divided into M mini-batches Bii=1M. To define the sufficient statistics, we first define the global parameters of the DP Gaussian mixture as probabilities of each mixture component (cluster) πks and the mixture parameters μk,σk for each cluster k. We define the local parameters as the cluster membership for each observation in memory. To remember the characteristics of all encountered data and the local information of the current dataset, we memorize three levels of sufficient statistics. The ith mini-batch sufficient statistics Skj,i=NkBi,skBi of the current dataset Dj, where skBi=nBiγ^nktzn and t(zn) is the sufficient statistics to represent a distribution within the exponential family (Gaussian distribution is within the exponential family and tzn=zn,znTzn in our case) and γ^nk represents the estimated probability of the nth observations in mini-batch Bi belonging to cluster k. We also define the stream sufficient statistics Skj=i=1MSkj,i of dataset Dj and the overall sufficient statistics Sk0=Nk,skz of all encountered datasets Djj=1N.

To efficiently maintain and update our knowledge, we develop our updating algorithm as: (1) substract the old summary of each mini-batch and update the local parameters; (2) compute a new summary for each mini-batch; and, (3) update the stream sufficient statistics for each cluster learned in the current

SkjSkjSkj,i, (8)
Skj,inBiγ^nk,nBiγ^nktzn, (9)
SkjSkj+Skj,i. (10)

For the dataset Dj in the learning phase, we repeat the updating process multiple iterations to refine our training while learning the DP Gaussian mixture parameters and the cluster membership. Finally, we update the overall sufficient statistics by Sk0Sk0+Skj. The correctness of the algorithm is guaranteed by the additive property of the sufficient statistics.

5.5. Contribution of Sufficient Statistics to Alleviate Forgetting

The sufficient statistics alleviate forgetting by preserving data characteristics and allow sequential updates in LL without the need of saving real data. To be precise, the sufficient statistics allow us to update the log-likelihood and the ELBO sequentially since both terms are linear functions of the expectation of the sufficient statistics. Given the expected sufficient statistics, we are able to evaluate the first two terms of LELBOVAEx in equation (7) and p(z|y)p(y|v) in the joint probability density function of our model in equation (3). Next, we provide mathematical derivations to illustrate this.

Define sufficient statistics of all data S0=S10,S20,,Sk0, where k is the number of clusters. Define Sk0=Nk,n=1Nrnktzn, where tzn=zn,znTzn and rnk denotes the probability of the nth observation belonging to cluster k, Nk=n=1Nrnk, and N is the total number of observations. Given the sufficient statistics and current mixture parameters μk,σk and π(v), we can evaluate k=1Nzn|μk,σk2IPyn=k|πv in the joint probability density function in equation (3) without storing each latent representation zn for all data. Similarly, we can also evaluate the first two terms of LELBOVAEx in equation (7) with notations defined in Table 1.

5.6. Cluster Expansion and Redundancy Removal Strategy

Our model starts with one cluster and, as we have new data with different characteristics, we expect the model to either dynamically grow the number of clusters or merge clusters if they have similar characteristics. To achieve this, we perform birth and merge moves in a similar fashion as the Nonparametric Bayesian literature [41] to allow automatic CERR. However, we would like to emphasize that our work is different from [41] since our merge moves have extra constraints. To avoid losing information about clusters learned earlier, we only allow merge moves between two novel clusters from the birth move or one existing cluster with a newborn cluster. Two previously existing clusters cannot be merged. Reference [41] is designed for a batch learning, thus, it does not require this constraint (but in LL this constraint is important for avoiding information loss).

It is challenging to give birth to new clusters with streaming data since the number of observations may not be sufficient to inform good proposals. To resolve this issue, we follow [41] by collecting a subsample of data for each learned cluster k. Then, we cache the samples in the subsample if the probability of the nth observation to be assigned to cluster k is bigger than a threshold of value 0.1. This value has been suggested by [41]. In this paper, we try to choose commonly used parameters in the literature and avoid dataset specific tuning as much as possible. We fit the DP Gaussian mixture to the cached samples with one cluster and expand the model with 10 novel clusters. However, only adopting the birth moves may overcluster the observations into different clusters. After the birth move, we merge the clusters by (1) selecting candidate clusters to merge and by (2) merging two selected clusters if ELBO improves. The candidate clusters are selected if the marginal likelihood of two merged clusters is bigger than the marginal likelihood when keeping the two clusters separate.

Algorithm 1 Variational Inference for DBULL1::Initialization: InitializetheDNNs, variational distributions and the hyperparameters for Dirich- letProcess Mixture Model (DPMM).2:forj=1,2,,Ndatasets in memory do3:for epoch=1,2,do4:forh=1,2,,Hiterations do5:Update the weightsω=ψ,θof the encoder and decoder viaωh+1=ωh+ηLELBO-VAExωhto maximizeLELBOVAExin equation 7 given cur- rentDPMM parameters, where η is the learning rate when using stochastic gradient descent.6:endfor  7:Compute the deep representation z of observations x using the encoder z=fψx=μx;ψ,σ2x;ψ.​ 8:fori=1,2,,Mjmini-batch of dataset jdo9:while The ELBO of the DPMM has not converged do10: Visit the ith mini-batch of z in a full pass in the jth dataset.11: Update the mini-batchsufficientstatisticsandstreamsufficientstatistics  in Section 5.4 via equation 8 and 9.12: Update the local and global parameters of DPMM using mini-batch i of ​ dataset j via standard variational inference for DPMM 10 provided in  Appendix A.13: Perform clusterexpansion described in Section 5.6 to propose new clus- ters. 14: Perform redundancy removal described in Section 5.6 to merge clustersif the ELBO improves.15:endwhile16:endfor17:Update the overallsufficientstatistics defined in Section 5.4 via equation 10.  18:endfor19:endfor20:Output: Learned DNNs, variational approximation to the posterior, deeplatent representation, cluster assignment for each observation, and clusterrepresentative in the latent space.¯¯¯

6. Experiments

Datasets.

We adopt the most common text and image benchmark datasets in LL to evaluate the performance of our method. The MNIST database of 70,000 handwritten digit images [42] is widely used to evaluate deep generative models [23, 25, 24, 27, 26] for representation learning and LL models in both supervised [14, 18, 19] and unsupervised learning contexts [22]. To provide a fair comparison with state-of-the-art competing methods and easy interpretation, we mainly use MNIST to evaluate the performance of our method and interpret our results with intuitive visualization patterns. To examine our method on more complex datasets, we use text Reuters10k [43] and image STL-10 [44] databases. STL-10 is at least as hard as a well-known image database CIFAR-10 [45] since STL-10 has fewer labeled training examples within each class. The summary statistics for the datasets are provided in Table 2.

Table 2:

Summary statistics for benchmark datasets.

Dataset # Samples Dimension # Classes
MNIST 70000 784 10
Reuters10k 10000 2000 4
STL-10 13000 2048 10

We adopt the same neural network architecture as in [26]. All values of the tuning parameters and the implementation details in the DNNs are provided in Appendix C. Our implementation is publicly available at https://github.com/KingSpencer/DBULL.

Competing Methods in Unsupervised Lifelong Learning.

CURL is the only lifelong unsupervised learning method currently with both representation learning and new cluster discovery capacity, which makes CURL the latest, most related, and comparable method to ours. We use CURL-D to represent CURL without the true number of clusters provided but to detect it Dynamically given unlabelled streaming data.

Competing Methods in a Classic Batch Setting.

Although designed for LL, to show the generality of DBULL in a batch training mode, we compare it against recent deep (generative) methods with representation learning and clustering capacity designed for batch settings, including DEC [25], VaDE [26], CURL-F [22] and VAE+DP. CURL-F represents CURL with the true number of clusters provided as a Fixed value. VAE+DP fits a Variational Autoencoders (VAE) [23] to learn latent representations first and then uses a DP to learn clustering in two separate steps. We list the capabilities of different methods in Table 3.

Table 3:

DBULL and competing methods capacity comparison.

Lifelong Learning
Batch Setting
DBULL CURL-D DEC VaDE VAE+DP CURL-F
Representation Learning yes yes yes yes yes yes
Learns # of Clusters yes yes no no yes no
Dynamic Expansion yes yes no no yes no
Overcome Forgetting yes yes no no no yes

Evaluation Metrics.

One of the main objectives of our method is to perform new cluster discovery with streaming non-stationary data. Thus, it is desired if our method can achieve superior clustering quality in both ULL and batch settings. We adopt the clustering quality metrics including Normalized Mutual Information (NMI), Adjusted Rand Score (ARS), Homogeneity Score (HS), Completeness Score (CS) and V-measure Score (VM). These are all normalized metrics ranging from zero to one, and larger values indicate better clustering quality. NMI, ARS and VM of value one represent perfect clustering as the ground truth. CS is a symmetrical metric to HS. Detailed definitions for these metrics can be found in [46].

6.1. Lifelong Learning Performance Comparison

Experiment Objective.

It is desired if LL methods can adapt a learned model to new data while retaining the information learned earlier. The objective of this experiment is to demonstrate DBULL has such desired capacity and effectiveness compared with state-of-the-art LL methods such that there is no dramatic performance decrease on past data even if the model has been updated with new information.

Experiment Setup.

To evaluate the performance of DBULL, we adopt the most common experiment setup called Split MNIST in LL, which used images from MNIST [15, 18]. We divide MNIST into 5 disjoint subsets with each subset containing 10,000 random samples of two digit classes in the order of digits 0–1, 2–3, 4–5, 6–7 and 8–9, denoted as DS1, DS2, DS3, DS4, and DS5. Each dataset is divided into 20 subsets that arrive in a sequential order to mimic a LL setting. We denote DSi:j as all data from DSi to DSj, where i, j = 1, 2, . . . , 5.

Discussion on Performance.

To check if our method has dramatic performance loss due to catastrophic forgetting, we sequentially train our method DBULL and its LL competitor CURL-D on DS1, DS2, . . . , DS5. We define TASKi as training on DSi, where i = 1, 2, . . . , 5. We measure the performance of TASK1 after training TASK1, TASK2, TASK3, TASK4, TASK5 with datasets DS1, DS1:2, DS1:3, DS1:4, and DS1:5, the performance of TASK2 after training TASK2, TASK3, TASK4, TASK5 with datasets DS2, DS2:3, DS2:4, DS2:5, etc. We report the LL clustering quality performance for each task after sequential training five tasks in Fig. 3 and Fig. 4.

Figure 3:

Figure 3:

Clustering quality performances in terms of ARS and CS measured on each task after sequential training from TASK1 to TASK5 across every 500 iterations for each task.

Figure 4:

Figure 4:

Clustering quality performances in terms of HS, NMI and VM measured on each task after sequential training from TASK1 to TASK5 across every 500 iterations for each task.

Fig. 3 and Fig. 4 reflect that DBULL has better performance in handling catastrophic forgetting than CURL-D since DBULL has slightly less performance drop than CURL-D for previous tasks in almost all scenarios in terms of nearly all clustering metrics.

Fig. 5 reflects that DBULL has advantages over CURL-D in handling over-clustering issues. Since each task has two digits, the true number of clusters seen after training each task sequentially is 2, 4, 6, 8, 10. The number of clusters automatically detected by DBULL after training TASK1, . . ., TASK5 is 4, 6, 8, 10, 12. DBULL clusters digit 1 into three clusters of different handwritten patterns in TASK1. For other digits, DBULL discovers each new digit into one exact cluster as the ground truth. In contrast, CURL-D clustered digits 0–1 into 14–16 clusters of TASK1 and obtained 23–25 clusters for 10 digits after training five tasks sequentially. We provide visualization of the reconstructed cluster mean from the DP mixture model via our trained decoder of DBULL in Fig. 6.

Figure 5:

Figure 5:

Number of clusters detected by CURL-D, DBULL after sequentially training from TASK1 to TASK5 across every 500 iterations for each task compared with the ground truth.

Figure 6:

Figure 6:

Decoded images using the DP Gaussian mixture posterior mean after sequentially trained from TASK1 to TASK5 using DBULL.

Besides the overall clustering quality reported, we also provide the precision and recall of DBULL to view the performance for each digit after sequentially training all tasks. CURL-D overclusters the 10 digits into 25 clusters, making it hard to report the precision and recall of each digit. To visualize the results, the three sub-clusters of digit one by DBULL have been merged into one cluster. Overall, there is no significant performance loss of previous tasks after sequentially training multiple tasks for digits 0, 1, 3, 4, 6, 8, 9. Digit 2 has experienced precision decrease after training TASK4 of digits 6 and 7 since DBULL has trouble in differentiating some samples from digits 2 and 7.

6.2. Batch Setting Clustering Performance Comparison

Experiment Objective.

The goal of this experiment is to demonstrate the generality our LL method DBULL, which can achieve comparable performance as competing methods in an unsupervised batch setting.

Experiment Setup.

To examine our method performance in a batch setting, we test it on more complex datasets including Reuters10k obtained from the original Reuters [43] and image STL-10 [44]. We use the same Reuters10k and STL-10 dataset from [25, 26]. The details of Reuters10k and STL-10 are provided in Appendix B. For all datasets, we randomly select 80% of the samples as training and evaluate the performance on the rest 20% of the samples for all methods.

Discussion on Performance.

The true number of clusters is provided to competing methods: DEC, VaDE and CURL-F in advance since the total number of clusters is required. DBULL, CURL-D, VAE+DP have less information than DEC, VaDE and CURL-F since they have no knowledge about the true number of clusters. DBULL, CURL-D and VAE+DP all start with one cluster and detect the number of clusters on the fly. Thus, if DBULL can achieve similar performance to DEC, VaDE and CURL-F and outperforms its LL counterpart CURL-D, it demonstrates DBULL’s effectiveness. Table 4 shows DBULL performs the best in NMI, VM for MNIST and NMI, ARS and VM for STL10 and outperforms CURL-D in MNIST and STL10. Moreover, DBULL and DEC are more stable in terms of all evaluation metrics because of smaller standard error than other methods.

Table 4:

Clustering quality (%) comparison averaged over five replications with both the average value and the standard error (in the parenthesis) provided.

Dataset Method NMI ARI
MNIST DEC 84.67 (2.25) 83.67 (4.53)
VaDE 80.35 (4.68) 74.06 (9.11)
VAE+DP 81.70 (0.825) 70.49 (1.654)
CURL-F 69.76 (2.51) 56.47 (4.11)
CURL-D 63.51 (1.32) 36.84 (1.98)
DBULL 85.72 (1.02) 83.53 (2.35)

Reuters10k DEC 46.56 (5.36) 46.86 (7.98)
VaDE 41.64 (4.73) 38.49 (5.44)
VAE + DP 41.62 (2.99) 37.93 (4.57)
CURL-F 51.92 (3.22) 47.72 (4.00)
CURL-D 46.31 (1.83) 22.00 (3.60)
DBULL 45.32 (1.79) 42.66 (5.73)

STL10 DEC 71.92 (2.66) 58.73 (5.09)
VaDE 68.35 (3.85) 59.42 (6.84)
VAE+DP 43.18 (1.41) 26.58 (1.32)
CURL-F 66.98 (3.38) 51.24 (4.06)
CURL-D 65.71 (1.33) 37.96 (4.69)
DBULL 75.26 (0.53) 70.72 (0.81)

Dataset Method HS VM

MNIST DEC 84.67 (2.25) 80.57 (1.96)
VaDE 79.86 (4.93) 80.36 (4.69)
VAE+DP 91.27 (0.215) 81.19 (0.904)
CURL-F 68.60 (2.56) 69.75 (2.51)
CURL-D 76.35 (1.53) 62.45 (1.32)
DBULL 89.34 (0.25) 85.65 (0.51)

Reuters10k DEC 48.44 (5.44) 46.52(5.36)
VaDE 43.64 (4.88) 41.60 (4.73)
VAE + DP 46.64 (3.85) 41.34 (2.94)
CURL-D 66.90 (2.09) 43.34 (2.00)
CURL-F 54.38 (3.49) 51.86 (3.21)
DBULL 48.88 (1.86) 45.40 (2.04)

STL10 DEC 68.47 (3.48) 71.83 (2.72)
VaDE 67.24 (4.23) 68.37 (3.92)
VAE+DP 42.28 (1.03) 43.16 (1.39)
CURL-F 65.46 (3.27) 66.96 (3.37)
CURL-D 80.86 (2.94) 64.31 (1.24)
DBULL 77.61 (1.29) 75.22 (0.52)

To assess whether there is any statistically significant difference in clustering measures including NMI, ARS, HS and VM among DBULL and all competing methods, we perform Friedman test. Friedman test is a nonparametric alternative to the one-way repeated measures ANOVA test when the normality assumption of the ANOVA test is not satisfied. It also extends the sign test in the situation where there are more than two groups in comparison. If the P-value of Friedman test is smaller than the significance level α = 0.05, it indicates that we reject the null hypothesis and there is a significant difference among all methods in comparison.

To further evaluate whether the clustering performance of DBULL is significantly better than each competing method, we perform paired Wilcoxon signed-rank test. P-values are adjusted using the Bonferroni multiple testing correction method. We present the Friedman test and paired Wilcoxon signed-rank test results for NMI, ARS, HS and VM in Table D.8-D.14 in Appendix D. We summarize the following key conclusions based on the results in Table D.8-D.14.

  • There is a statistically significant difference in clustering measures including NMI, ARS, HS and VM between at least two methods out of all methods in comparison according to Friedman test shown in Table D.8, D.10, D.12, D.14.

  • In terms of NMI, DBULL is significant better than CURL-D, VaDE, VAE+DP based on paired Wilcoxon signed-rank test according to Table D.9.

  • In terms of ARS, DBULL is significantly better than VaDE, VAE+DP, CURL-F and CURL-D based on paired Wilcoxon signed-rank test according to Table D.11.

  • In terms of HS, DBULL is significantly better than DEC, VaDE and CURL-F based on paired Wilcoxon signed-rank test according to Table D.13.

  • In terms of VM, DBULL is significantly better than VaDE, VAE+DP and CURL-D based on paired Wilcoxon signed-rank test according to Table D.15.

We also report the number of clusters found by DBULL, CURL-D for MNIST, Reuters10k and STL-10 in Table 5 out of five replications. Table 5 shows that DBULL handles overclustering issues better in comparison with CURL-D. In summary, Table 4 and 5 demonstrate DBULL’s effectiveness in a batch setting. In this paper, we follow the standard clustering evaluation metric used by DEC, unsupervised clustering accuracy (ACC) defined by [47] to compare our methods with competing methods. ACC is defined as

ACC=maxmi=1N1li=mciN, (11)

where N is the total number of observations, li is the ground-truth label, ci is the cluster assignment by the algorithm and m ranges over all possible one-to-one mappings between clusters and labels. When we fix the number of clusters as the number of ground-truth categories in DEC, VaDE, and CURL-F, under this definition, ACC is the same as the classification accuracy. We are not able to compute the original ACC for DBULL since m is a one-to-one mapping. However, DBULL can detect more clusters than the number of ground-truth categories. To resolve this issue, we extend the definition of ACC by allowing m to range over all possible many-to-one mappings between clusters and the true labels. The classification accuracy is the same as the extended ACC under this definition when there are more clusters than the number of ground-truth categories. As we have seen in Table 5, out of five replications, the number of clusters found by DBULL is from 11 to 15. We report the accuracy comparison in Table 6.

Table 5:

Number of clusters found by DBULL and CURL-D out of five replications, where the upperbounds for the number of clusters for Reuters10k and STL-10 are set at 40 and 50.

Datasets True # of Clusters DBULL CURL-D
MNIST 10 11–15 34
Reuters10k 4 5–10 40
STL-10 10 12–15 50
Table 6:

Classification accuracy (same as (extended) ACC defined in Equation 11) for VaDE, DEC, CURL-F and DBULL. We report the best accuracy for DBULL since DEC [25] and VaDE [26] only report their best accuracy. CURL-F [22] reports both the best accuracy and the average accuracy with the standard error.

Datasets DEC VaDE CURL-F (best) CURL-F (average) DBULL
MNIST 84.30% 94.46% 84% 79.38% (4.26%) 92.27%

7. Conclusion

In this work, we introduce our approach DBULL for unsupervised LL problems. DBULL is a novel end-to-end approximate Bayesian inference algorithm, which is able to perform automatic new task discovery via our proposed dynamic model expansion strategy, adapt to changes in the evolving data distributions, and overcome forgetting using our proposed information extraction mechanism via summary sufficient statistics while learning the underlying representation simultaneously. Experiments on MNIST, Reuters10k and STL-10 demonstrate that DBULL has competitive performance compared with state-of-the-art methods in both a batch setting and an unsupervised LL setting. Currently, we do not explore high-resolution tasks for LL. In the future, we plan to investigate more challenging tasks on ImageNet [48] in the future. In the field of LL, little attention has been paid to uncertainty quantification. Bayesian neural networks provide a natural choice by including uncertainty through priors on the weights of the neural network and quantifying the uncertainty of the functional mean via posterior predictive distributions. Our work is developed under the Bayesian framework and we hope to extend our work to characterize uncertainty in Bayesian LL. In follow-up work, we also intend to explore additional techniques to alleviate forgetting and develop more advanced methods to continually learn unsupervised representations under various setups such as reinforcement learning domain.

Supplementary Material

1

Figure 7:

Figure 7:

Precision and recall for each digit of DBULL evaluated after sequentially training from TASK1 to TASK5 across every 500 iterations for each task.

Acknowledgment

The work described was supported in part by Award Numbers U01 HL089856 from the National Heart, Lung, Blood Institute and NIH/NCI R01 CA199673, and NSF 1934846.

Footnotes

Declaration of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  • [1].McCloskey M, Cohen NJ, Catastrophic interference in connectionist networks: The sequential learning problem, in: Psychology of learning and motivation, Vol. 24, Elsevier, 1989, pp. 109–165. [Google Scholar]
  • [2].Thrun S, Mitchell TM, Lifelong robot learning, Robotics and autonomous systems 15 (1–2) (1995) 25–46. [Google Scholar]
  • [3].Thrun S, Is learning the n-th thing any easier than learning the first?, in: Advances in neural information processing systems, 1996, pp. 640–646. [Google Scholar]
  • [4].Ruvolo P, Eaton E, Ella: An efficient lifelong learning algorithm, in: International Conference on Machine Learning, 2013, pp. 507–515. [Google Scholar]
  • [5].Chen Z, Ma N, Liu B, Lifelong learning for sentiment classification, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015, pp. 750–756. [Google Scholar]
  • [6].Sarwar SS, Ankit A, Roy K, Incremental learning in deep convolutional neural networks using partial network sharing, IEEE Access [Google Scholar]
  • [7].Hou S, Pan X, Change Loy C, Wang Z, Lin D, Lifelong learning via progressive distillation and retrospection, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 437–452. [Google Scholar]
  • [8].Ruvolo P, Eaton E, Active task selection for lifelong machine learning, in: Twenty-seventh AAAI conference on artificial intelligence, 2013. [Google Scholar]
  • [9].Isele D, Rostami M, Eaton E, Using task features for zero-shot knowledge transfer in lifelong learning., in: IJCAI, 2016, pp. 1620–1626. [Google Scholar]
  • [10].Blei DM, Jordan MI, et al. , Variational inference for dirichlet process mixtures, Bayesian analysis 1 (1) (2006) 121–143. [Google Scholar]
  • [11].Parisi GI, Kemker R, Part JL, Kanan C, Wermter S, Continual lifelong learning with neural networks: A review, Neural Networks [DOI] [PubMed] [Google Scholar]
  • [12].McClelland JL, McNaughton BL, O’Reilly RC, Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory., Psychological review 102 (3) (1995) 419. [DOI] [PubMed] [Google Scholar]
  • [13].Chen Z, Liu B, Lifelong Machine Learning, Morgan & Claypool Publishers, 2018. [Google Scholar]
  • [14].Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A, et al. , Over-coming catastrophic forgetting in neural networks, Proceedings of the national academy of sciences 114 (13) (2017) 3521–3526. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Zenke F, Poole B, Ganguli S, Continual learning through synaptic intelligence, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR.org, 2017, pp. 3987–3995. [PMC free article] [PubMed] [Google Scholar]
  • [16].Robins A, Catastrophic forgetting, rehearsal and pseudorehearsal, Connection Science 7 (2) (1995) 123–146. [Google Scholar]
  • [17].Xu H, Liu B, Shu L, Yu PS, Lifelong domain word embedding via meta-learning, in: Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 4510–4516. [Google Scholar]
  • [18].Nguyen CV, Li Y, Bui TD, Turner RE, Variational continual learning, in: International Conference on Learning Representations (ICLR), 2018. [Google Scholar]
  • [19].Shin H, Lee JK, Kim J, Kim J, Continual learning with deep generative replay, in: Advances in Neural Information Processing Systems, 2017, pp. 2990–2999. [Google Scholar]
  • [20].Shu L, Liu B, Xu H, Kim A, Lifelong-rl: Lifelong relaxation labeling for separating entities and aspects in opinion targets, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Vol. 2016, NIH Public Access, 2016, p. 225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Shu L, Xu H, Liu B, Lifelong learning crf for supervised aspect extraction, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2017. [Google Scholar]
  • [22].Rao D, Visin F, Rusu A, Pascanu R, Teh YW, Hadsell R, Continual unsupervised representation learning, in: Advances in Neural Information Processing Systems, 2019, pp. 7645–7655. [Google Scholar]
  • [23].Kingma DP, Welling M, Auto-encoding variational bayes, in: International Conference on Learning Representations (ICLR), 2014. [Google Scholar]
  • [24].Johnson M, Duvenaud DK, Wiltschko A, Adams RP, Datta SR, Composing graphical models with neural networks for structured representations and fast inference, in: Advances in neural information processing systems, 2016, pp. 2946–2954. [Google Scholar]
  • [25].Xie J, Girshick R, Farhadi A, Unsupervised deep embedding for clustering analysis, in: International conference on machine learning, 2016, pp. 478–487. [Google Scholar]
  • [26].Jiang Z, Zheng Y, Tan H, Tang B, Zhou H, Variational deep embedding: an unsupervised and generative approach to clustering, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence, AAAI Press, 2017, pp. 1965–1972. [Google Scholar]
  • [27].Goyal P, Hu Z, Liang X, Wang C, Xing EP, Nonparametric variational auto-encoders for hierarchical representation learning, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5094–5102. [Google Scholar]
  • [28].LeCun Y, Bengio Y, Hinton G, Deep learning, nature 521 (7553) (2015) 436–444. [DOI] [PubMed] [Google Scholar]
  • [29].Abdar M, Pourpanah F, Hussain S, Rezazadegan D, Liu L, Ghavamzadeh M, Fieguth P, Cao X, Khosravi A, Acharya UR, et al. , A review of uncertainty quantification in deep learning: Techniques, applications and challenges, Information Fusion [Google Scholar]
  • [30].Kompa B, Snoek J, Beam AL, Second opinion needed: communicating uncertainty in medical machine learning, NPJ Digital Medicine 4 (1) (2021) 1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Abdar M, Fahami MA, Chakrabarti S, Khosravi A, P-lawiak P, Acharya UR, Tadeusiewicz R, Nahavandi S, Barf: A new direct and cross-based binary residual feature fusion with uncertainty-aware module for medical image classification, Information Sciences 577 (2021) 353–378. [Google Scholar]
  • [32].Abdar M, Salari S, Qahremani S, Lam H-K, Karray F, Hussain S, Khosravi A, Acharya UR, Nahavandi S, Uncertaintyfusenet: Robust uncertainty-aware hierarchical feature fusion with ensemble monte carlo dropout for covid-19 detection, arXiv preprint arXiv:2105.08590 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Kamarthi H, Kong L, Rodríguez A, Zhang C, Prakash BA, When in doubt: Neural non-parametric uncertainty quantification for epidemic forecasting, arXiv preprint arXiv:2106.03904 [Google Scholar]
  • [34].Abdar M, Samami M, Mahmoodabad SD, Doan T, Mazoure B, Hashemifesharaki R, Liu L, Khosravi A, Acharya UR, Makarenkov V, et al. , Uncertainty quantification in skin cancer classification using three-way decision-based bayesian deep learning, Computers in biology and medicine (2021) 104418. [DOI] [PubMed] [Google Scholar]
  • [35].Sethuraman J, Tiwari RC, Convergence of dirichlet measures and the interpretation of their parameter, in: Statistical decision theory and related topics III, Elsevier, 1982, pp. 305–315. [Google Scholar]
  • [36].Hornik K, Approximation capabilities of multilayer feedforward networks, Neural networks 4 (2) (1991) 251–257. [Google Scholar]
  • [37].Kingma DP, Mohamed S, Rezende DJ, Welling M, Semi-supervised learning with deep generative models, in: Advances in neural information processing systems, 2014, pp. 3581–3589. [Google Scholar]
  • [38].Nalisnick E, Hertel L, Smyth P, Approximate inference for deep latent gaussian mixtures, in: NIPS Workshop on Bayesian Deep Learning, Vol. 2, 2016. [Google Scholar]
  • [39].Ishwaran H, James LF, Gibbs sampling methods for stick-breaking priors, Journal of the American Statistical Association 96 (453) (2001) 161–173. [Google Scholar]
  • [40].Lee S, Stokes J, Eatonr E, Learning shared knowledge for deep life-long learning using deconvolutional networks, in: 6th Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJ-CAI), 2019, pp. 2837–2844. [Google Scholar]
  • [41].Hughes MC, Sudderth E, Memoized online variational inference for dirichlet process mixture models, in: Advances in Neural Information Processing Systems, 2013, pp. 1133–1141. [Google Scholar]
  • [42].LeCun Y, Bottou L, Bengio Y, Haffner P, et al. , Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324. [Google Scholar]
  • [43].Lewis DD, Yang Y, Rose TG, Li F, Rcv1: A new benchmark collection for text categorization research, Journal of machine learning research 5 (Apr) (2004) 361–397. [Google Scholar]
  • [44].Coates A, Ng A, Lee H, An analysis of single-layer networks in unsupervised feature learning, in: Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 215–223. [Google Scholar]
  • [45].Krizhevsky A, Hinton G, et al. , Learning multiple layers of features from tiny images [Google Scholar]
  • [46].Rosenberg A, Hirschberg J, V-measure: A conditional entropy-based external cluster evaluation measure, in: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), 2007. [Google Scholar]
  • [47].Yang Y, Xu D, Nie F, Yan S, Zhuang Y, Image clustering using local discriminant models and global integration, IEEE Transactions on Image Processing 19 (10) (2010) 2761–2773. [DOI] [PubMed] [Google Scholar]
  • [48].Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255. [Google Scholar]
  • [49].He K, Zhang X, Ren S, Sun J, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [Google Scholar]
  • [50].Kingma DP, Ba J, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES