Multi-view Deep Gaussian Process with a Pre-training Acceleration Technique

Han Zhu; Jing Zhao; Shiliang Sun

doi:10.1007/978-3-030-47436-2_23

. 2020 Apr 17;12085:299–311. doi: 10.1007/978-3-030-47436-2_23

Multi-view Deep Gaussian Process with a Pre-training Acceleration Technique

Han Zhu ⁷, Jing Zhao ^7,^✉, Shiliang Sun ⁷

Editors: Hady W Lauw⁸, Raymond Chi-Wing Wong⁹, Alexandros Ntoulas¹⁰, Ee-Peng Lim¹¹, See-Kiong Ng¹², Sinno Jialin Pan¹³

PMCID: PMC7206310

Abstract

Deep Gaussian process (DGP) is one of the popular probabilistic modeling methods, which is powerful and widely used for function approximation and uncertainty estimation. However, the traditional DGP lacks consideration for multi-view cases in which data may come from different sources or be constructed by different types of features. In this paper, we propose a generalized multi-view DGP (MvDGP) to capture the characteristics of different views and model data in different views discriminately. In order to make the proposed model more efficient in training, we introduce a pre-training network in MvDGP and incorporate stochastic variational inference for fine-tuning. Experimental results on real-world data sets demonstrate that pre-trained MvDGP outperforms the state-of-the-art DGP models and deep neural networks, achieving higher computational efficiency than other DGP models.

Keywords: Deep Gaussian process, Multi-view learning, Variational inference, Stochastic optimization, Pre-training technique

Introduction

Gaussian process (GP) owns a significant ability of modeling representation and can estimate the uncertainty of the prediction effectively [5, 11, 16]. Deep Gaussian process (DGP) is a stack of multi-layer GPs [1, 2, 13]. Benefitting from the hierarchical structure, DGP not only retains the excellent features of GP, but also overcomes the limitations of GP and obtains stronger mapping capability. However, the difficulty in DGP is mainly located on intractable calculations during the training process. The Bayesian training framework based on variational inference for DGP is a classical method but limited by the scale of data [2]. Doubly stochastic variational inference is a state-of-the-art and widely used inference technique, which adopts stochastic optimization and makes it possible for DGP to be applied to large-scale data [13]. Recently, there are some new works focusing on non-Gaussian posterior in the real-world data to develop DGP comprehensively [3, 14].

Traditional GP models only focus on modeling data from a single source. As amounts and sources of data are augmented, data integrations from multiple feature sets are referred to as multi-view data [17, 20]. It is improper to treat data from different views equally, and thus multi-view learning flourishes. GP-based models have been extended to multi-view scenarios, in which the multi-view regularized GP [8] and the sparse multimodal GP [9] are the generalizations of shallow GP model. A DGP-based work is also developed [18], but limited in multi-view unsupervised representation learning, where additional classifiers are needed for classification tasks. Besides, the inference of the unsupervised DGP [18] is based on the Bayesian training framework with strong mean-field and Gaussian assumptions, which underestimates variance and makes the model unable to be applied to large-scale data scenarios. Our goal is to propose a general end-to-end multi-view DGP (MvDGP). We build a scalable model without forcing independence between layers, and apply stochastic variational inference and re-parameterization techniques to improve the ability of modeling on the large-scale data.

In addition, we expect that the MvDGP model possesses significant superiority in training speed. In the multi-view scenario, we tune the model according to the characteristics of each view, which will inevitably introduce more model parameters and lengthen the training time. Pre-training is a widely used technique [4, 19], in which a large number of data are taken as training samples to be trained across multiple GPUs. The weights obtained by pre-trained networks are used as the initial weights for new tasks, and then only a few steps of fine-tuning are needed to get prediction results. In order to make the proposed model more competitive in terms of training speed, we introduce a novel pre-training model for MvDGP. Instead of training with the same model using other data sets, we use the same data set to train with other models. Because the neural network with infinite width has been proved equivalent to GP exactly and the training cost of deep neural network (DNN) is much less than DGP [6, 7, 10], we pre-train the DNN with a similar structure of MvDGP to analogize the initial training process of MvDGP. Through the DNN pre-training, we aim to get a set of appropriate initial parameters for MvDGP. Since the parameter domains of the DNN and the MvDGP are not the same, the initial parameters of each layer in the MvDGP are obtained by auxiliary optimization of single GP. The optimization efficiency of MvDGP is improved significantly by pre-training.

There are three main contributions in our work:

Generalized Multi-view Deep Gaussian Process (MvDGP): We propose a generalized and flexible MvDGP, which considers characteristics of different views. Deep structure leads to more powerful abilities of uncertainty estimation and mapping representation compared with shallow models [8, 9]. Furthermore, MvDGP is an end-to-end supervised model, which can take advantage of labels to learn models, and provides stronger robustness and generalization performance than unsupervised multi-view DGP [18].
Scalability: We infer the MvDGP without setting strong mean-field constraints and derive stochastic variational inference. Compared to the model [18] can hardly be applied in large-scale scenarios, our model is capable of it. Meanwhile, our model can be extended to more views easily and can customize the detailed depth of each view according to the view characteristic.
Efficiency: We obtain appropriate initial parameters by DNN pre-training for MvDGP, which reduces the oscillation and speeds up the training. Experiments demonstrate that the pre-trained MvDGP guarantees higher performance and runs several times faster than unpre-trained methods.

Deep Gaussian Process

Deep Gaussian process (DGP) is a stack of multiple GPs, which possesses a more powerful modeling capability than a GP [2]. For a standard DGP, we review a supervised version as an example. Given a training set, including observed inputs Inline graphic and observed outputs , where N is the number of samples, Q and D are the dimensionality of input and output vector, respectively.

For a DGP with L layers of hidden units, we define Inline graphic as the latent variable set, where is the output for layer l and the input for layer , . Furthermore, we add additional sets of inducing inputs and inducing points to employ variational inference [15]. The assumption of the model prior is as follows,

where Inline graphic is the mean function and is the kernel function. Note that , where . We record as , and the conditional distribution, corresponding mean and variance are denoted as follows,

The likelihood of model is generally set to a Gaussian distribution,

where Inline graphic is the variance of the observation . The joint density of the observed output , latent variables and inducing points is written as

Multi-view Deep Gaussian Process

Due to the characteristics of multi-view data, the general DGP cannot utilize the rich information in multiple views reasonably. In this section, we propose a new model named multi-view deep Gaussian process (MvDGP), and introduce stochastic variational inference for optimization.

Multi-view Deep Gaussian Process

We propose an end-to-end multi-view model and take two views of data and models as an example. For given data Inline graphic , and are observed inputs of the first and the second view respectively and is the observed outputs. For data of each view, there is a deep structure to model it. The latent variables of intermediate layers are recorded as , , , where v is the index of view and is the depth of view v. The depths of the networks in different views can be determined according to the data characteristics of each view for better mapping. The inducing inputs Inline graphic and the inducing points are introduced for each latent variable as in Sect. 2. In addition to the separated GP layers for each view, there are also common layers that share information for both views, in which variables and model parameters are denoted as , , , . The graphical model of MvDGP is illustrated in Fig. 1, and the depth for each view is marked as Inline graphic .

Fig. 1. — The graphical model for multi-view deep Gaussian process.

We record Inline graphic as the transition layer from the separated views to merged view , and the joint density of MvDGP is written as

where Inline graphic , is the concatenation of the last layers of two views, and represents corresponding unit variance. The joint distribution of latent variables in view v is specifically as

The depth for each view is Inline graphic and the symbols of , denote the observed inputs , , respectively.

Variational Inference

Directly inferring MvDGP is intractable and complex computationally, we take stochastic variational inference for optimization. The main idea of variational inference is to find an approximate posterior distribution Inline graphic that is as close as possible to the true posterior .

We adopt a factorized form for joint posterior distribution as

where Inline graphic is the variational distribution of view v, , and . The depth and for each view are denoted as Sect. 3.1. We take Gaussian forms for variational distribution of as , where layer , view , and , are mean and variance of , respectively. Under this setting, the variational posterior can be obtained analytically as

In order to maintain gradients and update layer-wise parameters in the process of optimization, we introduce the re-parameterization trick and choose Monte Carlo method to estimate variational posterior Inline graphic [12]. Firstly, draw a noise term from a standard Gaussian distribution, for view and layer . Then, iteratively sample latent variable , in which can be clearly written as

where Inline graphic and are mean and covariance functions denoted in (10), (11).

Stochastic Optimization and Predictions

To minimize the KL divergence of q and p, we maximize the lower bound Inline graphic of the logarithm marginal likelihood , which is formulated as

By substituting the joint density (7) and posterior distribution (9) to lower bound expression (13), the term Inline graphic in the numerator and denominator can be offset. The variational lower bound of model evidence in MvDGP can be rearranged to

where Inline graphic represents , .

The expection about Inline graphic in the variational lower bound can be written in the form of additions for samples as follows,

where Inline graphic is observed outputs, and are corresponding latent variables for sample i, . The addition expression of lower bound allows stochastic optimization to be employed in inference. The samples of minibatch can be regarded as an unbiased estimation of all samples.

Model parameters are optimized with the Adam optimizer during training, which include inducing inputs Inline graphic , variational parameters of inducing points , and kernel parameters , . Stochastic optimization and unbiased minibatch samples ensure the scalability of MvDGP. Our model can be easily generalized to large-scale data.

For predictions, we take the mean of multiple samples of Inline graphic as the predict outputs for test inputs , and is distributed as , where K is the number of samples, and the value of is set as . The samples can be obtained according to the re-parameterization Monte Carlo sample steps (12) iteratively.

If there are more than two views in data, the MvDGP is easily to be generalized to multiple views by adding separated multi-layer GPs structure for new views.

Pre-training Technique for MvDGP

In order to better model the function approximation of each view, MvDGP introduces more latent variables and model parameters than single-view DGP. The training time of the model with a large number of parameters is not optimistic even with doubly stochastic optimization. Due to the initial parameters of the model have a significant impact on the training efficiency, the training speed of the model with proper initial parameters is faster than the random one. We consider introducing a novel technique of pre-training to MvDGP by training a computational cost-dominant model and getting a suitable set of initial parameters for MvDGP.

Deep neural network (DNN) is a type of powerful model for representation learning and model mapping. Inspired by the similar characteristics of DNN and DGP [7, 10], we adopt the DNN with a similar structure to MvDGP to simulate the initial training process of MvDGP. We model the DNN separately for two views and build common network layers whose inputs are the concatenation of the outputs of the separated networks. The number of parameters is related to the number and dimension of hidden units. The number of model parameters in the DNN we used is much smaller than MvDGP, which leads to faster training speed.

Since it is not possible to directly use the parameters such as the network weights of the DNN in MvDGP, we use some single-layer GPs as auxiliary pre-training models. We take the values of the adjacent two layers in the DNN as the input and output of the single GP to obtain a set of initial parameters suitable for corresponding layers in MvDGP. Since the training difficulties of DNN and single GP are much lower than that of MvDGP, the pre-training step can be quickly calculated and is reasonable for roughly selecting the initial parameters of MvDGP. Then, taking advantage of powerful uncertainty estimation and robust characteristics of MvDGP, we can perform more precise probability learning in multi-view data. In the processes of training DNN, single GP, as well as MvDGP, stochastic optimization is all adopted to facilitate the generalization of massive data.

The schematic diagram of pre-trained MvDGP (PreMvDGP) is depicted in Fig. 2. The basic MvDGP model is framed in orange lines. The gray node in the outermost circle represents the DNN with a similar structure to MvDGP as the first stage of pre-training. The middle layers of the DNN, Inline graphic , , are used as the observed inputs and observed outputs to train the parameters of each single GP, where , , and , . The yellow blocks in the second column of the left and the second column of the right are both single GPs as the second stage of pre-training. The training results of each GP are taken as the initial parameters of the corresponding layer in MvDGP. At last, a precise mapping learning is performed through MvDGP.

Experiments

In this section, we evaluate the performance of the proposed model in four real-world data sets. Our concerns about model performance include accuracy and training speed. We analyze experimental results compared with the state-of-the-art DGP models and deep neural network.

Data Sets

WebKB University Data Set (WebKB). The WebKB data set1 is composed of four universities, Cornell, Texas, Washington, and Wisconsin, in which data are captured from two views, words in web pages and hyperlinks. The web page can be divided into five categories, where we denote the category of the largest number of samples as positive class and the rest as negative class.
Multiple Feature Data Set (MFeat). There are 200 samples as well as six features for each handwritten number (‘0’–‘9’) in MFeat data set2. We adopt these features as six views. The data is divided into ten partitions denoted as M-0M-9, in which partition M-i represents the samples labeled ‘i’ as positive class and others as negative class samples.
Internet Advertisements Data Set (Ads). The Ads data set3 is composed of the features extracted from five aspects. We consider five features as five views of data. There is a unique label to mark if the sample is an ad.
Forest CoverType Data Set (CoverType). The data4 are composed of quantitative real variables and binary one-hot variables, for which we adopt two views to model. We use samples labeled Spruce-Fir or Lodgepole as positive samples to form two data sets, respectively (marked as partition C-1 and C-2).

The total number of samples, dimension of each view, and the sample number of each class for four data sets and partitions are presented detailed in Table 1.

Table 1.

Detailed data set information.

Open in a new tab

Experimental Settings

We conduct a series of experiments on four data sets to verify the performance of our PreMvDGP model. For each experiment, we take 5-fold cross-validation to obtain 80% samples for the train set and 20% samples for the test set. We perform ten repeated experiments to each sample partition and take the average as the final experimental results. We adopt 20 samples as a minibatch and 128 inducing points for every layer in the experiments. The number of hidden layers of different views and the shared layers can be customized by the characteristics of each view data. To illustrate the general characteristics of PreMvDGP, we show the experimental results with Inline graphic .

To demonstrate the superior performance of our model, we compare with two state-of-the-art DGP methods, including doubly stochastic variational inference DGP (DSVI-DGP) [13] and stochastic gradient Hamilton Monte Carlo DGP (SGHMC-DGP) [3], and the deep neural network (DNN) which is designed to adapt to multi-view data in this experiments. Since the single-view DGP methods cannot utilize multi-view data directly, we consider separately taking the data of view 1 (V1), view 2 (V2), and the concatenation of two view data (Con) as three types of inputs for WebKB data set to verify the necessity of multi-view modeling.

Experiments using multiple single-source data are redundant and incomplete, so we concatenate the data from all views as the inputs of the other three data sets to make the most of the data. For single-view DGP methods, we abbreviate the methods as DSVI-DGP-Con, SGHMC-DGP-Con. To ensure adequate training and convergence, we use 500 epochs to train DSVI-DGP and SGHMC-DGP, respectively. In the pre-training phase of PreMvDGP, we set the number of hidden units as 64 and the dimension of hidden units as 10 to get a rough set of parameters as quickly as possible. Meanwhile, we set 300 epochs for DNN pre-training, 100 epochs for training single GPs, and 100 epochs for training MvDGP. In practice, the number of iterations set in this way can ensure that each step is completely trained. All parameter settings in our experiments remain fixed in each dataset and comparison method.

Results and Analysis

The experimental results on the four WebKB data sets, including average classification accuracies, standard deviations, and computational costs, are presented in Table 2. Experimental results show that the representation with only view 1 is significantly better than the representation with only view 2 in this data set. Concatenating data from two views (Con) has no significant effect on improving accuracy compared to results with view 1 (V1). In Table 2, the results of (Con) achieves better than (V1) for DSVI-DGP, while the results of (V1) take a bit advantage than (Con) for SGHMC-DGP. Concatenating data from different views causes an increase in the dimensions of the inputs, making the training process more expensive. The experiments prove that PreMvDGP achieves better classification performance than comparison methods, indicating that single-view methods cannot model the data characteristics of different views properly.

Table 2.

The average classification accuracies (%), standard deviations, and computational time(s) of comparison methods and PreMvDGP on the WebKB data sets.

Model	Dataset
	Cornell		Texas		Washington		Wisconsin
	Accuracy	Time	Accuracy	Time	Accuracy	Time	Accuracy	Time
DSVI-DGP-V1		306		297		345		414
DSVI-DGP-V2		46		40		55		68
DSVI-DGP-Con		455		434		523		563
SGHMC-DGP-V1		1597		1572		1655		1686
SGHMC-DGP-V2		206		197		228		273
SGHMC-DGP-Con		1846		1821		1940		1948
DNN		61		47		67		81
PreMvDGP		141		131		172		200

Open in a new tab

Since DNN is used as the initializer in our model, we also list the average time required for 300 iterations of DNN and the average classification accuracy only using the DNN optimizer. It can be found that the computational time of the pre-trainer takes a small part of the total time, and the training results of the DNN are suitable for the initialization of the MvDGP. PreMvDGP with appropriate initial parameters speeds up the training and learns function approximation more subtly than only using the DNN, resulting in more competitive results.

We model the data from six and five views separately for MFeat and Ads data sets, which means that our approach can be easily generalized to more views instead of using combinations of any two views. The experimental results including accuracies and computation time in the other three data sets are shown in Table 3. Our method almost achieves the best accuracy and is dominant in running time in all data sets and partitions, which means that discriminately modeling data of different views is necessary and the pre-training technique plays an important role in optimizing the initial parameters. Significantly, PreMvDGP also works well in the large forest CoverType data set. Stochastic optimization and inducing points help save the computational overhead of our model. Experiments prove that our method is appropriate for multi-view scenarios of large-scale data.

Table 3.

The average classification accuracies (%), standard deviations, and computational time (s) on multiple data sets and partitions, i.e., MFeat (M-0 Inline graphic M-9), Ads, CoverType (C-1, C-2).

Data set	DSVI-DGP-Con		SGHMC-DGP-Con		DNN		PreMvDGP
Data set	Accuracy	Time	Accuracy	Time	Accuracy	Time	Accuracy	Time
M-0	99.43 ± 0.17	3541	90.23 ± 3.33	4091	99.36 ± 0.19	513		1273
M-1	99.05 ± 0.39	3527	83.43 ± 9.10	4120	98.81 ± 0.23	489		1164
M-2	98.75 ± 1.92	3607	89.01 ± 4.05	4230	99.52 ± 0.08	584		1261
M-3	98.05 ± 1.71	3606	86.05 ± 2.45	4326		552	99.20 ± 0.35	1240
M-4	99.37 ± 0.14	3476	90.82 ± 5.70	4560	99.81 ± 0.08	584		1238
M-5	98.25 ± 1.75	3627	87.45 ± 1.78	4765	98.62 ± 0.21	586		1256
M-6	99.20 ± 0.25	3511	86.68 ± 2.27	4764	99.47 ± 0.11	590		1328
M-7	99.97 ± 0.03	3499	85.80 ± 4.94	4771	99.60 ± 0.23	584		1238
M-8	99.55 ± 0.29	3570	88.00 ± 1.64	4292	99.40 ± 0.28	538		1272
M-9	99.37 ± 0.24	3481	87.20 ± 3.09	4385	99.35 ± 0.15	529		1264
Ads	95.15 ± 0.33	3352	94.75 ± 0.26	3154	95.87 ± 0.23	458		1290
C-1	63.95 ± 1.01	21474	63.93 ± 1.26	9764	78.57 ± 0.41	3071		7360
C-2	59.88 ± 3.49	21672	57.51 ± 7.79	9771	76.61 ± 0.97	3164		7411

Open in a new tab

Conclusions

In this paper, we propose an end-to-end multi-view deep Gaussian process (MvD-GP) model, which is suitable for modeling multi-view data. The inference is based on doubly stochastic optimization and can be applied in large-scale data scenarios. To speed up the training, we introduce a pre-training deep neural network in MvDGP. The initial parameters obtained by the pre-training are proper for MvDGP, and more precise learning is performed by MvDGP. Experimental results demonstrate that pre-trained MvDGP (PreMvDGP) outperforms the state-of-the-art DGP methods in multi-view data modeling, and achieves better performance in training speed. Our work is a generalization of DGP in multi-view scenarios, which helps to develop the MvDGP under the trend of large-scale data with its superior computational performance.

Acknowledgments

The corresponding author Jing Zhao would like to thank supports from the National Natural Science Foundation of China under Projects 61673179, Shanghai Knowledge Service Platform Project (No. ZF1213) and Shanghai Sailing Program 17YF1404600.

Footnotes

WebKB data set is available at http://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/.

Multiple feature data set is available at https://archive.ics.uci.edu/ml/datasets.php.

Ads data set is available at http://archive.ics.uci.edu/ml/datasets.php.

⁴

CoverType data set is available at http://archive.ics.uci.edu/ml/datasets.php.

Contributor Information

Hady W. Lauw, Email: hadywlauw@smu.edu.sg

Raymond Chi-Wing Wong, Email: raywong@cse.ust.hk.

Alexandros Ntoulas, Email: antoulas@di.uoa.gr.

Ee-Peng Lim, Email: eplim@smu.edu.sg.

See-Kiong Ng, Email: seekiong@nus.edu.sg.

Sinno Jialin Pan, Email: sinnopan@ntu.edu.sg.

Han Zhu, Email: zhuhanchn@gmail.com.

Jing Zhao, Email: jzhao@cs.ecnu.edu.cn.

Shiliang Sun, Email: slsun@cs.ecnu.edu.cn.

References

1.Dai, Z., Damianou, A., González, J., Lawrence, N.: Variational auto-encoded deep Gaussian processes. arXiv preprint arXiv:1511.06455 (2015)
2.Damianou, A., Lawrence, N.: Deep Gaussian processes. In: Artificial Intelligence and Statistics, pp. 207–215 (2013)
3.Havasi, M., Hernández-Lobato, J.M., Murillo-Fuentes, J.J.: Inference in deep Gaussian processes using stochastic gradient hamiltonian monte carlo. In: Advances in Neural Information Processing Systems, pp. 7506–7516 (2018)
4.Hinton, G.E., Salakhutdinov, R.R.: A better way to pretrain deep boltzmann machines. In: Advances in Neural Information Processing Systems, pp. 2447–2455 (2012)
5.Ko J, Fox D. GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models. Auton. Robots. 2009;27:75–90. doi: 10.1007/s10514-009-9119-x. [DOI] [Google Scholar]
6.Koriyama, T., Kobayashi, T.: A training method using DNN-guided layerwise pretraining for deep Gaussian processes. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2787–2791 (2019)
7.Lee, J., Bahri, Y., Novak, R., Schoenholz, S.S., Pennington, J., Sohl-Dickstein, J.: Deep neural networks as Gaussian processes. arXiv preprint arXiv:1711.00165 (2017)
8.Liu, Q., Sun, S.: Multi-view regularized Gaussian processes. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 655–667 (2017)
9.Liu, Q., Sun, S.: Sparse multimodal Gaussian processes. In: International Conference on Intelligent Science and Big Data Engineering, pp. 28–40 (2017)
10.Matthews, A.G.d.G., Rowland, M., Hron, J., Turner, R.E., Ghahramani, Z.: Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271 (2018)
11.Rasmussen CE, Nickisch H. Gaussian processes for machine learning toolbox. J. Mach. Learn. Res. 2010;11:3011–3015. [Google Scholar]
12.Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082 (2014)
13.Salimbeni, H., Deisenroth, M.: Doubly stochastic variational inference for deep Gaussian processes. In: Advances in Neural Information Processing Systems, pp. 4588–4599 (2017)
14.Salimbeni, H., Dutordoir, V., Hensman, J., Deisenroth, M.P.: Deep Gaussian processes with importance-weighted variational inference. arXiv preprint arXiv:1905.05435 (2019)
15.Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudo-inputs. In: Advances in Neural Information Processing Systems, pp. 1257–1264 (2006)
16.Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, pp. 2951–2959 (2012)
17.Sun S. A survey of multi-view machine learning. Neural Comput. Appl. 2013;23:2031–2038. doi: 10.1007/s00521-013-1362-6. [DOI] [Google Scholar]
18.Sun, S., Liu, Q.: Multi-view deep Gaussian processes. In: International Conference on Neural Information Processing, pp. 130–139 (2018)
19.Yu, D., Deng, L., Seide, F.T.B., Li, G.: Discriminative pretraining of deep neural networks (2016)
20.Zhao J, Xie X, Xu X, Sun S. Multi-view learning overview: recent progress and new challenges. Inf. Fusion. 2017;38:43–54. doi: 10.1016/j.inffus.2017.02.007. [DOI] [Google Scholar]

[CR1] 1.Dai, Z., Damianou, A., González, J., Lawrence, N.: Variational auto-encoded deep Gaussian processes. arXiv preprint arXiv:1511.06455 (2015)

[CR2] 2.Damianou, A., Lawrence, N.: Deep Gaussian processes. In: Artificial Intelligence and Statistics, pp. 207–215 (2013)

[CR3] 3.Havasi, M., Hernández-Lobato, J.M., Murillo-Fuentes, J.J.: Inference in deep Gaussian processes using stochastic gradient hamiltonian monte carlo. In: Advances in Neural Information Processing Systems, pp. 7506–7516 (2018)

[CR4] 4.Hinton, G.E., Salakhutdinov, R.R.: A better way to pretrain deep boltzmann machines. In: Advances in Neural Information Processing Systems, pp. 2447–2455 (2012)

[CR5] 5.Ko J, Fox D. GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models. Auton. Robots. 2009;27:75–90. doi: 10.1007/s10514-009-9119-x. [DOI] [Google Scholar]

[CR6] 6.Koriyama, T., Kobayashi, T.: A training method using DNN-guided layerwise pretraining for deep Gaussian processes. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2787–2791 (2019)

[CR7] 7.Lee, J., Bahri, Y., Novak, R., Schoenholz, S.S., Pennington, J., Sohl-Dickstein, J.: Deep neural networks as Gaussian processes. arXiv preprint arXiv:1711.00165 (2017)

[CR8] 8.Liu, Q., Sun, S.: Multi-view regularized Gaussian processes. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 655–667 (2017)

[CR9] 9.Liu, Q., Sun, S.: Sparse multimodal Gaussian processes. In: International Conference on Intelligent Science and Big Data Engineering, pp. 28–40 (2017)

[CR10] 10.Matthews, A.G.d.G., Rowland, M., Hron, J., Turner, R.E., Ghahramani, Z.: Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271 (2018)

[CR11] 11.Rasmussen CE, Nickisch H. Gaussian processes for machine learning toolbox. J. Mach. Learn. Res. 2010;11:3011–3015. [Google Scholar]

[CR12] 12.Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082 (2014)

[CR13] 13.Salimbeni, H., Deisenroth, M.: Doubly stochastic variational inference for deep Gaussian processes. In: Advances in Neural Information Processing Systems, pp. 4588–4599 (2017)

[CR14] 14.Salimbeni, H., Dutordoir, V., Hensman, J., Deisenroth, M.P.: Deep Gaussian processes with importance-weighted variational inference. arXiv preprint arXiv:1905.05435 (2019)

[CR15] 15.Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudo-inputs. In: Advances in Neural Information Processing Systems, pp. 1257–1264 (2006)

[CR16] 16.Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, pp. 2951–2959 (2012)

[CR17] 17.Sun S. A survey of multi-view machine learning. Neural Comput. Appl. 2013;23:2031–2038. doi: 10.1007/s00521-013-1362-6. [DOI] [Google Scholar]

[CR18] 18.Sun, S., Liu, Q.: Multi-view deep Gaussian processes. In: International Conference on Neural Information Processing, pp. 130–139 (2018)

[CR19] 19.Yu, D., Deng, L., Seide, F.T.B., Li, G.: Discriminative pretraining of deep neural networks (2016)

[CR20] 20.Zhao J, Xie X, Xu X, Sun S. Multi-view learning overview: recent progress and new challenges. Inf. Fusion. 2017;38:43–54. doi: 10.1016/j.inffus.2017.02.007. [DOI] [Google Scholar]

PERMALINK

Multi-view Deep Gaussian Process with a Pre-training Acceleration Technique

Han Zhu

Jing Zhao

Shiliang Sun

Abstract

Introduction

Deep Gaussian Process

Multi-view Deep Gaussian Process

Multi-view Deep Gaussian Process

Fig. 1.

Variational Inference

Stochastic Optimization and Predictions

Pre-training Technique for MvDGP

Fig. 2.

Experiments

Data Sets

Table 1.

Experimental Settings

Results and Analysis

Table 2.

Table 3.

Conclusions

Acknowledgments

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Multi-view Deep Gaussian Process with a Pre-training Acceleration Technique

Han Zhu

Jing Zhao

Shiliang Sun

Abstract

Introduction

Deep Gaussian Process

Multi-view Deep Gaussian Process

Multi-view Deep Gaussian Process

Fig. 1.

Variational Inference

Stochastic Optimization and Predictions

Pre-training Technique for MvDGP

Fig. 2.

Experiments

Data Sets

Table 1.

Experimental Settings

Results and Analysis

Table 2.

Table 3.

Conclusions

Acknowledgments

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases