. 2023 Feb 4:1–43. Online ahead of print. doi: 10.1007/s10462-022-10272-8

Table 1.

Main differences between the deep learning models

	Features	Advantages	Disadvantages	Main variants
DNN	a. Can be regarded as a neural network with many hidden layers, which are fully connected; b. Use forward propagation and back propagation algorithms to tune parameters.	a. Can be used in wide range of applications; b. Network structure is simple.	a. The number of parameters is easy to be inflated; b. Need a lot of computation.	DNN-HMM, DNN-CTC
RBM	a. Be a stochastic generated neural network that can learn probabilistic distribution from an input data set; b. Be the building block of DBN.	a. Flexible and efficient computation; b. Easy to reason.	a. Only suitable for working on binary-valued data; b. Model is relatively simple so that expression ability is not good enough.	Conditional RBM, Point-wise Gated RBM, Temporal RBM
DBN	a. Can be regarded as either a generation model or a discriminant model; b. Can be used for both unsupervised and supervised learning.	a. The generation model learns the joint probability density distribution, so it can represent the distribution of data from a statistical point of view and reflect the similarity of similar data; b. The generative model can restore the conditional probability distribution, which is equivalent to the discriminant model.	a. The classification accuracy of the generative model is not as high as the discriminant model when it is used for classification problems; b. Because the generation model learns the joint distribution of data, the learning problem is more complex; c. The input data are required to be labeled for training.	Convolution DBN
CNN	a. Effectively reduce the dimension of large data images to small data (without affecting the results); b. Can retain the characteristics of the picture, similar to the principle of human vision.	a. The weight sharing strategy reduces the parameters that need to be trained, making the generalization ability of the trained model stronger; b. Pooling operation can reduce the spatial resolution of the network, so that the translation invariance of the input data is not required.	Depth models are prone to gradient dissipation.	GoogleNet, VGG, Deep Residual Learning
RNN	a. Long-term information can be effectively retained; b. Select important information to keep, and select “forget” for less important information.	The model is a depth model in time dimension, which can model the sequence content.	a. There are many parameters that need to be trained, which are prone to gradient dissipation or gradient explosion; b. Without feature learning ability.	LSTM, GRU
AE	a. Data dependent; b. Learn automatically from data samples.	a. Strong generalization; b. Can be used for dimension reduction; c. Can be used for feature detectors; d. Can be used for generation model.	a. Information is somewhat lost in the process of encoding and decoding; b. The compression ability only applies to samples similar to training samples.	Denoising AE, Stack AE, Undercomplete AE, Regular AE
Transformer	a. Self-attention mechanism; b. Focus on global information.	a. Enables to model more long-distance dependencies; b. Parallel computing.	a. High program complexity; b. Not Turing complete; c. Compute resource input average.	Linear Transformer, Sparse Transformer, Reformer, Set Transformer, Transformer-XL
DRL	a. Combine deep learning with reinforcement learning; b. End-to-End training.	a. Learn control strategies directly from high-dimensional raw data; b. Large numbers of samples can be produced for supervised study.	a. Difficult to achieve continuous motion control; b. Overestimation, that is, the estimated value function is larger than the true value function.	QR-DQN, Rainbow DQN