Table 1.
Main differences between the deep learning models
| Features | Advantages | Disadvantages | Main variants | |
|---|---|---|---|---|
| DNN |
a. Can be regarded as a neural network with many hidden layers, which are fully connected; b. Use forward propagation and back propagation algorithms to tune parameters. |
a. Can be used in wide range of applications; b. Network structure is simple. |
a. The number of parameters is easy to be inflated; b. Need a lot of computation. |
DNN-HMM, DNN-CTC |
| RBM |
a. Be a stochastic generated neural network that can learn probabilistic distribution from an input data set; b. Be the building block of DBN. |
a. Flexible and efficient computation; b. Easy to reason. |
a. Only suitable for working on binary-valued data; b. Model is relatively simple so that expression ability is not good enough. |
Conditional RBM, Point-wise Gated RBM, Temporal RBM |
| DBN |
a. Can be regarded as either a generation model or a discriminant model; b. Can be used for both unsupervised and supervised learning. |
a. The generation model learns the joint probability density distribution, so it can represent the distribution of data from a statistical point of view and reflect the similarity of similar data; b. The generative model can restore the conditional probability distribution, which is equivalent to the discriminant model. |
a. The classification accuracy of the generative model is not as high as the discriminant model when it is used for classification problems; b. Because the generation model learns the joint distribution of data, the learning problem is more complex; c. The input data are required to be labeled for training. |
Convolution DBN |
| CNN |
a. Effectively reduce the dimension of large data images to small data (without affecting the results); b. Can retain the characteristics of the picture, similar to the principle of human vision. |
a. The weight sharing strategy reduces the parameters that need to be trained, making the generalization ability of the trained model stronger; b. Pooling operation can reduce the spatial resolution of the network, so that the translation invariance of the input data is not required. |
Depth models are prone to gradient dissipation. | GoogleNet, VGG, Deep Residual Learning |
| RNN |
a. Long-term information can be effectively retained; b. Select important information to keep, and select “forget” for less important information. |
The model is a depth model in time dimension, which can model the sequence content. |
a. There are many parameters that need to be trained, which are prone to gradient dissipation or gradient explosion; b. Without feature learning ability. |
LSTM, GRU |
| AE |
a. Data dependent; b. Learn automatically from data samples. |
a. Strong generalization; b. Can be used for dimension reduction; c. Can be used for feature detectors; d. Can be used for generation model. |
a. Information is somewhat lost in the process of encoding and decoding; b. The compression ability only applies to samples similar to training samples. |
Denoising AE, Stack AE, Undercomplete AE, Regular AE |
| Transformer |
a. Self-attention mechanism; b. Focus on global information. |
a. Enables to model more long-distance dependencies; b. Parallel computing. |
a. High program complexity; b. Not Turing complete; c. Compute resource input average. |
Linear Transformer, Sparse Transformer, Reformer, Set Transformer, Transformer-XL |
| DRL |
a. Combine deep learning with reinforcement learning; b. End-to-End training. |
a. Learn control strategies directly from high-dimensional raw data; b. Large numbers of samples can be produced for supervised study. |
a. Difficult to achieve continuous motion control; b. Overestimation, that is, the estimated value function is larger than the true value function. |
QR-DQN, Rainbow DQN |