Table 1.
Comparison of attention mechanism modeling methods.
| Ref. | Attention name | Method | Comment |
|---|---|---|---|
| [69] | Soft attention | Give a probability according to the context vector for any word in the input sentence when seeking attention probability distribution | Parameterization Derivative enable Definitely |
|
| |||
| [69] | Hard attention | Focus only on a randomly chosen location using Monte Carlo sampling to estimate the gradient | Randomly On the base of probability Simple |
|
| |||
| [70] | Multihead attention | Linearly projecting multiple pieces of information selected from the input in parallel using multiple keys, values, and queries | Linear projection Parallel Focus on information from different representation subspaces in different locations Multiple attention head |
|
| |||
| [70] | Scaled dot-product attention | Execute a single attention function using keys, values, and query matrices | High speed Save space |
|
| |||
| [71] | Global attention | Considering the hidden layer state of all encoders, the weight distribution of attention is obtained by comparing the current decoder hidden layer state with the state of each encoder hidden layer | Comprehensive Time-consuming Large amount of calculation |
|
| |||
| [71] | Local attention | First find a location for it, then calculate the attention weight in the left and right windows of its location, and finally weight the context vector | Reduce the cost of calculations |
|
| |||
| [75] | Adaptive attention | Define a new adaptive context vector which is modeled as a mixture of the spatially attended image features and the visual sentinel vector. This trades off how much new information the network is considering from the image with what it already knows in the decoder memory | Solve when and where to add attention in order to extract meaningful information for sequence words |
|
| |||
| [76] | Semantic attention | Select semantic concepts and incorporate them into the hidden state and output of the LSTM | Optional Merge From top to bottom From bottom to top |
|
| |||
| [77] | Spatial and channel-wise attention | Select semantic attributes based on the needs of the sentence context | Multiple semantics In order to overcome the problem of overrange when using the general attention |
|
| |||
| [4] | Areas of attention | Modeling the dependencies between image regions, title words, and the state of the RNN language model | Interaction Comprehensive |