Table 4.
Unit | Layer | Output Size | |
---|---|---|---|
Input | 0 | Facial Image Feature | 512 × 48 (frames) |
1 | Facial Landmark Feature | 256 × 48 (frames) | |
Temporal Attention Module |
2 | Concatenate (0 + 1) | 768 × 48 (frames) |
3 | Average (48 frames) | 768 | |
4 | Concatenate (2 + 3) | 1536 × 48 (frames) | |
5 | Fully Connected | 1536 × 48 (frames) | |
Fully Connected | 1536 × 48 (frames) | ||
Fully Connected | 1 × 48 (frames) | ||
6 | Multiplication (2 · 5) | 768 × 48 (frames) | |
7 | Average (48 frames) | 768 | |
Output | 8 | Fully Connected | 3 or 4 |
In Unit 4, the outputs of Units 2 and 3 are concatenated for each frame. In Unit 6, the outputs of Units 2 and 5 are multiplied for each frame.