Skip to main content
. 2023 Nov 15;9(11):248. doi: 10.3390/jimaging9110248

Figure 2.

Figure 2

The proposed marginalization-based method: A 2D feature sequence, F=(F1,1,,FH,W), is produced by a 2D feature extractor such as a ViT backbone. F is fed to a linear layer to produce S=(S1,1,,SH,W) from which a softmax normalization is performed over both H and C dimensions. Next, the normalized U=(U1,1,,UH,W) is marginalized over the H dimension to produce P=(P1,,PW) that is fed to a CTC decoder. D and C are the feature and class dimensions, respectively.