Figure 2.
The proposed marginalization-based method: A 2D feature sequence, , is produced by a 2D feature extractor such as a ViT backbone. is fed to a linear layer to produce from which a softmax normalization is performed over both and C dimensions. Next, the normalized is marginalized over the dimension to produce that is fed to a CTC decoder. D and C are the feature and class dimensions, respectively.