Skip to main content
. 2023 Nov 15;9(11):248. doi: 10.3390/jimaging9110248

Table 6.

Word recognition accuracy (%) comparison with the baseline Transformer-decoder-based models. FT: fine-tuning on real data. Size: parameters in millions. Tr. Dec.: Transformer decoder. M: the proposed method. Bold: highest.

(a) Methods Trained on Synthetic Training Data (S).
Method Size IIIT SVT IC13 IC15 SVTP CUTE Total
DeiT-S + Tr. Dec. 26.1 93.7 88.9 92.4 80.0 80.6 86.8 88.3
DeiT-M + Tr. Dec. 46.2 94.1 89.6 92.6 81.5 82.8 83.6 89.0
DeiT-B + Tr. Dec. 103.4 94.8 90.3 92.9 81.0 85.1 87.5 89.6
CaiT-S + Tr. Dec. 50.9 94.9 90.3 94.2 81.3 83.4 89.9 89.9
DeiT-S + M (Ours) 21.6 91.4 85.5 91.3 75.3 76.7 82.2 85.3
DeiT-M + M (Ours) 38.9 92.5 87.8 92.2 76.6 79.5 81.9 86.6
DeiT-B + M (Ours) 85.7 93.0 86.9 92.2 78.6 79.1 84.0 87.3
CaiT-S + M (Ours) 46.5 93.5 86.9 91.9 77.6 77.8 85.4 87.2
(b) Methods Trained on Real Labeled Training Data (R).
Method Size IIIT SVT IC13 IC15 SVTP CUTE Total
DeiT-S + Tr. Dec. + FT 26.1 96.8 93.0 96.7 86.3 87.8 94.8 93.0
DeiT-M + Tr. Dec. + FT 46.2 97.0 94.0 97.1 86.3 89.3 95.1 93.4
DeiT-B + Tr. Dec. + FT 103.4 98.0 94.6 97.5 86.9 90.5 95.1 94.2
CaiT-S + Tr. Dec. + FT 50.9 97.4 94.9 97.1 86.5 89.5 95.8 93.7
DeiT-S + M + FT (Ours) 21.6 94.6 89.2 95.4 81.5 83.1 91.3 89.9
DeiT-M + M + FT (Ours) 38.9 95.0 92.3 95.2 83.5 84.0 90.9 90.9
DeiT-B + M + FT (Ours) 85.7 95.9 92.6 96.1 84.4 84.3 92.7 91.7
CaiT-S + M + FT (Ours) 46.5 96.1 90.6 95.4 84.9 85.4 92.7 91.7