Table 6.
Word recognition accuracy (%) comparison with the baseline Transformer-decoder-based models. FT: fine-tuning on real data. Size: parameters in millions. Tr. Dec.: Transformer decoder. M: the proposed method. Bold: highest.
(a) Methods Trained on Synthetic Training Data (S). | ||||||||
---|---|---|---|---|---|---|---|---|
Method | Size | IIIT | SVT | IC13 | IC15 | SVTP | CUTE | Total |
DeiT-S + Tr. Dec. | 26.1 | 93.7 | 88.9 | 92.4 | 80.0 | 80.6 | 86.8 | 88.3 |
DeiT-M + Tr. Dec. | 46.2 | 94.1 | 89.6 | 92.6 | 81.5 | 82.8 | 83.6 | 89.0 |
DeiT-B + Tr. Dec. | 103.4 | 94.8 | 90.3 | 92.9 | 81.0 | 85.1 | 87.5 | 89.6 |
CaiT-S + Tr. Dec. | 50.9 | 94.9 | 90.3 | 94.2 | 81.3 | 83.4 | 89.9 | 89.9 |
DeiT-S + M (Ours) | 21.6 | 91.4 | 85.5 | 91.3 | 75.3 | 76.7 | 82.2 | 85.3 |
DeiT-M + M (Ours) | 38.9 | 92.5 | 87.8 | 92.2 | 76.6 | 79.5 | 81.9 | 86.6 |
DeiT-B + M (Ours) | 85.7 | 93.0 | 86.9 | 92.2 | 78.6 | 79.1 | 84.0 | 87.3 |
CaiT-S + M (Ours) | 46.5 | 93.5 | 86.9 | 91.9 | 77.6 | 77.8 | 85.4 | 87.2 |
(b) Methods Trained on Real Labeled Training Data (R). | ||||||||
Method | Size | IIIT | SVT | IC13 | IC15 | SVTP | CUTE | Total |
DeiT-S + Tr. Dec. + FT | 26.1 | 96.8 | 93.0 | 96.7 | 86.3 | 87.8 | 94.8 | 93.0 |
DeiT-M + Tr. Dec. + FT | 46.2 | 97.0 | 94.0 | 97.1 | 86.3 | 89.3 | 95.1 | 93.4 |
DeiT-B + Tr. Dec. + FT | 103.4 | 98.0 | 94.6 | 97.5 | 86.9 | 90.5 | 95.1 | 94.2 |
CaiT-S + Tr. Dec. + FT | 50.9 | 97.4 | 94.9 | 97.1 | 86.5 | 89.5 | 95.8 | 93.7 |
DeiT-S + M + FT (Ours) | 21.6 | 94.6 | 89.2 | 95.4 | 81.5 | 83.1 | 91.3 | 89.9 |
DeiT-M + M + FT (Ours) | 38.9 | 95.0 | 92.3 | 95.2 | 83.5 | 84.0 | 90.9 | 90.9 |
DeiT-B + M + FT (Ours) | 85.7 | 95.9 | 92.6 | 96.1 | 84.4 | 84.3 | 92.7 | 91.7 |
CaiT-S + M + FT (Ours) | 46.5 | 96.1 | 90.6 | 95.4 | 84.9 | 85.4 | 92.7 | 91.7 |