Table 2.
Distribution of training and testing data for fine-tuning TSR and OCR models, highlighting the unique characteristics of each dataset
| Dataset | Table structure recognition | Optical character recognition | Average cells per image in test set | ||
|---|---|---|---|---|---|
| #Training Images | #Testing Images | #Train text lines | #Test text lines | ||
| UoS_Data_Rescue | 1113 | 112 | 497045 | 97150 | 867.41 |
| SROIE | 1426 | 273 | 33626 | 18704 | 68.51 |
| CORD | 800 | 100 | 19367 | 2355 | 23.55 |
| PubTabNet | 6000 | 15115 | 26000 | 606719 | 40.14 |
| ICDAR15 | – | – | 4468 | 2077 | – |
* The original PubTabNet dataset was released with 510K training samples
* Randomly selected 26000 text lines from the 6000 training samples