Table 3.
Basic statistics of the 2010 i2b2/VA Challenge training corpus
Set name | Document type | Documents | Lines | Tokens | Total concepts | ||
Problem | Test | Treatment | |||||
BETH | Discharge summaries | 73 | 8727 | 88 722 | 4187 | 3036 | 3073 |
PARTNERS | Discharge summaries | 97 | 7515 | 60 819 | 2886 | 1572 | 1771 |
UPMCD | Discharge summaries | 98 | 7328 | 62 727 | 2728 | 1217 | 2308 |
UPMCP | Progress notes | 81 | 7025 | 48 302 | 2167 | 1544 | 1348 |
Total | 349 | 30 597 | 260 570 | 11 968 | 7369 | 8500 |
The corpus consists of a set of document files and corresponding annotation files. Text in each file is already tokenized, and lines are roughly sentences, or list items.