Skip to main content
. 2011 Jun 27;18(5):580–587. doi: 10.1136/amiajnl-2011-000155

Table 3.

Basic statistics of the 2010 i2b2/VA Challenge training corpus

Set name Document type Documents Lines Tokens Total concepts
Problem Test Treatment
BETH Discharge summaries 73 8727 88 722 4187 3036 3073
PARTNERS Discharge summaries 97 7515 60 819 2886 1572 1771
UPMCD Discharge summaries 98 7328 62 727 2728 1217 2308
UPMCP Progress notes 81 7025 48 302 2167 1544 1348
Total 349 30 597 260 570 11 968 7369 8500

The corpus consists of a set of document files and corresponding annotation files. Text in each file is already tokenized, and lines are roughly sentences, or list items.