. 2011 Jun 27;18(5):580–587. doi: 10.1136/amiajnl-2011-000155

Table 3.

Basic statistics of the 2010 i2b2/VA Challenge training corpus

Set name	Document type	Documents	Lines	Tokens	Total concepts
Set name	Document type	Documents	Lines	Tokens	Problem	Test	Treatment
BETH	Discharge summaries	73	8727	88 722	4187	3036	3073
PARTNERS	Discharge summaries	97	7515	60 819	2886	1572	1771
UPMCD	Discharge summaries	98	7328	62 727	2728	1217	2308
UPMCP	Progress notes	81	7025	48 302	2167	1544	1348
Total		349	30 597	260 570	11 968	7369	8500

The corpus consists of a set of document files and corresponding annotation files. Text in each file is already tokenized, and lines are roughly sentences, or list items.