Skip to main content
. Author manuscript; available in PMC: 2009 Jan 1.
Published in final edited form as: Artif Intell Med. 2007 Nov 28;42(1):13–35. doi: 10.1016/j.artmed.2007.10.001

Table 1.

Number of words in each PHI category in the corpora. Word counts depend on the number and format of inserted surrogates.

Category Number of tokens
Random corpus Ambiguous corpus Out-of-vocabulary corpus Authentic corpus Challenge corpus
Non-PHI 17,874 19,275 17,875 112,669 444,127
Patient 1,048 1,047 1,037 294 1,737
Doctor 311 311 302 738 7,697
Location 24 24 24 88 518
Hospital 600 600 404 656 5,204
Date 735 736 735 1,953 7,651
ID 36 36 36 482 5,110
Phone 39 39 39 32 271