Skip to main content
. 2009 Aug 27;10(Suppl 8):S4. doi: 10.1186/1471-2105-10-S8-S4

Table 4.

Test corpora for information extraction evaluation. Based on the citation references from UniProtKb a base corpus was generated by retrieving abstract texts from MEDLINE. Two test corpora were derived from this corpus: the gold standard corpus (GC), which resembles a manually annotated test set, and the cross-validation corpus (XC), which contains automatically assigned annotations based on information from UniProtKb.

Dataset Gold standard corpus (GC) Cross-validation corpus (XC1) Cross-validation corpus (XC2)
Abstracts count 100 55,998 5,253
Method of annotation manual automatic automatic
total/unique residues 362/262 (with 262/191 having residue name + residue sequence position) N/A N/A
total/unique proteins 990/511 N/A N/A
total/unique organisms 323/123 N/A N/A
total/unique associations 240/172 residue-protein-organism associations NA/70,401 protein-organism as UTP NA/68,008 protein-residue as URP
Application Test the the type, amount and reliability of the extracted information (reproduction of manually annotated information). Test set is assumed to contain the same type of information as GC, but certainty is not clear. Study the reproduction of information contained in the database. Test set is assumed to contain the same type of information as GC, but certainty is not clear. Study the reproduction of information contained in the database.