Table 5.
Scoring method | Novel MEDLINE validation AUC (02/2007-01/2009) | Novel MEDLINE validation AUC (02/2007-04/2010) | Pre-existing CTD validation AUC (11/2008) | Novel CTD validation AUC (11/2008-04/2010) | Pre-existing MEDLINE validation AUC (02/2007) | Mean AUC | Rank |
---|---|---|---|---|---|---|---|
Cosine distance of term frequency-inverse document frequency | 0.92 | 0.91 | 0.95 | 0.93 | 0.98 | 0.94 | 2 |
Cosine distance of P-values | 0.53 | 0.51 | 0.65 | 0.63 | 0.53 | 0.57 | 16 |
Cosine distance of term fractions | 0.90 | 0.89 | 0.93 | 0.91 | 0.96 | 0.92 | 5 |
Sum of the log of combined P-values | 0.91 | 0.89 | 0.94 | 0.94 | 0.94 | 0.92 | 3 |
Sum of the differences of log P-values | 0.91 | 0.91 | 0.77 | 0.83 | 0.93 | 0.87 | 7 |
L2 of log-p of overlapping terms only | 0.96 | 0.95 | 0.92 | 0.94 | 0.99 | 0.95 | 1 |
L2 of term fractions of overlapping terms only | 0.64 | 0.62 | 0.57 | 0.60 | 0.53 | 0.59 | 15 |
L2 of log of P-values | 0.90 | 0.90 | 0.76 | 0.83 | 0.93 | 0.86 | 10 |
L2 of P-values | 0.89 | 0.89 | 0.75 | 0.81 | 0.92 | 0.86 | 12 |
L2 of term fractions | 0.92 | 0.90 | 0.91 | 0.92 | 0.95 | 0.92 | 4 |
L2 of term frequency | 0.90 | 0.90 | 0.76 | 0.82 | 0.93 | 0.86 | 11 |
Term coverage | 0.90 | 0.91 | 0.77 | 0.83 | 0.93 | 0.87 | 8 |
Term overlap | 0.91 | 0.89 | 0.90 | 0.92 | 0.90 | 0.90 | 6 |
Number of gene MeSH terms | 0.85 | 0.82 | 0.85 | 0.88 | 0.83 | 0.85 | 13 |
Number of disease MeSH terms | 0.90 | 0.90 | 0.76 | 0.83 | 0.93 | 0.86 | 9 |
Gene ID | 0.75 | 0.73 | 0.78 | 0.79 | 0.74 | 0.76 | 14 |
AUC of the described scoring methods were compared and tested on the validation sets. CTD, Comparative Toxicogenomics Database.