Table 9.
Results on medical evaluation set
Evaluation configuration |
Abbr →Exp |
Exp →Abbr |
Syn |
|||
---|---|---|---|---|---|---|
RI_4+RP_4_sw
|
RI_8+RP_8_sw
|
RI_20+RP_2_sw
|
||||
P | R | P | R | P | R | |
RI baseline |
0.02 |
0.09 |
0.01 |
0.08 |
0.03 |
0.18 |
RP baseline |
0.01 |
0.06 |
0.01 |
0.05 |
0.05 |
0.26 |
Medical ensemble |
0.03 |
0.17 |
0.01 |
0.11 |
0.06 |
0.34 |
+Post-processing (top 10) |
0.03 |
0.17 |
0.02 |
0.11 |
0.06 |
0.34 |
+Dynamic cut-off (top ≤ 10) | 0.17 | 0.17 | 0.10 | 0.11 | 0.06 | 0.34 |
Results (P = weighted precision, R = recall, top ten) of the best semantic spaces with and without post-processing on the three tasks. Dynamic # of suggestions allows the model to suggest less than ten terms in order to improve precision. The results are based on the application of the model combinations to the evaluation data. The difference in recall when using the ensemble method compared to the best baseline is only statistically significant (p-value < 0.05) for the synonym task (p-value = 0.000).