Table 4.
Data setup and model | GRADEa downgrading criteria | ||||||||||
|
RoBb, F1 | Imprecision, F1 | Inconsistency, F1 | Indirectness, F1 | Publication bias, F1 | ||||||
CDSR-QoEc | |||||||||||
|
Random | 0.50.02 | 0.48.02 | 0.23.02 | 0.17.02 | 0.08.02 | |||||
|
Majority | 0.74.02 | 0.71.02 | 0.00.00 | 0.00.00 | 0.00.00 | |||||
|
Classifierd | 0.75.02 | 0.72.02 | 0.09.02 | 0.11.02 | 0.02.00 | |||||
|
Multilabellere | 0.74.02 | 0.64.02 | 0.25.02 | 0.25.13 | 0.14.13 | |||||
CDSR-QoE-supp | |||||||||||
|
Classifier | 0.78.02 | 0.75.02 | 0.31.02 | 0.41.02 | 0.39.12 | |||||
|
Multilabeller | 0.64.02 | 0.59.02 | 0.29.02 | 0.47.02 | 0.44.21 | |||||
|
Classifier, −txtf | 0.72.02 | 0.74.02 | 0.26.02 | 0.24.02 | 0.19.13 | |||||
CDSR-QoE-alignedg | |||||||||||
|
Classifier | 0.71.02 | 0.69.02 | 0.47.02 | 0.28.26 | 0.05.00 | |||||
|
Classifier, +PSh | 0.74.02 | 0.66.02 | 0.51.02 | 0.30.10 | 0.13.00 |
aGRADE: Grading of Recommendation, Assessment, Development, and Evaluation.
bRoB: risk of bias.
cCDSR-QoE: Cochrane Database of Systematic Reviews Quality of Evidence.
dFor classifier (T1 [a]), each reason type is independently trained and evaluated in a binary setting (eg, “risk of bias” vs “other”).
eFor multilabeller, a single model is tasked with predicting multiple reasons (T1 [b]).
f−txt: with removed textual inputs.
gAs CDSR-QoE-aligned is smaller and with different test sets compared with CDSR-QoE, the results are not directly comparable between the 2.
h+PS: with added primary study–related features.