Skip to main content
. 2023 Mar 13;25:e35568. doi: 10.2196/35568

Table 4.

Test set results for classification of reasons for downgrading the quality of evidence (T1), reporting macroaveraged F1 together with median absolute deviation (subscripted) across all trials of 10-fold cross-validation.

Data setup and model GRADEa downgrading criteria

RoBb, F1 Imprecision, F1 Inconsistency, F1 Indirectness, F1 Publication bias, F1
CDSR-QoEc

Random 0.50.02 0.48.02 0.23.02 0.17.02 0.08.02

Majority 0.74.02 0.71.02 0.00.00 0.00.00 0.00.00

Classifierd 0.75.02 0.72.02 0.09.02 0.11.02 0.02.00

Multilabellere 0.74.02 0.64.02 0.25.02 0.25.13 0.14.13
CDSR-QoE-supp

Classifier 0.78.02 0.75.02 0.31.02 0.41.02 0.39.12

Multilabeller 0.64.02 0.59.02 0.29.02 0.47.02 0.44.21

Classifier, −txtf 0.72.02 0.74.02 0.26.02 0.24.02 0.19.13
CDSR-QoE-alignedg

Classifier 0.71.02 0.69.02 0.47.02 0.28.26 0.05.00

Classifier, +PSh 0.74.02 0.66.02 0.51.02 0.30.10 0.13.00

aGRADE: Grading of Recommendation, Assessment, Development, and Evaluation.

bRoB: risk of bias.

cCDSR-QoE: Cochrane Database of Systematic Reviews Quality of Evidence.

dFor classifier (T1 [a]), each reason type is independently trained and evaluated in a binary setting (eg, “risk of bias” vs “other”).

eFor multilabeller, a single model is tasked with predicting multiple reasons (T1 [b]).

f−txt: with removed textual inputs.

gAs CDSR-QoE-aligned is smaller and with different test sets compared with CDSR-QoE, the results are not directly comparable between the 2.

h+PS: with added primary study–related features.