. 2023 Mar 13;25:e35568. doi: 10.2196/35568

Table 4.

Test set results for classification of reasons for downgrading the quality of evidence (T1), reporting macroaveraged F1 together with median absolute deviation (subscripted) across all trials of 10-fold cross-validation.

Data setup and model			GRADE^a downgrading criteria
			RoB^b, F₁		Imprecision, F₁		Inconsistency, F₁		Indirectness, F₁		Publication bias, F₁
CDSR-QoE^c
	Random	0.50_.02		0.48_.02		0.23_.02		0.17_.02		0.08_.02
	Majority	0.74_.02		0.71_.02		0.00_.00		0.00_.00		0.00_.00
	Classifier^d	0.75_.02		0.72_.02		0.09_.02		0.11_.02		0.02_.00
	Multilabeller^e	0.74_.02		0.64_.02		0.25_.02		0.25_.13		0.14_.13
CDSR-QoE-supp
	Classifier	0.78_.02		0.75_.02		0.31_.02		0.41_.02		0.39_.12
	Multilabeller	0.64_.02		0.59_.02		0.29_.02		0.47_.02		0.44_.21
	Classifier, −txt^f	0.72_.02		0.74_.02		0.26_.02		0.24_.02		0.19_.13
CDSR-QoE-aligned^g
	Classifier	0.71_.02		0.69_.02		0.47_.02		0.28_.26		0.05_.00
	Classifier, +PS^h	0.74_.02		0.66_.02		0.51_.02		0.30_.10		0.13_.00

^aGRADE: Grading of Recommendation, Assessment, Development, and Evaluation.

^bRoB: risk of bias.

^cCDSR-QoE: Cochrane Database of Systematic Reviews Quality of Evidence.

^dFor classifier (T1 [a]), each reason type is independently trained and evaluated in a binary setting (eg, “risk of bias” vs “other”).

^eFor multilabeller, a single model is tasked with predicting multiple reasons (T1 [b]).

^f−txt: with removed textual inputs.

^gAs CDSR-QoE-aligned is smaller and with different test sets compared with CDSR-QoE, the results are not directly comparable between the 2.

^h+PS: with added primary study–related features.