. 2020 Dec 3;20:293. doi: 10.1186/s12874-020-01179-5

Table 6.

Consensus reached on standards for preferred statistical methods for reliability

Statistical methods	very good	adequate	doubtful	inadequate
7	For continuous scores: was an Intraclass Correlation Coefficient (ICC)^a calculated? 28/35 (80%)(R2^b)	ICC calculated; the model or formula was described, and matches the study design^c and the data 30/35 (86%)(R2)	ICC calculated but model or formula was not described or does not optimally match the study design^c OR Pearson or Spearman correlation coefficient calculated WITH evidence provided that no systematic difference between measurements has occurred	Pearson or Spearman correlation coefficient calculated WITHOUT evidence provided that no systematic difference between measurements has occurred 25/35 (71%) (R2) OR WITH evidence provided that systematic difference between measurements has occurred 25/34 (74%)(R2)
8	For ordinal scores: was a (weighted) Kappa calculated? 26/36 (72%)(R2)	Kappa calculated; the weighting scheme was described, and matches the study design and the data R3: 27/36 (75%)(R3^d)	Kappa calculated, but weighting scheme not described or does not optimally match the study design 19/36 (53%)(R3)
9	For dichotomous/nominal scores: was Kappa calculated for each category against the other categories combined? 23/33 (70%)(R3)	Kappa calculated for each category against the other categories combined

Statistical methods

very good

adequate

doubtful

inadequate

For continuous scores: was an Intraclass Correlation Coefficient (ICC)^a calculated?

28/35 (80%)(R2^b)

ICC calculated; the model or formula was described, and matches the study design^c and the data

30/35 (86%)(R2)

ICC calculated but model or formula was not described or does not optimally match the study design^c

Pearson or Spearman correlation coefficient calculated WITH evidence provided that no systematic difference between measurements has occurred

Pearson or Spearman correlation coefficient calculated WITHOUT evidence provided that no systematic difference between measurements has occurred

25/35 (71%) (R2)

OR WITH evidence provided that systematic difference between measurements has occurred

25/34 (74%)(R2)

For ordinal scores: was a (weighted) Kappa calculated?

26/36 (72%)(R2)

Kappa calculated; the weighting scheme was described, and matches the study design and the data

R3: 27/36 (75%)(R3^d)

Kappa calculated, but weighting scheme not described or does not optimally match the study design

19/36 (53%)(R3)

For dichotomous/nominal scores: was Kappa calculated for each category against the other categories combined?

23/33 (70%)(R3)

Kappa calculated for each category against the other categories combined

^a Generalizability and Decision coefficients are ICCs; ^b R2: consensus reached in round 2; ^c Based on panelists’ suggestions the steering committee decided after round 3 to use the word ‘study design’ instead of ‘reviewer constructed research question’; ^d R3: consensus reached in round 3