Skip to main content
. 2017 May 12;8:777. doi: 10.3389/fpsyg.2017.00777

Table 1.

Summary of interrater agreement statistics for likert-type response scales.

Statistic (citations) Formula Interpretation Strengths Limitations
rwg
(James et al., 1984; see also Finn, 1970)
1 – (Sx2eu2)
Sx2 = observed variance in judges' ratings on the single item; and σeu2 = variance of the rectangular, uniform null distribution, (A2 – 1)/12, where A is the number of discrete Likert-type response options.
  • A value of 1.0 indicates complete agreement.

  • A value of 0 indicates agreement equal to the null distribution (i.e., one index of completely random responding.

  • Values below 0 or above 1.0 are assumed to be the result of sampling error and should be reset to 0 (see James et al., 1984).

  • Commonly used in the literature and generally known to researchers and reviewers.

  • Likely the most researched agreement statistic.

  • Linear function facilitates interpretation.

  • Uniform distribution may inappropriately model random responding, and selecting an alternative null distribution can be difficult (for guidance, see LeBreton and Senter, 2008).

  • May not be directly comparable (i.e., equivalent) across different means of group ratings, number of raters, or sample sizes.

  • It is not uncommon for values to exceed +1.0 or fall below 0. These inadmissible values might not be the result of sampling error. Resetting the values to 0 may therefore be inappropriate and result in loss of information (Brown and Hauenstein, 2005).

rwg(j)
(James et al., 1984)
J(1Sx2¯/σeu2)J(1Sx2¯/σeu2)+(Sx2¯/σeu2)
J = number of items;
Sx2 = mean of the observed variance in judges' ratings on each scale item; and σeu2= see above.
  • A value of 1.0 indicates complete agreement.

  • A value of 0 indicates agreement equal to the null distribution.

  • Values below 0 or above 1.0 are assumed to be the result of sampling error and should be reset to 0 (see James et al., 1984).

  • Commonly used in the literature and generally known to researchers and reviewers.

  • Likely the most researched agreement statistic.

  • Same as rwg, above.

  • May not be directly comparable (i.e., equivalent) across different means of group ratings or the number of raters.

  • It is upwardly influenced by the number of discrete Likert scale response options.

  • Values in between 1.0 and 0 are difficult to interpret because the function is non-linear.

r*wg
(Lindell and Brandt, 1997)
1 – (Sx2eu2)
or
1 – (Sx2/ σmv2)
Sx2 = see above; σeu2 = see above; and σmv2 = variance of the maximum dissensus distribution, 0.5(X2U + X2L) – [0.5(XU + XL)]2
  • If using σeu2, the interpretation is the same as rwg, described above.

  • If using σmv2, a value of 1.0 indicates complete agreement;.5 indicates agreement equal to the uniform null distribution; and 0 indicates theoretical maximum dissensus.

  • r*wg using σmv2 will tend to be greater than is r*wg using σeu2 and rwg will always be less than is r*wg.

  • Values below 0 (using σeu2) and below 0.5 (using σmv2) are possible when agreement is low (i.e., it suggests bimodal distributions).

  • Presents a compelling alternative to the uniform null distribution (σeu2) by positing the theoretical maximum dissensus (σmv2) for use as a random error term.

  • Circumvents problems of inadmissible values by allowing for meaningful interpretations when Sx2 it exceeds σeu2.

  • May not be directly comparable (i.e., equivalent) across different means of group ratings.

  • Maximum dissensus may inappropriately model random responding, and selecting an alternative null distribution can be difficult (for guidance, see LeBreton and Senter, 2008).

  • May be positively correlated with group mean extremity.

r*wg(j) (Lindell et al., 1999) r*wg(j) = 1 – (S̄x2/ σeu2)
or
r*wg(j) = 1 – (Sx2/ σmv2)
Sx2 = see above; σeu2= see above; and σmv2 = see above.
  • Same as r*wg, above.

  • Same as r*wg, above.

  • With increasing items the function remains linear, unlike rwg(j).

  • Same as r*wg, above.

r'wg(j)
(Lindell, 2001)
1 – (Sy2eu2)
Sy2 = variance of individual judges' scale means; and σeu2 = see above.
  • Less attenuated than is r*wg(j) with σeu2 relative to rwg(j).

  • Interpretation is otherwise similar to r*wg(j).

  • Less attenuated than is r*wg; Otherwise the strengths are the same as those of r*wg(j).

  • Shares many of the same limitations as does r*wg(j)except r'wg(j) will often be less attenuated.

  • Application has been rare in the literature and, accordingly, researchers and reviewers may be unaware of the underlying logic.

rwg(p) (LeBreton et al., 2005; LeBreton and Senter, 2008)
  • Identify subgroups, calculate each subgroup's agreement score, check homogeneity of variances and, if supported, substitute sample-weighted average group variance (denoted S2x) value into rwgor rwg(j)equation.

  • Homogeneity of variances can be tested using Fisher's F-test by dividing the larger subgroup variance by the smaller subgroup variance, which is approximately distributed as the F distribution with degrees of freedom for subgroup 1/degrees of freedom for subgroup 2 (see Crawley, 2007, p. 289 for application in R).

  • Has same interpretation as does previous rwgconceptualization except considers subgroup agreement differences by averaging them.

  • Allows for consideration of theoretically meaningful subgroups.

  • Addresses limitation of inadmissible values that can be problematic for rwg and rwg(j).

  • Has many of the same interpretational problems as do previous rwgstatistics reviewed (e.g., difficulties in choosing an appropriate null distribution).

  • Can be difficult to generate theoretical predictions a priori about the existence of subgroups.

  • Assumes homogeneity of subgroup variances. If homogeneity assumptions cannot be supported, separate rwg values based on subgroups could be another option.

ADM(j)
(Burke et al., 1999; Burke and Dunlap, 2002)
∑(|xix¯|)/k
xi = a judge's rating on the item; x¯= is the group mean rating on the item; and k is the number of judges.
  • Indexes the average distance of judges' ratings from the group's scale mean.

  • Considerable justification for practical cutoff criteria have been proposed, but they are not without assumptions (see Section Standards for Agreement).

  • Interpretation is not complicated by changes (e.g., non-linearity) in the number of Likert categories (bearing in mind greater deviations are expected given category increases).

  • Circumvents problems associated with choosing an appropriate null distribution.

  • May be negatively correlated with group mean extremity.

  • Does not permit explicit modeling of random responding (i.e., has no null distribution term).

  • AD values are highly dependent on the number of scale categories employed. This makes it very difficult to compare AD values of scales differing in length.

ADM(J)
(Burke et al., 1999; Burke and Dunlap, 2002)
ADM(j)/J
J = see above.
  • Shares interpretations of ADM(j) except generalizes to multi-item scales.

  • Same advantages as ADM(j).

  • Takes the average of each ADM(j) and, therefore, does not unnecessarily complicate the multi-item interpretation.

  • Same limitations of ADM(j).

awg(1)
(Brown and Hauenstein, 2005)
1 – [(2 *Sx2)/Smpv/m2]
Sx2 = see above, and Smpv/m2 [(H+L)M–(M2) – H*L)*[k/(k-1)] where H = maximum discrete scale value; L = minimum discrete scale value; M = observed mean rating; and k = number of raters.
  • A value of +1.0 indicates perfect agreement, given the group mean.

  • A value of 0 indicates the observed variance is 50% of the maximum variance, given the group mean.

  • A value of −1.0 indicates maximum disagreement given the group mean. Will equal single-item rwg when the group mean is at the scale mid-point and the variance equations (sample vs. population) are not mismatched for rwg.

  • Will equal single and multi-item r*wg using σeu2 when the group mean is at the midpoint and the variances are not mismatched.

  • Controls for the extremeness of the group mean by not relying on a single specification of the null distribution.

  • Uses the unbiased, sample variance to calculate observed and theoretical random variance terms, whereas the rwg family of statistics confound these.

  • Circumvents problems of inadmissible values.

  • Will not be affected by sample size because it employs matched variances.

  • Requires at least A-1 raters for calculating interpretable awg values, where A is equal to the number of Likert response categories (see Brown and Hauenstein, 2005).

  • Is not interpretable at face value beyond certain extreme group means. That is, the minimum mean with interpretable awg = [L(k-1) + H]/k; and the maximum mean with interpretable awg = [H(k-1) + L]/k.

awg(j)
(Brown and Hauenstein, 2005)
awg(1)/J
J = see above.
  • Shares interpretations of awg(1), except generalizes to multi-item scales.

  • Same advantages as awg(1).

  • Takes the average of each awg(1) and, therefore, does not unnecessarily complicate the multi-item interpretation.

  • Same limitations as awg(1).

Swg
(Schmidt and Hunter, 1989)
{[∑ (xix¯)2]/(n – 1)}1/2
xi = a judge's rating on the item, x¯= the group mean rating on the item; and n is the number of group members.
  • The root of the average squared judge deviation from the mean.

  • Provides a straightforward and direct index of agreement.

  • Will be scale dependent such that a greater number of response options will tend to produce greater Swg.

  • Does not permit explicit modeling of random responding (i.e., has no null distribution term).

Swg(j) Swg/J
J = see above.
  • Shares interpretations of awg(1), except generalizes to multi-item scales.

  • Same advantages as Swg.

  • Takes the average of each Swg(1) and, therefore, does not unnecessarily complicate the multi-item interpretation.

  • Same limitations as Swg.

CVwg
(Allison, 1978; Bedeian and Mossholder, 2000)
Swg/x¯
s = see above; and x¯= see above.
  • Rescales the standard deviation by taking into account the mean. Large values suggest large variance relative to the mean (and scale).

  • Samples with larger means may be expected to have greater standard deviations than samples with smaller means. The CVwg will not be affected by the scale mean, thereby facilitating comparisons across samples (i.e., groups) with different means (and scaling).

  • It is difficult to decide what constitutes high and low consensus based on CVwg values; therefore, application and interpretation of CVwg may be difficult.

  • The assumption of a non-negative ratio scale may not always be tenable.

  • The CVwg is intended for situations in which means vary widely. If groups tend not to differ much on sample means there is little reason to adopt CVwg.

  • Does not permit explicit modeling of random responding (i.e., has no null distribution term).

CVwg(j) CVwg/J
J = see above.
  • Shares interpretations of CVwg, except generalizes to multi-item scales.

  • Same advantages as CVwg.

  • Same disadvantages as CVwg.