Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Feb 24;15:6568. doi: 10.1038/s41598-025-90873-9

Bayesian Gower agreement for categorical data

John Hughes 1,
PMCID: PMC11850839  PMID: 39994317

Abstract

In this work I present two methods for measuring agreement in nominal and ordinal data. The measures, which employ Gower-type distances, are simple, intuitive, and easy to compute for any number of units and any number of coders. Influential units and/or coders are easily identified. I consider both one-way and two-way random sampling designs, and develop an approach to Bayesian inference for each. I apply the methods to simulated data and to two real datasets, the first from a one-way radiological study of congenital diaphragmatic hernia, and the second from a two-way study of psychiatric diagnosis. Finally, I consider agreement scales and suggest that Gaussian mutual information can perhaps provide a scale that is more useful than the scale most commonly used. The methods I propose are supported by my open source R package, goweragreement, which is available on the Comprehensive R Archive Network.

Subject terms: Statistics, Software

Introduction

An inter-coder or intra-coder agreement coefficient, which takes a value in the unit interval, is a statistical measure of the extent to which two or more coders agree regarding the same units of analysis. The agreement problem has a long history and is important in many fields of inquiry, and numerous agreement statistics have been proposed.

The first agreement coefficients were S3, Inline graphic28, and Inline graphic7. Bennett et al. 3 proposed the S score as a measure of the extent to which two methods of communication provide identical information. Scott 28 proposed the Inline graphic coefficient for measuring agreement between two coders. Cohen 7 criticized Inline graphic and proposed the Inline graphic coefficient as an alternative to Inline graphic—although Smeeton 30 noted that Francis Galton mentioned a Inline graphic-like statistic in his 1892 book, Finger Prints. Fleiss 11 proposed multi-Inline graphic, a generalization of Scott’s Inline graphic for measuring agreement among three or more coders. Conger 8 and Davies and Fleiss 9 likewise generalized Inline graphic to the multi-coder setting. Other generalizations of Inline graphic, e.g., weighted Inline graphic6, have also been proposed. The Inline graphic coefficient and its generalizations can fairly be said to dominate the field and are still widely used despite their well-known shortcomings5,10. Other frequently used measures of agreement are Gwet’s Inline graphic and Inline graphic14 and Krippendorff’s Inline graphic16. For more comprehensive reviews of the literature on agreement, I refer the interested reader to the article by Banerjee et al. 2, the article by Artstein and Poesio 1, and the book by Gwet 15.

In this article I present new means of measuring agreement for nominal and ordinal data, and develop corresponding methods for Bayesian inference for both one-way random designs (units are random, coders are fixed) and two-way random designs (both units and coders are random). In Section "Gower-type agreement measures for nominal and ordinal data" I describe the agreement measures. In Section "Bayesian inference" I propose algorithms for sampling from the posterior distribution of the parameter of interest. In Section "Application to simulated data" I evaluate the two methodologies by applying them to simulated data. In Section "Application to real data" I apply the methods to two real datasets. In Section "Influence diagnostics" I show how to measure the influence of individual units and/or coders. In Section "Agreement scale calibration" I propose a method for obtaining a calibrated agreement scale for a given dataset, distance function, and sampling model. In Section "A footnote regarding agreement scales" I briefly discuss the possibility of constructing an agreement scale based on Gaussian mutual information. I make concluding remarks in Section "Discussion".

Gower-type agreement measures for nominal and ordinal data

Suppose the data Inline graphic are arranged in Inline graphic matrix Inline graphic, where n is the number of units and m is the number of coders. Then Inline graphic is the score assigned by coder j to unit i.

As an example, consider the nominal data that will be analyzed below in Section "Nominal data from a two-way study of psychiatric diagnosis". The data from this study are psychiatric diagnoses (depression, personality disorder, schizophrenia, neurosis, and other) assigned to 30 patients by six raters11. That is, each row of Inline graphic corresponds to a psychiatric patient, and each of the six elements in a given row contains a diagnosis assigned by one of the raters: Inline graphic Inline graphic is the diagnosis assigned by rater j to patient i. Since both the patients and the raters were presumably sampled from superpopulations, these data have a two-way design. See Section "Nominal data from a two-way study of psychiatric diagnosis" for more information regarding this dataset.

The building blocks of the proposed agreement measure are the row statistics

graphic file with name 41598_2025_90873_Article_Equ1.gif

where d is an appropriate distance function. We see that the second term above is the sample mean of the distances for the Inline graphic distinct pairs of observations in row i of Inline graphic. And row statistic Inline graphic is a Gower-type13 measure of agreement for row i. If the row in question is constant, the average distance will be 0, in which case the row statistic will equal 1 (perfect agreement). The more heterogeneous the row, the larger the average distance, in which case the row statistic will be closer to 0 (poorer agreement).

For nominal data I recommend the discrete metric Inline graphic, where I denotes the indicator function. For ordinal data I recommend the Inline graphic distance function given by

graphic file with name 41598_2025_90873_Article_Equ2.gif

where r is the range of the scores, e.g., Inline graphic for scores in Inline graphic. These distance functions were considered by Gower 13 and are also commonly used in other agreement settings19,21 and more generally4.

When d is the discrete metric the distances are of course binary, and their sum is not even approximately binomial unless the intra-row dependence is very weak. This is not surprising given theoretical work regarding sums of dependent Bernoulli variables12. In any case, the row statistics are an identically distributed sample from a discrete distribution having its points of support in the unit interval. This distribution is determined by the marginal distribution of the scores, the dependence structure, the number of coders, and the distance function. The mean of this distribution, Inline graphic, say, is the proposed measure of agreement for the study. (Although Gower distance was discovered long ago, these measures do not, to my knowledge, appear in the agreement literature, nor has Bayesian inference been considered in this context.)

To put it another way, the parameter of interest, Inline graphic, is the expected value of random variable

graphic file with name 41598_2025_90873_Article_Equ3.gif

where m is the number of coders and Inline graphic can index any unit. That is, Inline graphic. For a hypothetical dataset comprising n units, the row statistics Inline graphic are a sample of G.

One might then estimate Inline graphic as the sample mean of the Inline graphic:

graphic file with name 41598_2025_90873_Article_Equ4.gif

for nominal data, and

graphic file with name 41598_2025_90873_Article_Equ5.gif

for ordinal data. For a one-way design, wherein the units are random but the coders are fixed, the row-wise agreement statistics are iid, and so the ordinary central limit theorem applies: Inline graphic, where Inline graphic is the variance of G.

But I recommend Bayesian inference for Inline graphic, i.e., to base inference on the posterior distribution of Inline graphic conditional on data Inline graphic. In the next section I develop a Bayesian bootstrap27 for both the one-way design and the two-way design. These algorithms produce a sample from posterior distribution Inline graphic so that the posterior expectation Inline graphic is the agreement measure, which can be estimated as the mean of the posterior sample. This approach has a number of advantages, the most important of which is that doing Bayesian inference allows one to answer the right question, that is, to base inference only on the data at hand, rather than on hypothetical unobserved data.

Note that this approach yields a measure of agreement for each unit (Inline graphic) as well as a measure of agreement for the study (Inline graphic). It is easy to accommodate any number of units and any number of coders, and missing scores can be handled by simply skipping them when computing the row statistics. Any row having just a single score is removed prior to analysis since such a row carries no information about agreement.

To clarify further I will conclude this section by applying these ideas to a small nominal dataset. Let us consider an example dataset that was previously analyzed by Krippendorff 22. The dataset, which comprises 41 nominal codes assigned to a dozen units of analysis by four coders, is shown below in Figure 1. The dots represent missing values.

Fig. 1.

Fig. 1

Nominal scores previously analyzed by Krippendorff, for twelve units and four coders. The dots represent missing values.

Because this dataset is small it is easy to see by inspection that agreement is high. Indeed, eight of the units exhibit perfect agreement, and two of the remaining units exhibit near-perfect agreement. Only the sixth unit appears to have been problematic for the raters.

Due to perfect agreement the average distance for each of rows 1, 3, 4, 5, 7, 9, 10, and 11 is equal to 0, and so each of the row statistics is equal to 1. That leaves rows 2, 6, and 8 since row 12 must be discarded. For each of rows 2 and 8 the average distance is 0.5, and so the row statistics equal 0.5. For row 6 no two codes are equal, and so the average distance is equal to 1, which implies that the row statistic is equal to 0.

Now, the sample mean for the row statistics is approximately equal to 0.82. Because this value is not too far from 1, we can tentatively conclude that these data exhibit high agreement, as expected. This is of course a (incomplete) frequentist analysis. In the next sections I will describe how to do Bayesian analyses for Gower agreement, and apply the Bayesian procedures to both simulated and real datasets.

Bayesian inference

For a one-way study it is straightforward to adapt the Bayesian bootstrap to this setting. For two-way studies I develop a new Bayesian bootstrap based on pigeonhole resampling. These algorithms allow one to sample directly from the posterior distribution of Inline graphic. Here I specify the algorithms. In the next section I evaluate their performance in a Monte Carlo study.

Bayesian bootstrap for a one-way random design

For a study in which the units are random and the coders are fixed, one can employ a Bayesian bootstrap27 to draw samples from Inline graphic in the following way.

  1. Compute the row statistics Inline graphic.

  2. Repeat for Inline graphic:
    1. Draw Inline graphic iid Inline graphic and independent of Inline graphic.
    2. Sort Inline graphic and form the gap sequence Inline graphic.
    3. Compute Inline graphic as the bth sample from Inline graphic, where Inline graphic denotes the dot product.
  3. Use the posterior sample of size B to do inference for Inline graphic.

This procedure can be carried out efficiently even for a large number of units, and typically only a small posterior sample is required to do reliable inference. In the next section I evaluate the frequentist performance of this approach.

Bayesian bootstrap for a two-way random design

Doing posterior inference for a two-way design is more delicate. The ordinary Bayesian bootstrap is deficient for this purpose because the ordinary method accommodates only one source of random variation, the variation across units. To reflect the randomness of coders as well, one can marry the pigeonhole bootstrap26 with the Bayesian bootstrap in the following way. To my knowledge this is a new form of Bayesian bootstrap—or, more precisely, a hybrid method since the pigeonhole resampling has a frequentist flavor.

  1. Repeat for Inline graphic:
    1. Resample the rows of Inline graphic with replacement.
    2. Given the resampled rows, resample the columns of Inline graphic with replacement. These first two steps produce Inline graphic.
    3. Compute the row statistics Inline graphic from Inline graphic.
    4. Draw Inline graphic iid Inline graphic and independent of Inline graphic.
    5. Sort Inline graphic and form the gap sequence Inline graphic.
    6. Compute Inline graphic as the bth sample from Inline graphic.
  2. Use the posterior sample of size B to do inference for Inline graphic.

This approach also permits efficient computation and captures well both sources of randomness. In the next section I evaluate the frequentist performance of this method, and compare to the performance of a frequentist pigeonhole bootstrap.

Application to simulated data

For the simulation studies presented in this section I simulated data from direct Gaussian copula models with categorical marginal distributions. These are sensible proxy models since they permit one to specify appropriate correlation matrices for both the one-way design and the two-way design, and then apply those latent dependence structures to categorical outcomes. The generative form of the direct Gaussian copula model is

graphic file with name 41598_2025_90873_Article_Equ6.gif

where Inline graphic is the copula correlation matrix, Inline graphic is the standard Gaussian cdf, and Inline graphic is the quantile function of the desired response distribution. The random vector Inline graphic is a realization of the copula, and Inline graphic is obtained by applying the probability integral transform to the marginally standard uniform Inline graphic.

For the one-way study Inline graphic is block-diagonal with each of its n Inline graphic blocks having the same compound symmetry structure:

graphic file with name 41598_2025_90873_Article_Equ7.gif

where

graphic file with name 41598_2025_90873_Article_Equa.gif

for Inline graphic. I varied the intraclass correlation over the grid Inline graphic, and simulated 4,000 datasets for each value of Inline graphic. For each simulated dataset I computed a credible interval based on a posterior sample of size 1,000.

For the two-way study the copula correlation matrix is given by the Kronecker product

graphic file with name 41598_2025_90873_Article_Equ8.gif

where Inline graphic is a compound symmetry structure for the coders and Inline graphic is a compound symmetry structure for the units. I varied the intra-row correlation over the grid Inline graphic, and for each value of Inline graphic I used Inline graphic. That is, for each scenario the inter-row correlation was half the intra-row correlation. This seems like a sensible study design since this methodology is most useful when the coders do not exhibit large biases. When intra-coder dependence is stronger than intra-unit dependence, agreement is low, in which case doing inference for Inline graphic may be of little interest. If, for a given dataset, it seems clear that intra-coder dependence is strong, one might transpose Inline graphic and repeat the analysis to get a sense of the strength of intra-coder agreement.

For each of 4,000 simulated datasets I computed a credible interval and a frequentist bootstrap interval, both based on a sample of size of 1,000. The frequentist bootstrap used pigeonhole resampling. It is important to note that no frequentist bootstrap for the two-way design can be exact25, but the pigeonhole bootstrap is useful for comparison with the Bayesian bootstrap outlined above.

For all scenarios I used Inline graphic for the categorical probabilities, with Inline graphic

Nominal data

The coverage profile for the one-way design with 16 units and four coders is shown in Figure 2. These results were obtained by applying the algorithm described in Section 3.1. We see that the 95% credible interval offers nearly nominal frequentist coverage across the range of latent correlation Inline graphic. This is a small sample geometry; the coverage profile improves as the number of units and/or coders increases.

Fig. 2.

Fig. 2

The coverage profile for the 95% credible interval (solid line) for nominal data with 16 units and four coders, one-way sampling design. The dashed line marks 0.95.

The coverage profile for the two-way design with 16 units and four coders is shown in Figure 3. These results were obtained by applying the algorithm described in Section 3.2. We see that the 95% credible interval (solid line) offers slightly better than nominal frequentist coverage (dotted line) across the range of latent correlation Inline graphic. The coverage profile for the pigeonhole bootstrap is shown as a dotted line. Doing Bayesian inference clearly offers a substantial advantage here.

Fig. 3.

Fig. 3

The coverage profile for the 95% credible interval (solid line) for nominal data with 16 units and four coders, two-way sampling design. Frequentist coverage rates are shown by the dotted line.

Ordinal data

The coverage profile for the one-way design with 16 units and four coders is shown in Figure 4. These results were obtained by applying the algorithm described in Section "Bayesian bootstrap for a one-way random design". We see that the 95% credible interval offers nearly nominal frequentist coverage across the range of latent correlation Inline graphic. The coverage profile improves as the number of units and/or coders increases.

Fig. 4.

Fig. 4

The coverage profile for the 95% credible interval for ordinal data with 16 units and four coders, one-way sampling design.

The coverage profile for the two-way design with 16 units and four coders is shown in Figure 5. These results were obtained by applying the algorithm described in Section "Bayesian bootstrap for a two-way random design". We see that the two-way design presents more of a challenge to the methodology, with the coverage rate dipping as low as 90% for some combinations of Inline graphic and Inline graphic. But the performance of the credible interval is still very much better than that of the frequentist pigeonhole bootstrap.

Fig. 5.

Fig. 5

The coverage profile for the 95% credible interval for ordinal data with 16 units and four coders, two-way sampling design.

Application to real data

In this section I apply the proposed methods to two real datasets. The first is from a one-way magnetic resonance imaging study of congenital diaphragmatic hernia. The scores are ordinal. The second dataset is from a two-way study of psychiatric diagnosis. The scores are nominal.

I will interpret results according to the agreement scale given in Table 123. Although this scale is well-established, agreement scales remain a subject of debate31, and so the following scale (or any agreement scale) should be applied with caution. I discuss agreement scales further in Sections 7 and 8.

Table 1.

Guidelines for interpreting values of an agreement coefficient.

Range of Agreement Interpretation
Inline graphic Slight Agreement
Inline graphic Fair Agreement
Inline graphic Moderate Agreement
Inline graphic Substantial Agreement
Inline graphic Near-Perfect Agreement

Ordinal data from a one-way radiological study of congenital diaphragmatic hernia

The data for this example are liver-herniation scores (in Inline graphic) assigned by two coders (radiologists) to magnetic resonance images of the liver in a study pertaining to congenital diaphragmatic hernia (CDH)24, in which a hole in the diaphragm permits abdominal organs to enter the chest. The five grades are described in Table 2.

Table 2.

Liver herniation grades for the CDH study.

Grade Description
1 No herniation of liver into the fetal chest
2 Less than half of the ipsilateral thorax is occupied by the fetal liver
3 Greater than half of the thorax is occupied by the fetal liver
4 The liver dome reaches the thoracic apex
5 The liver dome not only reaches the thoracic apex but also extends across the thoracic midline

Each radiologist scored each of the 47 images twice, and so we are interested in assessing both intra-coder and inter-coder agreement. This is a one-way study, which is to say we are interested in measuring agreement for these two radiologists, as opposed to considering the radiologists as having been drawn from a larger population. The results are shown in Table 3. We see that both intra-coder and inter-coder agreement are very nearly perfect. Note that each of the execution times was shorter than one second despite my having drawn 10,000 posterior samples for each. The posterior sample for the second radiologist is shown in Figure 6. Superimposed are a kernel density estimate and the limits of the 95% credible interval (sample quantiles). Since the distribution is markedly skewed to the left, I should report the estimated posterior median: 0.991.

Table 3.

Results from applying the proposed methodology to the liver data. For Radiologist 1, the estimated quantity is Inline graphic, where Inline graphic denotes the parameter of interest for Radiologist 1, and Inline graphic denotes the columns of Inline graphic containing Radiologist 1’s scores only. Likewise, the estimated quantity for Radiologist 2 is Inline graphic. And for overall agreement we estimate Inline graphic.

Estimated Posterior Mean 95% Credible Interval
Radiologist 1 0.984 (0.962, 0.997)
Radiologist 2 0.989 (0.970, 0.999)
Overall 0.972 (0.954, 0.987)

Fig. 6.

Fig. 6

A histogram of the posterior sample for the second radiologist in the CDH study. This is a sample from Inline graphic, which was defined in the caption for Table 3.

My R package can be used as follows to analyze the CDH data. The CDH data are included in the package. The package’s flagship function is gower.agree, which has the following signature. Please see the package documentation for further details.

graphic file with name 41598_2025_90873_Figa_HTML.jpg

I will conclude this section by applying Krippendorff’s Inline graphic1921 to the CDH data in the interest of comparing the two approaches. I used my R package, krippendorffsalpha, to obtain the results shown in Table 4. We see that the Inline graphic point estimates are smaller than the estimated posterior means from the Gower procedure, but not very much so. The 95% confidence intervals, however, are much wider than the 95% credible intervals. Specifically, the three confidence intervals are 203%, 283%, and 152% wider, respectively, than the three credible intervals. And of course the two sets of results have different interpretations philosophically since Krippendorff’s Inline graphic is a frequentist approach.

Table 4.

Results from applying Krippendorff’s Inline graphic to the liver data.

Estimate 95% Confidence Interval
Radiologist 1 Inline graphic Inline graphic
Radiologist 2 Inline graphic Inline graphic
Overall Inline graphic Inline graphic

Nominal data from a two-way study of psychiatric diagnosis

The data from this study are psychiatric diagnoses (depression, personality disorder, schizophrenia, neurosis, and other) assigned to 30 patients by six raters11. I apply the two-way nominal methodology to these data, wherein both patients and raters are assumed to have been sampled from larger populations. The estimated posterior mean is 0.556, and the 95% credible interval is (0.474, 0.650). This is perhaps alarmingly poor agreement (only moderate according to the agreement scale given above) considering the stakes, but, to be fair, these are old data and so do not reflect recent advances in psychiatric diagnosis.

Note that Inline graphic is approximately Gaussian for these data (Figure 7), although we see slight asymmetry—the Shapiro–Wilk test29 rejects the null hypothesis of normality. Also note that Fleiss reported a Inline graphic value of 0.430, which is not contained in the credible interval for Inline graphic. This is close to the Krippendorff’s Inline graphic value of 0.440, but Inline graphic is inappropriate for these data because that methodology is for one-way designs19,21.

Fig. 7.

Fig. 7

A histogram of the posterior sample for the psychiatric diagnosis study. A Gaussian density is shown dashed.

R package goweragreement can be used as follows to analyze these data. The data are included in the irr package.

graphic file with name 41598_2025_90873_Figb_HTML.jpg

Influence diagnostics

In this section I describe how one might measure the influence of one or more units/coders on the agreement measure for a given dataset. I will use the nominal data from Figure 1 because that dataset is small enough that we can identify by inspection an influential unit and an influential coder. Specifically, unit 6 and coder 3 should be influential since all four coders disagreed regarding unit 6, and coder 3 occasionally disagreed with the other three coders.

R package goweragreement permits the user to measure the influence of any given unit or coder by simply removing the corresponding row or column from the data matrix and applying the Gower methodology to the new dataset. The code for measuring the influence of unit 6 and of coder 3 is shown below.

graphic file with name 41598_2025_90873_Figc_HTML.jpg

On line 29 we see that the estimated posterior mean for the whole dataset is 0.818, and the 95% credible interval is (0.550, 0.970). On line 57 we see that omitting unit 6 gave a posterior estimate of 0.900 and a substantially narrower credible interval of (0.727, 0.993). And on line 82 we see that omitting coder 3 yielded a posterior estimate of 0.901 and interval (0.598, 0.999). Thus unit 6 and coder 3 were approximately equally influential, and removing either resulted in an increase in agreement of approximately 10% (DFBETAs33 approximately equal to Inline graphic). Please note that in this context a DFBETA is defined as

graphic file with name 41598_2025_90873_Article_Equ9.gif

where i and j denote a unit or a coder, respectively; Inline graphic denotes the estimated posterior mean for the whole dataset; and Inline graphic denotes the estimated posterior mean when the specified unit or coder has been removed from the dataset. Thus Inline graphic and Inline graphic are computed in this example.

The three posterior samples are shown in Figure 8. Unit 6 and coder 3 clearly influence the posterior distribution dramatically.

Fig. 8.

Fig. 8

The posterior samples for the influence investigation of the nominal data from Figure 1.

Agreement scale calibration

As I carried out the simulation studies for this paper I was able to see how, exactly, the range of possible values of Inline graphic is constrained by the categorical marginal distribution and the distance function. Knowledge of said range can help one choose an appropriate scale for a given study, if one is willing to posit a direct Gaussian copula model with categorical margins as the data-generating mechanism.

For example, consider the plot in Figure 9, which shows Inline graphic as a function of latent correlation Inline graphic for the one-way nominal study with four coders. We see that Inline graphic is constrained to the range [0.29, 0.89]. One might use this relationship to devise a linear agreement scale such that Inline graphic represents slight agreement, Inline graphic represents fair agreement, Inline graphic represents moderate agreement, Inline graphic represents substantial agreement, and Inline graphic represents near-perfect agreement. Or one might consider Inline graphic against Gaussian mutual information, Inline graphic (see Figure 10). This provides what is perhaps the most sensible scale since mutual information is arguably superior to Inline graphic as a measure of redundancy32. In any case, the function Inline graphic can be revealed by doing a simulation study wherein the empirical categorical probabilities of the sample are used to generate the outcomes in the Gaussian copula model described earlier.

Fig. 9.

Fig. 9

Inline graphic as a function of Inline graphic for the one-way nominal simulation study with Inline graphic data matrix.

Fig. 10.

Fig. 10

Inline graphic as a function of Inline graphic for the one-way nominal simulation study with Inline graphic data matrix.

A footnote regarding agreement scales

In their seminal 1977 paper, Landis and Koch 23 proposed the linear agreement scale shown in Table 1. This scale seems sensible for any agreement coefficient that admits a linear interpretation. Unfortunately, some popular agreement coefficients, being based on within-unit Pearson correlation, defy interpretation according to a linear scale because Pearson correlation, although itself a measure of linear association, has a highly nonlinear effect on redundancy (as measured by Gaussian mutual information). Specifically, the relationship between Pearson correlation Inline graphic and mutual information Inline graphic for a bivariate Gaussian random vector is given by Inline graphic. This function of Inline graphic is highly nonlinear and approaches Inline graphic as Inline graphic approaches 1, rendering a linear interpretation problematic. Taleb 32 pointed out this danger, among others, in a recent, wide-ranging article on Pearson correlation. A fuller exploration of Gaussian mutual information in the context of agreement is left to a future study.

Discussion

Although the discrete metric appears to be an obvious choice of distance function for nominal data, the Inline graphic distance function is perhaps a less obvious choice for ordinal data. Hence other distance functions might be used for measuring agreement for ordinal scores. One might use the discrete metric for ordinal outcomes, or one might use, for example,

graphic file with name 41598_2025_90873_Article_Equ10.gif

as the measure of agreement for a given row of Inline graphic. Applying this latter distance measure to the full dataset from the CDH study yields the posterior distribution shown in Figure 11. This distribution has a smaller center and is more symmetric and more dispersed than the posterior obtained by applying the Inline graphic distance function.

Fig. 11.

Fig. 11

A histogram of the posterior sample for the overall measure of agreement in the CDH study and using the max norm.

Some readers may wonder, considering that I used the direct Gaussian copula model with discrete margins as a data-generating mechanism, why I developed the methodology presented in this article. The problem with the Gaussian copula model is that the likelihood is intractable for more than a few coders. This makes fully Bayesian analysis impractical for the copula model, whereas fully Bayesian analysis for Gower agreement is straightforward and computationally efficient. Methods for approximate Bayesian analysis have been developed (see, e.g., Hoff 18, Henn 17) for Gaussian copula models with discrete marginals, but the methodology presented here is, in my opinion, at least as compelling. Also, the methods in this article clearly do not assume any particular data-generating mechanism, only a one-way or two-way study design.

The methodology developed in this paper is supported by R package goweragreement, which is freely available on the Comprehensive R Archive Network at https://cran.r-project.org/web/packages/goweragreement.

Author contributions

John Hughes conceived the project, performed all simulations and data analyses, and wrote the manuscript.

Data availability

The datasets analyzed during the current study are available from the corresponding author upon reasonable request.

Declarations

Competing interests

The author declares no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Artstein, Ron & Poesio, Massimo. Inter-coder agreement for computational linguistics. Computational Linguistics34(4), 555–596 (2008). [Google Scholar]
  • 2.Banerjee, Mousumi, Capozzoli, Michelle, McSweeney, Laura & Sinha, Debajyoti. Beyond kappa: A review of interrater agreement measures. Canadian Journal of Statistics27(1), 3–23 (1999). [Google Scholar]
  • 3.Bennett, E. M., Alpert, R. & Goldstein, A. C. Communications through limited-response questioning. Public Opinion Quarterly, 18(3):303–308, 01 (1954).
  • 4.Cha, Sung-Hyuk. Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences1(4), 300–307 (2007). [Google Scholar]
  • 5.Cicchetti, Domenic V. & Feinstein, Alvan R. High agreement but low kappa: II. resolving the paradoxes. Journal of Clinical Epidemiology43(6), 551–558 (1990). [DOI] [PubMed] [Google Scholar]
  • 6.Cohen, J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin70(4), 213–220 (1968). [DOI] [PubMed] [Google Scholar]
  • 7.Cohen, Jacob. A coefficient of agreement for nominal scales. Educational and Psychological Measurement20(1), 37–46 (1960). [Google Scholar]
  • 8.Conger, Anthony J. Integration and generalization of kappas for multiple raters. Psychological Bulletin88(2), 322 (1980). [Google Scholar]
  • 9.Davies, Mark, & Fleiss, Joseph L. Measuring agreement for multinomial data. Biometrics, pages 1047–1051, (1982).
  • 10.Feinstein, Alvan R. & Cicchetti, Domenic V. High agreement but low kappa: I. the problems of two paradoxes. Journal of Clinical Epidemiology43(6), 543–549 (1990). [DOI] [PubMed] [Google Scholar]
  • 11.Fleiss, Joseph L. Measuring nominal scale agreement among many raters. Psychological Bulletin76(5), 378 (1971). [Google Scholar]
  • 12.Gonzalez-Barrios, Jose M. Sums of nonindependent bernoulli random variables. Brazilian Journal of Probability and Statistics, pages 55–64, (1998).
  • 13.Gower, J. C. A general coefficient of similarity and some of its properties. Biometrics27(4), 857–871 (1971). [Google Scholar]
  • 14.Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology61(1), 29–48 (2008). [DOI] [PubMed] [Google Scholar]
  • 15.Kilem Li Gwet. Handbook of Inter-Rater Reliability 5th edn. (AgreeStat Analytics, LLC, Gaithersburg, MD, 2021). [Google Scholar]
  • 16.Hayes, Andrew F. & Krippendorff, Klaus. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures1(1), 77–89 (2007). [Google Scholar]
  • 17.Henn, L. L. Limitations and performance of three approaches to Bayesian inference for Gaussian copula regression models of discrete data. Computational Statistics, pages 1–38, (2021).
  • 18.Hoff, Peter D. Extending the rank likelihood for semiparametric copula estimation. The Annals of Applied Statistics1(1), 265–283 (2007). [Google Scholar]
  • 19.Hughes, John. krippendorffsalpha: An R package for measuring agreement using Krippendorff’s Alpha coefficient. The R Journal13(1), 413–425 (2021). [Google Scholar]
  • 20.Hughes, John. Toward improved inference for Krippendorff’s Alpha agreement coefficient. Journal of Statistical Planning and Inference233, 106170 (2024). [Google Scholar]
  • 21.Krippendorff, Klaus. Content Analysis: An Introduction to Its Methodology. Sage, (2012).
  • 22.Krippendorff, Klaus. Computing Krippendorff’s alpha-reliability. Technical report, University of Pennsylvania, (2013).
  • 23.Landis, J Richard, & Koch, Gary G. The measurement of observer agreement for categorical data. Biometrics, pages 159–174, (1977). [PubMed]
  • 24.Longoni, Mauro, Pober, Barbara R. & High, Frances A. Congenital diaphragmatic hernia overview. GeneReviews®[Internet], (2020). [PubMed]
  • 25.McCullagh, Peter. Resampling and exchangeable arrays. Bernoulli, pages 285–301, (2000).
  • 26.Owen, Art B. The pigeonhole bootstrap. The Annals of Applied Statistics1(2), 386–411 (2007). [Google Scholar]
  • 27.Rubin, Donald B. The Bayesian bootstrap. The Annals of Statistics, pages 130–134, (1981).
  • 28.Scott, William A. Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly19, 321–325 (1955). [Google Scholar]
  • 29.Shapiro, S. S. & Wilk, M. B. An analysis of variance test for normality (complete samples). Biometrika52(3/4), 591–611 (1965). [Google Scholar]
  • 30.Smeeton, Nigel C. Early history of the kappa statistic. Biometrics41(3), 795–795 (1985). [Google Scholar]
  • 31.Taber, Keith S. The use of Cronbach’s alpha when developing and reporting research instruments in science education. Research in Science Education48(6), 1273–1296 (2018). [Google Scholar]
  • 32.Taleb, Nassim N. Fooled by correlation: Common misinterpretations in social science. Academia Online, (2019).
  • 33.Young, Derek S. Handbook of Regression Methods. CRC Press, (2017).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets analyzed during the current study are available from the corresponding author upon reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES