Abstract
In this work I present two methods for measuring agreement in nominal and ordinal data. The measures, which employ Gower-type distances, are simple, intuitive, and easy to compute for any number of units and any number of coders. Influential units and/or coders are easily identified. I consider both one-way and two-way random sampling designs, and develop an approach to Bayesian inference for each. I apply the methods to simulated data and to two real datasets, the first from a one-way radiological study of congenital diaphragmatic hernia, and the second from a two-way study of psychiatric diagnosis. Finally, I consider agreement scales and suggest that Gaussian mutual information can perhaps provide a scale that is more useful than the scale most commonly used. The methods I propose are supported by my open source R package, goweragreement, which is available on the Comprehensive R Archive Network.
Subject terms: Statistics, Software
Introduction
An inter-coder or intra-coder agreement coefficient, which takes a value in the unit interval, is a statistical measure of the extent to which two or more coders agree regarding the same units of analysis. The agreement problem has a long history and is important in many fields of inquiry, and numerous agreement statistics have been proposed.
The first agreement coefficients were S3,
28, and
7. Bennett et al. 3 proposed the S score as a measure of the extent to which two methods of communication provide identical information. Scott 28 proposed the
coefficient for measuring agreement between two coders. Cohen 7 criticized
and proposed the
coefficient as an alternative to
—although Smeeton 30 noted that Francis Galton mentioned a
-like statistic in his 1892 book, Finger Prints. Fleiss 11 proposed multi-
, a generalization of Scott’s
for measuring agreement among three or more coders. Conger 8 and Davies and Fleiss 9 likewise generalized
to the multi-coder setting. Other generalizations of
, e.g., weighted
6, have also been proposed. The
coefficient and its generalizations can fairly be said to dominate the field and are still widely used despite their well-known shortcomings5,10. Other frequently used measures of agreement are Gwet’s
and
14 and Krippendorff’s
16. For more comprehensive reviews of the literature on agreement, I refer the interested reader to the article by Banerjee et al. 2, the article by Artstein and Poesio 1, and the book by Gwet 15.
In this article I present new means of measuring agreement for nominal and ordinal data, and develop corresponding methods for Bayesian inference for both one-way random designs (units are random, coders are fixed) and two-way random designs (both units and coders are random). In Section "Gower-type agreement measures for nominal and ordinal data" I describe the agreement measures. In Section "Bayesian inference" I propose algorithms for sampling from the posterior distribution of the parameter of interest. In Section "Application to simulated data" I evaluate the two methodologies by applying them to simulated data. In Section "Application to real data" I apply the methods to two real datasets. In Section "Influence diagnostics" I show how to measure the influence of individual units and/or coders. In Section "Agreement scale calibration" I propose a method for obtaining a calibrated agreement scale for a given dataset, distance function, and sampling model. In Section "A footnote regarding agreement scales" I briefly discuss the possibility of constructing an agreement scale based on Gaussian mutual information. I make concluding remarks in Section "Discussion".
Gower-type agreement measures for nominal and ordinal data
Suppose the data
are arranged in
matrix
, where n is the number of units and m is the number of coders. Then
is the score assigned by coder j to unit i.
As an example, consider the nominal data that will be analyzed below in Section "Nominal data from a two-way study of psychiatric diagnosis". The data from this study are psychiatric diagnoses (depression, personality disorder, schizophrenia, neurosis, and other) assigned to 30 patients by six raters11. That is, each row of
corresponds to a psychiatric patient, and each of the six elements in a given row contains a diagnosis assigned by one of the raters:
is the diagnosis assigned by rater j to patient i. Since both the patients and the raters were presumably sampled from superpopulations, these data have a two-way design. See Section "Nominal data from a two-way study of psychiatric diagnosis" for more information regarding this dataset.
The building blocks of the proposed agreement measure are the row statistics
where d is an appropriate distance function. We see that the second term above is the sample mean of the distances for the
distinct pairs of observations in row i of
. And row statistic
is a Gower-type13 measure of agreement for row i. If the row in question is constant, the average distance will be 0, in which case the row statistic will equal 1 (perfect agreement). The more heterogeneous the row, the larger the average distance, in which case the row statistic will be closer to 0 (poorer agreement).
For nominal data I recommend the discrete metric
, where I denotes the indicator function. For ordinal data I recommend the
distance function given by
where r is the range of the scores, e.g.,
for scores in
. These distance functions were considered by Gower 13 and are also commonly used in other agreement settings19,21 and more generally4.
When d is the discrete metric the distances are of course binary, and their sum is not even approximately binomial unless the intra-row dependence is very weak. This is not surprising given theoretical work regarding sums of dependent Bernoulli variables12. In any case, the row statistics are an identically distributed sample from a discrete distribution having its points of support in the unit interval. This distribution is determined by the marginal distribution of the scores, the dependence structure, the number of coders, and the distance function. The mean of this distribution,
, say, is the proposed measure of agreement for the study. (Although Gower distance was discovered long ago, these measures do not, to my knowledge, appear in the agreement literature, nor has Bayesian inference been considered in this context.)
To put it another way, the parameter of interest,
, is the expected value of random variable
where m is the number of coders and
can index any unit. That is,
. For a hypothetical dataset comprising n units, the row statistics
are a sample of G.
One might then estimate
as the sample mean of the
:
for nominal data, and
for ordinal data. For a one-way design, wherein the units are random but the coders are fixed, the row-wise agreement statistics are iid, and so the ordinary central limit theorem applies:
, where
is the variance of G.
But I recommend Bayesian inference for
, i.e., to base inference on the posterior distribution of
conditional on data
. In the next section I develop a Bayesian bootstrap27 for both the one-way design and the two-way design. These algorithms produce a sample from posterior distribution
so that the posterior expectation
is the agreement measure, which can be estimated as the mean of the posterior sample. This approach has a number of advantages, the most important of which is that doing Bayesian inference allows one to answer the right question, that is, to base inference only on the data at hand, rather than on hypothetical unobserved data.
Note that this approach yields a measure of agreement for each unit (
) as well as a measure of agreement for the study (
). It is easy to accommodate any number of units and any number of coders, and missing scores can be handled by simply skipping them when computing the row statistics. Any row having just a single score is removed prior to analysis since such a row carries no information about agreement.
To clarify further I will conclude this section by applying these ideas to a small nominal dataset. Let us consider an example dataset that was previously analyzed by Krippendorff 22. The dataset, which comprises 41 nominal codes assigned to a dozen units of analysis by four coders, is shown below in Figure 1. The dots represent missing values.
Fig. 1.

Nominal scores previously analyzed by Krippendorff, for twelve units and four coders. The dots represent missing values.
Because this dataset is small it is easy to see by inspection that agreement is high. Indeed, eight of the units exhibit perfect agreement, and two of the remaining units exhibit near-perfect agreement. Only the sixth unit appears to have been problematic for the raters.
Due to perfect agreement the average distance for each of rows 1, 3, 4, 5, 7, 9, 10, and 11 is equal to 0, and so each of the row statistics is equal to 1. That leaves rows 2, 6, and 8 since row 12 must be discarded. For each of rows 2 and 8 the average distance is 0.5, and so the row statistics equal 0.5. For row 6 no two codes are equal, and so the average distance is equal to 1, which implies that the row statistic is equal to 0.
Now, the sample mean for the row statistics is approximately equal to 0.82. Because this value is not too far from 1, we can tentatively conclude that these data exhibit high agreement, as expected. This is of course a (incomplete) frequentist analysis. In the next sections I will describe how to do Bayesian analyses for Gower agreement, and apply the Bayesian procedures to both simulated and real datasets.
Bayesian inference
For a one-way study it is straightforward to adapt the Bayesian bootstrap to this setting. For two-way studies I develop a new Bayesian bootstrap based on pigeonhole resampling. These algorithms allow one to sample directly from the posterior distribution of
. Here I specify the algorithms. In the next section I evaluate their performance in a Monte Carlo study.
Bayesian bootstrap for a one-way random design
For a study in which the units are random and the coders are fixed, one can employ a Bayesian bootstrap27 to draw samples from
in the following way.
Compute the row statistics
.- Repeat for
: - Draw
iid
and independent of
. - Sort
and form the gap sequence
. - Compute
as the bth sample from
, where
denotes the dot product.
Use the posterior sample of size B to do inference for
.
This procedure can be carried out efficiently even for a large number of units, and typically only a small posterior sample is required to do reliable inference. In the next section I evaluate the frequentist performance of this approach.
Bayesian bootstrap for a two-way random design
Doing posterior inference for a two-way design is more delicate. The ordinary Bayesian bootstrap is deficient for this purpose because the ordinary method accommodates only one source of random variation, the variation across units. To reflect the randomness of coders as well, one can marry the pigeonhole bootstrap26 with the Bayesian bootstrap in the following way. To my knowledge this is a new form of Bayesian bootstrap—or, more precisely, a hybrid method since the pigeonhole resampling has a frequentist flavor.
- Repeat for
: - Resample the rows of
with replacement. - Given the resampled rows, resample the columns of
with replacement. These first two steps produce
. - Compute the row statistics
from
. - Draw
iid
and independent of
. - Sort
and form the gap sequence
. - Compute
as the bth sample from
.
Use the posterior sample of size B to do inference for
.
This approach also permits efficient computation and captures well both sources of randomness. In the next section I evaluate the frequentist performance of this method, and compare to the performance of a frequentist pigeonhole bootstrap.
Application to simulated data
For the simulation studies presented in this section I simulated data from direct Gaussian copula models with categorical marginal distributions. These are sensible proxy models since they permit one to specify appropriate correlation matrices for both the one-way design and the two-way design, and then apply those latent dependence structures to categorical outcomes. The generative form of the direct Gaussian copula model is
where
is the copula correlation matrix,
is the standard Gaussian cdf, and
is the quantile function of the desired response distribution. The random vector
is a realization of the copula, and
is obtained by applying the probability integral transform to the marginally standard uniform
.
For the one-way study
is block-diagonal with each of its n
blocks having the same compound symmetry structure:
![]() |
where
![]() |
for
. I varied the intraclass correlation over the grid
, and simulated 4,000 datasets for each value of
. For each simulated dataset I computed a credible interval based on a posterior sample of size 1,000.
For the two-way study the copula correlation matrix is given by the Kronecker product
where
is a compound symmetry structure for the coders and
is a compound symmetry structure for the units. I varied the intra-row correlation over the grid
, and for each value of
I used
. That is, for each scenario the inter-row correlation was half the intra-row correlation. This seems like a sensible study design since this methodology is most useful when the coders do not exhibit large biases. When intra-coder dependence is stronger than intra-unit dependence, agreement is low, in which case doing inference for
may be of little interest. If, for a given dataset, it seems clear that intra-coder dependence is strong, one might transpose
and repeat the analysis to get a sense of the strength of intra-coder agreement.
For each of 4,000 simulated datasets I computed a credible interval and a frequentist bootstrap interval, both based on a sample of size of 1,000. The frequentist bootstrap used pigeonhole resampling. It is important to note that no frequentist bootstrap for the two-way design can be exact25, but the pigeonhole bootstrap is useful for comparison with the Bayesian bootstrap outlined above.
For all scenarios I used
for the categorical probabilities, with 
Nominal data
The coverage profile for the one-way design with 16 units and four coders is shown in Figure 2. These results were obtained by applying the algorithm described in Section 3.1. We see that the 95% credible interval offers nearly nominal frequentist coverage across the range of latent correlation
. This is a small sample geometry; the coverage profile improves as the number of units and/or coders increases.
Fig. 2.

The coverage profile for the 95% credible interval (solid line) for nominal data with 16 units and four coders, one-way sampling design. The dashed line marks 0.95.
The coverage profile for the two-way design with 16 units and four coders is shown in Figure 3. These results were obtained by applying the algorithm described in Section 3.2. We see that the 95% credible interval (solid line) offers slightly better than nominal frequentist coverage (dotted line) across the range of latent correlation
. The coverage profile for the pigeonhole bootstrap is shown as a dotted line. Doing Bayesian inference clearly offers a substantial advantage here.
Fig. 3.

The coverage profile for the 95% credible interval (solid line) for nominal data with 16 units and four coders, two-way sampling design. Frequentist coverage rates are shown by the dotted line.
Ordinal data
The coverage profile for the one-way design with 16 units and four coders is shown in Figure 4. These results were obtained by applying the algorithm described in Section "Bayesian bootstrap for a one-way random design". We see that the 95% credible interval offers nearly nominal frequentist coverage across the range of latent correlation
. The coverage profile improves as the number of units and/or coders increases.
Fig. 4.

The coverage profile for the 95% credible interval for ordinal data with 16 units and four coders, one-way sampling design.
The coverage profile for the two-way design with 16 units and four coders is shown in Figure 5. These results were obtained by applying the algorithm described in Section "Bayesian bootstrap for a two-way random design". We see that the two-way design presents more of a challenge to the methodology, with the coverage rate dipping as low as 90% for some combinations of
and
. But the performance of the credible interval is still very much better than that of the frequentist pigeonhole bootstrap.
Fig. 5.

The coverage profile for the 95% credible interval for ordinal data with 16 units and four coders, two-way sampling design.
Application to real data
In this section I apply the proposed methods to two real datasets. The first is from a one-way magnetic resonance imaging study of congenital diaphragmatic hernia. The scores are ordinal. The second dataset is from a two-way study of psychiatric diagnosis. The scores are nominal.
I will interpret results according to the agreement scale given in Table 123. Although this scale is well-established, agreement scales remain a subject of debate31, and so the following scale (or any agreement scale) should be applied with caution. I discuss agreement scales further in Sections 7 and 8.
Table 1.
Guidelines for interpreting values of an agreement coefficient.
| Range of Agreement | Interpretation |
|---|---|
![]() |
Slight Agreement |
![]() |
Fair Agreement |
![]() |
Moderate Agreement |
![]() |
Substantial Agreement |
![]() |
Near-Perfect Agreement |
Ordinal data from a one-way radiological study of congenital diaphragmatic hernia
The data for this example are liver-herniation scores (in
) assigned by two coders (radiologists) to magnetic resonance images of the liver in a study pertaining to congenital diaphragmatic hernia (CDH)24, in which a hole in the diaphragm permits abdominal organs to enter the chest. The five grades are described in Table 2.
Table 2.
Liver herniation grades for the CDH study.
| Grade | Description |
|---|---|
| 1 | No herniation of liver into the fetal chest |
| 2 | Less than half of the ipsilateral thorax is occupied by the fetal liver |
| 3 | Greater than half of the thorax is occupied by the fetal liver |
| 4 | The liver dome reaches the thoracic apex |
| 5 | The liver dome not only reaches the thoracic apex but also extends across the thoracic midline |
Each radiologist scored each of the 47 images twice, and so we are interested in assessing both intra-coder and inter-coder agreement. This is a one-way study, which is to say we are interested in measuring agreement for these two radiologists, as opposed to considering the radiologists as having been drawn from a larger population. The results are shown in Table 3. We see that both intra-coder and inter-coder agreement are very nearly perfect. Note that each of the execution times was shorter than one second despite my having drawn 10,000 posterior samples for each. The posterior sample for the second radiologist is shown in Figure 6. Superimposed are a kernel density estimate and the limits of the 95% credible interval (sample quantiles). Since the distribution is markedly skewed to the left, I should report the estimated posterior median: 0.991.
Table 3.
Results from applying the proposed methodology to the liver data. For Radiologist 1, the estimated quantity is
, where
denotes the parameter of interest for Radiologist 1, and
denotes the columns of
containing Radiologist 1’s scores only. Likewise, the estimated quantity for Radiologist 2 is
. And for overall agreement we estimate
.
| Estimated Posterior Mean | 95% Credible Interval | |
|---|---|---|
| Radiologist 1 | 0.984 | (0.962, 0.997) |
| Radiologist 2 | 0.989 | (0.970, 0.999) |
| Overall | 0.972 | (0.954, 0.987) |
Fig. 6.

A histogram of the posterior sample for the second radiologist in the CDH study. This is a sample from
, which was defined in the caption for Table 3.
My R package can be used as follows to analyze the CDH data. The CDH data are included in the package. The package’s flagship function is gower.agree, which has the following signature. Please see the package documentation for further details.

I will conclude this section by applying Krippendorff’s
19–21 to the CDH data in the interest of comparing the two approaches. I used my R package, krippendorffsalpha, to obtain the results shown in Table 4. We see that the
point estimates are smaller than the estimated posterior means from the Gower procedure, but not very much so. The 95% confidence intervals, however, are much wider than the 95% credible intervals. Specifically, the three confidence intervals are 203%, 283%, and 152% wider, respectively, than the three credible intervals. And of course the two sets of results have different interpretations philosophically since Krippendorff’s
is a frequentist approach.
Table 4.
Results from applying Krippendorff’s
to the liver data.
| Estimate | 95% Confidence Interval | |
|---|---|---|
| Radiologist 1 | ![]() |
![]() |
| Radiologist 2 | ![]() |
![]() |
| Overall | ![]() |
![]() |
Nominal data from a two-way study of psychiatric diagnosis
The data from this study are psychiatric diagnoses (depression, personality disorder, schizophrenia, neurosis, and other) assigned to 30 patients by six raters11. I apply the two-way nominal methodology to these data, wherein both patients and raters are assumed to have been sampled from larger populations. The estimated posterior mean is 0.556, and the 95% credible interval is (0.474, 0.650). This is perhaps alarmingly poor agreement (only moderate according to the agreement scale given above) considering the stakes, but, to be fair, these are old data and so do not reflect recent advances in psychiatric diagnosis.
Note that
is approximately Gaussian for these data (Figure 7), although we see slight asymmetry—the Shapiro–Wilk test29 rejects the null hypothesis of normality. Also note that Fleiss reported a
value of 0.430, which is not contained in the credible interval for
. This is close to the Krippendorff’s
value of 0.440, but
is inappropriate for these data because that methodology is for one-way designs19,21.
Fig. 7.

A histogram of the posterior sample for the psychiatric diagnosis study. A Gaussian density is shown dashed.
R package goweragreement can be used as follows to analyze these data. The data are included in the irr package.

Influence diagnostics
In this section I describe how one might measure the influence of one or more units/coders on the agreement measure for a given dataset. I will use the nominal data from Figure 1 because that dataset is small enough that we can identify by inspection an influential unit and an influential coder. Specifically, unit 6 and coder 3 should be influential since all four coders disagreed regarding unit 6, and coder 3 occasionally disagreed with the other three coders.
R package goweragreement permits the user to measure the influence of any given unit or coder by simply removing the corresponding row or column from the data matrix and applying the Gower methodology to the new dataset. The code for measuring the influence of unit 6 and of coder 3 is shown below.

On line 29 we see that the estimated posterior mean for the whole dataset is 0.818, and the 95% credible interval is (0.550, 0.970). On line 57 we see that omitting unit 6 gave a posterior estimate of 0.900 and a substantially narrower credible interval of (0.727, 0.993). And on line 82 we see that omitting coder 3 yielded a posterior estimate of 0.901 and interval (0.598, 0.999). Thus unit 6 and coder 3 were approximately equally influential, and removing either resulted in an increase in agreement of approximately 10% (DFBETAs33 approximately equal to
). Please note that in this context a DFBETA is defined as
where i and j denote a unit or a coder, respectively;
denotes the estimated posterior mean for the whole dataset; and
denotes the estimated posterior mean when the specified unit or coder has been removed from the dataset. Thus
and
are computed in this example.
The three posterior samples are shown in Figure 8. Unit 6 and coder 3 clearly influence the posterior distribution dramatically.
Fig. 8.
The posterior samples for the influence investigation of the nominal data from Figure 1.
Agreement scale calibration
As I carried out the simulation studies for this paper I was able to see how, exactly, the range of possible values of
is constrained by the categorical marginal distribution and the distance function. Knowledge of said range can help one choose an appropriate scale for a given study, if one is willing to posit a direct Gaussian copula model with categorical margins as the data-generating mechanism.
For example, consider the plot in Figure 9, which shows
as a function of latent correlation
for the one-way nominal study with four coders. We see that
is constrained to the range [0.29, 0.89]. One might use this relationship to devise a linear agreement scale such that
represents slight agreement,
represents fair agreement,
represents moderate agreement,
represents substantial agreement, and
represents near-perfect agreement. Or one might consider
against Gaussian mutual information,
(see Figure 10). This provides what is perhaps the most sensible scale since mutual information is arguably superior to
as a measure of redundancy32. In any case, the function
can be revealed by doing a simulation study wherein the empirical categorical probabilities of the sample are used to generate the outcomes in the Gaussian copula model described earlier.
Fig. 9.

as a function of
for the one-way nominal simulation study with
data matrix.
Fig. 10.

as a function of
for the one-way nominal simulation study with
data matrix.
A footnote regarding agreement scales
In their seminal 1977 paper, Landis and Koch 23 proposed the linear agreement scale shown in Table 1. This scale seems sensible for any agreement coefficient that admits a linear interpretation. Unfortunately, some popular agreement coefficients, being based on within-unit Pearson correlation, defy interpretation according to a linear scale because Pearson correlation, although itself a measure of linear association, has a highly nonlinear effect on redundancy (as measured by Gaussian mutual information). Specifically, the relationship between Pearson correlation
and mutual information
for a bivariate Gaussian random vector is given by
. This function of
is highly nonlinear and approaches
as
approaches 1, rendering a linear interpretation problematic. Taleb 32 pointed out this danger, among others, in a recent, wide-ranging article on Pearson correlation. A fuller exploration of Gaussian mutual information in the context of agreement is left to a future study.
Discussion
Although the discrete metric appears to be an obvious choice of distance function for nominal data, the
distance function is perhaps a less obvious choice for ordinal data. Hence other distance functions might be used for measuring agreement for ordinal scores. One might use the discrete metric for ordinal outcomes, or one might use, for example,
as the measure of agreement for a given row of
. Applying this latter distance measure to the full dataset from the CDH study yields the posterior distribution shown in Figure 11. This distribution has a smaller center and is more symmetric and more dispersed than the posterior obtained by applying the
distance function.
Fig. 11.

A histogram of the posterior sample for the overall measure of agreement in the CDH study and using the max norm.
Some readers may wonder, considering that I used the direct Gaussian copula model with discrete margins as a data-generating mechanism, why I developed the methodology presented in this article. The problem with the Gaussian copula model is that the likelihood is intractable for more than a few coders. This makes fully Bayesian analysis impractical for the copula model, whereas fully Bayesian analysis for Gower agreement is straightforward and computationally efficient. Methods for approximate Bayesian analysis have been developed (see, e.g., Hoff 18, Henn 17) for Gaussian copula models with discrete marginals, but the methodology presented here is, in my opinion, at least as compelling. Also, the methods in this article clearly do not assume any particular data-generating mechanism, only a one-way or two-way study design.
The methodology developed in this paper is supported by R package goweragreement, which is freely available on the Comprehensive R Archive Network at https://cran.r-project.org/web/packages/goweragreement.
Author contributions
John Hughes conceived the project, performed all simulations and data analyses, and wrote the manuscript.
Data availability
The datasets analyzed during the current study are available from the corresponding author upon reasonable request.
Declarations
Competing interests
The author declares no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Artstein, Ron & Poesio, Massimo. Inter-coder agreement for computational linguistics. Computational Linguistics34(4), 555–596 (2008). [Google Scholar]
- 2.Banerjee, Mousumi, Capozzoli, Michelle, McSweeney, Laura & Sinha, Debajyoti. Beyond kappa: A review of interrater agreement measures. Canadian Journal of Statistics27(1), 3–23 (1999). [Google Scholar]
- 3.Bennett, E. M., Alpert, R. & Goldstein, A. C. Communications through limited-response questioning. Public Opinion Quarterly, 18(3):303–308, 01 (1954).
- 4.Cha, Sung-Hyuk. Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences1(4), 300–307 (2007). [Google Scholar]
- 5.Cicchetti, Domenic V. & Feinstein, Alvan R. High agreement but low kappa: II. resolving the paradoxes. Journal of Clinical Epidemiology43(6), 551–558 (1990). [DOI] [PubMed] [Google Scholar]
- 6.Cohen, J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin70(4), 213–220 (1968). [DOI] [PubMed] [Google Scholar]
- 7.Cohen, Jacob. A coefficient of agreement for nominal scales. Educational and Psychological Measurement20(1), 37–46 (1960). [Google Scholar]
- 8.Conger, Anthony J. Integration and generalization of kappas for multiple raters. Psychological Bulletin88(2), 322 (1980). [Google Scholar]
- 9.Davies, Mark, & Fleiss, Joseph L. Measuring agreement for multinomial data. Biometrics, pages 1047–1051, (1982).
- 10.Feinstein, Alvan R. & Cicchetti, Domenic V. High agreement but low kappa: I. the problems of two paradoxes. Journal of Clinical Epidemiology43(6), 543–549 (1990). [DOI] [PubMed] [Google Scholar]
- 11.Fleiss, Joseph L. Measuring nominal scale agreement among many raters. Psychological Bulletin76(5), 378 (1971). [Google Scholar]
- 12.Gonzalez-Barrios, Jose M. Sums of nonindependent bernoulli random variables. Brazilian Journal of Probability and Statistics, pages 55–64, (1998).
- 13.Gower, J. C. A general coefficient of similarity and some of its properties. Biometrics27(4), 857–871 (1971). [Google Scholar]
- 14.Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology61(1), 29–48 (2008). [DOI] [PubMed] [Google Scholar]
- 15.Kilem Li Gwet. Handbook of Inter-Rater Reliability 5th edn. (AgreeStat Analytics, LLC, Gaithersburg, MD, 2021). [Google Scholar]
- 16.Hayes, Andrew F. & Krippendorff, Klaus. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures1(1), 77–89 (2007). [Google Scholar]
- 17.Henn, L. L. Limitations and performance of three approaches to Bayesian inference for Gaussian copula regression models of discrete data. Computational Statistics, pages 1–38, (2021).
- 18.Hoff, Peter D. Extending the rank likelihood for semiparametric copula estimation. The Annals of Applied Statistics1(1), 265–283 (2007). [Google Scholar]
- 19.Hughes, John. krippendorffsalpha: An R package for measuring agreement using Krippendorff’s Alpha coefficient. The R Journal13(1), 413–425 (2021). [Google Scholar]
- 20.Hughes, John. Toward improved inference for Krippendorff’s Alpha agreement coefficient. Journal of Statistical Planning and Inference233, 106170 (2024). [Google Scholar]
- 21.Krippendorff, Klaus. Content Analysis: An Introduction to Its Methodology. Sage, (2012).
- 22.Krippendorff, Klaus. Computing Krippendorff’s alpha-reliability. Technical report, University of Pennsylvania, (2013).
- 23.Landis, J Richard, & Koch, Gary G. The measurement of observer agreement for categorical data. Biometrics, pages 159–174, (1977). [PubMed]
- 24.Longoni, Mauro, Pober, Barbara R. & High, Frances A. Congenital diaphragmatic hernia overview. GeneReviews®[Internet], (2020). [PubMed]
- 25.McCullagh, Peter. Resampling and exchangeable arrays. Bernoulli, pages 285–301, (2000).
- 26.Owen, Art B. The pigeonhole bootstrap. The Annals of Applied Statistics1(2), 386–411 (2007). [Google Scholar]
- 27.Rubin, Donald B. The Bayesian bootstrap. The Annals of Statistics, pages 130–134, (1981).
- 28.Scott, William A. Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly19, 321–325 (1955). [Google Scholar]
- 29.Shapiro, S. S. & Wilk, M. B. An analysis of variance test for normality (complete samples). Biometrika52(3/4), 591–611 (1965). [Google Scholar]
- 30.Smeeton, Nigel C. Early history of the kappa statistic. Biometrics41(3), 795–795 (1985). [Google Scholar]
- 31.Taber, Keith S. The use of Cronbach’s alpha when developing and reporting research instruments in science education. Research in Science Education48(6), 1273–1296 (2018). [Google Scholar]
- 32.Taleb, Nassim N. Fooled by correlation: Common misinterpretations in social science. Academia Online, (2019).
- 33.Young, Derek S. Handbook of Regression Methods. CRC Press, (2017).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets analyzed during the current study are available from the corresponding author upon reasonable request.














