Log-Linear Modeling of Agreement among Expert Exposure Assessors

Phillip R Hunt; Melissa C Friesen; Susan Sama; Louise Ryan; Donald Milton

doi:10.1093/annhyg/mev011

. 2015 Mar 6;59(6):764–774. doi: 10.1093/annhyg/mev011

Log-Linear Modeling of Agreement among Expert Exposure Assessors

Phillip R Hunt ^1,^*, Melissa C Friesen ², Susan Sama ³, Louise Ryan ⁴, Donald Milton ⁵

PMCID: PMC4506313 PMID: 25748517

Abstract

Background:

Evaluation of expert assessment of exposure depends, in the absence of a validation measurement, upon measures of agreement among the expert raters. Agreement is typically measured using Cohen’s Kappa statistic, however, there are some well-known limitations to this approach. We demonstrate an alternate method that uses log-linear models designed to model agreement. These models contain parameters that distinguish between exact agreement (diagonals of agreement matrix) and non-exact associations (off-diagonals). In addition, they can incorporate covariates to examine whether agreement differs across strata.

Methods:

We applied these models to evaluate agreement among expert ratings of exposure to sensitizers (none, likely, high) in a study of occupational asthma.

Results:

Traditional analyses using weighted kappa suggested potential differences in agreement by blue/white collar jobs and office/non-office jobs, but not case/control status. However, the evaluation of the covariates and their interaction terms in log-linear models found no differences in agreement with these covariates and provided evidence that the differences observed using kappa were the result of marginal differences in the distribution of ratings rather than differences in agreement. Differences in agreement were predicted across the exposure scale, with the likely moderately exposed category more difficult for the experts to differentiate from the highly exposed category than from the unexposed category.

Conclusions:

The log-linear models provided valuable information about patterns of agreement and the structure of the data that were not revealed in analyses using kappa. The models’ lack of dependence on marginal distributions and the ease of evaluating covariates allow reliable detection of observational bias in exposure data.

KEYWORDS: asthma, epidemiology, expert judgement, exposure assessment methodology, sensitization

INTRODUCTION

In community-based occupational health studies, the evaluation of occupational risk factors usually relies on expert judgment of exposure scientists, such as industrial hygienists, because of the impracticality of obtaining exposure measurements related to the thousands of jobs and employers in a typical study (Goldberg et al., 1986; Dewar et al., 1991; Stewart and Stewart, 1994; Fritschi et al., 1996; Siemiatycki et al., 1997). The large number of jobs, and often long follow-up periods, also makes it difficult to validate the experts’ ratings against measurements (Benke et al., 1997; Tielemans et al., 1999; Fritschi et al., 2003; t’ Mannetje et al., 2003; Friesen, et al., 2011; Friesen et al., 2013). As a result, the reliability of the expert assessment is usually evaluated by assessing the agreement among the expert raters.

Agreement is typically measured using Cohen’s Kappa statistic, which measures agreement beyond chance, the latter being defined as the agreement that would occur if the raters’ assessments were completely independent and calculated as the sum of the diagonal products of the marginal totals (Cohen, 1960). However, kappa statistics have well known limitations (Maclure and Willett, 1987; Cicchetti and Feinstein, 1990; Posner et al., 1990; Nelson and Pepe, 2000). Kappa is dependent on the underlying distribution of the condition being rated, as well as the amount of agreement among raters and is not amenable to statistical control based on covariates (Tanner and Young, 1985; Goldberg et al., 1986; Maclure and Willett, 1987; Agresti, 1988; Cicchetti and Feinstein, 1990; Feinstein and Cicchetti, 1990; Carlin et al., 2000; Nelson and Pepe, 2000). As a result, kappa values for different populations or different subgroups within the same population that have different distributions are difficult to compare with statistical rigor. Confidence intervals and P values for kappa statistics are designed to test the null hypothesis that agreement is different than chance agreement, but not to compare kappa values across subgroups.

Assessing agreement in subgroups can provide important insight into where agreement is poor or good (Friesen et al, 2011). For instance, there frequently is a need to evaluate whether there are differences in exposure ratings due to disease status: If cases are aware of potential exposure-disease associations, they are more likely than controls to provide more detailed or thought-through responses to the occupational questions that the experts use to base their exposure decisions and thus may bias the exposure estimates assigned. Kappa values can be stratified by disease status; however, statistical equivalence of kappa across the groups does not indicate a lack of differential misclassification (Armstrong et al., 1992). Similarly, identifying subgroups where the experts’ ratings are more or less reliable, such as white-collar versus blue-collar jobs, would provide important insights into the experts’ ratings, but cannot be directly compared using kappa. Bias can be evaluated using analogs of sensitivity and specificity (Cicchetti and Feinstein, 1990); however, these methods require dichotomous ratings and information can be lost when polytomous rating scales are collapsed and this approach requires assuming one set of estimates is a gold standard. Despite these drawbacks, kappa continues to be used to assess exposure data reliability because of the lack of informative alternatives (e.g. t’ Mannetje et al., 2003, Correa et al., 2006; Friesen et al., 2011; Solovieva et al., 2012).

This article describes the application of log-linear methods for modeling agreement that overcome some of the limitations of Cohen’s kappa. Log-linear models of agreement have been developed by Agresti (1988) and Tanner and Young (1985) based on the work of Goodman (1979) and have been used in behavioral and clinical research, but have not yet been applied in the context of expert-based exposure assessment in occupational studies. This modeling approach permits quantification of agreement that is less dependent on the underlying distribution of the rated condition, provides the flexibility to evaluate patterns of agreement within the data, and allows the estimation of differences in agreement across covariates (Tanner and Young, 1985; Carlin et al., 2000; Agresti, 2002).

METHODS

Study population and exposure assessment

Data used in this application were collected as a part of a community-based study of occupational asthma, described in detail elsewhere (Sama et al., 2003). Briefly, 611 job descriptions were obtained by telephone interview of asthma cases and controls using six open-ended questions to frame the collection of the job descriptions. Each job description was independently evaluated for exposure to sensitizers by two members (randomly designated Expert 1 and 2) of a six-member expert exposure assessment panel who were blinded to other information about the subjects. Experts assigned ratings for exposure to sensitizers on the following scale: ‘0 = low/no exposure’, ‘1 = likely/moderate exposure’, or ‘2 = highly likely/significant exposure’. Expert raters are treated as indistinguishable in assessing agreement. All possible pairings of experts were used and each rater has an equal chance of being paired with any other rater. Each job description was classified as an office or non-office environment by the raters and classified as blue collar or white collar by one of the authors (P.R.H.).

Inter-rater agreement

Table 1 shows the data structure for agreement analyses. Rows, indexed by i = 1, 2, 3, and columns, indexed by j = 1, 2, 3, contain the exposure scores (exposure scale = 0, 1, 2) for Experts 1 and 2, respectively. Cross-classified counts are given in the cells of the table. To assess agreement across levels of the covariate, this table is stratified by the levels of a covariate to produce a three-dimensional table (e.g., Expert 1 × Expert 2 × disease status), as shown in Supplementary Appendix Table 1 at Annals of Occupational Hygiene online.

Table 1.

Cross-tabulation of expert exposure scores for sensitizers.

	Frequency (%)	Exposure score rater 2			Total
	Frequency (%)	0	1	2	Total
Exposure score rater 1	0	347 (56.8)	44 (7.2)	19 (3.1)	410 (67.1)
	1	43 (7.0)	50 (8.2)	30 (4.9)	123 (20.1)
	2	8 (1.3)	29 (4.8)	41 (6.7)	78 (12.8)
	Total	398 (65.1)	123 (20.1)	90 (14.7)	611 (100.0)

Open in a new tab

Log-linear models of agreement

Log-linear models are a class of generalized linear models that describe the means of cell counts in a multidimensional table (Fienberg, 1980). The i × j cell counts are treated as independent observations of a Poisson random component. Under independence, the maximum likelihood fitted values for the cells are the expected frequencies for chi-square tests of independence. Association and interaction terms are used to describe departures from independence (Agresti, 2002). These models are useful when several factors interact in a multivariate manner and the cause and effect relationship is unclear (Zelterman, 1999), such as in the current application where the two expert ratings of exposure, and possibly covariates, interact in producing agreement. Typically, the log of the expected cell counts, log(m_ij), where i and j = the row and column numbers respectively, are modeled as a linear function of covariates thought to influence the degree of agreement. Model fit is based on maximum likelihood (Fienberg, 1980; Zelterman, 1999; Agresti, 2002) and tests of significance are based on likelihood ratio tests.

The specification of the agreement models depends upon the design matrix of covariates assigned in the data set. First, one must first create new variables for each subject based on the raters’ scores. For the data represented in Table 1, which has three exposure categories, five variables (x₀, x₁, x₂, x₃, x₄) were created. The intercept covariate, x₀, always takes a value of one. The covariate x₁ takes a value of zero when neither rater assigns a score of zero, a value of one when either rater, but not both, assigns a score of zero and a value of two when both raters assign a score of zero, to a job description. Similarly, x₂ takes a value of zero when neither rater assigns a score of one, a value of one when only one rater assigns a score of one, and a value of two when both assign a score of one to a job description. The covariate x₃ equals one when the two raters assign equal scores to a job, that is, when they agree exactly, and zero otherwise. The covariate x₄ equals the product of the row number (i) and column number (j) and ranges from 1 to 9 for the three by three table considered here. Additional variables are added when more exposure categories are present. See Supplementary Appendix A at Annals of Occupational Hygiene online for an example of a design matrix and SAS code to create the needed variables and to run the models described below.

To model inter-rater agreement, Agresti (1988) proposed the following independence with marginal homogeneity plus agreement and uniform association model, shown in equation (1) for our motivating example in Table 1 and Supplementary Appendix Table 1 at Annals of Occupational Hygiene online. This model estimates both the exact agreement and the agreement beyond chance between two sets of experts, assuming that the covariates x ₁ and x ₂ have equal marginal distributions (i.e., marginal homogeneity). The marginal homogeneity assumption treats the experts as indistinguishable, which is appropriate for the current data in which for each job experts were arbitrarily assigned as Expert 1 or 2 so that each of the six experts contributes to both row and column marginal distributions.

log (m_{i j}) = λ_{0} x_{0} + λ_{1} x_{1} + λ_{2} x_{2} + δ x_{3} + β x_{4}

(1)

x ₀, x ₁, x ₂, x ₃ and x ₄ were defined as described above. λ ₀ is an intercept parameter and λ ₁ and λ ₂ are the estimated parameters for class variables x ₁ and x ₂, respectively. The parameters δ and β provide the information needed to estimate agreement. The parameter δ provides an estimate of the exact agreement as uniform along the main diagonal (where i = j) beyond that expected by chance (hereafter, the ‘uniform exact agreement parameter’) (Agresti, 1988). The parameter β provides an estimate of the association beyond chance in the off-diagonals and indicates whether the high (or low) ratings from one expert, tend to be associated with high (or low) scores by the other expert’s ratings, excluding agreement by chance and exact agreement (hereafter, ‘uniform association parameter’). Positive values of β indicate that off-diagonal association across the table is positive. The interpretation of these parameters is described in the next section.

Where the estimated uniform exact agreement parameter (δ) is not significantly different from zero, the model structure simplifies to include the covariates and the uniform association parameter (β) only and is referred to as the independence with marginal homogeneity and uniform association model.

Both models can incorporate additional covariates and interaction terms to test hypotheses about differences in agreement across levels of the covariate. For example, adding additional covariates (c) and interactions to the independence with marginal homogeneity plus agreement and uniform association model from equation (1) takes the form shown in equation (2).

\begin{array}{l} log m_{i j} = λ_{0} x_{0} + λ_{1} x_{1} + λ_{2} x_{2} + δ x_{3} \\ + β x_{4} + λ_{0, c} c * x_{0} + λ_{1, c} c * x_{1} \\ + λ_{2, c} c * x_{2} + δ_{c} c * x_{3} + β_{c} c * x_{4} \end{array}

(2)

An extension proposed by Tanner and Young (1985) models the patterns of exact agreement across values of the rating score by adding exact agreement terms for each of the exposure scores to the independence with marginal homogeneity model (hereafter, ‘exact agreement model’):

log m_{i j} = λ_{0} x_{0} + λ_{1} x_{1} + λ_{2} x_{2} + δ_{1} x_{5} + δ_{2} x_{6} + δ_{3} x_{7}

(3)

where x ₅ = 1 where the row and column indices i = j = 0, 0 otherwise; x ₆ = 1 where i = j= 1, 0 otherwise; and x ₇ = 1 where i = j= 2, 0 otherwise. The δ _i parameters measure the agreement between experts at the individual scores, allowing agreement to vary according to the value of the score. This differs from the single uniform exact agreement parameter, δ in equation (1), which imposes a uniform level of agreement along the main diagonal regardless of the score (Tanner and Young 1985; Agresti, 1988).

Interpretation of log-linear models of agreement

Agreement in log-linear models is expressed as the local odds ratio of agreement. For a given 2 by 2 sub-table within the main table, the local odds ratio of agreement θ _ij indicates that, where Expert 1 gives a rating of i, the odds of Expert 2 assigning a rating of j are θ _ij times greater than the odds of Expert 2 assigning a rating of j + 1. The parameter estimates for the agreement terms represent the logs odds ratios attributable to those parameters.

For models in a three by three table with row and column indices I and j, respectively, the odds ratio of agreement (θ) is calculated as:

θ_{i j} = {\begin{matrix} exp (2 δ + β) where i = j \\ exp (β - δ) where | i - j | = 1 \end{matrix}

(4)

As above, when δ is zero, the formulas simply to the uniform association model and when scores represent equal intervals, as in the present data, β represents a uniform association across the table (Goodman, 1979). When interaction between covariates and agreement models is significant, the interaction term is included with the corresponding agreement parameter. Thus, if the interaction between the covariate and the exact agreement term, c*δ, is significant, θ _ij is calculated as:

θ_{i j} = {\begin{matrix} exp (2 (δ + c \partial) + β) where i = j; c = (0, 1) \\ exp (β - (δ + c \partial) where | i - j | = 1; c = (0, 1) \end{matrix}

(5)

For the exact agreement model (equation 3) the values of δ _i can be used to calculate the odds ratio of agreement across individual ratings, thus the odds ratio of exact agreement for a score of zero and a score of one is calculated as:

θ_{1, 2} = exp (δ_{1} + δ_{2})

(6)

and similarly for the agreement between other pairs of scores. The odds ratio of agreement (θ _i,j) can vary between 0 and positive infinity. The practical limit is dependent on the total counts (N) and will be less than (N/2)².

Application to sensitizer ratings

The log-linear models described above were applied to the experts’ sensitizer ratings. We selected final reduced models from the full models described above by backward elimination. Covariates were removed in turn from the full model. When removal of a covariate resulted in a significance difference (P < 0.05) from the larger model, the covariate was retained. When the smaller model was not significantly different from the large model, the covariate was eliminated from further models. Statistical significance of differences between models were determined using the difference in the deviance (G²) between models compared to a Chi-square distribution with degrees of freedom equal to the difference in the degrees of freedom of the two models being compared. The Akaike Information Criterion (AIC), AIC = −2 × (log likelihood − number of parameters), a model fit statistic that includes a penalty for a greater number of terms in the model (Burnham and Anderson, 1998), was used to compare non-nested models. A smaller AIC indicates a model with a better, more parsimonious fit. The models were examined with and without covariates for disease status, blue collar/white collar jobs, and office/non-office jobs.

In addition, we calculated weighted kappa values (Cohen, 1960; Armstrong et al., 1992), based on quadratic weights, for all jobs and stratified by disease status, blue collar/white collar, and office/non-office environment.

A simulation was carried out to determine approximate equivalent values for measures of percent agreement, weighted kappa, and the odds ratio of agreement from the log-linear models for a theoretical exposure rated on a three category scale. Three-by-three tables with balanced margins (i.e. marginal counts are distributed about equally across the measurement scale) having levels of agreement from none (all cells have equal number of counts) to near full agreement (nearly all counts on the diagonal) were generated and the odds ratios of agreement, percent agreement, and kappa values for each table were calculated.

Statistical analyses were conducted using SAS 8.02 for Windows (SAS Institute, Cary, NC, USA). See Supplementary Appendix A at Annals of Occupational Hygiene online for details.

RESULTS

The cross-tabulation of the expert panel exposure scores for sensitizers is shown in Table 1. The exposure ratings are skewed toward low/no exposure with approximately two-thirds of jobs receiving a rating of zero from one or both experts. The marginal distributions of exposures scores for Expert1 and 2 are approximately equal, as expected due to the method of randomly assigning two experts per job. Similarly, the marginal distributions were similar across disease status (Supplementary Appendix Table 1 at Annals of Occupational Hygiene online).

Supplementary Appendix Tables 2 and 3 at Annals of Occupational Hygiene online show the expert ratings stratified by blue-collar (BC) and by office workplace (OWP) status, respectively. The marginal distributions were similar within each stratum, but differed across strata.

Table 2 shows the deviance, degrees of freedom, and fit statistics for the full log-linear model (equation 2) and sequentially reduced models for sensitizer ratings controlled for disease status. Removal of the interaction terms between agreement parameters with disease status (δ*cc and β*cc) resulted in no significant difference from the full model (Models 2 and 3 in Table 2), nor did removal of disease status (Model 7, in Table 2), indicating no differences in agreement across disease status. Further, the parameter estimates for these terms were close to zero (not shown). Models 4 and 5 show that removal of either of the agreement terms results in a significantly larger deviance. Removal of the uniform agreement (β) term from Model 7 results in a significantly higher deviance (Model 8). Removal of the uniform exact agreement (δ) term from Model 7 yields a more parsimonious model with a deviance that is not significantly greater than in Model 9. However, Model 7 has a lower AIC and was selected as the best-fitting model containing statistically significant parameters for uniform exact agreement (δ) and uniform agreement (β).

Table 2.

Sequence of log linear models calculated for sensitizer scores adjusted for disease status.

Model no.	Model description	Parameters^a	Model df	Deviance	Comparison df	Comparison G2	P value^b	AIC^c
1	Exact + association + case/control + interactions	cc + λ ₁+ λ ₂ + δ + β + ccδ + ccβ	5491	1899.7				3139.7
2^d	Exact + association + case/control + association interaction	cc + λ ₁ + λ ₂ + δ + β + cc*β	5492	1899.8	1	0.04	0.840	3137.8
3^d	Exact + association + case/control + exact interaction	cc + λ ₁+ λ ₂ + δ + β + cc*δ	5492	1900.5	1	0.73	0.392	3138.5
4^e	Exact + case/control + exact interaction	cc + λ ₁+ λ ₂ + δ + cc*δ	5493	1945.0	2	45.25	<0.001	3181.0
5^e	Association + case/control + association interaction	cc + λ ₁+ λ ₂ + β + cc*β	5493	1907.5	2	7.78	0.020	3143.5
6^f	Exact + association + case/control	cc + λ ₁ + λ ₂ + δ + β	5493	1900.5	2	0.74	0.691	3136.5
7^g	Exact + association	λ ₁ + λ ₂ + δ + β	5494	1900.5	3	0.74	0.864	3134.5
8^h	Exact	λ ₁ + λ ₂ + δ	5495	1944.9	4	45.17	<0.001	3177.0
9^h	Association	λ ₁ + λ ₂ + β	5495	1908.2	4	8.47	0.076	3140.2

Open in a new tab

^aAll models contain an intercept term λ ₀ (not shown).

^b P value based on χ² distribution where χ² = delta G² (the difference in deviance between the reduced and full model) with degrees of freedom as the difference in df between models. All models compared to model no. 1.

^cAkeike information criterion (AIC) = −2 × (log likelihood − number of parameters).

^dModels 2 & 3 show the alternate removal of the cc × agreement (δ or β) interaction term; neither model shows a significant difference from the full model.

^eModels 4 & 5 show the alternate removal of the agreement terms (δ or β) and their corresponding interaction terms. Both models show a significant difference from the full model indicating a need for both δ and β in the model to achieve proper fit.

^fModel 6 shows the removal of all interaction terms, with no significant difference from the full model.

^gModel 7 shows the removal of case/control status from model 6, again showing no difference from the full model.

^hModels 8 and 9 show the models containing only one or the other agreement terms. Model 8 containing δ only shows a significant difference from the full model. Model 9 containing β only shows a marginally significant difference from the full model. Among the reduced models AIC is at a minimum for Model 7 which contains δ and β.

Table 3 shows the final model parameters and odds ratios of agreement for best-fitting models for each of the examined covariates. The model with disease status yielded an odds ratio of uniform exact agreement of 6.1. This indicates that for a job assigned an exposure score of i by Expert 1, the odds of Expert 2 assigning an exposure score of i was 6.1 times greater than the odds that Expert 2 assigned an exposure score of i + 1. The odds ratio for the uniform association agreement parameter (a measure of the agreement in the off-diagonals) was much lower than the uniform exact agreement; its 95% confidence interval included the null value. Similar patterns and magnitudes were observed for the uniform exact and association parameters for the models that included covariates for blue/white collar jobs and office/non-office jobs.

Table 3.

Final models, agreement parameters, and odds ratios of agreement for sensitizers exposure ratings.

			Parameter estimates^b		OR of agreement
Adjustment variable	Model description	Parameters^a	δ	β	Uniform exact^c		Off-diagonal^d
Adjustment variable	Model description	Parameters^a	δ	β	OR	95% CI	OR	95% CI
Disease status	Exact + association	λ ₁ + λ ₂+ δ + β	0.449	0.909	6.1	3.9, 9.4	1.6	0.9, 2.8
Blue collar/ white collar (BC)	Exact + association + interactions	λ ₁ + λ ₂ + δ + β + BC*λ ₁	0.363	0.874	5.1	3.2, 7.7	1.7	0.9, 3.0
Office workplace (OWP)	Exact + association + interactions	λ ₁ + λ ₂ + δ + β + OWPλ ₁ + OWPλ ₂	0.400	0.826	5.0	3.2, 7.9	1.5	0.9, 2.7

Open in a new tab

CI, confidence interval; OR, odds ratio.

^aAll models contain an intercept term λ ₀ (not shown).

^bEstimates from final models shown only for parameters required for calculation of odds ratios of agreement.

^cOR of uniform exact agreement = exp (2δ + β).

^dOR of off-diagonal agreement = exp (β − δ).

In the models controlled for blue collar/white collar and office/non-office, the interactions of the additional covariates with one or both λ parameters were significant (not shown). This indicated that the row/column margins were different for the two levels of these variables (as also shown in Supplentary Appendix Tables 2 and 3 at Annals of Occupational Hygiene online). However, interactions of the agreement parameters with blue collar/white collar and office/non-office were not significant, indicating agreement was not significantly different across levels of these variables despite the differences in rating distributions.

Using equation (3) for the exact agreement model without covariates, which modeled separate agreement parameters for each level of expert ratings, the odds ratio (95% confidence interval) of exact agreement between pairs of scores was 9.2 (5.5, 15.3) for distinguishing between scores of zero and one, 78 (34, 177) for distinguishing between scores of zero and two, and 2.4 (1.2, 4.5) for distinguishing between scores of one and two (not shown in tables). The higher odds ratio for distinguishing between a score of zero and a score of two conforms to the general expectation that it should be easier to distinguish ratings that are at the extremes of the scale. Similarly, the odds ratio for distinguishing a score of zero from a score of one (9.2) indicates experts distinguish well between jobs with no (or low) exposure relative to those with some (or moderate) exposure. In contrast, the odds ratio for distinguishing between a score of one and a score of two was lower than that for distinguishing between zero and one, indicating experts have a harder time distinguishing jobs with some (or moderate) exposure from those with high exposure.

Weighted kappa values are shown in Table 4. The weighted kappa of 0.52 for all jobs indicated a moderate degree of agreement. The magnitudes and confidence intervals for weighted kappa by strata for blue collar/white collar and office/non-office jobs suggested possible differences in agreement between raters. However, comparison of these results to the log-linear models, where marginal differences could be properly accounted for in the model, suggests that the differences in kappa values resulted from differing marginal distributions in the two job type strata rather than from dependence of agreement on job type.

Table 4.

Approximate equivalent values of odds ratios of agreement for values of weighted kappa for 3×3 tables with balanced margins.^a

Qualitative level of agreement	Percent agreement	Weighted kappa	OR agreement
Poor	50–70	<0.4	<5
Moderate	70–85	0.4–0.7	5–45
Strong	>85	>0.7	>45^b

Open in a new tab

OR, odds ratio.

^aComparison are made using tables with balanced margins (i.e. marginal counts are distributed about equally across the measurement scale), because kappa values are biased when margins are unbalanced (see text).

^bOR of agreement has no theoretical upper limit. The practical limit is dependent on the total counts (N) and will be less than (N/2)².

Table 4 shows the approximate equivalent values of percent agreement, weighted kappa, and odds ratio of agreement determined from the simulations, grouped as indicating poor, moderate, and strong agreement 2 using common applied kappa categories.

DISCUSSION

Assessing the reliability of expert agreement, including identifying factors and ratings for which experts agree well (or poorly), provides important insights into expert-based exposure assessment methods in community based epidemiologic studies (Goldberg et al., 1986; Siemiatycki et al., 1989, 1997; Stewart and Stewart, 1994). The log-linear models of agreement demonstrated here provide an effective means for analyzing agreement among raters and assessing data reliability for these types of studies.

The log-linear modeling approach has a number of clear technical advantages over the commonly used kappa statistic. First, the log-linear models provide quantification of agreement in a way that is less dependent on the underlying distribution of the rated condition. Marginal dependence can be a significant drawback with the use of the kappa statistic for occupational or environmental data, in which the prevalence of the conditions being rated (e.g. exposure or disease) may be very low. Under these conditions kappa tends to underestimate agreement, leading to a lower assessment of the reliability of the rating scores than may be warranted (Agresti, 1988; Feinstein and Cicchetti, 1990; Carlin et al., 2000).

In these data, the dependence of kappa on the marginal distributions was evident in the kappa values stratified by blue collar/white collar and office/non-office environment (Table 3). The groups that have substantially lower kappa values, white-collar jobs and office jobs, were the strata that have a greater proportion of zero ratings (no/low exposure). The log-linear models showed no such discrepancy because the measure of agreement depends on explicit agreement parameters that are not confounded by differences in marginal distributions. Differences in marginal distributions were reflected in the significant interaction terms between the covariate and the margin terms in the final log-linear models controlled for blue collar and office environment (Table 3).

In addition, the log-linear models allowed the estimation of differences in agreement across covariates, which is not possible with analyses using kappa (Tanner and Young, 1985; Agresti, 1988; Carlin et al., 2000). The log-linear models easily accommodated the inclusion of disease status and other variables as covariates and the interpretation of the covariate parameters was clearer than for kappa values. When the interactions of the covariates with the agreement parameters in log-linear models were not significant, we could safely conclude that agreement, distinct from possible differences in marginal distributions, was not different across levels of the covariate, thus providing evidence that the collection and classification of exposure information in this population was non-differential. While kappa values can be stratified by covariates, interpretation is confounded by the influence of marginal distributions.

Log-linear methods also allowed modeling of patterns of agreement, providing information about the structure of the data, rather than being restricted to the independence model. In this data, the exact agreement model (equation 3) suggests cut points for dichotomizing the exposure ratings, if necessary for estimating exposure-response relationships (e.g. attributable risks, relative risks, odds ratios). That is, it identified whether ratings of 1, the middle category, were more similar to 0 (no exposure) or 2 (high exposure). In this study, the expert raters had an easier time distinguishing a score of zero from a score of one than distinguishing a score of one from a score of two. This indicates that in analyses requiring dichotomizing of the exposure metric for sensitizers in this study, grouping the ratings of 1 and 2 together would result in less misclassification than grouping the ratings of 0 and 1.

Alternative methods for modeling agreement that share some of the advantages of log-linear models over kappa have been proposed. Mixture models are variants of the log-linear models presented in which ‘agreement by chance’ is defined differently. The data is viewed as a mixture of two subpopulations: one containing ‘obvious’ cases for which raters show perfect agreement; the second containing ‘doubtful’ cases for which raters may or may not show agreement (Aickin, 1990; Schuster, 2002). The summary agreement parameter from these models ignores non-exact agreement (association). The agreement parameter, although limited to a 0–1 scale, cannot be evaluated on the same scale as kappa. Latent class models provide useful extensions of log-linear models of agreement, but these models are not estimable with common statistical software and they use indices of agreement that refer to the latent classes, which represents a conceptual extension that can make the notion of agreement less concrete (Espeland and Handelman, 1989; Qu et al., 1996; Guggenmoos-Holzmann and Vonk, 1998; Nelson and Pepe, 2000).

Kappa values have an advantage in that, when expressed as a function of sensitivity and specificity, they can be related to measures of the attenuation of exposure-disease associations due to misclassification (Thompson, 1990). Currently, no methods for assessing attenuation of associations are available for use with odds ratios of agreement.

Log-linear models of agreement have some limitations in application. The models are designed to compare two raters and, while multiple raters can be compared within a single model, counts in the cross-classified cells can become sparse, making fitting of the models difficult. Further, interpretation of the conditional and partial-table odds ratios of agreement may not be straightforward (Nelson and Pepe, 2000), although models of diagonal agreement and association that yield single odds ratios applicable to full cross-classified tables mitigate these interpretive issues somewhat. The concept of an odds ratio as an index of association is a familiar one in epidemiology, although its application to agreement is not. The crucial element in interpreting the odds ratio of agreement is that it is, just like more familiar applications of odds ratios, a measure of association between cross classified data, amenable to statistical modeling, testing, and control. The notion that the odds ratio of agreement is uniform along a diagonal or across a whole table is similar to the idea of an odds ratio being uniform along the scale of a continuous variable. However, the scale of the odds ratio of agreement is much different from what we expect when considering environmental exposure-disease associations. As Table 4 indicates, an odds ratio of five is equivalent to moderate agreement as judged by the kappa statistic and an odds ratio of 45 indicates strong agreement. Odds ratios this large are rarely encountered in exposure-disease relationships. This difference in the application of the odds ratio is simply a matter of scale. Just as the commonly used thresholds for characterizing agreement using the kappa statistic have developed with use, so may our sense of scale for the odds ratio of agreement developed with use of this tool. The effort expended in understanding this new measurement tool is out-weighed by the advantages it provides in assessing agreement and determining the reliability of repeated measures of categorical data.

SUPPLEMENTARY DATA

Supplementary data can be found at http://annhyg.oxfordjournals.org/.

FUNDING

Supported by a research grants HL61302 from the National Heart, Lung and Blood Institute of the National Institutes of Health, and 2P30 ES 00002 from the National Institute of Environmental Health Sciences Occupational and Environmental Health Center. MCF was supported by the Intramural Research Program of the National Cancer Institute.

AUTHORS’ CONTRIBUTIONS

PRH participated in the design of the study and analysis, coordinated collection of exposure data, performed the statistical analysis and drafted the manuscript. MCF provided contextual input into the description and interpretation of the models. LR participated in the design of the statistical analysis. SS participated in the design of the study, and coordinated data collection and helped to draft the manuscript. DKM conceived of the study, and participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.

Supplementary Material

Supplementary Data

supp_59_6_764__index.html^{(1.2KB, html)}

ACKNOWLEDGEMENTS

Thanks to Alan Agresti for guidance during model development; Anthony Hamlett and E. Andres Houseman for useful comments and critiques as the work progressed; and to interviewers Maureen Bennett, Gail Medine, Barbara Cedarlund, and Jean Grace for their long hours collecting the information essential to this study.

REFERENCES

Agresti A. (1988) A model for agreement between ratings on an ordinal scale. Biometrics; 44: 539–48. [Google Scholar]
Agresti A. (2002) Categorical data analysis. Hoboken, NJ: John Wiley & Sons, Inc; ISBN 0 470 46363 5. [Google Scholar]
Aickin M. (1990) Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics; 46: 293–302. [PubMed] [Google Scholar]
Armstrong B, White E, Saracci R. (1992) Principles of exposure measurement in epidemiology. New York, NY: Oxford University Press; ISBN 0 19 262020 7. [Google Scholar]
Benke G, Sim M, Forbes A, et al. (1997) Retrospective assessment of occupational exposure to chemicals in community-based studies: validity and repeatability of industrial hygiene panel ratings. Int J Epidemiol; 26: 635–42. [DOI] [PubMed] [Google Scholar]
Burnham K, Anderson D. (1998) Model selection and inference: a practical information-theoretic approach. New York, NY: Springer-Verlag; ISBN 0 38 795364 7. [Google Scholar]
Carlin JB, Ryan LM, Harvey EA, et al. (2000) Anticonvulsant teratogenesis 4: inter-rater agreement in assessing minor physical features related to anticonvulsant therapy. Teratology; 62: 406–12. [DOI] [PubMed] [Google Scholar]
Cicchetti DV, Feinstein AR. (1990) High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol; 43: 551–8. [DOI] [PubMed] [Google Scholar]
Cohen J. (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas; 20: 37–46. [Google Scholar]
Dewar R., Siemiatycki J, Gerin M. (1991) Loss of statistical power associated with the use of a job-exposure matrix in occupational case–control studies. Appl Occup Environ Hyg; 6: 508–15. [Google Scholar]
Correa A, Min YI, Stewart PA, et al. (2006) Inter-rater agreement of assessed prenatal maternal occupational exposures to lead. Birth Defects Res A Clin Mol Teratol; 76: 811–24. [DOI] [PubMed] [Google Scholar]
Espeland MA, Handelman SL. (1989) Using latent class models to characterize and assess relative error in discrete measurements. Biometrics; 45: 587–99. [PubMed] [Google Scholar]
Feinstein AR, Cicchetti DV. (1990) High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol; 43: 543–9. [DOI] [PubMed] [Google Scholar]
Fienberg S. (1980) The analysis of cross-classified data. Cambridge, MA: The MIT Press; ISBN 0 38 772824 4. [Google Scholar]
Friesen MC, Coble JB, Katki HA, et al. (2011) Validity and reliability of exposure assessors’ ratings of exposure intensity by type of occupational questionnaire and type of rater. Ann Occup Hyg; 55: 601–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Friesen MC, Pronk A, Wheeler DC, et al. (2013) Comparison of algorithm-based estimates of occupational diesel exhaust exposure to those of multiple independent raters in a population-based case–control study. Ann Occup Hyg; 57: 470–481. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fritschi L, Nadon L, Benke G, et al. (2003) Validation of expert assessment of occupational exposures. Am J Ind Med; 43: 519–22. [DOI] [PubMed] [Google Scholar]
Fritschi L, Siemiatycki J, Richardson L. (1996) Self-assessed versus expert-assessed occupational exposures. Am J Epidemiol; 144: 521–7. [DOI] [PubMed] [Google Scholar]
Goldberg MS, Siemiatycki J, Gérin M. (1986) Inter-rater agreement in assessing occupational exposure in a case–control study. Br J Ind Med; 43: 667–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
Goodman A. (1979) Simple models for the analysis of association in cross-classifications having ordered categories. J Am Stat Assoc; 74: 537–52. [Google Scholar]
Guggenmoos-Holzmann I, Vonk R. (1998) Kappa-like indices of observer agreement viewed from a latent class perspective. Stat Med; 17: 797–812. [DOI] [PubMed] [Google Scholar]
Maclure M, Willett WC. (1987) Misinterpretation and misuse of the kappa statistic. Am J Epidemiol; 126: 161–9. [DOI] [PubMed] [Google Scholar]
t’ Mannetje A, Fevotte J, Fletcher T, et al. (2003) Assessing exposure misclassification by expert assessment in multicenter occupational studies. Epidemiology; 14: 585–92. [DOI] [PubMed] [Google Scholar]
Nelson JC, Pepe MS. (2000) Statistical description of interrater variability in ordinal ratings. Stat Methods Med Res; 9: 475–96. [DOI] [PubMed] [Google Scholar]
Posner KL, Sampson PD, Caplan RA, et al. (1990) Measuring interrater reliability among multiple raters: an example of methods for nominal data. Stat Med; 9: 1103–15. [DOI] [PubMed] [Google Scholar]
Qu Y, Tan M, Kutner MH. (1996) Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics; 52: 797–810. [PubMed] [Google Scholar]
Sama SR, Hunt PR, Cirillo CI, et al. (2003) A longitudinal study of adult-onset asthma incidence among HMO members. Environ Health; 2: 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schuster C. (2002) A mixture model approach to indexing rater agreement. Br J Math Stat Psychol; 55(Pt 2): 289–303. [DOI] [PubMed] [Google Scholar]
Siemiatycki J, Dewar R, Richardson L. (1989) Costs and statistical power associated with five methods of collecting occupation exposure information for population-based case–control studies. Am J Epidemiol; 130: 1236–46. [DOI] [PubMed] [Google Scholar]
Siemiatycki J, Fritschi L, Nadon L, et al. (1997) Reliability of an expert rating procedure for retrospective assessment of occupational exposures in community-based case–control studies. Am J Ind Med; 31: 280–6. [DOI] [PubMed] [Google Scholar]
Solovieva S1, Pehkonen I, Kausto J, et al. (2012) Development and validation of a job exposure matrix for physical risk factors in low back pain. PLoS One; 7: e48680. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stewart PA, Stewart WF. (1994) Occupational case–control studies: II. Recommendations for exposure assessment. Am J Ind Med; 26: 313–26. [DOI] [PubMed] [Google Scholar]
Stewart WF, Stewart PA. (1994) Occupational case–control studies: I. Collecting information on work histories and work-related exposures. Am J Ind Med; 26: 297–312. [DOI] [PubMed] [Google Scholar]
Tanner M, Young M. (1985) Modeling agreement among raters. J Am Stat Assoc; 80: 175–80. [Google Scholar]
Tanner MA, Young MA. (1985) Modeling ordinal scale disagreement. Psychol Bull; 98: 408–15. [PubMed] [Google Scholar]
Thompson WD. (1990) Kappa and attenuation of the odds ratio. Epidemiology; 1:357–69. [DOI] [PubMed] [Google Scholar]
Tielemans E, Heederik D, Burdorf A, et al. (1999) Assessment of occupational exposures in a general population: comparison of different methods. Occup Environ Med; 56: 145–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zelterman D. (1999) Models for discrete data. Oxford, NY: Clarendon Press; ISBN 0 19 856701 4. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

supp_59_6_764__index.html^{(1.2KB, html)}

supp_mev011_Appendix_Table2.pdf^{(24.7KB, pdf)}

supp_mev011_Appendix_Log_Linear_Modeling_of_Agreement17Nov2014.pdf^{(131.6KB, pdf)}

supp_mev011_Appendix_Table1.pdf^{(23.9KB, pdf)}

supp_mev011_Appendix_Table3.pdf^{(25.3KB, pdf)}

[CIT0001] Agresti A. (1988) A model for agreement between ratings on an ordinal scale. Biometrics; 44: 539–48. [Google Scholar]

[CIT0002] Agresti A. (2002) Categorical data analysis. Hoboken, NJ: John Wiley & Sons, Inc; ISBN 0 470 46363 5. [Google Scholar]

[CIT0003] Aickin M. (1990) Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics; 46: 293–302. [PubMed] [Google Scholar]

[CIT0004] Armstrong B, White E, Saracci R. (1992) Principles of exposure measurement in epidemiology. New York, NY: Oxford University Press; ISBN 0 19 262020 7. [Google Scholar]

[CIT0005] Benke G, Sim M, Forbes A, et al. (1997) Retrospective assessment of occupational exposure to chemicals in community-based studies: validity and repeatability of industrial hygiene panel ratings. Int J Epidemiol; 26: 635–42. [DOI] [PubMed] [Google Scholar]

[CIT0006] Burnham K, Anderson D. (1998) Model selection and inference: a practical information-theoretic approach. New York, NY: Springer-Verlag; ISBN 0 38 795364 7. [Google Scholar]

[CIT0007] Carlin JB, Ryan LM, Harvey EA, et al. (2000) Anticonvulsant teratogenesis 4: inter-rater agreement in assessing minor physical features related to anticonvulsant therapy. Teratology; 62: 406–12. [DOI] [PubMed] [Google Scholar]

[CIT0008] Cicchetti DV, Feinstein AR. (1990) High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol; 43: 551–8. [DOI] [PubMed] [Google Scholar]

[CIT0009] Cohen J. (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas; 20: 37–46. [Google Scholar]

[CIT0010] Dewar R., Siemiatycki J, Gerin M. (1991) Loss of statistical power associated with the use of a job-exposure matrix in occupational case–control studies. Appl Occup Environ Hyg; 6: 508–15. [Google Scholar]

[CIT0011] Correa A, Min YI, Stewart PA, et al. (2006) Inter-rater agreement of assessed prenatal maternal occupational exposures to lead. Birth Defects Res A Clin Mol Teratol; 76: 811–24. [DOI] [PubMed] [Google Scholar]

[CIT0012] Espeland MA, Handelman SL. (1989) Using latent class models to characterize and assess relative error in discrete measurements. Biometrics; 45: 587–99. [PubMed] [Google Scholar]

[CIT0013] Feinstein AR, Cicchetti DV. (1990) High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol; 43: 543–9. [DOI] [PubMed] [Google Scholar]

[CIT0014] Fienberg S. (1980) The analysis of cross-classified data. Cambridge, MA: The MIT Press; ISBN 0 38 772824 4. [Google Scholar]

[CIT0015] Friesen MC, Coble JB, Katki HA, et al. (2011) Validity and reliability of exposure assessors’ ratings of exposure intensity by type of occupational questionnaire and type of rater. Ann Occup Hyg; 55: 601–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0016] Friesen MC, Pronk A, Wheeler DC, et al. (2013) Comparison of algorithm-based estimates of occupational diesel exhaust exposure to those of multiple independent raters in a population-based case–control study. Ann Occup Hyg; 57: 470–481. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0017] Fritschi L, Nadon L, Benke G, et al. (2003) Validation of expert assessment of occupational exposures. Am J Ind Med; 43: 519–22. [DOI] [PubMed] [Google Scholar]

[CIT0018] Fritschi L, Siemiatycki J, Richardson L. (1996) Self-assessed versus expert-assessed occupational exposures. Am J Epidemiol; 144: 521–7. [DOI] [PubMed] [Google Scholar]

[CIT0019] Goldberg MS, Siemiatycki J, Gérin M. (1986) Inter-rater agreement in assessing occupational exposure in a case–control study. Br J Ind Med; 43: 667–76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0020] Goodman A. (1979) Simple models for the analysis of association in cross-classifications having ordered categories. J Am Stat Assoc; 74: 537–52. [Google Scholar]

[CIT0021] Guggenmoos-Holzmann I, Vonk R. (1998) Kappa-like indices of observer agreement viewed from a latent class perspective. Stat Med; 17: 797–812. [DOI] [PubMed] [Google Scholar]

[CIT0022] Maclure M, Willett WC. (1987) Misinterpretation and misuse of the kappa statistic. Am J Epidemiol; 126: 161–9. [DOI] [PubMed] [Google Scholar]

[CIT0023] t’ Mannetje A, Fevotte J, Fletcher T, et al. (2003) Assessing exposure misclassification by expert assessment in multicenter occupational studies. Epidemiology; 14: 585–92. [DOI] [PubMed] [Google Scholar]

[CIT0024] Nelson JC, Pepe MS. (2000) Statistical description of interrater variability in ordinal ratings. Stat Methods Med Res; 9: 475–96. [DOI] [PubMed] [Google Scholar]

[CIT0025] Posner KL, Sampson PD, Caplan RA, et al. (1990) Measuring interrater reliability among multiple raters: an example of methods for nominal data. Stat Med; 9: 1103–15. [DOI] [PubMed] [Google Scholar]

[CIT0026] Qu Y, Tan M, Kutner MH. (1996) Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics; 52: 797–810. [PubMed] [Google Scholar]

[CIT0027] Sama SR, Hunt PR, Cirillo CI, et al. (2003) A longitudinal study of adult-onset asthma incidence among HMO members. Environ Health; 2: 10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0028] Schuster C. (2002) A mixture model approach to indexing rater agreement. Br J Math Stat Psychol; 55(Pt 2): 289–303. [DOI] [PubMed] [Google Scholar]

[CIT0029] Siemiatycki J, Dewar R, Richardson L. (1989) Costs and statistical power associated with five methods of collecting occupation exposure information for population-based case–control studies. Am J Epidemiol; 130: 1236–46. [DOI] [PubMed] [Google Scholar]

[CIT0030] Siemiatycki J, Fritschi L, Nadon L, et al. (1997) Reliability of an expert rating procedure for retrospective assessment of occupational exposures in community-based case–control studies. Am J Ind Med; 31: 280–6. [DOI] [PubMed] [Google Scholar]

[CIT0031] Solovieva S1, Pehkonen I, Kausto J, et al. (2012) Development and validation of a job exposure matrix for physical risk factors in low back pain. PLoS One; 7: e48680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0032] Stewart PA, Stewart WF. (1994) Occupational case–control studies: II. Recommendations for exposure assessment. Am J Ind Med; 26: 313–26. [DOI] [PubMed] [Google Scholar]

[CIT0033] Stewart WF, Stewart PA. (1994) Occupational case–control studies: I. Collecting information on work histories and work-related exposures. Am J Ind Med; 26: 297–312. [DOI] [PubMed] [Google Scholar]

[CIT0034] Tanner M, Young M. (1985) Modeling agreement among raters. J Am Stat Assoc; 80: 175–80. [Google Scholar]

[CIT0035] Tanner MA, Young MA. (1985) Modeling ordinal scale disagreement. Psychol Bull; 98: 408–15. [PubMed] [Google Scholar]

[CIT0500] Thompson WD. (1990) Kappa and attenuation of the odds ratio. Epidemiology; 1:357–69. [DOI] [PubMed] [Google Scholar]

[CIT0036] Tielemans E, Heederik D, Burdorf A, et al. (1999) Assessment of occupational exposures in a general population: comparison of different methods. Occup Environ Med; 56: 145–51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0037] Zelterman D. (1999) Models for discrete data. Oxford, NY: Clarendon Press; ISBN 0 19 856701 4. [Google Scholar]

PERMALINK

Log-Linear Modeling of Agreement among Expert Exposure Assessors

Phillip R Hunt

Melissa C Friesen

Susan Sama

Louise Ryan

Donald Milton

Abstract

Background:

Methods:

Results:

Conclusions:

INTRODUCTION

METHODS

Study population and exposure assessment

Inter-rater agreement

Table 1.

Log-linear models of agreement

Interpretation of log-linear models of agreement

Application to sensitizer ratings

RESULTS

Table 2.

Table 3.

Table 4.

DISCUSSION

SUPPLEMENTARY DATA

FUNDING

AUTHORS’ CONTRIBUTIONS

Supplementary Material

ACKNOWLEDGEMENTS

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Log-Linear Modeling of Agreement among Expert Exposure Assessors

Phillip R Hunt

Melissa C Friesen

Susan Sama

Louise Ryan

Donald Milton

Abstract

Background:

Methods:

Results:

Conclusions:

INTRODUCTION

METHODS

Study population and exposure assessment

Inter-rater agreement

Table 1.

Log-linear models of agreement

Interpretation of log-linear models of agreement

Application to sensitizer ratings

RESULTS

Table 2.

Table 3.

Table 4.

DISCUSSION

SUPPLEMENTARY DATA

FUNDING

AUTHORS’ CONTRIBUTIONS

Supplementary Material

ACKNOWLEDGEMENTS

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases