Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Dec 1.
Published in final edited form as: Qual Quant. 2022 May 7;57(2):1231–1245. doi: 10.1007/s11135-022-01397-7

Evaluating and Tracking Qualitative Content Coder Performance Using Item Response Theory

Michael Hennessy 1, Amy Bleakley 1, Morgan E Ellithorpe 1
PMCID: PMC10691860  NIHMSID: NIHMS1895079  PMID: 38046942

Abstract

Content analysis of traditional and social media has a central role in investigating features of media content, measuring media exposure, and calculating calculation of media effects. The reliability of content coding is usually evaluated using “Kappa-like” agreement measures, but these measures produce results that aggregate individual coder decisions, which obscure the performance of individual coders. Using a data set of 105 advertisements for sports and energy drinks media content coded by five coders, we demonstrate that Item Response Theory can track coder performance over time and give coder-specific information on the consistency of decisions over qualitatively coded objects. We conclude that IRT should be added to content analysts’ tool kit of useful methodologies to track and measure content coders’ performance.

Keywords: content analysis, coder agreement, item response theory


Content analysis is a data analysis method (Jordan, Kunkel, Manganello, & Fishbein, 2010; Neuendorf, 2017) and plays an important role in analysis of strategic communication messaging (Krippendorff, 2018; Riff, Lacy, Watson, & Fico, 2019). For example, meaningful estimation of “media effects” (Emmers-Sommer & Allen, 1999; Potter & Riddle, 2007) on attitudes and behavior depends on correct theoretical classifications and accurate content descriptions of media message content. Some examples of content analysis of messages include of public service announcements (Coleman & Hatley Major, 2014; DeJong, 2001), drugs and alcohol in movies (El-Khoury et al., 2019; Stern & Morr, 2013), television (Barker, Whittamore, Britton, Murray, & Cranwell, 2018; Russell, Russell, & Grube, 2009), and music (Peteet et al., 2020; Primack, Dalton, Carroll, Agarwal, & Fine, 2008) and sex content in movies and television (Bleakley et al., 2017). Content analyses of digital media (Skalski, Neuendorf, & Cajigas, 2017) include studies on the depiction of alcohol on Facebook (Beullens & Schepers, 2013), drugs and alcohol in Twitter (Cavazos-Rehg, Krauss, Fisher, et al., 2015; Cavazos-Rehg, Krauss, Sowles, & Bierut, 2015; Krauss, Grucza, Bierut, & Cavazos-Rehg, 2017) and commercial products such as “junk food” and energy drinks (Brownbill, Miller, Smithers, & Braunack-Mayer, 2020; Buchanan, Yeatman, Kelly, & Kariippanon, 2018; Coates, Hardman, Halford, Christiansen, & Boyland, 2019; Mus, Rozas, Barnoya, & Busse, 2021; Vassallo et al., 2018).

Content analysis of this kind is usually conducted using human coders who are trained to identify the content of the message (e.g., implicit theoretical constructs, character portrayals and their context, the presence of factual information,) as well as specific message features (e.g., visual/auditory cues, production techniques, the type of persuasive strategy). Reliability in content coding implies agreement between coders and is a pre-condition for the validity of coded data, as summarized by Artstein and Poesio:

[coded] data are reliable if coders can be shown to agree on the categories assigned to units to an extent determined by the purposes of the study…If different coders produce consistently similar results, then we can infer that they have internalized a similar understanding of the annotation guidelines, and we can expect them to perform consistently under this understanding….Reliability is thus a prerequisite for demonstrating the validity of the coding scheme—that is, to show that the coding scheme captures the “truth” of the phenomenon being studied, in case this matters: If the annotators are not consistent then either some of them are wrong or else the annotation scheme is inappropriate for the data.

(Artstein & Poesio, 2008, p. 557, emphasis in the original).

Accurate coding is important because, when valid and reliable, it provides an opportunity to describe the nature of media content and document prevailing patterns and trends. Additionally, the codes are often a component of constructed exposure measures (e.g., Bleakley et al., 2008). which are used to evaluate the effects of “media exposure” on cognitive and behavioral outcomes.

Kappa-centric Agreement Measures

Although coder agreement measures differentiate between categorical (i.e., nominal) measures and continuous ones like the Pearson or intra-class correlation (Barnhart, Haber, & Lin, 2007; Gwet, 2014a; Shrout & Fleiss, 1979), here we focus on “kappa-centric” measures for nominal judgements which are often required for coding advertisements and other types of media content. Kappa-centric measures assess the agreement between two or more coders after agreement due to chance is removed (Streiner, 1995). The most common is Cohen’s kappa (Cκ), but there are others (Carletta, 1996; Hallgren, 2012). All of these variants agree on the measure of “simple” agreement (i.e., the sum of the frequencies in the main diagonal of a 2*2 or larger square crosstabulation of an assessment of a media trait or feature produced by two or more coders). However, the alternatives disagree on the correct definition of “chance agreement” and its calculation, thus different kappa-like measures rarely give identical answers when applied to identical coder data (Artstein & Poesio, 2008, p. 560; Gwet, 2008; Krippendorff, 2004; Oleinik, Popova, Kirdina, & Shatalova, 2014).

Cκ performs poorly when the coded outcomes are either extremely rare or extremely common - this is “prevalence bias”. When the marginal distribution of the codes are not balanced across coders, Cκ cannot attain a unity value of chance adjusted “perfect agreement” (Brennan & Prediger, 1981; Streiner, 1995). Kappa also performs poorly when the independent coders display different marginal distribution in their use of available codes, which is referred to “coder bias” (Banerjee, Capozzoli, McSweeney, & Sinha, 1999; Byrt, Bishop, & Carlin, 1993; Gwet, 2002, 2008); as coders disagree as to the prevalence of the codes, Cκ increases (Artstein & Poesio, 2008, p. 573; Krippendorff, 2004, pp. 422, 429). Thus, prevalence bias results in reduced Cκ, while coder bias results in inflated Cκ. In summary, when prevalence rates are very high or very low and when coders vary between themselves in their coding, even small discrepancies between coders result in distortions in estimated agreement using Cκ. (Feinstein & Cicchetti, 1990). A widely used alternative to Cκ is Krippendorf’s α (Kα) (Hayes & Krippendorff, 2007) can be applied to both categorical and ordinal outcomes. Weighted Kα is identical to the intra-class correlation in the case of ordinally scaled coder data (Fleiss & Cohen, 1973; Krippendorff, 1970). Gwet’s AC measure is less common, but because it was designed to explicitly correct Cκ’s sensitivity to coder and prevalence bias we also this measure in the results below (Gwet, 2014b)

Identifying and Resolving Coder Disagreements

During coder training it would be useful to quantitatively assess individual coder performance relative other coders on the team. For example, Belur et al. reviewed 49 meta-analytic studies on crime reduction in terms of research reporting on dimensions of coder reliability. Sixty three percent of the studies reported no quantitative measure of coder agreement and 35% of the studies did not discuss how coder disagreements (however identified) were resolved (Belur, Tompson, Thornton, & Simon, 2018, Table 1). Garrison et al. proposed an iterative coder training process they call “negotiated agreement” applied to coders of educational transcripts, and was demonstrated in a case study of coded transcriptions, another kind of content analysis (Garrison, Cleveland-Innes, Koole, & Kappelman, 2006). Lacy et al. (Year) identifies “best practices” in content analysis. They suggest always using multiple independent coders (“three or more is better”) who practice on content not included in the final sample, evaluating disagreements through random assignment of codes during training, and resolving disagreements through re-training, establishing coder consensus, or dropping the code. They also offer their standard for coder training :

Table 1:

Percentage of Content Codes Applied by Coders and Agreement Measures

Content Code Coder 1 Coder 2 Coder 3 Coder 4 Coder 5 Agreement Measure*
Humor
(No obvious bias)
38 43 40 36 40 Cκ: .80
Kα: .82
AC: .81
Narrative
(Coder Bias)
75 30 40 50 40 Cκ: .40
Kα: .39
AC: .39
Extraordinary
Athlete
(Prevalence Bias)
46 41 38 32 25 Cκ: .69
Kα: .69
AC: .73
Make Me Happy
(Coder and Prevalence
Bias)
39 17 36 13 15 Cκ: .47
Kα: .46
AC: .69
*

Rounded to two decimal places. N = 105 advertisements.

Cκ: Cohen’s Kappa. Kα: Krippendorff’s Alpha. AC: Gwet’s AC.

The intercoder reliability check should occur with study content as it is coded. If reliability is not achieved, the coders have to be replaced in the recoding process because recoding content violates independence across coders and time. As a result, coders need to practice with content that will not be in the study but is similar to the study content in complexity. Practice with non-study content should continue until the protocol and coders can produce reliable data. At that point, the coding of study content can begin.(Lacy, Watson, Riffe, & Lovejoy, 2015, p. 804)

But it is not clear how such individual assessments can be made with Kappa-centric statistics because they produce global agreement indices between either pairs or groups of coder evaluating the same content. In contrast, item response theory (IRT) is an alternative approach that can be used to focus precisely on between-coder performance cross-sectionally or over time.

Using IRT for Evaluating Coder Performance

IRT inhabits a different statistical world than the kappa-centric measures discussed above: structural equation modeling (Ullman & Bentler, 2012) and confirmatory factor analysis of dichotomous and ordinal items (Brown, 2015). Like classical test theory (Glockner-Rist & Hoijtink, 2003), IRT assumes a latent underlying trait that determines the respondent’s choices (Singh, 2004). The term “trait” comes from the use of factor analysis to construct achievement and intelligence tests when the outcomes are true/false responses; in these cases the trait is IQ or some other unbobserved ability construct. Content analysis using IRT does the same: it assumes a latent variable (Θ or “theta” in IRT vocabulary) that underlies the decision to classify a specific code to an evaluated object (such as an ad, movie scene, or Facebook image). In other words, content analysts treat a specific coding decision as a true/false choice rather than a true/false answer (as in the case of educational tests with a priori correct values). Thus, content coding decisions can be modeled in terms of the two IRT parameters: “discrimination” and “difficulty” (Reise, Ainsworth, & Haviland, 2005).

Consider coding an advertisement for an energy drink: one decision to be made might be “Is humor used in the ad?” Discrimination is the slope of the line of each coder as they distinguish between the ads that definitely do not use humor (starting at the zero point on the Y axis) and those that definitely do (ending at the unity point on the Y axis). IRT assumes a nonlinear relationship (the logistic) between the choice and Θ which produces an “S” shaped function relating the X (Θ) axis trait to the Y axis probability of a humor component in the ad. If the line is steep, then small changes in the latent trait have large consequences in term of the probability of the “humor content is present” decision. If the slope is flatter, then it takes a large change in Θ to affect the probability of the “humor is present” decision. In the first case, coder discrimination is high and in the second case coder discrimination is low (DeMars, 2010, p. 11). In IRT, the discrimination parameter is the logistically scaled “factor loading” of the item’s change given a one unit change in the (unobserved) Θ estimate.

Difficulty is the location on the Θ axis where the discrimination slope predicts a probability of 50% or more on the coding choice (here, coding for “humor is present”). In other words, the difficulty parameter is the item’s location on the Θ latent trait where the regression curve crosses “50%” (De Ayala, 2013). “Easy” content items are located on the left (e.g. the low side of the Θ axis) and “difficult” items on the right (e.g., on the high side of the Θ axis) because in the first case lower values of the latent trait achieve the 50% or more standard for determining if “humor content is present” than the higher latent trait values needed to achieve this in the second case. In IRT graphs of this type (an “Item Characteristic Curve”), a vertical line to the Θ axis from each items “50%” probability cut-off locates the item on the Θ axis and therefore the relative difficulty level (i.e., the value of Θ) of the coding decision for each coder. Thus, coders are naturally ranked on the Θ axis in terms of how much of the latent trait was required for them to reach the 50% or more probability level. In IRT, the “difficulty” parameter is the ratio of the specific variables intercept divided by the common slope multiplied by −1 (Glockner-Rist & Hoijtink, 2003).[1]

The Three IRT Models for Dichotomous Items

Because IRT represents a type of measurement model, there are options for any analysis. “One parameter” IRT models allow for different item difficulties (Θ location values) over coders but assume the same discrimination (slope) for all coders. “Two parameter” models allow both coder specific difficulties and slopes (see Zickar & Highhouse, 1998 for an application in risk aversion and message framing). “Three parameter” models allow for random starting points above the zero point on the Y axis (Raykov & Marcoulides, 2018, Chapter 5).[2] Likelihood ratio tests can statistically differentiate between all three models because one and two parameter are nested within each other (DeMars, 2010, p. 57).

A Hypothetical Example

Figure 1 shows a one parameter item characteristic curve for three coders assigning a single code to multiple ads (as in the example of humor above). The Y axis is the probability of assigning the code and the X axis is the latent trait Θ determining the assignment probability. The “S” curve is the slope of the change in the probability of assignment given a one unit (standard deviation) increase in Θ. The location where the slope for each coder crosses the “50%” value of the Y axis (indicated by the drop down lines for each coder) is the “difficulty” parameter. Assigning this code was the easiest for coder 1: the Θ value indicating 50% or more probability is .18 standard deviation below the mean value of zero. In contrast, coder 3 found assigning this code more difficult, the value of Θ that crosses the 50% probability line is .42 standard deviations above the average Θ. Coder 2 falls between the others in difficulty.

Figure 1:

Figure 1:

Item Characteristic Curve (ICC) for Three Coders Assigning the Same Code to a Common Set of Coded Objects

Notes: Because this is a one parameter IRT model, the discrimination slopes are the same for all coders. See Figure 4 for the case where this constraint is removed using actual data for the Narrative code.

When content analysts look at IRT graphs of coder performance, they want to see two specific features. First, because coders should agree on the difficulty of assigning a specific code, all the slopes of each coder should be close to one another on the Θ axis so that the drop down difficulty values cluster around a narrow range. This hypothetical example probably does not meet that expectation. Content analysts also want to see a steep discrimination slope indicating that small changes in Θ produce large changes in the probability of code assignment. In contrast, what content analysts do not want to see are widely separated slopes that are shallow and stretch across a wide range of Θ. In this case, it takes large changes in Θ to increase the probability of a code being correctly assigned (i.e., discrimination is low) and drop down difficulty values may be far apart on the Θ axis (i.e., difficulty is variable across coders). From a coding reliability perspective, this is the worst case scenario.

Methods and Measures

The TeenADE Study

Our data come from TeenADE, a project to identify modifiable individual factors (like beverage preferences) and environmental factors (like sports and energy drink advertising) to enable the design of health messages to discourage energy and sports drink consumption. American adolescents consume more sugary drinks than any other age group (Marriott, Hunt, Malek, & Newman, 2019) with the highest consumption rates among lower-income youth as well as Black and Mexican-American children and teens (Harris, Felming-Milici, Kibwana-Jaff, & Phaneuf, 2020). Fifty percent of high school students in the United States reported drinking a sports drink at least weekly and one in ten report drinking sports drinks every day.(Underwood et al., 2020). In addition, energy drink consumption among youth is also increasing (Vercammen, Koma, & Bleich, 2019) and this suggests the need for a focus on these types of sugary drinks, especially considering public uncertainty about the sugar and caloric content of sports and energy drinks (Hennessy, Bleakley, Piotrowski, Mallya, & Jordan, 2015) combined with the perception that sports and energy drinks provide health benefits (Burke & Hawley, 2018) like hydration, enhanced athletic performance, mental alertness, and more rapid recovery after exercise (Moran & Roberto, 2018; Munsell, Harris, Sarda, & Schwartz, 2016; Zytnick, Park, & Onufrak, 2016)

To understand the persuasive logic of sports and energy drink advertising, a component of TeenADE content analyzes advertisements for sports and energy drinks (with Coca Cola as a conventional sugar sweetened beverage as a comparison) using a coding manual derived from the Elaboration Likelihood Model (including central and peripheral cue codes) and the Reasoned Action Approach (including outcome expectancy claim codes items and injunctive and descriptive normative belief codes). Additional details on the TeenADE study can be found elsewhere (Hennessy et al., 2021).

Selecting the Sample for Content Coding

We used Nielsen ratings to identify the ads for “isotonic” drinks seen by the largest broadcast audiences for seven brands: sports drinks—Gatorade, Powerade, Vitamin Water; energy drinks – Monster Energy, Red Bull, 5-Hour Energy and Coca-Cola (included to provide a sugar-sweetened comparison beverage). The brands were chosen because they are the top-selling brands in each drink category. We created a list ads from 2015–2016, 2016–2017, 2017–2018 given the ad description provided by Nielsen (Coke n=3,991, Gatorade n=2,723, Powerade n=181, Vitamin Water n=148, Red Bull n=1,373, 5-Hour Energy n=1,248, Monster n=69). We sorted by the ad description and removed duplicates based on the description and also removed descriptions like “incomplete video” and “no title available”. Then we searched for the actual ads matching the Nielsen descriptions using AdForum, iSpotTV, and YouTube, brand websites, Google videos, Vimeo, and Dailymotion. Of the original de-duplicated list, nine ads could not be located, four were for Coca Cola, four were for Red Bull, and one was for Gatorade. The final sample of social media ads (N = 105) consisted of: Gatorade: 15, Powerade: 5, Vitamin Water: 1, 5 Hour Energy: 28, Red Bull: 24, Monster Energy: 10, Coca-Cola: 22

Coder Training

The coding scheme was developed by team members based on the Reasoned Action Approach (Fishbein & Ajzen, 2010) of intention formation and the Elaboration Likelihood Model of information processing (O’Keefe, 2008; Petty & Cacioppo, 1986). The objective was to develop codes that would allow for a comprehensive analysis of the messaging strategies used in the advertisements. The codebook features codes designed to account for production features of the ads, messaging strategies (e.g., humor, use of celebrity), and outcome expectancies mentioned or implied (e.g., you will get/stay hydrated, you will stay focused and alert). An online coding instrument was developed to correspond with the codebook. After several iterations of the codebook, coders were hired and trained. Coders attended 3 training sessions on the codebook and further adjustment to the codes and codebook were made as necessary. The training included practice coding of ads not included in the sample (but were the same brands from different years) and then meeting with project members to identify and resolve coding differences. This process was repeated until the investigators were satisfied with consistency among the coders using two quantitative measures of categorical agreement (i.e., Kα and Gwet’s AC). Coders evaluated the 105 ads along theory and content dimensions.

Data Analysis Plan

We discuss four examples from the coded ads to demonstrate the use of IRT. Two are of ad content (the use of Humor, the use of Narrative) and two are Reasoned Action expectancies (the product will make me an Extraordinary Athlete, the product will Make Me Happy). We selected these four because they highlight coder biases as noted in Table 1. To investigate the two types of coder bias, we look at central tendency of assigned codes for each example. We use Stata (StataCorp, 2019) to estimate the kappa-centric statistics, the IRT parameters, and a goodness of fit index, the BIC (Aho, Derryberry, & Peterson, 2014). Unfortunately, with these types of logistic models, more conventional SEM-related fit indices (e.g., Tucker-Lewis Index, the RMSEA) are not available.

Results

Summary Statistics, Agreement Measures, and IRT Results for the Four Codes

Because of the importance of coder and prevalence bias on Kappa-centric measures, Table 1 shows the performance of each coder as well as the three measures of agreement for each of the four outcomes. For Humor, coders were extremely consistent in applying the codes and there is no evidence of either kind of coder bias in the statistics shown on use of the codes. Coders apparently made consistent decisions on the ad, because all agreement measures are high. For Narrative, there is coder bias, especially for coders 1 and 2, and agreement measures are universally poor. For Extraordinary Athlete, the coders were relatively consistent and agreement measures are adequate. For Make Me Happy, both prevalence and coder bias appear to be operating and this is consistent with the agreement measures, the bias-correcting AC is moderate while the other two are low.

Figure 1 shows the IRT plots for the four codes. In these one parameter models, the discrimination slopes are the same and only the difficult parameter varies between coders. Humor was the most consistent on the kappa-centric agreement measures and the IRT shows a logistic slope of 4.29. We drop a vertical line to the theta axis to compare the two most extreme difficulty values. The BIC was 390, note that lower values indicate better fit.

Narrative did poorly on all agreement measures and the IRT plot shows why. The slope is 2.85 and the difficulty locations range from −.80 of a SD below the mean to .61 of a SD above the mean: coder 1 found it easy to identify narratives in these ads and coder 2 found it difficult. The BIC was 559.

Extraordinary Athlete performed adequately on the agreement measures and shows no extreme type of bias. The IRT plot is consistent with this: the common slope is 13.53 and the difficulty ranges from .55 (coder 1) to .85 (coder 5). The BIC was 415.

Finally, the Make Me Happy code displayed both kinds of bias and for this code, the common slope is 4.36 and the difficulty range is from .39 (coder 1) to 1.24 (for coder 4). Also note that the coders fell into two difficulty defined groups: 1 and 3 versus 2, 4, and 5. The BIC was 413.

Tracking Coder Progress: The Extraordinary Athlete and Narrative Code

Using IRT, content analysts can track coder’s experience with a specific code over time between batches of ads: all coders independently coded 35 ads in the first batch and then the remaining 70 in the second batch. Figure 2 shows the IRT results for Extraordinary Athlete and the Narrative code for each batch. Well trained coders should show only slight variation in batch results, and this is what we have here. However, Figure 2 does shows large differences in code performance: Extraordinary Athlete has high discrimination and a narrow range of difficulty parameters, the Narrative code shows lower discrimination and a higher range of difficulty. The Narrative code obviously needs additional analysis.

Figure 2:

Figure 2:

IRT Results for Each of the Four Content Codes

Different Discrimination and Difficulties: The Narrative Code

As noted above, it is possible to differentiate coders in terms of slopes as well as difficulty. However, with these data we find that this often leads to lack of estimation convergence when there is high agreement and low bias (as is the case with Humor and Extraordinary Athlete) which suggests that data with high agreement may imply common discrimination slopes. In any case, when coder specific slopes are similar, computer programs cannot converge on a solution, the case here with the Humor code. But the Narrative code’s initial results in Table 1 and Figures 1 and 2 were not encouraging, so perhaps the assumption of common slopes for all coders was unreasonable, and the initial results are such because the coders are actually quite different in terms of discrimination ability for Narrative. The Narrative code was also fit as a two parameter model and is shown in Figure 3. Each coder has their own slope and difficulty parameter: Coder 1 (slope = 2.46 difficulty = −.84) Coder 2 (slope = 3.48 difficulty = .58) Coder 3 (slope = 4.61 difficulty = .25) Coder 4 (slope = 3.06 difficulty = .002) Coder 5 (slope = 1.97 difficulty = .35). While coder 3 shows the most discrimination and coder 2 the greatest difficulty, in general, the difference between a one parameter model (Figure 1) and a two parameter model (Figure 2) for the Narrative code appears to be small. In fact, the likelihood ratio test shows that the addition of the extra slopes in the two parameter model is not statistically significant (χ2 = 4.13, df = 4, p = .38, the upper tail .05 area for this distribution is 9.49). In addition, the model fit is slightly worse than the one parameter model assuming equal discrimination: the BIC was 573. The initial results for Narrative do not therefore appear to be due to coder-specific idiosyncratic judgements, because even when the two parameter model permits these coder-specific parameters, the results are still poor. We conclude that the Narrative code should be discarded for this set of advertisements.

Figure 3:

Figure 3:

Tracking Longitudinal Coder Performance: Extraordinary Athlete and Narrative Code

Notes: For clarity, the difficulty drop down Θ values are not shown for the Extraordinary Athlete code.

What About Using a Kappa-centric Approach?

Could a traditional content analysis come to the same conclusions using kappa-centric results? Because some of the coders clearly display coder bias, Table 2 shows the Gwet AC measure for each coder pair and each of the four example codes. Scanning these forty measures, only one finding really stands out: the performance of Coder 1 for the Narrative code. For this code, Coder 1 has very small agreement AC (ranging from .09 to .29) with the other coders. Looking at both IRT for Narrative plots (in Figures 2 and 4) shows that Coder 1 has the lowest difficulty score in both IRT models. For Coder 1, Narratives were easier to identify compared with the other coders and this coder specific feature may account for the poor results for the Narrative code. Given that possible explanation, it is not clear how a kappa-centric analysis should proceed, but because IRT is actually a SEM model, we can evaluate this hypothesis by redoing the IRT analysis for Narrative without Coder 1. With a common discrimination slope, the BIC is 467 and with coder specific discrimination slopes it is 475 (the IRT plots look similar to the originals with Coder 1 and are not shown here). While this is an improvement, we would still conclude that the Narrative code should be not be used.

Table 2:

Bivariate ACs for all Coder Pairs for Example Codes

Coder Pair Content Code
Humor Narrative Extraordinary Athlete Make Me Happy
1 2 .78 .09 .83 .38
1 3 .84 .20 .77 .77
1 4 .72 .29 .60 .60
1 5 .88 .19 .54 .54
2 3 .82 .52 .86 .86
2 4 .78 .39 .65 .65
2 5 .82 .58 .63 .63
3 4 .76 .65 .71 .71
3 5 .92 .43 .60 .60
4 5 .72 .33 .67 .67

Figure 4:

Figure 4:

Two Parameter IRT Model for Narrative Code

Finally, also note that of the four displayed codes in Figure 1, the difficulty drop-down values for Coder 1 are the furthest to the left for three of the four coding decisions. In other words, Coder 1 also found it easier to assign the Extraordinary Athlete and Make Me Happy codes compared to the other coders. This highlights the visual advantages of IRT: Item Characteristic Curves display differential difficulty as a distance difference on the Θ axis, here revealing an aspect of Coder 1’s performance that is not evident in the kappa-centric results in Table 2. This brief exploration of the specific coding behavior of Coder 1 highlights the advantages of the IRT approach to evaluating coder performance and provides information on coder assessments that cannot be ascertained from other reliability statistics.[3]

Conclusion

Content analysts need to assess coder performance and track coding skill development. IRT is a method that is appropriate for this task and can be applied sequentially as the content coded data sets accumulate cases. IRT can be used during coder training as a tool to identify points of contention by both code category and coder, as well as to verify that the training is resulting in improved performance over time. It can also be used post hoc to demonstrate coder consistency (and inconsistency, see the Narrative results). Combined with the use of kappa-centric agreement indices, IRT adds an important dimension to content analysis and the identification and use of content codes that are both reliable and theoretically sound.

Acknowledgments:

Funded by the US National Institute of Dental and Craniofacial Research (NIH/NIDCR, grant number R21DE028414-01). Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIDCR. We thank our coders (Sean Hinton, Leah Yaker, Hallie Rubinstein, and Julia Sciacca, Charles Zoeller) for their dedication to and effort on this project.

Footnotes

Disclosure Statement: No conflicts of interest declared.

1

Because IRT results are factor analysis derived, extensions of the IRT model for measuring additional aspects of “coder agreement” are possible (Dayton, 2008; Porcu & Giambona, 2017; Uebersax, 1992) However, this would entail – at least in Stata – the use of its SEM procedures and a rather comprehensive knowledge of confirmatory factor analsis.

2

Three parameter models are used to model guessing in analysis of knowledge and cognitive ability test items but do not apply here because trained coders should never be guessing.

3

That the IRT approach is more efficient given many coding decisions is obvious. We use only 4 examples here, but the TeenADE data set has over 100 coded ad features and with 5 coders would require at least 1000 kappa-centric measures to evaluate all coding decisions, i.e., our Table 2 with a 100 columns of ten kappa-centric entries: no one can examine such a table sufficiently carefully.

References

  1. Aho K, Derryberry D, & Peterson T (2014). Model selection for ecologists: the worldviews of AIC and BIC. Ecology, 95(3), 631–636. [DOI] [PubMed] [Google Scholar]
  2. Artstein R, & Poesio M (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596. [Google Scholar]
  3. Banerjee M, Capozzoli M, McSweeney L, & Sinha D (1999). Beyond kappa: A review of interrater agreement measures. Canadian journal of statistics, 27(1), 3–23. [Google Scholar]
  4. Barker AB, Whittamore K, Britton J, Murray RL, & Cranwell J (2018). A content analysis of alcohol content in UK television. Journal of Public Health, fdy142–fdy142. doi: 10.1093/pubmed/fdy142 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Barnhart HX, Haber MJ, & Lin LI (2007). An overview on assessing agreement with continuous measurements. Journal of Biopharmaceutical Statistics, 17(4), 529–569. doi: 10.1080/10543400701376480 [DOI] [PubMed] [Google Scholar]
  6. Belur J, Tompson L, Thornton A, & Simon M (2018). Interrater reliability in systematic review methodology: exploring variation in coder decision-making. Sociological Methods & Research, 1–29. doi: 10.1177/0049124118799372 [DOI] [Google Scholar]
  7. Beullens K, & Schepers A (2013). Display of Alcohol Use on Facebook: A Content Analysis. CyberPsychology, Behavior & Social Networking, 16(7). doi: 10.1089/cyber.2013.0044 [DOI] [PubMed] [Google Scholar]
  8. Bleakley A, Ellithorpe ME, Hennessy M, Jamieson PE, Khurana A, & Weitz I (2017). Risky movies, risky behaviors, and ethnic identity among Black adolescents. Social Science & Medicine, 195, 131–137. doi: 10.1016/j.socscimed.2017.10.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Bleakley A, Fishbein M, Hennessy M, Jordan A, Chernin A, & Stevens R (2008). Developing respondent based multi-media measures of exposure to sexual content. Communications Methods and Measures, 2(1 & 2), 43–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Brennan RL, & Prediger DJ (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41(3), 687–699. [Google Scholar]
  11. Brown T (2015). Confirmatory Factor Analysis for Applied Research (2nd ed.). New York: Guilford. [Google Scholar]
  12. Brownbill AL, Miller CL, Smithers LG, & Braunack-Mayer AJ (2020). Selling function: the advertising of sugar-containing beverages on Australian television. Health Promotion International. doi: 10.1093/heapro/daaa052 [DOI] [PubMed] [Google Scholar]
  13. Buchanan L, Yeatman H, Kelly B, & Kariippanon K (2018). A thematic content analysis of how marketers promote energy drinks on digital platforms to young Australians. Australian and New Zealand Journal of Public Health, 42(6), 530–531. doi: 10.1111/1753-6405.12840 [DOI] [PubMed] [Google Scholar]
  14. Burke LM, & Hawley JA (2018). Swifter, higher, stronger: What’s on the menu? Science, 362(6416), 781–787. doi: 10.1126/science.aau2093 [DOI] [PubMed] [Google Scholar]
  15. Byrt T, Bishop J, & Carlin JB (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46(5), 423–429. [DOI] [PubMed] [Google Scholar]
  16. Carletta J (1996). Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics, 22(2), 249–254. [Google Scholar]
  17. Cavazos-Rehg PA, Krauss M, Fisher SL, Salyer P, Grucza RA, & Bierut LJ (2015). Twitter chatter about marijuana. Journal of Adolescent Health, 56(2), 139–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Cavazos-Rehg PA, Krauss MJ, Sowles SJ, & Bierut LJ (2015). “Hey everyone, I’m drunk.” An evaluation of drinking-related Twitter chatter. Journal of Studies on Alcohol and Drugs, 76(4), 635–643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Coates AE, Hardman CA, Halford JCG, Christiansen P, & Boyland EJ (2019). Food and Beverage Cues Featured in YouTube Videos of Social Media Influencers Popular With Children: An Exploratory Study. Frontiers in Psychology, 10(2142). doi: 10.3389/fpsyg.2019.02142 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Coleman R, & Hatley Major L (2014). Ethical health communication: A content analysis of predominant frames and primes in public service announcements. Journal of Mass Media Ethics, 29(2), 91–107. doi: 10.1080/08900523.2014.893773 [DOI] [Google Scholar]
  21. Dayton CM (2008). An introduction to latent class analysis. In Menard S (Ed.), Handbook of longitudinal research: Design, measurement, and analysis (pp. 357–371): Academic Press. [Google Scholar]
  22. DeJong RCW, Bryn Austin S, William. (2001). US federally funded television public service announcements (PSAs) to prevent HIV/AIDS: A content analysis. Journal of Health Communication, 6(3), 249–263. doi: 10.1080/108107301752384433 [DOI] [PubMed] [Google Scholar]
  23. El-Khoury J, Bilani N, Abu-Mohammad A, Ghazzaoui R, Kassir G, Rachid E, & El Hayek S (2019). Drugs and Alcohol Themes in Recent Feature Films: A Content Analysis. Journal of Child & Adolescent Substance Abuse, 28(1), 8–14. [Google Scholar]
  24. Emmers-Sommer TM, & Allen M (1999). Surveying the effect of media effects: A meta-analytic summary of the media effects research in Human Communication Research. Human Communication Research, 25(4), 478–497. [Google Scholar]
  25. Feinstein AR, & Cicchetti DV (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543–549. [DOI] [PubMed] [Google Scholar]
  26. Fishbein M, & Ajzen I (2010). Predicting and changing behavior: The reasoned action approach: Taylor & Francis. [Google Scholar]
  27. Fleiss JL, & Cohen J (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33(3), 613–619. [Google Scholar]
  28. Garrison DR, Cleveland-Innes M, Koole M, & Kappelman J (2006). Revisiting methodological issues in transcript analysis: Negotiated coding and reliability. The Internet and Higher Education, 9(1), 1–8. [Google Scholar]
  29. Glockner-Rist A, & Hoijtink H (2003). The best of both worlds: Factor analysis of dichotomous data using item response theory and structural equation modeling. Structural Equation Modeling, 10(4), 544–565. [Google Scholar]
  30. Gwet KL (2002). Inter-rater reliability: dependency on trait prevalence and marginal homogeneity. Statistical Methods for Inter-Rater Reliability Assessment Series, 2(1), 1–9. [Google Scholar]
  31. Gwet KL (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29–48. [DOI] [PubMed] [Google Scholar]
  32. Gwet KL (2014a). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (Vol. 2): Advanced Analytics, LLC. [Google Scholar]
  33. Gwet KL (2014b). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (Vol. 1: Analysis of Categorical Ratings): Advanced Analytics, LLC. [Google Scholar]
  34. Hallgren KA (2012). Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in quantitative methods for psychology, 8(1), 23–34. doi: 10.20982/tqmp.08.1.p023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Harris JL, Felming-Milici F, Kibwana-Jaff A, & Phaneuf L (2020). Sugary drink advertising to you: Continued barrier to public health progress. Storrs, CT: University of Connecticut Rudd Center for Food Policy and Obesity. [Google Scholar]
  36. Hayes AF, & Krippendorff K (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77–89. [Google Scholar]
  37. Hennessy M, Bleakley A, Ellithorpe ME, Maloney E, Jordan AB, & Stevens R (2021). Reducing Unhealthy Normative Behavior: The Case of Sports and Energy Drinks. Health Education and Behavior, 1–12. doi: 10.1177/10901981211055468 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Hennessy M, Bleakley A, Piotrowski JT, Mallya G, & Jordan A (2015). Sugar-sweetened beverage consumption by adult caregivers and their children: the role of drink features and advertising exposure. Health Education & Behavior, 42(5), 677–686. [DOI] [PubMed] [Google Scholar]
  39. Jordan A, Kunkel D, Manganello J, & Fishbein M (Eds.). (2010). Media messages and public health: A decisions approach to content analysis: Routledge. [Google Scholar]
  40. Krauss M, Grucza R, Bierut L, & Cavazos-Rehg P (2017). “Get drunk. Smoke weed. Have fun.”: A content analysis of tweets about marijuana and alcohol. American Journal of Health Promotion, 31(3), 200–208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Krippendorff K (1970). Bivariate agreement coefficients for reliability of data. Sociological methodology, 2, 139–150. [Google Scholar]
  42. Krippendorff K (2004). Reliability in content analysis: Some common misconceptions and recommendations. Human Communication Research, 30(3), 411–433. [Google Scholar]
  43. Krippendorff K (2018). Content analysis: An introduction to its methodology: Sage publications. [Google Scholar]
  44. Lacy S, Watson BR, Riffe D, & Lovejoy J (2015). Issues and best practices in content analysis. Journalism & Mass Communication Quarterly, 92(4), 791–811. [Google Scholar]
  45. Marriott BP, Hunt KJ, Malek AM, & Newman JC (2019). Trends in intake of energy and total sugar from sugar-sweetened beverages in the United States among children and adults, NHANES 2003–2016. Nutrients, 11(9). [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Moran AJ, & Roberto CA (2018). Health warning labels correct parents’ misperceptions about sugary drink options. American Journal of Preventive Medicine, 55(2), e19–e27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Munsell CR, Harris JL, Sarda V, & Schwartz MB (2016). Parents’ beliefs about the healthfulness of sugary drink options: opportunities to address misperceptions. Public Health Nutrition, 19(1), 46–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Mus S, Rozas L, Barnoya J, & Busse P (2021). Gender representation in food and beverage print advertisements found in corner stores around schools in Peru and Guatemala. BMC Research Notes, 14(1), 402. doi: 10.1186/s13104-021-05812-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Neuendorf KA (2017). The content analysis guidebook. Thousand Oaks, CA: Sage. [Google Scholar]
  50. O’Keefe DJ (2008). Elaboration Likelihood Model. In Donsbach W (Ed.), The international encyclopedia of communication (Vol. IV, pp. 1475–1480). Oxford: Blackwell. [Google Scholar]
  51. Oleinik A, Popova I, Kirdina S, & Shatalova T (2014). On the choice of measures of reliability and validity in the content-analysis of texts. Quality & Quantity, 48(5), 2703–2718. [Google Scholar]
  52. Peteet B, Roundtree C, Dixon S, Mosley C, Miller-Roenigk B, White J, . . . McCuistian C (2020). ‘Codeine crazy:’a content analysis of prescription drug references in popular music. Journal of Youth Studies, 1–17. doi: 10.1080/13676261.2020.1801992 [DOI] [Google Scholar]
  53. Petty RE, & Cacioppo JT (1986). Communication and persuasion: Central and peripheral routes to attitude change. New York: Springer-Verlag. [Google Scholar]
  54. Porcu M, & Giambona F (2017). Introduction to latent class analysis with applications. The Journal of Early Adolescence, 37(1), 129–158. doi: 10.1177/0272431616648452 [DOI] [Google Scholar]
  55. Potter WJ, & Riddle K (2007). A content analysis of the media effects literature. Journalism & Mass Communication Quarterly, 84(1), 90–104. [Google Scholar]
  56. Primack BA, Dalton MA, Carroll MV, Agarwal AA, & Fine MJ (2008). Content analysis of tobacco, alcohol, and other drugs in popular music. Archives of Pediatrics and Adolescent Medicine, 162(2), 169–175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Raykov T, & Marcoulides GA (2018). A Course in Item Response Theory and Modeling with Stata. College Station, TX: Stata Press. [Google Scholar]
  58. Reise S, Ainsworth A, & Haviland M (2005). Item response theory: Fundamentals, applications, and promise in psychological research. Current directions in psychological science, 14(2), 95–101. [Google Scholar]
  59. Riff D, Lacy S, Watson B, & Fico F (2019). Analyzing media messages: Using quantitative content analysis in research (Fourth ed.). New York: Routledge. [Google Scholar]
  60. Russell CA, Russell DW, & Grube JW (2009). Nature and impact of alcohol messages in a youth-oriented television series. Journal of Advertising, 38(3), 97–112. doi: 10.2753/JOA0091-3367380307 [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Shrout PE, & Fleiss JL (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. [DOI] [PubMed] [Google Scholar]
  62. Singh J (2004). Tackling measurement problems with Item Response Theory: Principles, characteristics, and assessment, with an illustrative example. Journal of Business Research, 57(2), 184–208. doi: 10.1016/S0148-2963(01)00302-2 [DOI] [Google Scholar]
  63. Skalski PD, Neuendorf KA, & Cajigas JA (2017). Content analysis in the interactive media age. In Neuendorf KA (Ed.), The content analysis guidebook (pp. 201–242). Thousand Oaks, CA: Sage. [Google Scholar]
  64. StataCorp. (2019). Stata: Release 16 Statistical Software. College Station, TX: StataCorp LP. [Google Scholar]
  65. Stern S, & Morr L (2013). Portrayals of teen smoking, drinking, and drug use in recent popular movies. Journal of Health Communication, 18(2), 179–191. doi: 10.1080/10810730.2012.688251 [DOI] [PubMed] [Google Scholar]
  66. Streiner DL (1995). Learning how to differ: Agreement and reliability statistics in psychiatry. The Canadian Journal of Psychiatry, 40(2), 60–66. [PubMed] [Google Scholar]
  67. Uebersax JS (1992). Modeling approaches for the analysis of observer agreement. Investigative Radiology, 27(9), 738–743. [DOI] [PubMed] [Google Scholar]
  68. Ullman JB, & Bentler PM (2012). Structural equation modeling. In Weiner IB (Ed.), Handbook of Psychology (Second ed., pp. 661–690): Wiley. [Google Scholar]
  69. Underwood JM, Brener N, Thornton J, Harris WA, Bryan LN, Shanklin SL, . . . Chyen D (2020). Overview and methods for the Youth Risk Behavior Surveillance System—United States, 2019. MMWR supplements, 69(1), 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Vassallo AJ, Kelly B, Zhang L, Wang Z, Young S, & Freeman B (2018). Junk Food Marketing on Instagram: Content Analysis. JMIR Public Health and Surveillance, 4(2). doi: 10.2196/publichealth.9594 [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Vercammen KA, Koma JW, & Bleich SN (2019). Trends in energy drink consumption among US adolescents and adults, 2003–2016. American Journal of Preventive Medicine, 56(6), 827–833. [DOI] [PubMed] [Google Scholar]
  72. Zickar M, & Highhouse S (1998). Looking closer at the effects of framing on risky choice: An item response theory analysis. Organizational Behavior and Human Decision Processes, 75(1), 75–91. [DOI] [PubMed] [Google Scholar]
  73. Zytnick D, Park S, & Onufrak SJ (2016). Child and caregiver attitudes about sports drinks and weekly sports drink intake among US youth. American Journal of Health Promotion, 30(3), e110–e119. [DOI] [PubMed] [Google Scholar]

RESOURCES