Abstract
Reliable and valid message evaluation has a central role in effective health communication and message effects research. The authors have employed a message testing protocol to efficiently acquire valid and reliable message evaluation results: (a) use multiple messages, (b) recruit evaluators from the target population, (c) use valid and reliable effectiveness measures, (d) expose an evaluator to multiple messages, and (e) ensure enough evaluations per message. Two secondary analyses of anti-tobacco message evaluation studies provide evidence for reliability and validity regarding points (d) and (e). Seven studies where adult smokers evaluated the effectiveness of various anti-smoking campaign messages were examined. The first analysis shows that the position in which a message appears has little or no impact on its evaluation, supporting the validity of multiple-exposure design. The second analysis suggests having 25 evaluations per message can achieve a fair balance between accuracy and efficiency.
Keywords: Message evaluation, message effects, experiment design, survey efficiency, positional effects
Communication and message-based interventions are crucial in efforts to improve public health (Hornik, 2002; Noar, 2009; Wakefield, Loken, & Hornik, 2010). Theory- and evidence-based message design should be the first consideration for persuasive campaigns. However, designing a message almost always involves choosing among multiple instances of content or format features that can address largely the same beliefs. Ascertaining the effectiveness of public health messages is crucial in avoiding expenditures of campaign funds and jeopardizing robust theory testing using weak messages that fail to change behaviors. Efficient procedures for evaluating the effectiveness of persuasive messages would be a useful tool for those designing public health campaigns, and certainly for researchers seeking to evaluate various types of message effects as they undertake theory testing.
Jackson (1992) has argued the importance of message effects research and recommended principles for study design and analyses. Interest in message effects research methods to inform evidence-based message design has recently revived (O’Keefe, 2015; Reeves, Yeykelis, & Cummings, 2015; Slater, Peter, & Valkenburg, 2015). Using secondary analyses, this manuscript addresses two questions directly pertinent to efficient, reliable, and valid message testing: (1) Will a multiple-exposure design distort the evaluation process by introducing bias for messages that appear in later positions (vs. the first)? And (2) how many evaluations are needed to achieve a stable rating of the population value for the message’s persuasiveness?
A Proposed Message Evaluation Protocol
The most direct way to evaluate the effectiveness of a message is to conduct a test in the field with the appropriate target population and outcomes, comparing the messages on the outcome variables of interest against some appropriate control. This “gold standard” strategy requires running at least a small-scale campaign in the field and then evaluating its success in changing behavior, which can take months or longer, rendering it impractical and expensive. The gold standard needs to be replaced by a simpler, more efficient procedure even if that procedure is less than ideal (Cappella & Kim, 2017). Researchers and campaign designers need to know about the effectiveness of messages in advance of testing the final campaign and before deploying campaign resources, rendering long-term observation of behavioral change not possible.
Extensive message testing in our research group has centered around quantitative ratings that follow an explicit but straightforward protocol (Kim & Cappella, 2019).
Multiple stimuli: Use a large pool of messages to be evaluated that meet the needs of the proposed campaign or research project and exhibit nontrivial variability. This will allow clearer differentiation between stronger and weaker variants than using a single stimulus, so that researchers can address the case-category confound in theory testing (Jackson, 1992) and provide campaign designers some selection among message variants of equivalent effectiveness (thereby addressing the wear-out problem).
Evaluators from the target population: Recruit evaluators who are members of the appropriate target population.
Efficient outcome measures: Use valid and reliable measures of the core outcomes that are also efficiently administered.
Multiple-exposure design: Each evaluator is exposed to multiple messages randomly selected from the pool of messages and presented in random order. This will increase the efficiency of message testing by increasing the number of evaluations per message.
Stable estimates: Assign enough evaluations per message to ensure the stability of outcomes, while avoiding the inefficiencies of oversampling.
The first three elements of this protocol have been addressed in elsewhere (Cappella, 2018; Cappella & Kim, 2017; Kim & Cappella, 2019). For element (c), specific outcome measures of perceived message effectiveness (PME) and perceived argument strength (PAS) have been widely tested for validity and reliability, even against bio-behavioral outcomes (Bigsby, Cappella, & Seitz, 2013; Fishbein et al., 2002; Zhao & Cappella, 2016; Zhao, Strasser, Cappella, Lerman, & Fishbein, 2011). The measures have been criticized by some (O’Keefe, 2018) and defended by many others (Davis & Duke, 2018; Dillard & Ha, 2016; Dillard, Shen, & Vail, 2007; Nabi, 2018; Noar, Barker, Bell, & Yzer, 2018).
The focus of our research here is on elements (d) and (e). This efficient protocol assumes that evaluating a message in position 1, 2, 3, etc. in a presentation sequence is no different from evaluating a single message. And if the number of evaluations required per message is prohibitively large, then the whole idea behind the value of using multiple stimuli will be undermined.
In what follows, the issues of message position and stable sample size are addressed empirically:
RQ1: Does the position that a message appears in a sequence of messages affect its perceived message effectiveness or perceived argument strength (as appropriate)?
RQ2: At what point does the size of the sample of message evaluators produce stable ratings of message effectiveness, consistent with the ratings provided by the larger population?
Data were drawn from seven studies evaluating various anti-smoking campaign messages. The variability of messages in our sample (some audio-visual PSAs, others brief text-only ones) is relatively large, and the evaluators are generally representative of the adult smoking population in the United States.
Analysis 1: Positional effects in message evaluation
Multiple-exposure design allows efficient data collection but raises questions about systematic biases in the results. If one person evaluates multiple messages, does the position in which the message appears affect its evaluation? Primacy or recency effects, much examined in decision making and memory research, suggest that the message position may affect the weight given to the message. Later messages may receive higher evaluation scores as the evaluators become more experienced (O’Connor & Cheema, 2018) or as the persuasiveness of the evaluation set accumulates. Significant positional effects have been observed in pop culture (e.g., Bruine de Bruin, 2005) and sports contests (e.g., Scheer & Ansorge, 1975), with scoring advantage for later performances over earlier ones.
The potential bias can be addressed by randomly selecting messages from a pool of related ones and presenting them in a random order unique for each respondent. The aggregate evaluation for each message would be averaged across positions and contexts in which it appeared. However, even though the design frees the aggregate evaluation from positional effects, it is unclear whether the process that the protocol dictates distorts the evaluation of the message. That is: If each person rated only one message, would those ratings differ from the same message rated in the context of other messages where the evaluation scores are averaged across presentational positions? Despite random assignment, it is possible through bad luck and random sampling that a particular message is unevenly distributed across positions leaning heavily toward one or another position. In such a scenario, a bias in evaluation favoring early or late presentations would undermine the aggregate scores for that test. Would such a (unlucky) distortion matter to the message evaluation process? We provide empirical evidence about the biasing of evaluation of a message when it appears early, middle, or late in the sequence of messages to be evaluated. This test is NOT a test of random order versus non-random order; it is a test of the biasing effect of message position in the sequence if such bad luck intervenes.
Because evaluations of a first-position message is the evaluation “uncontaminated” by surrounding messages, the key comparison is between a message appearing in the first vs. any other position. If scores distinctly differ across positions, then it is not reasonable to treat the scores from positions other than the first as representative of the message’s evaluation.
Method
Data.
This is a secondary data analysis of seven tobacco control studies conducted previously with IRB approval from The University of Pennsylvania. All seven studies tested the effectiveness of anti-smoking messages in video and textual form in persuading adult daily smokers, and all used the protocol described previously. Four studies tested video public service announcements (PSAs), and three tested text-only paragraphs containing anti-smoking arguments. The seven studies included 3,441 adult smokers who reported they have smoked more than 100 cigarettes in their lifetime and currently smoked at least 5 cigarettes a day. See Table 1 for the descriptive statistics of the participants.
Table 1.
Descriptive statistics of the seven message evaluation studies
| ARG1 | ARG2 | ARG3 | PSA1 | PSA2 | PSA3 | PSA4 | |
|---|---|---|---|---|---|---|---|
| Study design | |||||||
| No. of participants | 300 | 487 | 300 | 427 | 566 | 656 | 705 |
| No. of all messages in the study | 99 | 100 | 33 | 32 | 60 | 40 | 68 |
| No. of messages evaluated per participant | 12 | 8 | 6 | 4 | 4 | 4 | 4 |
| No. of evaluators per message [minimum – maximum] | 36.4 (0.54) [36-38] | 39.0 (5.94) [25-51] | 54.5 (6.29) [44-69] | 53.4 (7.40) [38-68] | 37.7 (5.18) [24-50] | 65.5 (8.02) [41-78] | 41.5 (6.85) [22-55] |
| Message length 1 [minimum – maximum] | 23.24 (5.73) [12-38] | 19.85 (3.72) [6-25] | 16.00 (3.38) [8-22] | 29.72 (3.64) [10-32] | 30.60 (3.92) [28-60] | 29.78 (0.50) [28-31] | 29.97 (1.90) [15-32] |
| Demographics | |||||||
| Age | 36.84 (12.71) | 46.28 (12.53) | 42.65 (12.88) | 56.08 (11.75) | 49.57 (11.01) | 45.82 (12.06) | 51.42 (13.65) |
| Gender: Female (%) | 49.7% | 46.4% | 53.7% | 45.9% | 53.2% | 48.6% | 44.0% |
| Non-Hispanic white (%) | 73.0% | 77.2% | 80% | 85.5% | 79.9% | 71.5% | 82.1% |
| Non-Hispanic black (%) | 18.3% | 8.2% | 5.7% | 6.3% | 8.1% | 12.2% | 8.1% |
| Other, including Hispanics (%) | 8.7% | 14.6% | 5.3% | 8.2% | 12.0% | 16.3% | 9.8% |
| Education (Median) | High-school/GED | High-school/GED | Some college | Some college | Some college | High-school/GED | Some college |
| Need for cognition | 2.70 (.56) | 2.81 (.66) | 3.44 (.69) | 3.20 (.64) | 3.60 (.75) | 3.41 (.77) | 3.50 (.76) |
| Smoking history | |||||||
| Stage of change | 7.96 2 (2.92) | 4.53 (2.93) | 5.85 (2.85) | 4.91 (2.99) | 5.75 (2.78) | 5.31 (2.98) | 5.18 (2.99) |
| Fagerström Test of Nicotine Dependence | n/a | 4.40 (2.34) | 4.87 (2.15) | 4.10 (2.14) | 4.34 (2.38) | 4.15 (2.31) | 3.95 (2.15) |
Note. Numbers represent the mean and standard deviation (in parentheses), unless otherwise annotated.
Unit of message length is number of words for textual arguments, and seconds in video files for PSAs.
In ARG1, Stage of change was measured using slightly different wording: “On a scale of 0 to 10, how interested are you in quitting smoking?” 0 = not at all interested, 10 = very interested.
The four video studies (PSA1, PSA2, PSA3, and PSA4) tested a total of 199 professionally made, 15- to 30-second PSAs.1 The studies recruited a nationally representative sample of adult smokers through the KnowledgePanel (formerly Knowledge Networks) web-based panel. The participants watched and evaluated four PSAs in terms of perceived message effectiveness (PME; Bigsby et al., 2013).
The three text studies (ARG1, ARG2, and ARG3) tested 232 arguments: 99, 100, and 33 respectively.2 All arguments were extracted from existing anti-smoking PSAs, using procedures developed in previous studies (Lee, Cappella, Lerman, & Strasser, 2011; Zhao et al., 2011). The 232 PSAs included the 199 used in the above-mentioned PSA studies. ARG1 used a kiosk installed in a mall, and others used the KnowledgePanel. The participants in the three studies read 12, eight, and six arguments respectively, and evaluated each argument using measures of perceived argument strength (PAS; Zhao et al., 2011).
Measures.
Message position, i.e., where in the sequence of multiple messages it is presented, ranging from four to 12 across the studies, was the independent variable. Message position was treated as an ordinal variable, and messages shown at each later position were compared to the ones shown first.
The dependent variable was an evaluation of message persuasiveness measured by PME for video messages and PAS for textual arguments. PME measured how strongly the participants agreed with four statements including “This ad was convincing” and “The ad put thoughts in my mind about quitting smoking” (Bigsby et al., 2013).3 PAS measured how strongly participants agreed with nine statements such as “The statement is a reason for quitting smoking that is believable/convincing/a strong reason to quit smoking” or “The statement put thoughts in my mind about quitting smoking/wanting to continue smoking” (Zhao & Cappella, 2016).
Although the analysis focused on the effect of message position on the evaluation of message persuasiveness, there are other individual characteristics of the participants that may affect how one evaluates anti-smoking campaign messages. These were used as covariates in the analyses.4
Analyses.
Each respondent provided multiple evaluations of message persuasiveness (range: 4-12); and each message was shown to multiple respondents (range: 22-78). The unit of analysis was individual message evaluation (n = 18,425). Because the evaluations were not independent, doubly-nested within the respondent as well as the message, cross-classified models were fitted using multilevel mixed-effects linear regression in STATA.
The analyses were completed for each of the seven studies; also, a meta-analytic approach was used to calculate and combine the effect sizes (Cohen’s d and F2) of message position on evaluation scores to reach a general conclusion. Another meta-analytic approach using individual participant data rather than aggregate data (Cooper & Patall, 2009) was also employed. See online supplement for details of this analysis.
Results
Unconditional models were examined first for all seven studies to calculate intra-class correlations. Message-level clustering explained 6.8% to 9.3% of the variance in evaluation scores, and individual-level clustering explained 42.3% to 54.7% of the variance.
Then full models including message position and individual differences were examined. Message position exerted significant overall effect in two of the seven studies (ARG1, PSA2) and a marginally significant effect in one study (PSA1). In ARG1, the last (12th) message was rated 0.25 points higher on PAS (out of 5) than the first one (χ2 (11) = 56.06, p < .001). In PSA2, messages shown at the fourth position received 0.11 point higher PME evaluation than the first ones (χ2 (3) = 8.49, p = .04). In PSA1, the fourth message received significantly higher PME evaluation, showing a 0.09 (also out of 5) point difference than the first one (χ2 (3) = 6.97, p = .07). For the remaining four studies, the overall effects of message position were not significant (all χ2s < 9.50, all ps > .10). Figure 1 shows the effect of message position for the three studies mentioned above.
Figure 1.

Effects of message position on evaluation results. Only results with p < .10 are shown with 95% CIs. Control variables include age, gender, race, education, need for cognition, nicotine dependence and contemplation ladder.
In a meta-analytic approach, Cohen’s ds were calculated based on the standardized mean difference in predicted margins of message evaluation scores at the first and last position; ds ranged from −.02 (PSA4) to .08 (ARG1). A fixed-effect approach, using n – 3 as a weight, yielded d = .02, which is quite small. Cohen’s F2 based on the variance explained (Selya, Rose, Dierker, Hedeker, & Mermelstein, 2012) ranged from .001 (ARG3) to .016 (ARG1). A fixed-effect meta-analysis yielded F2 = .004. This can be translated to r = .07.
Discussion
Analysis 1 suggests that multiple-exposure design can achieve greater efficiency by reducing the total sample size without introducing substantial difference to the evaluation results when compared to a protocol in which each respondent rates only one message. Across seven different studies, each participant evaluated multiple messages randomly placed in different positions. The same message shown in different positions yielded very similar results.
Although two studies showed significant positional effects, all had more than 1,700 evaluations, which provide more than sufficient statistical power to detect a very small effect. Across the seven studies, the combined effect size was F2 = .004, smaller than the “small effect” according to guidelines suggested by Cohen (1988; F2 = .01) – although some meta-analyses of mass media campaigns have yielded small but significant effect sizes (e.g., Snyder et al., 2004).
Even when chance creates a preponderance of evaluations from earlier or later in the sequence, such a condition will have little to no effect on the evalutions observed aggregated overall. Similarly, a single-exposure design (i.e., first position ratings) yields no different ratings than a design where the average rating is based on messages from all positions in the sequence. If initial position scores were quite different from ratings when the target was in other positions, then the aggregated absolute message score would be biased. No evidence of such a bias was found.
Analysis 2: Choosing an optimal sample size to evaluate messages
To achieve high-quality results in message testing, researchers often have multiple people evaluate the message and create aggregate scores. This approach inevitably faces a trade-off between accuracy and efficiency. As more people evaluate a message, their individual differences (i.e., the noise) will become less consequential, yielding an aggregate score that is closer to the true value. However, having more evaluators requires more financial and time resources, undermining efficiency.
Researchers have employed varied number of evaluators in health message evaluation – from 18 (Durkin, Biener, & Wakefield, 2009) to ~120 evaluators per message (Nonnemaker, Farrelly, Kamyab, Busey, & Mann, 2010; Parvanta et al., 2013). Some suggestions have been made for using expert raters (e.g., 10-15 to establish cutoff scores for standardized tests; Hurtz & Hertz, 1999). However, message effects research often involves a general population or subpopulation that shares some characteristics (e.g., current smokers). The potential heterogeneity among the evaluators may require a larger sample.
The present analysis aims to determine a minimum required number of evaluators to achieve accurate assessments of message persuasiveness in the larger population. We employed bootstrap methods with varying samples of adult smokers who have rated a wide range of anti-smoking messages to ascertain estimated accuracy (in other words, ascertain whether they reproduce the evaluation results generated by a large group of evaluators).
Method
Data.
The data were drawn from the same seven tobacco control studies in Analysis 1. The average numbers of evaluators per message varied across the seven studies as the number of participants, messages, and exposures per participant differ. Average number of evaluators per message ranged from 36.4 to 65.6 participants, with the minimum from 22 to 44 (see Table 1).
Measures and analyses.
Similar to Analysis 1, PAS for textual arguments and PME for video PSAs were examined. To examine the effect of the number of evaluators per message, message-level aggregated evaluation scores were used as the unit of analysis.
For each of the seven studies, multiple bootstrap samples were drawn to explore the effect of the sample size. Random subsamples of the dataset with replacement were drawn 2,000 times each with different sample sizes, moving from five evaluators per message to the minimum number of evaluators per message in each study in increments of two. For example, in PSA1, the bootstrap sample size ranged from 5 to 37 in 17 steps, resulting in 2,000 * 17 = 34,000 bootstrap samples. Then, message evaluation scores derived from the bootstrap samples were compared to the original data using various metrics.
First, Pearson correlation coefficients, r, between message-level aggregated evaluation scores derived from bootstrap samples and original data were calculated. Fisher r-to-z transformation was conducted for the 2,000 rs with the same sample size, which were then averaged and transformed back to r. Second, mean differences between aggregate scores derived from bootstrap samples and original data were examined. For each bootstrap sample, the absolute differences between the aggregated evaluation score for a message from the original data and from its own were calculated, and then averaged across all messages to create a difference score to represent how far the particular bootstrap sample is away from the original data. These difference scores were averaged over 2,000 bootstrap samples with the same sample size.
These two methods are based on the mean evaluation scores; however, message evaluation often aims to determine the relative evaluation of messages – e.g., What is the strongest (or the weakest) message among a group of messages? Therefore, additional methods were used based on the rank order of evaluation results.
First, rank-order correlations (Kendall’s tau-a) between the aggregated evaluation scores derived from bootstrap samples and original data were calculated, which were then averaged across the 2,000 bootstrap samples with the same sample size.
Second, we examined how well the bootstrap samples allocate messages in the top and bottom evaluated groups (based on quartiles and thirds) without an error. Misspecification error was defined as specifying (a) an originally top-score message into the bottom-score group, or (b) an originally bottom-score message into the top-score group. The proportion of misspecified messages () and the proportion of bootstrap samples that produced no misspecification errors () were examined.
Third, the bootstrap samples’ ability to tell whether messages should be in the top 20 group in the original data was assessed. Cohen’s kappa (k) was used as a reliability measure. A reliable subgroup will be able to produce acceptably high value, e.g., k > .60 (Landis & Koch, 1977). See the online supplement for more details on analyses methods and results.
Results
Analyses based on mean evaluation scores.
Figure 2 shows the Pearson correlation coefficients between original and bootstrapped aggregated evaluation scores and the changes when two more evaluators were added. The rs increased as the number of evaluators per message increased, but in increasingly smaller increments. With 21 evaluators per message, the average rs were over .80 for all seven studies. After that, the coefficients seem to reach a saturation point – the changes become quite small, at or below .01.
Figure 2.

Correlation between original data and bootstrap samples. Solid lines show averaged correlation coefficient across 2,000 bootstrap samples. Dashed lines show change in average correlation coefficients when compared to the bootstrap samples with two fewer evaluators (starting from n = 7).
With 17 or more evaluators, the subgroup of evaluators reproduced the original evaluation scores with quite small errors, with differences less than 20% of SDs observed in the original data. After 21 evaluators, the changes approached zero, suggesting that adding more evaluators would not increase precision further.
Analyses based on rank orders.
Similar patterns emerged with the rank order correlation coefficients. At 23 evaluators, all seven studies yielded coefficients higher than .60; change scores fluctuated more than the case of Pearson correlation coefficients but stabilized below .02 after that point.
The misspecification approach focused on the ability of a smaller group of evaluators to correctly select the most effective and avoid the least effective messages in a set of messages. Figure 3 shows misspecification errors where the top quartile messages were incorrectly specified to the bottom quartile. From a message designer’s perspective, this error means that a truly effective message would fail to be considered as a final product. With 23 evaluators, the proportion of bootstrap samples without any errors reached 90%, and the proportion of originally top-quartile messages that were misspecified into the bottom quartile were all below 0.5%. Two studies (ARG1 and ARG2) showed noticeably lower accuracy than others; for the other five studies, 17 evaluators yielded over 90% zero-error bootstrap sample rates. A similar analysis was conducted by examining the misspecification of the lowest quartile messages to the highest quartile. Both misspecification analyses suggest that after 25 evaluators, having more may not add much value.
Figure 3.

Top vs. bottom quartile specification: Misspecification to bottom quartile. Solid lines show proportion of originally top-quartile messages that are misspecified in bottom quartile; dashed lines show the proportion of bootstrap samples (out of total 2,000) that had no misspecification errors.
When employing a tougher criterion of misspecification into top and bottom thirds that are closer to each other, the proportion of misspecification is expectedly larger than the case of quartiles. However, with 25 evaluators, the results suggest there would be less than one error when testing 100 messages. Another conservative analysis was conducted to assess the reliability of the rater subgroup, examining Cohen’s kappa for reproducing the top 20 messages. With 25 evaluators, the changes in kappa approached zero in six studies, and five out of seven studies yielded k > .60. See online supplement for more detailed results.
Discussion
Analysis 2 suggests that having 25 evaluators per message would achieve a fair balance between accuracy and efficiency. At that point, many critical cut points were met (e.g., Pearson correlation > .80, rank order correlation > .60), and adding more evaluators did not enhance the accuracy much, as shown in very low change scores in correlation coefficients and mean difference scores. When the focus was reproducing the rank orders of messages within a study, most studies also yielded > 90% zero-error trials and k > .60 with 25 evaluators per message.
A wide variety of criteria for the fit between sample and population ratings could be used and we tried several. One subtle but important one is misspecification between the top- and bottom-ranked messages (e.g., top versus bottom quartile). In selecting messages for a campaign (or for research testing strong vs. weak versions), one would not want to mistakenly choose a message as a top scoring message when in fact it is a lowest scoring one, or vice versa. These confusions would be disastrous for the validity of the results. This criterion is one that must be met if smaller samples of evaluators are to be employed.
Two studies (ARG1 and ARG2) showed relatively lower accuracy than other studies at 25 evaluators per message in correctly identifying bottom vs. top rank messages. It should be noted that the messages from these studies show a much narrower distribution of evaluation scores, especially among highly-evaluated messages. For other studies, median evaluation scores were about two standard deviations lower than that of the score of the highest-ranked message. However, in ARG1 and ARG2, the mid-ranked messages were only about one standard deviation away from the highest-ranked ones. As a result, these studies were more likely to yield errors in reproducing rank orders even with quite a large sample size. In a similar light, O’Keefe’s (2018) meta-analysis of perceived versus actual message effectiveness found that the predictive validity of perceived message effectiveness was less when the variance was lower.
The final decision on sample size in message evaluation studies should be based on the expected distribution of evaluations as well as how much error one can tolerate. If the messages are expected to be highly similar to each other in quality and one still wants to correctly determine the rank order among the messages, more evaluators should be used; we would recommend at least 30 but the context of application will be important to consider. On the other hand, when one has a good reason to believe that the messages differ from each other quite a bit, and/or one can tolerate slightly larger error, fewer evaluators will yield sufficiently good discrimination between the messages. When specifying top vs. bottom third messages, less than 5% of the messages were misspecified with as few as 17 evaluators per message. This is equivalent to making about 1.5 errors when one attempts to specify 100 messages into three groups according to their evaluation scores. If one can tolerate this level of error, 17 evaluators per message will be an efficient and reasonably effective sample size.
The different criteria we used to assess the fit between the larger and smaller samples reflect different goals for message analysis. In selecting messages for a health campaign, it would be important not to misspecify a high performing message as low, or even worse, the opposite. So, the assignment to the highest and lowest performing groups is especially important in that context. Where more fine-grained theoretical tests are the target, then the message’s rating score is more consequential than its relative rank. The sample sizes we identified as the minimum necessary varied somewhat as a function of the criterion employed and therefore the testing context.
General Discussion and Conclusion
The larger the sample a researcher can recruit in a message evaluation study, the more valid the results will be. However, considering limited financial resources and time, it is important to find a good balance between validity and efficiency. This manuscript examined a message evaluation protocol that involves quantitative methods (Cappella & Kim, 2017; Kim & Cappella, 2019), and adduce empirical evidence to show the reliability and validity of the suggested protocol regarding two design points: 1) exposing an evaluator to multiple messages, and 2) ensuring enough evaluations per message for stability and precision.
Analysis 1 showed that there is a minimal effect of message position when the evaluators are exposed to multiple messages; having the same message appear in different positions did not yield substantial differences in the observed evaluation score. This alleviates the concern that multiple-exposure design might introduce a systematic bias. One can significantly reduce the number of required evaluators because one evaluator can evaluate multiple messages randomly presented. It is not necessary to require only one message per respondent or to ignore evaluations of messages occurring after the initial one.
Analysis 2 showed that having 25 evaluators per message would produce efficient yet accurate evaluation results compared to results produced by a larger number of evaluators, although it might vary depending on the nature of messages (e.g., distribution of expected rating results) used in the study. The proposed protocol allows researchers to achieve substantially comparable message evaluations even when recruiting a smaller group of evaluators.
Reliable and valid message evaluation is a crucial issue in the study of message effects whether from the point of view of theory-testing or campaign design. By employing a multiple-exposure design and setting an optimal sample size for evaluators, the message evaluation protocol examined here can maximize efficiency without much undermining accuracy.
The results presented are subject to a variety of limitations. Although a number of different datasets contributed to the overall conclusions, all messages were about tobacco control and were relatively brief in length. Whether the results are generalizable to other topics or to longer textual or audio-visual materials awaits additional research. All the conclusions are tied to specific measures of perceived message effectiveness and argument strength. We offer no conclusion for other message judgments such as engagement, coherence, comprehensibility, and so on. Despite these limitations, we expect that messages that are relatively brief and engineered to be persuasive would show results in line with those presented here.
Messages and their effects are the sine qua non of much of media effects research and studies of persuasive influence. The ability to evaluate the effectiveness of a large set of messages allows a range of research on the components of message persuasiveness, the design of effective messages, and the evaluation of messages that should and should not appear in real-world communication campaigns. Advancing theory and empirical research about messages requires the ability to evaluate many cases indicative of underlying categories and not simply finding one or two examples that work for the category. The result is more robust conclusions about real-world messages with a stronger basis for generalization at least to the domain from which the messages have been selected.
Supplementary Material
Acknowledgments
Funding
This work was supported by the National Cancer Institute at the National Institutes of Health [P20 CA095856, R01 CA160226].
Biographies
Author biography
Minji Kim (Ph.D., University of Pennsylvania) is a postdoctoral fellow at the Center for Tobacco Control Research and Education at the University of California, San Francisco. Her research focuses on targeted and tailored health communication, message effects, and message testing methods.
Joseph N. Cappella (Ph.D., Michigan State University) is the Gerald R. Miller professor of communication at the Annenberg School for Communication at the University of Pennsylvania. His research has focused on the effects of messages on the public in the political and health domains. He is a Fellow of the International Communication Association (ICA), the National Communication Association, and past president of ICA.
Footnotes
The four studies tested 32, 60, 40, and 68 PSAs respectively. One PSA (“critics-cinema” by Truth Initiative, formerly American Legacy Foundation) was included in two studies (PSA2 and PSA4); the two incidents were treated separately in the study, resulting in 200 messages in total.
Textual arguments were short paragraphs with on average of 27.08 words (SD = 10.51). One example based on the Truth campaign reads “Sodium hydroxide, a caustic chemical found in cigarettes, is found in many hair removal products. Tell others the facts about smoking.”
Although PSA1 used a shorter version of the PME measurement, which only included two statements, this measure also showed very high correlation in other studies when compared to the full measure (rs > .80).
Covariates included age, gender, race, education, need for cognition, nicotine dependence, and stage of change. Because White and Black participants composed the majority of the total sample, race was categorized into three groups: Whites, Blacks, and Other. Level of education was categorized into four groups (1 = ~grade 8, 2 = grade 9~12/high school/GED, 3 = some college, 4 = college or higher). For most studies, stage of change was assessed according to the contemplation ladder in smoking cessation (Prochaska & DiClemente, 1982), where the response options ranged from 0 (“I have not had thoughts about quitting smoking”) to 10 (“I am taking action to quit smoking”). However, in ARG1, the wording was slightly different, where participants were asked “On a scale of 0 to 10, how interested are you in quitting smoking?” The response to this question ranged from 0 (“Not at all interested”) to 10 (“Very interested”). Nicotine dependence was measured using the responses to the Fagerström test for nicotine dependence scale (Heatherton, Kozlowski, Frecker, & Fagerström, 1991). ARG1 did not include these items, so the analysis of ARG1 did not include nicotine dependence as control variable. Inclusion or exclusion of these covariates did not change the overall results. See Table 1 for descriptive statistics of the covariates.
Contributor Information
Minji Kim, Center for Tobacco Control Research and Education, University of California, San Francisco, 530 Parnassus Ave. Suite 366, San Francisco, CA 94143.
Joseph N. Cappella, Annenberg School for Communication, University of Pennsylvania, 3620 Walnut St. Philadelphia, PA 19104.
References
- Bigsby E, Cappella JN, & Seitz HH (2013). Efficiently and effectively evaluating public service announcements: Additional evidence for the utility of perceived effectiveness. Communication Monographs, 80, 1–23. doi: 10.1080/03637751.2012.739706 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bruine de Bruin W (2005). Save the last dance for me: Unwanted serial position effects in jury evaluations. Acta Psychologica, 118, 245–260. doi: 10.1016/j.actpsy.2004.08.005 [DOI] [PubMed] [Google Scholar]
- Cappella JN (2018). Perceived message effectiveness meets the requirements of a reliable, valid, and efficient measure of persuasiveness. Journal of Communication, 68, 994–997. doi: 10.1093/joc/jqy044 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cappella JN, & Kim M (2017). Media evaluation In Rössler P (Ed.), The International Encyclopedia of Media Effects. doi: 10.1002/9781118783764.wbieme0020 [DOI] [Google Scholar]
- Cooper H, & Patall EA (2009). The relative benefits of meta-analysis conducted with individual participant data versus aggregated data. Psychological Methods, 14, 165–176. doi: 10.1037/a0015565 [DOI] [PubMed] [Google Scholar]
- Davis KC, & Duke JC (2018). Evidence of the real-world effectiveness of public health media campaigns reinforces the value of perceived message effectiveness in campaign planning. Journal of Communication, 68, 998–1000. doi: 10.1093/joc/jqy045 [DOI] [Google Scholar]
- Dillard JP, & Ha Y (2016). Interpreting perceived effectiveness: Understanding and addressing the problem of mean validity. Journal of Health Communication, 21, 1016–1022. doi: 10.1080/10810730.2016.1204379 [DOI] [PubMed] [Google Scholar]
- Dillard JP, Shen L, & Vail RG (2007). Does perceived message effectiveness cause persuasion or vice versa? 17 consistent answers. Human Communication Research, 33, 467–488. doi: 10.1111/j.1468-2958.2007.00308.x [DOI] [Google Scholar]
- Durkin SJ, Biener L, & Wakefield MA (2009). Effects of different types of antismoking ads on reducing disparities in smoking cessation among socioeconomic subgroups. American Journal of Public Health, 99, 2217–2223. doi: 10.2105/ajph.2009.161638 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fishbein M, Cappella J, Hornik R, Sayeed S, Yzer M, & Ahern R (2002). The role of theory in developing effective anti-drug public service announcements In Crano WD & Burgoon M (Eds.), Mass media and drug prevention: Classic and contemporary theories and research (pp. 89–117). Mahwah, NJ: Lawrence Erlbaum Associates. [Google Scholar]
- Heatherton TF, Kozlowski LT, Frecker RC, & Fagerström K-O (1991). The fagerström test for nicotine dependence: A revision of the fagerstrom tolerance questionnaire. British Journal of Addiction, 86, 1119–1127. doi: 10.1111/j.1360-0443.1991.tb01879.x [DOI] [PubMed] [Google Scholar]
- Hornik R (2002). Public health communication: Evidence for behavior change. Mahwah, NJ: Lawrence Erlbaum Associates. [Google Scholar]
- Hurtz GM, & Hertz NR (1999). How many raters should be used for establishing cutoff scores with the angoff method? A generalizability theory study. Educational and Psychological Measurement, 59, 885–897. doi: 10.1177/00131649921970233 [DOI] [Google Scholar]
- Jackson S (1992). Message effects research: Principles of design and analysis. New York, NY: The Guilford Press. [Google Scholar]
- Kim M, & Cappella JN (2019). Reliable, valid and efficient evaluation of media messages: Developing a message testing protocol. Journal of Communication Management. doi: 10.1108/JCOM-12-2018-0132 [DOI] [Google Scholar]
- Landis JR, & Koch GG (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174. doi: 10.2307/2529310 [DOI] [PubMed] [Google Scholar]
- Lee S, Cappella JN, Lerman C, & Strasser AA (2011). Smoking cues, argument strength, and perceived effectiveness of antismoking psas. Nicotine & Tobacco Research, 13, 282–290. doi: 10.1093/ntr/ntq255 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nabi RL (2018). On the value of perceived message effectiveness as a predictor of actual message effectiveness: An introduction. Journal of Communication, 68, 988–989. doi: 10.1093/joc/jqy048 [DOI] [Google Scholar]
- Noar SM (2009). Challenges in evaluating health communication campaigns: Defining the issues. Communication Methods and Measures, 3, 1–11. doi: 10.1080/19312450902809367 [DOI] [Google Scholar]
- Noar SM, Barker J, Bell T, & Yzer M (2018). Does perceived message effectiveness predict the actual effectiveness of tobacco education messages? A systematic review and meta-analysis. Health Communication, Advance online publication. doi: 10.1080/10410236.2018.1547675 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nonnemaker J, Farrelly MC, Kamyab K, Busey A, & Mann N (2010). Experimental study of graphic cigarette warning labels. Research Triangle Park, NC: RTI International. [Google Scholar]
- O’Keefe DJ (2015). Message generalizations that support evidence-based persuasive message design: Specifying the evidentiary requirements. Health Communication, 30, 106–113. doi: 10.1080/10410236.2014.974123 [DOI] [PubMed] [Google Scholar]
- O’Keefe DJ (2018). Message pretesting using assessments of expected or perceived persuasiveness: Evidence about diagnosticity of relative actual persuasiveness. Journal of Communication, 68, 120–142. doi: 10.1093/joc/jqx009 [DOI] [Google Scholar]
- O’Connor K, & Cheema A (2018). Do evaluations rise with experience? Psychological Science, 29, 779–790. doi: 10.1177/0956797617744517 [DOI] [PubMed] [Google Scholar]
- Parvanta S, Gibson L, Forquer H, Shapiro-Luft D, Dean L, Freres D, … Hornik R. (2013). Applying quantitative approaches to the formative evaluation of antismoking campaign messages. Social Marketing Quarterly, 19, 242–264. doi: 10.1177/1524500413506004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prochaska JO, & DiClemente CC (1982). Transtheoretical therapy: Toward a more integrative model of change. Psychotherapy: Theory, Research & Practice, 19, 276–288. doi: 10.1037/h0088437 [DOI] [Google Scholar]
- Reeves B, Yeykelis L, & Cummings JJ (2015). The use of media in media psychology. Media Psychology, 19, 49–71. doi: 10.1080/15213269.2015.1030083 [DOI] [Google Scholar]
- Scheer J, & Ansorge C (1975). Effects of naturally induced judges’ expectations on the ratings of physical performances. Research Quarterly, 46, 463–470. doi: 10.1080/10671315.1975.10616704 [DOI] [PubMed] [Google Scholar]
- Selya AS, Rose JS, Dierker LC, Hedeker D, & Mermelstein RJ (2012). A practical guide to calculating Cohen’s f2, a measure of local effect size, from proc mixed. Frontiers in Psychology, 3, 111. doi: 10.3389/fpsyg.2012.00111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slater MD, Peter J, & Valkenburg PM (2015). Message variability and heterogeneity. Annals of the International Communication Association, 39, 3–31. doi: 10.1080/23808985.2015.11679170 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Snyder LB, Hamilton MA, Mitchell EW, Kiwanuka-Tondo J, Fleming-Milici F, & Proctor D (2004). A meta-analysis of the effect of mediated health communication campaigns on behavior change in the united states. Journal of Health Communication, 9, 71–96. doi: 10.1080/10810730490271548 [DOI] [PubMed] [Google Scholar]
- Wakefield MA, Loken B, & Hornik RC (2010). Use of mass media campaigns to change health behaviour. The Lancet, 376, 1261–1271. doi: 10.1016/S0140-6736(10)60809-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao X, & Cappella JN (2016). Perceived argument strength In Kim DK & Dearing JW (Eds.), Health communication research measures (2nd ed., pp. 119–126). New York, NY: Peter Lang Publishing Inc. [Google Scholar]
- Zhao X, Strasser A, Cappella JN, Lerman C, & Fishbein M (2011). A measure of perceived argument strength: Reliability and validity. Communication Methods & Measures, 5, 48–75. doi: 10.1080/19312458.2010.547822 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
