Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2020 Oct 21;9(5):961–991. doi: 10.1093/jssam/smaa021

Survey Reliability: Models, Methods, and Findings

Roger Tourangeau 1,
PMCID: PMC8665769  PMID: 34912940

Abstract

Although most survey researchers agree that reliability is a critical requirement for survey data, there have not been many efforts to assess the reliability of responses in national surveys. In addition, there are quite different approaches to studying the reliability of survey responses. In the first section of the Lecture, I contrast a psychological theory of over-time consistency with three statistical models that use reinterview data, multi-trait multi-method experiments, and three-wave panel data to estimate reliability. The more sophisticated statistical models reflect concerns about memory effects and the impact of method factors in reinterview studies. In the following section of the Lecture, I examine some of the major findings from the literature on reliability. Despite the differences across methods for exploring reliability, the findings mostly converge, identifying similar respondent and question characteristics as major determinants of reliability. The next section of the paper looks at the correlations among estimates of reliability derived from the different methods; it finds some support for the validity of the measures from traditional reinterview studies. The empirical claims motivating the more sophisticated methods for estimating reliability are not strongly supported in the literature. Reliability is, in my judgment, a neglected topic among survey researchers, and I hope the Lecture spurs further studies of the reliability of survey questions.

Keywords: Belief-sampling, MTMM experiments, Quasi-simplex model, Reliability, Simple response variance

1. INTRODUCTION

Let me begin with a digression. When I was an undergraduate majoring in psychology, I came across something called the Mental Measurements Yearbook, which was then edited by Oscar K. Buros. As I recall, this fat and very useful book consisted of short articles discussing the psychometric properties of various scales used in psychological research, especially evidence regarding their reliability and validity. The scales themselves were multi-item batteries designed to assess some psychological construct. For example, my first foray into social science research (Alker, Tourangeau, and Staines 1976) used something called the Tomkins Polarity Scale, which consisted of fifty-nine items, and contrasted people with humanistic or normative orientations. (I no longer have any idea what this meant.) One could find out what was known about such scales by looking them up in the Mental Measurements Yearbook. There was a strong presumption in psychology that researchers should use scales that were known to be reliable and valid and that multiple items were necessary to achieve high levels of reliability and validity.

Imagine my surprise when I joined National Opinion Research Center’s (NORC) Sampling Department in 1980 and learned that most surveys used items of unknown reliability and validity and that key variables were often measured with a single survey question. It is not that survey researchers were not worried about error. To the contrary, to me, they seemed absolutely obsessed with sampling and coverage errors. One of my first assignments at NORC was to help select its master sampling frame for the 1980s. This was a national area probability sample of addresses that was to be used for survey samples over the next decade; it was a large investment for NORC and its chief statistician, Marty Frankel, was extremely meticulous in designing and implementing the procedures to develop this frame. The contrast with psychological research could not have been starker. Psychologists did not care at all whether their samples were any good, and, in fact, most psychological research used samples of convenience (or, for the subjects, samples of inconvenience), consisting of freshman and sophomores taking introductory psychology courses that required them to “volunteer” for psychological experiments as subjects. Of course, other fields have their obsessions as well, but the complementary concerns of psychology and survey research were quite a revelation to me.

2. PSYCHOLOGICAL VERSUS STATISTICAL MODELS OF RELIABILITY

My dual citizenship in the worlds of psychology and survey sampling unexpectedly became extremely useful in the early 1980s, as survey researchers began to see the relevance of psychology, especially cognitive psychology, to survey methodological issues. Some leading survey researchers, including George Gallup and Norman Bradburn, had been trained as psychologists, but for the first time the relevance of cognitive psychology, a relatively new discipline, was becoming clear to the researchers responsible for official statistics. Conferences in the United Kingdom (Moss and Goldstein 1979) and the United States (Biderman 1980) brought survey researchers together with memory researchers in an effort to improve the quality of survey data in studies like the National Crime Survey and the National Health Interview Survey. I was asked to participate in the Advanced Seminar on Cognitive Aspects of Survey Methodology, which was held in 1983, and was commissioned to write a paper discussing how people answered survey questions (Tourangeau 1984). I argued that there were four main components to the survey response process—comprehension, retrieval, judgment, and reporting. Shortly thereafter, Ken Rasinski and I applied this model in analyzing context effects, the impact of earlier questions on answers to later questions, in attitude surveys (Tourangeau and Rasinski 1988).

2.1 The Belief-Sampling Model

In subsequent work, I tried to address the question of when respondents gave consistent answers to attitude questions over time. Tourangeau, Rips, and Rasinski (2000) argued that respondents could draw on four types of information in answering attitude questions: (1) existing attitudes; (2) general values; (3) impressions or stereotypes; or (4) specific beliefs or feelings about the issue. Although this suggests that survey respondents could draw from a broad base of materials to construct answers to attitude questions, in practice, they seem to sample this material only superficially. Their explanations for their answers typically cited only two or three considerations, and reaction time studies showed that respondents answer most attitude questions in about three seconds (Bassili and Fletcher 1991; Tourangeau, Rasinski, and D’Andrade 1991; Bassili and Scott 1996). The wide range of beliefs and values that respondents might consider in formulating their answers and the small number they actually considered suggests that variability in their answers over time could reflect unreliability in the processes of retrieving considerations and combining them into the relevant judgment. To the extent that respondents retrieve different considerations on different occasions, evaluate them differently, or use different methods to combine them, their answers to the same attitude question will vary over time.

We proposed a quantitative model—the belief-sampling model—to explain variation in consistency in answers to attitude questions over time (Tourangeau, Rips, and Rasinski 2000). The model assumes that the retrieval process for an attitude question yields a haphazard assortment of prior judgments, impressions, general values, and specific beliefs about an issue. We referred to these as considerations. Which considerations a respondent retrieves will depend on the momentary accessibility of the considerations. Accessibility, in turn, depends on a host of variables, including the wording of the question, the nature of the judgment to be made, the instructions to the respondent, chronic accessibility or strength, and what was brought to mind by the earlier questions. Often, several considerations will come to mind, and the respondent will have to combine them in some way to produce a final verdict. For simplicity, we assumed that the output from the judgment component was a simple average of the considerations that are the input to it. We did not assume that respondents actually compute an average. Instead, we argued that averaging is the algebraic result of an underlying process of successive adjustments: The respondent retrieves (or generates) a consideration and derives its implications for the question at hand; this serves as the initial judgment, which is adjusted in light of the next consideration that comes to mind, and so on. The formation of an attitude judgment is, according to Tourangeau, Rips, and Rasinski, similar to the accretion of details and inferences that produces other sorts of survey responses, such as frequency estimates.

The final judgment can be represented by the average of the considerations the respondent takes into account:

J=insi/n. (1)

In (1), J denotes the output of the integration process, si is the scale value assigned to consideration i, and n is the number of considerations taken into account. The scale values represent the implications of the consideration—that is, the answer it points to—for the particular question. Equation (1) applies equally well when an existing evaluation is the only consideration taken into account, when an existing evaluation is adjusted in the light of other considerations about the issue, or when a new judgment is derived from several considerations retrieved at the time a question is asked. Although there is not necessarily a simple relationship between the outcome of the judgment stage and the answer the respondent reports, the quantitative model we proposed assumes that the reporting process yields an output that is a linear function of J. This is likely to be a reasonable approximation, since the real relationship between covert judgment and overt response is generally monotonic.

According to the belief-sampling model, then, responses to attitude questions are inherently unstable because they are based on a sample of the relevant material and there is no guarantee that the same ones will be sampled each time. A second source of unreliability is in the values that respondents assign to the considerations. This valuation process is not necessarily easy, and it will not necessarily yield the same results every time, even if the judgment is based solely on an existing evaluation. For example, the scale value assigned to a consideration might depend on whatever points of comparison are salient to the respondent. With a few additional assumptions, (1) leads to specific predictions about the correlations between responses to the two same question administered two different times (or two different items administered at the same time).

Three parameters determine the level of the correlation between responses. The first one is the reliability of the scaling process, measured by the correlation between the scale values assigned to the same consideration on two different occasions. Even if a respondent retrieves the same considerations each time the question is asked, the responses will be correlated only to the extent that the scaling process assigns similar values to the considerations each time. The expected correlation between the scale values assigned to the same consideration on different occasions is represented in the model by ρ1. The second parameter represents the degree that any two considerations retrieved by the same respondent are correlated. The considerations in memory (and any that are generated on the spot in response to a question) are likely to show some degree of internal consistency. Thus, even if answers on two occasions are based on different considerations, they are still likely to be correlated to some degree since even nonoverlapping sets of considerations will imply similar answers. We use ρ2 to denote the degree of internal consistency among the pool of potentially relevant considerations; it represents the expected value of the correlation between the scale values assigned to any pair of different considerations retrieved or generated by a respondent on a single occasion. The final parameter affecting the correlation between responses to two items is the degree of overlap in the considerations taken into account in each response. The parameter q represents the proportion of considerations taken into account in the second response that were also incorporated in the first. Other things being equal, the more the considerations overlap, the more the answers will correlate (see also Zaller 1992; Zaller and Feldman 1992).

Equation (2) shows the expected level of the correlation between responses on two occasions as a function of these three parameters:

        r12= n1n2ρ1ρ2+n2qρ1(1ρ2)        ([n12ρ2+n1(1ρ2)][n22ρ2+n(1ρ2)])1/2. (2)

In the equation, n1 represents the total number of considerations taken into account by the respondent on the first occasion and n2 the number taken into account on the second. The product n2q, which figures in the second term in the numerator, is just the number of overlapping considerations (the number taken into account both times).

Equation (2) underscores the importance of several relatively neglected variables that affect response consistency over time and across items. For example, it points to the importance of the homogeneity of the considerations on which the attitude responses rest (ρ2). The more that considerations are evaluatively consistent within persons, the higher the correlations. More generally, the model predicts that the correlation between responses will typically increase with ρ1 (the reliability of the scaling process), ρ2, and q (the degree of overlap in the samples of considerations taken into account on the different occasions). The model also makes some predictions about interactions between these variables. One of these involves the interaction between the overlap and homogeneity parameters: as ρ2 approaches 1.0, the impact of q nears 0; if all the potentially relevant considerations point to the same answer, it does not matter so much which ones the respondent actually retrieves. I present evidence regarding the validity of the belief-sampling model in section 3.1 below. For now, I want to contrast it to the reigning statistical models for reliability.

2.2 Statistical Models for Reliability

2.2.1. Simple response variance and related measures

The most common method for assessing the reliability of survey responses has been to conduct reinterviews with respondents a short interval (one to two weeks) after an initial interview and to estimate relatively simple statistics from these data, such as the gross difference rate (GDR). The GDR is the proportion of respondents giving answers to an item in the reinterview that differ from the answers in the initial interview:

GDR=1pa,

in which pa is the proportion of respondents giving the same answer in both interviews. Cohen’s kappa (Cohen 1960) corrects the agreement rate for chance agreement:

κ=pape1pe,

in which pe is the level of agreement that would be expected if the two responses were independent. The correlation between responses to the same question asked at two different time points has also been used to estimate reliability.

The GDR and related statistics rest on a relatively simple statistical model—the survey response reflects some underlying true score plus an error:

yij=ti+eij, (3)

in which yij is the actual survey response for respondent i and occasion j; ti is the true score for that respondent; and eij is the error for that respondent on that occasion. Equation (3) makes no assumptions per se—an observed response can always be rewritten as the sum of what the respondent should have said (his or her true score) and the difference between that and what he or she actually said. However, assumptions are usually added to aid the interpretation of reinterview data, and these assumptions can be wrong. It is generally assumed that the true score is the same on both measurement occasions, that the errors have an expected value of zero, and that the errors on the two occasions are independent of each other. Under these assumptions, the error variance (or simple response variance) in the responses is a straightforward function of the discrepancies between the answers on the two occasions:

σ^e2=12nin(Y1iY2i)2,

where Y1i and Y2i are the answers in the initial interview and reinterview, respectively, and n is the number of respondents. If the item is dichotomous, then the GDR is just two times the simple response variance (see O’Muircheartaigh 1991; Biemer 2004, for more sophisticated discussions of the concept of simple response variance).

2.2.2. Multi-trait multi-method models

Other more complex approaches have also been used to estimate the reliability of survey items. Two of the most prominent of these alternatives include multi-trait multi-method (MTMM) experiments and the quasi-simplex model applied to data from three (or more) waves of a longitudinal survey. Each of these models offer somewhat different definitions of reliability and these more sophisticated methods require that certain assumptions be met for the models to produce valid results.

Andrews (1984) introduced the use of structural equation modeling to analyze the results of MTMM experiments. The experiments typically involve nine (or more) survey items measuring three different traits (or constructs) using three different methods. Three traits and three methods are the minimum necessary to achieve an identifiable model. In some cases, individual respondents get only two of the three measures of a given construct in a split-ballot design (Saris, Satorra, and Coenders 2004). Although other models are possible, the MTMM model that is usually applied (Andrews 1984; Saris and Gallhofer 2007a, 2007b) assumes that the observed response (yij) reflects a “true” score (tij) plus a random error (eij):

yij=rijtij+eij, (4)

in which rij is the reliability of item i—that is, the relationship between the true score and the observed response. The true score, in turn, reflects the construct of interest (fi) and any method effect (Mj):

tij=vijfi+mijMj, (5)

in which vij is the validity coefficient, representing the relationship between the true score and the underlying construct, and mij represents the impact of method j on responses to item i. This model, the “true score” model, assumes that the random errors in the observed scores are independent of each other and that the true scores reflect two factors—the construct of interest and a method effect. Equations (4) and (5) do not differ markedly from (3)—they both assume that an observed response reflects both the truth and random error—but the MTMM true score model allows for method effects and posits a more complex relationship between the underlying variable of interest and the observed response.

2.2.3. The quasi-simplex model

Alwin (2007) has advocated the use of longitudinal data for reliability estimates. The “quasi-simplex” model for analyzing such data assumes that the true score at a given wave (tik) reflects the true score at the previous wave (ti,k−1) plus change over time (zik):

tik=βik,k1ti,k1+zik, (6)

in which βik,k1 reflects the relationship between the true score at wave k and the true score at the prior wave. The observed score for a given wave (yik) is just the true score plus a random error (eik)—that is, the same model as (3). A basic assumption of the quasi-simplex model is that there is no lagged effect of the true score from two waves prior to the current wave. In addition, to make the model parameters identifiable, either the reliabilities (Heise 1969) or the error variances (Wiley and Wiley 1970) must be assumed to be constant across waves of the survey.

2.2.4. Summary

Although the statistical models have gotten more complex, there is a strong family resemblance among them. They all are based on the idea of a true score, and the increasing complexity of the models reflects their treatment of the relation between the true score and the observed score. In psychology, test–retest studies are sometimes done to estimate the reliability of a scale; the correlation between scale scores in the initial administration (the “test”) and the second administration (the “retest”) is used as the reliability estimate. However, the over-time correlation is the product of the reliability at time 1, the reliability at time 2, and the correlation between the true scores on the two occasions. If one assumes no change in the true scores and equal reliabilities on both occasions, the over-time correlation is the square of the reliability of the item or scale.

The MTMM model reflects the concern that some of the over-time stability in responses in a reinterview study partly results from method effects—that is, respondents giving correlated answers to any items that share the same format, regardless of their content. Of course, the danger with the MTMM design is that respondents are asked at least two questions designed to tap the same construct within a single interview and they may recall their earlier answers, but none of the models summarized by (3)–(6) are based on claims about how respondents come up with answers to the questions.

Alwin’s (2007) use of the quasi-simplex model applied to longitudinal data reflects the view that respondents may remember their answers from the initial interview when they are interviewed a second time, inflating the estimate of reliability. Still, the assumptions that allow reliability to be estimated under this model may not be plausible. A longitudinal panel may minimize memory effects, but panel designs can create their own measurement artifacts (see Cantor 2008; Warren and Halpern-Manners 2012, for reviews). For example, respondents may get better answering the questions after repeated administrations, so that error variances are lower (or the reliabilities higher) in later rounds of the panel. This would violate the assumptions made to achieve an identifiable model.

3. FINDINGS ON RELIABILITY

In this section, I briefly review what we have learned about reliability. I start with the findings inspired by the belief-sampling model, then move to the Population Assessment of Tobacco and Health Reliability and Validity study (the PATH-RV study), which used the interview–reinterview method to estimate reliabilities.

3.1 The Belief-Sampling Model

3.1.1. Initial study

I conducted two studies to test the belief-sampling model. In the first, a random sample of households in the city of Chicago was contacted by telephone during the fall of 1991. We selected the sample by choosing random telephone numbers from the Chicago directory and replacing the final digit of each number with a random digit. Interviewers at the NORC called this initial sample of telephone numbers and contacted a total of 1,481 households; calls to another 570 numbers went unanswered after multiple tries. Within these households, interviewers asked to speak to any available adult male or, if no male was available, to any adult female. Those who completed the first interview were recontacted and interviewed by telephone a second time. A total of 599 respondents completed the initial interview and 499 of them completed a reinterview about three weeks later.

In each interview, respondents answered an open-ended probe and five closed questions each about two familiar issues—abortion and welfare. All of the closed items used 5-point response scales, with answer categories ranging from agree strongly to disagree strongly. The first closed item about each issue was repeated verbatim in both interviews. For abortion, this target item read “A woman who wants an abortion should be allowed to obtain one.” For welfare, the item read “The federal government should increase spending for welfare programs.” The data from these two items allowed us to assess correlations to the same item over two interviews. The open-ended item asked respondents to “Please tell me your thoughts and feelings about whether or not [a woman who wants an abortion should be allowed to obtain one/the federal government should increase spending on welfare programs].” Some respondents got the open-ended probes just before the target items and some got them immediately afterward; the order of the open and closed items had no effect on the results, and I combined the data from the two orders in the analysis.

The initial interview also included items designed to assess attitude homogeneity. For each issue, we asked respondents whether their beliefs about the issue were mixed: “Would you say that you are strongly on one side or the other on the abortion (welfare) issue or would you say your feelings are mixed?” These self-report measures came at the end of the initial interview. Altogether, the initial interview included thirty-three questions: the target items, open-ended probes, and four additional closed items each about abortion and welfare; several questions about some unrelated political issues (to disguise the point of the study); and some questions eliciting basic demographic information. The follow-up questionnaire was briefer, including only the twelve items on welfare and abortion and three unrelated items.

Two coders counted the number of considerations cited in each open-ended answer and the number mentioned on both occasions. These counts served as estimates of n1 and n2, the number of considerations taken into account on the two occasions; the number of considerations judged to be the same in both open-ended responses served as the estimate of n2q, the number of overlapping considerations.

Table 1 presents some key findings from the study. The top two panels of the table display the correlations over time by measures of attitude homogeneity. For both issues, the correlation between target responses in the two interviews was significantly higher for respondents who said they were strongly on one side of the issue than for those who said their views were mixed (for abortion, z = 2.52; for welfare, z = 2.00). The results for the measure of homogeneity based on the open-ended answers in the first interview showed the same pattern—the more considerations favoring one side of the issue, the higher the correlation between responses to the closed target items (for a similar measure, see Zaller and Feldman 1992). For both issues, the differences between groups formed based on their open-ended responses were significant. The F-values for the linear trend in the correlations across groups were 3.99 for abortion and 5.39 for welfare (both ps < 0.05).

Table 1.

Correlation of Answers across Interviews, by Consistency of Considerations and Overlap

Issue
Abortion Welfare
Balance of considerations in initial interview
 Equally pro and con 0.614 (40) 0.580 (51)
 Majority of one consideration 0.755 (416) 0.618 (416)
 Majority of two or more 0.806 (34) 0.825 (22)
Self-report
 Mixed views 0.575 (175) 0.535 (306)
 Views all on one side 0.814 (314) 0.723 (184)
Overlap in open-ended responses
 No shared statements 0.348 (68) 0.039 (95)
 One shared statement 0.794 (386) 0.722 (374)
 Two or more shared statements 0.840 (27) 0.630 (12)

Note.— Parenthetical entries are sample sizes.

As respondents take into account more of the same considerations each time they answer a question, their answers should become more consistent. [In (2), this effect is captured by n2q, which represents the number of considerations taken into account both times.] Our coding scheme noted whether each consideration in the open-ended answer in the second interview was also mentioned in the open-ended answer in the first. The bottom panel of table 1 displays the results by the amount of overlap in the open-ended responses. For both issues, respondents with at least one overlapping statement in their open-ended answers in the two interviews showed higher correlations between target responses than those with no overlapping statements; all of the comparisons between the no overlap group and the other two groups are statistically significant. For example, the correlations between abortion target responses were 0.35 among respondents whose open-ended answers about abortion showed no overlap across interviews, 0.79 among those whose open-ended answers included one shared consideration, and 0.84 among those whose open-ended answers shared two considerations across interviews. Both 0.84 and 0.79 differ significantly from 0.35 (z = 3.59 and 5.36).

Equation (2) allows for more specific quantitative predictions about the correlations between target responses once values for the two parameters, ρ1 and ρ2, are estimated. We fit the model for each issue using nonlinear regression procedures to estimate values for ρ1 and ρ2. Figure 1 displays the predicted and observed correlations as a function of the number of considerations taken into account on the two occasions and the overlap between these considerations. (One of the observed welfare correlations was actually nonsignificantly negative; prior to fitting the model, we replaced the actual value with 0.00.) The values on the horizontal axis indicate the number of considerations mentioned in the initial interview (n1, the first digit of the group label), the number mentioned in the second interview (n2, the second digit in the label), and the number of overlapping considerations (n2q, the final digit). For example, in the group labeled 222, the same two considerations were cited in the open-ended answers given in both interviews. For both issues, the model seems to fit the data reasonably well, accounting for 81 percent of the variance in the abortion correlations and for 74 percent of the variance in the welfare correlations. The estimated value of ρ1 (the reliability of the scaling process) was similar for the two issues (0.86 for abortion and 0.90 for welfare). By contrast, the estimates of the homogeneity parameter ρ2 differ quite sharply (0.71 for abortion versus 0.22 for welfare). Apparently, the considerations underlying welfare attitudes were more diverse than those underlying abortion attitudes. This result is consistent with the finding that far more respondents described their views on welfare as mixed (62.4 percent of the sample) than described their abortion views that way (35.8 percent).

Figure 1.

Figure 1.

Predicted (Dashed Lines) and Observed Correlations (Solid Lines) between Target Responses in the Two Interviews. The labels on the horizontal axis indicate the number of considerations cited in the first interview (the first digit), the number cited in the second (the second digit), and the number judged to be the same in both interviews (the last digit).

3.1.2. Highway study

Both abortion and welfare are familiar issues, and we were not able to manipulate the key variables in the belief-sampling model. In a second study, we used an issue that was unfamiliar to the respondents and we tried to affect the underlying mix of considerations. We recruited a convenience sample of 192 adults, offering participants $20 to take part in a study of attitudes in the spring of 1994. NORC interviewers attempted to recontact those who completed the initial session by telephone some two weeks later; 177 of the initial participants completed the telephone interview.

In the initial session, participants came to NORC’s offices and read a brief description of a proposed highway project. The description summarized the arguments in favor of and opposed to the project, citing the highway’s likely effects on commuting times, traffic, the local economy, and the environment. Half of the participants received eight arguments concerning the highway; the other half, only four. The material also varied in the balance between pro and con arguments. The description given to one group of participants included three arguments in favor of the highway for each one that opposed it; the description given to a second group included equal numbers of pro and con arguments; the description given to the final group included three arguments opposed to the highway for each one that favored it. Overall, then, there were six experimental groups. The arguments had been pretested, and the ones we gave to the participants were deemed to be roughly equally convincing by the pretest respondents.

After reading the assigned material on the highway project and answering some questions to test their comprehension of it, respondents received a questionnaire that included several questions about the proposed highway. Among these were two attitude items, both on 5-point response scales. The first of the two attitude items asked for a general evaluation of the highway project ("In general, how strongly do you favor or oppose the highway project?"); the second asked specifically about the project’s impact on the local economy ("In general, what effect do you think the highway project would have on the local economy?"). Immediately after each of the attitude questions, respondents were asked to list the considerations that had affected their answers. All but fifteen of the participants were interviewed by telephone approximately two weeks after their initial session. The questionnaire for the telephone interview was quite brief and included the item asking for an overall evaluation of the highway and the open-ended follow-up to that item. A coder counted the number of considerations cited in the open-ended answers in both the initial questionnaire and the follow-up interview, as well as the number of overlapping considerations mentioned in each pair of answers.

Table 2 shows some of the key results from this study. As in the initial study, respondents were more consistent over time when they retrieved considerations that were mostly on one side of the issue (second panel of the table) and when they retrieved more of the same considerations on the two occasions (bottom panel). The effects of both variables are highly significant for the same item administered twice (the rightmost column) and are in the same direction (though nonsignificant) for two different items administered during the initial session. For the linear trend in the correlations between responses to the general evaluation item on the two occasions, the F value is 9.62 for the size of the majority and 17.51 for the degree of overlap.

Table 2.

Correlation of Target Answers by Homogeneity, Overlap, and Attitude Issue

General economy (time 1) General (time 1 to time 2)
Experimental group
 One pro/three con arguments 0.485 (32) 0.785 (30)
 Two pro/six con arguments 0.382 (32) 0.880 (30)
 Two pro/two con arguments 0.536 (31) 0.912 (29)
 Four pro/four con arguments 0.431 (31) 0.814 (30)
 Three pro/one con arguments 0.405 (31) 0.743 (30)
 Six pro/two con arguments 0.598 (31) 0.917 (28)
Balance of considerations in initial session
 Equally pro and con 0.270 (30) 0.686 (20)
 Majority of one 0.550 (53) 0.824 (48)
 Majority of two 0.540 (37) 0.818 (38)
 Majority of three or more 0.587 (64) 0.919 (61)
Overlap in open-ended responses
 No shared statements 0.515 (136) 0.767 (77)
 One shared statement 0.626 (41) 0.910 (71)
 Two or more shared statements 0.648 (11) 0.956 (28)

Note.— Parenthetical entries are sample sizes. The “general” item elicited an overall evaluation of the proposed highway project; the “economy” item asked about its effects on the local economy.

The sample sizes were smaller in the second study than in our initial study and we were able to fit the model in (2) only to the data for the general item. Again, we used nonlinear regression procedures to estimate values for ρ1 and ρ2. The parameter estimates (0.66 for ρ1 and 1.0 for ρ2) were similar to those for abortion in the earlier study. As figure 2 shows, the model gives a reasonable fit to the observed correlations, though not so good as those in the initial study. In our second study, the predicted values account for nearly 60 percent of the variance in the correlations between responses to the overall evaluation item in the initial session and the subsequent telephone interview. The poorer fit in our second may simply reflect the smaller sample sizes (about twelve per cell) on which the correlations were based.

Figure 2.

Figure 2.

Predicted (Dashed Line) and Observed Correlations (Solid Line) between Target Responses in the Highway Study. The labels on the horizontal axis indicate the number of considerations cited in the initial session (the first digit), the telephone interview (the second digit), and the number judged to be the same in both (the last digit).

3.1.3. Discussion

Why aren't attitude responses more consistent over time? According to the belief-sampling model, the reason is not that people do not have attitudes or that they answer the questions sloppily, although these both doubtless play a role. The problem is partly that respondents have a wide range of considerations they can draw on and they do not necessarily draw on the same ones each time. When they don't consider the same things on different occasions, it matters a great deal whether they are sampling from a consistent or mixed underlying pool of considerations about the issue. Those with a mixed set of considerations to draw from are especially prone to give inconsistent answers over time. If the underlying pool is homogenous enough, it does not matter much whether respondents retrieve the same ones every time. Additional analyses of the data from our second study show that overlapping considerations had a larger, more consistent effect among the respondents who got mixed arguments about the highway project initially than that among those who got arguments clearly favoring one side. Again, overlap and internal consistency are two of the key factors in determining reliability over time and they can interact.

3.2 Self–Proxy Differences

Another important variable in the literature on reliability is whether the data come from the same respondent in the reinterview as in the initial interview. In a classic study that examined data from the Current Population Survey (CPS) reinterview program, O’Muircheartaigh (1991) looked at the variables affecting the GDRs in two key dichotomous variables from the CPS—the classification of sample members as labor force participants and as employed or unemployed. The CPS is a household interview in which one person typically reports about himself or herself and also about everyone else in the household over fourteen. The respondent to the main interview may or may not be the person who provides the reinterview data. Table 3 below presents GDRs for the employment status variables as a function of the two respondents. Although the GDR is lowest for the self/self combination (the same person reporting about themselves in both the main interview and the reinterview), when other variables (such as the age and education of the respondent) are taken into account, the odds of difference across interviews are lowest when same proxy reporter provides the data in both interviews.

Table 3.

GDRs and Logistic Regression Estimates for Employment Variable, by Self/Proxy Combination

Combination GDR Logistic regression coefficients
Univariate effect Effect in multivariate model
Self/self 0.023 0.00 0.00
Proxy/self 0.041 0.52 0.32
Self/proxy 0.031 0.25 0.07
Proxy/same proxy 0.031 0.22 −0.39
Proxy/different proxy 0.064 0.88 −0.20

Note.— Adapted from O’Muircheartaigh (1991). The GDR is the simple gross difference rate; the marginal effect expresses the impact of the different reporter combinations on the log odds of a difference; the final column shows the effect of the different reporter combinations controlling for other variables.

A later study (Lee, Mathiowetz, and Tourangeau 2007) examined the issue of self–proxy differences experimentally in a study on items intended to measure disability. In that study, a national telephone study, two adults (over age forty) were selected as sample persons from each cooperating household. By design, about half of the time, the same person was interviewed in the initial interview and in the reinterview; in the remainder, the reinterviews were done with the adult who was not the respondent in the first interview. The GDR was lower when the same respondent (whether it was a self-respondent or a proxy) completed both interviews. For example, the GDR for difficulties in seeing or hearing was 0.057 when the data came from a single respondent but was 0.080 when it came from two different respondents. The proxy–proxy combination led to somewhat lower GDRs than the self–self combination. Self-reporters may be more aware of minor fluctuations in their disability status, leading them to give reports that are less consistent over time. By contrast, proxy reporters may base their answers on relatively stable characteristics of the sample person, such as his or her usual status.

3.3 Findings from Statistical Models

3.3.1. The MTMM and the quasi-simplex model

There have been at least two systematic attempts to investigate the characteristics of respondents and questions that produce reliable answers. The most comprehensive of these efforts are Saris and Gallhofer (2007a, 2007b) and their colleagues (Rodgers, Andrews, and Herzog 1992; Saris, Revilla, Krosnick, and Schaeffer 2010; Revilla, Saris, and Krosnick 2014) and by Alwin and his colleagues (Alwin 2007; Alwin, Baumgartner, and Beatty 2018; see also Hout and Hastings 2016). Both Saris and Gallhofer and Alwin are critical of the reinterview methods that have traditionally been used to estimate the reliability of survey items and advocate alternative methods instead. Saris advocates the use of MTMM experiments in conjunction with the true score model [see (4) and (5)]; Alwin advocates use of the longitudinal data and the quasi-simplex model summarized in (6).

Saris and Gallhofer (2007a, 2007b) note a potential issue with the reliability estimates from reinterview studies. They argue that respondents may give the same answer to a survey question because of a “method” effect. For example, if the question uses an agree–disagree format, respondents may select “agree strongly” as their answer in both interviews partly because they are prone to select that response option regardless of the content of the question. The presence of such method effects inflates the estimated reliability of the answers. In an MTMM experiment, multiple “traits” (i.e., constructs) are measured, each by several methods (such as agree–disagree items and items with item-specific response options). This makes it possible to disentangle the method variance from the valid variance in a specific question via structural equation modeling. In a few of Alwin’s (2007) analyses, he adopts the MTMM approach as well. It is not always clear how important the method effects are; they often seem to contribute little to the reliable variation in answers.

Alwin (2007) argues that memory effects inflate the reliability estimates from the typical reinterview study—that is, according to Alwin, respondents remember their answers from the initial interview and repeat them in the second interview. Instead he advocates the use of panel data in which interviews are conducted at least two years apart to minimize any memory effects. A three-wave panel survey makes it possible to fit a model that provides separate estimates of random error variance from variance in true changes over time. Additional assumptions are needed to make the parameters of the model identifiable.

Despite these methodological differences, there is at least some convergence in the conclusions reached by the investigators. Table 4 below is our attempt to summarize the key findings from studies using the two approaches. The Saris and Gallhofer (2007b) findings are based on results from eighty-seven MTMM experiments, involving more than 1,000 survey items.

Table 4.

Key Results from MTMM and Quasi-Simplex Reliability Studies

Variable Studies Finding
Question characteristics
 Type of question (factual versus nonfactual) Alwin (2007), Hout and Hastings (2016) Factual questions produce more reliable answers
 Position in questionnaire Saris and Gallhofer (2007a, 2007b) Later items in questionnaire produce more reliable answers
 Question length Alwin (2007) Shorter questions produce more reliable answers
 Syntactic complexity (number of subordinate clauses) Saris and Gallhofer (2007a, 2007b) Questions with fewer subordinate clauses produce more reliable answers
 Polarity Alwin (2007), Alwin et al. (2018), Saris and Gallhofer (2007a, 2007b) Unipolar questions produce more reliable answers than bipolar
 Response format (open versus closed for factual items) Alwin (2007) Factual questions with numeric open-ended responses produce more reliable answers than those that use closed categories with vague quantifiers
 Number of response categories Alwin, Baumgartner, and Beattie (2018), Revilla, Saris, and Krosnick (2014) Fewer response options produce more reliable answers
 Middle categories Alwin et al. (2018), Saris and Gallhofer (2007a, 2007b) Questions without a middle response option produce more reliable answers
 Verbal labeling Alwin (2007), Alwin and Krosnick (1991), Saris and Gallhofer (2007a, 2007b) Questions in which every option is labeled verbally produce more reliable answers
 Type of scale (agree–disagree versus item specific) Saris et al. (2010) Item-specific response scales produce more reliable answers
 Reference period Saris and Gallhofer (2007a, 2007b) Questions asking about the past produce more reliable answers than those asking about the present and future
 Salience of topic Saris and Gallhofer (2007a, 2007b) Questions about salient/central topics produce more reliable answers than questions about less salient topics
 Instructions for respondents Saris and Gallhofer (2007a, 2007b) Respondent instructions produce less reliable answers
Respondent characteristics
 Age Alwin (1989); Alwin and Krosnick (1991), Rodgers, Andrews, and Herzog (1992) Younger respondents provide more reliable answers
 Education Alwin (1989, 2007), Alwin and Krosnick (1991), Saris and Gallhofer (2007a, 2007b) More educated respondents provide more reliable answers

Alwin (2007; see also Hout and Hastings 2016; Alwin et al. 2018) finds that items eliciting facts produce more reliable answers than those measuring subjective constructs (such as attitudes or beliefs); open-ended questions produce more reliable answers than closed-ended; unipolar questions produce more reliable answers than bipolar questions (although this difference seems small); two-category scales produce more reliable answers than scales with more response categories; fully labeled scales produce more reliable answers than scales in which only the endpoints are given verbal labels; and shorter questions produce more reliable answers than longer questions (but only for stand-alone questions as opposed to questions in batteries). Finally, questions with long introductions seem to produce less reliable answers than those with short or no introductions. This last finding illustrates a potential limitation of trying to identify question features associated with unreliability—the underlying causal relationships may be unclear. Questions with long introductions often involve complicated or unfamiliar tasks, which is why the investigators provide the lengthy introductions. The difficulty of the task rather than the length of the introduction may account for the low reliability of answers to such items.

3.3.2. PATH-RV study

How do these results compare with findings from traditional interview–reinterview data? We (Tourangeau, Yan, and Sun 2020) examined 447 questions from the PATH study Adult questionnaire for which we were able to obtain both initial and reinterview responses from at least hundred respondents. The PATH-RV study was designed to assess the reliability of the questions in the PATH study, a major longitudinal study then in its fourth round. We also examined responses to the PATH Study Youth questionnaire, but restrict our attention here to the adult data, where the sample size is much larger. We coded question characteristics that we thought might be related to difficulties in question comprehension, retrieval, judgment, or mapping and also coded whether answers were likely to be “edited” due to social desirability concerns. Tourangeau, Rips, and Rasinski (2000) argue that measurement error in surveys generally reflects difficulties with one or more components in the response process (e.g., misunderstanding of the questions or retrieval failure). Our general hypothesis was that cognitive difficulties would manifest themselves in reduced reliability.

We fit models that examined the reliability of an item, as measured by the GDR and kappa, as a function of the item’s characteristics. Table 5 presents the main findings from that analysis. We conducted unweighted analyses, weighted analyses that took into account the variance of the GDR or kappa estimate, and weighted analyses that dropped a few highly influential observations.

Table 5.

Linear Regression Coefficients for Item-Level Models of GDR and Kappa

Question characteristic GDR
Kappa
Unweighted Weighted Weighted, problem items dropped Unweighted Weighted Weighted, problem items dropped
Attitudinal (versus factual) 0.13*** 0.08*** 0.10*** −0.05 −0.06 −0.03
Demographic (versus other) −0.13*** −0.08*** −0.11*** 0.25*** 0.26*** 0.16***
Number of sentences 0.00 0.01 0.01 −0.01 0.00 −0.02*
Number of words per sentence 0.002* 0.001 0.001 −0.003 −0.01* −0.004***
Number of response options 0.04*** 0.03*** 0.02*** −0.01 −0.01 −0.01
Position in the questionnaire 0.01* 0.01 0.00 −0.02 −0.03** −0.02***
Not a scale (versus a response scale) −0.06** −0.07*** −0.09*** 0.16** 0.12* 0.26***
Frequency scale (versus other scales) 0.06* 0.02 −0.01 −0.05 0.03 −0.04
Extent of social desirability concerns −0.01 −0.01** −0.01** 0.09*** 0.06* 0.04***
Central topic 0.05*** 0.02*** 0.03*** −0.02 0.02 −0.01
Present/future reference period (versus past) 0.01 0.01* 0.01 0.00 0.03 0.00
Flesch score 0.00 0.00 0.00 0.01 0.01 0.01
Number of items 426 397 389 419 389 375
R-squared 0.69 0.60 0.72 0.32 0.34 0.72

Note.— In the weighted models, the weights were the inverse of the variance of the GDR or kappa.

***

p < 0.0001; **p < 0.01; *p < 0.05; †p < 0.10.

The type of item clearly makes a difference to its reliability. Attitudinal items elicit less reliable answers than factual items and demographic items receive more reliable answers than other types of questions. If we separate the items into three categories—attitudinal, behavioral, and demographic—the average GDRs were 0.31, 0.09, and 0.01 for the three types of item; similarly, the average kappas were 0.46, 0.63, and 0.88. One of the few items in either questionnaire where the answers were perfectly reliable was the question asking the respondent’s sex. Question length—both the number of sentences and the number of words per sentence—reduced reliability. It is almost inevitable that the more response options, the lower the GDR and that effect is apparent with both adults and youth. With more options, it is more likely that respondents will choose a different answer in the second interview. The effect of the number of options was not significant in the models for kappa. The other item characteristics were significant predictors of reliability in at least some of the models: whether the item featured a response scale (which lowered reliability), the position of the item in the questionnaire, readability as measured by the Flesch reading-ease score, whether it touched on a central topic, and whether it concerned the present or future rather than the past. Later items were found to have lower reliability than earlier items, but this effect was only statistically significant for the model of kappas. The effect of topic centrality was statistically significant only for the GDR statistic; it lowered reliability. Retrospective questions produce more reliable answers than questions about the present or future (although there are relatively few of the latter in the questionnaire). In general, these findings are quite consistent with the findings summarized in table 4, even though they reflect a different method for measuring reliability.

Using the same variables included in our item-level model, we coded ninety-one items from the National Survey of Drug Use and Health (NSDUH) Reliability Study for which estimates of kappa had been published. We used the final model estimates (combining the PATH-RV data for adults and youths since these were not separated in the NSDUH Reliability Study) to predict kappas for the NSDUH items. The overall correlation between our predictions and the published NSDUH kappa values was 0.58. Our item-level model consistently underpredicted the kappas for the NSDUH items—the average deviation between the predicted and actual values was −0.05.1

We also tried to identify respondent characteristics associated with giving reliable survey answers. These respondent-level analyses examined the percentage of items for which the respondent provided the same answers in both interviews as a function of various respondent characteristic. Table 6 displays both weighted and unweighted results (the weight was the final sampling weight for the respondent, which adjusted for unequal selection probabilities and differential nonresponse and was raked to population totals), and the analyses take into account the clustering of the sample by geographic area.

Table 6.

Regression Coefficients for Respondent-Level Models

Respondent characteristic Unweighted Weighted Weighted, problem cases dropped
60 or older (versus younger than 60) for adults 0.00 0.00 0.00
High school or less (versus more than HS) −0.02** −0.03** −0.02**
Conscientiousness 0.005* 0.008* 0.006*
Need for cognition 0.00 0.00 0.00
Every day tobacco user (versus nonuser) −0.03*** −0.03* −0.04***
Some day tobacco user (versus nonuser) −0.02*** −0.02 −0.02**
Male (versus female) −0.02*** −0.03** −0.02***
Hispanic (versus all others) −0.01 −0.01 −0.01
Non-Hispanic Black (versus all others) −0.01 −0.01 0.00
Non-Hispanic White (versus all others) 0.01 0.02 0.02
Households receiving income assistance −0.02** −0.01 −0.01
Households <$50k −0.02*** −0.02 −0.02***
Number of elapsed days −0.001 −0.002 −0.002***
Number of respondents  407  407  383
R-squared 0.35 0.35 0.51

Note.— Except for the number of elapsed days, which was derived from paradata, the respondent characteristics reflect respondent self-reports in the initial interview. The weights used in the weighted models are the final sampling weights for the respondents.

***

p < 0.001; **p < 0.01; *p < 0.05; †p < 0.10.

Six respondent characteristics were significantly related to the percentage of identical answers. As expected, the Big Five factor of conscientiousness was positively related to this variable; respondents with a higher level of conscientiousness provided a higher percentage of identical answers than those with a lower level of conscientiousness (conscientiousness is the tendency to be organized and to exercise self-discipline; Goldberg, 1992). Respondents with high school or less education, respondents who report using tobacco (either every day or some days), males, and respondents from households with a household income less than $50,000 provided fewer identical answers across the two interviews. In addition, there was a significant effect for the number of days between the interview and reinterview—respondents gave somewhat fewer identical answers as more time elapsed between the two interviews.

We also ran multi-level models that simultaneously examined item and respondent characteristics. These models confirmed the findings I have already summarized.

4. COMPARING THE STATISTICAL MODELS

One of the goals of the PATH-RV study was to compare various reliability estimates as well as to compare them to other tools for evaluating survey questions.

We used the PATH-RV data to calculate three traditional reliability measures—GDRs, kappas, and overtime correlations. We calculated these statistics for every item with at least hundred observations, dropping items with marginal proportions of 0.95 or above to avoid problems with the kappa statistic. A total of 409 survey items from the Adult Questionnaire met these criteria. Because overtime correlations do not make sense for categorical items with three or more categories, these statistics are available for only 391 of the items. We applied the quasi-simplex model to data from the first three waves of the main PATH study, using the Wiley and Wiley (1970) assumption of equal error variances in each wave to estimate the parameters. Sixty items were analyzed in this way. The reliability estimates from MTMM draw on MTMM experiments included in the reinterview questionnaires. The respondents received all three items tapping a single construct. We applied the true score model [see (4) and (5) above].

We also used two computer-based systems to assess item reliability—SQP and Question Understanding Aid (QUAID). The QUAID is an automated system designed to identify potential comprehension problems that respondents may have based on linguistic features of the question. It detects and diagnoses five classes of comprehension difficulties (Graesser, Cai, Louwerse, and Daniel 2006). For this analysis, we use the number of issues detected by QUAID regardless of type of issue as a proxy measure of reliability; our assumption was that the more issues detected by QUAID, the lower the reliability.

4.1. Convergence across Methods

How well do the various measures of reliability cohere? Table 7 shows the correlations among the measures. Ten of the correlations were significant, all of them in the expected direction. A few of these associations are worth highlighting. First, kappa was highly correlated with both the GDR and overtime correlation (−0.61 and 0.84, respectively), but the correlation between the latter two was surprisingly weak (only −0.21). Second, reliability predicted by SQP was significantly related to reliability calculated from all three traditional approaches, but the correlations were relatively low (−0.18 to 0.21). Third, the number of problems found by QUAID was significantly correlated with GDR and over-time correlations but the correlation was rather low (−0.16). Fourth, reliability calculated through the quasi-simplex model was significantly correlated with kappa and the over-time correlations (0.36 and 0.49), but not with GDR. Finally, the ex ante methods and the quasi-simplex model were not strongly related to each other. The pattern does not change appreciably if we use the square root of the over-time correlation rather than the correlation itself (the correlation between the two was 0.99) or the quality estimate from SQP rather than the reliability estimate (the correlation between the two was 0.97).

Table 7.

Correlations among Methods (and Sample Sizes)

Type of method Reliability measure GDR Overtime correlation Reliability from SQP QUAID issues Quasi-simplex model
Traditional approaches Kappa −0.61*** 0.84*** 0.21*** −0.06 0.36**
n 409 391 409 409 60
GDR −0.21*** −0.18*** −0.16** 0.14
n 391 409 409 60
Overtime correlation 0.20*** −0.16** 0.49***
n 391 391 57
Ex ante methods Reliability from SQP 0.05 0.20
n 409 60
Number of issues identified by QUAID 0.21
n 60
Sophisticated method Quasi-simplex model
n

Note.— The ns are the number of items on which the correlations are based.

**p < 0.01; ***p < 0.001.

We also examined the associations between measures based on the interview–reinterview data and the estimates from the models fitted to the MTMM experiments. The sample size for these analyses was much reduced because only a small number of items were involved in the MTMM experiments. Only nine of these items were also included in three waves of the PATH study. As a result, we drop the estimates from the quasi-simplex model from this analysis.

Table 8 shows the correlations between different methods for the thirty-five items that figured in the MTMM experiments. (Again, the results do not change appreciably if we use the root over-time correlation instead of the over-time correlation or the quality estimate from the MTMM experiments instead of the reliability estimates.) Only three of the fifteen associations were significant; the correlations range from −0.93 to 0.71; these all involve the three traditional approaches. The two ex ante methods and the MTMM estimates were not significantly related to the other measures of reliability—the SQP reliability estimate was marginally associated with GDR, but this correlation was in the opposite from the expected direction.

Table 8.

Correlations among Seven Methods

Reliability measure GDR Overtime correlation Reliability from SQP QUAID issues Reliability from MTMM
Kappa −0.93*** 0.71*** −0.23 −0.15 −0.01
GDR −0.68*** 0.31 0.05 0.25
Overtime correlation −0.18 −0.02 −0.23
Reliability from SQP −0.24 0.07
Number of Issues identified by QUAID −0.27
Reliability from MTMM

Note.— All of the correlations based on thirty-five items.

***p < 0.001.

4.2. Discussion

To our knowledge, the PATH-RV is the first to compare different methods for assessing reliability empirically. The three traditional survey approaches (GDR, kappa, and over-time correlations) were significantly correlated with each other (see tables 7 and 8); kappa was highly correlated both of the other two traditional measures, but the correlation between the GDR and the over-time correlation was quite a bit lower. Reliability estimates from the quasi-simplex models were moderately correlated with the three traditional approaches. The reliability estimates from the MTMM experiments were not related to any of the other reliability measures. As for the two ex ante methods, reliability predicted by SQP was weakly related to the three traditional approaches in the adult sample. The number of problems identified by QUAID was not related to any of the reliability measures. A factor analysis suggested that QUAID taps a different latent factor from the three traditional measures.

Alwin (2007) has emphasized the biasing impact of memory in the traditional interview–reinterview approach to estimating reliability. We found evidence for some effect of memory—answers were more likely to change with more elapsed times between interviews—but the effect was small and doubtless partly reflected true changes between interviews. If memory for the earlier answers consistently biases reliability estimates upward, we would expect the average overtime correlations to be higher, on average, than the reliability estimates from the quasi-simplex model. For the items we examined, the averages are almost the same. The mean over-time correlation was 0.71 versus a mean estimated reliability of 0.70 from the quasi-simplex models. Similarly, if the reliability estimates are biased downward, as Saris and Gallhofer (2007a, 2007b) argue, one would expect the mean overtime correlations to be lower than the mean SQP estimates, but they are somewhat higher—0.69 versus 0.60.

To me, the results in tables 7 and 8 suggest that the traditional methods have good convergent validity. It seems unlikely to us that respondents’ answers in the reinterview are seriously distorted by memory effects. Why would respondents remember their answers to more than 400 questions from interviews lasting more than an hour? Perhaps if the questionnaire were much shorter or the interviews more closely spaced than in the PATH-RV study memory might be a more important source of bias.

By contrast, the MTMM results may well be affected by memory of the earlier questions. Although van Meurs and Saris (1990) claim that respondents forget their answers in twenty minutes or so, other studies (e.g., Todorov 2000) have demonstrated context effects involving widely spaced questions in an interview, suggesting that the effects of earlier questions can linger, even if the earlier answer cannot be explicitly recalled. A recent study (Rettig, Höhne, and Blom 2019) found that some 85 percent of respondents in an Internet panel reported that they recalled their answers to questions they had answered about twenty minutes earlier and 64 percent could, in addition, correctly reproduce their earlier answers. Schwarz, Revilla, and Weber (2019) report similar results from a laboratory study. Many respondents can probably guess their earlier answers without recalling them. As a result, Van Meurs and Saris argue for assessing the memory effect by the difference between the proportions of respondents who correctly reproduce their earlier answers among those who say they remember them versus among those who say they cannot remember them. For example, if 80 percent of the respondents who claim to remember their earlier answers correctly reproduce the answer but 60 percent of the respondents who say do not remember are nonetheless able to reproduce the answer, the “memory effect” is 20 percent. By this measure, some 17–34 percent of respondents recall their prior answers, which presumably would introduce substantial bias in the MTMM estimates. In addition, this measure almost certainly overcorrects for guessing the earlier answers because it assumes no memory effect among respondents who say they cannot remember. A large number of studies demonstrate the effects of information presented earlier even among those who cannot consciously recall that information (see Schacter 1987).

5. CONCLUSIONS

Oscar K. Buros, you had it right! There is great value in doing studies to assess the reliability and validity of our data and I wish a lot more studies did that. We have accumulated much knowledge about factors that make it difficult for respondents to answer survey questions reliably (such as the work summarized in table 4) and about the types of respondents who are likely to give reliable answers. Almost twenty years ago, Schaeffer and Presser (2003) wrote an excellent review article titled “The Science of Asking Questions.” It is studies like the ones reviewed here that provide the empirical building blocks for that science.

We also need to keep examining our tools for assessing survey questions. We should not simply assume that our methods yield valid results. As tables 7 and 8 demonstrate, different methods converge, but often only weakly. We need to understand when a given method yields valid results and whether they yield complementary evidence or just conflicting evidence.

Footnotes

1

We also obtained predicted reliabilities from the Survey Quality Predictor (SQP) for these ninety-one items and compared SQP predictions with the published NSDUH kappas. SQP is an automated system that predicts reliability, validity, method effects, and total quality of a question (Saris and Gallhofer 2007b). It is based on the results of a large number of MTMM experiments. The overall correlation between the predictions and the kappas was 0.34. SQP predictions also underestimated the reliability of the NSDUH items; the average deviation between SQP predictions and the published NSDUH kappa values was −0.16.

This article is the write-up of the sixth Monroe Sirken Lecture, delivered at the Annual Conference of the American Association for Public Opinion Research on June 11, 2020. I am grateful to AAPOR and the American Statistical Association for giving me this honor and to Monroe Sirken for establishing this award. Much of the work discussed here was done with Ting Yan and Hanyu Sun; I am deeply indebted to them both. Some of the work reported here was funded by a grant from the National Institute on Drug Abuse, National Institutes of Health [5R01DA040736-02 to R.T.]. The views and opinions expressed in this manuscript are those of the author only and do not necessarily represent the views, official policy, or position of the US Department of Health and Human Services or any of its affiliated institutions or agencies.

REFERENCES

  1. Alker H., Tourangeau R., Staines C. B. (1976), “Facilitating Personality Change with Audiovisual Self-Confrontation and Interviews,” Journal of Consulting and Clinical Psychology, 44, 720–728. [DOI] [PubMed] [Google Scholar]
  2. Alwin D. F. (1989), “Problems in the Estimation and Interpretation of the Reliability of Survey Data,” Quality & Quantity, 23, 277–331. [Google Scholar]
  3. Alwin D. F. (2007), Margins of Error: A Study of Reliability in Survey Measurement, Hoboken, NJ: John Wiley. [Google Scholar]
  4. Alwin D. F., Baumgartner E. M., Beattie B. A. (2018), “Number of Response Categories and Reliability in Attitude Measurement,” Journal of Survey Statistics and Methodology, 6, 212–239. [Google Scholar]
  5. Alwin D. F., Krosnick J. A. (1991), “The Reliability of Survey Attitude Measurement: The Influence of Question and Respondent Attributes,” Sociological Methods and Research, 20, 139–181. [Google Scholar]
  6. Andrews F. M. (1984), “Construct Validity and Error Components of Survey Measures: A Structural Equation Approach,” Public Opinion Quarterly, 48, 409–442. [Google Scholar]
  7. Bassili J. N., Fletcher J. (1991), “Response-Time Measurement in Survey Research: A Method for CATI and a New Look at Non-Attitudes,” Public Opinion Quarterly, 55, 331–346. [Google Scholar]
  8. Bassili J. N., Scott B. S. (1996), “Response Latency as a Signal to Question Problems in Survey Research,” Public Opinion Quarterly, 60, 390–399. [Google Scholar]
  9. Biderman A. (1980), Report of a Workshop on Applying Cognitive Psychology to Recall Problems of the National Crime Survey, Washington, DC: Bureau of Social Science Research. [Google Scholar]
  10. Biemer P. P. (2004), “Simple Response Variance: Then and Now,” Journal of Official Statistics, 20, 417–439. [Google Scholar]
  11. Cantor D. (2008), “A Review and Summary of Studies on Panel Conditioning,” in Handbook of Longitudinal Research: Design, Measurement, and Analysis, ed. Menard S., pp. 123–138,Burlington, MA: Academic Press. [Google Scholar]
  12. Cohen J. (1960), “A Coefficient of Agreement for Nominal Scales,” Educational and Psychological Measurement, 20, 37–46. [Google Scholar]
  13. Graesser A. C., Cai Z., Louwerse M. M., Daniel F. (2006), “Question Understanding Aid (QUAID): A Web Facility That Tests Question Comprehensibility,” Public Opinion Quarterly, 70, 3–22. [Google Scholar]
  14. Heise D. R. (1969), “Separating Reliability and Stability in Test-Retest Correlation,” American Sociological Review, 34, 93–101. [Google Scholar]
  15. Hout M., Hastings O. P. (2016), “Reliability of the Core Items in the General Social Survey: Estimates from the Three-Wave Panels, 2006–2014,” Sociological Science, 3, 971–1002. [Google Scholar]
  16. Lee S., Mathiowetz N. A., Tourangeau R. (2007), “Measuring Disability in Surveys: Consistency over Time and across Respondents,” Journal of Official Statistics, 23, 163–184. [Google Scholar]
  17. Moss L., Goldstein H. (1979), The Recall Method in Social Surveys, London: The University of London Institute of Education. [Google Scholar]
  18. O’Muircheartaigh C. (1991), “Simple Response Variance: Estimation and Determinants,” in Measurement Error in Surveys, eds. Biemer P., Groves R., Lyberg L., Mathiowetz N., Sudman S., pp. 551–574, New York: John Wiley. [Google Scholar]
  19. Rettig T., Höhne J. K., Blom A. G. (2019), “Recalling Survey Answers: A Comparison across Question Types and Different Levels of Online Panel Experience.” Manuscript under review.
  20. Revilla M., Saris W. E., Krosnick J. A. (2014), “Choosing the Number of Categories in Agree/Disagree Scales,” Sociological Methods & Research, 43, 73–97. [Google Scholar]
  21. Rodgers W. L., Andrews F. M., Herzog A. R. (1992), “Quality of Survey Measures: A Structural Equation Modeling Approach,” Journal of Official Statistics, 3, 251–275. [Google Scholar]
  22. Saris W., Gallhofer I. (2007. a), “Estimation of the Effects of Measurement Characteristics on the Quality of Survey Questions,” Survey Research Methods, 1, 29–43. [Google Scholar]
  23. Saris W., Gallhofer I. (2007. b), Design, Evaluation, and Analysis of Questionnaires for Survey Research, Hoboken, NJ: John Wiley. [Google Scholar]
  24. Saris W. E., Revilla M., Krosnick J. A., Schaeffer E. M. (2010), “Comparing Questions with Agree/Disagree Response Options to Questions with Item-Specific Response Options,” Survey Research Methods, 4, 61–79. [Google Scholar]
  25. Saris W., Satorra A., Coenders G. (2004), “A New Approach to Evaluating the Quality of Measurement Instruments: The Split‐Ballot MTMM Design,” Sociological Methodology, 34, 311–347. [Google Scholar]
  26. Schacter D. L. (1987), “Implicit Memory: History and Current Status,” Journal of Experimental Psychology: Learning, Memory, and Cognition, 13, 501–518. [DOI] [PubMed] [Google Scholar]
  27. Schaeffer N. C., Presser S. (2003), “The Science of Asking Questions,” Annual Review of Sociology, 29, 65–88. [Google Scholar]
  28. Schwarz H., Revilla M., Weber W. (2019), “Memory Effects in Repeated Survey Questions—Reviving the Empirical Investigation of the Independent Measurements Assumption.” Manuscript under review.
  29. Todorov A. (2000), “Context Effects in National Health Surveys: Effects of Preceding Questions on Reporting Serious Difficulty Seeing and Legal Blindness,” Public Opinion Quarterly, 64, 65–76. [PubMed] [Google Scholar]
  30. Tourangeau R. (1984), “Cognitive Science and Survey Methods,” in Cognitive Aspects of Survey Design: Building a Bridge between Disciplines, eds. Jabine T., Straf M., Tanur J., Tourangeau R., pp. 73–100,Washington, DC: National Academy Press. [Google Scholar]
  31. Tourangeau R., Rasinski K. (1988), “Cognitive Processes Underlying Context Effects in Attitude Measurement,” Psychological Bulletin, 103, 299–314. [Google Scholar]
  32. Tourangeau R., Rasinski K., D’Andrade R. (1991), “Attitude Structure and Belief Accessibility,” Journal of Experimental Social Psychology, 27, 48–75. [Google Scholar]
  33. Tourangeau R., Rips L. J., Rasinski K. (2000), The Psychology of Survey Responses, New York: Cambridge University Press. [Google Scholar]
  34. Tourangeau R., Yan T., Sun H. (2020), “Who Can You Count on? Understanding the Determinants of Reliability,” Journal of Survey Statistics and Methodology, doi: 10.1093/jssam/smaa018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. van Meurs A., Saris W. E. (1990), “Memory Effects in MTMM Studies,”inEvaluation of Measurement Instruments by Meta-Analysis of Multitrait–Multimethod Studies, eds. van Meurs A., Saris W. E., pp. 134–147, Amsterdam: North-Holland. [Google Scholar]
  36. Warren J. R., Halpern-Manners A. (2012), “Panel Conditioning in Longitudinal Social Science Surveys,” Sociological Methods & Research, 41, 491–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Wiley D. E., Wiley J. A. (1970), “The Estimation of Measurement Error in Panel Data,” American Sociological Review, 35, 112–117. [Google Scholar]
  38. Zaller J., Feldman S. (1992), “A Simple Theory of the Survey Response: Answering Questions versus Revealing Preferences,” American Journal of Political Science, 36, 579–616. [Google Scholar]
  39. Zaller J. R. (1992), The Nature and Origins of Mass Opinion, Cambridge: Cambridge University Press. [Google Scholar]

Articles from Journal of Survey Statistics and Methodology are provided here courtesy of Oxford University Press

RESOURCES