Comparing Methods for Assessing Reliability

Roger Tourangeau; Hanyu Sun; Ting Yan

doi:10.1093/jssam/smaa018

. 2020 Sep 8;9(4):651–673. doi: 10.1093/jssam/smaa018

Comparing Methods for Assessing Reliability

Roger Tourangeau ^1,^✉, Hanyu Sun ², Ting Yan ³

PMCID: PMC8519302 PMID: 34671685

Abstract

The usual method for assessing the reliability of survey data has been to conduct reinterviews a short interval (such as one to two weeks) after an initial interview and to use these data to estimate relatively simple statistics, such as gross difference rates (GDRs). More sophisticated approaches have also been used to estimate reliability. These include estimates from multi-trait, multi-method experiments, models applied to longitudinal data, and latent class analyses. To our knowledge, no prior study has systematically compared these different methods for assessing reliability. The Population Assessment of Tobacco and Health Reliability and Validity (PATH-RV) Study, done on a national probability sample, assessed the reliability of answers to the Wave 4 questionnaire from the PATH Study. Respondents in the PATH-RV were interviewed twice about two weeks apart. We examined whether the classic survey approach yielded different conclusions from the more sophisticated methods. We also examined two ex ante methods for assessing problems with survey questions and item nonresponse rates and response times to see how strongly these related to the different reliability estimates. We found that kappa was highly correlated with both GDRs and over-time correlations, but the latter two statistics were less highly correlated, particularly for adult respondents; estimates from longitudinal analyses of the same items in the main PATH study were also highly correlated with the traditional reliability estimates. The latent class analysis results, based on fewer items, also showed a high level of agreement with the traditional measures. The other methods and indicators had at best weak relationships with the reliability estimates derived from the reinterview data. Although the Question Understanding Aid seems to tap a different factor from the other measures, for adult respondents, it did predict item nonresponse and response latencies and thus may be a useful adjunct to the traditional measures.

Keywords: Latent class analyses, MTMM experiments, Quasi-simplex model, Reinterview data, Survey reliability

1. INTRODUCTION

The ultimate goal of question evaluation and testing is to produce survey items that yield reliable and valid answers. Although there are a wide variety of question evaluation methods that researchers can use, the methods do not necessarily converge in their conclusions (Yan, Kreuter, and Tourangeau 2012; Maitland and Presser 2016, 2018). In addition, few studies have compared the outcomes of these evaluation methods to traditional psychometric measures of validity and reliability, a problem noted by Yan et al. (2012).

The most common method for assessing the reliability of survey responses has been to conduct reinterviews with respondents a short interval (one to two weeks) after an initial interview and to estimate relatively simple statistics from these data, such as the gross difference rate (GDR). The GDR is the proportion of respondents giving answers to an item in the reinterview that differ from the answers in the initial interview:

GDR = 1 - p_{a},

in which p_a is the proportion of respondents giving the same answer in both interviews. Cohen’s kappa (Cohen 1960, 1968) corrects the agreement rate for chance agreement:

κ = \frac{p_{a} - p_{e}}{1 - p_{e}},

in which p_e is the level of agreement that would be expected if the two responses were independent. The correlation between responses to the same question asked at two different time points has also been used to estimate reliability (e.g., O’Muircheartaigh 1991).

More sophisticated approaches have also been used to estimate the reliability of survey items. These include multi-trait, multi-method (MTMM) experiments, the quasi-simplex model applied to data from three (or more) waves of a longitudinal survey, and latent class analysis (LCA). Each of these models offers somewhat different definitions of reliability, and the more sophisticated methods require that certain assumptions be met for the models to produce valid results. For instance, with LCA, a basic assumption is that the errors in the “indicators” (i.e., the responses to the survey questions) are independent conditional on the latent class membership; this is known as the “local independence” assumption. When this assumption is not met, LCA produces biased error rates (e.g., Kreuter, Yan, and Tourangeau 2008; Yan et al. 2012). Furthermore, all three of these methods require multiple measurements of the same construct.

Andrews (1984) introduced the use of structural equation modeling for analyzing MTMM experiments. These typically involve nine (or more) survey items measuring three different traits (or constructs) using three different methods. Three traits and three methods are the minimum necessary to achieve an identifiable model. In some cases, individual respondents get only two of the three measures of a given construct in a split-ballot design (Saris, Satorra, and Coenders 2004). Although other models are possible, the MTMM model that is usually applied (e.g., Saris and Gallhofer 2007a, p. 32, 2007b) assumes that the observed response (y_ij) reflects a “true” score (t_ij) plus a random error (e_ij):

y_{i j} = r_{i j} t_{i j} + e_{i j},

(1)

in which r_ij is the reliability of item i—that is, the relationship between the true score and the observed response. The true score, in turn, reflects the construct of interest (f_i) and any method effect (M_j):

t_{i j} = v_{i j} f_{i} + m_{i j} M_{j},

(2)

in which v_ij is the validity coefficient, representing the relationship between the true score and the underlying construct, and m_ij represents the impact of method j on responses to item i. This model, the “true score” model, assumes that the errors in the observed scores are independent of each other and that the true scores reflect a single construct and a single methods effect. Saris and Gallhofer (2007a, 2007b) have built a computerized tool—the survey quality predictor (SQP)—based on a meta-analysis of a large number of MTMM estimates (based on the true score model). SQP predicts the reliability and validity of the answers based on a set of question characteristics.

Alwin (2007) has advocated the use of longitudinal data for reliability estimates. The “quasi-simplex” model used in analyzing such data assumes that the true score at a given wave (t_ik) reflects the true score at the previous wave (t_i_,_k₋₁) plus change over time (z_ik):

t_{i k} = β_{i k, k - 1} t_{i, k - 1} + z_{i k},

in which $β_{i k, k - 1}$ reflects the relationship between the true score at wave k and the true score at the prior wave. The observed score for a given wave (y_ik) is just the true score plus a random error (e_ik). In this framework, the reliability of the observed score is the ratio of the true score variance to the observed score variance. A basic assumption of the quasi-simplex model is that there is no lagged effect of the true score from two waves prior to the current wave. In addition, to make the model parameters identifiable, either the reliabilities (Heise 1969) or the error variances (Wiley and Wiley 1970) must be assumed to be constant across waves of the survey.

The latent class model assumes that respondents fall into a small number of categories (or latent classes). The probability of an observed response (y_ij) on item j depends on the conditional probability of observing that response given that the respondent is in latent class k, summed across all K of the latent classes. It is assumed that the responses are independent of each other within each latent class, so the probability of the vector of responses $μ$ is:

μ = \sum_{k = 1}^{K} P (c = k) \prod_{j = 1}^{J} P (y_{i j} | c = k),

(3)

in which there are K latent classes, each with a “prevalence” (or unconditional probability) of $P (c = k)$ . The model produces estimates of these unconditional probabilities—representing the relative sizes of each latent class—as well as of the conditional probabilities of each response within each latent class ( $P (μ_{j} | c = k)$ ). Equation (3) assumes conditional independence—that is, within a latent class the observed variables are independently distributed. Clogg and Manning (1996) propose a measure of reliability based on a latent class model with two latent classes:

θ_{y x} = \frac{P (y_{i j} = 1 | c = 1) P (y_{i j} = 2 | c = 2)}{P (y_{i j} = 2 | c = 1) P (y_{i j} = 1 | c = 2)},

in which $θ_{yx}$ measures the degree that item y places respondents in the same category as the latent class variable. Clogg and Manning also propose a reliability statistic, defined in equation (4) below, that is more analogous to a correlation:

Q_{y x} = (θ_{y x} - 1) / (θ_{y x} + 1) .

(4)

1.1. Problems in Estimating Reliability

There are potential problems with all of these methods. Kappa is relatively easy to calculate and can be applied to different study designs, but it is sensitive to the prevalence rate (Spitznagel and Helzer 1985). As a result, sometimes the level of agreement is very high but kappa is low, typically reflecting extreme marginals (Cicchetti and Feinstein 1990)—that is, when there is little variation in the answers, kappa values are almost always quite low because the expected levels of agreement are very high.

All of the measures based on reinterview data assume that there is no change in the true scores between the interview and reinterview. This assumption may often be violated. Tourangeau, Yan, Sun, Hyland, and Stanton (2019) found that respondents in the Population Assessment of Tobacco and Health Reliability and Validity (PATH-RV) Study explained many of their discrepant answers as resulting from true change. In addition, as Saris and Gallhofer (2007a, 2007b) have noted, the correlation between the same item administered at two different time points is not a pure measure of reliability, but the product of the reliabilities of the item at time 1 and time 2 and the correlation between the true scores on the two occasions:

r_{12} = r_{1} β_{12} r_{2},

in which r₁ and r₂ are the two reliabilities and $β_{12}$ is the correlation between the true scores across the interviews. The assumption of no change between the time 1 and time 2 implies a correlation of 1.0; under this assumption, the over-time correlation should be considered the product of the two reliabilities.

On the other hand, a number of authors (including Alwin 2007) have also argued that the errors in interview and reinterview may be correlated due to memory effects. This concern dates back at least to Moser and Kalton (1972), who wrote:

… even if persons were to submit themselves to repeat questioning, a comparison of the two sets of results would hardly serve as an exact test of reliability, since they could not be regarded as independent. At the retest, respondents may remember their first answers and give consistent retest answers, an action that would make the test appear more reliable than is truly the case. Alternatively, the first questioning may make them think more about the survey subject … or events occurring between the two tests may cause them to change their views on the subject. (pp. 353–4).

There does not seem to be much empirical evidence for such memory effects in reinterview studies, and it is not clear how well respondents would be able to remember their answers from a lengthy interview conducted a week or two earlier. Studies show that respondents often have considerable difficulty recalling the answers they gave earlier and extrapolate backward from their current answers to reconstruct their earlier ones (Bem and McConnell 1970; Smith 1984; Ross 1989). Tourangeau, Yan, and Sun (2020) examined the effect of the elapsed time between the initial interviews and reinterviews in the PATH-RV Study and found a very small effect; each additional day between interviews lowered the proportion of identical answers by 0.2 percent. As they note, this effect of elapsed time may, in part, reflect the greater chance of true change as more time passes between interviews.

The problem of memory effects would appear to be much more serious in MTMM studies, in which respondents answer at least two questions designed to tap the same construct within a single interview. In addition, it is conceivable, even likely, that answers to the first item affect answers to the later item(s) (see Tourangeau and Rasinski 1988 on “carryover” effects). Although van Meurs and Saris (1990) claim that respondents forget their answers in 20 minutes or so, other studies (e.g., Todorov 2000) have demonstrated context effects involving widely spaced questions in an interview, suggesting that the effects of earlier questions can linger, even if the earlier answer cannot be explicitly recalled. A recent study (Rettig, Höhne, and Blom 2019) found that some 85 percent of respondents in an Internet panel reported they recalled their answers to questions they had answered about 20 minutes earlier and 64 percent could, in addition, correctly reproduce the earlier answers. Schwarz, Revilla, and Weber (2019) report similar results from a laboratory study.¹

Finally, Alwin (2007) has argued that survey items contain both specific components of variance and random error variance. The MTMM design does not allow the separation of reliable specific variance from random error variance and, as a result, the reliability of the question is likely to be underestimated (Alwin and Jackson 1979; Alwin 2011).

Quasi-simplex models employ a panel design to collect three or move waves of measures over lengthy periods of time. Such designs minimize or eliminate memory effects by spreading the reinterviews over months or years. The models allowing the separation of unreliability from the changes in the true score by using quasi-simplex modeling (Alwin 2007). However, Coenders, Saris, Batista‐Foguet, and Andreenkova (1999) showed that the variance in the reliability estimates can be very large when the stability of the true score over time is low. Very long time intervals between waves are likely to reduce stability and thus increase the variability in the estimates. In addition, Coenders et al. (1999) demonstrate that minor violations of the model assumptions can lead to large biases in the reliability estimates.

Although LCA does not require the multiple observed indicators to be error free, it does assume that observed response variables are independent of each other within latent classes. This local independence assumption is often being violated in practice, producing biased estimates of error rates (e.g., Kreuter et al. 2008; Yan et al. 2012). In addition, Spencer (2008) has argued that LCA models will consistently underestimate the actual error rates when the actual error rates are low (summing to less than 1) and when the model misclassifies some individuals; these conditions are likely to be met often in practice.

As this summary indicates, all of the methods for estimating reliability have their potential flaws that could lead them to produce biased results. It is, thus, important to see whether the methods (with their different weaknesses) produce similar or diverging results. The mere potential for bias does not mean that most estimates are biased or, even if they are, that the biases are large.

1.2. Reliability and Question Evaluation

At least three studies have attempted to relate outcomes of question evaluation methods to the reliability of the answers. Yan et al. (2012) used reliability (measured through over-time correlations and error rates from LCA models) to rank two sets of three items each. The rank orders according to the over-time correlations did not always agree with those produced by the LCA error rates. For one triplet, the item with the largest error rate (i.e., the one ranked the worst of the three) also had the highest over-time reliability (see table 3 in Yan et al. 2012). For the second triplet, the item with the largest error rate also has the lowest reliability. Furthermore, the correlation between LCA error rates and quality predictions from SQP was only 0.37 across the fifteen survey items Yan and her colleagues examined and was in the opposite from the expected direction.

Table 3.

Correlations among Eight Methods, Adult Sample

Type of method	Reliability measure	GDR	Over-time correlation	Reliability from SQP	QUAID issues	Reliability from MTMM	Reliability from LCA	LCA error rate
Traditional approach	Kappa	−0.93^***	0.71^***	−0.23	−0.15	−0.01	0.50^**	−0.57^***
	GDR		−0.68^***	0.31	0.05	0.25	−0.50^**	0.69^***
	Over-time correlation			−0.18	−0.02	−0.23	0.68^***	−0.42^*
Ex-ante Method	Reliability from SQP				−0.24	0.07	−0.14	0.22
Ex-ante Method	# of Issues identified by QUAID					−0.27	0.25	0.06
Sophisticated Method	Reliability from MTMM						−0.03	0.16
	Reliability from LCA							−0.40^*
	LCA error rate

Open in a new tab

p < 0.05.

^**

p < 0.01.

^***

p < 0.001; all of the correlations based on thirty-five items.

Maitland and Presser (2016) compared seven question evaluation methods on their accuracy in predicting the reliability of answers to a set of questionnaire items. The reliability estimates were derived from reinterview data. The seven methods were SQP, Question Understanding Aid (QUAID) (Graesser, Cai, Louwerse, and Daniel 2006), the question appraisal system (QAS; Lessler and Forsyth 1996), expert review, cognitive interviews, behavior coding of interviews, and response latency. Although all individual question evaluation methods, except SQP, significantly predicted the reliability of the answers, none of the models including only one evaluation method provided an adequate fit to the data as compared to the model that included five of the methods—SQP, QAS, QUAID, cognitive interviewing, and expert review. This finding suggests that multiple methods are better than any individual method in predicting reliability, in line with what Maitland and Presser call the “complementary methods hypothesis”—the idea that different question evaluation methods produce different but valid information about an item.

Finally, an earlier study, by Hess, Singer, and Bushery (1999), found that certain behavior codes (such as qualified answers) from behavior coding are related to reliability (as measured by simple response variance).

1.3. The Current Study

To our knowledge, no prior study has yet undertaken a systematic empirical comparison of different methods for estimating reliability (although van der Ark, van den Palm, and Sijtsma [2011] conducted a simulation study comparing various reliability measures). Thus, the primary goal of this article is to compare different methods to estimate reliability to see whether different measures produce similar conclusions about an item. Specifically, we evaluate (1) standard survey methods based on reinterview data (comparing Cohen’s kappa, GDRs, and over-time correlations), (2) more sophisticated model-dependent methods (from MTMM experiments, quasi-simplex modeling of longitudinal data, and LCA), and (3) ex ante methods that do not require new data collection (SQP and QUAID). These methods all produce estimates of an item’s reliability or other measures of item quality. However, it is not clear whether the estimates produced by these methods agree with each other.

The second goal of the article is to explore the relationships between reliability and two other common indicators of data quality: response latency and item nonresponse rates. A series of recent papers has examined the determinants of response latency (Yan and Tourangeau 2008; Couper and Kreuter 2013; Olson and Smyth 2015). These papers are based on the hypothesis, first proposed by Yan and Tourangeau (2008), that cognitive difficulties in answering the questions are a major source of long response times. All three studies find evidence supporting this conjecture. For example, Yan and Tourangeau conclude that the complexity of the question (as measured by the number of clauses in the question) and the complexity of the response options affect response times; more recently, Olson and Smyth identified respondent confusion as a primary predictor of response time. We propose that item nonresponse also has similar roots in respondents’ cognitive difficulties. Respondents are less likely to answer a question when they do not understand it or cannot remember the information needed to answer it. Because similar difficulties are responsible for unreliable answers, slow response times, and item nonresponse, we expect the three to be related. Thus, we test the hypothesis that reliability (as measured in the different ways) is negatively related to response latency and item nonresponse and determine which measures of reliability best predicts these additional data quality indicators.

2. METHODS

2.1 Data

The data for this study came from the first three waves of the PATH Study and from the PATH-RV Study, which collected interview and reinterview data from a national sample of respondents.

The PATH Study is a major national longitudinal study of tobacco use and health. It is following more than 40,000 members of the US household population aged twelve years and older and includes both tobacco users and nonusers. The sample is a multi-stage area probability sample. The first wave of interviewing was completed in 2014, the second wave in 2015, and the third wave in 2016. The study uses audio computer-administered self-interviewing (ACASI) to collect information on a wide range of topics, including use of tobacco; attitudes and perceptions toward different tobacco products; knowledge of tobacco products and their health consequences; tobacco-use cessation attempts; uptake of new products, switching of products or brands, and use of more than tobacco product; and health conditions, including ones potentially related to tobacco use. The questionnaire encompasses behavioral, knowledge, and attitude items. We used data from the first three waves of the main PATH Study to estimate the reliability of the survey items available on the public use data set via the quasi-simplex model.

The PATH-RV Study collected interview and reinterview data from 524 respondents. The sample was selected in a nationally representative subset of the main PATH primary sampling areas; within those sample areas, we selected second-stage units (one or more blocks) and then individual addresses. In the final stage of selection, we selected sample individuals based on answers to screening questionnaires.

To the extent possible, the PATH-RV Study replicated the systems and procedures of the main PATH study. It used the same instruments (taken from the fourth wave of the main study) and the same software to administer the questions. Data collection for the PATH-RV Study took place in two phases. During the first phase, sample addresses were mailed a short screening questionnaire to identify members of three population groups—adult (18 years old and older) tobacco users, adult nonusers, and youth (12–17 years old). The sample included both tobacco users and nonusers to ensure that there would be observations for every section of the Adult Questionnaire. We received screening questionnaires from 2,296 households. The overall response rate to the screening component of the study was 25.1 percent (AAPOR RR3). We selected a total of 865 adults and 266 youth for the rest of the data collection, which was carried out by Westat field interviewers (all of whom worked on the main PATH Study as well). Respondents were interviewed twice, with the reinterview done five to twenty-four days after the initial interview; the median elapsed time was twelve days. Both interviews were done using ACASI, as in the main PATH Study.

Both the initial interview and the reinterview used the PATH Study Wave 4 questionnaires. The reinterview questionnaire included some additional items that followed the regular PATH Study questions. These items included six MTMM experiments added to the Adult Questionnaire and four to the Youth Questionnaire. In total, 47.1 percent of the sample adults (n = 407) and 44.0 percent of the sample youths (n = 117) completed both interviews. The PATH-RV interviews were done from March 2017 to February 2018. More details on the PATH-RV study can be found in Tourangeau et al. (2019).

2.2 Reliability Measures

2.2.1. Traditional reliability measures

We used the PATH-RV data to calculate three traditional reliability measures—GDRs, kappas, and over-time correlations. We calculated these statistics for every item with at least one hundred observations; we dropped any items with marginal proportions of 0.95 or above to avoid problems with the kappa statistic. A total of 409 survey items from the Adult Questionnaire and 212 survey items from the Youth Questionnaire met these criteria. Because over-time correlations do not make sense for categorical items with three or more categories, these statistics are available for only 391 of the adult items and 211 of the youth items.

2.2.2. Reliability estimates from more sophisticated models

Data from the first three waves of the main PATH were analyzed using the quasi-simplex model. We applied the Wiley and Wiley (1970) assumption of equal error variances to allow us to estimate the parameters. In addition, we used Pearson correlations rather than polychoric or tetrachoric correlations in fitting the models. Sixty items from the Adult Questionnaire and twenty-one from the Youth Questionnaire were analyzed in this way.

The reliability estimates from MTMM and LCA draw on the MTMM experiments included in the adult and the youth reinterview questionnaires. The respondents received all three items tapping a single construct. We applied the true score model, summarized in (1) and (2) above. For the LCA estimates, we dichotomized all the items and assumed two latent classes.² We used the method recommended by Clogg and Manning (1996) to estimate reliability from the LCA models (see (4) above); we also calculated the overall error rate, summing the false positive and false negative rates for each item.

2.2.3. Ex ante methods

We also used two ex ante computer-based systems to assess item reliability. The survey quality predictor (SQP) is an automated system that predicts reliability, validity, methods effects, and total quality of a question. The program is based on a meta-analysis of the relationship between reliability and validity estimates obtained from a large number of MTMM experiments and the features of the questions in those experiments (Saris and Gallhofer 2007a, 2007b).

QUAID is another automated system designed to identify potential comprehension problems that respondents may have based on linguistic features of the question. It detects and diagnoses five classes of comprehension difficulties including unfamiliar technical terms, vague or imprecise relative terms, vague or ambiguous noun phrases, complex syntax, and working memory overload (Graesser, Cai, Louwerse, and Daniel 2006). For this analysis, we use the number of issues detected by QUAID regardless of issue type as a proxy measure of reliability; our assumption was that the more issues detected by QUAID for a survey item, the lower the reliability of that item.

2.2.4. Additional measures

We computed item nonresponse rates for the initial interview and the reinterview for all 409 items from the Adult Questionnaire and 212 items from the Youth Questionnaire from which there were at least hundred responses in both waves. Similarly, we obtained median response latencies for items that were individually displayed on a screen, resulting in response time measures for 229 items from the Adult Questionnaire and 152 items from the Youth Questionnaire. Items displayed in grids were dropped from the analysis of response latencies since we did not have individual response times for those items.

3. RESULTS

3.1 Descriptive Statistics

Table 1 shows descriptive statistics for the adults and youth samples for the various reliability measures as well as for the two other question performance measures. As table 1 shows, the means and variances of the measures are relatively similar for the Adults and Youth Questionnaires. The Youth Questionnaire included too few items for us to include the MTMM, LCA, and quasi-simplex estimates. We note that the reliability estimates from the MTMM experiments are based on a different model from those from the quasi-simplex model, and we also include the quality estimates from the MTMM experiments in table 1. The quality estimate is the product of the reliability and validity estimates (see 1 and 2) and may be a closer analog to the reliability derived from the quasi-simplex model. For similar reasons, we also examine the quality estimates from SQP. We also include the square root of the over-time correlation in table 1. Under the assumption of no change in the true scores, the square root of the over-time correlation represents the geometric mean of the time 1 and time 2 reliabilities.

Table 1.

Means and Standard Deviations for All Variables

Data	Type of method	Reliability measures	N	Mean	SD
Adults	Traditional approaches	Kappa	409	0.61	0.21
		GDR	409	0.15	0.15
		Over-time correlation	391	0.69	0.18
		Square root of over-time correlation	391	0.82	0.12
	Sophisticated model-based approaches	Reliability from quasi-simplex models	60	0.70	0.06
		Reliability from MTMM	35	0.90	0.06
		Quality from MTMM	35	0.87	0.13
		Reliability from LCA	35	0.99	0.02
		LCA error rate	35	0.15	0.18
	Ex-ante approaches	Reliability from SQP	409	0.60	0.05
		Quality from SQP	409	0.56	0.05
		Number of issues identified by QUAID	409	3.56	2.07
	Other data quality indicators	Time 1 item nonresponse	409	0.01	0.01
		Time 2 item nonresponse	409	0.01	0.01
		Time 1 response latency	229	7.11	2.80
		Time 2 response latency	229	6.79	2.90
Youth	Traditional approaches	Kappa	212	0.54	0.24
		GDR	212	0.20	0.16
		Over-time correlation	211	0.62	0.21
		Square root of over-time correlation	211	0.78	0.14
	Ex-ante approaches	Reliability from SQP	212	0.60	0.07
		Quality from SQP	212	0.56	0.06
		Number of issues identified by QUAID	212	3.73	1.88
	Other data quality indicators	Time 1 item nonresponse	212	0.02	0.02
		Time 2 item nonresponse	212	0.01	0.02
		Time 1 response latency	152	6.23	2.51
		Time 2 response latency	152	5.12	2.46

Open in a new tab

3.2 Associations among Reliability Measures

We first examined the correlations among the methods for assessing reliability, as shown in table 2, for adults and youths separately. We analyze the two samples separately since they received different questionnaires. The table includes the three traditional survey measures (kappa, GDR, over-time correlation), reliabilities from the quasi-simplex model, and the measures from the two ex ante methods (SQP and QUAID). The MTMM and LCA results are based on much smaller numbers of items, and we examine them separately below (see table 3). We examine these latter results only for the adults, since the Youth Questionnaire contained too few items to which we could apply the more sophisticated models.

Table 2.

Correlations and Sample Sizes among Methods, by Sample

Sample	Type of method	Reliability measure	GDR	Over-time correlation	Reliability from SQP	QUAID issues	Quasi-simplex model
Adults	Traditional approaches	Kappa	−0.61***	0.84***	0.21***	−0.06	0.36**
		n	409	391	409	409	60
		GDR		−0.21***	−0.18***	−0.16**	0.14
		n		391	409	409	60
		Over-time correlation			0.20***	−0.16**	0.49***
		n			391	391	57
	Ex-ante methods	Reliability from SQP				0.05	0.20
		n				409	60
		Number of issues identified by QUAID					0.21
		n					60
	Sophisticated method	Quasi-simplex model
	Sophisticated method	n
Youth	Traditional approaches	Kappa	−0.67***	0.90***	0.47***	−0.05
		n	212	211	212	212
		GDR		−0.43***	−0.52***	−0.16*
		n		211	212	212
		Over-time correlation			0.34**	−0.11
		N			211	211
	Ex-ante methods	Reliability from SQP				0.09
		n				212
		Number of issues identified by QUAID
		n

Open in a new tab

p < 0.10.

p < 0.05.

^**

p < 0.01.

^***

p < 0.001; the ns are the number of items on which the correlations are based.

For the Adult Questionnaire, there were ten significant correlations, all of them in the expected direction. A few of these associations are worth noting. First, kappa was highly correlated with both the GDR and over-time correlation (−0.61 and 0.84, respectively), but the relationship between the GDR and the over-time correlation, though significant, was surprisingly weak (−0.21). Second, reliability predicted by SQP was significantly related to reliability calculated from all three traditional approaches, but the correlations were relatively low (−0.18 to 0.21). Third, the number of problems found by QUAID was significantly correlated with GDR and over-time correlations but the correlation was rather low (−0.16). Fourth, reliability calculated through the quasi-simplex model was significantly correlated with kappa and the over-time correlations (0.36 and 0.49), but not with GDR. Finally, the ex ante methods and the quasi-simplex model were not strongly related to each other. The pattern does not change appreciably if we use the square root of the over-time correlation rather than the over-time correlation (the correlation between the two was 0.99) or the quality estimate from SQP rather than the reliability estimate (the correlation between the two was 0.97).

For the Youth Questionnaire, there were eight significant associations. For the youths, reliability as calculated via the three traditional measures was strongly correlated with each other (−0.43 to 0.84) and again was also moderately correlated with reliability from SQP (0.34 to −0.52). The QUAID findings were again weakly (but significantly) correlated with GDR (−0.16).

We observed similar association patterns in the adult and youth data, although, for the youth, the GDR correlated more highly with the other traditional measures than it did for the adults.

We also examined the relationships among the reliability measures in table 2 for various subsets of the items. Tourangeau et al. (2020) had found that the number of response options and whether the item was factual or attitudinal had an effect on the item’s reliability. However, we found no consistent differences in the pattern of findings in table 2 for dichotomous items versus items with more than two response options or for factual versus attitudinal items (see supplementary data online tables 1 and 2).

3.2.1. Factor analyses

On the assumption that the measures in table 2 reflect similar underlying cognitive difficulties, we carried out confirmatory factor analyses of the five measures for which we had relatively large samples of items, assuming a single underlying factor on which all the measures would load. For both the adults and the youth, a single factor model yielded a poor fit to the data. For the adults, the comparative fit index was 0.949, but both the standardized root mean square residual (0.110) and the root mean square error of approximation (0.158) were unacceptably high. The corresponding figures for the youth were even worse. Exploratory factor analyses helped to explain why the one-factor model did not fit. In both samples, the five indices seem to reflect two factors. For the adults, the three traditional measures loaded highly on the first factor (loadings of 0.97, −0.63, and 0.87 for kappa, GDR, and over-time correlation, respectively), QUAID loads highly on the second (loading of 0.88), and SQP loads weakly on both (loadings of 0.35 and 0.27). Similarly, for the youth data, the loadings on the first factor were quite high for kappa, the GDR, and the over-time time correlations (0.95, −0.79, and 0.85), and QUAID loaded highly on the second factor (0.91). For the youth, SQP loaded highly on the first factor (0.66) and less highly on the second (0.30). In both samples, then, QUAID reflects a different latent variable from the three traditional reliability measures.

3.2.2. Results based on MTMM experiments

We also examined the associations between measures based on the interview–reinterview data and the estimates from the more sophisticated models fitted to the MTMM experiments. The sample size for these analyses was much reduced because only a small number of items were involved in these experiments. Only nine of these items were also included in three waves of the PATH study. As a result, we drop the estimates from the quasi-simplex model from this analysis. In addition, there were too few youth items to present those results.

Table 3 shows the correlations between different methods for the thirty-five adult items. Again, the results do not change appreciably if we use the root over-time correlation instead of the over-time correlation or the quality estimate from the MTMM experiments instead of the reliability estimates. Ten of the twenty-eight associations were significant; the correlations range from −0.93 to 0.71. In addition to being highly correlated with each other, the three traditional approaches were moderately (and significantly) correlated with reliability as estimated from the LCA and with the LCA error rates; these correlations range from −0.42 to 0.69. The two ex ante methods and the MTMM estimates were not significantly related to the other measures of reliability—the SQP reliability estimate was marginally associated with GDR, but this correlation was in the opposite from the expected direction.

3.3 Associations between the Reliability Measures and Measures of Data Quality

Which of the reliability measures were most strongly related to the other measures of data quality? To answer this question, we regressed the missing data rates and median response times from both interviews on the five reliability measures from table 2. Table 4 displays the regression coefficients from the four multiple regression models.

Table 4.

Multiple Regression Coefficients from Models of Item Nonresponse and Response Times, by Sample

	Reliability measure	Time 1 item nonresponse	Time 2 item nonresponse	Time 1 median response time	Time 2 median response time
Adult	Intercept	−0.012^*	0.003	8.07^***	4.78
	Kappa	0.015^*	0.010^*	4.01	5.02
	GDR	0.011^*	0.008^*	9.76^***	8.10^***
	Over-time correlation	−0.013^*	−0.014^***	−3.55	−3.25
	Reliability from SQP	0.017^*	0.005	−6.84^*	−1.74
	Number of issues identified by QUAID	0.002^***	0.001^***	0.39^***	0.24^*
	n	391	391	214	214
	R-square	0.161	0.129	0.274	0.103
Youth	Intercept	0.031^*	0.035^***	3.43	0.17
	Kappa	0.039^*	0.019	3.03	3.61
	GDR	0.055^***	0.039^***	7.69^***	5.74^**
	Over-time correlation	−0.023	−0.010	−2.38	−1.63
	Reliability from SQP	−0.052	−0.052^**	−0.06	4.79
	Number of issues identified by QUAID	−0.000	−0.001	0.25	−0.00
	n	211	211	151	151
	R-square	0.152	0.198	0.154	0.066

Open in a new tab

p < 0.10.

p < 0.05.

^**

p < 0.01.

^***

p < 0.001.

As the table shows, in both samples and both waves of data collection, the GDR significantly predicted item nonresponse and response times. In the adult sample, the over-time correlation predicted item nonresponse in both waves and QUAID predicted both item nonresponse and response times in both waves. For the youths, QUAID was not a significant predictor of either item nonresponse or response times in either wave. The other noteworthy finding was that kappa predicted significantly predicted item nonresponse in both waves of the adult sample, but the regression coefficients were in the opposite from the expected direction—more consistent responses as measured by kappa predicted higher rather than lower item nonresponse for the adults.

3.4 Agreement about “Unreliable” Items

As a practical matter, it would be useful if the different measures agreed on which items needed to be fixed. We examined how well the different measures agreed on which items were the worst according to each measure. For the adults, we identified the fifteen items that were the worst by each measure; for the youths, we identified the worst ten items. Because of ties on some of the measures, for the adults, we actually included the sixteen worst items according to SQP and the thirty-one worst according to QUAID; for the youths, we included the nine worst items according to QUAID. If the different measures agreed perfectly with each other, we would expect the same items to be identified by all or most of the measures. However, a total of seventy-seven unique items were identified for the adult sample and forty unique items were identified for the youths, indicating that the different measures do not generally agree about the least reliable items. Table 5 shows the results. The top row of the table gives the number of items identified as among the worst according to the kappa statistic alone. The second row gives the number of items among the worst according to either the kappa statistic or the over-time correlation; of the nineteen items identified as among the worst by either method, eleven items were identified by both. The third row of the table gives the number identified by any of the three of the traditional statistics, and so on. Altogether some seventy-seven items from the Adult Questionnaire were on at least one of the worst item lists, but none appears on more than three of the lists. Similarly, a total of forty items appeared on at least one of the “ten worst” lists for the Youth Questionnaire, but no item appeared on more than three of the lists.

Table 5.

Agreement across Methods about “Worst” Items, by Sample

	One method	Two methods	Three methods	Four methods	Five methods	Total
Adults
Kappa only	15					15
Kappa + OTC	8	11				19
Kappa + OTC + GDR	19	13	0			32
Kappa + OTC + GDR + SQP	35	13	0	0		48
Kappa + OTC + GDR + SQP + QUAID	64	11	2	0	0	77
Youths
Kappa only	10					10
Kappa + OTC	10	5				15
Kappa + OTC + GDR	18	6	0			24
Kappa + OTC + GDR + SQP	27	5	1	0		33
Kappa + OTC + GDR + SQP + QUAID	34	3	4	0	0	40

Open in a new tab

4. DISCUSSION

This article is the first attempt to compare different methods for assessing reliability empirically and to determine which ones yield the most useful estimates. We looked at traditional survey approaches relying on reinterview data, sophisticated model-based methods, and ex ante methods that do not require data collection. We found that the three traditional survey approaches (GDR, kappa, and over-time correlations) were significantly correlated with each other (see table 2); kappa was highly correlated both of the other two traditional measures, but the correlation between the GDR and the over-time correlation was somewhat lower (−0.21 for the adults and −0.43 for the youth). Among the more sophisticated methods, three of them (reliability estimated via LCA, LCA error rates, and reliability estimated from quasi-simplex models) were moderately correlated with the three traditional approaches. The quasi-simplex model estimates based on three waves converged most with the over-time correlations based on the interview–reinterview data. The reliability estimates from the MTMM experiments were not related to any of the other reliability measures, though the latent class estimates (derived from the same data) did correlate quite highly with the traditional measures. The absolute values of the correlations for the LCA reliability estimates with the traditional measures range from 0.50 to 0.68 (see table 3); the correlations of the LCA error rates with the traditional measures are similarly high. As for the two ex ante methods, reliability predicted by SQP was weakly related to the three traditional approaches in the adult sample but more strongly related in the youth sample. The number of problems identified by QUAID was not related to any of the reliability measures in either sample. A factor analysis suggested that QUAID taps a different latent factor from the three traditional measures.

We further attempted to evaluate which reliability measures more strongly predicted other measures of item performance, including item nonresponse and response latency. The GDR seems to be the best predictor overall, though each of the measures seems to add value for at least one measure of item performance. This finding (see table 4) is in line with the complementary method hypothesis (Maitland and Presser 2016, 2018). It makes sense that the comprehension problems identified by QUAID should lead (at least within the adult sample) to longer response times and high rates of missing data; respondents presumably take more time as they try to understand difficult items and are less likely to answer questions they find hard to understand.

Alwin (2007) has emphasized the biasing impact of memory in the traditional interview–reinterview approach to estimating reliability. In an earlier effort, we found evidence for some effect of memory (see Tourangeau et al. 2020)—answers were more likely to change with more elapsed times between interviews—but the effect was small and doubtless partly reflected true changes between interviews. If memory for the earlier answers consistently biases reliability estimates upward, we would expect the average over-time correlations to be higher, on average, than the reliability estimates from the quasi-simplex model. For the items from the Adult Questionnaire where we have both estimates, the averages are almost the same. The mean over-time correlation was 0.71 versus a mean estimated reliability of 0.70 from the quasi-simplex models. Similarly, if the reliability estimates are biased downward, as Saris and Gallhofer (2007) argue, one would expect the mean over-time correlations to be lower than the mean SQP estimates, but they are somewhat higher—0.69 versus 0.60 for the adults and 0.62 versus 0.60 for the youth.

The SQP estimates are based on model that predicts item reliabilities from various question characteristics. Tourangeau et al. (2020) compared SQP’s predictions with predictions from their own model of item reliability. They used ninety-one kappa values from the 2009 National Survey of Drug Use and Health reliability study (Substance Abuse and Mental Health Services Administration 2010). The model based on interview–reinterview data predicted the kappa values better than SQP did (0.58 versus 0.34).

Our results are based on a single study that examines items from a single questionnaire. Although the questionnaire included a range of questions asking about beliefs, knowledge, attitudes, and behaviors, the vast majority of the questions concern a single topic—tobacco use. Several other features of our procedures may have affected the results. For example, the median spacing between the interviews and the reinterviews in the PATH-RV Study was 12 days. A longer interval between the two might have affected the GDRs, kappas, and over-time correlations, increasing the amount of true change; similarly, a shorter interval might have introduced memory effects. We believe that the key variables affecting the relationships among the various measures of reliability (as opposed to the overall levels of reliability) may be those affecting whether the assumptions of each method are met. For example, for the traditional survey measures, the important variables may be those related to the assumption of independent errors. A much shorter questionnaire than one we used or one that was readministered after a much shorter time interval would likely produce inflated estimates of reliability, estimates less likely to correlate with other measures of data quality. Similarly, for the quasi-simplex model, the spacing between interviews may be a key variable. The main PATH study data we used are based on interviews separated by about a year; a longer time interval might lead to less stable reliability estimates (as Coenders et al. 1999 argue), reducing the relation of these estimates to other estimates. As one of the reviewers pointed out, our analyses are based on a particular version of the quasi-simplex model: To achieve an identifiable model, we applied the Wiley and Wiley (1970) assumption instead of the alternative Heise (1969) assumption; in addition, we used Pearson correlations instead of polychoric and tetrachoric correlations. However, Alwin’s (2007) extensive explorations of these issues suggest that these decisions probably had little impact on our conclusions. Similarly, our implementation of the MTMM approach involved administering to each respondent three items designed to tap a construct (rather than two items, as is sometimes done) and the various MTMM experiments followed all the items from the PATH questionnaire, so that the spacing between items may have been shorter than is generally optimal. Finally, the low levels of item nonresponse may have attenuated the relationship between this variable and the various measures of reliability we examined.

Still, we believe that our findings have three practical implications for survey researchers. First, we showed that traditional survey approaches for assessing reliability are still sound. They are easier to compute than the more sophisticated methods. The GDR, in particular, predicts a range of problems, including item nonresponse and response times (see table 4), and is strongly related to most of the other reliability measures. Although the MTMM and LCA estimates do not necessarily require two rounds of data collection, these techniques do require multiple survey items measuring the same concept within a questionnaire. In addition, they require certain assumptions to be satisfied to achieve an identifiable model and unbiased estimates. We suggest that survey researchers at least consider a reinterview effort when assessing the reliability of answers to survey items.

Second, our findings seem to indicate that at least two to three methods for evaluating survey items are probably enough. Of course, more research is needed to determine the ideal number of methods to be used. Still, there was good evidence (see especially table 4) that every method turns up useful information.

Third, ex ante methods can be conducted with minimal cost and quick turnaround since they require no data collection. Still, it is not clear how good they are in evaluating the quality of survey items. Reliabilities predicted by SQP were weakly related to reliabilities estimated though traditional approaches and the number of problems found by QUAID was unrelated to the reliability measures. Still, for the adults at least, QUAID was a significant predictor of item nonresponse and median response times as well. QUAID may tap a different dimension from the traditional reliability measures but appears to provide useful information for identifying problematic questions.

Supplementary Materials

Supplementary materials are available online at academic.oup.com/jssam.

Supplementary Material

smaa018_Supplementary_Data

Click here for additional data file.^{(112.5KB, zip)}

Footnotes

Many respondents can probably guess their earlier answers without recalling them. As a result, Van Meurs and Saris advocate assessing the memory effect by the difference between the proportions of respondents who correctly reproduce their earlier answers among those who say they remember them versus among those who say they cannot remember them. For example, if 80 percent of the respondents who claim to remember their earlier answers correctly reproduce the answer but 60 percent of the respondents who say do not remember are nonetheless able to reproduce the answer, the “memory effect” is 20 percent. By this measure, some 17–34 percent of respondents recall their prior answers, depending on the study. However, this measure almost certainly overcorrects for guessing the earlier answers, because it assumes no memory effect among respondents who say they cannot remember. A large number of studies demonstrate the effects of information presented earlier even among those who cannot consciously recall that information (for a review of such “implicit memory” effects, see Schacter 1987).

We dichotomized the items because one item in each MTMM experiment was dichotomous to begin with (as yes/no or agree/disagree) and the remaining items either had different numbers of response options or asked for numbers. This seemed to us the simplest approach for dealing with the differences in response formats. We dichotomized items with more than two response options to be consistent with the yes/no or agree/disagree items. In addition, the reliability measure proposed by Clogg and Manning (1996) applies only to dichotomous items.

Notes

Roger Tourangeau is with Westat Inc., Methodology, Westat, 1600 Research Boulevard, Rockville, MD 20850, USA. Hanyu Sun is with Westat Inc., Statistical Department, Westat, 1600 Research Boulevard, Rockville, MD 20850, USA. Ting Yan is with Westat Inc., Survey Methodology, Westat, 1600 Research Boulevard, Rockville, MD 20850, USA. This work was supported by National Institute on Drug Abuse, National Institutes of Health [5R01DA040736-02 to R.T.]. The views and opinions expressed in this manuscript are those of the author only and do not necessarily represent the views, official policy, or position of the US Department of Health and Human Services or any of its affiliated institutions or agencies.

REFERENCES

Alwin D. F. (2007), Margins of Error: A Study of Reliability in Survey Measurement, Hoboken, NJ: John Wiley. [Google Scholar]
Alwin D. F. (2011), “Evaluating the Reliability and Validity of Survey Interview Data Using the MTMM Approach,” in Question Evaluation Methods: Contributing to the Science of Data Quality, eds. Madans J., Miller K., Maitland A., Willis G., pp. 265–293. Hoboken, NJ: John Wiley. [Google Scholar]
Alwin D. F., Jackson D. J. (1979), “Measurement Models for Response Errors in Surveys: Issues and Applications,” in Sociological Methodology 1980, ed. Schuessler K. F., pp. 68–119. San Francisco, CA: Jossey-Bass. [Google Scholar]
Andrews F. (1984), “Construct Validity and Error Components of Survey Measures: A Structural Modeling Approach,” Public Opinion Quarterly, 48, 409–442. [Google Scholar]
Bem D. J., McConnell H. K. (1970), “Testing the Self-Perception Explanation of Dissonance Phenomena: On the Salience of Premanipulation Attitudes,” Journal of Personality and Social Psychology, 14, 23–31. [DOI] [PubMed] [Google Scholar]
Biemer P.P. (2004), “Modeling Measurement Error to Identify Flawed Questions,” in Methods for Testing and Evaluating Survey Questionnaires, eds. Presser S., Rothgeb J., Couper M., Lessler J., Martin E., Martin J., Singer E., pp. 225–246. New York: John Wiley. [Google Scholar]
Cicchetti D. V., Feinstein A. R. (1990), “High Agreement but Low Kappa: II. Resolving the Paradoxes,” Journal of Clinical Epidemiology, 43, 551–558. [DOI] [PubMed] [Google Scholar]
Coenders G., Saris W. E., Batista‐Foguet J. M., Andreenkova A. (1999), “Stability of Three‐Wave Simplex Estimates of Reliability,” Structural Equation Modeling: A Multidisciplinary Journal, 6, 135–157. [Google Scholar]
Cohen J. (1960), “A Coefficient of Agreement for Nominal Scales,” Educational and Psychological Measurement, 20, 37–46. [Google Scholar]
Cohen J. (1968), “Weighted Kappa: Nominal Scale Agreement with Provision for Scaled Disagreement or Partial Credit,” Psychological Bulletin, 70, 213–220. [DOI] [PubMed] [Google Scholar]
Clogg C. C., Manning W. D. (1996), “Assessing Reliability of Categorical Measurements Using Latent Class Models,” in Categorical Variables in Developmental Research: Methods of Analysis, eds. van Eye A. and Clogg C. C., pp. 169–182. New York: Academic Press. [Google Scholar]
Couper M. P., Kreuter F. (2013), “Using Paradata to Explore Item Level Response Times in Surveys,” Journal of the Royal Statistical Society, Series A, 176, 271–286. [Google Scholar]
Graesser A. C., Cai Z., Louwerse M. M., Daniel F. (2006), “Question Understanding Aid (QUAID): a Web Facility That Tests Question Comprehensibility,” Public Opinion Quarterly, 70, 3–22. [Google Scholar]
Heise D. R. (1969), “Separating Reliability and Stability in Test-Retest Correlation,” American Sociological Review, 34, 93–101. [Google Scholar]
Hess J., Singer E., Bushery J. (1999), “Predicting Test-Retest Reliability from Behavior Coding,” International Journal of Public Opinion Research, 11, 346–360. [Google Scholar]
Kreuter F., Yan T., Tourangeau R. (2008), “Good Item or Bad—can Latent Class Analysis Tell?: The Utility of Latent Class Analysis for the Evaluation of Survey Questions,” Journal of the Royal Statistical Society, Series A, 171, 723–738. [Google Scholar]
Lessler J. T., Forsyth B. H. (1996), “A Coding System for Appraising Questionnaires,” in Answering Questions: Methodology for Determining Cognitive and Communicative Processes in Survey Research, eds. Schwarz N., Sudman S., pp. 259–291. San Francisco: Jossey-Bass. [Google Scholar]
Maitland A., Presser S. (2016), “How Accurately Do Different Evaluation Methods Predict the Reliability of Survey Questions?,” Journal of Survey Statistics and Methodology, 4, 362–381. [Google Scholar]
Maitland A., Presser S. (2018), “How Do Question Evaluation Methods Compare in Predicting Problems Observed in Typical Survey Conditions?,” Journal of Survey Statistics and Methodology, 6, 465–490. [Google Scholar]
van Meurs A., Saris W. E. (1990), “Memory Effects in MTMM Studies,” in Evaluation of Measurement Instruments by Meta-Analysis of Multitrait–Multimethod Studies, eds. van Meurs A., Saris W. E., pp. 134–147. Amsterdam: North-Holland. [Google Scholar]
Moser C. A., Kalton G. (1972), Survey Methods in Social Investigation (2nd ed.), New York: Basic Books. [Google Scholar]
O’Muircheartaigh C. (1991), “Simple Response Variance: Estimation and Determinants,” in Measurement Error in Surveys, eds. Biemer P., Groves R., Lyberg L., Mathiowetz N., Sudman S., pp. 551–574. New York: John Wiley. [Google Scholar]
Olson K., Smyth J. (2015), “The Effect of CATI Questions, Respondents, and Interviewers on Response Time,” Journal of Survey Statistics and Methodology, 3, 361–396. [Google Scholar]
Rettig T., Höhne J. K., Blom A. G. (2019), “Recalling Survey Answers: A Comparison across Question Types and Different Levels of Online Panel Experience.” Manuscript under review.
Ross M. (1989), “The Relation of Implicit Theories to the Construction of Personal Histories,” Psychological Review, 96, 341–357. [Google Scholar]
Saris W. (2012), “Discussion: Evaluation Procedures for Survey Questions,” Journal of Official Statistics, 28, 537–551. [Google Scholar]
Saris W., Gallhofer I. (2007. a), “Estimation of the Effects of Measurement Characteristics on the Quality of Survey Questions,” Survey Research Methods, 1, 29–43. [Google Scholar]
Saris W., Gallhofer I. (2007. b), Design, Evaluation, and Analysis of Questionnaires for Survey Research, Hoboken, NJ: John Wiley. [Google Scholar]
Saris W., Satorra A., Coenders G. (2004), “A New Approach to Evaluating the Quality of Measurement Instruments: The Split‐Ballot MTMM Design,” Sociological Methodology, 34, 311–347. [Google Scholar]
Schacter D. L. (1987), “Implicit Memory: History and Current Status,” Journal of Experimental Psychology: Learning, Memory, and Cognition, 13, 501–518. [DOI] [PubMed] [Google Scholar]
Schwarz H., Revilla M., Weber W. (2019), “Memory Effects in Repeated Survey Questions—Reviving the Empirical Investigation of the Independent Measurements Assumption.” Manuscript under review.
Smith T. W. (1984), “Recalling Attitudes: An Analysis of Retrospective Questions on the 1982 General Social Survey,” Public Opinion Quarterly, 48, 639–649. [Google Scholar]
Spencer B. D. (2008), “When Do Latent Class Models Overstate Accuracy for Binary Classifiers?” unpublished manuscript. [DOI] [PubMed]
Spitznagel E. L., Helzer J. E. (1985), “A Proposed Solution to the Base Rate Problem in the Kappa Statistic,” Archives of General Psychiatry, 42, 725–728. [DOI] [PubMed] [Google Scholar]
Substance Abuse and Mental Health Services Administration. (2010), Reliability of Key Measures in the National Survey on Drug Use and Health. Office of Applied Studies (Methodology Series M-8; HSS Publication No. SMA 09-4425). Rockville, MD. [PubMed]
Todorov A. (2000), “Context Effects in National Health Surveys: Effects of Preceding Questions on Reporting Serious Difficulty Seeing and Legal Blindness,” Public Opinion Quarterly, 64, 65–76. [PubMed] [Google Scholar]
Tourangeau R., Rasinski K. (1988), “Cognitive Processes Underlying Context Effects in Attitude Measurement,” Psychological Bulletin, 103, 299–314. [Google Scholar]
Tourangeau R., Yan T., Sun H. (2020), “Who Can You Count on? Understanding the Determinants of Reliability,” Journal of Survey Statistics and Methodology, 10.1093/jssam/smz034. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tourangeau R., Yan T., Sun H., Hyland A., Stanton C. A. (2019), “Population Assessment of Tobacco and Health (PATH) Reliability and Validity Study: Selected Reliability and Validity Estimates,” Tobacco Control, 28, 663–668. [DOI] [PubMed] [Google Scholar]
van der Ark L. A., van der Palm V. W., Sijtsma K. (2011), “A Latent Class Approach to Estimating Test-Score Reliability,” Applied Psychological Measurement, 35, 380–392. [Google Scholar]
Wiley D. E., Wiley J. A. (1970), “The Estimation of Measurement Error in Panel Data,” American Sociological Review, 35, 112–117. [Google Scholar]
Yan T., Kreuter F., Tourangeau R. (2012), “Evaluating Survey Questions: A Comparison of Methods,” Journal of Official Statistics, 28, 503–529. [Google Scholar]
Yan T., Tourangeau R. (2008), “Fast Times and Easy Questions: The Effects of Age, Experience and Question Complexity on Web Survey Response Times,” Applied Cognitive Psychology, 22, 51–68. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

smaa018_Supplementary_Data

Click here for additional data file.^{(112.5KB, zip)}

[smaa018-B1] Alwin D. F. (2007), Margins of Error: A Study of Reliability in Survey Measurement, Hoboken, NJ: John Wiley. [Google Scholar]

[smaa018-B2] Alwin D. F. (2011), “Evaluating the Reliability and Validity of Survey Interview Data Using the MTMM Approach,” in Question Evaluation Methods: Contributing to the Science of Data Quality, eds. Madans J., Miller K., Maitland A., Willis G., pp. 265–293. Hoboken, NJ: John Wiley. [Google Scholar]

[smaa018-B3] Alwin D. F., Jackson D. J. (1979), “Measurement Models for Response Errors in Surveys: Issues and Applications,” in Sociological Methodology 1980, ed. Schuessler K. F., pp. 68–119. San Francisco, CA: Jossey-Bass. [Google Scholar]

[smaa018-B5] Andrews F. (1984), “Construct Validity and Error Components of Survey Measures: A Structural Modeling Approach,” Public Opinion Quarterly, 48, 409–442. [Google Scholar]

[smaa018-B6] Bem D. J., McConnell H. K. (1970), “Testing the Self-Perception Explanation of Dissonance Phenomena: On the Salience of Premanipulation Attitudes,” Journal of Personality and Social Psychology, 14, 23–31. [DOI] [PubMed] [Google Scholar]

[smaa018-B7] Biemer P.P. (2004), “Modeling Measurement Error to Identify Flawed Questions,” in Methods for Testing and Evaluating Survey Questionnaires, eds. Presser S., Rothgeb J., Couper M., Lessler J., Martin E., Martin J., Singer E., pp. 225–246. New York: John Wiley. [Google Scholar]

[smaa018-B10] Cicchetti D. V., Feinstein A. R. (1990), “High Agreement but Low Kappa: II. Resolving the Paradoxes,” Journal of Clinical Epidemiology, 43, 551–558. [DOI] [PubMed] [Google Scholar]

[smaa018-B11] Coenders G., Saris W. E., Batista‐Foguet J. M., Andreenkova A. (1999), “Stability of Three‐Wave Simplex Estimates of Reliability,” Structural Equation Modeling: A Multidisciplinary Journal, 6, 135–157. [Google Scholar]

[smaa018-B12] Cohen J. (1960), “A Coefficient of Agreement for Nominal Scales,” Educational and Psychological Measurement, 20, 37–46. [Google Scholar]

[smaa018-B13] Cohen J. (1968), “Weighted Kappa: Nominal Scale Agreement with Provision for Scaled Disagreement or Partial Credit,” Psychological Bulletin, 70, 213–220. [DOI] [PubMed] [Google Scholar]

[smaa018-B14] Clogg C. C., Manning W. D. (1996), “Assessing Reliability of Categorical Measurements Using Latent Class Models,” in Categorical Variables in Developmental Research: Methods of Analysis, eds. van Eye A. and Clogg C. C., pp. 169–182. New York: Academic Press. [Google Scholar]

[smaa018-B15] Couper M. P., Kreuter F. (2013), “Using Paradata to Explore Item Level Response Times in Surveys,” Journal of the Royal Statistical Society, Series A, 176, 271–286. [Google Scholar]

[smaa018-B17] Graesser A. C., Cai Z., Louwerse M. M., Daniel F. (2006), “Question Understanding Aid (QUAID): a Web Facility That Tests Question Comprehensibility,” Public Opinion Quarterly, 70, 3–22. [Google Scholar]

[smaa018-B19] Heise D. R. (1969), “Separating Reliability and Stability in Test-Retest Correlation,” American Sociological Review, 34, 93–101. [Google Scholar]

[smaa018-B20] Hess J., Singer E., Bushery J. (1999), “Predicting Test-Retest Reliability from Behavior Coding,” International Journal of Public Opinion Research, 11, 346–360. [Google Scholar]

[smaa018-B2158996] Kreuter F., Yan T., Tourangeau R. (2008), “Good Item or Bad—can Latent Class Analysis Tell?: The Utility of Latent Class Analysis for the Evaluation of Survey Questions,” Journal of the Royal Statistical Society, Series A, 171, 723–738. [Google Scholar]

[smaa018-B21] Lessler J. T., Forsyth B. H. (1996), “A Coding System for Appraising Questionnaires,” in Answering Questions: Methodology for Determining Cognitive and Communicative Processes in Survey Research, eds. Schwarz N., Sudman S., pp. 259–291. San Francisco: Jossey-Bass. [Google Scholar]

[smaa018-B22] Maitland A., Presser S. (2016), “How Accurately Do Different Evaluation Methods Predict the Reliability of Survey Questions?,” Journal of Survey Statistics and Methodology, 4, 362–381. [Google Scholar]

[smaa018-B23] Maitland A., Presser S. (2018), “How Do Question Evaluation Methods Compare in Predicting Problems Observed in Typical Survey Conditions?,” Journal of Survey Statistics and Methodology, 6, 465–490. [Google Scholar]

[smaa018-B24] van Meurs A., Saris W. E. (1990), “Memory Effects in MTMM Studies,” in Evaluation of Measurement Instruments by Meta-Analysis of Multitrait–Multimethod Studies, eds. van Meurs A., Saris W. E., pp. 134–147. Amsterdam: North-Holland. [Google Scholar]

[smaa018-B25] Moser C. A., Kalton G. (1972), Survey Methods in Social Investigation (2nd ed.), New York: Basic Books. [Google Scholar]

[smaa018-B26] O’Muircheartaigh C. (1991), “Simple Response Variance: Estimation and Determinants,” in Measurement Error in Surveys, eds. Biemer P., Groves R., Lyberg L., Mathiowetz N., Sudman S., pp. 551–574. New York: John Wiley. [Google Scholar]

[smaa018-B27] Olson K., Smyth J. (2015), “The Effect of CATI Questions, Respondents, and Interviewers on Response Time,” Journal of Survey Statistics and Methodology, 3, 361–396. [Google Scholar]

[smaa018-B29] Rettig T., Höhne J. K., Blom A. G. (2019), “Recalling Survey Answers: A Comparison across Question Types and Different Levels of Online Panel Experience.” Manuscript under review.

[smaa018-B30] Ross M. (1989), “The Relation of Implicit Theories to the Construction of Personal Histories,” Psychological Review, 96, 341–357. [Google Scholar]

[smaa018-B31] Saris W. (2012), “Discussion: Evaluation Procedures for Survey Questions,” Journal of Official Statistics, 28, 537–551. [Google Scholar]

[smaa018-B32] Saris W., Gallhofer I. (2007. a), “Estimation of the Effects of Measurement Characteristics on the Quality of Survey Questions,” Survey Research Methods, 1, 29–43. [Google Scholar]

[smaa018-B33] Saris W., Gallhofer I. (2007. b), Design, Evaluation, and Analysis of Questionnaires for Survey Research, Hoboken, NJ: John Wiley. [Google Scholar]

[smaa018-B34] Saris W., Satorra A., Coenders G. (2004), “A New Approach to Evaluating the Quality of Measurement Instruments: The Split‐Ballot MTMM Design,” Sociological Methodology, 34, 311–347. [Google Scholar]

[smaa018-B35] Schacter D. L. (1987), “Implicit Memory: History and Current Status,” Journal of Experimental Psychology: Learning, Memory, and Cognition, 13, 501–518. [DOI] [PubMed] [Google Scholar]

[smaa018-B36] Schwarz H., Revilla M., Weber W. (2019), “Memory Effects in Repeated Survey Questions—Reviving the Empirical Investigation of the Independent Measurements Assumption.” Manuscript under review.

[smaa018-B37] Smith T. W. (1984), “Recalling Attitudes: An Analysis of Retrospective Questions on the 1982 General Social Survey,” Public Opinion Quarterly, 48, 639–649. [Google Scholar]

[smaa018-B38] Spencer B. D. (2008), “When Do Latent Class Models Overstate Accuracy for Binary Classifiers?” unpublished manuscript. [DOI] [PubMed]

[smaa018-B39] Spitznagel E. L., Helzer J. E. (1985), “A Proposed Solution to the Base Rate Problem in the Kappa Statistic,” Archives of General Psychiatry, 42, 725–728. [DOI] [PubMed] [Google Scholar]

[smaa018-B40] Substance Abuse and Mental Health Services Administration. (2010), Reliability of Key Measures in the National Survey on Drug Use and Health. Office of Applied Studies (Methodology Series M-8; HSS Publication No. SMA 09-4425). Rockville, MD. [PubMed]

[smaa018-B41] Todorov A. (2000), “Context Effects in National Health Surveys: Effects of Preceding Questions on Reporting Serious Difficulty Seeing and Legal Blindness,” Public Opinion Quarterly, 64, 65–76. [PubMed] [Google Scholar]

[smaa018-B42] Tourangeau R., Rasinski K. (1988), “Cognitive Processes Underlying Context Effects in Attitude Measurement,” Psychological Bulletin, 103, 299–314. [Google Scholar]

[smaa018-B43] Tourangeau R., Yan T., Sun H. (2020), “Who Can You Count on? Understanding the Determinants of Reliability,” Journal of Survey Statistics and Methodology, 10.1093/jssam/smz034. [DOI] [PMC free article] [PubMed] [Google Scholar]

[smaa018-B44] Tourangeau R., Yan T., Sun H., Hyland A., Stanton C. A. (2019), “Population Assessment of Tobacco and Health (PATH) Reliability and Validity Study: Selected Reliability and Validity Estimates,” Tobacco Control, 28, 663–668. [DOI] [PubMed] [Google Scholar]

[smaa018-B45] van der Ark L. A., van der Palm V. W., Sijtsma K. (2011), “A Latent Class Approach to Estimating Test-Score Reliability,” Applied Psychological Measurement, 35, 380–392. [Google Scholar]

[smaa018-B46] Wiley D. E., Wiley J. A. (1970), “The Estimation of Measurement Error in Panel Data,” American Sociological Review, 35, 112–117. [Google Scholar]

[smaa018-B47] Yan T., Kreuter F., Tourangeau R. (2012), “Evaluating Survey Questions: A Comparison of Methods,” Journal of Official Statistics, 28, 503–529. [Google Scholar]

[smaa018-B48] Yan T., Tourangeau R. (2008), “Fast Times and Easy Questions: The Effects of Age, Experience and Question Complexity on Web Survey Response Times,” Applied Cognitive Psychology, 22, 51–68. [Google Scholar]

PERMALINK

Comparing Methods for Assessing Reliability

Roger Tourangeau

Hanyu Sun

Ting Yan

Abstract

1. INTRODUCTION

1.1. Problems in Estimating Reliability

1.2. Reliability and Question Evaluation

Table 3.

1.3. The Current Study

2. METHODS

2.1 Data

2.2 Reliability Measures

2.2.1. Traditional reliability measures

2.2.2. Reliability estimates from more sophisticated models

2.2.3. Ex ante methods

2.2.4. Additional measures

3. RESULTS

3.1 Descriptive Statistics

Table 1.

3.2 Associations among Reliability Measures

Table 2.

3.2.1. Factor analyses

3.2.2. Results based on MTMM experiments

3.3 Associations between the Reliability Measures and Measures of Data Quality

Table 4.

3.4 Agreement about “Unreliable” Items

Table 5.

4. DISCUSSION

Supplementary Materials

Supplementary Material

Footnotes

Notes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases