Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Aug 1.
Published in final edited form as: Sociol Methods Res. 2017 Sep 25;48(3):485–533. doi: 10.1177/0049124117729698

Sources of Variance in the Accuracy of Interviewer Observations

Brady T West 1, Dan Li 2
PMCID: PMC6905517  NIHMSID: NIHMS911794  PMID: 31827308

Abstract

In face-to-face surveys, interviewer observations are a cost-effective source of paradata for nonresponse adjustment of survey estimates and responsive survey designs. Unfortunately, recent studies have suggested that the accuracy of these observations can vary substantially among interviewers, even after controlling for household-, area-, and interviewer-level characteristics, limiting their utility. No study has identified sources of this unexplained variance in observation accuracy. Motivated by theoretical expectations from the observer bias literature, this study analyzed more than 45,000 open-ended justifications provided by interviewers in the U.S. National Survey of Family Growth (NSFG) for their observations on two key features of all sampled NSFG households: presence of children and expected probability of household response. The study finds that variability among interviewers in the cues used to record these observations (evident from the open-ended justifications) explains much of the previously unexplained variance in observation accuracy.

Keywords: interviewer observations, survey paradata, observer bias, multilevel modeling, interviewer variance


Survey organizations often request that field interviewers conducting face-to-face interviews record notes and observations about the housing units sampled by the organizations. Examples of major survey programs in the United States and Europe currently employing this practice include the National Health Interview Survey, the National Survey of Family Growth (NSFG), the National Survey on Drug Use and Health, and the European Social Survey (ESS). By recording information about selected features of all units in a sample that is ancillary to the survey data collection itself (or paradata; Kreuter 2013), interviewers can play a critical role in the reduction of nonresponse bias (Sinibaldi, Trappmann, and Kreuter 2014) and the implementation of responsive survey design (RSD), which uses paradata to inform field operations and modifications of survey design decisions in a real-time fashion (Groves and Heeringa 2006; Sinibaldi and Eckman 2015; Wagner et al. 2012). These observations can also provide important environmental context (Casas-Cordero et al. 2013; Jones, Pebley, and Sastry 2011) in a cost-efficient manner (Bader et al. 2010) and guide the work of new interviewers charged with taking over the workloads of previous interviewers (West and Sinibaldi 2013). Interviewers can see and hear what survey managers cannot, picking up valuable information about their assigned housing units as a data collection proceeds. These observations therefore have great potential for improving survey research operations in interviewer-administered surveys.

This potential, unfortunately, may be offset by problems with the observations, which primarily manifest themselves in the form of variability in observation accuracy among interviewers. If the observations are to be used for nonresponse adjustment of survey estimates or RSD, they need to have sufficiently high accuracy, and the literature suggests that this is not always the case (Sinibaldi et al. 2013; West 2013a; see West 2011 for several additional examples). Excessive levels of error in the observations may lead to nonresponse adjustments that increase the bias in estimates (West 2013a, 2013b), introducing a need for more complex adjustments that rely on assumptions (West and Little 2013). The literature also suggests that shifts in survey estimates based on nonresponse adjustments incorporating these observations are generally slight (Kreuter et al. 2010; West 2013a; West, Kreuter, and Trappmann 2014), despite the fact that the observations have been shown to have associations with both response propensity and key survey variables. If some interviewers collect very error-prone observations and training on the observation process is not standardized, this may not be a fruitful research methodology.

This study focuses on the open question of why some interviewers collect more accurate observations than others. In other words, what are the reasons for interviewer variance in the accuracy of the observations? This variability among interviewers in observation accuracy, even after controlling for a variety of respondent-, area-, household-, and interviewer-level covariates that may affect accuracy, has been clearly demonstrated in a variety of different interviewer-administered surveys (Sinibaldi, Durrant, and Kreuter 2013; West and Kreuter 2013, 2015; West, Kreuter, and Trappmann 2014). So what are the accurate interviewers doing differently from the interviewers who record error-prone observations, when faced with the same essential survey conditions? Does a lack of training in appropriate standards for recording these observations result in different interviewers looking for different cues in the field?

This study attempts to address these gaps in knowledge related to interviewer observations. We aim to answer two primary research questions:

  1. In a large national face-to-face survey with only minimal training dedicated to recording interviewer observations, do different interviewers tend to look for different cues when making their observations?

  2. Do the different cues used ultimately explain variability among interviewers in the accuracy of their observations, after adjusting for household, interviewer, and area characteristics that may affect accuracy?

Theoretical Expectations

Survey interviewers who do not receive standardized training on the observation process may vary in terms of what they believe are “relevant” cues when tasked with making a particular observation. That is, they may think that certain cues are more or less relevant as indicators of a household having particular features, based on varying prior expectations of how the social world around them is organized (e.g., “a household with young children present will have features X, Y, and Z, so I will only look for features X, Y, and Z …”; Funder 1987; Manderson and Aaby 1992a, 1992b; McCall 1984; Tversky and Kahneman 1974). Unfortunately, empirical evidence of this behavior is largely lacking in the literature (e.g., West and Kreuter 2011), and no studies have examined the relationships between the types of cues used and observation accuracy. Knowledge of these relationships could be used to standardize interviewer training on the observation process.

A key theoretical determinant of the cues that interviewers look for could be the types of geographic areas to which the interviewers have been assigned. Interviewers working in different types of areas generally operate under very different conditions, and the information structure (i.e., the types of housing unit features that could be observed in theory) could vary greatly across geographic areas, potentially affecting the cues available. Furthermore, looking for particular types of cues that could indicate selected features of housing units may be more or less effective in certain types of areas (e.g., always looking for toys as indicators of young children being present in a housing unit may not be as effective in high-density urban areas), motivating a need to consider moderating effects of the type of area being worked on the relationship between use of a particular cue and observation accuracy.

We therefore consider three broad classes of interviewer assignments, in terms of the types of sampled geographic areas to which they might be assigned: (1) the largest metropolitan statistical areas (MSAs) only (e.g., New York, Chicago, and Los Angeles), having the highest population density; (2) a mixture of high-density urban and low-density rural primary sampling units (PSUs), which is a common assignment for “traveling” interviewers; and (3) low-density rural PSUs only. We consider these three fairly broad classes of geographic areas to ensure that we will have reasonable counts of interviewers within each class for our forthcoming empirical investigation. These three classes will also generally vary in terms of (1) crime levels, where possessions may or may not be left out in the open for observation and interviewers may not feel comfortable investigating properties in detail; (2) types of housing units, which may or may not lend themselves to easy access and observation (e.g., apartments in multiunit buildings); and (3) heterogeneity of the neighborhoods, where more homogeneous neighborhoods may cause interviewers to infer that clusters of housing units have similar features (Tversky and Kahneman 1974). We later control for observable features of individual housing units that could vary within these larger classes (e.g., location in a multiunit building) when assessing the relationships of using particular cues with observation accuracy.

So what types of cues might be associated with increased observation accuracy? The literature on observer bias from various fields of social science (e.g., anthropology, psychology) provides some hints. Some interviewers (e.g., those working in high-density urban areas) may resort to considering features of the areas in which they are working in the absence of any household-specific cues. This could be helpful if the interviewers are familiar with the areas, and the areas are fairly homogeneous in terms of the feature(s) being observed. But if areas tend to be more heterogeneous, interviewers may incorrectly apply expectations that all households in that area will have the same features, or the assumption that if several households have been similar, the next will also have similar features (Babbie 2001; Das 1983; Harris, Jerome, and Fawcett 1997; Manderson and Aaby 1992a, 1992b; Millen 2000; Repp et al. 1998; Seidler 1974; Tversky and Kahneman 1974). This could lead to reduced observation accuracy.

For other interviewers, the observational task may be quite difficult (e.g., inability to access a locked building of apartments, working in crowded urban areas), leading to a failure to pick up on important external visual cues and subsequent guessing or “going on hunches.” In these situations, observations would be expected to have reduced accuracy (Feldman et al. 1951; Funder 1987; Graham 1984; Jones et al. 1979; Kazdin 1977; Most et al. 2005; Simons and Jensen 2009), and interested readers can consult these studies for details on how accuracy was measured. When faced with similar environmental context (e.g., working in an urban area), some interviewers may rely on their aforementioned social expectations and only try to detect those cues that they deem relevant. Other interviewers may attempt to pick up on whatever cues are available for detection, rather than exclusively relying on cues that they believe to be “relevant.” This diversity in the cues used (depending on what is available for observation at a given housing unit), combined with an ability to detect specific features of a given housing unit, would be expected to result in increased accuracy (Funder 1995; Kazdin 1977; West et al. 2014; West and Kreuter 2015). The social psychology literature also suggests that observations based on first impressions in the presence of limited information will tend to have increased accuracy (Ambady, Hallahan, and Conner 1999; Patterson and Stockbridge 1998). Whether or not these theoretical expectations related to observation accuracy are borne out in the survey interviewing context remains an open research question.

Efforts to detect whatever cues are available could also lead to inaccurate observations in a particular survey context. If an interviewer working in an urban area frequently changes the cues that she or he is looking for and consistently picks up on different cues that are not necessarily related to the feature(s) being observed, observation accuracy may be reduced, and a narrower focus on specific and relevant cues (if available) may prove more effective (Funder 1995). Interviewers may also vary in terms of their perceptive ability (Ambady, Hallahan, and Rosenthal 1995; Patterson, Foster, and Bellmer 2001; Patterson and Stockbridge 1998; Smith, Archer, and Costanzo 1991). However, observation accuracy has been shown to be more of a function of the difficulty of the observational task rather than individual ability (Feldman, Hyman, and Hart 1951; Kazdin 1977; Seidler 1974; Simons and Jensen 2009). We hypothesize that interviewers will tend to look for different cues when tasked with recording potentially difficult observations (Feldman, Hyman, and Hart 1951; Kazdin 1977; Seidler 1974; Simons and Jensen 2009), and we aim to test the associations of use of different cues with observation accuracy. By identifying those cues that have the strongest relationships with observation accuracy, we hope to motivate future experimental assessments of alternative interviewer training strategies designed to produce more consistent observation accuracy among interviewers.

In summary, although very little empirical work in this area has focused on survey interviewers, we do have the following theoretical expectations based on the literature reviewed above:

  1. Survey interviewers will likely vary in terms of the cues that they believe are relevant for a particular observation with which they have been tasked, given their varying backgrounds and expectations about how the social world around them is organized;

  2. The relationship of using area-specific cues with observation accuracy will likely vary depending on the type of area being worked;

  3. Observation accuracy will likely suffer in areas where observations are more difficult (e.g., crowded urban areas with multiunit buildings) and interviewers are forced to make guesses about the features being observed;

  4. Interviewers picking up on whatever cues are available for detection (i.e., showing more diversity in the cues used) and not only focusing on a particular set of cues that they deem “relevant” will tend to produce more accurate observations; and

  5. Interviewers noting specific and relevant features of a housing unit and going on first (or general) impressions will tend to produce more accurate observations.

Method

The NSFG

The NSFG is an ongoing demographic survey that collects fertility data from four independent national samples (or “quarter samples”) of persons aged 15–49 in the United States each year. The NSFG is based on a multistage area probability sample of U.S. households, and the survey is completed in two stages. In the first stage, interviewers (all of whom are female) complete a screening interview by using computer-assisted personal interviewing (CAPI) to collect a list of all members of a sampled household, including their age, sex, race, and ethnicity; the response rate to this screening interview is 92 percent on average. One person aged 15–49 is then selected at random from the age-eligible persons within the household. In the second stage, the interviewer seeks a 60- to 80-minute “main” interview from the selected person using CAPI; the response rate to this main interview is 70 percent on average. This relatively high response rate is enabled by the use of RSD strategies (Wagner et al. 2012), “tokens of appreciation” for respondent participation that can be as high as US$80 (Wagner et al. 2017), and general completion of the “main” interview by the same interviewer who conducted the screening interview.

Some NSFG interviewers are only assigned to work in smaller PSUs, which are typically rural counties outside of larger MSAs, while other interviewers are assigned to work in either larger MSAs exclusively (which tend to be more urban) or a mix of larger MSAs and smaller PSUs. Given our theoretical expectations outlined earlier, we divided 73 NSFG interviewers who collected data from any of the eight quarter samples fielded between September 2011 and August 2013 into three groups, according to the types of PSUs to which they were assigned: the largest three MSAs in the United States only (the MSA group, with 8 interviewers); a mixture of MSAs and rural PSUs (the MIX group, with 37 interviewers); or rural PSUs only (the RURAL group, with 28 interviewers).

Observational Tasks

NSFG interviewers are asked to make the following observation immediately prior to their first in-person contact attempt with a sampled housing unit: “Do you think there are children under 15 years of age living in the unit? (Yes/No).” Interviewers are not allowed to proceed with attempting a screening interview until this observation has been entered, so these observations are never entered after making initial contact with a household, there are no missing data for the observations, and interviewers cannot change their observations after learning more about the household. Previous research (West 2013a) has shown that this observation is predictive of multiple key NSFG variables and household response propensity, making it a promising auxiliary variable for nonresponse adjustments and RSD strategies (Wagner et al. 2012). After recording this observation using the CAPI application, each of the 73 interviewers was required to type an open-ended justification for this observation, with no word limit (e.g., “There were kids’ toys and bicycles in the front yard”).

In addition, after seven weeks of data collection for a given quarter sample, interviewers were required to answer the following question based on all effort applied to a given sampled housing unit, whether contact was established or not: “What is the probability of getting the main interview? (1 = high, 2 = medium, 3 = low).” Similar estimates have been recorded in other surveys (e.g., Eckman, Sinibaldi, and Montmann-Hertz 2013; Sinibaldi and Eckman 2015), and this estimate is currently used to guide the NSFG’s RSD strategies. After entering this estimate (which also could not be changed after the fact), the interviewers were also required to type an open-ended justification for the estimate, based on the work that they had performed on the household to date (e.g., “They have a sign on the door stating that they do not open the door for anyone”). Additional probing of the justifications provided for each of these two observations was not considered, given the large additional costs that this would introduce for a national field study like the NSFG.

We focused on these two observations because their accuracy can be evaluated and because of their potential roles in RSD for many different types of face-to-face surveys. For instance, families with and without young children may have different patterns of being at home, which affects the likelihood of contact at different times of day, and may also vary in terms of key survey measures of interest (e.g., time use, community volunteering, employment, self-rated health).

Training Instructions

In training, NSFG interviewers are not presently provided with additional guidance on optimal approaches to recording these two observations based on past empirical studies. The interviewers are simply instructed to “answer a few brief questions about the housing unit and your assumptions about it” and “give us your reaction based on anything you may or may not have observed” when recording these observations. Some organizations (e.g., the ESS and the U.S. Census Bureau) are currently trying to increase the amount of training dedicated to the practice of recording these observations (e.g., Dahlhamer 2012), given how often they are used for survey operations and estimation, but no standardizing guidance is currently provided to NSFG interviewers.

Coding Operations and Reliability

There were a total of 39,048 open-ended justifications provided by the 73 interviewers for their observations regarding the presence of young children in sampled NSFG households. Of these observations, 19.7 percent were “yes” and 80.3 percent were “no.” In addition, there were a total of 8,187 open-ended justifications provided by 68 interviewers for their estimated response probabilities recorded for cases still active after the seventh week of each data collection quarter (a total of 5 interviewers provided fewer than 10 response propensity estimates and were not analyzed further). Of these estimates, 7.9 percent were high, 42.8 percent were medium, and 49.3 percent were low. A team of five coders, including the lead author of this study, two master’s-level graduate students in survey methodology, and two undergraduate students (where all students involved had no affiliation with the NSFG), identified 13 general cues that were (1) theoretically observable for all possible types of housing units and (2) often referenced in the justifications provided for observations on the presence of young children (e.g., mention of toys being visible [TOYS]). The coding team also identified 17 generally observable cues that were often referenced in the justifications provided for estimated response probabilities (e.g., any reference to resistance or refusal, including a lack of resistance [REF_RES]). Unique cues that were only mentioned rarely (e.g., garages) but could still be thought of as belonging to a more general class of cues (e.g., observable features of a housing unit) were discussed amongst the team and grouped into the more general categories.

Table 1 provides details for each of these 30 cues, along with examples of words or phrases commonly mentioned in the justifications that would result in a value of “1” being coded for a binary indicator of the cue being mentioned (as opposed to a “0”) for a given open-ended justification.

Table 1.

Details Regarding Indicators of The Cues Mentioned (1 = Yes, 0 = No) That Were Coded for Each Justification, Including Ranges of Intercoder Reliabilities (κ Statistics) for The Indicators Across The NSFG Quarters.

Interviewer Observation Indicator Example Mentions in Justifications Resulting in the Indicator Being Coded 1 Range of κ Statistics, Q1–Q8
Presence of young children under the age of 15 HOUSE_OBS Reference to observable features of the actual housing unit structure, including vacant house/no one living in house/housing unit actually a business, windows, garages, flags, wheelchair ramps, doors or items hanging on doors, garbage cans, blinds or curtains, stairs, signs, pets, gates (possibly locked), furniture, general items indicating children (such as strollers), mailbox comments, maintenance or age of the housing unit, and decks or porches 0.84–0.98
AREA_OBS Reference to features of the area, neighborhood, or community (including multiunit buildings/locked buildings and their balconies), playground nearby, references to neighboring houses, children playing in the area or around the area of the house, and courtyards 0.95–0.99
TOYS Reference to the presence of toys in or around the house (e.g., balls, dolls, toy cars, wagons) 0.87–0.99
NO_TOYS Reference to the absence of toys in or around the house 0.95–0.99
DECORATIONS Reference to decorations in the house or yard, kids paintings, holiday or seasonal decorations (e.g., pumpkins), decals on windows, ornaments 0.62–0.98
NO_DECORATIONS Reference to no decorations on or in the house 0.82–1.00
HH_MEMBER Reference to demographics of people actually seen in the household: the absence of elderly people or elderly household members, actually seeing children in or outside the house, names on mailboxes or signs, establishment of telephone contact for the screener, explicit descriptions of people in household from neighbors, formal statement of no kids being present 0.70–0.98
CARS Reference to whether there are cars/motorcycles/boats around the household or not, bumper stickers, baby car seats, motor homes/campers/recreational vehicles, and explicit mention of brands of cars (e.g., Ford Edge) 0.88–0.98
BIKES Reference to the presence of bikes for young kids (yes or no), including tricycles and bike helmets 0.94–0.99
YARD Reference to observable features of the yard or property: plants, gardens, farm equipment, landscaping, pool, patio, fence, basketball hoop, swing set/jungle gym, play equipment, sports equipment, toys in yard, etc.; does not include references to “outside” visual cues or driveways 0.87–0.99
GUESSING Vague reference to guessing, not being sure, “can’t tell,” don’t know/unknown, “just feels” a certain way, hunches, indication that they cannot see, observing without an actual visit 0.83–0.99
NEIGHBOR Reference to speaking with neighbors who live near the housing unit 0.33–0.98
NO_EVIDENCE General reference to no evidence or signs of children; no presence of children, no children’s stuff or items, and “nothing suggests children” 0.95–0.99
Estimated response probability NO_CONTACT Interviewer unable to make contact with anyone (or multiple visits not talking to anyone), informant or HH members evasive/will not open the door (“silent refusals”), no one ever home, traveling interviewer coming in, explicit lack of direct contact that is stated directly 0.81–0.99
INTEREST Explicit reference to respondent interest in the survey (high or low) 0.75–1.00
REF_RES (REFUSAL/RESISTANCE) Given establishment of contact: hard refusal, very uncooperative, resistance (high or low), reference to respondents or informants being “negative” or “not negative,” interviewer has made contact but does not believe they will open the door, respondent/informant closes or slams door despite contact (acting like gatekeeper) 0.85–0.99
NO_INFO Informant or respondent unwilling to share information about the household or does not finish the screening interview 0.14–1.00
INTRODUCTION Evidence that interviewer has made contact in general (maybe by telephone) and expects future outcome based on recognition, information provided by interviewer, establishment of an appointment, initial contact made but respondent generally busy, and provided minimal resistance 0.49–0.94
CONCERNS Respondent/informant expressed concerns over some aspects of the survey (including not wanting to do surveys), such as privacy, confidentiality, or government sponsorship 0.59–1.00
OBS_HOUSE Reference to observable features of the housing unit/structure (e.g., under construction, gated, pets, vacant, poorly maintained, babysitter/housekeeper who is not HH member, pets, signs, garages, driveway, evidence of children/teens), address or unit does not exist 0.64–0.99
CARS Reference to cars in the driveway or seeing people driving away 0.75–1.00
YARD Reference to observable features of the lawn or yard more generally 0.22–1.00
PERSONALITY Reference to personality of potential respondent/informant (e.g., mean, hostile, yelling, nice/friendly, happy, cooperative, stressed, busy, adamant about something) 0.33–0.99
DEMOGRAPHICS Reference to sociodemographics of potential respondent/informant, including gender, age (someone may be eligible), race/ethnicity, employment status, incarceration/jail, travel habits, mobility, or reference to a sole occupant (single-person household) 0.85–0.97
ACCESS Reference to problems accessing the property (e.g., locked building/locked gate/security door), establishing contact via telephone, access problems due to safety concerns 0.75–0.99
COMMUNICATION Reference to communication problems due to language (Spanish, Korean, etc.) or health problems (e.g., handicapped, disabled) 0.54–1.00
TIMING Reference to problems with timing (not sure when respondent is home, hard to reach, works odd hours, “if I can find respondent at home,” respondent on phone, respondent unavailable at that moment, current problem in household such as baby sleeping), trying to catch respondent when not busy, or effort to work out right time (e.g., needs visit at particular time or respondent gave me a best time known [BTK]) 0.46–0.98
HH_MEMBER General reference to observable features of other persons in the household aside from the potential informant/respondent (e.g., family, kids, babies, partners, spouses, cohabitants, “we,” older people, ethnicity of couple/family), possibly in addition to specific references to individuals (does not include respondents living alone), requires mention of specific features 0.31–0.99
NEIGHBOR Evidence of discussions with neighbors about the people in the housing unit 0.60–0.98
OBS_AREA Reference to observable features of the community in general, features of a neighborhood or street, features of a multi-unit building, etc., also interviewer safety concerns about an area 0.79–0.99

Note. NSFG = National Survey of Family Growth.

After reviewing the types of justifications that would result in each of the cue indicators being coded in detail (Table 1), two members of the coding team independently examined each of the justifications for a given observation in a given quarter and coded the justifications provided in that quarter on the identified indicators (1 = the cue was mentioned in the justification, 0 = otherwise). All justifications were thus coded twice by a given pair of coders, and data sets containing the codes for each justification were compared using SAS (Version 9.4) PROC COMPARE to identify discrepancies in coding between the two coders and measure overall levels of exact agreement (i.e., where all coded indicators exactly agreed for a given justification). SAS PROC FREQ was used to compute κ statistics measuring levels of intercoder agreement for each indicator.

Overall agreement rates varied between 86.1 percent and 98.5 percent for the young children justifications and between 91.0 percent and 98.6 percent for the response propensity estimate justifications across the eight NSFG quarters. The ranges of k measures for each individual indicator across the eight NSFG quarters are also provided in Table 1. The ranges of the κ statistics across the quarters all suggest moderate to almost perfect agreement for each of the coded indicators (Landis and Koch 1977), and the κ values tended to increase across the eight quarters, as the team members became more proficient with the coding task (results not shown). Near perfect (or perfect) agreement was found in some quarters for cues that were infrequently cited (e.g., justifications for response propensity estimates that mentioned features of the yard). Every individual discrepancy in the coded indicators was discussed among the team members and subsequently resolved to produce a final set of codes for each justification.

Calculation of Shannon’s Entropy (ShE)

Given the final set of values on the indicators for each justification, the different “types” of justifications, defined by combinations of values on all of the coded indicators, were identified for each observation. For example, one “type” of justification for a recorded observation about the presence of young children might be mentioning observable features of the house (HOUSE_OBS = 1) and presence of toys (TOYS = 1) and no other cues (a 0 for the other 11 cue indicators). The total number of unique “types” of justifications used by the interviewers for each observation was enumerated, resulting in 222 unique types of justifications for the observations on young children and 403 unique types of justifications for estimated response propensities. Considering our first research question, these counts alone provide fairly clear initial evidence of substantial variability in terms of the variety of the justifications used by the interviewers when actually visiting the households.

Given these counts of unique possible types of justifications for each observation, the proportions of all justifications provided by each interviewer for each observation (young children and response propensity) that were classified into each type were computed. The availability of these proportions for each interviewer enabled computation of an ShE measure (see, e.g., Hill 1973) for each interviewer and each observation. Given the K possible proportions pik, k = 1, …, K, for each interviewer i, where K is the total number of possible types of justifications (e.g., K = 403 for the response propensity estimates), the ShE measure for each interviewer was computed as ShEi=1k=1Kpiklog(pik). Given that the minimum ShE measure is 0 and the maximum ShE measure is log(K), the computed measure of ShE was divided by log(K) to rescale it to (0,1) and then multiplied by 100. An interviewer using more diversity in her justifications, suggesting that she looks for different cues depending on the household worked, would thus have an ShE measure closer to 100. An interviewer using the same type of justification for most or all of her observations (e.g., an interviewer who refers only to evidence of refusal for 99 percent of her justifications related to the response propensity estimates) would have an ShE measure closer to 0.

The two entropy measures computed for each interviewer therefore reflect the amount of variability in the types of justifications that a given interviewer used when recording each observation for their assigned housing units. This measure of the variety in the justifications used for a given interviewer could not be captured in a design that only asks an interviewer about their general strategies for recording these observations outside of the context of their assigned housing units and quantifies whether interviewers tend to change the cues that they use depending on context.

Analyses of Interviewer Variance in Cues Used Over Time

Next, to address our first research question, the data files of coded indicators for each justification were aggregated to the interviewer level, using masked interviewer ID codes. Specifically, the number of justifications mentioning a specific cue and the total number of justifications provided overall were computed for each interviewer and merged with the computed ShE measures. For example, interviewer A might justify 60 percent of her recorded observations on presence of young children by mentioning cues representing observable features of the housing unit (e.g., children’s artwork in the windows or small shoes on the porch), 20 percent of her observations by mentioning toys, and have a computed ShE value of 30, given the frequencies with which she referenced other cues. We also computed the same interviewer-level counts for each of the NSFG quarters in which they worked, producing a data set with aggregate counts for all interviewer–quarter combinations.

Given these quarter-specific interviewer-level measures of justification tendencies, we then employed multilevel logistic regression models to examine whether interviewers varied in general in terms of the probability of using a particular cue and also whether the interviewers changed their tendencies in a systematic fashion across the quarters. This may have happened, for example, if the interviewers could detect that the cues that they were using to record an observation were in fact associated with higher (or lower) observation accuracy when completing screening interviews, and in turn, they started to look for particular cues (or ignore other cues) more regularly. Separately for each of the three groups of interviewers defined by the types of areas to which they had been assigned (MSA, MIX, and RURAL), we modeled overall average trends in the probability of using each cue (for each of the two observations), in addition to between-interviewer variance in the means and slopes of these probabilities across the eight NSFG quarters. More specifically, we used adaptive quadrature (see Kim, Choi, and Emery 2013) to fit the multilevel logistic regression model in equation (1) described below to the quarterly interviewer-level data from each type of area, using the number of times that a given cue was used (a binomial number of “events”) and the overall number of justifications provided by each interviewer (the total number of “trials”) in each quarter:

ln(ϕit1ϕit)=β0+u0i+β1qt+u1iqt,

where ϕit = probability of using a particular cue for IWER i at time t;

qt=valueof quarter(1through8)at timet,(u0iu1i)~N(σ02σ01σ01σ12). (1)

Our theoretical expectations suggest that different interviewers assigned to work in the same types of areas would tend to use different cues in the absence of standardized training on this process and that the rates at which these cues were used would remain stable for a given interviewer across the quarters. This pattern of results would arise if interviewers consistently use the cues that they believe are relevant for a particular observation in the absence of standardized training. In terms of the statistical model described above in equation (1), we therefore expected to find evidence of significant interviewer variance in the intercepts (i.e., σ02>0), but not the slopes (i.e., σ12=0, using appropriate likelihood ratio tests; Zhang and Lin 2008); that is, different interviewers will remain stable over time in terms of the probability of using a particular cue but vary in terms of how often that cue is used in general. Evidence of a significant overall fixed effect of quarter (i.e., β1 ≠ 0) would suggest that interviewers were systematically changing the types of cues used over time, and evidence of significant variance in the random slopes (i.e., σ12=0) would suggest that different interviewers tended to have different trends (possibly increasing or decreasing) in terms of the probability of using a particular cue. We did not expect either of these results, consistent with a hypothesis that even when working in the same type of area, different interviewers tend to look for different cues in the absence of standardized training on the observational process.

Multilevel Models of Observation Accuracy

Finally, to address our second research question, we considered a sequence of five multilevel models of observation accuracy to see if the types of cues used in the justifications (either at the household level or in aggregate at the interviewer level) could explain significant portions of the unexplained variability in accuracy among interviewers that has been repeatedly noted in prior studies. We once again employed adaptive quadrature when fitting these multilevel models.

For the young children observation, we considered as a dependent variable a simple binary indicator of accuracy at the housing unit level (1 = yes, meaning that the interviewer observation about presence of young children was consistent with the completed household roster, and 0 = no). Some interviewers were only found to make either false positive errors (indicating that a young child was present when one was not) or false negative errors (indicating that a young child was not present when one was). These “empty cell” problems would pose estimation problems when including random interviewer effects in a multilevel multinomial model, as not all interviewers would have nonzero counts for all three categories of a dependent variable indicating observation accuracy (i.e., accurate observation, false positive, or false negative). The binary accuracy indicator was determined based on completed household rosters, from which an indicator of a child under the age of 15 being present in the household was determined. Not all housing units with observations and coded justifications recorded had this accuracy indicator available; for about 6 to 10 percent of sampled households, depending on the quarter, a screening interview is never completed.

For the estimates of response propensity after seven weeks, we considered an ordinal accuracy outcome at the housing unit level: (1) accurate (high estimate and the selected respondent from the household actually responds or low estimate and eventual nonrespondent), (2) neither accurate nor inaccurate (medium estimate), and (3) inaccurate (high estimate and eventual nonrespondent or low estimate and eventual respondent). This ordinal accuracy variable was computed based on final case dispositions for the completed cases in these eight quarters; once again, not all households had these final accuracy measures computed (e.g., households may be discovered to be ineligible or nonsample households). For each dependent accuracy measure, we considered a sequence of five multilevel binary (for young children) or ordinal (for estimated response propensity) logistic regression models. These models are described in detail below.

Model 1

The first model was an “empty” model that only included a fixed intercept (or two intercepts in the case of the ordinal model) and random interviewer effects. The random interviewer effects were assumed to be normally distributed with mean zero and some unknown variance. This first model enabled us to estimate the unconditional between-interviewer variance in the accuracy of a particular observation. The significance of this variance component was assessed for each model fitted using likelihood ratio tests based on mixtures of χ2 distributions, as described by Zhang and Lin (2008). The inclusion of the fixed effects of additional covariates in each subsequent model allowed us to compute the percentage of unexplained between-interviewer variance in observation accuracy that was explained by the fixed effects of the covariates that were added to a given model.

Model 2

The second model added the fixed effects of several covariates that either described features of the housing units that may impact observation accuracy or features of the areas in which the housing units were located. These covariates included the year in which a household was measured (2011, 2012, 2013); the Census region(s) being worked by a given interviewer (Midwest Only, Northeast Only, South Only, West Only, or Multiple); the base sampling weight specific to the selected respondent in the housing unit, which reflects (a) the demographic features of the area in which the housing unit is located, (b) the number of individuals in the housing unit, and (c) the demographic features of the individuals in the housing unit (Lepkowski et al. 2013); an indicator of whether the housing unit was located within a gated community; an interviewer estimate of the number of housing units in the area segment; an indicator of whether the housing unit was located in a primarily residential (vs. commercial or mixed-use) neighborhood; an indicator of whether the interviewer noted evidence of non-English speakers in the area; an indicator of whether the interviewer had safety concerns when visiting the area; an indicator of whether the housing unit was located in a multiunit building; an indicator of evidence of physical impediments to the housing unit (e.g., security system, gated driveway); the estimated percentage of households with eligible respondents in the area per NSFG models; and the percentage of occupied housing units in the area per the 2010 Census.

Model 3

The third model added the fixed effects of several interviewer characteristics that have been shown to affect observation accuracy in previous studies (West and Kreuter 2013; Sinibaldi, Durrant, and Kreuter 2013). These included age in years (50+ vs. <50), education (less than college vs. college), race/ethnicity (white vs. other), marital status (married vs. other), having additional jobs (yes vs. no), having prior NSFG experience (yes vs. no), having one or more children, years of interviewing experience (>4 vs. 4 or less), types of PSUs assigned (MSA, MIX, or RURAL), and overall workload (in terms of the total number of observations recorded that could also be validated, as a measure of experience with making the observations).

Model 4

The fourth model was critical to our second research question, as the prior literature has only considered the first three models described above. The fourth model added fixed effects of the indicators of the cues used in the justifications provided for the observations on each housing unit. In addition, the aforementioned aggregated percentages computed for each interviewer, representing the percentage of their justifications that used a particular cue, were considered. When analyzing these aggregate percentages, some large pairwise correlations emerged, and initial analyses suggested that estimates of fixed effects were being affected by large variance inflation factors (VIFs) when including all of the interviewer-level aggregates as predictors in the same model. Given these concerns about multicollinearity of the interviewer-level aggregates, a principal components analysis (PCA) was conducted, with principal components having an eigenvalue greater than 1.0 extracted. An orthogonal varimax rotation was then employed to improve interpretability of the components (see the Supplementary Materials for the rotated loadings and the interpretations of the resulting principal components), and predicted values for the components (based on regression scoring) were computed for inclusion as interviewer-level predictors. Adding the fixed effects of these principal components to the model resulted in substantially reduced average VIFs (from 2.76 to 1.90 for the model of response propensity estimate accuracy and from 2.54 to 1.84 for the model of accuracy of observations of young children). Finally, we also added a fixed effect of the interviewer-level measure of ShE, again reflecting diversity in the justifications used depending on context.

Model 5

The fifth and final model included significant interactions between the types of PSUs assigned and the individual indicators of cues used for each housing unit, the aforementioned principal components, and ShE. These interactions were identified using a procedure described by Hosmer, Lemeshow, and Sturdivant (2013), which involved adding each interaction to the fourth model described above (one at a time) and testing it for significance using a Wald test, fitting a new model that added all interactions found to be significant at the 5 percent level, and then only retaining those interactions that continued to be significant in this final model. While this fifth model was not necessarily intended to explain variance among interviewers in observation accuracy, it enabled us to understand whether the relationships of using particular cues with accuracy varied depending on the types of areas being worked, which has important implications for interviewer training.

We fitted this sequence of multilevel models using the melogit and meologit commands in the Stata software (Version 14.2). All continuous predictors were centered at their means, and potential nonlinear relationships of the continuous predictors with each accuracy outcome were assessed by recoding the predictor into quartiles and fitting a model including fixed effects of the quartiles. All Stata code and data used for fitting these models are available in the Supplementary Materials. A general overview of the rationale for the entire observational research design described in this section, including potential alternative designs and their benefits/limitations, can also be found in the Supplementary Materials.

Results

Descriptive Statistics

Table 2 presents the overall percentages of justifications provided that mentioned each of the specific cues across the eight NSFG quarters. For the response propensity estimates, the most common cues included noncontact, refusal/resistance, demographics of people in the household, and poor timing. Notably, there was a tremendous amount of variability among the 68 interviewers in terms of the percentages of their justifications (across quarters) that mentioned specific cues; for example, at least one interviewer never mentioned noncontact, while one interviewer mentioned this in nearly 92 percent of her justifications across the quarters in which she worked. The mean interviewer-specific entropy for the estimates of response propensity, which recall captures the variance in the types of justifications used by a given interviewer, was 39.6, with range 5.8 (very limited variance in the justifications used) to 62.8 (larger variance in the justifications used) across the 68 interviewers. These results are entirely consistent with our theoretical expectations that different interviewers would tend to focus on different cues.

Table 2.

Frequencies With Which Each Cue Indicator Was Coded With a “1” Based on the Justifications for Each of the Two Observations, in Addition to Evidence of Significant Interviewer Variance in the Percentages Across the Eight NSFG Quarters, by Types of Areas Assigned.

Evidence of Significant Interviewer Variance in Intercepts/Slopes for the Probability of Mentioning Each Cue Across the 8 NSFG Quartersa

Interviewer Observation Indicator Percentage of All Justifications With This Indicator Coded as 1 (%) Range of Percentages Across Interviewers (All NSFG Quarters; %) Interviewers in Rural Areas Only (RURAL) Interviewers in both Rural and Urban Areas (MIX) Interviewers in Top 3 MSAs Only (MSA)
Presence of young children under the age of 15 (39,048 justifications by 73 interviewers) HOUSE_OBS 13.53 0.00–48.44 Y/Y Y/Y Y/N
AREA_OBS 14.40 0.00–46.25 Y/Y Y/Y Y/N
TOYS   5.26 0.00–38.98 Y/N/− Y/N/− Y/N
NO_TOYS 22.49 0.00–90.28 Y/Y Y/Y/− Y/N/−
DECORATIONS   1.87 0.00–13.25 Y/N/− Y/Y/− Y/N
NO_DECORATIONS   0.48 0.00–12.25 Y/N/− Y/N/− Y/N
HH_MEMBER   3.19 0.00–22.73 Y/N/− Y/Y/− Y/N
CARS 3.13 0.00–14.62 Y/Y Y/Y Y/N
BIKES   4.50 0.00–52.05 Y/N/− Y/Y Y/N/−
YARD 16.09 0.00–83.15 Y/Y Y/Y Y/N
GUESSING   6.68 0.00–88.65 Y/Y Y/Y Y/N/+
NEIGHBOR   0.32 0.00–2.08 Y/N Y/N/− N/N
NO_EVIDENCE 36.07 0.01–91.67 Y/Y/+ Y/Y Y/N/+
Estimated response probability (8,187 justifications by 68 Interviewers) NO_CONTACT 42.26 0.00–91.79 Y/Y Y/Y Y/N/−
INTEREST   3.87 0.00–21.67 Y/N Y/N/− Y/N
REF_RES (REFUSAL/RESISTANCE) 20.19 0.00–62.00 Y/N Y/N/− N/N
NO_INFO   0.93 0.00–16.44 Y/N Y/N Y/N
INTRODUCTION 2.25 0.00–12.00 Y/N Y/N Y/N
CONCERNS   1.56 0.00–31.76 Y/N Y/Y/− N/N
OBS_HOUSE 11.87 0.00–50.65 Y/N/− Y/Y Y/N/−
CARS   1.34 0.00–10.96 Y/N Y/N/− N/N
YARD   0.21 0.00–5.13 N/N N/N N/N
PERSONALITY   5.63 0.00–29.05 Y/N Y/Y Y/N
DEMOGRAPHICS 18.61 0.00–60.38 Y/N Y/Y/− Y/N
ACCESS 12.50 0.00–81.73 Y/N/+ Y/Y Y/N/−
COMMUNICATION   2.70 0.00–12.50 N/N Y/Y Y/N
TIMING 14.06 0.00–45.31 Y/Y Y/Y Y/N
HH_MEMBER   6.33 0.00–28.21 Y/N/− Y/Y/− Y/N
NEIGHBOR   2.96 0.00–18.97 Y/N Y/Y N/N
OBS_AREA 10.05 0.00–66.35 Y/Y Y/Y Y/N

Note. NSFG = National Survey of Family Growth.

a

Y/Y = when fitting a multilevel logistic regression model (including random intercepts and random quarter slopes) to the quarterly aggregate counts of justifications mentioning a particular cue and overall justifications that were computed for each interviewer (see equation 1), there was evidence of significant (p < .05) variance among interviewers in both intercepts and slopes, based on likelihood ratio tests; Y/N = evidence of significant variance among interviewers in intercepts but not slopes (i.e., variation in mean probabilities only); N/Y = evidence of variance in slopes but not intercepts (i.e., interviewers had the same mean but different slopes across the quarters); and N/N = no evidence of variance in intercepts or slopes among interviewers (all interviewers had the same mean and randomly varied around that overall mean percentage across quarters). +/− indicates whether the fixed effect of quarter was significant in the model (p < .05) and the direction of the relationship of quarter with the probability of using a particular cue.

For the observations on presence of young children, the most common cues included no general evidence of children, absence of toys, features of the yard, features of the area, and features of the housing unit. There was also tremendous variability among the 73 interviewers in terms of the overall percentages (across quarters) of justifications mentioning specific cues for these observations regarding presence of young children: For instance, at least one interviewer never mentioned the absence of toys, while another interviewer mentioned this cue in more than 90 percent of her justifications. The mean interviewer-specific entropy for the observations on presence of young children was 36.3, with range 11.5 (limited variance in justifications used) to 71.1 (large variance in justifications used) across the 73 interviewers. These results were also consistent with our theoretical expectations regarding interviewer variance in the cues used.

Trends in the Use of Particular Cues

The RURAL Group

When fitting the multilevel logistic regression models to the quarterly aggregates indicating how frequently particular cues were used by the 28 interviewers in the RURAL group (25 of which recorded 10 or more response propensity estimates), we found evidence of significant variance in the intercepts for all 13 of the coded indicators for observations on presence of young children and 15 of the 17 coded indicators for the response propensity estimates (see Table 2 for a summary). We found some evidence of variance in the slopes among interviewers for the frequency of using a particular cue (7 of the 13 indicators for the observations on presence of young children and 3 of the 17 indicators for the response propensity estimates), suggesting that interviewers working in the same type of (rural) area vary substantially not only in terms of their general tendencies but also in terms of the frequency with which they use particular cues over time. Significant (p < .05) negative fixed effects of quarter were found for the probabilities of mentioning toys, decorations, no decorations, bikes, and household members (for observations on young children) and housing unit features and household members (for response propensity estimates), suggesting general decreases in the probability of using these cues over time. Significant positive fixed effects of quarter were found for no general evidence of children and access problems (suggesting general increases in the use of these cues over time).

The MIX Group

For the 37 interviewers in the MIX group (36 of which recorded 10 or more response propensity estimates), we found evidence of significant variance in the intercepts for all 13 indicators for presence of young children and 16 of the 17 indicators for estimates of response propensity (see Table 2 for a summary). We also found evidence of significant variance among interviewers in the slopes for 10 of the 13 indicators for presence of young children and 11 of the 17 response propensity indicators. Furthermore, we found evidence of significant negative fixed quarter effects for toys, no toys, decorations, no decorations, household members, and speaking with neighbors (for observations on presence of young children) and interest, refusal/resistance, concerns, cars, demographics, and household members (for response propensity estimates). These results suggest that there were systematic decreases in the use of these cues over time, and that while interviewers working in a mix of different areas once again varied in terms of their general tendencies, they also varied somewhat in terms of changes in these probabilities over time (e.g., the probability of guessing increased for some interviewers over time but decreased for others).

The MSA Group

Finally, for the 8 interviewers working in the three largest MSAs exclusively (7 of which recorded 10 or more response propensity estimates), significant interviewer variance in the intercepts was found for 12 of the 13 indicators for presence of young children (with the exception being speaking with neighbors; in the highest-density urban areas, this cue may not be entirely useful) and 12 of the 17 indicators for response propensity estimates (see Table 2 for a summary). We view these results as especially meaningful, given the small number of interviewers working in these types of areas only (which could affect the reliability of our estimated variance components; Cohen 1998) and the fact that these areas tend to be densely populated and highly urban, which may minimize the household-specific features available for observation. We found little evidence of between-interviewer variance in the slopes; significant positive fixed effects of quarter were found for refusal/resistance and speaking with neighbors (for response propensity estimates) and for guessing and a general lack of evidence of children (for the young children observations), suggesting that the use of these cues was increasing over time in these more urban areas in general. We also found negative fixed effects of quarter for bikes and the absence of toys (for the young children observations) and no contact, housing unit observations, and access problems (for the response propensity estimates), suggesting that the use of these cues was generally decreasing over time in these urban areas. Figure 1 presents an example of the type of between-interviewer variance that was observed across the eight quarters, highlighting two interviewers in the MSA group that varied substantially in terms of how often they noted a general lack of evidence of children (and did not change their tendencies across the quarters; i.e., their intercepts varied and their slopes did not). This type of pattern was noted for many of the different cues: Despite the fact that interviewers were working in the same types of areas, they still varied substantially in terms of how often they cited particular cues.

Figure 1.

Figure 1

Evidence of interviewer variance in the percentage of justifications for observations on presence of young children that referred to a general lack of evidence of children, for two of the interviewers assigned to work in the largest U.S. metropolitan statistical areas.

Collectively, these results suggest that different interviewers assigned to work in the same types of areas (with similar environmental features) do in fact tend to note different cues when observing the presence of young children or estimating response propensity. These results provide additional support for our theoretical expectation that different interviewers will tend to rely on different cues. They also suggest that summarizing interviewer tendencies by aggregating these percentages across quarters may miss important features of the data, given that some interviewers had percentages that changed in different directions across the quarters. This warrants examining the individual cues used in each quarter and their relationships with observation accuracy.

Observation Accuracy by Types of Areas Worked

In terms of the accuracy of the observations regarding presence of young children, 74.1 percent of the observations made by interviewers in the RURAL group were found to be accurate, as compared to 72.5 percent of the observations made by interviewers working in the MIX group and 69.2 percent of the observations made by interviewers working in the MSA group. In terms of the accuracy of the response propensity estimates, 38.2 percent of the estimates made by interviewers in the RURAL group were found to be accurate, as compared to 49.0 percent of the estimates made by interviewers in the MIX group and 52.1 percent of the estimates made by interviewers in the MSA group. These results are consistent with our theoretical expectation that certain types of observations may be more or less difficult in general depending on the features of the area being worked. The accuracy rates of individual interviewers were found to vary widely around these overall rates, as expected (see the multilevel modeling results below).

Multilevel Models: Accuracy of Observations on Presence of Young Children

The results in Table 3 indicate that the different cues used by the interviewers for observing presence of young children explained a large percentage of the variance in observation accuracy that was not accounted for by fixed effects of the household-, area-, and interviewer-level covariates. The fixed effects in Models 2 and 3, which explained nearly 38 percent of the unexplained variance in the accuracy of the observations regarding presence of young children, suggest that lower accuracy of the observations was found for households having persons with larger sampling weights, areas with non-English speakers, and areas with higher occupancy rates, while higher accuracy was found for housing units with physical impediments. Unique predictors of reduced accuracy for the observations on young children included interviewer-expressed safety concerns for the sampled area segment and increased household eligibility rates in a sampled area, while unique predictors having a positive relationship with accuracy included being located in a gated community and the interviewer having another job. Notably, overall workload was not related to accuracy in any way, and additional analyses looking at a possible nonlinear relationship of workload with accuracy (not shown) did not change this conclusion. These results are generally consistent with expectations based on the existing literature in this area (e.g., West and Kreuter, 2013).

Table 3.

Estimated Parameters in the Sequence of Five Multilevel Binary Logistic Regression Models for the Accuracy of the Young Children Observations.

Fixed Effects Model 1
Estimate (SE)
Model 2
Estimate (SE)
Model 3
Estimate (SE)
Model 4
Estimate (SE)
Model 5
Estimate (SE)
Intercept   0.99 (0.05)***   1.11 (0.13)***   1.32 (0.22)***   1.26 (0.28)***   1.25 (0.28)***
Housing unit features
Year measured
 2012   0.01 (0.04)   0.01 (0.04) −0.01 (0.04) −0.01 (0.04)
 2013   0.07 (0.05)   0.06 (0.05)   0.03 (0.05)   0.04 (0.05)
Census regions of IWER
 Multiple   0.02 (0.14)   0.01 (0.16) −0.01 (0.13)   0.02 (0.13)
 Northeast only −0.27 (0.19) −0.27 (0.19) −0.06 (0.18) −0.03 (0.17)
 South only −0.04 (0.14) −0.12 (0.14) −0.20 (0.12)* −0.20 (0.12)*
 West only −0.23 (0.16) −0.30 (0.16)* −0.36 (0.13)*** −0.36 (0.13)***
Sampling weighta −0.01 (<0.01)*** −0.01 (<0.01)*** −0.01 (<0.01)*** −0.01 (<0.01)***
In gated community   0.15 (0.04)***   0.15 (0.04)***   0.15 (0.04)***   0.14 (0.04)***
Estimated # of households in area segmenta −0.01 (0.01) −0.01 (0.01) −0.01 (0.01) −0.01 (0.01)
In primarily resid. area −0.02 (0.03) −0.02 (0.03) −0.02 (0.03) −0.02 (0.03)
Evidence of non-English speakers −0.24 (0.04)*** −0.23 (0.04)*** −0.23 (0.04)*** −0.22 (0.04)***
Interviewer safety concerns −0.15 (0.03)*** −0.15 (0.03)*** −0.14 (0.03)*** −0.15 (0.03)***
In many unit building   0.02 (0.03)   0.02 (0.03) −0.02 (0.03) −0.03 (0.03)
Physical impediments   0.25 (0.05)***   0.25 (0.05)***   0.25 (0.05)***   0.27 (0.05)***
Estimated eligibility rate of areaa −1.42 (0.25)*** −1.39 (0.25)*** −1.38 (0.25)*** −1.34 (0.25)***
Occupancy rate of areaa −1.61 (0.18)*** −1.58 (0.18)*** −1.53 (0.18)*** −1.52 (0.18)***
Interviewer features
Interviewer age ≤ 50 −0.19 (0.09)** −0.10 (0.08) −0.10 (0.08)
Interviewer < college edu.   0.11 (0.09)   0.20 (0.08)**   0.18 (0.08)**
Interviewer white −0.14 (0.11) −0.04 (0.10) −0.04 (0.09)
Interviewer married −0.07 (0.09) −0.02 (0.08) −0.01 (0.08)
Interviewer other job   0.25 (0.09)***   0.25 (0.08)***   0.25 (0.08)***
Interviewer previous NSFG experience   0.07 (0.11)   0.04 (0.10)   0.04 (0.10)
Interviewer 1+ children −0.01 (0.11) −0.02 (0.09) −0.05 (0.09)
Interviewer > 4 years exp. −0.01 (0.11)   0.01 (0.09)   0.02 (0.09)
PSU assign.
 MIX −0.10 (0.10) −0.13 (0.09) −0.11 (0.09)
 MSA −0.16 (0.16) −0.04 (0.14) −0.01 (0.14)
Workloada <0.01 (<0.01) <0.01 (<0.01) <0.01 (<0.01)
Interviewer cues used
 Housing unit obs.   0.10 (0.05)*   0.09 (0.05)*
 Area obs.   0.29 (0.05)***   0.54 (0.10)***
 Toys −0.30 (0.06)*** −0.34 (0.10)***
 No toys   0.27 (0.05)***   0.29 (0.05)***
 Decorations −0.10 (0.09) −0.09 (0.09)
 No decorations −0.23 (0.19) −0.85 (0.28)***
 Household members   0.18 (0.08)**   0.19 (0.08)**
 Cars −0.11 (0.07) −0.11 (0.07)
 Bikes −0.20 (0.07)*** −0.21 (0.07)***
 Yard −0.16 (0.04)*** −0.15 (0.04)***
 Guessing −0.20 (0.07)*** −0.24 (0.07)***
 Speaking with neighbors   0.13 (0.27)   0.16 (0.27)
 No evidence of kids   0.37 (0.05)***   0.38 (0.05)***
Principal components based on aggregate tendenciesb
 Focus on housing unit features −0.01 (0.03) −0.01 (0.03)
 Specific evidence vs. no evidence   0.01 (0.03)   0.01 (0.03)
 Less guessing   0.10 (0.04)**   0.09 (0.04)**
 Area features   0.08 (0.03)**   0.07 (0.03)**
 Household members and neighbors   0.03 (0.03)   0.03 (0.03)
 Toys and no decorations −0.03 (0.04) −0.01 (0.04)
Entropy   0.01 (0.01)   0.01 (0.01)
Interactions
 PSU Assign. × Area Obs.
 MIX × Area Obs. −0.35 (0.10)***
 MSA × Area Obs. −0.08 (0.16)
 PSU Assign. × Toys
 MIX × Toys 0.15 (0.12)
 MSA × Toys −0.63 (0.21)***
PSU Assign. × No Decorations
 MIX × No Decorations 1.32 (0.44)***
 MSA No × Decorations 0.73 (0.59)
Variance of random IWER intercepts (% reduction) 0.144*** 0.114*** (20.8%) 0.090*** (21.1%) 0.052*** (42.2%) 0.050*** (3.8%)
# Obs.   29,221   29,221   29,221   29,221   29,221
# IWERs       73       73       73       73       73
Log likelihood −16,889.61 −16,658.61 −16,651.47 −16,483.94 −16,464.06
Change in log likelihood   231.00   7.14   167.53   19.88

Note. Reference categories: year measured = 2011; Census region worked by interviewer = Midwest only; PSU assignments: rural only (RURAL). IWER = interviewer; PSU = primary sampling unit; MSA = metropolitan statistical area; obs. = observation; assign. = assignment; resid. = residential; exp. = experience.

a

Mean centered.

b

See the Supplementary Materials for detailed results of the principal components analysis, including interpretation of the components.

***

p < .01.

**

p < .05.

*

p < .10 (Wald tests for fixed effects; likelihood ratio tests for variance components).

The addition of the fixed effects of the cues used in the individual justifications, the principal components capturing general interviewer tendencies, and the entropy measures in Model 4 explained more than 42 percent of the remaining unexplained variance in observation accuracy among interviewers. Holding the predictors in Models 2 and 3 constant, positive relationships with accuracy were found for noting housing unit features, area features, the absence of toys, features of household members, and a general lack of evidence of children. Negative relationships with accuracy were found for indicating toys, bikes, and features of the yard in the justifications and also guessing (supporting our theoretical expectations). In particular, interviewers with lower percentages of justifications referring to guesses and higher percentages of justifications referring to area features tended to be more accurate.

Three interactions were identified as significant in Model 5. These results suggested that (1) the positive relationship of noting area features with observation accuracy for rural areas was substantially reduced for interviewers in the MIX group; (2) the negative relationship of noting toys with accuracy in rural areas became significantly more negative in the MSA group, where noting toys in the justification would be expected to reduce the odds of an accurate observation by a factor of exp(−0.35 −0.63) = 0.38 or 62 percent; and (3) the negative relationship of noting an absence of decorations with accuracy in rural areas became positive for interviewers working in the MIX group.

Multilevel Models: Accuracy of Response Propensity Estimates

The results in Table 4 indicate once again that the different cues used by the interviewers and their general tendencies explained a large percentage of the variance in the accuracy of the response propensity estimates that was not accounted for by fixed effects of the household-, area-, and interviewer-level covariates. The results in the first three models are once again largely consistent with the existing literature, suggesting that (1) interviewers improve significantly over time; (2) estimates are less accurate for households with sampled individuals having a higher sampling weight (generally White individuals in areas with lower population density and less diversity); (3) many unit buildings, evidence of non-English speakers, and high occupancy rates are generally associated with reduced accuracy, while evidence of physical impediments to the housing unit is associated with increased accuracy; and (4) interviewers with more experience tend to produce less accurate estimates. The variance in accuracy rates by type of PSU assignment is also apparent, with interviewers in the MSA group tending to have the highest accuracy. Overall workload was once again found to have no relationship with the accuracy of the response propensity estimates when adjusting for the other covariates.

Table 4.

Estimated Parameters in the Sequence of Five Multilevel Ordinal Logistic Regression Models for the Accuracy of the Response Propensity Estimates.

Fixed Effects Model 1
Estimate (SE)
Model 2
Estimate (SE)
Model 3
Estimate (SE)
Model 4
Estimate (SE)
Model 5
Estimate (SE)
Intercepts
 Cut point 1 −2.24 (0.09)*** −1.96 (0.22)*** −1.97 (0.35)*** −1.96 (0.55)*** −1.89 (0.56)***
 Cut point 2   0.11 (0.08)   0.44 (0.22)**   0.43 (0.35)   0.67 (0.55)   0.74 (0.56)
Housing unit features
Year measured
 2012   0.47 (0.07)***   0.47 (0.07)***   0.44 (0.08)***   0.46 (0.08)***
 2013   0.38 (0.08)***   0.38 (0.08)***   0.52 (0.08)***   0.55 (0.08)***
Census regions of IWER
 Multiple   0.11 (0.24)   0.09 (0.27) −0.14(0.22) −0.11 (0.22)
 Northeast only −0.41 (0.35) −0.54 (0.33) −0.68 (0.26)*** −0.65 (0.26)**
 South only −0.14 (0.25) −0.12 (0.24) −0.13 (0.20) −0.06 (0.20)
 West only −0.14 (0.28)   0.03 (0.27) −0.01 (0.22)   0.03 (0.22)
Sampling weighta −0.01 (<0.0I)*** −0.01 (<0.01)*** −0.01 (<0.01)*** −0.01 (<0.01)***
In gated community   0.14 (0.06)**   0.14 (0.06)**   0.08 (0.06)   0.09 (0.06)
Estimated # of households in area segmenta <0.01 (0.01) <0.01 (0.01) <0.01 (0.01) <0.01 (0.01)
In primarily resid. area   0.01 (0.05)   0.01 (0.05)   0.01 (0.05)   0.02 (0.05)
Evidence of non-English speakers −0.19 (0.06)*** −0.20 (0.06)*** −0.21 (0.06)*** −0.21 (0.07)***
Interviewer safety concerns −0.09 (0.06) −0.09 (0.06) −0.15 (0.06)** −0.14 (0.06)**
In many unit building −0.18 (0.05)*** −0.18 (0.05)*** −0.09 (0.06)* −0.10 (0.06)*
Physical impediments   0.52 (0.07)***   0.52 (0.07)***   0.18 (0.08)**   0.17 (0.08)**
Estimated eligibility rate of areaa −0.21 (0.39) −0.21 (0.39)   0.39 (0.40)   0.20 (0.41)
Occupancy rate of areaa −0.74 (0.27)*** −0.72 (0.27)*** −0.68 (0.28)** −0.67 (0.28)**
Interviewer features
 Interviewer age ≤ 50 −0.08 (0.15)   0.10 (0.12)   0.10 (0.12)
 Interviewer < college edu.   0.14 (0.15)   0.14 (0.13)   0.13 (0.13)
 Interviewer white −0.16 (0.18) −0.06 (0.14) <0.01 (0.14)
 Interviewer married −0.24 (0.15) −0.18 (0.12) −0.18 (0.12)
 Interviewer other job −0.11 (0.17) −0.06 (0.13) −0.08 (0.13)
 Interviewer prev. NSFG experience −0.26 (0.21) −0.15 (0.16) −0.18 (0.17)
 Interviewer 1+ children   0.13 (0.18)   0.48 (0.15)***   0.49 (0.16)***
 Interviewer > 4 years exp. −0.10 (0.19) −0.32 (0.16)* −0.32 (0.16)*
PSU assignments
 MIX   0.40 (0.19)**   0.37 (0.16)**   0.37 (0.16)**
 MSA   0.51 (0.27)*   0.55 (0.24)**   0.53 (0.25)**
Workloada −0.01 (<0.0I) −0.01 (<0.0I) −0.01 (<0.0I)
Interviewer cues used
 No contact −0.04 (0.06) −0.02 (0.06)
 Interest of R   1.17 (0.15)***   1.19 (0.15)***
 Refusal/res.   1.26 (0.08)***   1.51 (0.13)***
 No information   0.90 (0.31)***   0.89 (0.31)***
 Introduction −0.94 (0.16)*** −0.93 (0.16)***
 R concerns   0.53 (0.22)**   0.54 (0.22)**
 Housing unit obs.   0.62 (0.08)***   0.68 (0.09)***
 Cars   0.05 (0.21)   0.02 (0.22)
 Lawn −0.03 (0.53)   2.46 (1.21)**
 R personality −0.02 (0.11) −0.04 (0.11)
 R demographic   0.31 (0.08)***   0.31 (0.08)***
 Access problems   0.91 (0.11)***   0.32 (0.17)*
 Communication problems −0.30 (0.15)** −0.27 (0.15)*
 Timing −0.84 (0.07)*** −0.84 (0.07)***
 Household members   0.01 (0.11) −0.14 (0.18)
 Speaking with neighbors   0.34 (0.15)**   0.30 (0.15)**
 Area obs.   0.48 (0.11)***   0.49 (0.11)***
Principal components based on aggregate tendenciesb
 Focus on area features/access vs. noncontact   0.13 (0.05)**   0.14 (0.05)***
 Focus on housing unit features −0.06 (0.05) −0.06 (0.06)
 Focus on sociodemographic info   0.05 (0.06)   0.04 (0.06)
 Focus on R statements −0.04 (0.04) −0.05 (0.04)
 Focus on R availability   0.18 (0.06)***   0.18 (0.06)***
 Focus on timing vs. noncontact −0.15 (0.06)** −0.14 (0.06)**
Entropy −0.01 (0.01) −0.01 (0.01)
Interactions
 PSU Assign. × Refusal/Res.
 MIX × Refusal/Res. −0.38 (0.15)**
 MSA × Refusal/Res.     0.01 (0.26)
 PSU Assign. × Lawn
 MIX × Lawn −3.83 (1.42)***
 MSA × Lawn −2.91 (1.63)*
 PSU Assign. × Access Prob.
 MIX × Access Prob.   0.62 (0.20)***
 MSA × Access Prob.   1.93 (0.36)***
 PSU Assign. × HH Members
 MIX × HH Members   0.42 (0.22)*
 MSA × HH Members −0.44 (0.32)
 Var. of random IW intercepts (% reduction)   0.401***   0.352*** (12.2%)   0.270*** (23.3%)   0.127*** (53.0%)   0.129*** (N/A)
#Obs.   8,186   8,186   8,186   8,186   8,186
#IWERs      68      68      68      68      68
Log likelihood −7,652.58 −7,548.98 −7,541.77 −6,969.07 −6,938.64
Change in log likelihood 103.60     7.21 572.70   30.43

Note. Reference categories: year measured = 2011; Census region worked by interviewer = Midwest only; PSU assignments: rural only (RURAL). IWER = interviewer; PSU = primary sampling unit; MSA = metropolitan statistical area; obs. = observation; assign. = assignment; res. = resistance; edu. = education; resid. = residential; exp. = experience; var. = variance; prob. = problem.

a

Mean = centered.

b

See the = Supplementary Materials for detailed results of the principal components analysis.

***

p < .01.

**

p < .05.

*

p < .10 (Wald tests for fixed effects; likelihood ratio tests for variance components).

The fixed effects in Models 2 and 3 did explain a large percentage (about 31 percent) of the unexplained interviewer variance in the accuracy of the response propensity estimates, but a large interviewer variance component remained after fitting Model 3. The addition of fixed effects of the indicators of individual cues used in the justifications and the aggregate interviewer tendencies in Model 4 explained 53 percent of the remaining variance in estimate accuracy. Specifically, when controlling for all of the covariates included in the first three models, significant increases in estimate accuracy were found if justifications made reference to respondent interest, evidence of refusals/resistance, respondent unwillingness to provide screening information, respondent concerns, housing unit features, respondent demographics, access problems, speaking with neighbors, and area features. Significant reductions in the probability of an accurate estimate were associated with making an introduction (an interviewer may think that an introduction went well, but this doesn’t guarantee a completed interview), communication problems, making note of timing issues (i.e., basing an estimate on good or bad times for a given housing unit), and increased entropy or increased diversity in the justifications used (an unexpected finding, given our theoretical expectations). These results point to the importance of noting specific statements if making contact and also being aware of specific housing unit and area features (which is mostly consistent with our theoretical expectations).

Finally, four significant interactions were identified in the fifth model. First, the large and significant positive relationship of noting refusals or resistance with estimate accuracy in rural areas was significantly reduced (but still positive) for interviewers working in the MIX group. Second, the large positive relationship of noting features of the lawn in rural areas was significantly reduced for interviewers working in the MIX and MSA groups, suggesting that this is not an effective cue to use in these areas (supporting our theoretical expectations). Third, noting access problems has a much stronger positive relationship with accuracy for interviewers working in the MIX and MSA groups. Fourth, noting features of persons in the housing unit has a stronger positive relationship with accuracy for interviewers in the MIX group, but a much more negative relationship in the largest MSAs (suggesting that this would not be worthwhile in these areas, once again supporting our theoretical expectations).

Discussion

Summary of Findings

Addressing our first research question, we found that interviewers in the NSFG who were assigned to similar types of areas (e.g., the RURAL group) do in fact vary substantially in terms of their general tendencies to look for particular cues when observing whether young children are present in a household and estimating the likelihood that a household will respond to the survey request. While there was some evidence of particular interviewers increasing or decreasing over time in terms of the percentage of their justifications referring to particular cues, most of the evidence pointed to consistent variability between interviewers working in the same type of geographic area in how often particular cues were used across the NSFG data collection quarters. Addressing our second research question, we found that the individual cues used by interviewers at particular housing units and aggregate measures of the interviewer tendencies explained substantial portions of the previously unexplained interviewer variance in the accuracy of these types of observations. For both observations analyzed in this study, the results suggested that there were several significant relationships between the use of particular cues and observation accuracy, and these results have direct implications for practice and future interviewer training on the observational process.

Implications for Practice and Interviewer Training

Interviewer training sessions for future surveys planning to record these types of observations could use the results from this study to introduce a more standardized approach to recording these types of observations, with the goal being to stabilize accuracy rates across the interviewers. First, if interviewers in a given survey are asked to estimate the probability that a given household will respond to a survey request using information that they have obtained on the housing units during the field period, the results of this study suggest that the following approaches may increase the accuracy of these estimates:

  • If making contact, try and detect evidence of an unwillingness to provide information, general refusals or resistance, interest in the survey topic, household member demographics, and concerns about the survey. Notably, interviewers working in both urban and rural areas were found to use interest in the survey and refusals as cues at a lower rate over time, which may not be ideal for maintaining the accuracy of these estimates. Interviewers working in the largest MSAs exclusively had increasing rates in the use of refusals as a cue over time, likely contributing to their increased accuracy overall.

  • If unable to make contact, try and detect features of the housing unit (e.g., signs saying “Beware of Dog”), features of the area (are people friendly, are lights on, are doors open, etc.), and evidence of housing unit access problems (especially in MSAs) and try speaking with neighbors. Interviewers working in the largest MSAs exclusively also increased the rate at which they spoke with neighbors over time, which appears to be beneficial; whether this was in response to noting more accurate estimates of response propensity when speaking with neighbors could not be determined, given the observational nature of this study.

  • Avoid thinking that a successful introduction of the survey will affect the probability of cooperation, or inferring that communication issues (e.g., foreign languages) will necessarily impact the likelihood of cooperation; most importantly, do not base the estimates on timing issues for a given housing unit.

  • Excessive diversity in the justifications used, somewhat contrary to theoretical expectations, may reduce the expected accuracy of response propensity estimates; as was suggested in the Theoretical Expectations section earlier, excessive diversity without a focus on the most relevant cues may be counterproductive.

Second, if interviewers are asked to observe whether young children are present in a given household, these approaches may prove beneficial for increasing accuracy:

  • Take note of housing unit features (e.g., a child-friendly environment), area features (e.g., is a playground nearby?), the absence of toys, and features of individuals within a housing unit;

  • If no cues in particular stand out, a general lack of evidence of children being present may also be an effective cue. Interviewers working in rural areas exclusively used this cue at an increasing rate over time, which may have contributed to their relatively higher accuracy;

  • Avoid guessing; guessing in individual justifications and the interviewer-level percentage of justifications indicating guessing were both found to have a negative relationship with observation accuracy; and

  • Don’t focus exclusively on toys, bikes, or features of the yard (e.g., basketball hoops); while these would seem to be logical cues to look for, noticing these features had a negative relationship with observation accuracy.

Study Limitations

The findings in this study are specific to one national survey data collection in the United States (the NSFG). We determined the cues that interviewers were using based on very time-intensive coding of open-ended justifications that interviewers were required to provide for their observations. To understand whether interviewers in other surveys are also using different cues (and evaluate their accuracy), we believe that survey managers could use a “checklist” idea. This would involve working with interviewers to develop a list of possible cues that they might use when recording different types of observations and then having the interviewers check off the cues that they used to record a key observation while in the field. These data could easily be stored as binary indicators of picking up on certain cues and then analyzed using the same approaches employed in this study (without the need for time-intensive coding).

More generally, as we describe in our overview of the research design used in the Supplementary Materials, this study suffered from the same limitations as all other observational studies. Most importantly, the cues used by the interviewers may have been the only cues available at the particular housing units assigned to them and not the cues that they would have looked for in general in the absence of any type of standardized training. That is, interviewer variance in observation accuracy may have been arising from specific features of the housing units being worked, rather than the use of different cues by different interviewers. We attempted to control for as many housing unit features as possible in the NSFG data, given this potential confound of cues used and features of the housing units, in an effort to isolate the relationships of the cues used with observation accuracy. We also attempted to examine whether interviewers who used a variety of cues depending on housing unit context would ultimately produce more accurate observations, but our interviewer-level measure of ShE was not found to have a positive relationship with accuracy in the models.

Given these limitations, a reasonable alternative study design would be to ask interviewers about the cues that they look for outside of context, in an in-depth in-person interview, and then compare the accuracy rates of interviewers claiming to use different cues in these interviews (e.g., West, Kreuter, and Trappmann 2014). As a supplement to the present study, we considered this approach with 21 volunteer NSFG interviewers working in November 2016. These interviewers were asked by their main supervisor over e-mail what cues they generally look for when attempting to determine whether young children were present (no additional probing of the responses provided was performed, given limited resources). These open-ended responses (which again would not entirely reflect the variety in cues used depending on context) were then linked with accuracy rates for observations on presence of young children for those same 21 interviewers, from all quarters in which they had worked since 2011. The resulting de-identified data can be found in the Supplementary Materials.

We once again found evidence of variability among the interviewers in terms of the types of cues used, and some cues were used more often than others. These data show that the more accurate interviewers tended to look for housing unit features, area features, decorations, an absence of toys, features of household members, and a general lack of visible evidence of children, which (outside of decorations) is largely consistent with the results of the larger observational study described above (see Table 3). Notably, only one interviewer mentioned guessing, and this interviewer was found to have one of the lower accuracy rates. Guessing was much more prevalent in the larger field study, and this finding suggests that what interviewers say that they do (in this type of “context-free” design) and what they actually do (given context) may not always be consistent (Jerolmack and Kahn 2014).

Directions for Future Research

First, this study did find evidence of a small amount of remaining unexplained variance in observation accuracy among interviewers, even after accounting for a wide variety of covariates. This could be due to outlier interviewers or possibly other covariates that were not measured in the NSFG. Like all studies of interviewer effects, future studies in this area need to continue identifying explanations for these effects, and monitoring interviewers found to have extremely high or low accuracy rates.

Second, the results of this study could be used to motivate future experimental studies making use of vignettes during interviewer training. Once the cues having the strongest relationships with the accuracy of a particular observation are identified (possibly using the aforementioned checklist idea or the findings from this study), future interviewer training could emphasize these cues with a random subsample of interviewers in an effort to evaluate this approach to standardizing the field observation process and possibly increasing observation quality (e.g., Caughy, O’Campo, and Patterson 2001; Zenk et al. 2007). Interviewers could be randomized to either receive training in the most important cues to look for or use their best judgments (the “control” group, representing current practice). Managers could then present the two groups of interviewers (independently) with alternative vignettes, presenting hypothetical housing units/environments (possibly using photographic slideshows), and ask the interviewers in each group to record observations about the housing units using the approaches in which they had been instructed (e.g., Babbie 2001; Dahlhamer 2012). Given “correct” values of the observation for each vignette (e.g., children were present at that housing unit), the accuracy rates of the two groups of interviewers could be compared. Notably, this approach would hold constant the housing units being observed by the interviewers and allow multiple interviewers to indicate what cues they used for the same housing unit. This would enable assessment of interviewer variance in the cues used, and whether use of more relevant cues would increase observation accuracy. This type of approach (sending multiple interviewers to the same randomly sampled housing unit and asking them to record observations on that unit) would be cost-prohibitive in a real national face-to-face data collection such as the NSFG.

Similar training methods have been used in the ESS (Stähli 2010) and the LA-FANS study (Casas-Cordero et al. 2013), but these methods have not been evaluated experimentally based on prior empirical evidence of the type presented in this study. Enhanced evidence-based training procedures that build on cues found to have the strongest associations with observation accuracy may represent an improvement over existing training procedures that have still resulted in error-prone observations (Casas-Cordero et al. 2013). Existing literature on sources of observer bias suggests that disciplined training, rigorous preparation, and continuous feedback on strategies for making effective observations in a variety of different scenarios may prove effective for eliminating the selective perception that is natural for human observers and standardizing the observational process, thus increasing the accuracy and objectivity of the observations (Bernard 1945; Harris et al. 1997; Johnson and Sackett 1998; Kazdin 1977; Manderson and Aaby 1992b; McCall 1984; Patton 1999; Petticrew et al. 2007; Repp et al. 1998; Seidler 1974; Simmons et al. 2002). We feel that this is a worthwhile direction for continued work in this area.

Supplementary Material

srpplemental

Acknowledgments

We acknowledge the outstanding assistance of Ka’Marr Coleman-Byrd, Yimeng Ma, Tyler Sloman, and Yi Wang with coding and analysis for this research. We also acknowledge Dr. Nora Cate Schaeffer for her helpful comments, discussions, and guidance related to this work. Finally, we would like to thank the National Center for Health Statistics and Mick Couper for allowing us to collect and analyze these data as part of the NSFG. The NSFG is conducted by the Centers for Disease Control and Prevention’s (CDC) National Center for Health Statistics (NCHS), under contract # 200-2010-33976 with University of Michigan’s Institute for Social Research with funding from several agencies of the U.S. Department of Health and Human Services, including CDC/NCHS, the National Institute of Child Health and Human Development (NICHD), the Office of Population Affairs (OPA), and others listed on the NSFG webpage (see http://www.cdc.gov/nchs/nsfg/).

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Funding for this research was provided by NIH Grant #1-R03-HD-075979-01-A1.

Biographies

Brady T. West is a research associate professor in the Survey Methodology Program, located within the Survey Research Center at the Institute for Social Research on the University of Michigan–Ann Arbor (U-M) campus. His current research interests include the implications of measurement error in auxiliary variables and survey paradata for survey estimation, survey nonresponse, interviewer variance, and multilevel regression models for clustered and longitudinal data. He is the lead author of a book comparing different statistical software packages in terms of their mixed-effects modeling procedures (Linear Mixed Models: A Practical Guide using Statistical Software, Second Edition, Chapman Hall/CRC Press, 2014), and he is a co-author of a second book entitled Applied Survey Data Analysis, Second Edition (with Steven Heeringa and Pat Berglund), which was published by Chapman Hall in June 2017. Recent publications include West, B. T. and A. G. Blom. 2017. Explaining Interviewer Effects: A Research Synthesis. Journal of Survey Statistics and Methodology 5:175–211; West, B. T., J. W. Sakshaug, and G. A. S. Aurelien. 2016. How Big of a Problem Is Analytic Error in Secondary Analyses of Survey Data? PLOS One 11: e0158120; West, B. T., L. Beer, W. Gremel, J. Weiser, C. Johnson, S. Garg, and J. Skarbinski. 2015. Weighted Multilevel Models: A Case Study. American Journal of Public Health 105:2214–5; West, B. T., D. Ghimire, and W. G. Axinn. 2015. Evaluating a Modular Design Approach to Collecting Survey Data using Text Messages. Survey Research Methods 9:111–23; and West, B. T., J. Wagner, H. Gu, and F. Hubbard. 2015. The Utility of Alternative Commercial Data Sources for Survey Operations and Estimation: Evidence from the National Survey of Family Growth. Journal of Survey Statistics and Methodology 3:240–64.

Dan Li is a data scientist with ALG in Los Angeles, California. She graduated from the University of Michigan Program in Survey Methodology with an MS in Survey Methodology in 2014 and primarily worked on this study with Dr. West during her time as a graduate student. She has research interests in machine learning and survey paradata and has no recent publications to report.

Footnotes

Author’s Note

All data and syntax files used for the analyses presented in this study are available online (in Excel, Stata, and SAS formats), enabling immediate replication of the study results. The syntax files include explicit comments indicating which data files should be opened for each analysis, and a summary file accompanies the data and syntax files. The views expressed here do not represent those of NCHS or the other funding agencies.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Material

Supplementary material for this article is available online.

References

  1. Ambady Nalini, Hallahan Mark, Conner Brett. Accuracy of Judgments of Sexual Orientation from Thin Slices of Behavior. Journal of Personality and Social Psychology. 1999;77:538–47. doi: 10.1037//0022-3514.77.3.538. [DOI] [PubMed] [Google Scholar]
  2. Ambady Nalini, Hallahan Mark, Rosenthal Robert. On Judging and Being Judged Accurately in Zero-Acquaintance Situations. Journal of Personality and Social Psychology. 1995;69:518–29. [Google Scholar]
  3. Babbie Earl R. The Practice of Social Research. 9th. Belmont, CA: Wadsworth/Thomson Learning; 2001. [Google Scholar]
  4. Bader Michael DM, Ailshire Jennifer A, Morenoff Jeffrey D, House James S. Measurement of the Local Food Environment: A Comparison of Existing Data Sources. American Journal of Epidemiology. 2010;171:609–17. doi: 10.1093/aje/kwp419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bernard Jessie. Observation and Generalization in Cultural Anthropology. Journal of Sociology. 1945;50:284–91. [Google Scholar]
  6. Casas-Cordero Carolina, Kreuter Frauke, Wang Yueyan, Babey Susan. Assessing the Measurement Error Properties of Interviewer Observations of Neighbourhood Characteristics. Journal of the Royal Statistical Society (Series A) 2013;176:227–50. doi: 10.1111/j.1467-985X.2012.01065.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Caughy Margaret O, O’Campo Patricia J, Patterson Jacqueline. A Brief Observational Measure for Urban Neighborhoods. Health & Place. 2001;7:225–36. doi: 10.1016/s1353-8292(01)00012-0. [DOI] [PubMed] [Google Scholar]
  8. Cohen MP. Determining Sample Sizes for Surveys with Data Analyzed by Hierarchical Linear Models. Journal of Official Statistics. 1998;14:267–75. [Google Scholar]
  9. Dahlhamer James. New Observation Questions; Presented at the 2013 NHIS Centralized Refresher Training and Conference; December 3–6, 2012; Hyattsville, MD. 2012. [Google Scholar]
  10. Das T Hari. Qualitative Research in Organizational Behaviour. Journal of Management Studies. 1983;20:301–14. [Google Scholar]
  11. Eckman Stephanie, Sinibaldi Jennifer, Möntmann-Hertz Aleksa. Can Interviewers Effectively Rate the Likelihood of Cases to Cooperate? Public Opinion Quarterly. 2013;77:561–73. [Google Scholar]
  12. Feldman Jacob J, Hyman Herbert, Hart Cecil W. A Field Study of Interviewer Effects on the Quality of Survey Data. Public Opinion Quarterly. 1951;15:734–61. [Google Scholar]
  13. Funder David C. Errors and Mistakes: Evaluating the Accuracy of Social Judgment. Psychological Bulletin. 1987;101:75–90. [PubMed] [Google Scholar]
  14. Funder David C. On the Accuracy of Personality Judgment: A Realistic Approach. Psychological Review. 1995;102:652–70. doi: 10.1037/0033-295x.102.4.652. [DOI] [PubMed] [Google Scholar]
  15. Graham Robert J. Anthropology and O.R.: The Place of Observation in Management Science Process. The Journal of the Operational Research Society. 1984;35:527–36. [Google Scholar]
  16. Groves Robert M, Heeringa Steven G. Responsive Design for Household Surveys: Tools for Actively Controlling Survey Errors and Costs. Journal of the Royal Statistical Society (Series A) 2006;169:439–57. [Google Scholar]
  17. Harris Karij J, Jerome Norge W, Fawcett Stephen B. Rapid Assessment Procedures: A Review and Critique. Human Organization. 1997;56:375–8. [Google Scholar]
  18. Hill Mark O. Diversity and Evenness: A Unifying Notation and Its Consequences. Ecology. 1973;54:427–32. [Google Scholar]
  19. Hosmer David W, Lemeshow Stanley, Sturdivant Rodney X. Applied Logistic Regression. 3rd. New York, NY: John Wiley; 2013. [Google Scholar]
  20. Jerolmack Colin, Khan Shamus. Talk Is Cheap: Ethnography and the Attitudinal Fallacy. Sociological Methods and Research. 2014;43:178–209. [Google Scholar]
  21. Johnson Allen, Sackett Ross. Direct Systematic Observation of Behavior. In: Bernard HR, editor. Handbook of Methods in Cultural Anthropology. Beverly Hills, CA: AltaMira Press; 1998. pp. 301–32. chapter 9. [Google Scholar]
  22. Jones Edward E, Riggs Janet M, Quattrone George. Observer Bias in the Attitude Attribution Paradigm: Effect of Time and Information Order. Journal of Personality and Social Psychology. 1979;37:1230–8. [Google Scholar]
  23. Jones Malia, Pebley Anne R, Sastry Narayan. Eyes on the Block: Measuring Urban Physical Disorder through In-Person Observation. Social Science Research. 2011;40:523–37. doi: 10.1016/j.ssresearch.2010.11.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kazdin Alan E. Artifact, Bias, and Complexity of Assessment: The ABCs of Reliability. Journal of Applied Behavior Analysis. 1977;10:141–50. doi: 10.1901/jaba.1977.10-141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kim Yoonsang, Choi Young-Ku, Emery Sherry. Logistic Regression with Multiple Random Effects: A Simulation Study of Estimation Methods and Statistical Packages. The American Statistician. 2013;67:171–82. doi: 10.1080/00031305.2013.817357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Kreuter Frauke., editor. Improving Surveys with Paradata: Analytic Uses of Process Information. Hoboken, NJ: Wiley; 2013. [Google Scholar]
  27. Kreuter Frauke, OlsonM Kristen M, Wagner James, Yan Ting, Ezzati-Rice Trena M, Casas-Cordero Carolina, Lemay Michael, Peytchev Andy, Groves Robert M, Raghunathan Trivellore E. Using Proxy Measures and Other Correlates of Survey Outcomes to Adjust for Nonresponse: Examples from Multiple Surveys. Journal of Royal Statistical Society (Series A) 2010;173:389–407. [Google Scholar]
  28. Landis J Richard, Koch Gary G. The Measurement of Observer Agreement for Categorical Data. Biometrics. 1977;33:159–74. [PubMed] [Google Scholar]
  29. Lepkowski James M, Mosher William D, Groves Robert M, West Brady T, Wagner James, Gu Hayou. National Center for Health Statistics, Vital and Health Statistics 2. 2013. Responsive Design, Weighting, and Variance Estimation in the 2006–2010 National Survey of Family Growth. [PubMed] [Google Scholar]
  30. Manderson Lenore, Aaby Peter. An Epidemic in the Field? Rapid Assessment Procedures and Health Research. Social Science Medicine. 1992a;35:839–50. doi: 10.1016/0277-9536(92)90098-b. [DOI] [PubMed] [Google Scholar]
  31. Manderson Lenore, Aaby Peter. Can Rapid Anthropological Procedures Be Applied to Tropical Diseases? Health Policy and Planning. 1992b;7:46–55. [Google Scholar]
  32. McCall George J. Systematic Field Observation. Annual Review of Sociology. 1984;10:263–82. [Google Scholar]
  33. Millen David R. Rapid Ethnography: Time Deepening Strategies for HCI Field Research; Proceedings on DIS00: Designing Interactive Systems: Processes, Practices, Methods, and Techniques; August 17–19, 2000; Brooklyn, NY. 2000. pp. 280–6. [Google Scholar]
  34. Most Steven B, Scholl Brian J, Clifford Erin R, Simons Daniel J. What You See Is What You Get: Sustained Inattentional Blindness and the Capture of Awareness. Psychological Review. 2005;112:217–42. doi: 10.1037/0033-295X.112.1.217. [DOI] [PubMed] [Google Scholar]
  35. Patterson Miles L, Foster Jeffrey L, Bellmer Craig D. Another Look at Accuracy and Confidence in Social Judgments. Journal of Nonverbal Behavior. 2001;25:207–19. [Google Scholar]
  36. Patterson Miles L, Stockbridge Erica. Effects of Cognitive Demand and Judgment Strategy on Person Perception Accuracy. Journal of Nonverbal Behavior. 1998;22:253–63. [Google Scholar]
  37. Patton Michael Q. Enhancing the Quality and Credibility of Qualitative Analysis. Health Services Research. 1999;34:1189–208. [PMC free article] [PubMed] [Google Scholar]
  38. Petticrew Mark, Semple Sean, Hilton Shona, Creely Kaen S, Eadie Douglas, Ritchie Deborah, Ferrell Catherine, Christopher Yvette, Hurley Fintan. Covert Observation in Practice: Lessons from the Evaluation of the Prohibition of Smoking in Public Places in Scotland. BMC Public Health. 2007;7:204. doi: 10.1186/1471-2458-7-204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Repp Alan C, Nieminen Gayla S, Olinger Ellen, Brusca Rita. Direct Observation: Factors Affecting the Accuracy of Observers. Exceptional Children. 1998;55:29–36. [Google Scholar]
  40. Seidler John. On Using Informants: A Technique for Collecting Quantitative Data and Controlling Measurement Error in Organization Analysis. American Sociological Review. 1974;39:816–31. [Google Scholar]
  41. Simmons Sandra F, Babineau Sarah, Garcia Emily, Schnelle John F. Quality Assessment in Nursing Homes by Systematic Direct Observation: Feeding Assistance. Journal of Gerontology: Medical Sciences. 2002;57A:M665–71. doi: 10.1093/gerona/57.10.m665. [DOI] [PubMed] [Google Scholar]
  42. Simons Daniel J, Jensen Melinda S. The Effects of Individual Differences and Task Difficulty on Inattentional Blindness. Psychonomic Bulletin and Review. 2009;16:398–403. doi: 10.3758/PBR.16.2.398. [DOI] [PubMed] [Google Scholar]
  43. Sinibaldi Jennifer, Durrant Gabriele B, Kreuter Frauke. Evaluating the Measurement Error of Interviewer Observed Paradata. Public Opinion Quarterly. 2013;77:173–93. [Google Scholar]
  44. Sinibaldi Jennifer, Eckman Stephanie. Using Call-level Interviewer Observations to Improve Response Propensity Models. Public Opinion Quarterly. 2015;79:976–93. [Google Scholar]
  45. Sinibaldi Jennifer, Trappmann Mark, Kreuter Frauke. Which Is the Better Investment for Nonresponse Adjustment: Purchasing Commercial Auxiliary Data or Collecting Interviewer Observations. Public Opinion Quarterly. 2014;78:440–73. [Google Scholar]
  46. Smith Heather J, Archer Dane, Costanzo Mark. ‘Just a Hunch’: Accuracy and Awareness in Person Perception. Journal of Nonverbal Behavior. 1991;15:3–17. [Google Scholar]
  47. Stähli Michèle E. Examples and Experiences from the Swiss Interviewer Training on Observable Data (Neighborhood Characteristics) for ESS 2010 (R5); Paper presented at the NC Meeting; March 31 to April 1, 2011; Mannheim, Germany. 2010. [Google Scholar]
  48. Tversky Amos, Kahneman Daniel. Judgment under Uncertainty: Heuristics and Biases. Science. 1974;185:1124–31. doi: 10.1126/science.185.4157.1124. [DOI] [PubMed] [Google Scholar]
  49. Wagner James, West Brady T, Kirgis Nicole, Lepkowski James M, Axinn William G, Kruger-Ndiaye Shonda. Use of Paradata in a Responsive Design Framework to Manage a Field Data Collection. Journal of Official Statistics. 2012;28:477–99. [Google Scholar]
  50. Wagner James, West Brady T, Guyer Heidi, Burton Paul, Kelley Jennifer, Couper Mick P, Mosher William D. The Effects of a Mid-Data Collection Change in Financial Incentives on Total Survey Error in the National Survey of Family Growth. In: Biemer Paul P, de Leeuw Edith, Eckman Stephanie, Edwards Brad, Kreuter Frauke, Lyberg Lars E, Tucker N Clyde, West Brady T., editors. Total Survey Error in Practice. New York, NY: John Wiley; 2017. pp. 155–177. chapter 8. [Google Scholar]
  51. West Brady T. The Error Properties of Interviewer Observations and Their Implications for Nonresponse Adjustment of Survey Estimates. Doctoral dissertation. University of Michigan; Ann Arbor: 2011. [Google Scholar]
  52. West Brady T. An Examination of the Quality and Utility of Interviewer Observations in the National Survey of Family Growth (NSFG) Journal of the Royal Statistical Society (Series A) 2013a;176:211–25. [Google Scholar]
  53. West Brady T. The Effects of Error in Paradata on Weighting Class Adjustments: A Simulation Study. In: Kreuter F, editor. Improving Surveys with Paradata: Making Use of Survey Process Information. Hoboken, NJ: Wiley; 2013b. pp. 361–88. Chapter 15. [Google Scholar]
  54. West Brady T, Kreuter Frauke. Proceedings of the Survey Research Methods Section of the American Statistical Association (AAPOR 2011) American Statistical Association; Alexandria, VA: 2011. Observational Strategies Associated with Increased Accuracy of Interviewer Observations: Evidence from the National Survey of Family Growth. [Google Scholar]
  55. West Brady T, Kreuter Frauke. Factors Impacting the Accuracy of Interviewer Observations: Evidence from the National Survey of Family Growth (NSFG) Public Opinion Quarterly. 2013;77:522–48. [Google Scholar]
  56. West Brady T, Kreuter Frauke. A Practical Technique for Improving the Accuracy of Interviewer Observations of Respondent Characteristics. Field Methods. 2015;27:144–62. [Google Scholar]
  57. West Brady T, Little Roderick JA. Nonresponse Adjustment of Survey Estimates based on Auxiliary Variables Subject to Error. Journal of the Royal Statistical Society (Series C) (Applied Statistics) 2013;62:213–31. [Google Scholar]
  58. West Brady T, Kreuter Frauke, Trappmann Mark. Is the Collection of Interviewer Observations Worthwhile in an Economic Panel Survey? New Evidence from the German Labor Market and Social Security (PASS) Study. Journal of Survey Statistics and Methodology. 2014;2:159–81. [Google Scholar]
  59. West Brady T, Sinibaldi Jennifer. The Quality of Paradata: A Literature Review. In: Kreuter Frauke., editor. Improving Surveys with Paradata. New York, NY: John Wiley; 2013. pp. 339–359. chapter 14. [Google Scholar]
  60. Zenk Shannon N, Schulz Amy J, Mentz Graciela, House James S, Gravlee Clarence C, Miranda Patricia Y, Miller Patricia, Kannan Srimanthi. Inter- Rater and Test-Retest Reliability: Methods and Results for the Neighborhood Observational Checklist. Health & Place. 2007;13:452–65. doi: 10.1016/j.healthplace.2006.05.003. [DOI] [PubMed] [Google Scholar]
  61. Zhang Daowen, Lin Xihong. Variance Component Testing in Generalized Linear Mixed Models for Longitudinal/Clustered Data and Other Related Topics. In: Dunson David B., editor. Random Effect and Latent Variable Model Selection. New York, NY: Springer; 2008. pp. 19–36. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

srpplemental

RESOURCES