What You Don't Know Can Hurt You: Missing Data and Partial Credit Model Estimates

Sarah L Thomas; Karen M Schmidt; Monica K Erbacher; Cindy S Bergeman

. Author manuscript; available in PMC: 2017 Oct 11.

Published in final edited form as: J Appl Meas. 2016;17(1):14–34.

What You Don't Know Can Hurt You: Missing Data and Partial Credit Model Estimates

Sarah L Thomas ¹, Karen M Schmidt ², Monica K Erbacher ³, Cindy S Bergeman ⁴

PMCID: PMC5636626 NIHMSID: NIHMS908869 PMID: 26784376

Abstract

The authors investigated the effect of Missing Completely at Random (MCAR) item responses on partial credit model (PCM) parameter estimates in a longitudinal study of Positive Affect. Participants were 307 adults from the older cohort of the Notre Dame Study of Health and Well-Being (Bergeman and Deboeck, 2014) who completed questionnaires including Positive Affect items for 56 days. Additional missing responses were introduced to the data, randomly replacing 20%, 50%, and 70% of the responses on each item and each day with missing values, in addition to the existing missing data. Results indicated that item locations and person trait level measures diverged from the original estimates as the level of degradation from induced missing data increased. In addition, standard errors of these estimates increased with the level of degradation. Thus, MCAR data does damage the quality and precision of PCM estimates.

The popular idiom “what you don't know can't hurt you” (Simpson and Speake, 2008) does not apply to scientific research. Missing data-the absence of a response on an item given to a participant- can lead to lower quality parameter estimates. Although the amount of quantitative research investigating the statistical effects of missing data has recently skyrocketed (Gottfredson, Bauer, and Baldwin, 2014; Manly and Wells, 2014; Padgett, Skillbeck, and Summers, 2014; Yang, Wang, and Maxwell, 2014), there is only a small body of literature on missing data in Rasch models and it has focused primarily on identifying estimation methods that work best with missing data in Rasch models (DeMars, 2002). Aside from the study of estimation methods, there is little to no research on the effects of missing data on the parameter estimates of Rasch-based models, leaving a substantial gap in the literature. The possible effects of missing data on Rasch model parameter estimates remain largely unknown. This dearth may exist because common estimations of Rasch models use all available information (An and Yung, 2014), a fact that may satisfy researchers who might otherwise investigate potential problems caused by missing data. Howell (2007) notes that all missing data is problematic, regardless of type, and researchers should always concern themselves with improving parameter estimates obtained from data with missing responses.

The validity of obtained parameter estimates, and the conclusions drawn from those estimates, can be threatened by missing data. This is particularly harmful in cases that use obtained parameter estimates for high-stakes decisions such as admittance to a program, clinical diagnoses, and the hiring, promoting, or dismissal of employees. Researchers who assume the effects of missing data are negligible for their Rasch model analyses may be ignoring the fact that there appear to be no thorough, empirical investigations on the effects of missing data in Rasch models in the literature. The negative effects of missing data could include, but are not limited to, decreases in the variability of raw observations, bias in the magnitude of parameter estimates in one direction, reduced information on which to base parameter estimates, increased instability of parameter estimates, and a loss of precision in parameter estimates.

The purpose of the present study is to begin filling the existing gap in the literature by investigating how creating additional randomly selected missing responses influences partial credit model (PCM; Masters, 1982) parameter estimates, using a longitudinal data set containing Positive Affect responses. Additionally, we will introduce a hybrid approach that examines effects of missing data by pairing controlled, simulated degradation with realistic psychological data. We will begin with a description of common categories used to describe patterns of missing responses and why they are important, followed by a description of several approaches to investigating missing data, and finally the partial credit model.

Types of Missing Data

There are three common categories used to describe patterns of missing responses in data sets: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR; Rubin, 1976). These categories provide the structure for methods specifically designed to handle these types of missing data (Howell, 2007). In the work presented here, we focus on MCAR data; however, the hybrid approach employed, described in more detail later, permits other types of missing responses to exist in the data. Thus, all three classifications are reviewed.

Missing completely at random

To meet the qualifications for MCAR, missing responses on a variable must be independent of the scores on that variable and on other observed variables in the study (Rubin, 1976). In MCAR data, the probability of missing responses is equal across items (Pickles, 2005). As a very simple example, a participant who is taking an online survey clicks their mouse button to go to the next item twice, accidentally leaving an item blank, that missing response is equally likely for any item and is classified as MCAR. This is not to say that MCAR data is trivial, but to show that the probability of MCAR data is equal across all items.

There are many properties of estimates that could be affected by MCAR data, but research in this area has focused primarily on parameter estimate magnitude. Researchers conducting simulation-based research have found MCAR data does not bias the magnitudes of parameter estimates. Enders and Bandalos (2001) investigated the effect of MCAR data on a factor analysis using structural equation modeling and Full Information Maximum Likelihood (FIML) when each observed variable had 2%, 5%, 10%, 15%, or 25% MCAR data. They also varied the magnitudes of the true values of the factor loading parameters (.40, .60, and .80) and the sample size (N = 100, N =250, N = 500, N = 750). They found that most factor loading parameters estimated from MCAR data showed small bias in magnitude (less than 2% of the original factor loading on average) from the true parameter value. However, some conditions with small sample size (N = 100) and small true factor loading values (.40) showed large bias in magnitude (over 10% of the original parameter value). The researchers found the same pattern in the simulated complete data and attributed the instances of high bias not to the estimation methods, but to unstable covariance matrix elements, supporting the notion that MCAR data led to negligible amounts of bias in the magnitude of parameter estimates, given sufficiently large sample sizes.

Although MCAR data is thought not to bias the magnitude of parameter estimates, it could cause other problems in terms of the variability of raw observations, information, and standard errors in Rasch models. In terms of raw observations, MCAR data might influence the variability of observed responses, which could affect the quality of parameter estimates as well as the precision associated with those estimates. Additionally, MCAR data decreases the total number of available observations, decreasing the amount of information on which parameters are based and increasing standard errors of parameter estimates. Thus, estimates may become less stable and less precise as the amount of MCAR data increases. The effect of MCAR data on Rasch-based model estimates has yet to be examined, but it is possible that a thorough investigation of this topic will reveal effects of MCAR data that have not been previously addressed.

Missing at random

If data is missing at random (MAR), the probability of any given response missing is unrelated to the value of that response after controlling for other observed variables in the analysis (Enders and Bandalos, 2001; Rubin, 1976). For example, if participants were asked to rate themselves on the adjective “ecstatic” but participants with low levels of education left the item blank because they did not know what ecstatic meant, their data would be MAR if education level was an observed variable. The probability of having missing data on “ecstatic” would be dependent on education level, but unrelated to how ecstatic the participants felt. Note in this example, education level cannot be related to the level of ecstatic reported by participants, otherwise the missing data on “ecstatic” would be related to the rating of “ecstatic” and the data would be MNAR (see next section for description of MNAR). Some researchers consider MAR the most misunderstood type of missing data (Schafer and Graham, 2002). This is in part because of differing operational definitions of the word “random”. However, the terminology introduced by Rubin (1976) is deeply entrenched in the study of missing data and is likely to persist.

Missing not at random

For missing not at random data (MNAR), the missing data fails to meet the qualifications for MCAR and MAR, implying that there is some underlying model for the missing responses which involves an unmeasured variable. For example, if depressed participants did not answer Positive Affect items, that data is MNAR (i.e., the missing response is related to the score on the variable itself). It is impossible to determine if data are MNAR or identify the underlying mechanism for missing data in MNAR data unless one conducts follow-up interviews with participants to probe for reasons behind missing responses. MNAR is sometimes called non-ignorable missing data and is considered by many to be the most problematic of the types of missing data (Enders, 2011). In MCAR or MAR data, the missing values on a variable are unrelated to the value of that variable; however, in MNAR data, this is not the case. Therefore, MNAR is potentially more problematic because conclusions may be drawn on data that does not fully represent an individual's trait or traits. Also, MNAR data can directly alter the distribution of the observed scores. In the Positive Affect example above, many low scores would be missing and thus the lower tail of the distribution of observed scores would be under represented, reducing observed variance. Available MNAR analyses can be problematic because they require strict assumptions to be made about the mechanism behind the missing responses, which is unknown in many cases (Enders, 2011).

MCAR, MAR, and MNAR data can occur within the same study or even on a single item. For example, when answering an item about race it is possible that one participant clicked the next button accidentally before answering the item (MCAR) but another participant left the item blank because they thought that answering the question might lead to bias against them within the study (MNAR). The type of missing data for each participant aggregates across persons to form the type of missing data at the variable level. If the majority of participants did not answer an item about depression because they were depressed, that depression variable might be judged as having MNAR data even though other types of missing data might also be present on that variable.

Approaches to Investigating Missing Data

Identifying the type of missing data present in a dataset is only one aspect of the many ways that missing data can be studied. An obvious next area of investigation is the effect of missing data on parameter estimates. This effect can be studied in terms of many different outcomes including bias, accuracy, and precision. Negative effects of missing data can endanger the validity of obtained estimates and their interpretation; this is particularly harmful in cases when estimates are used to make high-stakes decisions. In the following sections of this paper, we will discuss simulation-based research of the effects of missing data and introduce a new hybrid approach to examining the effects of missing data using controlled degradation with realistic psychological data.

Simulations

The complexity of the types of missing data present on a single item points to the important concern of correctly classifying missing responses into these categories (MCAR, MAR, and MNAR). Without careful investigation, the conclusions that the researchers draw regarding the category of missing data present in a study could very possibly be an oversimplification of the types of missing data present on that item. In simulation studies, this categorization decision of the type of missing data on an item is unnecessary because the data are created to have a specific type of missing data. As useful and informative as simulation studies are, their weaknesses are that they are often based on unrealistic data and therefore have limited generalizability. The strength of simulation studies is that they provide very clear cause and effect information through a controlled statistical experiment. However, the conditions that allow for a controlled statistical experiment are also much less complex than those encountered in real data.

A hybrid approach

In the current study, we utilized a more data-centered approach to the study of missing data. We began with an empirical longitudinal data set and investigated how creating MCAR data (through degradation) in addition to the pre-existing missing data (of undetermined type) influenced parameter estimates under the partial credit model (Masters, 1982; Wright and Masters, 1982). Degradation, defined as the process of systematically removing a certain amount of non-missing observations, is typically conducted on simulated data. However, in the current study, we used controlled degradation on realistic psychological data, creating a hybrid approach to investigating the implications of missing data. The hybrid approach fuses the strengths of simulation research with the advantages of using empirical data.

One advantage of this hybrid method is that it allows substantive researchers to discover the implications of missing item responses on their specific data set or on data typically obtained in a particular substantive field. The hybrid approach allows one to calculate the parameter estimates for a study under two conditions, the original data and the data with additional missing data introduced. This allows the magnitude of the effect of a certain type of missing data to be calculated for any data set. The hybrid approach is a powerful tool for addressing concerns about how missing data affects estimates in a substantive area of research.

Another advantage of this hybrid method is that the results obtained with empirical psychological data are potentially more generalizable than results obtained with simulated data. Psychological data is notoriously messy in ways that simulated data might not capture. As a result, we have more confidence that effects found in real psychological data sets apply to other psychological data sets. Indeed, effects found in simulated data may cease to exist in realistic data. For example, suppose a simulation study finds that under a specific set of data conditions, estimates double in magnitude. This may seem severely problematic, but if the specific data conditions that caused biased estimates in the simulation occur rarely, or even never, in real data, then this finding has minimal impact. Procedures that use real psychological data to answer statistical questions provide stronger evidence for the existence and practical importance of those effects than simulation studies. In the study of missing data, simulations are a great way to investigate the general effect of each type of missing data on estimates; however, simulation results should be cross-referenced with the results of the hybrid approach to investigate the relevance of the discovered effects to psychological data.

Note that when the data-centered method is used for data with previously existing missing data, we are restricted to comparing our estimates to those from the original data, which by definition, is incomplete. The underlying types of missing data and the mechanisms behind the missing data are often unknown and may influence results. If the data does not have missing responses, then these considerations are eliminated, but the advantages of using this hybrid method remain. Every method has strengths and weaknesses, but for substantive researchers who desire a data-centered approach, the hybrid approach presented here may be the most appropriate.

The Partial Credit Model

The partial credit model, which will be used to analyze the data in the current study, was developed by Masters and is a part of a family of Rasch models (Rasch, 1960) (Masters, 1982; Wright and Masters, 1982). In a broad sense, the partial credit model estimates the probability of a person responding to an item in a certain category given their latent trait level and the measure, also called location or difficulty, of the item. Item and person parameters are estimated on the same logit scale, which allows their locations to be compared easily. The partial credit model will be utilized in the current study because it is flexible and well-suited for rating scale data (Masters and Wright, 1996).

With a five-point rating scale, there are four transitions between categories (see Figure 1). In the PCM, a transition location is that point on the latent trait continuum where the probability of responding in a given category is equal to the probability of responding in the category above it. Visually, transition locations occur where the probability of responding in one category overlaps with the next category on Category Response Curves. For example, one of the questions from the Notre Dame Study of Health and Well-Being data (Bergeman and Deboeck, 2014), which will be analyzed in this paper, is “How well does the word ‘happy’ describe you?” Responses were recorded on the following scale: 1 = Not at all, 2 = A little, 3 = Moderately, 4 = Quite a bit, and 5 = Extremely. A person with a low trait level on Positive Affect has a high probability of selecting a low category on this item but a person with a high trait level on Positive Affect has a low probability of selecting a low category on this item. The differences between a person's estimated latent trait level and the transition location associated with advancing to the next category, for each category, are the main components of the partial credit model equation, which is represented as:

Category response curves (CRCs) showing the probability of response to a category given person measure level. The numbers located near the peak of each line represent the category number and the arrows indicate the item transition locations.

π_{x n i} = \frac{exp \sum_{j = 0}^{x} (β_{n} - δ_{i j})}{\sum_{k = 0}^{m_{i}} exp \sum_{j = 0}^{k} (β_{n} - δ_{i j})} .

In the partial credit model equation, i represents the item, x, k, and j represent the current category, and m represents the total number of categories. The probability of a person n responding to item i with category x is an exponential function of the difference between the person's location on the latent dimension (β) and the set of transition locations between categories for the item (δ), which indicate the item's location on the latent dimension. The numerator includes the category of interest and the denominator is summed over all categories. The result of the partial credit model equation is a single probability for each category representing the probability of a person with a specific trait level responding to an item with that particular category as their response.

Goals of the Present Study

The primary goal of this study is to investigate whether creating additional MCAR data influences partial credit model parameter estimates. Previous literature has examined the impact of MCAR data on estimates from factor analyses and structural equation models, but the literature on how missing data affects Rasch model estimates is sorely lacking. The analyses of the current study are exploratory in nature, but we anticipated that creating additional MCAR data would cause more divergence from original PCM parameter estimates (i.e., trait level scores, item locations, and transition locations) and decreased precision due to a loss of observations as the level of degradation increased. We will also investigate the point at which degradation affects parameter estimates and the magnitude of those changes in parameter estimates.

A second goal of this study is to pioneer the data-centered approach to investigating missing data. This approach combines controlled, simulated degradation with realistic, psychological data to directly examine the effects of missing responses on parameter estimates. A major advantage of this method is that it allows substantive researchers to directly examine how an increase of a certain type of missing data would affect estimates in data typically obtained in their field of study. Another advantage of this method is that the results may be more likely to generalize to real psychological data than the results of simulations, which are based on perfect and unrealistic data conditions. Using the data-centered approach, we will investigate how PCM estimates change if additional MCAR data were created for the Notre Dame Study on Health and Well-Being.

Method

Participants

The Notre Dame Study of Health and Well-Being was funded by the National Institute of Aging (Bergeman and Deboeck, 2014). Data collection began in 2006 and as of the date this manuscript was written, was ongoing in biennial bursts lasting 56 days. Participants were 307 adults from the first burst of measurement in the older cohort of this study who completed paper-and-pencil questionnaires every evening for 56 days. Participants were between 53 and 91 years of age, with a mean age of 68 years (SD = 5.33). The majority of the participants were female (58%). The sample was primarily Caucasian (81%), but also included African Americans (12%), Hispanics (4%), and Asians (3%). The sample had a wide range of income levels, with 19.8% earning less than $15,000, 20.8% earning between $15,000 and $25,000, 25.7% earning between $25,000 and $40,000, 22.6% earning between $40,000 and $75,000, 5.2% earning between $75,000 and $100,000, and 3.1% earning over $100,000 (income data were missing for some participants). Fourteen percent of the participants completed a post-college degree, 13% completed a college degree, 23% completed some college classes, 8% completed vocational education, 39% completed high school, and 3% did not complete high school.

Measures

Affect items

The Positive and Negative Affect Schedule (PANAS; Watson, Clark, and Tellegen, 1988) was created to measure Positive Affect (PA) and Negative Affect (NA). These two dimensions were created to be orthogonal, with 10 items measuring each on the PANAS. Participants responded to each item using the prompt “Today I felt______” on a 5 point Likert scale with 1 = Not at all, 2 = A little, 3 = Moderately, 4 = Quite a bit, and 5 = Extremely. Twenty-two additional affect items (12 PA; 10 NA) were administered to participants, with most of these items taken from the Circumplex Model of Emotion (Larsen and Diener, 1992) for a more comprehensive measurement of Positive and Negative Affect. Due to previous work demonstrating that the items Scared, Lonely, Passive, and Still were associated with problematic item infit and or outfit across the majority of the 56 days, these items were removed from analyses (Erbacher, Schmidt, Boker, and Bergeman, 2012; Bond and Fox, 2007). Twenty PA items and 18 NA items composed the final PA and NA scales. The final PA items were Active, Calm, Alert, Attentive, Elated, Determined, Stimulated, Happy, Enthusiastic, Excited, Love, Proud, Joyful, Strong, Interested, Pleased, Content, Aroused, Inspired, and Euphoric (PANAS items italicized).

Demographic items

Demographic information regarding gender, race, income, and education was collected from participants.

Procedure

Participants responded to 38 affect items once a day, in the evening, for 56 consecutive days. The responses were recorded using paper and pencil.

Data Analysis

Partial Credit Model Analyses Procedure

The need for anchoring

Winsteps (Linacre, 2014) was used to obtain partial credit model parameter estimates. If each of the 56 occasions of item responses were analyzed separately, the PCM parameter estimates would not be comparable because the parameters from each occasion would be estimated on their own logit scale (Baker and Kim, 2004; Bond and Fox, 2007; Embretson and Reise, 2000). To compare parameters from different analyses, the items should be anchored on the same logit scale. Erbacher et al. (2012) compared unanchored and anchored PCM estimates of longitudinal data, using four anchoring conditions: unanchored or floated analyses, using the first occasion parameters as anchors, using occasion-specific parameters for each item from a “long-format” analysis of all 56 days to anchor the day-by-day analyses (e.g., For the first occasion, Calm in the anchored analysis had item parameters that were fixed to be equal to the item parameters for Calm on the first occasion from the “long-format” analysis), and averaging values from three occasions to serve as anchor values. The standard errors of estimates and fit statistics of items under each anchoring method demonstrated that anchoring using the average values from three days or anchoring using occasion-specific parameters from an analysis of all occasions resulted in the most desirable results; however, only the former fixed item parameters to be stable across time. In the present study, the measurement characteristics of the PA items should not be changing across occasions. Thus, we anchored day by day analyses using the average item parameters (item locations and item transition locations); however, unlike Erbacher et al. (2012), we used all 56 days to compute the average item parameters. The method for doing so is described below.

The procedure for anchoring

The original data were analyzed in one “long format” analysis that included data from the 20 items for all 56 days, with each item-occasion pairing treated as an independent item (20 items × 56 days = 1120 items). The data were then degraded randomly by an additional percentage (20%, 50% or 70%) of 307 responses (the total N) on each item on each day. For example, the item Aroused on day one had 281 responses, with 26 responses already missing, but degrading Aroused by an additional 50% for that day caused 154 observations to be randomly replaced with missing values, leaving 127 of the original 281 responses. Each level of degraded data was analyzed in a “long format” analysis (20 items × 56 days = 1120 items) to obtain average item locations and average item transition locations for each item on each occasion under varying levels of degradation. Then, the data for each level of degraded data was analyzed separately for each occasion, using the average item parameters from the “long format” analysis of that data to anchor the results. The results of the original, anchored analyses (labeled ORIG, hereafter) and degraded, anchored analyses for each level of degradation (labeled DEG-20, DEG-50, and DEG-70, hereafter) were then compared.

Special considerations for anchoring

As data are degraded, it becomes more likely that certain categories will not be observed in item responses. When intermediate categories (i.e., 2, 3, and 4 on a 5-point scale) are not observed, Winsteps creates extreme numeric value transition locations (by adding +/–40 logits) between that category and the next, indicating that a category is not currently observed, but may be observed in future data (Linacre, 2014). We excluded these extreme transition locations to prevent them from influencing the average item transition locations for each item.

Estimation procedure

Winsteps uses joint maximum likelihood estimation (JMLE; Linacre, 2014), which allows for missing data and uses all of the available information to produce estimates (An and Yung, 2014). Using this method we will investigate changes in the magnitudes of PCM parameter estimates and the precision of those parameter estimates.

Results

Description of the Pre-Existing Missing Data

Missing responses by day

To assess the overall trend of missing responses for the 56 days of the study, the number of missing responses per day over all people and all items was computed. Missing responses increased over the fifty-six days of the study, but the day-to-day variation from this trend was large (see Figure 2). On the first day, 4% of the total responses were missing and on the fifty-sixth day, 21% of the total responses were missing. However, much of this effect could be caused by attrition. To investigate if the increase in missing responses over time was present after controlling for attrition, we also computed the sum of missing responses excluding participants who appeared to drop out of the study.

Total missing response trends over the 56 days for each level of degradation.

In addition to participants who clearly dropped out of the study, some participants stopped responding just before end of the study and it was unclear if they were dropouts or if they simply stopped responding for a few days near the end of the study. It was particularly problematic for judging whether or not a participant was a dropout when the participants had repeatedly stopped and resumed responding throughout the course of the study. Thus, participants were counted as dropouts only if their final streak of missing responses, defined as multiple days of non-response on the entire Positive Affect scale, lasted for more days than any of their previous missing responses streaks. In total, 40 of 307 participants were classified as study dropouts missing responses in the non-dropout participants showed a similar trend to the missing responses for all participants and increased over time, indicating that attrition alone did not drive the increase of missing data over time observed in the data.

Missing responses by item

It is important to assess the missing responses for each item individually to investigate how each item's missing responses contribute to the overall missing data and to determine whether some items were skipped due to particular reasons. To investigate if some items had more missing data than others, we computed the proportion of missing responses per item across days and participants. Indeed, some items had a higher proportion of missing data than others. For example, across all days of the study, “Euphoric” was missing about 22 percent of the time, but “Active” was missing about 14 percent of the time (see Table 1). The demographic items, measured at the first occasion, exhibited the lowest proportions of missing responses (0.00% – 0.02%).

Table 1. Proportion of Missing Responses and Mean Measure by Item in Original Data.

Item	Proportion of Missing Responses in ORIG Data	ORIG Mean Item Location for Anchoring
Active	0.14	−0.76
Happy	0.14	−0.68
Alert	0.14	−0.93
Calm	0.14	−0.35
Determined	0.15	−0.35
Attentive	0.15	−0.34
Interested	0.15	−0.19
Content	0.15	−0.13
Pleased	0.15	−0.17
Strong	0.15	−0.37
Enthusiastic	0.15	0.20
Excited	0.15	0.64
Love	0.15	−0.69
Joyful	0.15	0.04
Stimulated	0.15	0.26
Proud	0.16	−0.42
Inspired	0.16	0.51
Elated	0.17	0.82
Aroused	0.19	1.46
Euphoric	0.22	1.44

Open in a new tab

Note. The total number of observations used to compute the proportion of missing responses was 17192 (307 participants × 56 days). The mean item locations on the right are those used to anchor the day-by-day ORIG analyses.

We also computed the number of missing responses per item for each day to gauge the trends of missing data on each item over time. Every item showed a linear increase in missing responses across the 56 days, indicating that all of the items contributed to the overall missing data trend.

Missing responses by person

To assess the patterns of missing responses exhibited by each participant's data, we computed the total number of missing responses for each person on each day and examined these totals visually. There were several patterns of missing data exhibited by participants, and they seem to fit four main patterns. The first pattern of missing responses was missing a low number of items over all 56 days (“Ideal Responder,” see Figure 3A). The second pattern of missing responses was answering the majority of the items on most days, but missing the whole Positive Affect scale on several days (“Occasional Scale Non-Responder,” see Figure 3B). The third pattern of missing responses was responding to a low number of items on most days and missing the whole Positive Affect scale on several days (“High-Level Non-Responder,” see Figure 3C). The fourth and final missing responses pattern demonstrates attrition or dropping out of the study (“Dropout,” see Figure 3D). There were many unique patterns in the data, but these four patterns and their combinations describe most of the variability in the patterns of missing responses.

Exemplars of participants' patterns of missing responses over the 56 occasions. The x-axes represent the days of measurement and the y-axes represent the total number of missing Positive Affect Items. A represents the Ideal Responder, B represents the Occasional Scale Non-Responder, C represents the High-Level Non-Responder, and D represents the Dropout.

Degradation Effects on Item Parameters

Item locations

Item locations, sometimes known as item difficulty, represent the location of an item on the logit scale. More specifically, item location in the PCM is the average of the transition locations in an item (Masters, 1982). However, in Winsteps, the transition locations sum to zero and item location is a separate numeric value. Recall that item parameters were anchored by taking the average item locations and average item transition locations for each item on each occasion under each level of degraded data in a “long format” analysis and using those average item parameters to anchor the analysis for each level of degradation separately for each occasion. The average item locations used for anchoring ranged from β = −0.93 to β = 1.46 in the ORIG analyses (see Table 1). See Figure 4 for a person-item map showing the item locations and estimated trait levels of participants for an example day (Positive Affect on day 28 of the ORIG analyses).

Person-item map of Positive Affect on day 28 for ORIG analysis.

To investigate changes in the average item locations used for anchoring between the data sets, absolute change scores from the average ORIG item locations to the average DEG-20, DEG-50, and DEG-70 item locations were computed. As seen in Table 2, the average absolute changes in item location increased as the level of degradation increased. Additionally, the maximum and standard deviation of these absolute changes increased as the level of degradation increased. These findings indicate that item location estimates were influenced by degradation, such that as degradation increased, the estimates diverged more from ORIG estimates and the magnitude of the changes from ORIG became more variable.

Table 2. Average, Maximum, and Standard Deviation of Changes in Item Location from ORIG to Degraded Analyses.

	Average Absolute Change from ORIG	Max Absolute Change from ORIG	SD of Absolute Change from ORIG
DEG-20	0.01	0.03	0.01
DEG-50	0.02	0.08	0.02
DEG-70	0.10	0.41	0.11

Open in a new tab

Note. The values in this table represent the average, maximum, and standard deviation of the absolute changes from the average item locations used to anchor the ORIG analyses to the average item locations used to anchor the DEG-20, DEG-50, and DEG-70 analyses.

Standard errors of estimate for item location

The standard error of estimate for an item location represents the precision of that estimate. The precision of the estimate is the inverse of the item information, which can be expressed as:

I_{i} (θ) = \sum_{x = 0}^{m_{i}} \frac{{[P_{i x}^{'} (θ)]}^{2}}{P_{i x} (θ)},

where I represents the calculated information, i represents the item, x represents the current category, m represents the final category, P represents probability, and θ represents a given latent trait level. Thus, P_ix(θ) represents the probability that a person with latent trait level θ will receive a category score of x on item i. $P_{i x}^{'} (θ)$ is the first derivative of P_ix(θ). Hence, item information is the information for a given trait level multiplied by the probability of a person with that given trait level selecting a particular category, summed over all of the categories. Standard errors of the item location estimates were expected to increase with the level of degradation because sample size decreases as the level of degradation increases, and thus estimates should become less precise.

The standard errors for every item on each day of the ORIG, DEG-20, DEG-50, and DEG-70 analyses were grouped by degradation level. As you can see in Table 3, the average standard error increased with increasing degradation, indicating that the item location estimates became less precise as the level of degradation increased. The range of the standard errors and the standard deviation of the standard errors increased with increasing degradation. However, the standard deviation of the standard errors was similar, after rounding, for all analyses except DEG-70. These findings suggest that standard error increases in magnitude and range as the level of degradation increases, and this was most extreme at DEG-70.

Table 3. Average, Minimum, Maximum, and Standard Deviation of Standard Errors of Item Location Estimates for Each Analysis.

	Average SE	Min of SEs	Max of SEs	SD of SEs
ORIG	0.08	0.07	0.09	0.01
DEG-20	0.09	0.07	0.10	0.01
DEG-50	0.12	0.10	0.16	0.01
DEG-70	0.20	0.13	0.46	0.04

Open in a new tab

Note. The values in this table represent the average, minimum, maximum, and standard deviation of the standard errors of item locations for ORIG, DEG-20, DEG-50, and DEG-70.

Item transition locations

Recall that item transition locations are the points on the latent trait continuum where the probability of participants responding in one category is equal to the probability of responding in the next category given a particular trait level (see Figure 1).

To examine the effect of degradation on item transition locations, the item transition locations used to anchor the analyses were compared between degradation levels. For each item, the total change in all five item transition locations was computed. Like the results for item location, these results showed that as degradation increased, item transition locations changed more on average and the change in item transition locations became more variable (see Table 4).

Table 4. Average, Maximum, and Standard Deviation of Total Changes in Item Transition Locations from ORIG to Degraded Analyses.

	Average Total Absolute Change from ORIG	Max Total Absolute Change from ORIG	SD of Total Absolute Change from ORIG
DEG-20	0.07	0.17	0.04
DEG-50	0.15	0.59	0.12
DEG-70	0.62	1.59	0.47

Open in a new tab

Note. The values in this table represent the average, maximum, and standard deviation of the total changes per item in item transition locations between ORIG and DEG-20, DEG-50, and DEG-70.

Item infits and outfits

Item infit is a chi square statistic indicating the extent to which an individual's responses deviate from their expected responses when the item location and the individual's estimated trait level are in similar locations on the logit scale. Item infit is a chi square statistic, represented as a mean square, that is a function of the difference between the actual value and the expected value. Item infit can be expressed as:

InfitMNSQ = \frac{\sum_{i = 1}^{N} {(X_{i} - E (X_{i}))}^{2}}{\sum_{i = 1}^{N} \sum_{k = 0}^{m} {(k - E (X_{i}))}^{2} P_{i} (k)},

where X represents the category endorsed and the E represents the expected category, for person i. N represents the number of examinees, k represents the current category, and m represents the total number of categories (Wright and Masters, 1982, pp. 98-99). P represents the probability of a particular person endorsing a particular category, given by the PCM equation. The expected value of X, for person i, can be represented as follows:

E (X_{i}) = \sum_{k = 0}^{m} k P_{i} (k),

where k represents the current category, m represents the total number of categories, and P represents the probability of a particular person endorsing a particular category. Thus, chi squared item infit is the sum of the squared differences between the actual and expected values divided by sum of the squared differences between the actual and expected values which has been multiplied by the probability of responding in that category given the item and person parameters.

Traditional evaluation of item infit MNSQ regards infit of greater than one as underfit, infit of approximately one as ideal, and infit less than one as overfit (Linacre, 2015). Underfit indicates a larger than expected difference between expected and actual responses and overfit indicates a smaller than expected difference between expected and actual responses.

Item outfit pertains to the difference in expected and actual responses when an individual answers items far from the individual's trait level on the latent dimension measured. The traditional evaluation of item outfit is the same as item infit; an outfit of about one is ideal, outfit greater than one is considered underfit, and outfit less than one is considered overfit (Linacre, 2015). Item outfit can be represented as:

OutfitMNSQ = \frac{\sum_{i = 1}^{N} \frac{{(X_{i} - E (X_{i}))}^{2}}{\sum_{k = 0}^{m} {(k - E (X_{i}))}^{2} P_{i} (k)}}{N},

which is item infit divided by the sample size, N. Item outfit and item infit share the same equation for expected values which is:

E (X_{i}) = \sum_{k = 0}^{m} k P_{i} (k),

Rather than using the traditional cutoffs for infit and outfit, Smith, Schumaker, and Bush (1998) showed that infit and outfit statistic cutoffs should be computed for the specific sample size and number of items in a study to minimize type I errors. However, in the present study, the sample size varied by item and by day because of missing data, making the calculation of customized fit statistic cutoffs impractical. If customized fit statistic cutoffs had been calculated by item and day, there would be 4480 cutoff scores (20 items * 56 days * 4 degradation levels) for a single fit statistic. To identify extreme infit and outfit statistics for mean square item infits and outfits, values were compared to a commonly used cutoff score of 1.70 (Bond and Fox, 2007). Comparing mean square item infits and outfits to a fixed cutoff such as 1.70 ignores important factors like sample size and the number of items; thus, the results should be interpreted cautiously. We also assessed t-standardized item infits and outfits, using a cutoff score of +/−1.96 (Linacre, 2007). Comparison to these particular cutoff scores is the same approach that Erbacher et al. (2012) used in their analysis of anchoring methods.

The percentages of extreme item infit and outfit statistics indicating underfit for ORIG, DEG-20, DEG-50, and DEG-70 are presented in Table 5. Results indicate a higher number of extreme standardized item infit and outfit statistics for ORIG than DEG-20, DEG-50, and DEG-70. Thus, degradation decreased the number of extreme standardized item infit statistics present.

Table 5. Frequency and Percentage of Extreme Item Infit and Outfit Statistics for Each Analysis.

	Item Std Infit >1.96	Item Std Outfit >1.96	Item Msq Infit >1.70	Item Msz Outfit >1.70
ORIG	103 (9.20%)	99 (8.84%)	0 (0.00%)	14 (1.25%)
DEG-20	77 (6.88%)	74 (6.61%)	0 (0.00%)	9 (0.80%)
DEG-50	28 (2.5%)	43 (3.84%)	3 (0.27%)	8 (0.71%)
DEG-70	4 (0.36%)	3 (0.27%)	0 (0.00%)	0 (0.00 %)

Open in a new tab

Note. This table contains the frequency and percentages of extreme item infit and outfit statistics indicating underfit for ORIG, DEG-20, DEG-50, and DEG-70. The percentages were calculated by dividing the frequency by 1,120 (20 items * 56 days). Cutoff scores for extreme mean square (Msq) item fit statistics are from Bond and Fox (2007). Cutoff scores for extreme t-standardized (Std) item fit statistics are from Linacre (2007).

Closer investigation revealed that nearly all of the extreme standardized item infit statistics were associated with two items, Aroused and Calm. For ORIG, only 12 of the 103 extreme standardized infit statistics were not associated with Aroused or Calm. These 12 extreme standardized item infit statistics were associated with Active (1), Love (2), and Euphoric (9).

To examine the extreme item outfit statistics for Aroused and Calm, we compiled a list of the persons who contributed most to the MNSQ outfit for both Aroused and Calm for each day in ORIG (Winsteps Table 10.4; Linacre, 2014). For Aroused, 61% of persons contributed to item outfit on zero days, 26% of persons contributed to item outfit on one to three days, 7% of persons contributed to item outfit on four to seven days, and 6% of persons contributed to item outfit on eight or more days. For Calm, 50% of persons contributed to item outfit on zero days, 32% of persons contributed to item outfit on one to three days, 9% of persons contributed to item outfit on four to seven days, and 8% of persons contributed to item outfit on eight or more days. These results suggest that most participants contributed to the item outfit on zero or a small number of days. However, a small group of participants contributed to the item outfit on a large number of days in the study. Surprisingly, a few participants contributed to the item outfit of Aroused or Calm on 54 or 55 of the 56 days of the study. This suggests to us that the number of extreme standardized outfit statistics may have been caused by a small group of participants who responded strangely to Aroused and Calm.

Outcome Measures for Persons

We examined trait levels, standard errors of estimates for trait levels, and person infit and outfit as the outcome measures for persons. Please note that all analyses of person measures were conducted excluding estimates for which there was no information on which to base those estimates (−2 status code; Linacre, 2014). Estimates obtained without information might be extreme, which could cause the results to appear to have a greater range and variability than actually exists. This could introduce a confounding factor into our study, particularly as degradation is associated with a decrease in information, which makes extreme estimates obtained without information more likely. Thus, we excluded estimates obtained without information to obtain a more accurate picture of how degradation influenced PCM person estimates (2202 individual theta values were excluded from ORIG, 2217 individual theta values were excluded from DEG-20, 2263 individual theta values were excluded from DEG-50, and 2879 individual theta values were excluded from DEG-70, of a possible 17192 individual theta values [307 persons * 56 days]).

Positive Affect trait level estimates

A person's trait level represents the location of a person on the logit scale. Positive Affect trait level estimates ranged from θ = −6.22 to θ = 6.19 for the ORIG analyses (M = .07, SD = 1.42). To investigate the change in Positive Affect trait level estimates between the data sets, the average trait levels for each participant were computed for ORIG, DEG-20, DEG-50, and DEG-70. Then absolute change scores from the average ORIG PA trait levels to the average DEG-20, DEG-50, and DEG-70 PA trait levels were computed for every participant.

Under degradation, Positive Affect trait level estimates showed a pattern of results similar to those of item location. The average change in trait level estimates from the ORIG estimates increased more as the level of degradation increased (see Table 6). The maximum change from the average ORIG person trait level increased with the level of degradation to the extent that DEG-70 had a maximum trait level change of around two logits, which is quite large. Degradation also led to increased variability in the magnitude of the change from ORIG trait level estimates.

Table 6. Average, Maximum, and Standard Deviation of Changes in Person Trail Level from ORIG to Degraded Analyses.

	Average Absolute Change from ORIG	Max Absolute Change from ORIG	SD of Absolute Change from ORIG
DEG-20	0.03	0.24	0.04
DEG-50	0.07	0.90	0.10
DEG-70	0.12	2.04	0.21

Open in a new tab

Note. The values in this table represent the average, maximum, and standard deviation of the absolute changes from the average person trait level estimates of the ORIG analyses to the average person trait level estimates for DEG-20, DEG-50, and DEG-70 analyses.

It is possible that the Positive Affect trait level estimates of more extreme persons might have been affected by degradation differently than that the Positive Affect trait level estimates of less extreme persons. This may occur because extreme responses are scarce in the original data, and degradation degrades the entire distribution of raw responses, causing the majority of the more extreme observations to disappear. Thus, without their more extreme item responses, the trait levels for more extreme persons after degradation may change more than the trait levels for less extreme persons. To examine this, we correlated the absolute average ORIG trait level with the absolute change in trait level from ORIG to DEG-20, ORIG to DEG-50, and ORIG to DEG-70. The resulting correlations were strong, positive, and significant, r(305) = .61, p < .001, r(305) = .63, p < .001, and r(305) = .65, p < .001, respectively. This indicates that more extreme persons did experience more change in their estimated trait level due to degradation than less extreme persons.

Standard errors of the Positive Affect trait level estimates

The standard error of the Positive Affect trait level estimates reflects the precision associated with that estimate. The standard error of the Positive Affect trait level estimates was computed for each person for each of the 56 days for ORIG, DEG-20, DEG-50, and DEG-70.

The average standard errors of trait level estimates increased as the level of degradation increased, indicating that estimates became less precise as degradation increased (see Table 7). The minimum and maximum standard errors were similar for all degradation levels. The standard deviation indicated that the standard errors became more variable as degradation increased. Thus, increased degradation led to less and more varied precision in trait level.

Table 7. Average, Minimum, Maximum, and Standard Deviation of Standard Errors of Person Trait Level Estimates for Each Analysis.

	Average SE	Min of SEs	Max of SEs	SD of SEs
ORIG	0.32	0.26	1.97	0.20
DEG-20	0.37	0.26	1.96	0.21
DEG-50	0.51	0.28	1.98	0.35
DEG-70	0.82	0.35	2.00	0.38

Open in a new tab

Note. The values in this table represent the average, minimum, maximum, and standard deviation of the standard errors of person trait level estimates for ORIG, DEG-20, DEG-50, and DEG-70.

Person infit and outfit

Person infit and outfit statistics identify the extent to which patterns of item responses conform to expected responses for each person (Embretson and Reise, 2000). Unexpected item response patterns can be those that are too dissimilar to expected responses or too similar to expected responses. For example, if a person is expected to respond with low levels of Positive Affect on items but they consistently respond with high levels of Positive Affect on items, they have an unexpected response pattern. Person infit and outfit are calculated using the same equation as item infit and are traditionally interpreted using the same guidelines; infit or out-fit of about one is ideal, less than one is considered overfit, indicating that the item response pattern is fitting the expected pattern too well, and greater than one is considered underfit, indicating that the item response pattern is not fitting the expected pattern as well as it should. The customized cutoffs suggested by Smith, Schumaker, and Bush (1998) for specific sample sizes and number of items could be used to assess infit and outfit while minimizing type I errors, but the calculation of customized fit statistic cutoffs was impractical in this study because of changing sample sizes by day due to missing data (307 persons * 56 days * 4 degradation levels = 68,768 customized cutoffs). To identify extreme infit and outfit statistics for mean square infits and outfits, values were compared to a cutoff score of 1.70 suggested by Bond and Fox (2007) and used in previous work by Erbacher et al. (2012). Recall that comparing mean square item infits and outfits to a fixed cutoff such as 1.70 ignores important factors like sample size and the number of items and that the results should be interpreted cautiously. We also investigated t-standardized person infit and outfit, another measure of unexpected response patterns. The t-standardized person infit and outfit values were compared to a cutoff score of +/−1.96 (Linacre, 2007), which was also used in the anchoring study by Erbacher et al. (2012).

The percentages of extreme person infit and outfit statistics indicating underfit for ORIG, DEG-20, DEG-50, and DEG-70 are presented in Table 8. As with item infit and outfit, we see that there are fewer extreme person infits and outfits as the level of degradation increases. This is because the process of degradation causes the majority of the more extreme observations to be omitted, leading to fewer extreme person infits and outfits.

Table 8. Frequency and Percentage of Extreme Person Infit and Outfit Statistics for Each Analysis.

	Person Std Infit >1.96	Person Std Outfit >1.96	Person Msq Infit >1.70	Person Msq Outfit >1.70
ORIG	1155 (6.72%)	931 (5.42%)	1203 (7.00%)	1056 (6.14%)
DEG-20	972 (5.65%)	786 (4.57%)	1163 (6.76%)	1057 (6.15%)
DEG-50	597 (3.47%)	487 (2.83%)	1103 (6.42%)	1054 (6.13%)
DEG-70	220 (1.28%)	213 (1.24%)	685 (3.98%)	694 (4.04%)

Open in a new tab

Note. This table contains the frequency and percentages of extreme person infit and outfit statistics indicating underfit for ORIG, DEG-20, DEG-50, and DEG-70. The percentages were calculated by dividing the frequency by 17,192 (307 persons × 56 days). Cutoff scores for extreme mean square (Msq) item fit statistics are from Bond and Fox (2007). Cutoff scores for extreme t-standardized (Std) item fit statistics are from Linacre (2007).

Discussion

The results of the current study show that creating additional missing completely at random data through degradation caused partial credit model item and person parameter estimates to diverge from the original estimates. The precision of these estimates decreased as the level of degradation increased. Although every level of degradation was associated with some degree of undesirable parameter characteristics, degradation of the data by 70 percent was by far the worst case. The results of this study showed that MCAR, a type of missing data that was previously thought to be ignorable (Howell, 2007) and have negligible effects (Enders and Bandalos, 2001), may cause severe problems in obtaining stable, precise parameter estimates and drawing valid interpretations from those parameter estimates of Rasch models. Assuming the results of the present study may reoccur in other studies, researchers who have MCAR data should not assume that it can be ignored and should be aware that their estimates may suffer in terms of stability, precision, and validity. Consider the implementations of this in practice, MCAR data on a high stakes test that is used to make admission decisions at institutes of higher education could cause less precise person trait level measures, leading to admitting students who would not normally be admitted or rejecting students who would normally be admitted.

The hybrid approach introduced in this study pairs controlled degradation with realistic, psychological data, combining the benefits of simulation research with the advantages of using empirical data. It enables one to calculate the parameter estimates for a study under two conditions, the original data and the data with additional missing data introduced. This allows the magnitude of the effect of a certain type of missing data to be calculated for any data set. An advantage of this hybrid method is that the results of testing the effects of missing data in empirical psychological data are potentially more generalizable than results obtained with simulated data. Psychological data are problematic in ways that simulated data may not be able to capture. As a result, we have more confidence that effects found in real psychological data sets apply to other psychological data sets. The hybrid approach is easy to use and is a powerful tool for substantive researchers.

A point to consider is that some of the Positive Affect items measured in this study performed better than others. Although most of the Positive Affect items were from the Positive and Negative Affect Schedule (PANAS; Watson, Clark, and Tellegen, 1988), there were 12 additional Positive Affect (PA) items, mostly from the Circumplex Model of Emotion (Larsen and Diener, 1992). The additional items were added to this study in an attempt to more comprehensively measure the affect of older adults. These additional items included lower arousal affect items like Calm and Content and higher arousal affect items like Aroused and Euphoric. Some of the measured Positive Affect items were associated with more missing data or more extreme item infit statistics than other items. For the most part, the PANAS items and the other Positive Affect items performed similarly in terms of missing data and item infit statistics; however, the worst PA items in terms of missing data and the worst PA items in terms of item infit were all non-PANAS items. For example, Elated, Aroused, and Euphoric (all non-PANAS items) were the three items with the highest proportions of missing data, while Active, Alert, and Happy (two PANAS items and one non-PANAS item) were the three items with the lowest proportions of missing data. Additionally, Calm, Aroused, and Euphoric (all non-PANAS items) were the items that were associated with the most extreme standardized item infit statistics. Thus, it appears that the PANAS items may be more useful than some of the additional Positive Affect items used in this study.

One limitation of this study is that the results of the anchored analyses under each level of degradation (none, 20%, 50%, 70%) were not on the exact same logit scale, which some might argue makes them unable to be directly compared. Recall that the average item parameters used for anchoring were calculated separately for each data set. For example, the anchoring item parameters for the DEG-20 analyses were computed from the data that was degraded by an additional 20 percent. However, each of the degraded data sets was a degradation of the original data, so in some ways, each degraded analysis was nested within the analysis of the original data. Thus, although the logit scale was different for each data set, it was different only because of the degradation of the data. The alternative to separately anchoring each data set would be to use the anchored logit scale from the original data set for each degraded data set, which unrealistically assumes that degradation does not affect item parameters. We chose to compare the anchored estimates from each data set, despite the differing scales, to allow the average item parameters to be influenced by degradation and permit exploration of that influence.

A more general limitation of the present study is that we only examined MCAR missing data, which requires strict assumptions that are not often justifiable in psychology research (Muthén, Kaplan, and Hollis, 1987). Howell (2007) makes a point of clarifying that although MCAR data is often categorized as ignorable missing data, this does not mean we can ignore the problem of missing data. It simply means we should concern ourselves with improving parameter estimates without creating a model for the underlying missing data mechanisms. Imagine a situation in which some of a researcher's data is lost due to a randomly occurring technical error or an absentminded research assistant, resulting in a large amount of MCAR data. Although truly MCAR situations such as this might be rare, they do exist and it is important to understand the effect that MCAR data might have on the quality of parameter estimates. In addition, we hope that the current paper serves as a springboard for further research on how differing types of missing data affect Rasch model estimates.

Future researchers should examine the impact of sample size on PCM estimates. One method for doing so would be to conduct a study that degrades real data, cross-sectional or longitudinal, by a certain percentage of the sample size by removing some participants' data using the hybrid method presented in this paper. For example, a researcher could obtain PCM estimates using the total sample size and again after randomly removing 20 percent of the participants' data, rather than degrading individual responses to items as we did in the current study. This procedure could detect the degree to which PCM estimates become unstable as a function of sample size and identify whether some participants have more destabilizing influence on estimates than others.

Future researchers could also examine the impact that number of occasions has on PCM estimates. This would require the analysis of longitudinal data. Estimates would once again be compared, but this time after randomly selecting entire occasions to remove, rather than participants or responses. It is possible that some occasions are more or less informative than other occasions. These lines of research could provide information on the sample size and number of occasions necessary for precise and accurate estimates under varying conditions of missing data.

Another direction for future researchers might be to look at Rasch model estimates under MAR and MNAR data. This could be accomplished using a method similar to the one used in the present study. For MAR degradation, one could degrade observations on one variable in a way that was dependent on other observed variables. For example, MAR data could be created by selecting responses randomly but only removing them if the participant's score on Interested (which must be observed) was below the midpoint of the scale. To create MNAR degradation, observations could be randomly selected but removed only if the observation lies within a certain range of categories (i.e., only those with a score of 4 or 5 on a 5-point scale). With several levels of degradation, one could investigate how various levels of MNAR and MAR data could bias or influence PCM estimates. The results of MAR degradation might be similar to the results of MCAR degradation presented in this paper, but empirical research is necessary to investigate this possibility. Continued research on how missing data of various types affects Rasch model estimates is needed and would benefit this field of research.

In conclusion, MCAR data can affect partial credit model estimates, causing estimates to diverge with increasing levels of missing responses. This was most drastic at 70% additional degradation when compared to 20% and 50% additional degradation in this data. Future research should examine the sample size and number of occasions necessary for accurate estimates given data degradations and the effect of missing at random and missing not at random degradations on Rasch model estimates. The effects of missing data, a common issue in many fields of research, on Rasch model parameter estimates may include severe problems in obtaining accurate parameter estimates and drawing valid interpretations.

Contributor Information

Sarah L. Thomas, University of Virginia

Karen M. Schmidt, University of Virginia

Monica K. Erbacher, James Madison University

Cindy S. Bergeman, University of Notre Dame

References

An X, Yung Y. Presented at the annual meeting of the SAS® Global Forum. Washington, DC: 2014. Mar, Item response theory: What it is and how you can use the IRT procedure to apply it. [Google Scholar]
Baker FB, Kim SH. Item response theory: Parameter estimation techniques. 2nd. New York, NY: Dekker; 2004. [Google Scholar]
Bergeman CS, Deboeck PR. Trait stress resistance and dynamic stress dissipation on health and well-being: The reservoir model. Research in Human Development. 2014;11:108–125. doi: 10.1080/15427609.2014.906736. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bond TG, Fox CM. Applying the Rasch model: Fundamental measurement in the human sciences. 2nd. Mahwah, NJ: Lawrence Erlbaum; 2007. [Google Scholar]
DeMars C. Missing data and IRT item parameter estimation. Presented at the annual meeting of the American Educational Research Association; New Orleans, LA. 2002. Apr, [Google Scholar]
Embretson SE, Reise S. Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum; 2000. [Google Scholar]
Enders CK. Analyzing longitudinal data with missing values. Rehabilitation Psychology. 2011;56:267–288. doi: 10.1037/a0025579. [DOI] [PubMed] [Google Scholar]
Enders CK, Bandalos DL. The relative performance of full information maximum likelihood estimation for missing data in structural equation models. Structural Equation Modeling. 2001;8:430–457. [PubMed] [Google Scholar]
Erbacher MK, Schmidt KM, Boker SM, Bergeman CS. Measuring positive and negative affect in older adults over 56 days: Comparing trait level scoring methods using the partial credit model. Journal of Applied Measurement. 2012;13:146–164. [PMC free article] [PubMed] [Google Scholar]
Gottfredson NC, Bauer DJ, Baldwin SA. Modeling change in the presence of nonrandomly missing data: Evaluating a shared parameter mixture model. Structural Equation Modeling. 2014;21:196–209. doi: 10.1080/10705511.2014.882666. [DOI] [PMC free article] [PubMed] [Google Scholar]
Howell DC. The analysis of missing data. In: Outhwaite W, Turner S, editors. Handbook of social science methodology. London, UK: Sage; 2007. [Google Scholar]
Larsen RJ, Diener E. Promises and problems with the circumplex model of emotion. In: Clark MS, editor. Emotion. Vol. 13. Newbury Park, CA: Sage; 1992. pp. 25–59. [Google Scholar]
Linacre JM. Winsteps® Rasch measurement [Computer software] Beaverton, OR: 2014. Winsteps.com. [Google Scholar]
Manly CA, Wells RS. Reporting the use of multiple imputation for missing data in higher education research. Research in Higher Education. 2014 Jul 20;2014:1–13. [Google Scholar]
Masters GN. A Rasch model for partial credit scoring. Psychometrika. 1982;47:149–174. [Google Scholar]
Masters GN, Wright BD. The partial credit model. In: van der Linden WJ, Hambleton RK, editors. Handbook of modern item response theory. New York, NY: Springer; 1996. pp. 101–122. [Google Scholar]
Muthén B, Kaplan D, Hollis M. On structural equation modeling with data that are not missing completely at random. Psychometrika. 1987;42:431–462. [Google Scholar]
Padgett CR, Skilbeck CE, Summers MJ. Missing data: The importance and impact of missing data from clinical research. Brain Impairment. 2014;15:1–9. [Google Scholar]
Pickles A. Missing data, problems and solutions. In: Kempf-Leonard K, editor. Encyclopedia of social measurement. Amsterdam, Holland: Elsevier; 2005. pp. 689–694. [Google Scholar]
Rasch G. Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Danish Institute for Educational Research; 1960. Expanded edition, 1980. Chicago, IL: University of Chicago Press. [Google Scholar]
Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592. [Google Scholar]
Schafer JL, Graham JW. Missing data: Our view of the state of the art. Psychological Methods. 2002;7:147–177. [PubMed] [Google Scholar]
Simpson J, Speake J. Oxford dictionary of proverbs. 5th. Oxford, UK: Oxford University Press; 2008. [Google Scholar]
Smith RM, Schumaker RE, Bush MJ. Using item mean squares to evaluate fit to the Rasch model. Journal of Outcome Measurement. 1998;2:66–78. [PubMed] [Google Scholar]
Watson D, Clark LA, Tellegen A. Development and validation of brief measures of Positive and Negative Affect: The PANAS scales. Journal of Personality and Social Psychology. 1988;54:1063–1070. doi: 10.1037//0022-3514.54.6.1063. [DOI] [PubMed] [Google Scholar]
Wright BD, Masters GN. Rating scale analysis. Chicago, IL: MESA Press; 1982. [Google Scholar]
Yang M, Wang L, Maxwell SE. Bias in longitudinal data analysis with missing data using typical linear mixed-effects modelling and pattern-mixture approach: An analytical illustration. British Journal of Mathematical and Statistical Psychology. 2014;68:1–22. doi: 10.1111/bmsp.12043. [DOI] [PubMed] [Google Scholar]

[R1] An X, Yung Y. Presented at the annual meeting of the SAS® Global Forum. Washington, DC: 2014. Mar, Item response theory: What it is and how you can use the IRT procedure to apply it. [Google Scholar]

[R2] Baker FB, Kim SH. Item response theory: Parameter estimation techniques. 2nd. New York, NY: Dekker; 2004. [Google Scholar]

[R3] Bergeman CS, Deboeck PR. Trait stress resistance and dynamic stress dissipation on health and well-being: The reservoir model. Research in Human Development. 2014;11:108–125. doi: 10.1080/15427609.2014.906736. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Bond TG, Fox CM. Applying the Rasch model: Fundamental measurement in the human sciences. 2nd. Mahwah, NJ: Lawrence Erlbaum; 2007. [Google Scholar]

[R5] DeMars C. Missing data and IRT item parameter estimation. Presented at the annual meeting of the American Educational Research Association; New Orleans, LA. 2002. Apr, [Google Scholar]

[R6] Embretson SE, Reise S. Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum; 2000. [Google Scholar]

[R7] Enders CK. Analyzing longitudinal data with missing values. Rehabilitation Psychology. 2011;56:267–288. doi: 10.1037/a0025579. [DOI] [PubMed] [Google Scholar]

[R8] Enders CK, Bandalos DL. The relative performance of full information maximum likelihood estimation for missing data in structural equation models. Structural Equation Modeling. 2001;8:430–457. [PubMed] [Google Scholar]

[R9] Erbacher MK, Schmidt KM, Boker SM, Bergeman CS. Measuring positive and negative affect in older adults over 56 days: Comparing trait level scoring methods using the partial credit model. Journal of Applied Measurement. 2012;13:146–164. [PMC free article] [PubMed] [Google Scholar]

[R10] Gottfredson NC, Bauer DJ, Baldwin SA. Modeling change in the presence of nonrandomly missing data: Evaluating a shared parameter mixture model. Structural Equation Modeling. 2014;21:196–209. doi: 10.1080/10705511.2014.882666. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Howell DC. The analysis of missing data. In: Outhwaite W, Turner S, editors. Handbook of social science methodology. London, UK: Sage; 2007. [Google Scholar]

[R12] Larsen RJ, Diener E. Promises and problems with the circumplex model of emotion. In: Clark MS, editor. Emotion. Vol. 13. Newbury Park, CA: Sage; 1992. pp. 25–59. [Google Scholar]

[R13] Linacre JM. Winsteps® Rasch measurement [Computer software] Beaverton, OR: 2014. Winsteps.com. [Google Scholar]

[R14] Manly CA, Wells RS. Reporting the use of multiple imputation for missing data in higher education research. Research in Higher Education. 2014 Jul 20;2014:1–13. [Google Scholar]

[R15] Masters GN. A Rasch model for partial credit scoring. Psychometrika. 1982;47:149–174. [Google Scholar]

[R16] Masters GN, Wright BD. The partial credit model. In: van der Linden WJ, Hambleton RK, editors. Handbook of modern item response theory. New York, NY: Springer; 1996. pp. 101–122. [Google Scholar]

[R17] Muthén B, Kaplan D, Hollis M. On structural equation modeling with data that are not missing completely at random. Psychometrika. 1987;42:431–462. [Google Scholar]

[R18] Padgett CR, Skilbeck CE, Summers MJ. Missing data: The importance and impact of missing data from clinical research. Brain Impairment. 2014;15:1–9. [Google Scholar]

[R19] Pickles A. Missing data, problems and solutions. In: Kempf-Leonard K, editor. Encyclopedia of social measurement. Amsterdam, Holland: Elsevier; 2005. pp. 689–694. [Google Scholar]

[R20] Rasch G. Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Danish Institute for Educational Research; 1960. Expanded edition, 1980. Chicago, IL: University of Chicago Press. [Google Scholar]

[R21] Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592. [Google Scholar]

[R22] Schafer JL, Graham JW. Missing data: Our view of the state of the art. Psychological Methods. 2002;7:147–177. [PubMed] [Google Scholar]

[R23] Simpson J, Speake J. Oxford dictionary of proverbs. 5th. Oxford, UK: Oxford University Press; 2008. [Google Scholar]

[R24] Smith RM, Schumaker RE, Bush MJ. Using item mean squares to evaluate fit to the Rasch model. Journal of Outcome Measurement. 1998;2:66–78. [PubMed] [Google Scholar]

[R25] Watson D, Clark LA, Tellegen A. Development and validation of brief measures of Positive and Negative Affect: The PANAS scales. Journal of Personality and Social Psychology. 1988;54:1063–1070. doi: 10.1037//0022-3514.54.6.1063. [DOI] [PubMed] [Google Scholar]

[R26] Wright BD, Masters GN. Rating scale analysis. Chicago, IL: MESA Press; 1982. [Google Scholar]

[R27] Yang M, Wang L, Maxwell SE. Bias in longitudinal data analysis with missing data using typical linear mixed-effects modelling and pattern-mixture approach: An analytical illustration. British Journal of Mathematical and Statistical Psychology. 2014;68:1–22. doi: 10.1111/bmsp.12043. [DOI] [PubMed] [Google Scholar]

PERMALINK

What You Don't Know Can Hurt You: Missing Data and Partial Credit Model Estimates

Sarah L Thomas

Karen M Schmidt

Monica K Erbacher

Cindy S Bergeman

Abstract

Types of Missing Data

Missing completely at random

Missing at random

Missing not at random

Approaches to Investigating Missing Data

Simulations

A hybrid approach

The Partial Credit Model

Figure 1.

Goals of the Present Study

Method

Participants

Measures

Affect items

Demographic items

Procedure

Data Analysis

Partial Credit Model Analyses Procedure

The need for anchoring

The procedure for anchoring

Special considerations for anchoring

Estimation procedure

Results

Description of the Pre-Existing Missing Data

Missing responses by day

Figure 2.

Missing responses by item

Table 1. Proportion of Missing Responses and Mean Measure by Item in Original Data.

Missing responses by person

Figure 3.

Degradation Effects on Item Parameters

Item locations

Figure 4.

Table 2. Average, Maximum, and Standard Deviation of Changes in Item Location from ORIG to Degraded Analyses.

Standard errors of estimate for item location

Table 3. Average, Minimum, Maximum, and Standard Deviation of Standard Errors of Item Location Estimates for Each Analysis.

Item transition locations

Table 4. Average, Maximum, and Standard Deviation of Total Changes in Item Transition Locations from ORIG to Degraded Analyses.

Item infits and outfits

Table 5. Frequency and Percentage of Extreme Item Infit and Outfit Statistics for Each Analysis.

Outcome Measures for Persons

Positive Affect trait level estimates

Table 6. Average, Maximum, and Standard Deviation of Changes in Person Trail Level from ORIG to Degraded Analyses.

Standard errors of the Positive Affect trait level estimates

Table 7. Average, Minimum, Maximum, and Standard Deviation of Standard Errors of Person Trait Level Estimates for Each Analysis.

Person infit and outfit

Table 8. Frequency and Percentage of Extreme Person Infit and Outfit Statistics for Each Analysis.

Discussion

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases