Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jun 6.
Published in final edited form as: Arch Phys Med Rehabil. 2021 Jun 16;103(5 Suppl):S84–S107.e38. doi: 10.1016/j.apmr.2021.03.044

Examination of the Measurement Equivalence of the Functional Assessment in Acute Care MCAT (FAMCAT) Mobility Item Bank Using Differential Item Functioning Analyses

Jeanne A Teresi 1,2,3,4, Katja Ocepek-Welikson 2, Mildred Ramirez 2,3, Marjorie Kleinman 4, Chun Wang 5, David Weiss 6, Andrea Cheville 7
PMCID: PMC10243473  NIHMSID: NIHMS1895109  PMID: 34146534

Abstract

Objective:

To assess differential item functioning (DIF) in an item pool measuring the mobility of hospitalized patients across educational, age, and gender groups.

Design:

Measurement evaluation cohort study. Content experts generated DIF hypotheses to guide the interpretation. The graded response item response theory (IRT) model was used. Primary DIF tests were Wald statistics; sensitivity analyses were conducted using the IRT ordinal logistic regression procedure. Magnitude and impact were evaluated by examining group differences in expected item and scale score functions.

Setting:

Hospital-based rehabilitation

Participants:

2216 hospitalized patients

Main Outcome Measures:

111 self-reported mobility items

Results:

Two linking items among those used to set the metric across forms evidenced DIF for gender and age: ‘difficulty climbing stairs step-over-step without a handrail (alternating feet)’ and ‘difficulty climbing 3 to 5 steps without a handrail’. Conditional on the mobility state, the items were more difficult for women and older people (aged 65 and over). An additional 18 items were identified with DIF. Items with both high DIF magnitude and hypotheses related to age were difficulty: ‘crossing road at a 4-lane traffic light with curbs’; ‘jumping/landing on one leg’; ‘strenuous activities’; ‘descending 3–5 steps with no handrail’. Although DIF of higher magnitude was observed for several items, the impact was relatively small and the exposure rate for the most problematic items was low (0.35, 0.27 and 0.20).

Conclusions:

This was the first study to evaluate measurement equivalence of the hospital-based rehabilitation mobility item bank. Although 20 items evidenced high magnitude DIF, five related to stairs, the impact was minimal; however, it is recommended that such items be avoided in the development of short-form measures. No items with salient DIF were removed from calibrations, thus supporting the use of the item bank across groups differing in education, age, and gender. The bank may thus be useful to assist clinical assessment and decision-making regarding risk for specific mobility restrictions at discharge as well as identifying mobility-related functions targeted for post- discharge interventions. Additionally, with the goal of avoiding long and burdensome assessments for patients and clinical staff; these results could be informative for those using the item bank to construct short forms.

Keywords: mobility item bank, differential item functioning, item response theory, education, hospitalized patients, rehabilitation


Mobility becomes more difficult in advanced age,1 and the projected growth in the number of older people will likely result in an increased prevalence of people with mobility-associated disabilities. Accordingly, mobility assessment is critically important because lower extremity performance has been considered a robust indicator of health in aging.2

Disability is often experienced as a progressive and incremental process3 because with age, increase in health risks and disease can result in disability. Individuals confronting disabilities due to illness or physical frailty with age4 are at a higher risk of experiencing inability to remain independent, greater challenges for successful aging and diminished quality-of-life.5 Mobility is a key component of the operational definition of disability because of its central role in the performance of activities of daily living.6,7 Evidence from gerontological research points to mobility as a robust predictor of disability, institutional placement and mortality.1,4,8

Mobility assessment has implications clinically and conceptually, serving as an explanatory or a preventative mechanism of disability, and thus providing crucial information regarding medical and rehabilitation needs. Mobility status can inform a wide range of adjustments and adaptations necessary in order to preserve function, independence, autonomy, community participation, and involvement. Thus, community-based integrated interventions including rehabilitation services can benefit from an accurate and effective mobility item bank developed and evaluated using advanced psychometric methods.9 Moreover, mobility is an integral element of health, function, and quality of life, and holds particular relevance for clinical and public policy research, as well as for health outcomes evaluation (see1014). An important first step in the development of a mobility item bank is to ensure that the items are performing equivalently across studied groups. Measurement equivalence at the item level is a prerequisite to valid assessment.

Differential Item Functioning (DIF) in Mobility Item Banks and Short-form Measures

DIF is a method used to examine the performance of items in measures across groups varying in socio-demographic characteristics. Because items with DIF may result in imprecise measurement, studies of measurement equivalence are important prerequisites to item bank construction. Two kinds of DIF may be observed: uniform and non-uniform. Uniform DIF indicates that the DIF is in the same direction across the trait measured for the groups studied, e.g., gender, education, age, and involves examination of the “severity” parameters of an item, indicative of whether the item measures best at lower or higher points on the latent continuum measured by the scale. Non-uniform DIF indicates that the DIF is in different directions at varying points along the trait and involves the discrimination parameters, indicative of how well an item relates to the latent attribute measured by the scale, and separates people at discrete points along the attribute continuum. An example of uniform DIF would be if the probability of reporting difficulty walking several blocks is higher for one group (say older persons) than another (say younger persons) given equivalent levels of the mobility trait. An important concept is trait equivalence. Groups must be compared at the same level of impairment in mobility.

Studies of DIF in physical function and mobility item banks have identified DIF by age and language of interview. A validation study of a mobility item bank for older patients in primary care found no evidence of non-uniform DIF; however, uniform DIF was observed for items in domains such as changing and maintaining body position, carrying, lifting, and pushing.9 Paz, Spritzer, Morales, and Hays15 found language (Spanish versus English) DIF in 50 of 114 Patient Reported Outcomes Measurement Information System (PROMIS®) physical function items. This bank includes both upper and lower extremity (mobility) performance-related tasks (see7). Similarly, Jones, Tommet, Ramírez, Jensen, and Teresi16 found that almost all of the 16 items in the PROMIS Physical Function short-form evidenced significant but negligible sex-, race-, ethnicity-, age-, and or education- DIF in a sample of cancer patients. The content included walking, bending, kneeling, stooping, and climbing stairs.

DIF across certain groups has been observed in single and/or sets of items assessing mobility that are included in quality of life and in general health measures. The EUROQoL17 mobility item was found to be less likely to be endorsed across nine European countries except for Denmark in the Schizophrenia Outpatient Health Outcomes study.18 DIF of large magnitude was observed in the mobility domain of the EQ-5D (The EUROQoL Group) for cancer type, sex, and age groups.19 Age DIF was evidenced on the Sickness Impact Profile20 items: “I get around only by using a walker, crutches”, “I do not walk up or down hills”, and “I do not get around in the dark or in unlit places without someone’s help”.21 On the other hand, no evidence of age DIF was found for the limitations of physical mobility item of the Nottingham Health Profile,22 a generic measure of health-related quality of life, in a sample of surgery patients.23 Additionally, no age-, sex- or side of lesion- DIF was evidenced in the Rivermead Mobility Index items when examined in a stroke patient sample.24 A more detailed review of DIF in functional measures can be found in Teresi, Ramirez, Lai, and Silver.25

Aims of the Analyses

The DIF assessment was performed separately within three domains: mobility, activity limitation and applied cognition assessed. This paper describes the analyses of one of the three domains: mobility.

The purpose of these analyses was to examine the item-level performance of the patient-reported outcomes Functional Assessment in Acute Care MCAT (FAMCAT) mobility item bank among different educational, gender, and age groups, with a focus on differential item functioning.

Methods

Sample Generation and Description

This study was approved by the Institutional Review Board of the Hebrew Home at Riverdale, project number 0215/PO104/05 and that of the Mayo Clinic. Because administration of all items to one group would be burdensome, the total sample of 2216 in-patients were divided into four groups (“batches”). Groups ranged in size from 509 to 669 adults (Table 1). A total of 111 unique items was divided so that each group was assessed on 32 to 38 items, including the eight linking items. All batches contained the same linking items. DIF analyses were conducted within batches. While both groups were examined, the studied (focal) group in the analyses of gender was males and the reference group was females. In the analyses of education, the reference group was some college and above; the studied group was 0 years of education (no formal education) through high school or GED. The reference group for age was 65 to 90; the studied group was age < 65. The details of sample generation and description are in the Appendix.

Table 1.

Mobility Item Bank: Reliability Statistics – Cronbach’s Alpha, Ordinal Alpha, McDonald’s Omega Total, and Explained Common Variance (ECV) for the Total Sample and Demographic Subgroups by Interview Batch (Estimated from the R Psych Package)

Sample N Cronbach’s Alpha Ordinal Alpha McDonald’s Omega Total Explained Common Variance (ECV)
Batch 1
 Total Sample 669 0.971 0.985 0.985 78.186
 Male 357 0.973 0.986 0.986 71.659
 Female 312 0.969 0.983 0.983 78.677
 High School Diploma or GED 247 0.968 0.982 0.982 71.446
 Some College or Greater 403 0.973 0.986 0.986 78.513
 < 65 301 0.974 0.986 0.987 69.892
 65 + 368 0.968 0.982 0.983 78.018
Batch 2
 Total Sample 509 0.967 0.983 0.983 79.067
 Male 282 0.965 0.981 0.982 76.720
 Female 227 0.967 0.982 0.983 75.626
 High School Diploma or GED 176 0.967 0.981 0.982 75.687
 Some College or Greater 333 0.967 0.983 0.983 76.973
 < 65 237 0.965 0.982 0.983 74.948
 65 + 272 0.965 0.979 0.980 75.890
Batch 3
 Total Sample 515 0.966 0.980 0.980 64.278
 Male 278 0.966 0.979 0.980 48.475
 Female 237 0.965 0.977 0.978 71.016
 High School Diploma or GED 165 0.964 0.976 0.977 63.204
 Some College or Greater 350 0.966 0.979 0.980 58.689
 < 65 239 0.974 0.985 0.985 23.782
 65 + 276 0.956 0.970 0.972 59.054
Batch 4
 Total Sample 523 0.971 0.985 0.985 83.585
 Male 275 0.973 0.985 0.986 80.651
 Female 248 0.969 0.983 0.984 77.333
 High School Diploma or GED 169 0.972 0.984 0.984 78.093
 Some College or Greater 344 0.970 0.984 0.985 82.476
 < 65 234 0.974 0.987 0.988 80.915
 65 + 279 0.968 0.981 0.982 79.262

Cronbach’s alpha was computed using Pearson correlations

Ordinal alpha, McDonald’s Omega Total, and ECV were computed using polychoric correlations

Measures

Items were administered using a four-point response scale. There were two types of mobility items. The stem for the first was: “How much DIFFICULTY do you currently have…” The response categories for these items were: 1 – Unable; 2 - A lot; 3 - A little; 4 – None. The stem for the second type was: “How much HELP from another person do you currently need…” The response categories were: 1 – Total; 2 - A lot; 3 - A little; 4 – None. Thus, the scale was scored in the direction of ability.

Only two and three response categories remained for some items after recoding for sparse data. Sparse data was defined as an uneven distribution of response categories. Sparse data occurred when the combination of the first response category (unable) and the second ‘a lot’ was less than 5%. In Batch 1, there were the frequencies in the first response category, i.e. ‘Unable’ or ‘Total’ were ten or less for 11 items; In Batch 2, there were 2 such items; in Batch 3, there were 4 items in the first response category and an additional one item for the second response category ‘A lot’; in Batch 4, there were three items with frequencies of ten or less in the first response category. For these items, the first response category was combined with the second or in one case the first two categories were combined with the third. There was little missing data. For the analyses, all respondents who provided answers to less than 50% of items were excluded (32 or 4.6% in Batch 1; 33 or 6.1% in Batch 2; 40 or 7.2% in Batch 3; 20 or 3.7% in Batch 4). The majority (80%) of the remaining cases had only 1 or 2 responses missing. The missing item responses for the rest of the cases were prorated, i.e. the mean for each individual case was calculated and used in place of a missing response for an item.

Procedures and Statistical Approach

Qualitative analyses to generate DIF hypotheses

DIF hypotheses were generated for these analyses by asking a set of 6 content experts with expertise in rehabilitation medicine to indicate whether or not they expected DIF to be present, and the direction of the DIF with respect to several comparison groups: gender, age, race/ethnicity, and education. Hypotheses with respect to length of stay at the hospital were also elicited. However, the range of hospital stays was relatively small (0 to 3 days for the majority of respondents) and did not permit subgroup analyses.

A definition of DIF, and the following instructions related to hypotheses generation were provided.

DIF means that individuals from different groups but who are at the same level of the trait (state) will have different probabilities of endorsing an item. Put another way, reporting difficulty with mobility should depend only on the level of the trait (state), e.g., mobility, and not on membership in a group, male or female. Very specifically, randomly selected persons from each of two groups (e.g., males and females) who are at the same level of mobility should have the same likelihood of reporting difficulty related to performing certain activities, e.g., difficulty using an escalator. If it is theorized that this might not be the case, it would be hypothesized that the item has gender DIF.

A grid containing a row for each of the items and separate columns for each of the referenced groups was developed and distributed to content experts for completion in order to facilitate the rating (see Appendix 1, Table A1).

Because the complexity of items may affect item performance, possibly resulting in DIF, reading level was also examined for the item set as a whole and for those items with DIF. The overall and the item-level readability of both the full set and the mobility items with DIF were assessed by the Flesch-Kincaid Reading Ease Index (FREI) and the Flesch-Kincaid grade level (F-K)26 methods available in Microsoft Word 2016.

Quantitative analyses

DIF assessment methods.

Latent variable methods for establishing measurement invariance include two approaches to parameterization: factor analyses (e.g.,2730) and the methods of DIF.31,32 Because the mobility item bank was developed using a multidimensional item response theory (IRT) model (see33), an IRT approach to DIF detection was used. Given that the items were on a 4-point ordinal scale, a graded response model34 with comparison of parameters from nested DIF models was the approach adopted. In IRT with binary items, the item characteristic curve (ICC) that relates the probability of an item response to the underlying state, e.g., mobility, measured by the item set is characterized by a discrimination parameter, proportional to the slope of the curve and a location (severity or difficulty) parameter. In the context of the graded response model, groups are compared with expected item scores functions described below (see Figure 1). An item shows DIF if people from different subgroups but at the same level of the attribute (denoted θ) have unequal probabilities of endorsement. (See35 for formulas). Statistical tests of DIF were a variant of the Wald test based on Lord’s chi-square.3639 A unidimensional IRT model was used for DIF detection for two reasons: small subgroup sample sizes mitigated against the use of more complex models, and model fit indices supported the essential unidimensionality of the item set (see below and Appendix 2, Table A3).

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Figure 1.

Mobility Item Bank: Expected Scale and Item Scores

Model assumption of unidimensionality.

Item response theory models, upon which the DIF analyses rely, have associated assumptions including unidimensionality and local independence. The latter implies that the item responses for any pair of items are independent, conditional on the trait level. Model assumptions and fit were tested as described in Appendix 2.

Descriptive analyses.

Item frequencies were examined across batches and subgroups to detect possible problems with sparse data and skew.

DIF detection.

Age, gender, and education DIF were examined with IRTPRO.36 The primary method used for DIF detection was the Wald test for examination of group differences in IRT item parameters. Further details are described in Appendix 3. Best practices for DIF detection in patient-reported outcome measures were adopted for these analyses.40,41

Sensitivity analyses for DIF detection.

A second DIF-detection method used in sensitivity analyses was based on ordinal logistic regression (OLR).42,43 (See Appendix 3.)

Evaluation of DIF magnitude and effect sizes.

Significance tests were accompanied by magnitude measures. Expected item scores were examined as measures of magnitude. (See Figure 1 for examples.) An expected item score is the sum of the weighted (by the response category value) probabilities of scoring in each of the possible categories for the item. The method used for quantification of the difference in the average expected item scores was the non-compensatory DIF (NCDIF) index.4447

Evaluation of DIF impact.

Aggregate-level impact at the scale level was evaluated, examining expected scale score functions. Expected item scores were summed to produce an expected scale score which provides evidence regarding the effect of DIF on the total score. Group differences in scale response functions provide overall aggregated measures of DIF impact. (See Appendix 3 for the expected scale score and item score graphs.)

The following rules were applied to flag items with DIF for potential action such as removal, separate group calibration, or lowering the exposure rate in applications of computerized adaptive testing (CAT). DIF saliency was identified by a) DIF identified after adjustment for multiple comparisons; b) consistent DIF across methods; c) high magnitude of DIF; d) confirmatory hypotheses consistent with the DIF findings. Three levels of DIF salience were identified and indicated in a table with the following indicators: Level 1 (L1) is an indication that DIF was detected by IRTPRO adjusted for multiple comparisons or OLR R2 (McFadden Pseudo R2 criterion ≥ 0.02; Nagelkerke Pseudo R2 criterion ≥ 0.02; or Cox -Snell Pseudo R2 criterion ≥ 0.02) or NCDIF above threshold; Level 2 (L2) indicates that DIF was detected by IRTPRO or OLR R2 and additionally by NCDIF above threshold; Level 3 (L3) indicates that all three tests indicated DIF. In addition, * indicates a hypothesis was in the same direction as the DIF result. Only items of at least L2 were flagged as indicating that DIF was identified by two methods and evidenced high magnitude. These items were subsequently flagged for one of the disposition actions described above.

Evaluation of reliability.

Reliability was evaluated by decomposing the scale score into the sum of the item scores and the contribution of the common term or communality. McDonald’s48 Omega Total (ωt), a reliability estimate that is based on the proportion of total common variance explained, was also calculated. Both Cronbach’s alpha49 and ordinal alpha based on polychoric correlations50 were calculated. Additionally, IRT-based reliability measures were examined at selected points along the underlying latent continuum. The formulation is based on the definition of reliability as the ratio of true score to observed variance (which can be rewritten as the ratio of error variance to total variance) from Lord and Novick.51 The following relationship was used to calculate the conditional reliability, with the latent trait standardized at mean of 0 and variance of 1: σ2θ = se(θ)2 = 1 / I(θ) and π(reliability) = I(θ) / 1+I(θ). (see52). The formula uses the standard error of theta (θ) at points from −2.8 to 2.8 at intervals of 0.4, and weights the standard error squared by 1 + the standard error squared and subtracts this value from 1 (see also53, 38); marginal reliability estimates for the expected a posteriori estimates across subgroups from IRTPRO are also provided.

Results

Qualitative Results

Readability.

The FREI computed at the item-level for the full set of 111 mobility items and for the 24 mobility items with DIF ranged from 22.8 to 76.2 and from 29 to 76.2, respectively. The FREI computed for all mobility items as a set was 57.4 compared to 55.8 for the set of items with DIF. The theoretical range is 1–100; higher scores are indicative of items that are easier to understand. The item-level (F-K) grade level indices for the full set of mobility items and for the mobility items with DIF ranged from 4.7 to 16.5 and from 7.3 to 11.6, respectively. The F-K grade level indices computed for all mobility items as a set was 9.8 compared to 9.6 for the set of items with DIF.

Hypothesis generation.

Forms were completed by six content experts, one applied psychologist, three doctoral level occupational/physical therapists, a physical medicine and rehabilitation doctor, and a gerontologist. The goal was to identify items that might have a different meaning or not be understood well and/or equivalently by individuals of any of the groups referenced. A summary of the DIF hypotheses is given in Appendix 1, Table A1.

Conditional on mobility, it was hypothesized that older individuals would express greater difficulty ‘crossing the road at a 4-lane traffic light with curbs’, ‘jumping/landing on one leg’, ‘taking part in strenuous activities’, and ‘climbing and descending 3 to 5 steps without a handrail with a walking aid’. No gender or education DIF was posited for the mobility items.

Quantitative Results

Tests of model assumptions

Unidimensionality.

As shown in Appendix 2, Table A2, there was strong support for essential unidimensionality across all comparison socio-demographic groups.

As a further test of dimensionality, model fit for confirmatory factor analyses, specifying a unidimensional model was examined. Model fit indices supported the essential unidimensionality of the item set. (See Appendix 2, Table A3).

Reliability estimates

The reliability estimates were high across batches for all subgroups (Table 1) with McDonald’s Omega total values ranging from 0.972 to 0.988, Cronbach’s alphas from 0.956 to 0.974, and the ordinal alphas based on polychoric correlations were 0.970 to 0.987. The reliability estimates (precision) at points along the latent trait (θ) reflective of where respondents were observed ranged from 0.70 at the highest points of θ (θ = 2.0 to 2.4) to 0.99 (θ = −1.6 to −0.4). The overall reliability estimates for the total sample ranged from 0.93 to 0.95 across batches, and the marginal reliabilities from 0.92 to 0.96 across subgroups (see Appendix 4, Table A4).

IRT Parameter Estimates, Tests of DIF, and Assessment of Magnitude and Impact

Appendix 4, Table A5 shows the discrimination (a) parameters across subgroup comparisons. As shown, the a parameters vary; some items were less discriminating (a below 1.5). Several items were highly discriminating (a > 3.5), and 21 items were most discriminating for respondents aged < 65. (The graded response item parameters and their standard errors for the total sample are shown in Appendix 4, Table A6.)

DIF Results

DIF results are summarized in two tables: Table 2 displays the eight mobility linking/ anchor items that were included in all batch samples with flags for items showing DIF. Table 3 presents only non-linking items with substantial DIF, identified by the two DIF methods described above, and with high magnitude.

Table 2.

Mobility Item Bank: Differential Item Function (DIF) Results for Linking Items Across Four Batches and Demographic Groups

Variable Name Question Batch 1 Batch 2 Batch 3 Batch 4
Gender Education Age Gender Education Age Gender Education Age Gender Education Age
Q72M How much DIFFICULTY do you currently have walking quickly indoors to answer the telephone?
Q74M How much DIFFICULTY do you currently have walking around inside a building (50 feet, or 16 meters) on the same level (e.g., hospital hallway, around a doctor’s office or supermarket)?
Q75M How much DIFFICULTY do you currently have taking 10 steps without a walking aid?
Q76M How much DIFFICULTY do you currently have walking around one floor of your home, taking into consideration thresholds, doors, furniture, and a variety of floor coverings?
Q79M How much DIFFICULTY do you currently have going up and down a flight of stairs inside, using a handrail with your walking aid?
Q80M How much HELP from another person do you currently need climbing 1 step with a railing?
Q81M How much DIFFICULTY do you currently have climbing stairs step-over-step without a handrail? (alternating feet) L2
U;
Older more
L3
U;
Older more
L2
U;
F more
Q82M How much DIFFICULTY do you currently have climbing 3 to 5 steps without a handrail? L3
U;
F more
L2 *
U;
Older more
L2 *
U;
Older more

U = Uniform DIF; NU = Non-Uniform DIF

L1 (Level 1) DIF indicated by IRTPRO method adjusted for multiple comparisons or ordinal logistic regression R2 (McFadden Pseudo R2 criterion ≥ 0.02; or Nagelkerke Pseudo R2 criterion ≥ 0.02; or Cox -Snell Pseudo R2 criterion ≥ 0.02) or NCDIF above threshold; L2 (Level 2) DIF indicated by IRTPRO or ordinal logistic regression R2 and additionally by NCDIF above threshold; L3 (Level 3) All three tests indicate DIF

*

Hypothesis is in the same direction as the DIF result

Table 3.

Mobility Item Bank: Differential Item Function (DIF) Results (Only Items Showing DIF that are Not in Table 2 are Included)

Variable Name Question Gender Education Age
DIF Hypotheses by Experts Stronger DIF Flag DIF Hypotheses by Experts Stronger DIF Flag DIF Hypotheses by Experts Stronger DIF Flag
Batch 1
Q83M How much DIFFICULTY do you currently have climbing 3 to 5 steps without a handrail with your walking aid? L2
U;
F more
Q86M How much DIFFICULTY do you currently have carrying something in both arms while climbing a flight of stairs (e.g., laundry basket)? L2
U;
F more
Q110M How much DIFFICULTY do you currently have getting into and out of truck, shuttle van, or sport utility vehicle? L3
U;
F more
Batch 2
Q162M How much DIFFICULTY do you currently have lifting and carrying a moderate size suitcase by the handle while walking from the house to the car? L2
U;
Older less
Q168M How much DIFFICULTY do you currently have standing for 3 to 4 minutes without support? L2
NU;
High ed. more discr.
Q177M How much DIFFICULTY do you currently have taking 10 steps with your walking aid? L2
U;
Low educ. less
Batch 3
Q232M How much DIFFICULTY do you currently have opening a high window above shoulder height, while standing? L3
U;
F more
Q233M How much DIFFICULTY do you currently have loading or unloading a car trunk or hatchback (e.g., packages or equipment)? L2
U;
F more
Q236M How much HELP from another person do you currently need to walk to different parts of the hospital (e.g., gift shop, cafeteria)? L2
U;
Older more
Q239M How much DIFFICULTY do you currently have crossing the road at a 4-lane traffic light with curbs? 2
Older more difficult
L3 *
U;
Older more§
Q241M How much DIFFICULTY do you currently have cleaning up spills on the floor (e.g., with a rag or mop)? L2
U;
F less
Q247M How much DIFFICULTY do you currently have jumping/landing on one leg? 2
Older more difficult (1)
L2 *
U;
Older more§
Q248M How much DIFFICULTY do you currently have taking part in strenuous activities (e.g., running 3 miles, swimming half mile, etc.)? 2
Older more difficult (1)
L2 *
U;
Older more
Q250M How much DIFFICULTY do you currently have descending 3 to 5 steps without a handrail with your walking aid? 2
Older more
L2 *
U;
Older more§
Q253M How much DIFFICULTY do you currently have sitting down on and standing up from a chair with arms (e.g., wheelchair, or bedside chair)? L2
U;
F less
Batch 4
Q303M How much DIFFICULTY do you currently have pulling open a heavy door? L3
U;
F more
Q307M How much DIFFICULTY do you currently have doing heavy housework or repairs (e.g., painting, washing windows inside and out, shoveling snow, or assembling new furniture or appliances)? L2
U;
F more
Q327M How much DIFFICULTY do you currently have getting up from the floor (e.g., if you fell) with your walking aid? L2
U;
Older more

U = Uniform DIF; NU = Non-Uniform DIF

L1 (Level 1) DIF indicated by IRTPRO method adjusted for multiple comparisons or ordinal logistic regression R2 (McFadden Pseudo R2 criterion ≥ 0.02; or Nagelkerke Pseudo R2 criterion ≥ 0.02; or Cox -Snell Pseudo R2 criterion ≥ 0.02) or NCDIF above threshold; L2 (Level 2) DIF indicated by IRTPRO or ordinal logistic regression R2 and additionally by NCDIF above threshold; L3 (Level 3) All three tests indicate DIF

*

Hypothesis is in the same direction as the DIF result

§

Low exposure rate

Among the linking items used to set the metric across forms, two showed significant DIF. The item, ‘difficulty climbing stairs step-over-step without a handrail (alternating feet)’ showed DIF in Batch 1 and Batch 3 samples for age, conditional on the mobility state, the item was more difficult for the older respondents and in Batch 4 for gender, where the item was more difficult for women. The item, ‘difficulty climbing 3 to 5 steps without a handrail’ exhibited age DIF in Batch 2 and Batch 3 samples. The item was more difficult for older responders, as was hypothesized. In the Batch 1 sample, the item was more difficult for women.

In addition to the linking items, the following items showed DIF for gender; conditional on mobility, the items were more difficult for women, and DIF had not been hypothesized a priori: ‘difficulty climbing 3 to 5 steps without a handrail with your walking aid’; ‘carrying something in both arms while climbing a flight of stairs (e.g., laundry basket)’; ‘getting into and out of a truck, shuttle van, or sport utility vehicle’; ‘opening a high window above shoulder height, while standing’; ‘loading or unloading a car trunk or hatchback (e.g., packages or equipment)’; ‘pulling open a heavy door’; ‘doing heavy housework or repairs (e.g., painting, washing windows inside and out, shoveling snow, or assembling new furniture or appliances)’. Two items, ‘difficulty cleaning up spills on the floor (e.g., with a rag or mop)’; ‘sitting down on and standing up from a chair with arms (e.g., wheelchair, or bedside chair)’ were more difficult for men and the DIF was also not hypothesized.

The item ‘standing for 3 to 4 minutes without support’ was more discriminating for respondents with more education and the item ‘taking 10 steps with your walking aid’ was more difficult for respondents with more education.

An additional seven items showed age DIF; all the items were more difficult for the older respondents, conditional on mobility level. The following four items that showed DIF were also hypothesized to be more difficult for the older respondents: ‘crossing the road at a 4-lane traffic light with curbs’; ‘jumping/landing on one leg’; ‘taking part in strenuous activities (e.g., running 3 miles, swimming half mile, etc.)’; ‘descending 3 to 5 steps without a handrail with your walking aid’. The items: ‘lifting and carrying a moderate size suitcase by the handle while walking from the house to the car’; ‘help needed to walk to different parts of the hospital (e.g., gift shop, cafeteria)’; and, ‘getting up from the floor (e.g., if you fell) with your walking aid’ showed DIF, but were not hypothesized to do so.

Because local dependencies (LDs; defined in Appendix 2) can result in over-identification of DIF, items with high LDs were examined to see if this could be a reason for DIF. A high LD between the item ‘climbing 3 to 5 steps without a handrail’ and the item ‘climbing 3 to 5 steps without a handrail with your walking aid’ was observed; both items showed gender DIF.

Aggregate impact

As shown in Figure 1, there was no evident scale level impact for gender, education, or age DIF in any of the batches. All group curves were overlapping for all comparisons.

Sensitivity analyses

A detailed analysis of the missing data was performed. After eliminating cases with more than 50% of missing responses on the items, the vast majority of cases had no missing responses (64% in Batch 2 to 84% in Batch 4). At most, only 9% of cases had a moderate amount of missing, i.e. more than 2 item responses (5% in Batch 1, 9% in Batch 2, 5% in Batch 3 and 3% in Batch 4).

Additionally, we repeated the DIF analyses without imputing the missing data; the EM algorithm in the IRTPRO package facilitated the estimation of the parameters with inclusion of all respondents. The DIF results remained basically unchanged. The only change of note was in Batch 1 for the age DIF of the item Q81M (climbing stairs step over step without a handrail) from U* to U, however the Level 2 designation for saliency did not change.

Discussion

The overall FREI and F-K indices for the total set of mobility items were generally comparable to those of the set of items with DIF. The FREI of the full set (57.4) was similar to that of the items with DIF (55.8); both reflect relatively moderate levels of reading ease. In terms of the grade level indices, the full set of items (9.8) and those with DIF (9.6) are deemed to be readily comprehensible for individuals at a U.S ninth- grade reading level.

In general, items were highly discriminating. Relatively low discrimination parameters were observed for 16 items. Items that were less discriminating tended to be severe indicators of disability such as ‘lying in bed on your side without support’; ‘reaching for the nurse call button’; ‘reaching for the phone’; ‘rearranging pillows while lying on your bed’; ‘reaching for toilet paper while seated on a toilet’.

DIF was observed for many items. However, given the large sample sizes, using more stringent criteria related to DIF consistency, magnitude, and impact, together with presence of hypotheses generated by content experts, relatively few mobility items from this bank of 111 items rose to levels surpassing pre-set flagging criteria.

Among the items identified with DIF for age, gender, or education, 20 were singled out with salient DIF among which 9 items were flagged with the most evidence. One item showed DIF by all the methods (Level 3) and the hypotheses in the same direction: ‘crossing the road at a 4-lane traffic light with curbs’ but only for age groups. The item ‘climbing 3 to 5 steps without a handrail’ showed consistent DIF of high magnitude and the hypotheses in the same direction. The item ‘climbing stairs step-over-step without a handrail (alternating feet)’ showed consistent DIF of high magnitude without confirmatory a priori hypotheses. Three items were identified with DIF by all the methods (Level 3) without confirmatory hypotheses: ‘getting into and out of truck, shuttle van, or sport utility vehicle’; ‘opening a high window above shoulder height, while standing’; ‘pulling open a heavy door’. An additional three items were identified with DIF by two methods (Level 2) with the hypotheses in the same direction: ‘jumping/landing on one leg’; ‘taking part in strenuous activities (e.g., running 3 miles, swimming half mile, etc.)’; ‘descending 3 to 5 steps without a handrail with your walking aid’.

Saliency was accentuated in the presence of a DIF confirmatory hypothesis. Among the 20 items with Level 2 or 3 DIF, five had confirmatory hypotheses. No action was taken in the item bank with respect to these items. However, the exposure rate based on extensive CAT simulations for the items with the most salient DIF: ‘climbing stairs step-over-step without a handrail (alternating feet)’; ‘climbing 3 to 5 steps without a handrail’; and, ‘crossing the road at a 4-lane traffic light with curbs’ was low (0.35, 0.27 and 0.20).

DIF for two linking items, ‘difficulty climbing stairs step-over-step without a handrail (alternating feet)’ (Q81M) and the item, ‘difficulty climbing 3 to 5 steps without a handrail’ (Q82M) was not present consistently across batches. Variation was due to the requirement of examining the linking items together with the unique items within each of four different samples in Batch 1 through Batch 4. However, DIF was observed more than once for the items Q81M and Q82M; they were thus considered problematic for future use.

DIF and CAT

Actions that might have been taken include item removal, lowering the exposure rate, or performing separate calibrations for items with DIF. Because items with DIF can also provide information about examinees, it is a potential waste of information to simply delete them. In the field test, the DIF items were administered like the DIF-free items. The item parameters of an item with DIF can therefore be estimated for different groups respectively, based on these previously collected data. Then the group-specific DIF item parameters can be used when the items are administered to the respective group, known as the DIF-CAT procedure.54 Using the properly (i.e., within group) calibrated DIF items in the scale would increase the item usage efficiency and potentially obtain better patient trait estimates. However, the DIF-CAT method requires that the item parameters be calibrated precisely within each group, implying the need for sample sizes of at least 500 per group, which was greater than the subgroup sample sizes used in these analyses.55 To reduce the potential impact of three items that showed significant DIF (but in simulations had exposure rates from 0.20 to 0.35), a probabilistic procedure was implemented that limited their exposure rate in live MD-CAT testing to 0.20.

In the current item bank, three items with most salient DIF displayed a low exposure rate (0.20 to 0.35). This item bank can be recommended as relatively DIF-free with regard to education, age, and gender among similar groups of primarily White hospitalized patients. These data are informative for those constructing short-forms or CATs. Five out of twenty items with salient DIF, two of which were linking items involved stair climbing. Overall, eight items involving stair climbing evidenced DIF; thus, the recommendation is to minimize inclusion of such items in short-form measures that might be developed.

Study Limitations

Assumption violations such as high local dependencies may result in inflated discrimination parameters. The high LDs involved the stair climbing items; these items were also more likely to show DIF. High LDs can lead to false DIF detection, and this explanation for the DIF observed cannot be ruled out. On the other hand, stair climbing items have been shown to evidence DIF in previous studies (e.g.,16,58), and some were hypothesized by content experts to be likely to show DIF. Overall, 50% (9/18) of the items with DIF involved stair climbing. These items were found to be more difficult for older people and women.

Another limitation was the inability to perform DIF analyses by race and ethnicity. In this study, the item related to strenuous activities evidenced DIF by age, with the item being more difficult (severe) for older people; however, the item involving vigorous activities was not found to evidence DIF of sufficient magnitude and consistency to surpass the threshold set for flagging items with DIF. This latter item which is similar to the item related to strenuous activities has been found in previous research to evidence DIF by race and ethnicity. Race-DIF of large magnitude has been found in the item measuring vigorous activities.56,57 Teresi and colleagues58 additionally reported race-DIF in walking, lifting/carrying groceries, and large, but of low impact sex-DIF for the item related to strenuous activities. Analogously, similar SF-36 items such as vigorous activities (age-, education-, race-DIF), bend/kneel/stoop (age-DIF), and walk more than a mile (race-DIF) were found to be problematic for application in comparative studies across various demographic subgroups.59 DIF associated with age was also documented for the SF-12 items moderate activities and climb several flights of stairs.56

A final limitation relates to adjustments for DIF. Beyond mean comparisons, DIF may impact the range and distribution of scores and their relationship to other variables. Various actions taken to mitigate DIF could affect the overall measure. For example, item removal could affect validity by decreasing coverage of selected aspects of the trait or of items with seeming face or clinical validity. However, in this relatively large item bank, there were many items from which to choose that measured the trait well at equivalent levels. Thus, lowering the exposure described above or item removal had less of an impact.

Clinical Implications

Patient reported outcomes are a central aspect of discharge planning.60 It is noteworthy that to the best of our knowledge, this is the first study that examines DIF in an item bank tapping mobility function among hospitalized patients. Given that the item bank was observed to be relatively DIF-free with regard to education, age, and gender, the bank can be used to assist clinical assessment and decision-making regarding patients who might be at risk for specific mobility restrictions at discharge as well as aspects of their mobility-related function that need to be targeted for post- discharge interventions. More important, with the goal of avoiding long and burdensome assessments for patients and clinical staff; these study data are informative for the construction of short-forms, including selection of items most sensitive to different levels of mobility impairment.61 Relatedly, clinicians such as those focused on orthopedic medicine might wish to evaluate upper and lower extremity mobility, distinctly. The item bank could be stratified by subdomain when selecting items to create short forms to estimate upper and lower extremity mobility status using a common metric. In addition, these data are applicable for administering a tailored assessment via a computerized adaptive test, potentially conferring increased precision62,63 in the assessment of mobility impairment, and thus, potentially rendering a better needs-service referral fit for physical rehabilitative treatment after hospital discharge. Enhanced mobility has the potential to translate into improved physical and psychosocial functioning, impacting the quality of life of people facing disabilities.

Conclusions

DIF can have serious consequences, particularly in the context of CAT for which a limited number of items are administered. Particularly concerning is if the starting or linking items contain DIF. As a result, it is recommended that items with salient DIF as defined here receive separate group calibrations, depending on findings of DIF, or are assigned lower exposure rates, or are removed from the bank. Recommendations for item calibration and scoring in the presence of DIF have been offered recently,64 and methodology for DIF detection presented.65 In the current item bank, three items with most salient DIF evidenced a low exposure rate (0.20 to 0.35). There was no evident aggregate scale level DIF effect for sex, education, or age in any of the batches. Thus, this item bank can be recommended as relatively DIF-free with regard to education, age, and gender among similar groups of primarily White hospitalized patients. These data are informative for those constructing short-forms or CATs. Five out of twenty items with salient DIF (two of which were linking items) involved stair climbing. Overall, eight items involving stair climbing evidenced DIF; it is thus recommended that such items be avoided in the development of short-form measures. DIF analysis is an essential step in ensuring that item banks contain high quality items. Item banks are frequently used to construct short-form measures that may be more precise and efficient for use in clinical practice. The results presented here provide needed information for those developing and using such measures.

Supplementary Material

appendix

Funding:

This work was supported by the National Institute of Child Health and Human Development [grant number 1R01HD079439-01A1], and by the National Institute on Aging, Pepper Center [grant number 1P30AG028741].

The authors thank Joseph P. Eimicke, MS and Stephanie Silver, MPH for additional analyses and editorial assistance in the preparation of this manuscript.

Abbreviations:

CAT

computerized adaptive testing

DIF

differential item functioning

F-K

Flesch-Kincaid grade level

FAMCAT

functional assessment in acute care MCAT

FREI

Flesch-Kincaid Reading Ease Index

ICC

item characteristic curve

IRT

item response theory

LDs

local dependencies

NCDIF

non-compensatory DIF

OLR

ordinal logistic regression

PROMIS

Patient Reported Outcomes Measurement Information System

Footnotes

Conflicts of interest and Disclosures: None

References

  • 1.Leveille SG, Penninx BW, Melzer D, Izmirlian G, Guralnik JM. Sex differences in the prevalence of mobility disability in old age: The dynamics of incidence, recovery, and mortality. J Gerontol B Psychol Sci Soc Sci 2000;55:S41–50. [DOI] [PubMed] [Google Scholar]
  • 2.Ferrucci L, Guralnik JM. Mobility in human aging: a multidisciplinary life span conceptual framework. Annu Rev Gerontol Geriatr 2013;33:171–92. doi-org.ezproxy.cul.columbia.edu/10.1891/0198-8794.33.171 [Google Scholar]
  • 3.Verbrugge LM, Brown DC, Zajacova A. Disability rises gradually for a cohort of older Americans. J Gerontol: Series B 2017;72(1):151–61. 10.1093/geronb/gbw002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Molton IR, Yorkston KM. Growing older with a physical disability: A special application of the successful aging paradigm. J Gerontol: Series B 2017;72(2): 290–9. 10.1093/geronb/gbw122 [DOI] [PubMed] [Google Scholar]
  • 5.Young Y, Frick KD, Phelan EA. Can successful aging and chronic illness coexist in the same individual? A multidimensional concept of successful aging. JAMDA 2009;10(2):87–92. doi: 10.1016/j.jamda.2008.11.003 [DOI] [PubMed] [Google Scholar]
  • 6.Gill TM. Assessment of function and disability in longitudinal studies. J Am Geriatr Soc 2010; 58(Suppl 2):S308–12. doi: 10.1111/j.1532-5415.2010.02914.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Hays RD, Spritzer KL, Amtmann D, et al. Upper-extremity and mobility subdomains from the Patient-Reported Outcomes Measurement Information System (PROMIS) adult physical functioning item bank. Arch Phys Med Rehabil 2013;94:2291–6. doi.org/ 10.1016/j.apmr.2013.05.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Guralnik JM, Ferrucci L. Assessing the building blocks of function: utilizing measures of functional limitation. Am J Prev Med 2003;25(Suppl 2):112–21. doi: 10.1016/s0749-3797(03)00174-0 [DOI] [PubMed] [Google Scholar]
  • 9.Cabrero-García J, Ramos-Pichardo JD, Muñoz-Mendoza CL, et al. Validation of a mobility item bank for older patients in primary care. Health Qual Life Outcomes 2012;10:147. 10.1186/1477-7525-10-147 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Abellan van Kan G, Rolland Y, Andrieu Y, et al. Gait speed at usual pace as a predictor of adverse outcomes in community-dwelling older people an International Academy on Nutrition and Aging (IANA) Task Force. J Nutr Health Aging 2009;13:881–9. doi.org/ 10.1007/s12603-009-0246-z [DOI] [PubMed] [Google Scholar]
  • 11.Hardy SE, Perera S, Roumani YF, Chandler JM, Studenski SA. Improvement in usual gait speed predicts better survival in older adults. J Am Geriatr Soc 2007;55:1727–34. doi: 10.1111/j.1532-5415.2007.01413.x [DOI] [PubMed] [Google Scholar]
  • 12.Hirvensalo M, Rantanen T, Heikkinen E. Mobility difficulties and physical activity as predictors of mortality and loss of independence in the community living older population. J Am Geriatr Soc 2000;48:493–8. doi.org/ 10.1111/j.1532-5415.2000.tb04994.x [DOI] [PubMed] [Google Scholar]
  • 13.Newman AB, Simonsick EM, Naydeck BL, et al. Association of long-distance corridor walk performance with mortality, cardiovascular disease, mobility limitation, and disability. JAMA 2006;295:2018–26. doi: 10.1001/jama.295.17.2018 [DOI] [PubMed] [Google Scholar]
  • 14.Studenski S, Perera S, Patel K, et al. Gait speed and survival in older adults. JAMA 2011;305:50–8. doi: 10.1001/jama.2010.1923 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Paz SH, Spritzer KL, Morales LS, Hays RD. Evaluation of the Patient-Reported Outcomes Information System (PROMIS) Spanish physical functioning items. Qual Life Res 2013;22:1819–30. doi: 10.1007/s11136-012-0292-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Jones RN, Tommet D, Ramirez M, Jensen R, Teresi JA. Differential item functioning in Patient Reported Outcomes Measurement Information System® (PROMIS®) Physical Functioning short forms: Analyses across ethnically diverse groups. Psychol Test Assess Model 2016;58(2):371–402. [Google Scholar]
  • 17.The EuroQol Group. EuroQol-a new facility for the measurement of health-related quality of life. Health Policy 1990;16:199–208. [DOI] [PubMed] [Google Scholar]
  • 18.Prieto L, Novick D, Sacristan JA, Edgell ET, Alonso J, SOHO Study Group. A Rasch model analysis to test the cross-cultural validity of the EuroQoL-5D in the Schizophrenia Outpatient Health Outcomes Study. Acta Psychiatrica Scandanavica 2003;107(Suppl. 416):24–9. doi.org/ 10.1034/j.1600-0447.107.s416.6.x [DOI] [PubMed] [Google Scholar]
  • 19.Smith AB, Cocks K, Parry D, Taylor M. A differential item functioning analysis of the EQ-5D in cancer. Value Health 2016;19:1063–7. doi.org/ 10.1016/j.jval.2016.06.005 [DOI] [PubMed] [Google Scholar]
  • 20.Bergner M, Bobbitt RA, Carter WB, Gilson BS. The Sickness Impact Profile: development and final revision of a health status measure. Med Care 1981;19:787–805. [DOI] [PubMed] [Google Scholar]
  • 21.Lindeboom R, Holman R, Dijkgraaf MGW, et al. Scaling the Sickness Impact Profile using item response theory: An exploration of linearity, adaptive use, and patient driven item weights. J Clin Epidemiol 2004;57:66–74. doi.org/ 10.1016/S0895-4356(03)00212-9 [DOI] [PubMed] [Google Scholar]
  • 22.McEwen J, McKenna S. Nottingham Health Profile. In: Spilker B, ed. Quality of Life and Pharmacoeconomics in Clinical Trials. 3rd ed. Philadelphia: Lippincott-Raven Publishers;1996:281–286. [Google Scholar]
  • 23.Juhel J, Gaillot AC. Structural validity and age-based differential item functioning of the French Nottingham Health Profile in a sample of surgery patients. Adv Psychol Study 2012;1:14–21. [Google Scholar]
  • 24.Roorda LD, Green JR, Houwink A, et al. The Rivermead Mobility Index allows valid comparisons between subgroups of patients undergoing rehabilitation after stroke who differ with respect to age, sex, or side of lesion. Arch Phys Med Rehabil 2012;93:1086–90. doi.org/ 10.1016/j.apmr.2011.12.015 [DOI] [PubMed] [Google Scholar]
  • 25.Teresi JA, Ramirez M, Lai JS, Silver S. Occurrences and sources of differential item functioning (DIF) in patient-reported outcome measures: description of DIF methods, and review of measures of depression, quality of life and general health. Psychol Sci Q 2008;50:538–612. [PMC free article] [PubMed] [Google Scholar]
  • 26.Flesch R A new readability yardstick. J Appl Psychol 1948;32:221–3. [DOI] [PubMed] [Google Scholar]
  • 27.Meredith W Measurement invariance, factor analysis and factorial invariance. Psychometrika 1993:58;525–43. doi: 10.1007/BF02294825 [DOI] [Google Scholar]
  • 28.Meredith W, Teresi JA. An essay on measurement and factorial invariance. Med Care 2006;44(Suppl 3):S69–77. doi: 10.1097/01.mlr.0000245438.73837.89 [DOI] [PubMed] [Google Scholar]
  • 29.Millsap RE, Meredith W. Inferential conditions in the in the statistical detection of measurement bias. Appl Psychol Meas 1992;16:389–402. doi: 10.1177/014662169201600411 [DOI] [Google Scholar]
  • 30.van de Vijver F, Leung K. Methods and data analyses for cross-cultural research. Thousand Oaks, California: Sage Publications;1997. [Google Scholar]
  • 31.Holland PH, Wainer H. Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum; 1993. [Google Scholar]
  • 32.Lord FM. Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum;1980. [Google Scholar]
  • 33.Wang & Weiss, this series
  • 34.Samejima F Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement 1969;34:100–14. doi: 10.1007/BF02290599 [DOI] [Google Scholar]
  • 35.Orlando-Edelen M, Thissen D, Teresi JA, Kleinman M, Ocepek-Welikson K. Identification of differential item functioning using item response theory and the likelihood-based model comparison approach: applications to the Mini-Mental State Examination. Med Care 2006;44(11 Suppl 3):S134–42. doi: 10.1097/01.mlr.0000245251.83359.8c. [DOI] [PubMed] [Google Scholar]
  • 36.Cai L, Thissen D, du Toit SHC. IRTPRO: Flexible, multidimensional, multiple categorical IRT Modeling [Computer software]. Chicago, IL: Scientific Software International, Inc;2011. [Google Scholar]
  • 37.Langer MM. A re-examination of Lord’s Wald test for differential item functioning using item response theory and modern error estimation (Doctoral dissertation). University of North Carolina at Chapel Hill library;2008. http://search.lib.unc.edu/search?R=UNCb5878458. [Google Scholar]
  • 38.Teresi JA, Kleinman M, Ocepek-Welikson K. Modern psychometric methods for detection of differential item functioning: application to cognitive assessment measures. Stat Med 2000;19:1651–83. doi: [DOI] [PubMed] [Google Scholar]
  • 39.Woods CM, Cai L, Wang M. The Langer-improved Wald test for DIF testing with multiple groups: evaluation and comparison to two-group IRT. Educ Psychol Meas 2013;73:532–47. doi: 10.1177/0013164412464875 [DOI] [Google Scholar]
  • 40.Reeve BB, Teresi JA. Overview to the two-part series: Measurement equivalence of the patient-reported outcomes measurement information system (PROMIS®) short-forms. Psychol Test Assess Model 2016;58(1): 31–5. [PMC free article] [PubMed] [Google Scholar]
  • 41.Teresi JA, Wang C, Kleinman M, Jones RN, Weiss DJ. (in press). Differential item functioning analyses of the Patient Reported Outcomes Measurement Information System (PROMIS®) measures: Methods, challenges, advances and future directions. Psychometrika. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Swaminathan H, Rogers HJ. Detecting differential item functioning using logistic regression procedures. J Educ Meas 1990;27:361–70 doi: 10.1111/j.1745-3984.1990.tb00754.x [DOI] [Google Scholar]
  • 43.Zumbo BD. A handbook on the theory and methods of differential item functioning (DIF): logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defense;1999. Retrieved from http://www.educ.ubc.ca/faculty/zumbo/DIF/index.html. [Google Scholar]
  • 44.Raju NS. DFITP5: A Fortran program for calculating dichotomous DIF/DTF [Computer program]. Chicago: Illinois Institute of Technology;1999. [Google Scholar]
  • 45.Raju NS, van der Linden WJ, Fleer PF. IRT-based internal measures of differential functioning of items and tests. Appl Psychol Meas 1995;19:353–68.doi: 10.1177/014662169501900405. [DOI] [Google Scholar]
  • 46.Flowers CP, Oshima TC, Raju NS. A description and demonstration of the polytomous DFIT framework. Appl Psychol Meas 1999;23:309–26. doi: 10.1177/01466219922031437 [DOI] [Google Scholar]
  • 47.Oshima TC, Kushubar S, Scott JC, Raju NS. DFIT for Window User’s Manual: differential functioning of items and tests. St. Paul, MN: Assessment Systems Corporation;2009. [Google Scholar]
  • 48.McDonald RP. Test theory: a unified treatment. Mahwah, NJ: L. Erlbaum Associates;1999. [Google Scholar]
  • 49.Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika 1951;16:297–334. [Google Scholar]
  • 50.Zumbo BD, Gadermann AM, Zeisser C. Ordinal versions of coefficient alpha and theta for Likert rating scales. J Mod Appl Stat Methods 2007;6:21–9. [Google Scholar]
  • 51.Lord FM, Novick MR. Statistical theories of mental test scores. Reading, MA: Addison Wesley; 1968. [Google Scholar]
  • 52.Cheng Y, Liu C, Behrens J. Standard error of ability estimates and the classification accuracy and consistency of binary decisions. Psychometrika 2015;80(3):645–64. doi: 10.1007/s11336-014-9407-z. [DOI] [PubMed] [Google Scholar]
  • 53.Cheng Y, Yuan K-H, Liu C. Comparison of reliability measures under factor analysis and item response theory. Ed Psych Meas 2012;72:52–67. [Google Scholar]
  • 54.Wang Z, Weiss D, Wang C. DIF-CAT: Doubly adaptive CAT using subgroup information to improve measurement precision. Paper presented at the 2017 IACAT (International Association on Computerized Adaptive Testing), Niigata, Japan. [Google Scholar]
  • 55.Jiang S, Wang C, Weiss DJ. Sample size requirements for estimation of item parameters in the multidimensional graded response model. Frontiers Psych 2016;7:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Fleishman JA, Lawrence WF. Demographic variation in SF-12 scores: True differences or differential item functioning? Med Care 2003;41(7 Suppl):III75–86. [DOI] [PubMed] [Google Scholar]
  • 57.Crane PK, Gibbons LE, Ocepek-Welikson K, et al. A comparison of three sets of criteria for determining the presence of differential item functioning using ordinal logistic regression. Qual Life Res 2007;16:69–84. doi: 10.1007/s11136-007-9185-5. [DOI] [PubMed] [Google Scholar]
  • 58.Teresi JA, Ocepek-Welikson K, Kleinman M, et al. Evaluating measurement equivalence using the item response theory log-likelihood ratio (IRTLR) method to assess differential item functioning (DIF): Applications (with illustrations) to measures of physical functioning ability and general distress. Qual Life Res 2007;16:43–68. doi: 10.1007/s11136-007-9186-4 [DOI] [PubMed] [Google Scholar]
  • 59.Perkins AJ, Stump TE, Monahan PO, McHorney CA. Assessment of differential item functioning for demographic comparisons in the MOS SF-36 health survey. Qual Life Res 2006;15:331–48. doi: 10.1007/s11136-005-1551-6. [DOI] [PubMed] [Google Scholar]
  • 60.Flynn KE, Dombeck CB, DeWitt EM, Schulman KA, Weinfurt KP. Using item banks to construct measures of patient reported outcomes in clinical trials: investigator perceptions. Clin Trials 2008;5(6):575–86. doi: 10.1177/1740774508098414 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Lai JS, Cella D, Chang CH, Bode RK, Heinemann AW. Item banking to improve, shorten and computerize self-reported fatigue: an illustration of steps to create a core item bank from the FACIT-Fatigue Scale. Qual Life Res 2003;12:485–501. doi: 10.1023/a:1025014509626. [DOI] [PubMed] [Google Scholar]
  • 62.Fries JF, Bruce B, Bjorner JB, Rose M. More relevant, precise, and efficient items for assessment of physical function and disability: moving beyond the classic instruments. Ann Rheum Dis 2006;65(Suppl 3):iii16–21. doi: 10.1136/ard.2006.059279 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Reeve BB, Hays RD, Bjorner JB, et al. Psychometric evaluation and calibration of health-related quality of life item banks: plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Med Care 2007;45(Suppl 1):S22–31. doi: 10.1097/01.mlr.0000250483.85507.04 [DOI] [PubMed] [Google Scholar]
  • 64.Cho S-J, Suh Y, Lee W. After differential item functioning is detected: IRT item calibration and scoring in the presence of DIF. Appl Psychol Meas 2016;40(8):573–91. doi: 10.117701-466216664304 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Van der Linden WJ (ed). Handbook of Item Response Theory: Volume 3: Applications. Chapman & Hall/CRC; 2018. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

appendix

RESOURCES