Abstract
A long-standing and critical problem in the study of aging and depression is the comparability of measurement across age groups. While psychological measures of depression typically show increased incidence of symptoms with increasing age, rates of depression diagnosis do not show the same age trend. This analysis presents tests of differential item functioning on the depression section of the CAMDEX interview schedule, using factor analysis-derived affective and somatic subscales (McGue & Christensen, 1997). Results for the affective subscale show significant differences in item functioning in the majority of the affective items as a function of age (Items “Happy Life”, “Lonely”, “Nervous” “Worthless” and “Future”: = [30.193, 255.971] across items, all p<.0001). Analyses for the somatic subscale show differential item functioning is limited to a single item relating to coping ( = 180.754, p<.0001). These results indicate that differences in depression symptoms across age groups are not entirely consistent with a unidimensional depression trait, and that the measurement structure of depression varies over the lifespan.
Keywords: depression, aging, item response theory, differential item functioning, measurement invariance
Depression is among the most common and disabling psychiatric conditions in the general population (Pincus & Petit, 2001) and the most common cause of diminished quality of life in elderly persons (Cole, Bellavance, & Mansour, 1999; Xavier et al., 2002). Depression tends to be chronic or relapsing with a declining probability of remission with age (Cole, Bellavance, & Mansour, 1999; Kessler, Mickelson, Walters, Zhao, & Hamilton, 2009; Thielke, Diehr, & Unutzer, 2010). Depressive symptoms in older adults predict declines in physical performance (Pennix et al., 1998) and disability in activities of daily living (Bruce, et al, 1994; Iwasa et al., 2009). Physical symptoms commonly reported by older patients interact reciprocally with depression, creating positive feedback loops (Drayer et al., 2005; Lyness et al., 1996). The risk of psychiatric comorbidity, including other mood disorders, anxiety, and substance use disorders increases with age in patients with major depressive disorder (MDD) (Kessler, et al., 2009). Depression is also associated with increased all-cause mortality (Blazer & Hybels, 2004; Saz & Dewey, 2001) and mortality due to suicide (Pearson, 2002).
A critical and long-standing challenge in the study of age and depression is establishing prevalence estimates across age groups. Estimates by age vary widely depending in large part on method of measurement, chiefly categorical versus continuous approaches (Baldwin & Shean, 2007; Jorm, 2000). Diagnostic classification based on various editions of the Diagnostic and Statistical Manual of Mental Disorders is categorical: presence or absence of clinically-relevant depression based on the primary symptoms of depressed mood or loss of interest or pleasure, weighted more heavily than a specific number of other affective and somatic symptoms, lasting at least two weeks.
Self-report questionnaires and structured interviews are typically used for categorical classification using cutoff scores. For example, the Beck Depression Inventory Second Edition (BDI-II; Beck, Steer, & Brown, 1996) provides total score ranges for four levels of severity for patients diagnosed with major depression and a single cut score which may be used to screen for major depression. The BDI and other self-report questionnaires such as the CES-D (Radloff, 1997), Zung Self-Rating Depression Scale (Zung, W. W. K., 1965), Geriatric Depression Scale (GDS; Yesavage & Brink, 1983) and CAMDEX (Roth, Tym, Mountjoy, Huppert, Verma, & Goddard, 1986) may be scored either by summing a number equally-weighted items or fitting a measurement model (factor analysis, item response models, etc), yielding individual scores as a continuous variable or variables for subsequent analyses.
When measured as a continuous variable through sum scoring, factor analysis or an item response model, levels of depression generally increase with age (Glaesmer, Riedel-Heller, Braehler, Spangen, & Luppa 2011; Kessler et al., 2003; Krause, 1999; Narrow, Rae, Robins, & Reigier, 2002). However, other studies show decreases with age (Kim, Pilkonis, Frank, Thase, & Reynolds, 2002) or no effects (Feinson, 1985). Even among those studies that show an increasing age trend, effects range from simple group differences to non-linear and curvilinear trends (Gatz, & Hurwicz, 1990; Teachman, 2006). In contrast, rates of diagnosed major depression decline in adulthood (Blazer, 2003; Glaesmer, et al., 2011, Jorm, 2000; Kessler et al., 2003; Meeks, Vahia, Lavretsky, Kulkarni & Jeste, 2010). Newman (1989) reviewed 21 studies of depressive symptomatology, comparing those studies that used a continuous measure of depression with those that used a dichotomous categorical measure. Continuous depression measures showed a U-shaped curve, with middle-aged adults showing lower levels of depression than either younger or older adults. However, clinical diagnostic (categorical) data yielded an inverted U-shaped curve. Newman concluded that these two measurement approaches may be tapping different forms of depression with different age distributions, and that age-related changes in the nature and expression of depression may lead to biased estimates of late-life depression. Similarly, Beekman, Copeland, and Prince (1999) reviewed prevalence of late-life major and minor depression from community samples in a number of industrialized countries. They found that the prevalence of major depression decreased with age, while the level of depressive symptoms or minor depression increased with age. However, the authors noted that the inconsistency in the ways depression was defined and the variety of item content in structured interviews precluded definitive conclusions.
Many explanations have been suggested for the varying estimates of prevalence of depression across age groups. A simulation study by Giuffra and Risch (1994) showed that observed cohort differences could be explained by “differential recall”. Some have proposed variants of depression unique to old age such as "depression without sadness" (Gallo, et al, 1997) and "executive dysfunction syndrome" (Alexopoulous, 2005) that would not be detected as MDD by standard diagnostic criteria. Other suggestions include reluctance among elderly individuals to self-disclose, resistance to psychiatric labels (Chew-Graham, et al, 2012), the assumption that depression is simply to be expected with aging (Law, Laidlaw, & Peck, 2010), and censoring from older samples due to institutionalization or mortality. Other authors have suggested that depressive disorder is under-diagnosed in the elderly because depressive somatic symptoms are confused with physical morbidities (Blazer, Burchett, Service, & George, 1991), though studies that control for objective medical illness have shown that depression and somatic symptoms were unrelated to physical health burden (Drayer et al., 2005; Kessler, et al., 2010).
An alternative explanation for the inconsistency of age effects is a lack of measurement invariance across age. DSM items commonly used to identify clinical depression do not function equivalently across adulthood and late life when controlling for latent depression scores (Jeste, Blazer, & First, 2007). Balais and Cully (2008) showed that older adults endorsed DSM-IV somatic items at a higher rate and were less likely to endorse cognitive items and the suicide item than younger adults. Kim et al. (2002) tested for differential item functioning (DIF) in the Beck Depression Inventory (BDI; Beck & Steer, 1993), finding significant DIF in 17 of the 22 BDI items across age groups. BDI trait depression scores for those individuals formally diagnosed with major depressive disorder were significantly lower for the late-life group (age 60 and older) than the midlife group (younger than 60), while the opposite effect was found when using the Hamilton Rating Scale for Depression (HRSD; Hamilton, 1960). Aggen, Kendler, Kubarych, and Neale (2011) found more complex relationships between age and symptom endorsement in DSM-III criteria for major depression, including effects of sex and age-sex interactions, while Forkmann, et al (2013) found similar sex effects and multidimensionality in an elderly sample. In a review of epidemiological studies of depression across the lifespan, Jorm (2000) found no consistency for age effects across studies using a variety of assessment methods, both diagnostic and psychometric. While the BDI and DSM-based measures have been tested for age-based differential item functioning, the CAMDEX and many other depression scales have not, precluding their use in assessing age-related changes in depression structure.
Current Study
The Cambridge Mental Disorders in the Elderly (CAMDEX; Roth, Tym, Mountjoy, Huppert, Verma, & Goddard, 1986) is a structured interview schedule for the diagnosis of dementing disorder in the elderly. The CAMDEX is widely used in epidemiological and clinical studies internationally, including studies of depression across the lifespan originating from the Danish Twin Registry (e.g., Johnson, et al, 2002; McGue & Christensen, 2003; McGue & Christensen, 2013). Depression items were included to help identify non-dementing disorders which can masquerade as dementia. Depressed mood is assessed with 20 self-report items included in a structured interview and an additional informant question, “Do you think he/she is depressed?”: a subset of those items formed a “clinical diagnostic scale” based on DSM-III criteria. An adaptation of the depression section of the CAMDEX is used in the Danish Twin Registry (Skytthe et al., 2002), which was translated into Danish then back-translated into English to verify its accuracy (M. McGue, personal communication, June 2014). The CAMDEX depression items support a two-factor structure: with one factor representing depressed affect (affective factor) and another factor representing neurovegetative symptoms relating to psychomotor slowing (somatic factor; McGue & Christensen, 1997). These scales have not been subjected to invariance testing with respect to age, which is an important step in attributing age differences in depression scores to depression rather than changes in the measure or depression itself. Given the noninvariance over age found in both DSM criteria and depression scales reviewed above, we undertook the present analyses as a prelude to further use of the extensive longitudinal depression data in the two data sets described below.
The present analysis is aimed at answering two primary questions:
Do either of the CAMDEX subscales show evidence of differential item functioning (DIF) with respect to age?
Conditional on noninvariant items being found, in what ways do items differ across age?
Methods
Sample
Participants were drawn from two survey studies based on the Danish Twin Registry (DTR), one of the oldest and largest twin registries in the world (Kyvik, et al., 1996, Skytthe, et al, 2013). The Danish Twin Registry contains approximately 85,000 twin pairs, containing nearly all twins born in Denmark, including 100% of twins born since 1973 and greater than 90% attainment for older cohorts (Skytthe, et al, 2002). When paired with previous research showing no differences between twins and singletons on most behavioral outcomes (Evans & Martin, 2000; Kendler, Martin, Heath & Eaves, 1995), random samples from the Danish Twin Registry may be considered to be representative of the Danish population1. We utilize two random samples drawn from the DTR to comprise the data for this analysis: the Middle-Aged Danish Twin Study and the Longitudinal Study of Aging Danish Twins. The Middle-Aged Danish Twin Study (MADT; Gaist et al., 2000; Skytthe, et al, 2002) began in 1998 with a random sample of twin pairs born 1931–1952. The Longitudinal Study of Aging Danish Twins (LSADT; Christensen, Holm, McGue, Corder, & Vaupel 1999; Skytthe, et al., 2002) is a cohort-sequential study of like-sex twins aged 75 or older in 1995 (lowered to age 70 in subsequent waves), followed at 2-year intervals. Participation rates for contacted, eligible participants in these studies range between 70% and 83% (McGue & Christensen, 2003; Skytthe, et al, 2002) Skytthe, et al, 2013 contains information on other measures available in the MADT.
The MADT and LSADT assessment is conducted in participants’ homes by trained interviewers. The survey encompasses demographic background and measures of medical health, physical and cognitive functioning and depression; twins are assessed separately by different interviewers. Zygosity in same sex pairs is determined by a self-report questionnaire concerning the similarity between twins, a procedure validated by genetic testing, with error rates less than 5% (Christensen, Holm, McGue, Corder, & Vaupel, 2003).
The sample for the present study was drawn from the MADT (collected in 1998) and the first measurement occasion for each participant of the LSADT (collected in 1995, 1997, 1999 and 2001). While combining separate studies can lead to problems in many analyses, several features of the above studies make their combination appropriate for this analysis. Together, the MADT and LSADT represent approximately 80% of all Danish-born twins for the age ranges relevant to each study. With the high response rate in both studies and the exhaustive nature of the Danish Twin Registry, this sample represents essentially the same population at different ages (MADT: 45–68; LSADT: 70–104). In addition to the complementary nature of the samples, the structure of the CAMDEX schedule, associated interviewer training and data collection processes are equivalent across the MADT and LSADT. As the two studies maintain identical measurement procedures for adjoining, non-overlapping age ranges from the same population, the MADT and the first measurement occasion of the LSADT can be viewed as using a single cross-sectional sample. Sample size and the number of males and females in both the full sample and subsamples used in subsequent analyses are included in Table 2; other demographic variables are not included, as the Danish Twin Registry relatively homogeneous with respect to ethnicity, socioeconomic status, and access to health care, which is representative of the native-born population in the birth years covered by the registry (Christensen, Herskind, & Vaupel, 2006).
Table 2.
Ranges and Sample Sizes for Age Group Variables
| Age | Sample Size | ||||||
|---|---|---|---|---|---|---|---|
| Groups | Min | Mean | Max | Total | Twin 1 | Twin 2 | Female% |
| 2 | 45 | 56.86 | 70 | 4314 | 2153 | 2161 | 49.05% |
| 71 | 77.36 | 102 | 4731 | 2339 | 2392 | 58.93% | |
| 3 | 45 | 53.48 | 61 | 2981 | 1494 | 1487 | 49.41% |
| 62 | 68.62 | 74 | 2909 | 1437 | 1472 | 50.70% | |
| 75 | 79.98 | 102 | 3155 | 1561 | 1594 | 62.00% | |
| 4 | 45 | 52.47 | 59 | 2599 | 1296 | 1303 | 49.13% |
| 60 | 63.49 | 69 | 1715 | 857 | 858 | 48.92% | |
| 70 | 74.35 | 79 | 3292 | 1651 | 1641 | 56.41% | |
| 80 | 84.60 | 102 | 1439 | 688 | 751 | 64.70% | |
| Full Sample | 45 | 67.63 | 102 | 9,045 | 4492 | 4553 | 54.22% |
Note. Age ranges, descriptive statistics, and sample sizes for two-group, three-group and four-group age splits for DIF analyses. Two- and three-group splits based on sample age quantiles, while four-group based on approximately ten-year age bins. The 25th, 50th and 75 quantiles of the age distribution are 57.2, 70.7 and 76.0, respectively. Female% indicates what percentage of the full sample in each age bin is female.
As is commonly done in twin studies to aid genetic studies, one member of each twin pair was assigned to be “twin 1” while the other was assigned to be “twin 2”. This allowed the sample to be further split into two subsamples, such that each twin pair placed one member in sample 1 (n=4,492) and the other in sample 2 (n=4,553), creating two matched subsamples where all individuals within a subsample are unrelated. Bootstrap power analyses were carried out by resampling from the singleton (twin 1) data with replacement and carrying out the differential item functioning tests described in the results section for sample sizes from 100 to 10,000 in increments of 100. This power analysis showed that 99% power could be reached with sample sizes of 1,600 (experiment-wide α =.05) or 1,900 (α =.01) individuals, and no failures to detect an effect were found with 2,900 subjects or more.
Measures
Trait depression was measured using a subset of the depression scale from the Cambridge Mental Disorders in the Elderly (CAMDEX; Roth et al., 1986). The CAMDEX was given in Danish, with translation and back-translation carried out to verify equivalence prior to the first wave of data collection in 1995: the translation was done following the procedure described by Brislin (1970). Scoring of the CAMDEX depression scale was originally intended to be carried out via a symptom count (Roth, et al, 1986), which was then either used as a sum score or used in conjunction with cut-points to identify categorical diagnoses. McGue and Christensen (1997) factor analyzed the 21-item CAMDEX depression section to construct affective (nine items) and somatic (eight items) subscales, with the remaining four items dropped due to weak relations with either a common depression trait or the extracted subscale factors. These four items dealt primarily with somatic complaints (e.g. loss of appetite, trouble sleeping) that, while relevant to depression symptomatology, are comorbid with other physiological and psychological disorders, which may explain their smaller relationship with the latent constructs. This two-factor solution provided an empirical and more psychometrically valid method for creating continuous depression scores than the extant sum scores and clinical cut points, and has since been used to score this measure in studies from the Danish Twin Registry2. The somatic scale indexes primarily psychomotor slowing and loss of energy, while the affective scale reflects negative affect and lack of well-being. Items on the affective and somatic scales are rated on a mostly 3-point scale with response options 1 = no, 2 = sometimes, and 3 = most of the time. Two items have a dichotomous response format, 1= no, 2 = yes (items 5 and 9, Table 1), and item 1 has a 5-option response format (see Table 1 for scale items). While Figure 1 is presented primarily for informing the Results section, the bottommost plots show smoothed endorsement probabilities at each age, and give general age trends for each item in the scale. Internal consistency (as defined by Cronbach’s alpha) and two-year stability were reported to be .78 and .63 for the affective scale respectively, and .80 and .54 for the somatic scale (Johnson et al., 2002). Internal consistency for the affective scale was found to be .88 for each of the subsamples in this dataset, with the somatic scale yielding internal consistency values of .89 and .90 for the two subsamples. Scores on the two scales were correlated 0.534 and 0.523 in the twin 1 and twin 2 subsamples, respectively. The affective scale had 34.3% of respondents endorse zero items, 55.6% endorse between one and four items, and 10.1% endorse five or more of the nine items. Similarly, the somatic subscale had 49.6% of respondents endorse no items, 41.9% endorse between one and four items, and 8.5% endorse five or more of the eight items.
Table 1.
Affect and somatic scale items (adapted from the CAMDEX depression section)
| Subscale | Code | Item |
|---|---|---|
| Affective |
|
Are you happy and satisfied with your life at present? |
| How often do you feel happy? | ||
| Have you felt lonely lately? | ||
| Do you feel tense and do you worry more than usual about matters of minor importance? | ||
| Do you consider yourself a nervous person? | ||
| Do you at the moment feel sad, depressed, or miserable? | ||
| Do you feel worthless, or do you blame yourself for mistakes that you have made a long time ago? | ||
| How do you feel about your future? | ||
| Do you sometimes feel that life is not worth living? | ||
| Somatic |
|
Do you have extraordinarily long sleep? |
| Do you find it more difficult to cope with things now than before? | ||
| Do you find it more difficult to make decisions than you used to? | ||
| Have you lost pleasure or interest in doing things you usually cared about or enjoyed? | ||
| Do you find you have lost energy recently and is it harder to get things done | ||
| Do you find it more difficult to concentrate than usual? | ||
| Do you speak more slowly than usual | ||
| Do you feel that you think more slowly than usual? | ||
| Omitted | 18. 19. 20. 21. |
Have you preferred to be more on your own recently? |
| Do you have less appetite or are you often more hungry… | ||
| Within the last 6 months, have you lost or gained substantial weight? | ||
| Do you wake up early in the morning unable to fall asleep again? | ||
Figure 1.
Smoothed Endorsement Curves for CAMDEX Affective and Somatic Subscales
Item Response Model Selection
The eigenvalues of the polychoric correlation matrices of each subscale and sample were used for scree tests to assess dimensionality. All four subscale samples (i.e., somatic and affective scales for twin 1 and twin 2 samples) showed strong evidence of unidimensionality by this measure, as all subscales and samples featured a single large eigenvalue (ranging from 4.83 to 5.14) that accounted for 55.3% – 60.8% of the variance in the polychoric correlation matrix. No other eigenvalue exceeded 1.0, which supports a unidimensional solution by the Kaiser rule, and the ratio of first eigenvalue to second ranged from 5.25 to 5.95, which is sufficiently high to surpass common unidimensionality criteria (Lumsden, 1959; Lord, 1980).
Two polytomous item response models were considered for the somatic and affective subscales: the graded response model, which estimates separate discrimination parameters for each item, and the polytomous Rasch model, which assumes that item discrimination is constant across all items. The graded response model showed improved fit over a polytomous Rasch model for both the somatic (Twin 1: =256.37, p<.001; Twin 2: =249.49, p<.001) and affective (Twin 1:=315.33, p<.001; Twin 2:= 370.94, p<.001) subscales. This indicates significant differences in discrimination parameters across items, and thus the graded response model was used in subsequent analyses.
Analysis of Differential Item Functioning
All data analyses were carried out using R 2.11.1 (R Core Team, 2010). Item response models in this analysis were fit using the ‘ltm’ library (Rizopoulos, 2006), and tests of differential item functioning were carried out using the ‘lordif’ library (Choi, Gibbons & Krane, 2011) following a logistic regression procedure (French & Miller, 1996; Miller & Spray, 1993; Swaminathan & Rogers, 1990; Zumbo, 1999). The ‘lordif’ library was customized for this analysis to allow for repeated sets of random starting values. An in-depth description of this logistic DIF procedure can be found in Choi, Gibbons & Krane (2011); the steps shown below summarize our analysis of differential item functioning with respect to age:
Fit the desired item response model (graded response model) to the item-level data and generate trait estimates for each individual.
- For each item, compare the following ordinal logistic regression models:
- Item regressed on trait level.
- Item regressed on trait level and age group.
- Item regressed on trait level, age group and a trait by age group interaction.
Assess model fit by likelihood ratio test, such that differences between models 2a and 2c indicate some form of DIF. Differences between models 2a and 2b indicate uniform DIF, while differences between models 2b and 2c indicate non-uniform DIF. Items that show DIF in any of the three comparisons (2a vs. 2b, 2a vs. 2c, 2b vs 2c) are considered to be differentially functioning.
If any DIF is found in Step 3, apply Crane’s (et al, 2006) sparse matrix method to separate the affected item(s) and allow group-specific difficulties and discriminations to be estimated. Regenerate trait scores with Stocking-Lord (1983) equating using the items that do not show DIF as anchors, and return to step 2 to repeat the procedure, keeping track of what items have already shown DIF for subsequent applications of steps 3 and 4. If no new DIF is found, end the procedure.
As DIF depends on differences in item functioning across groups, age had to be split into a categorical grouping variable. Rather than pick one arbitrary cut point to create grouping variables, we repeated the analysis three times, each with a different number of groups to validate that the results were not specific to any one way of treating the age variable. Different versions of this analysis split the sample into two, three and four groups of approximately equal sample size for this analysis, which are described in Table 2: specifically, the two- and three-group cut points were determined by the sample median and 33rd and 67th quantiles, while the four-group used more theoretical designations of “Under 60”, “60–69”, “70–79” and “80 or older”. The procedure above was carried out a total of six times: once on each twin subsample with two age groups, then repeated for three and four age groups to yield six versions of the analysis. Items identified as differentially functioning have separate discrimination and difficulty parameters for each age group in subsequent iterations3. For all iterations after the first, tests are equated across differentially functioning groups using the invariant items as anchors (Stocking & Lord, 1983).
As the primary question pertains to the invariance of the 17 CAMDEX items, the Bonferroni-corrected p-value criterion for each replication of the DIF procedure was set at .05/17 = .0029. Results did not substantially differ with changes to the criterion p-value in the .01 to .001 range. Initial results use the omnibus test for DIF, comparing models 2a and 2c. Subsequent analyses discuss uniform and non-uniform DIF separately.
Tests of differential item functioning require not only measures of significance (likelihood ratio tests and associated p-values) but also measures of effect size. We used two different effect size measures: the group difference in log odds ratio rescaled to the ETS Δ scale (Dorans, 1989; Holland & Thayer, 1988) and the change in variance explained (ΔR2) from models 2a to 2c. The Δ measure reflects the difference in endorsement probabilities across groups, with absolute values less than 1.0 indicating small or negligible DIF, absolute values between 1.0 and 1.5 indicating moderate DIF, and values more extreme than 1.5 indicating extensive DIF (Zieky, 1993; Zwick & Ercikan, 1989). Similarly, ΔR2 less than .035 indicate small or negligible DIF, values between .035 and .070 indicate moderate DIF, and values greater than .070 indicate a large amount of DIF (Jodoin & Gierl, 2001). Previous work has found ΔR2 to be more sensitive to non-uniform DIF, but has generally found lower levels of DIF than the Δ metric (Hidalgo & Lopez-Pena, 2004). Both measures are conditional on significant DIF found via a formal test. The final column of all DIF table indicates the number of versions of the test (3 age bin sizes x 2 twins) in which each item showed significant DIF at the Bonferroni-corrected criterion.
All models and tests were fit to both subsamples and to both the affective and somatic subscales. When only a single set of results is presented, those correspond to the dataset representing the “first” twin in the pair.
Results
Differential Item Functioning
Tests of differential item functioning were carried out as described in the Analysis section. The results of the four-group analyses for both subscales are presented in Table 3; the results for the two-group and three-group analyses and details regarding pairwise comparisons of the four-group analysis are presented in Tables A1, A2 and A5 of the appendix. The results in Table 3 and the appendix indicate relatively few problems with the somatic subscale. Only the “Cope” item (Item 11 from Table 1) showed consistent significant effects across all groups and versions of the analysis ( in range [126.893, 180.754], all p-values <.0001). These effects are small by variance accounted for (ΔR2 =[0.015, 0.016]), but large by the Δ measure (Δ=[−4.374, −4.684]). The only other somatic subscale item to show any effects was the “Long Sleep” item, which showed small effects in two of the twin 2 samples but failed to replicate in the twin 1 samples.
Table 3.
DIF Tests and Effect Sizes for the CAMDEX Somatic and Affective Subscales
| Twin 1 | Twin 2 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Affective | p | ΔR2 | Δ | p | ΔR2 | Δ | DIF? | ||||
| 1. Happy Life | 30.193 | <.0001 | .006 | 1.292 | 54.593 | <.0001 | .009 | 2.202 | 5 of 6 | ||
| 2. Feel Happy | 16.117 | 0.0131 | .001 | 0.595 | 60.075 | <.0001 | .002 | 1.072 | 4 of 6 | ||
| 3. Lonely | 94.468 | <.0001 | .013 | −3.119 | 166.737 | <.0001 | .017 | −3.692 | 6 of 6 | ||
| 4. Tense | 20.372 | 0.0024 | .007 | 1.180 | 11.592 | 0.0717 | .004 | −0.103 | 1 of 6 | ||
| 5. Nervous | 31.937 | <.0001 | .010 | 0.329 | 31.298 | <.0001 | .008 | −0.544 | 5 of 6 | ||
| 6. Sad | 5.489 | 0.4828 | .003 | 2.495 | 4.367 | 0.6272 | .005 | 2.642 | 0 of 6 | ||
| 7. Worthless | 50.502 | <.0001 | .008 | −1.162 | 72.037 | <.0001 | .011 | −1.849 | 6 of 6 | ||
| 8. Future | 137.761 | <.0001 | .014 | −2.167 | 255.971 | <.0001 | .017 | −2.228 | 6 of 6 | ||
| 9. Life | 10.980 | 0.0890 | .003 | −1.143 | 13.681 | 0.0334 | .001 | −0.491 | 2 of 6 | ||
| Somatic | p | ΔR2 | Δ | p | ΔR2 | Δ | DIF? | ||||
| 1. Long Sleep | 6.289 | 0.3916 | .002 | −0.436 | 13.420 | 0.0368 | .005 | −0.992 | 2 of 6 | ||
| 2. Cope | 180.754 | <.0001 | .015 | −4.374 | 171.700 | <.0001 | .016 | −4.684 | 6 of 6 | ||
| 3. Decisions | 11.835 | 0.0657 | .004 | 1.598 | 3.138 | 0.7913 | .001 | 0.344 | 0 of 6 | ||
| 4. Lost Interest | 7.654 | 0.2646 | .003 | −0.320 | 3.644 | 0.7248 | .001 | 0.074 | 0 of 6 | ||
| 5. Lost Energy | 5.606 | 0.4687 | .001 | 0.167 | 8.983 | 0.1745 | .001 | −0.290 | 0 of 6 | ||
| 6. Concentrate | 12.343 | 0.0547 | .004 | −0.058 | 3.331 | 0.7663 | .001 | −0.016 | 0 of 6 | ||
| 7. Speak Slowly | 12.362 | 0.0544 | .006 | −3.105 | 8.306 | 0.2165 | .004 | −2.385 | 0 of 6 | ||
| 8. Think Slowly | 4.212 | 0.6480 | .002 | 0.054 | 9.497 | 0.1475 | .003 | −0.098 | 0 of 6 | ||
Note. Likelihood ratio tests and associated p-values for associations between item-level responses and age partialling out the effect of the somatic depression trait. Age is treated as a four-group categorical variable for the purposes of this analysis. Chi-square values indicate difference in fit between model where each item is regressed on the somatic depression trait and a model which includes ordinal logistic regression effects of the trait, age group and age*trait interactions. Both ΔR2 and Δ are measures of effect size, while “DIF?” indicates how many of the six total tests (3 age groupings by 2 twins) find significant differential item functioning. Bonferroni-corrected criterion value of .0029 used for significance testing and “DIF?” column.
The results for the affective subscale showed substantial DIF. Three of the nine affective items showed DIF in every version of the analysis (Items 3, 7 and 8 from Tables 1 and 3: “Lonely”, “Worthless” and “Future”). The effect sizes for these three items were small by ΔR2 (ΔR2 in [.008, .017]), but moderate to large by the Δ measure (Δ in [−3.692, −1.162]). Three additional items showed DIF in either four or five of the six analyses (Items 1, 2 and 5: “Happy Life”, “Feel Happy” and “Nervous”), with all uncorrected p-values less than .0137 regardless of significance and replication across twins in at least one treatment of age. As with the other differentially functioning affective items, effect sizes were small by ΔR2 but Δ effects ranged from negligible (0.329 & −0.544) to large (2.202). Only the “Sad” item was completely free of DIF effects in all versions of the analysis.
The impact of these differentially functioning items is shown in the four plots given in Figure 1. All four of these plots represent the loess-smoothed probability of endorsing each of the CAMDEX items as a function of age. Items plotted with solid black lines showed DIF that replicated across subsamples (significant effect in 4, 5 or 6 of 6 tests), while items represented by dotted grey lines showed little or no evidence of DIF. The top two plots show the probability of endorsement conditional on the trait, such that each item should have an expected value of zero at every age if no DIF is occurring. The bottom two plots show the raw probability of endorsement. Items shown in black show significant DIF that replicated across subsamples in at least one set of age bins, while non-significant items are shown in grey. The plots for the somatic items show little evidence of differential functioning, as the age curves for each item are concentric for both plots, and reasonably close to zero in the residual plot, with the notable exception of the “Cope” item. In contrast, the affective items show discrepant patterns with age, with items showing both positive and negative trends in varying shapes and magnitudes in the raw data, and discrepant non-zero trends in the residual plots. Beyond the presence of variable patterns with age, subsets of the items do not appear to change as units, making even multidimensional interpretations of DIF implausible. While the similarity between the somatic items indicates that a unidimensional trait is plausible, the set of affective item age trends cannot be explained by a single construct that changes with age.
Uniform and Non-Uniform DIF
It is important to assess not just the presence of differential item functioning over age, but also the nature and form of the age-specific effects. Tests of DIF commonly distinguish between uniform and non-uniform DIF, as these two forms of non-invariance denote different types of measurement problems (Embretson & Reese, 2000). Uniform DIF indicates changes in item difficulty across groups but no change in item discrimination, while non-uniform DIF indicates that the relationship between the trait and item (i.e., item discrimination) varies across groups4. In the context of this study, uniform DIF effects would consist of age differences in baseline symptom frequency that have no relationship to affective or somatic depression constructs, while non-uniform DIF effects would indicate that the relationship between depression and the depression symptoms is different at different ages. For the purposes of applied use, uniform and non-uniform DIF differ in that uniform DIF is correctable, while non-uniform DIF is not. If DIF in the CAMDEX is largely uniform, then trait scores could be adjusted for age-based symptom endorsement in the same way that Kim et al. (2002) provided adjusted cutoff scores for the Beck Depression Inventory. However, if the DIF is largely non-uniform then no such adjustments can be made.
The DIF procedure outlined in the Methods section presented three model comparisons that can be used to indicate DIF: the omnibus test compares models 3a and 3c, and tests for DIF in any form. The comparison between models 3a and 3b tests for uniform DIF, as these models differ only by a constant age effect on item difficulty. The comparison between models 3b and 3c tests for non-uniform DIF, as these models differ by the inclusion of a trait by age interaction term that models changes in item discrimination over age.
Table 4 summarizes the uniform and non-uniform DIF tests. Complete statistics for all effects can be found in Table A3. These results show the same overall rate of differential functioning as in the omnibus DIF tests, but split the effect into uniform, non-uniform or both. Three items on the affective subscale (“Happy Life”, “Future” and “Life”) are characterized by uniform DIF, exhibiting exclusively uniform DIF in 11 of 14 significant tests (two-, three- and four-group version show 4/5, 3/4, and 4/5, respectively; see Tables A3 & A4). However, the other five differentially functioning affective items and both differentially functioning somatic items predominantly show non-uniform DIF instead of or in addition to the uniform DIF. In total, 31 of the 42 significant DIF effects found in this analysis were either exclusively non-uniform or jointly uniform and non-uniform (two-, three- and four-group version show 12/15, 10/16, 10/13 respectively; see Tables A3 & A4), indicating that the differential functioning found in this depression scale is too complex to be explained by age-related shifts in item endorsement. The items “Lonely”, “Nervous”, “Worthless” and “Coping” showed decreased discrimination in the older adult samples, indicating a weakening of the relationship between these indicators and the affective and somatic constructs. The items “Feel Happy” and “Future” show higher discriminations in the oldest old, indicating a strengthening of the relationships between these items and depression in late life.
Table 4.
Summary of Uniform and Non-Uniform DIF Tests
| Discrimination | DIF | |||||
|---|---|---|---|---|---|---|
| Affective | Young | Old | Omnibus | Uniform | Non- Uniform |
Both |
| 1. Happy Life | 1.90 (0.15) | 2.04 (0.20) | 5* | 5 | 1 | 0 |
| 2. Feel Happy | 2.25 (0.20) | 2.78 (0.33) | 4 | 1 | 0 | 3 |
| 3. Lonely | 2.02 (0.20) | 1.28 (0.16) | 6 | 0 | 0 | 6 |
| 4. Tense | 1.54 (0.14) | 1.62 (0.20) | 1** | 0 | 0 | 0 |
| 5. Nervous | 1.44 (0.15) | 1.19 (0.17) | 5 | 0 | 4 | 1 |
| 6. Sad | 2.99 (0.33) | 2.88 (0.38) | 0 | - | - | - |
| 7. Worthless | 1.54 (0.15) | 0.82 (0.14) | 6 | 0 | 3 | 3 |
| 8. Future | 1.34 (0.12) | 1.71 (0.18) | 6 | 4 | 0 | 2 |
| 9. Life | 2.03 (0.23) | 1.76 (0.24) | 2 | 2 | 0 | 0 |
| Discrimination | DIF | |||||
| Somatic | Young | Old | Omnibus | Uniform | Non- Uniform |
Both |
| 1. Long Sleep | 0.83 (0.16) | 1.07 (0.16) | 2 | 0 | 2 | 0 |
| 2. Cope | 2.75 (0.29) | 2.06 (0.20) | 6 | 0 | 0 | 6 |
| 3. Decisions | 2.68 (0.31) | 2.97 (0.37) | 0 | - | - | - |
| 4. Lost Interest | 2.62 (0.26) | 1.94 (0.21) | 0 | - | - | - |
| 5. Lost Energy | 2.80 (0.27) | 2.43 (0.25) | 0 | - | - | - |
| 6. Concentrate | 2.84 (0.29) | 1.95 (0.21) | 0 | - | - | - |
| 7. Speak Slowly | 2.02 (0.33) | 1.35 (0.21) | 0 | - | - | - |
| 8. Think Slowly | 2.18 (0.21) | 1.80 (0.19) | 0 | - | - | - |
Note. Summary of tests of uniform, non-uniform and omnibus (total) DIF for the affective and somatic subscales. In two cases, the individual uniform and non-uniform results differed from the omnibus test:
The two-group condition for twin 1 showed significant uniform DIF, but did not show DIF under the omnibus test.
The four-group condition for twin 1 showed significant DIF in the omnibus tests despite neither the uniform nor non-uniform tests being significant.
Discussion
The goal of the present study was to investigate the measurement invariance of the CAMDEX depression scale with respect to age. The initial analyses supported the previously established structure of two unidimensional subscales measuring somatic and affective symptoms at the population level. Tests of differential item functioning with respect to age showed few problems in the somatic subscale but significant differential functioning in the majority of the items on the affective subscale, neither of which were detectable using the initial population-wide analyses. Further testing revealed that these effects were predominantly non-uniform, indicating that the changes in the items with age were too complex to be caused by simple changes in endorsement over age, but rather represent changes in the way each differentially functioning item relates to the affective and somatic dimensions of depression.
The somatic subscale performed relatively well in the differential item functioning tests. Only two items showed evidence of DIF, only one of which replicated across twins. The “Cope” item showed consistent uniform and non-uniform effects across twins and age-splits, whereas the “Long Sleep” item showed non-uniform DIF only for the two-group and three-group analysis for twin 2. The other six items showed no evidence of DIF, neither uniform nor non-uniform. Given the effects found in this analysis, a version of the somatic subscale that omits the “Cope” item should be considered when making comparisons between individuals across middle and late adulthood. When possible, researchers should consider running their analyses of this scale both with and without the “Cope” item as a check to make sure that results are not due to differential functioning in the scale. However, researchers using a censored version of any scale should take note that removing items may change which construct is being measured, and that the omission of the “Cope” item could influence how the somatic subscale should be interpreted.
Unlike the somatic subscale, the affective subscale showed substantial evidence of differential item functioning, the majority of which was non-uniform. Among the nine items included in this subscale, only the “Sad” item was free of any age-specific effects. The remaining eight items all showed varying degrees of differential item functioning, including replication across twin samples and age groups for six of those items. Only the “Sad”, “Tense” and “Life” items were completely free of nonuniform DIF, indicating that most of the scale shows age differences in how the individual items discriminate different levels of the affective depression construct. Unlike Kim et al.’s (2002) previous work showing correctable uniform DIF across age in the Beck Depression Inventory in the form of age-specific cut-off scores, the non-uniform DIF found in this analysis cannot be corrected using age adjustments. Instead, the differential effects in the affective subscale indicate more complex lifespan differences in the structure of this scale and construct.
While this type and degree of non-invariance has not been shown previously, the results are consistent with previous findings of considerable heterogeneity in phenotypic expression of depression in late life (Balsis & Cully, 2008; Hybels, Blazer, Landerman, & Steffens, 2011). This heterogeneity in symptom expression is not consistent with the assumptions of latent trait estimation, which assumes that the trait (e.g., depression) can be described by one or more dimensions or factors with linear (factor analysis) or monotonic (item response modeling) relations with the items. The force of this assumption is that a higher trait level leads to a systematic increase in endorsement probability for all symptoms.
The use of item response modeling and the differential item functioning framework provides both benefits and drawbacks for the study of measurement invariance. Item response models like the graded response model provide greater computational efficiency than equivalent structural equation models for ordinal data (Kamata & Bauer, 2008; Maydeu-Olivares, 2004; Widaman & Reise, 1997). McDonald (1999) among others highlights the equality of and transformation between specific item response models and their SEM counterparts: understanding the relationship between these methods and the varying invariance traditions that surround them is important for all researchers. While the DIF framework provides a useful set of tools for testing the measurement invariance of items, most applications of DIF utilize a categorical grouping variable for DIF testing, and the complexity of the tests increase with the number of groups. The present analysis relied on various splits of the continuous age variable to satisfy this requirement. The effects of age on this scale likely have a much more continuous effect on item endorsement than indicated by the two-, three- and four-group models. Future analyses should include continuous effects of age (including non-linear terms) and a method for including those continuous effects on the item discrimination and difficulty parameters in the iterative portions of the DIF procedure. These continuous effects of age were not included in this analysis both as they are not the most common method for testing DIF, and that the testing of non-linear age terms would further complicate and require more degrees of freedom than the categorical testing employed in this analysis, obfuscating the general conclusion that DIF exists across ages in this scale. The logistic regression framework also assumes proportionality of thresholds, and alternative DIF procedures that allow for non-proportional variation in thresholds should be used when possible. Additionally, future work should incorporate tests of sex invariance simultaneously with age, given previous work on sex differences in incidence and invariance in trait depression (e.g., Barry, Allore, Guo, Bruce, & Gill, 2008), and extend into alternative methods and paradigms for testing DIF.
The characteristics of the sample are a strength of this study. The use of national register-based data helped minimize selection bias, and the sheer size of the dataset (total n=9,045) provided sufficient power to detect differential item functioning, as indicated by bootstrap power analysis. While the sample used in this study provided ample opportunity to test and internally replicate the tested effects, replication in other samples and populations remains an important consideration in the further testing of this scale. This sample was collected from the Danish Twin Registry, which provided a representative sample of a national population that may require replication in other national samples. The present analysis is restricted to a single occasion, future studies should utilize both the longitudinal and family clustering inherent to this dataset.
The power and effect sizes underlying the DIF found in the affective subscale merit further discussion, especially given the characteristics of this sample and the disagreement between the two effect-size measures used in this study. On the one hand, this study had the advantage of a very large sample size, and found DIF effects consistently in the small range as defined by ΔR2. Were a more conventional panel design used with a sample size in the hundreds rather than thousands, it is unlikely that we would have detected all of the DIF discovered in this study, and we may have concluded that these subscales were relatively invariant. None of the changes in symptoms across age were so large as to change the sign of any item discriminations, nor were any discriminations near zero at one age and very large at another. On the other hand, this study used a strong multiple testing correction, and taking advantage of the twin design inherent to the data allowed us to replicate these DIF effects across twin subsamples and varying treatments of age. DIF effects in the affective scale ranged from small to large by the ETS Δ measure; a medium to large Δ effect size yielding small ΔR2 values is a known discrepancy between the measures (Hidalgo & Lopez-Pena, 2004).
Some question as to the impact on this degree of DIF on use of the CAMDEX is warranted. From an empirical perspective, the range of difficulty and discrimination values found across groups is relatively small. The trait scores generated from the youngest cohort’s affective subscale scoring model correlate very highly with those generated from the oldest cohort’s model (r=0.985), and both correlate very highly with sum scores as well (r=.959 for the youngest cohort’s model, r=.965 for the oldest cohort’s model). It is unlikely that correlations and regressions of the CAMDEX on other scales will be substantively affected by changes in trait scores of this magnitude. However, the great majority of the DIF results persisted and replicated across samples and treatments of age.
The results of this study show that the structure of the CAMDEX depression scale, as assessed by the McGue and Christensen (1997) two-factor solution, varies as a function of age, and that neither the somatic nor the affective subscales of this measure demonstrate measurement invariance in their current form. This is an undesirable finding from a measurement perspective, but in a broader context, it is somewhat illuminating. The problems found in the affective subscale do not indicate simple age-related changes in endorsement, but rather a more complex substantive problem: the symptoms of depression, as measured by the CAMDEX affective subscales, change in their relation to depression as age increases. This result may help to explain the disagreement between clinical diagnosis and self-report measures of depression discussed in the introduction. Items such as “Feel Happy” and “Future” may discriminate better between varying levels of affective depression in older adults, while more mood-related items such as “Lonely” and “Worthlessness” may discriminate better in middle-age adults. Symptom endorsement does generally increase over the lifespan, but much of this increase is due to changes in scale usage with increased age. Whether these changes reflect cohort effects or developmental change, and whether these changes indicate idiographic manifestations of depression, a population-wide change in factor structure over age or simply the accumulation of age-related non-depression causes of symptom endorsement are important questions for further research. In the short term, we must test the invariance of survey-based depression scales whenever possible, both as stand-alone methodological investigations and as a first step of applied developmental research in depression. If these tests yield consistent findings, then scale construction should change to create more developmentally-sensitive measures of depression. If such a scale cannot be built, however, we must consider the possibility that depression does not change unidimensionally or independent of other age-related changes, and change our hypotheses and measures accordingly.
Acknowledgments
This research was supported by NIH grants R01 AG18436 and R25 DA026119. The Danish Twin Registry is supported by grants from The National Program for Research Infrastructure 2007 from the Danish Agency for Science, Technology and Innovation, the Velux Foundation, and the US National Institutes of Health (P01 AG08761).
Appendix
Table A1.
Omnibus DIF Tests and Effect Sizes for Two-Group Age Treatment
| Twin 1 | Twin 2 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Affective |
|
p | ΔR2 | Δ | p | ΔR2 | Δ | DIF? | |||
| 1. Happy Life | 10.477 | 0.0053 | .005 | 1.211 | 16.504 | 0.0003 | .007 | 1.621 | 5 of 6 | ||
| 2. Feel Happy | 8.575 | 0.0137 | .000 | 0.145 | 49.832 | <.0001 | .002 | 1.495 | 4 of 6 | ||
| 3. Lonely | 61.557 | <.0001 | .009 | −2.140 | 107.493 | <.0001 | .010 | −2.353 | 6 of 6 | ||
| 4. Tense | 5.867 | 0.0532 | .004 | 0.706 | 5.898 | 0.0524 | .001 | −0.003 | 1 of 6 | ||
| 5. Nervous | 23.431 | <.0001 | .006 | −0.094 | 26.143 | <.0001 | .003 | −0.654 | 5 of 6 | ||
| 6. Sad | 1.406 | 0.4952 | .000 | 0.881 | 4.391 | 0.1113 | .003 | 0.496 | 0 of 6 | ||
| 7. Worthless | 47.883 | <.0001 | .007 | −1.133 | 67.748 | <.0001 | .010 | −1.464 | 6 of 6 | ||
| 8., Future | 109.813 | <.0001 | .011 | −1.348 | 214.371 | <.0001 | .014 | −1.635 | 6 of 6 | ||
| 9. Life | 2.715 | 0.2574 | .000 | −0.344 | 14.990 | 0.0006 | .000 | −0.234 | 2 of 6 | ||
| Somatic | p | ΔR2 | Δ | p | ΔR2 | Δ | DIF? | ||||
| 1. Long Sleep | 3.396 | 0.1830 | .001 | −0.500 | 20.880 | <.0001 | .004 | −0.833 | 2 of 6 | ||
| 2. Cope | 144.471 | <.0001 | .012 | −2.892 | 130.827 | <.0001 | .012 | −2.594 | 6 of 6 | ||
| 3. Decisions | 2.259 | 0.3232 | .000 | −0.320 | 1.082 | 0.5821 | .000 | −0.094 | 0 of 6 | ||
| 4. Lost Interest | 5.593 | 0.0610 | .002 | −0.427 | 3.397 | 0.1829 | .001 | −0.246 | 0 of 6 | ||
| 5. Lost Energy | 3.488 | 0.1748 | .001 | 0.059 | 7.659 | 0.0217 | .001 | −0.186 | 0 of 6 | ||
| 6. Concentrate | 1.559 | 0.4586 | .001 | −0.039 | 3.308 | 0.1913 | .001 | −0.393 | 0 of 6 | ||
| 7. Speak Slowly | 4.982 | 0.0828 | .002 | −1.398 | 1.517 | 0.4683 | .001 | −0.585 | 0 of 6 | ||
| 8. Think Slowly | 2.903 | 0.2343 | .002 | 0.036 | 0.756 | 0.6851 | .001 | 0.250 | 0 of 6 | ||
Note. Likelihood ratio tests and associated p-values for associations between item-level responses and age, partialling out the effect of the somatic depression trait. Age is treated as a two-group categorical variable for the purposes of this analysis. Chi-square values indicate difference in fit between model where each item is regressed on the somatic depression trait and a model which includes ordinal logistic regression effects of the trait, age group and age*trait interactions. Both ΔR2 and Δ are measures of effect size, while “DIF?” indicates how many of the six total tests (3 age groupings by 2 twins) find significant differential item functioning. Bonferroni-corrected criterion value of .0029 used for significance testing and “DIF?” column.
Table A2.
Omnibus DIF Tests and Effect Sizes for Three-Group Age Treatment
| Twin 1 | Twin 2 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Affective | p | ΔR2 | Δ | p | ΔR2 | Δ | DIF? | ||||
| 1. Happy Life | 18.946 | 0.0008 | 0.005 | 1.447 | 51.130 | <.0001 | 0.008 | 1.888 | 5 of 6 | ||
| 2. Feel Happy | 46.887 | <.0001 | 0.001 | 0.483 | 77.483 | <.0001 | 0.003 | 1.615 | 4 of 6 | ||
| 3. Lonely | 74.944 | <.0001 | 0.008 | −2.394 | 118.914 | <.0001 | 0.012 | −2.628 | 6 of 6 | ||
| 4. Tense | 6.840 | 0.1446 | 0.005 | 0.380 | 6.997 | 0.1361 | 0.002 | −0.100 | 1 of 6 | ||
| 5. Nervous | 27.726 | <.0001 | 0.009 | −0.164 | 14.537 | 0.0058 | 0.005 | −0.570 | 5 of 6 | ||
| 6. Sad | 9.594 | 0.0479 | 0.001 | 1.691 | 2.006 | 0.7347 | 0.005 | 1.844 | 0 of 6 | ||
| 7. Worthless | 36.205 | <.0001 | 0.006 | −1.109 | 45.084 | <.0001 | 0.008 | −1.450 | 6 of 6 | ||
| 8. Future | 161.158 | <.0001 | 0.013 | −1.743 | 243.290 | <.0001 | 0.017 | −2.169 | 6 of 6 | ||
| 9. Life | 6.913 | 0.1405 | 0.001 | −0.430 | 18.237 | 0.0011 | 0.000 | −0.369 | 2 of 6 | ||
| Somatic | p | ΔR2 | Δ | p | ΔR2 | Δ | DIF? | ||||
| 1. Long Sleep | 4.081 | 0.3951 | 0.001 | −0.574 | 23.526 | 0.0001 | 0.007 | −0.748 | 2 of 6 | ||
| 2. Cope | 126.892 | <.0001 | 0.010 | −3.356 | 174.567 | <.0001 | 0.016 | −3.962 | 6 of 6 | ||
| 3. Decisions | 4.582 | 0.3329 | 0.002 | 0.051 | 2.038 | 0.7288 | 0.001 | 0.224 | 0 of 6 | ||
| 4. Lost Interest | 5.779 | 0.2163 | 0.001 | −0.748 | 1.951 | 0.7448 | 0.001 | −0.017 | 0 of 6 | ||
| 5. Lost Energy | 7.168 | 0.1273 | 0.001 | 0.030 | 6.543 | 0.1621 | 0.001 | −0.108 | 0 of 6 | ||
| 6. Concentrate | 5.377 | 0.2508 | 0.003 | −0.302 | 3.983 | 0.4083 | 0.001 | −0.324 | 0 of 6 | ||
| 7. Speak Slowly | 9.922 | 0.0418 | 0.004 | −1.751 | 8.024 | 0.0907 | 0.004 | −1.456 | 0 of 6 | ||
| 8. Think Slowly | 7.238 | 0.1238 | 0.003 | −0.057 | 4.931 | 0.2944 | 0.001 | −0.176 | 0 of 6 | ||
Note. Likelihood ratio tests and associated p-values for associations between item-level responses and age, partialling out the effect of the somatic depression trait. Age is treated as a three-group categorical variable for the purposes of this analysis. Chi-square values indicate difference in fit between model where each item is regressed on the somatic depression trait and a model which includes ordinal logistic regression effects of the trait, age group and age*trait interactions. Both ΔR2 and Δ are measures of effect size, while “DIF?” indicates how many of the six total tests (3 age groupings by 2 twins) find significant differential item functioning. Bonferroni-corrected criterion value of .0029 used for significance testing and “DIF?” column.
Table A3.
Uniform DIF
| Twin 1 | Twin 2 | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Two-Group | Three-Group | Four-Group | Two-Group | Three-Group | Four-Group | ||||||||||||||
| Affective | p | p | p | p | p | p | DIF? | ||||||||||||
| 1. Happy Life | 9.160 | 0.0025 | 18.101 | 0.0001 | 27.210 | <.0001 | 0.001 | 0.9719 | 40.181 | <.0001 | 46.287 | <.0001 | 5 of 6 | ||||||
| 2. Feel Happy | 8.231 | 0.0041 | 35.740 | <.0001 | 13.441 | 0.0038 | 25.621 | <.0001 | 36.138 | <.0001 | 41.369 | <.0001 | 4 of 6 | ||||||
| 3. Lonely | 32.344 | <.0001 | 57.405 | <.0001 | 56.980 | <.0001 | 87.120 | <.0001 | 103.887 | <.0001 | 133.074 | <.0001 | 6 of 6 | ||||||
| 4. Tense | 5.064 | 0.0244 | 3.140 | 0.2081 | 11.171 | 0.0108 | 3.422 | 0.0643 | 4.342 | 0.1141 | 6.258 | 0.0997 | 0 of 6 | ||||||
| 5. Nervous | 1.310 | 0.2524 | 2.501 | 0.2864 | 11.393 | 0.0098 | 10.940 | 0.0009 | 6.745 | 0.0343 | 12.877 | 0.0049 | 1 of 6 | ||||||
| 6. Sad | 1.256 | 0.2624 | 7.736 | 0.0209 | 3.349 | 0.3409 | 4.257 | 0.0391 | 1.219 | 0.5436 | 2.059 | 0.5602 | 0 of 6 | ||||||
| 7. Worthless | 4.909 | 0.0267 | 6.089 | 0.0476 | 4.708 | 0.1944 | 18.986 | <.0001 | 13.313 | 0.0013 | 16.571 | 0.0009 | 3 of 6 | ||||||
| 8. Future | 103.520 | <.0001 | 150.650 | <.0001 | 132.425 | <.0001 | 207.303 | <.0001 | 229.346 | <.0001 | 233.294 | <.0001 | 6 of 6 | ||||||
| 9. Life | 2.367 | 0.1239 | 5.402 | 0.0672 | 8.013 | 0.0458 | 14.899 | 0.0001 | 17.634 | 0.0001 | 11.703 | 0.0085 | 2 of 6 | ||||||
| Somatic | p | p | p | p | p | p | DIF? | ||||||||||||
| 1. Long Sleep | 2.298 | 0.1295 | 3.895 | 0.1426 | 3.793 | 0.2846 | 0.986 | 0.3208 | 8.691 | 0.0130 | 2.620 | 0.4541 | 0 of 6 | ||||||
| 2. Cope | 128.751 | 0.0000 | 100.995 | 0.0000 | 146.199 | 0.0000 | 120.260 | 0.0000 | 140.006 | 0.0000 | 141.039 | 0.0000 | 6 of 6 | ||||||
| 3. Decisions | 1.424 | 0.2328 | 4.391 | 0.1113 | 5.684 | 0.1281 | 0.688 | 0.4069 | 1.975 | 0.3724 | 2.192 | 0.5336 | 0 of 6 | ||||||
| 4. Lost I5nterest | 0.121 | 0.7276 | 1.094 | 0.5788 | 1.511 | 0.6798 | 0.071 | 0.7898 | 0.310 | 0.8562 | 0.494 | 0.9203 | 0 of 6 | ||||||
| 5. Lost Energy | 0.562 | 0.4534 | 5.439 | 0.0659 | 1.570 | 0.6661 | 3.769 | 0.0522 | 2.222 | 0.3293 | 2.900 | 0.4073 | 0 of 6 | ||||||
| 6. Concentrate | 0.031 | 0.8607 | 0.823 | 0.6627 | 5.684 | 0.1280 | 0.817 | 0.3661 | 1.530 | 0.4652 | 0.869 | 0.8330 | 0 of 6 | ||||||
| 7. Speak Slowly | 3.130 | 0.0769 | 8.430 | 0.0148 | 8.094 | 0.0441 | 0.128 | 0.7207 | 2.758 | 0.2518 | 4.765 | 0.1899 | 0 of 6 | ||||||
| 8. Think Slowly | 0.138 | 0.7098 | 3.170 | 0.2049 | 0.931 | 0.8179 | 0.275 | 0.5999 | 2.600 | 0.2725 | 8.191 | 0.0422 | 0 of 6 | ||||||
Note. Chi-square and associated p-values for likelihood ratio tests of uniform DIF.
Table A4.
Non-Uniform DIF
| Twin 1 | Twin 2 | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Two-Group | Three-Group | Four-Group | Two-Group | Three-Group | Four-Group | ||||||||||||||
| Affective | p | p | p | p | p | p | DIF? | ||||||||||||
| 1. Happy Life | 1.546 | 0.2138 | 0.841 | 0.6566 | 3.008 | 0.3903 | 16.485 | <.0001 | 10.957 | 0.0042 | 8.250 | 0.0411 | 1 of 6 | ||||||
| 2. Feel Happy | 0.268 | 0.6046 | 11.166 | 0.0038 | 2.667 | 0.4459 | 24.189 | <.0001 | 41.355 | <.0001 | 18.675 | 0.0003 | 3 of 6 | ||||||
| 3. Lonely | 29.240 | <.0001 | 17.564 | 0.0002 | 37.502 | <.0001 | 20.397 | <.0001 | 15.040 | 0.0005 | 33.640 | <.0001 | 6 of 6 | ||||||
| 4. Tense | 0.837 | 0.3603 | 3.701 | 0.1572 | 9.268 | 0.0259 | 2.482 | 0.1151 | 2.652 | 0.2655 | 5.327 | 0.1493 | 0 of 6 | ||||||
| 5. Nervous | 22.102 | <.0001 | 25.254 | <.0001 | 20.553 | 0.0001 | 15.211 | 0.0001 | 7.789 | 0.0204 | 18.433 | 0.0004 | 5 of 6 | ||||||
| 6. Sad | 0.098 | 0.7541 | 1.858 | 0.3949 | 2.117 | 0.5485 | 0.136 | 0.7121 | 0.788 | 0.6744 | 2.292 | 0.5141 | 0 of 6 | ||||||
| 7. Worthless | 42.941 | <.0001 | 30.144 | <.0001 | 45.771 | <.0001 | 48.772 | <.0001 | 31.770 | <.0001 | 55.472 | <.0001 | 6 of 6 | ||||||
| 8. Future | 6.125 | 0.0133 | 10.574 | 0.0051 | 5.351 | 0.1478 | 7.118 | 0.0076 | 13.935 | 0.0009 | 22.465 | 0.0001 | 2 of 6 | ||||||
| 9. Life | 0.325 | 0.5687 | 1.518 | 0.4680 | 2.971 | 0.3962 | 0.093 | 0.7603 | 0.606 | 0.7387 | 1.951 | 0.5826 | 0 of 6 | ||||||
| Somatic | p | p | p | p | p | p | DIF? | ||||||||||||
| 1. Long Sleep | 1.098 | 0.2947 | 0.186 | 0.9112 | 2.497 | 0.4758 | 19.910 | <.0001 | 14.803 | <.0001 | 10.842 | 0.0126 | 2 of 6 | ||||||
| 2. Cope | 15.717 | <.0001 | 25.915 | <.0001 | 34.539 | <.0001 | 10.681 | 0.0011 | 34.297 | <.0001 | 30.876 | 0.0000 | 6 of 6 | ||||||
| 3. Decisions | 0.835 | 0.3608 | 0.192 | 0.9085 | 6.153 | 0.1044 | 0.409 | 0.5225 | 0.042 | 0.9792 | 0.974 | 0.8075 | 0 of 6 | ||||||
| 4. Lost Interest | 5.473 | 0.0193 | 4.684 | 0.0961 | 6.143 | 0.1049 | 3.347 | 0.0673 | 1.614 | 0.4462 | 3.175 | 0.3654 | 0 of 6 | ||||||
| 5. Lost Energy | 2.925 | 0.0872 | 1.727 | 0.4217 | 4.035 | 0.2577 | 3.945 | 0.0470 | 4.218 | 0.1214 | 6.179 | 0.1032 | 0 of 6 | ||||||
| 6. Concentrate | 1.528 | 0.2164 | 4.553 | 0.1026 | 6.659 | 0.0836 | 2.511 | 0.1131 | 2.426 | 0.2973 | 2.486 | 0.4778 | 0 of 6 | ||||||
| 7. Speak Slowly | 1.853 | 0.1734 | 1.492 | 0.4743 | 4.268 | 0.2339 | 1.395 | 0.2376 | 5.251 | 0.0724 | 3.558 | 0.3133 | 0 of 6 | ||||||
| 8. Think Slowly | 2.764 | 0.0964 | 4.068 | 0.1308 | 3.282 | 0.3502 | 0.486 | 0.4857 | 2.286 | 0.3189 | 1.333 | 0.7213 | 0 of 6 | ||||||
Note. Chi-square and associated p-values for likelihood ratio tests of uniform DIF.
Table A5.
Restating Four-Group DIF Results as Binary Comparisons
| Affective | Somatic | ||||
|---|---|---|---|---|---|
| Comparison | Twin 1 | Twin 2 | Twin 1 | Twin 2 | |
| MADT 50s | MADT 60s | 0 of 9 |
2 of 9 (1,8) |
0 of 8 | 0 of 8 |
| MADT 60s | LSADT 70s | 0 of 9 | 2 of 9 (3,8) |
1 of 8 (2) |
1 of 8 (2) |
| LSADT 70s | LSADT 80s+ | 1 of 9 (3) |
2 of 9 (3,5) |
1 of 8 (2) |
2 of 8 (2,3) |
| MADT 50s | LSADT 70s | 3 of 9 (1,7,8) |
4 of 9 (1,3,7,8) |
1 of 8 (2) |
1 of 8 (2) |
| MADT 60s | LSADT 80s+ | 5 of 9 (3,4,5,7,8) |
2 of 9 (3,8) |
1 of 8 (2) |
1 of 8 (2) |
| MADT 50s | LSADT 80s+ | 4 of 9 (2,3,7,8) |
5 of 9 (2,3,5,7,8 |
3 of 8 (2,6,7) |
1 of 8 (2) |
Note. Summary of DIF found in each possible binary comparison between age groups in the four-group analysis. Each cell indicates the number of the possible items that were identified as differentially functioning. Each item is identified by number rather than by name, with the number indicating that item’s position in every table. For the Affective subscale, 1=“Happy Life”, 2=“Feel Happy”, 3=“Lonely”, 4=“Tense”, 5=“Nervous”, 7=“Worthless”, 8=“Future”. For the Somatic subscale, 2=“Sad”, 3=“Decisions, 7=“Speak”, 8=“Think”.
Footnotes
At the time of this writing, the Danish population is 89.4% ethnic Dane, 3.4% immigrant from Western EU & Nordic countries, and 7.2% non-Western immigrants (StatBank Denmark, 2014). Race was not collected as part of the MADT or the LSADT.
McGue & Christensen’s (1997) scoring paper has been cited 109 times per Google Scholar at the time of this submission. Outside of sum-score based methods, no other scoring procedures have been published for this scale to the author’s knowledge.
The estimation of separate difficulty and discrimination parameters for each group is done via data manipulation in the lordif package. For example, identifying one item as differentially functioning in the two-group model for somatic symptoms (eight items) would be handled by separating that item into two new variables. For all individuals in the first group, the first variable would have their responses to the differentially functioning item while the second variable would be missing. In the second group, the first variable would be missing for all individuals while the second variable would contain those individual’s responses to the item in question. As a result, there would now be nine items instead of eight for the next iteration of the DIF procedure: the seven items retained as invariant, and two copies of the differentially functioning item with structured missingness inserted to allow for separate item parameters for each group. Only the invariant items are used for test equating in subsequent iterations.
The equivalence between item response models and item factor analysis (McDonald, 1985; Kamata & Bauer, 2008) provides an alternative explanation, as item discriminations are equivalent to factor loadings and item difficulties are a function of loadings and thresholds given model identification constraints (see Kamata & Bauer, 2008 for exact transformation equations). Uniform DIF means that group differences are specific to the item thresholds, but the discriminations or factor loadings do not vary across groups. Non-uniform DIF indicates group differences in discriminations or factor loadings. As such, uniform DIF can be thought of as the retention of weak factorial invariance (Meredith, 1964a; 1964b; 1993) despite the rejection of strong and strict invariance, while non-uniform DIF shows no degree of metric invariance.
Contributor Information
Ryne Estabrook, Department of Medical Social Sciences, Northwestern University, Chicago, IL.
Michael E. Sadler, Department of Psychology and Counseling, Gannon University, 109 University Square, Erie, PA 16541
Matt McGue, The Danish Twin Registry and Danish Aging Research Center, University of Southern Denmark, 5000.
Odense C Denmark, Department of Psychology, University of Minnesota, N218 Elliott Hall, 75 East River Road, Minneapolis, MN 55455.
References
- Aggen SH, Kendler KS, Kubarych TS, Neale MC. Differential age and sex effects in the assessment of major depression: A population-based twin item analysis of the DSM Criteria. Twin Research and Human Genetics. 2011;14(6):524–538. doi: 10.1375/twin.14.6.524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alexopoulos GS. Depression in the elderly. Lancet. 2005;365:1961–1970. doi: 10.1016/S0140-6736(05)66665-2. [DOI] [PubMed] [Google Scholar]
- Baldwin G, Shean GD. A taxonometric study of the Center for Epidemiological Studies Depression Scale. Genetic, Social, and General Psychology Monographs. 2006;132(2):101–128. doi: 10.3200/mono.132.2.101-128. [DOI] [PubMed] [Google Scholar]
- Balsis S, Cully JA. Comparing depression diagnostic symptoms across younger and older adults. Aging & Mental Health. 2008;12(6):800–806. doi: 10.1080/13607860802428000. [DOI] [PubMed] [Google Scholar]
- Barry LC, Allore HG, Guo Z, Bruce ML, Gill TM. Higher burden of depression among older women: The effect of onset, persistence and mortality over time. Archives of General Psychiatry. 2008;65(2):172–178. doi: 10.1001/archgenpsychiatry.2007.17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beck AT, Steer RA. Manual for the Beck Depression Inventory. San Antonio, TX: Psychological corporation; 1993. [Google Scholar]
- Beck AT, Steer RA, Brown GK. Manual for the Beck Depression Inventory-II. San Antonio, TX: Psychological Corporation; 1996. [Google Scholar]
- Beekman ATF, Copeland JRM, Prince MJ. Review of community prevalence of depression in late life. British Journal of Psychiatry. 1999;174:307–311. doi: 10.1192/bjp.174.4.307. [DOI] [PubMed] [Google Scholar]
- Blazer DG. Depression in late life: Review and commentary. Journal of Gerontology: MEDICAL SCIENCES. 2003;58A(3):249–65. doi: 10.1093/gerona/58.3.m249. [DOI] [PubMed] [Google Scholar]
- Blazer D, Burchett B, Service C, George LK. The association of age and depression among the elderly: An epidemiologic exploration. Journal of Gerontology: MEDICAL SCIENCES. 1991;46(6):M210–M215. doi: 10.1093/geronj/46.6.m210. [DOI] [PubMed] [Google Scholar]
- Blazer DG, Hybels CF. What symptoms of depression predict mortality in community-dwelling elders? Journal of the American Geriatric Society. 2004;52:2052–2056. doi: 10.1111/j.1532-5415.2004.52564.x. [DOI] [PubMed] [Google Scholar]
- Bonin-Guillaume S, Clément JP, Chassain AP, Léger JM. Psychometric evaluation of depression in the elderly subject: which instruments? What are the future perspectives? L’ Encephale. 1995;21(1):25–34. [PubMed] [Google Scholar]
- Brislin RW. Back-Translation For Cross- Cultural Research. Journal of Cross-Cultural Psychology. 1970;1(3):185–216. [Google Scholar]
- Bruce ML, Seeman TE, Merrill SS, Blazer D. The impact of depressive symptomatology on physical disability: MacArthur Studies of Successful Aging. American Journal of Public Health. 1994;84(11):1796–1799. doi: 10.2105/ajph.84.11.1796. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chew-Graham C, Kovandzic M, Gask L, Burroughs H, Clarke P, Sanderson H, Dowrick C. Why may older people with depression not present to primary care? Messages from secondary analysis of qualitative data. Health and Social Care in the Community. 2012;20(1):52–60. doi: 10.1111/j.1365-2524.2011.01015.x. [DOI] [PubMed] [Google Scholar]
- Choi SW, Gibbons LE, Crane PK. lordif: An R package for detecting differential item functioning using iterative hybrid ordinal logistic regression/item response theory and monte carlo simulations. Journal of Statistical Software. 2011;39(8):1–30. doi: 10.18637/jss.v039.i08. URL http://www.jstatsoft.org/v39/i08/ [DOI] [PMC free article] [PubMed] [Google Scholar]
- Christiansen L, Frederiksen H, Schousboe K, Skytthe A, von Wurmb Schwark N, Christensen K, Kyvik K. Age- and sex-differences in the validity of questionnaire-based zygosity in twins. Twin Research. 2003;6(4):275–278. doi: 10.1375/136905203322296610. [DOI] [PubMed] [Google Scholar]
- Christensen K, Herskind AM, Vaupel JW. Why Danes are smug: comparative study of life satisfaction in the European Union. British Medical Journal. 2006;333:1289–11291. doi: 10.1136/bmj.39028.665602.55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Christensen K, Holm N, McGue M, Corder L, Vaupel JW. A Danish population-based twin study on self-rated health in the elderly. Journal of Aging and Health. 1999;11:49–64. doi: 10.1177/089826439901100103. [DOI] [PubMed] [Google Scholar]
- Cole MG, Bellavance F, Mansour A. Prognosis of depression in elderly community and primary care populations: A systematic review and meta-analysis. The American Journal of Psychiatry. 1999;156(8):1182–1189. doi: 10.1176/ajp.156.8.1182. [DOI] [PubMed] [Google Scholar]
- Crane PK, Gibbons LE, Jolley L, van Belle G. Differential Item Functioning Analysis with Ordinal Logistic Regression Techniques: DIF Detect and difwithpar. Medical Care. 2006;44(11 Supp 3):S115–S123. doi: 10.1097/01.mlr.0000245183.28384.ed. [DOI] [PubMed] [Google Scholar]
- Drayer RA, Mulsant BH, Lenze EJ, Rollman BL, Dew MA, Kelleher K, Reynolds CF. Somatic symptoms of depression in elderly patients with medical co-morbidities. International Journal of Geriatric Psychiatry. 2005;20:973–982. doi: 10.1002/gps.1389. [DOI] [PubMed] [Google Scholar]
- Dorans NJ. Two new approaches to assessing differential item functioning: Standardization and the Mantel-Haenszel method. Applied Measurement in Education. 1989;3:217–233. [Google Scholar]
- Embretson SE, Reise S. Item response theory for psychologists. Mahwah, NJ: Erlbaum Publishers; 2000. [Google Scholar]
- Evans DM, Martin NG. The validity of twin studies. GeneScreen. 2000;1:77–79. [Google Scholar]
- Feinson MC. Aging and mental health: Distinguishing myth from reality. Research on Aging. 1985;7:155–174. doi: 10.1177/0164027585007002001. [DOI] [PubMed] [Google Scholar]
- Forkmann T, Gauggel S, Spagnenberg L, Brahlerb E, Glaesmer H. Dimensional assessment of depressive severity in the elderly: Psychometric evaluation of the PHQ-9 using Rasch Analysis. Journal of Affective Disorders. 2013;148:323–330. doi: 10.1016/j.jad.2012.12.019. [DOI] [PubMed] [Google Scholar]
- French AW, Miller TR. Logistic regression and its use in detecting differential item functioning in polytomous items. Journal of Educational Measurement. 1996;33:315–332. [Google Scholar]
- Gaist D, Bathum L, Skytthe A, Jensen TK, McGue M, Vaupel JW, Christensen K. Strength and anthropometric measures in identical and fraternal twins: No evidence of masculinization of females with male co-twins. Epidemiology. 2000;11(3):340–343. doi: 10.1097/00001648-200005000-00020. [DOI] [PubMed] [Google Scholar]
- Gallow JJ, Rabins PV, Lyketsos CG, Tien AY, Anthony JC. Depression without sadness: Functional outcomes of nondysphoric depression in later life. Journal of the American Geriatrics Society. 1997;45(5):570–578. doi: 10.1111/j.1532-5415.1997.tb03089.x. [DOI] [PubMed] [Google Scholar]
- Gatz M, Hurwicz M. Are old people more depressed? Cross-sectional data on Center for Epidemiological Studies Depression Scale factors. Psychology and Aging. 1990;5(2):284–290. doi: 10.1037//0882-7974.5.2.284. [DOI] [PubMed] [Google Scholar]
- Geroldi C, Frisoni GB, Zanetti O, Bianchetti A, Trabucchi M, De Leo D, Bocola V. ASSESSMENT OF DEPRESSION WITH PSYCHOMETRIC SCALES IN ALZHEIMER'S DISEASE. Past, Present and Future of Psychiatry. 1994:538–542. [Google Scholar]
- Giuffra LA, Risch N. Diminished recall and the cohort effect of major depression: A simulation study. Psychological Medicine. 1994;24(2):375–383. doi: 10.1017/s0033291700027355. [DOI] [PubMed] [Google Scholar]
- Glaesmer H, Riedel-Heller S, Braehler E, Spangen L, Luppa M. Age- and gender-specific prevalence and risk factors for depressive symptoms in the elderly: A population-based study. International Psychogeriatrics. 2011;23(8):1294–1300. doi: 10.1017/S1041610211000780. [DOI] [PubMed] [Google Scholar]
- Hamilton M. A rating scale for depression. Journal of Neurology, Neurosurgery, and Psychiatry. 1960;23:56–61. doi: 10.1136/jnnp.23.1.56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hidalgo MD, López-Pina JA. Differential item functioning detection and effect size: A comparison between logistic regression and Mantel-Haenszel procedures. Educational and Psychological Measurement. 2004;64:903–915. [Google Scholar]
- Holland PW, Thayer DT. Differential item performance and the Mantel-Haenszel procedure. In: Wainer H, Braun HI, editors. Test validity. Hillsdale, NJ: Erlbaum; 1988. pp. 129–145. [Google Scholar]
- Hybels CF, Blazer DG, Landerman LR, Steffens DC. Heterogeneity in symptom profiles among older adults diagnosed with major depression. International Psychogeriatrics. 2011;23(6):906–922. doi: 10.1017/S1041610210002346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Iwasa H, Yoshida Y, Kumagai S, Ihara K, Yoshido H, Suzuki T. Depression status as a reliable predictor of functional status among Japanese community-dwelling older adults: A 12-year population-based cohort study. International Journal of Geriatric Psychiatry. 2009 doi: 10.1002/gps.2245. Published online. [DOI] [PubMed] [Google Scholar]
- Jeste DV, Blazer DG, First MB. Aging-related diagnostic variations: Need for diagnostic criteria appropriate for elderly psychiatric patients. In: Narrow WE, First MB, Sirovatka PJ, Regier DA, editors. Age and gender considerations in psychiatric diagnosis: A research agenda for DSM-V. Arlington, VA: American Psychiatric Publishing, Inc; 2007. pp. 273–288. [Google Scholar]
- Jodoin MG, Gierl MJ. Evaluating Type I error and power rates using an effect size measure with logistic regression procedure for DIF detection. Applied Measurement in Education. 2001;14:329–349. [Google Scholar]
- Johnson W, McGue M, Gaist D, Vaupel JW, Christensen K. Frequency and heritability of depression symptomatology in the second half of life: Evidence from Danish twins over 45. Psychological Medicine. 2002;32:1175–1185. doi: 10.1017/s0033291702006207. [DOI] [PubMed] [Google Scholar]
- Jorm AF. Does old age reduce the risk of anxiety and depression? A review of epidemiological studies across the adult life span. Psychological Medicine. 2000;30:11–22. doi: 10.1017/s0033291799001452. [DOI] [PubMed] [Google Scholar]
- Kamata A, Bauer D. A note on the relation between factor analytic and item response theory models. Structural Equation Modeling. 2008;15:136–153. [Google Scholar]
- Kendler KS, Gardner CO. A longitudinal etiologic model for symptoms of anxiety and depression in women. Psychological Medicine. 2011;41:2035–2045. doi: 10.1017/S0033291711000225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kendler KS, Martin NG, Heath AC, Eaves LJ. Self-report psychiatric symptoms in twins and their non-twin relatives: Are twins different? American Journal of Medical Ethics. 1995;60:588–591. doi: 10.1002/ajmg.1320600622. [DOI] [PubMed] [Google Scholar]
- Kessler RC, Berglund P, Demler O, Jin R, Koretz D, Merikangas KR, Rush AJ, Wang PS. The epidemiology of major depressive disorder. Results from the National Comorbidity Survey Replication (NCS-R) Journal of the American Medical Association. 2003;289(23):3095–3105. doi: 10.1001/jama.289.23.3095. [DOI] [PubMed] [Google Scholar]
- Kessler RC, Birnbaum HG, Shahly V, Bromet E, Hwang I, Mclaughlin KA, Stein DJ. Age differences in the prevalence and co-morbidity of DSM-IV major depressive episodes: Results from the WHO World Mental Health Survey Initiative. Depression and Anxiety. 2010;27:351–364. doi: 10.1002/da.20634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kessler RC, Mickelson KD, Walters EE, Zhao S, Hamilton L. Age and depression in the MIDUS Survey. In: Brim OG, Ryff CD, Kessler RC, editors. How healthy are we? A national study of well-being at midlife. Chicago, IL: The University of Chicago Press; 2009. pp. 227–251. [Google Scholar]
- Kim Y, Pilkonis PA, Frank E, Thase ME, Reynolds CF. Differential item functioning of the Beck Depression Inventory in late-life patients: Use of Item Response Theory. Psychology and Aging. 2002;17(3):379–391. doi: 10.1037//0882-7974.17.3.379. [DOI] [PubMed] [Google Scholar]
- Krause N. Mental disorder in late life: Exploring the influences of stress and socioeconomic status. In: Aneshensel CS, Phelan JC, editors. Handbook of the sociology of mental health. New York: Kluwer Academic/Plenum; 1999. pp. 183–208. [Google Scholar]
- Kyvik KO, Christensen K, Skytthe A, Harvard B, Holm NV. The Danish Twin Register. Danish Medical Bulletin. 1996;43(5):467–470. [PubMed] [Google Scholar]
- Law J, Laidlaw K, Peck D. Is depression viewed as an inevitable consequence of age? The "understandability phenomenon" in older people. Clinical Gerontologist. 2010;33:194–209. [Google Scholar]
- Lord FM. Applications of item response theory to practical testing problems. New York: Erlbaum Associates; 1980. [Google Scholar]
- Lumsden J. The construction of unidimensional tests. University of Western Australia, Perth; Australia: 1959. Unpublished master’s thesis. [Google Scholar]
- Lyness JM, Bruce ML, Koenig HG, Parmelee PA, Schulz R, Lawton P, Reynolds CF. Depression and medical illness in late life: Report of a symposium. Journal of the American Geriatric Society. 1996;44:198–203. doi: 10.1111/j.1532-5415.1996.tb02440.x. [DOI] [PubMed] [Google Scholar]
- Maydeu-Olivares A. Linear IRT, non-linear IRT, and factor analysis: A unified framework. In: Maydeu-Olivares A, McArdle JJ, editors. Contemporary psychometrics. A festschrift to Roderick P. McDonald. Mahwah, NJ: Lawrence Erlbaum; 2004. (2005) [Google Scholar]
- McDonald RP. Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum Associates; 1999. [Google Scholar]
- McGue M, Christensen K. Genetic and environmental contributions to depression symptomatology: Evidence from Danish twins 75 years of age and older. Journal of Abnormal Psychology. 1997;106(3):439–448. doi: 10.1037//0021-843x.106.3.439. [DOI] [PubMed] [Google Scholar]
- McGue M, Christensen K. The heritability of depression symptoms in elderly Danish twins: Occasion-specific versus general effects. Behavior Genetics. 2003;33(2):83–93. doi: 10.1023/a:1022545600034. [DOI] [PubMed] [Google Scholar]
- McGue M, Christensen K. Growing old but not growing apart: Twin similarity in the latter half of the lifespan. Behavior Genetics. 2013;43:1–12. doi: 10.1007/s10519-012-9559-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meeks TW, Vahia IV, Lavretsky H, Kulkarni G, Jeste DV. A tune in “a minor” can “b major”: A review of epidemiology, illness course and public health implications of subthreshold depression in older adults. Journal of Affective Disorders. 2010;129:126–142. doi: 10.1016/j.jad.2010.09.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meredith W. Notes on factorial invariance. Psychometrika. 1964a;29:177–185. [Google Scholar]
- Meredith W. Rotation to achieve factorial invariance. Psychometrika. 1964b;29:186–206. [Google Scholar]
- Meredith W. Measurement invariance, factor analysis and factor invariance. Psychometrika. 1993;58:525–543. [Google Scholar]
- McDonald RP. Factor analysis and related methods. Hillsdale, NJ: Lawrence Erlbaum Associates; 1985. [Google Scholar]
- McGue M, Christensen K. Genetic and environmental contributions to depression symptomatology: Evidence from Danish twins 75 years of age and older. Journal of Abnormal Psychology. 1997;106(3):439–448. doi: 10.1037//0021-843x.106.3.439. [DOI] [PubMed] [Google Scholar]
- Miller TR, Spray JA. Logistic discriminant function analysis for DIF identification of polytomously scored items. Journal of Educational Measurement. 1993;30:107–122. [Google Scholar]
- Narrow WE, Rae DS, Robins LN, Reigier DA. Revised prevalence estimates of mental disorders in the United States. Archives of General Psychiatry. 2002;59:115–123. doi: 10.1001/archpsyc.59.2.115. [DOI] [PubMed] [Google Scholar]
- Newman JP. Aging and depression. Psychology and Aging. 1989;4(2):150–165. doi: 10.1037//0882-7974.4.2.150. [DOI] [PubMed] [Google Scholar]
- Pearson JL. Recent research on suicide in the elderly. Current Psychiatric Reports. 2002;4:59–63. doi: 10.1007/s11920-002-0014-9. [DOI] [PubMed] [Google Scholar]
- Pennix, et al. Depressive symptoms and physical decline in community-dwelling older persons. Journal of the American Medical Association. 1998;279(21):1720–1726. doi: 10.1001/jama.279.21.1720. [DOI] [PubMed] [Google Scholar]
- Pincus HA, Petit AR. The societal costs of chronic major depression. Journal of Clinical Psychiatry. 2001;62(Suppl. 6):5–9. [PubMed] [Google Scholar]
- Pinho MX, Custodio O, Makdisse M, Carvalho ACC. Reliabilty and validity of the Geriatric Depression Scale in elderly individuals with coronary artery disease. Brazilian Archives of Cardiology. 2010;94(5):535–544. doi: 10.1590/s0066-782x2010005000032. [DOI] [PubMed] [Google Scholar]
- R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2011. [Google Scholar]
- Radloff LS. The CES-D Scale: A self-report depression scale for research in the general population. Applied Psychological Measurement. 1977;1:385–401. [Google Scholar]
- Rizopoulos D. ltm: An R package for latent variable modeling and item response theory analyses. Journal of Statistical Software. 2006;17(5):1–25. http://cran.r-project.org/web/packages/ltm/ltm.pdf. [Google Scholar]
- Roth M, Tym E, Mountjoy CQ, Huppert FA, Verma S, Goddard R. CAMDEX: A standardized instrument for the diagnosis of mental disorder in the elderly with special reference to the early detection of dementia. The British Journal of Psychiatry. 1986;149:698–709. doi: 10.1192/bjp.149.6.698. [DOI] [PubMed] [Google Scholar]
- Saz P, Dewey ME. Depression, depressive symptoms and mortality in persons aged 65 and over living in the community: A systematic review of the literature. International Journal of Geriatric Psychiatry. 2001;16:622–630. doi: 10.1002/gps.396. [DOI] [PubMed] [Google Scholar]
- Skytthe A, Kyvik K, Holm NV, Vaupel JW, Christensen K. The Danish Twin Registry: 127 cohorts of twins. Twin Research. 2002;5(5):352–357. doi: 10.1375/136905202320906084. [DOI] [PubMed] [Google Scholar]
- Skytthe A, Christiansen L, Kyvik KO, Bødker FL, Hvidberg L, Petersen I, Christensen K. The Danish Twin Registry: Linking Surveys, National Registers, and Biological Information. Twin Research and Human Genetics?: the Official Journal of the International Society for Twin Studies. 2013;16(1):104–111. doi: 10.1017/thg.2012.77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- StatBank Denmark. 2014 Retrieved November 5, 2014, from http://www.statbank.dk/
- Stocking ML, Lord FM. Developing a common metric in item response theory. Applied Psychological Measurement. 1983;7(2):201–210. [Google Scholar]
- Swaminathan H, Rogers HJ. Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement. 1990;27:361–370. [Google Scholar]
- Takkinen S, Gold C, Pedersen NL, Malmberg B, Nilsson S, Rovine M. Gender differences in depression: A study of older unlike-sex twins. Aging & Mental Health. 2004;8:187–195. doi: 10.1080/13607860410001669714. [DOI] [PubMed] [Google Scholar]
- Teachman. Aging and negative affect: The rise and fall and rise of anxiety and depression symptoms. Psychology and Aging. 2006;21(1):201–207. doi: 10.1037/0882-7974.21.1.201. [DOI] [PubMed] [Google Scholar]
- Thielke SM, Diehr P, Unüzer J. Prevalence, incidence, and persistence of major depressive symptoms in the Cardiovascular Health Study. Aging & Mental Health. 2010;14(2):168–176. doi: 10.1080/13607860903046537. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Widaman KF, Reise SP. Exploring the measurement invariance of psychological instruments: Applications in the substance use domain. In: Bryant KJ, Windle M, West SG, editors. The science of prevention: Methodological advances from alcohol and substance abuse research. Washington, DC: American Psychological Association; 1997. pp. 281–324. [Google Scholar]
- Wong SYS, Mercer SW, Woo J, Leung J. The influence of multimorbidity and self-reported socio-economic standing on the prevalence of depression in an elderly Hong Kong population. BMG Public Health. 2008;8:119. doi: 10.1186/1471-2458-8-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xavier FMF, Ferraza MPT, Argimon I, Trentini CM, Poyares D, Bertollucci PH, …Moriguchi EH. The DSM-IV ‘minor depression’ disorder in the oldest-old: Prevalence rate, sleep patterns, memory function and quality of life in elderly people of Italian descent in southern Brazil. International Journal of Geriatric Psychiatry. 2002;17:107–116. doi: 10.1002/gps.517. [DOI] [PubMed] [Google Scholar]
- Yesavage JA, Brink TL. Development and validation of a geriatric depression screening scale: A preliminary report. Journal of Psychiatric Research. 1983;17:37–49. doi: 10.1016/0022-3956(82)90033-4. [DOI] [PubMed] [Google Scholar]
- Zieky M. Practical questions in the use of DIF statistics in test development. In: Holland PW, Wainer H, editors. Differential item functioning. Hillsdale, NJ: Erlbaum; 1993. pp. 337–347. [Google Scholar]
- Zumbo BD. A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defense; 1999. Retrieved from http://educ.ubc.ca/faculty/zumbo/DIF/index.html. [Google Scholar]
- Zung WWK. A self-rating depression scale. Archives of General Psychiatry. 1965;12:63–70. doi: 10.1001/archpsyc.1965.01720310065008. [DOI] [PubMed] [Google Scholar]
- Zwick R, Ercikan K. Analysis of differential item functioning in the NAEP history assessment. Journal of Educational Measurement. 1989;26:55–66. [Google Scholar]

