Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jan 1.
Published in final edited form as: Psychol Assess. 2019 Aug 8;32(1):98–107. doi: 10.1037/pas0000765

Are We Accurately Evaluating Depression in Patients with Cancer?

Rebecca M Saracino 1, Ezgi Aytürk 2, Heining Cham 3, Barry Rosenfeld 4, Leah M Feuerstahler 5, Christian J Nelson 6
PMCID: PMC6928435  NIHMSID: NIHMS1042384  PMID: 31393150

Abstract

Depression remains poorly managed in oncology, in part because of the difficulty of reliably screening and assessing for depression in the context of medical illness. Whether somatic items really skew the ability to identify “true” depression, or represent meaningful indicators of depression, remains to be determined. This study utilized item response theory (IRT) to compare the performance of traditional depression criteria with Endicott’s substitutive criteria (ESC; tearfulness or depressed appearance; social withdrawal; brooding; cannot be cheered up). The Patient Health Questionnaire (PHQ-9), ESC, and Center for Epidemiologic Studies Depression Scale (CES-D) were administered to 558 outpatients with cancer. IRT models were utilized to evaluate global and item fit for traditional PHQ-9 items compared to a modified version replacing the four somatic items with ESC. The modified PHQ-9 ESC scale was the best fit using a Partial Credit Model; model fit was improved after collapsing the middle two response categories and removing psychomotor agitation/retardation. This improved model showed satisfactory scale precision and internal consistency, and was free from differential item functioning for gender, age, and race. Concurrent and criterion validity were supported. Thus, as many have speculated, utilizing the ESC may result in more accurate identification of depressive symptoms in oncology. Depressed mood, anhedonia, and suicidal ideation retained their expected properties in the modified scale, indicating that the traditional underlying syndrome of depression likely remains the same, but the ESC may provide more specificity when assessing patients with cancer.

Keywords: depression, diagnostic criteria, oncology, IRT, screening


Accurate assessment of depression in patients with medical illness is critically important, as those with comorbid mood disorders are at significantly greater risk for non-adherence to medical treatments and premature mortality (Carney & Freedland, 2003; DiMatteo, Lepper, & Croghan, 2000; Misono, Weiss, Fann, Redman, & Yueh, 2008). Historically, clinicians and researchers have debated whether or not the reliance on somatic items when rendering a depression diagnosis inappropriately inflates the prevalence of depressive disorders among the medically ill, especially in oncology settings (Jones et al., 2015; Krebber et al., 2014; Saracino, Rosenfeld, & Nelson, 2018). Somatic items (i.e., sleep disturbance; fatigue; appetite changes; diminished concentration) may reflect side effects of treatment or the pathology of the underlying illness itself. Despite this concern, the Patient Health Questionnaire-9 item (PHQ-9; Kroenke & Spitzer, 2002), which relies exclusively on DSM criteria, remains one of the most widely utilized depression screening measures across medical settings (e.g., primary care, oncology, cardiovascular disease; Dyer et al., 2016; Forkmann, Gauggel, Spangenberg, Brahler, & Glaesmer, 2013; Gothwal, Begga, & Sumalini, 2014; Kendel et al., 2010; Pedersen, Mathiasen, Bang-Christensen, & Makransky, 2016; Williams et al., 2009).

The PHQ-9 consists of nine items, each of which corresponds to one of the nine symptoms required for a diagnosis of a major depressive disorder (MDD) as defined by the Diagnostic and Statistical Manual of Mental Disorders (DSM; American Psychiatric Association, 2013). Respondents are asked to rate how often they have been bothered by each of the nine symptoms over the preceding two weeks. Respondents rate each item on a four-point scale (0=not at all, 1= several days, 2=more than half the days, 3=nearly every day). Due to its popularity, a handful of studies have used item response theory (IRT) to examine the PHQ-9 in samples of medical patients (Dyer et al., 2016; Forkmann et al., 2013; Gothwal et al., 2014; Kendel et al., 2010; Pedersen et al., 2016; Williams et al., 2009). For example, Kendel et al. (2010) observed that among 1,271 patients undergoing coronary artery bypass graft surgery, most of the somatic items on the PHQ-9 did not meet criteria for a good overall model fit (i.e., according to fit statistics). Instead, they found that six out of seven items on the Hospital Anxiety and Depression Scale Depression subscale (HADS-D; Zigmond & Snaith, 1983), which rely entirely on cognitive and affective symptoms, and the two PHQ-9 items reflecting the DSM gateway symptoms of MDD (i.e., depressed mood and anhedonia) plus fatigue, were the strongest indicators of the underlying construct. They also identified differential item functioning (DIF) across genders on two PHQ-9 items; women were more likely than men to endorse depressed mood and fatigue conditional on the latent trait. In theory, DIF is an undesirable property of an item, as it indicates that respondents from different groups (e.g., males and females) with the same level of the latent trait have different probabilities of endorsing an item (Holland & Wainer, 1993).

A study of 1,531 patients with heart disease and implantable cardioverter defibrillators identified PHQ-9 items reflecting depressed mood, feeling bad about yourself or that you are a failure, and suicidal ideation, as being the best items for discriminating individuals with higher and lower levels of depression (Pedersen et al., 2016). They also found significant DIF for gender for the depressed mood item, such that women were more likely than men to endorse this item at the same underlying level of depression. Additionally, overall model fit was substantially improved after collapsing the two middle response options (several days and more than half the days) in the four-point scale, indicating that these two response options were not meaningfully distinguished from one another. Another study of 100 adults with a history of traumatic brain injury demonstrated similar findings, as all PHQ-9 items demonstrated good fit when the two intermediate response categories were collapsed (Dyer et al., 2016). Thus, regardless of the relative performance of individual items across clinical samples, a collapsed, three response option format may be most suitable for the PHQ-9.

In oncology settings, alternative approaches to depression assessment have been proposed (e.g., Cavanaugh, 1995; Endicott, 1984) in order to increase the specificity of depression screening measures and decrease the potential over-inclusivity of the criteria used by the DSM. The most widely recognized of these approaches are the substitutive criteria proposed by Endicott (1984; ESC), who recommended replacing the four somatic symptoms with four alternative symptoms: tearfulness or depressed appearance in face or body posture; social withdrawal or decreased talkativeness; brooding, self-pity or pessimism; and cannot be cheered up, doesn’t smile, no response to good news or funny situations. Although widely cited, there is a dearth of published research that has systematically evaluated this proposal.

Only one prior study has utilized IRT to compare the performance of traditional DSM criteria with the Endicott substitutive approach, using a structured clinical interview to rate each of the criteria under investigation. Akechi et al. (2009) examined the utility of the DSM-IV criteria for MDD, along with the Endicott’s substitutive criteria and those proposed by Cavanaugh (1995), who recommended replacing the four DSM somatic items with two behavioral criteria: “not participating in medical treatment in spite of ability to do so” and “functioning at a lower level than medical condition warrants or failure to progress in recovery despite improved medical condition,” In a sample of 728 cancer patients diagnosed with depression (based on DSM-IV criteria), these authors found that the Endicott and Cavanaugh’s criteria were among the symptoms with the most utility in assessing depression across the spectrum of severity. Endicott’s “tearfulness or depressed appearance” and “brooding, self-pity, or pessimism” were particularly good indicators of mild depression, while “not participating in medical care” (Cavanaugh) and “social withdrawal” (Endicott) were good indicators of moderate to severe depression. For patients with severe depression, Endicott’s “cannot be cheered up...” symptom was the most salient indicator. Although none of the DSM-IV criteria had a high ability to discriminate between individuals with more or less severe depression in this sample, this finding may have been impacted by their study methodology, since they included only patients that met DSM criteria for MDD (thereby reducing the variability in the DSM-IV symptoms). Nevertheless, the authors suggested that the substitutive criteria proposed by Endicott and Cavanaugh are promising, given their apparent utility in discriminating depressive symptom severity. In addition to a restricted symptom range due to inclusion criteria, this study also relied on clinician interview, which is a costly and unrealistic approach to depression screening, particularly in busy oncology settings in which clinicians do not have the training nor the time to conduct psychiatric diagnostic interviews.

Despite its popularity, no studies to date have utilized IRT to examine the PHQ-9 in patients with cancer, nor have these methods been extended to study the Endicott’s substitutive criteria in a self-report format. Cancer and its treatment have unique disease sequelae and treatment side effects that are not necessarily as salient in other medical conditions such as heart disease or brain injury. While fatigue may be cross-cutting, symptoms such as appetite, concentration, and sleep disturbances are particularly salient in oncology (Akechi et al., 2003). Given its wide popularity and development for specific use in oncology, the present study focused on the classic symptoms of MDD and Endicott’s criteria only; the alternative symptoms proposed by Cavanaugh were not included in the current study as they were developed for general medical settings, not specifically for use with cancer patients. While depression screening measures can identify general distress, dysphoria, and subsyndromal depression (in addition to MDD), the goal of the current study was to evaluate the DSM criteria for MDD (via the PHQ-9) and the Endicott’s substitutive criteria as a first step towards further psychometric validation of the substitutive approach. The present study searched for the best-fitting measurement structure for the 13 items (nine DSM criteria plus four Endicott’s substitutive criteria items) using several IRT models. Differential item functioning (DIF) of the selected measurement structure was also tested across gender (males vs. females), age (40–69 years old vs. 70 or above), and racial groups (non-Hispanic White participants vs. ethnic minority participants), as well as precision and internal consistency of scale scores and concurrent validity of score interpretations.

Method

Participants and Procedures

Participants were recruited from outpatient clinics at Memorial Sloan Kettering Cancer Center (MSK) between January 2016 and May 2016. To be eligible for participation, patients had to be 40 years or older1, fluent in English, and have a cancer diagnosis. Patients were approached by trained research personnel while awaiting routine clinic appointments; those who were eligible were informed of the study procedures, risks and benefits, and invited to participate. The study was approved by the MSK and Fordham University Institutional Review Boards.

Measures

All participants completed a packet of questionnaires in a fixed order, including the Patient Health Questionnaire-9 (PHQ-9) and four items assessing the Endicott criteria. Table 1 presents the PHQ-9 items and Endicott’s substitutive criteria (ESC) items, along with the percentage endorsing each response option. As noted above, respondents were asked to rate how often they have been bothered by the symptoms described by the items over the last two weeks on a four-point scale (0=not at all, 1=several days, 2=more than half the days, 3=nearly every day). Endicott (1984) proposed four alternative symptoms (tearfulness or depressed appearance in face or body posture; social withdrawal or decreased talkativeness; brooding, self-pity or pessimism; and cannot be cheered up, doesn’t smile, no response to good news or funny situations) as substitutes for four DSM symptoms that are most commonly confounded by medical illness (sleep disturbance; fatigue; appetite changes; diminished concentration). These four items were assessed using the same instructions and response scale as PHQ-9 items.

Table 1.

Percentage (%) of Response Options of PHQ-9 and Endicott’s Substitutive Criteria Items

Percentage (%) Endorsing Response Option
Abbreviated Item Label Not at All
(0)
Several Days
(1)
More than Half
the Days (2)
Nearly
Every Day
(3)

Patient Health Questionnaire-9 Item (PHQ-9)
1. Anhedonia 64.0 21.5   9.1   5.4
2. Depressed mood 63.8 26.0   6.4   3.8
3. Sleep disturbances 44.8 28.7 14.0 12.5
4. Fatigue 32.4 37.5 15.4 14.7
5. Appetite changes 58.6 22.6 10.2   8.6
6. Feeling bad about yourself 77.8 14.9   4.8   2.5
7. Trouble concentrating 65.1 23.7   6.4   4.8
8. Psychomotor agitation and retardation 80.3 10.9   5.9   2.9
9. Suicidal ideation 92.6   6.3   0.9   0.2

Endicott’s Substitutive Criteria
1. Socially withdrawn 75.1 16.5   5.0   3.4
2. Tearfulness 78.1 14.7   5.0   2.2
3. Brooding 71.1 21.0   4.7   3.2
4. Could not be cheered up 82.8 12.2   3.9   1.1

Participants were also administered the Center for Epidemiologic Studies Depression Scale (CES-D; Radloff, 1977), a self-report measure of 20 depressive symptoms. Past research indicates acceptable psychometric properties and has supported a four-factor structure: depressed affect, positive affect, somatic complaints, and interpersonal problems (Nelson, Cho, Berk, Holland, & Roth, 2010; Saracino, Cham, Rosenfeld, & Nelson, 2018; Vodermaier et al., 2009). The CES-D was used to examine the concurrent validity of PHQ-9 and ESC item scores; it was not included in IRT analyses as the primary focus was on approximating DSM criteria for MDD, which are more directly assessed by the PHQ-9. Sociodemographic and medical data were also collected by participant self-report.

Data Analyses

Missing Data Analysis.

A total of 663 patients completed the study questionnaires. Missing data rates for the PHQ-9 and ESC items were low (mean = 7.2%, range: 6.5% to 7.7%). The differences between the sample with complete data (N = 558) and those with missing observations were small in effect sizes (all Cohen’s d < .29 and W < .15; Cohen, 1988) across sociodemographic and medical data, indicating that listwise deletion was appropriate to handle cases with the missing values.

IRT Analysis.

Following prior studies (e.g., Dyer et al., 2016; Forkmann et al., 2013; Gothwal et al., 2014; Kendel et al., 2010; Lamoureux, Tee, Pesudovs, Pallant, Keeffe, Rees, 2009; Pedersen et al., 2016; Williams et al., 2009), two polytomous Rasch models were used: the partial credit model (PCM; Masters, 1982) and the rating scale model (RSM; Andrich, 1978). Two polytomous non-Rasch models were also analyzed: the generalized partial credit model (GPCM; Muraki, 1992) and graded response model (GRM; Samejima, 1969). Rasch models (PCM and RSM) use observed item response patterns to estimate a person’s ability (in this case, depression severity) and an item’s difficulty (depression level that the item represents) on a continuous latent variable (depression). It models the probability of a given response as a logistic function of the difference between a person’s ability and item difficulty (Andrich, 1978). With dichotomous data (e.g., Yes/No or Correct/Incorrect), the higher the person’s ability relative to the item difficulty, the more likely a person is to endorse the item. With polytomous data, Rasch models estimate the response category threshold parameters. Category thresholds refer to the point where the probability of choosing either one of two adjacent response options (e.g., “not at all” versus “several days’”) is equal. RSM is the simplest (most constrained) polytomous Rasch model which assumes equal category thresholds across all items of a given scale and estimates a difficulty parameter for each item. The PCM is more relaxed than RSM as it estimates separate item thresholds for each item. However, both models assume the same discrimination for all items (i.e., the degree to which an item differentiates people with different depression levels). In these two models, average or sum scores of the items can be used as the overall scale score. The GPCM and GRM differ from the polytomous Rasch models in that they estimate different discrimination parameters for each item (the degree to which an item differentiates people with different depression levels). Because the items can have different discriminating power in GPCM and GRM, both models require specialized algorithms to computing the scale scores. Unlike the GPCM, the GRM estimates the probability of choosing a particular response category or above, but assumes that the item category thresholds are always ordered.

Three indices of model fit criteria were used to select the best-fitting model(s): (1) C2 goodness-of-fit test statistic (Cai & Manroe, 2014; Maydeu-Olivares & Joe, 2006), (2) Akaike Information Criterion (AIC; small value indicates better model fit), and (3) Bayesian Information Criterion (BIC; small value indicates better model fit). The unidimensional structure was first tested with PHQ-9 items only (termed PHQ-9-Original) and then a unidimensional structure with the four PHQ-9 items (sleep disturbances, fatigue, appetite changes, trouble concentrating) substituted by the ESC items (termed PHQ-9-Substitutive). Both measurement structures were tested with the PCM, RSM, GPCM and GRM models. Based on the results of these analyses, the models were modified by collapsing the response options of the items and removing items that negatively impacted model fit (described in more detail below).

DIF Analysis.

After deciding on the optimal measurement structure for the IRT analysis, the simultaneous item bias test (SIBTEST; Shealy & Stout, 1993) was used to examine if there was differential functioning of PHQ-9 and Endicott items across gender (males: n = 288 vs. females: n = 270), age (younger: 40–69 years old; n = 380 vs. older: 70 or above; n = 178), and racial groups (non-Hispanic White: n = 455 vs. ethnic minority participants: n = 103). Age 70 was used to bifurcate the sample as patients with cancer who are over 70 years old have been shown to experience significantly more medical comorbidity that those younger than 70 (Bluethmann, Mariotto, & Rowland, 2016). Both uniform DIF and non-uniform DIF were tested with one crossing point (Chalmers, 2018; Li & Stout, 1996). The SIBTEST estimates a standardized mean difference (β) capturing the group differences in correct response probabilities (β = 0 indicates no DIF) and provides a significance test to determine if β is significantly different from zero.β values between zero and .05 are considered small DIF, between .05 and .1 are considered moderate DIF, and .1 or above are considered large DIF (Shealy & Stout, 1993). To avoid inflated Type I error rate due to multiple testing of β for each item, Holm’s (1979) procedure was used to adjust p values (Kim & Oshima, 2013).

Validity Analysis.

The proportion of participants who obtained the lowest possible scale score on the PHQ-9-Original and on the selected substitutive measurement structure was calculated. It was expected that there would be a higher proportion of patients with a scale score of zero in the selected substitutive structure. To examine the convergent and discriminant validity of the selected substitutive structure and compare the relative differences between the selected substitutive structure and PHQ-9-Original, we calculated the correlations between the scale scores of the selected substitutive structure, PHQ-9-Original, and the CES-D total score and factors (depressed affect, positive affect, somatic complaints, and interpersonal problems). It was expected that there would be larger correlations between the selected substitutive structure and the CES-D depressed affect factor and total scores, because the depressed affect factor is most closely aligned with the affective DSM criteria. Finally, participants who reported receiving treatment for depression and those who did not were compared on the scale scores of the selected substitutive structure and PHQ-9-Original. It was anticipated that the difference between the two groups would be larger on the selected substitutive structure than the PHQ-9-Original.

All IRT and DIF analyses (except for person separation reliability; described in more detail below) were conducted using the R mirt package (version 1.29; Chalmers, 2012). Person separation reliability indices were calculated using the R eRm package (version 0.16–1; Mair & Hatzinger, 2007).

Results

Participant Characteristics

The sample (N = 558) was approximately evenly split by gender (51.6% male; n = 288) and ranged in age from 40 to 90 years or older2 (M = 64.7, SD = 10.3; see Table 2). Most participants were white (87.6%; n = 489; including n = 455 non-Hispanic and n = 34 Hispanic), married or living with a partner (70.6%; n = 394) and had a college and/or graduate education (70.4%; n = 393). The most common cancer diagnoses were gynecological (16.8%; n = 94), lung (15.2%; n = 85), and prostate (13.1%; n = 73). Over one third of participants reported stage IV disease (37.5%; n = 209). The majority of participants had received active cancer treatment within the preceding six months (71.3%; n = 398).

Table 2.

Demographic Characteristics

Demographic Frequency %

Gender Male 288 48.4
Female 270 51.6
Race White 489 87.6
African-American   29 5.2
Asian or Pacific Islander   21 3.8
Other   19 3.4
Ethnicity Hispanic   48 8.6
Not Hispanic 510 91.4
Marital status Single (never married)   40 7.2
Married/living with partner 394 70.6
Divorced/ separated   75 13.4
Widowed   49 8.8
Education Did not graduate high school   22 4.0
High school graduate/GED/some college 142 25.5
College graduate 168 30.1
Graduate degree/professional training 225 40.4
Missing     1 0.0
Treatment status Active treatment 398 71.3
Off treatment 138 24.7
Missing   22 4.0
Comorbidity Present 203 36.4
Absent 353 63.3
Missing     2 0.3
Disease stage In remission/not staged   24 4.3
Stage 1   34 6.1
Stage 2   34 6.1
Stage 3   77 13.8
Stage 4 209 37.4
Missing 180 32.3
Primary cancer Gynecological   94 16.8
Lung   85 15.2
Prostate   73 13.1
Colon   47 8.4
Past depression treatment Yes 131 23.5
No 427 76.5
Current depression treatment Yes   90 16.1
No 468 83.9

Initial Analysis of Unidimensionality

Confirmatory factor analysis (CFA) was conducted to test the unidimensionality of the PHQ-9-Original and PHQ-9-Substitutive. Models were estimated using polychoric correlations and diagonally weighted least squares estimation via the R lavaan package (Rosseel, 2012). A full report of the results can be found in the online supplementary materials. The comparative fit index (CFI) and Tucker-Lewis index (TLI) suggested good model fit of both the original PHQ-9-Original and PHQ-9-Substitutive (all values > .99); however, the PHQ-9-Original had slightly worse RMSEA than PHQ-9-Substitutive (i.e., .066 versus .028, respectively). Taken together, these model fit indices suggest that both PHQ-9-Original and PHQ-9-Substitutive were sufficiently unidimensional for IRT analysis.

IRT Analysis

All the IRT models (PCM, RSM, GPCM, GRM) converged properly in the PHQ-9-Original and PHQ-9-Substitutive measurement structures. Panels A and B in Table 3 present the global model fit results for the IRT models of the two structures. Compared to PHQ-9-Original, the PHQ-9-Substitutive structure had a better model fit in terms of AIC and BIC across all four IRT models. Therefore, the remaining analyses used only the PHQ-9-Substitutive structure. However, the PHQ-9-Subsitutive structure generated a significant C2 test statistic (ps < .001) for all four models, indicating that none of the models fit the data well. Since the more complex GPCM and GRM did not fit better than PCM and RSM, they were not considered further3.

Table 3.

Global Model Fit of the IRT Models

Model C2 df p AIC BIC

(A) PHQ-9-Original
PCM   90.0 35 < .001 7532.3 7653.4
RSM 142.3 51 < .001 7573.1 7625.0
GPCM   69.5 27 < .001 7472.4 7628.1
GRM 136.7 27 < .001 7444.7 7600.4

(B) PHQ-9-Substitutive
PCM   95.8 35 < .001 5525.0 5646.1
RSM 134.3 51 < .001 5538.7 5590.6
GPCM   82.3 27 < .001 5468.2 5623.9
GRM   87.8 27 < .001 5442.9 5598.6

(C) PHQ-8-Substitutive
PCM   46.7 27 .01 4863.6 4971.7
RSM   53.1 41 .10 4863.8 4911.4

(D) PHQ-8-Substitutive-Collapsed
PCM   28.7 27 .37 3956.9 4030.4
RSM   37.9 34 .30 3951.7 3994.9

Note. PHQ-9-Original is a unidimensional structure with PHQ-9 items only. PHQ-9-Substitutive is a unidimensional structure with the four PHQ-9 items (items 3, 4, 5, 7) substituted by the Endicott items. PHQ-8-Substitutive removes PHQ-9 item 8 from PHQ-9-Substitutive. PHQ-8-Substitutive-Collapsed combines response options 1 (several days) and 2 (more than half the days) from PHQ-8-Substitutive. PCM is partial credit model, RSM is rating scale model, GPCM is generalized partial credit model, GRM is graded response model. AIC is Akaike Information Criterion. BIC is Bayesian Information Criterion.

Next, the item fit of the PCM and RSM were compared using the PHQ-9-Substitutive structure using: 1) S-X2 item fit test statistic (Kang & Chen, 2008; Orlando & Thissen, 2000) and 2) Item infit (information weighted mean square), where a value of 1.0 indicates perfect fit and values between 0.7 and 1.3 are considered acceptable fit (Wright & Linacre, 1994). Results showed that item 8 on the PHQ-9 (“moving or speaking so slowly that other people could have noticed or the opposite – being so fidgety or restless that you have been moving a lot more than usual”) was the only item that showed both significant S-X2 test statistics, PCM: S-X2(df = 22) = 39.63, p = .01; RSM: S-X2(df = 23) = 45.37, p = .004, as well as infit values beyond the acceptable range (PCM: 1.45; RSM: 1.57). Following Forkmann et al. (2013) and Kendel et al. (2010), this item was removed from PHQ-9-Substitutive structure and the PCM and RSM models were fit again to this new structure (termed PHQ-8-Substitutive)4.

In PHQ-8-Substitutive, the PCM had a significant C2 test statistic, C2(df = 27) = 46.7, p = .01, while RSM did not, C2(df = 41) = 53.1, p = .10. The AIC and BIC estimates for these models were smaller than those generated by the PHQ-9-Substitutive, further supporting the PHQ-8-Subsititutive structure. However, there were still poorly fitting items in both models, especially in the RSM. The threshold parameter estimates and item characteristic curves of the eight items of PHQ-8-Substitutive in the PCM (Panel A in Figure S1 in online supplementary materials)5 and RSM were evaluated. Across items, 72.3% of the 95% confidence intervals of the threshold parameters for response 1 (several days) and 2 (more than half the days) overlapped in the PCM. Consistent with previous studies (Caplan et al., 2016; Forkmann et al., 2013; Gothwal et al., 2014; Lamoureux et al., 2009; Pedersen et al., 2016), these two response options were collapsed. This structure was termed PHQ-8-Substitutive-Collapsed.

The PCM and RSM were fit to the PHQ-8-Substitutive-Collapsed structure. Both models fit the data well, generating non-significant C2 test statistics, PCM: C2(df = 27) = 28.7, p = .37; RSM: C2(df = 34) = 37.9, p = .30, and lower AIC and BIC than those in PHQ-8-Substitutive (Panel D in Table 3). In the PCM, all items showed non-significant S-X2 test statistics and acceptable infit (Table 4). The item characteristic curves showed little overlap in the response options across items (Panel B in Figure S1 in online supplementary materials). In the RSM, two items had poor fit: the PHQ-9 depressed mood item, S-X2(df = 9) = 18.88, p = .03, infit = 0.66, and the Endicott “could not be cheered up” item, S-X2(df = 9) = 13.58, p = .03, infit = .78. Given these findings, PCM was determined to be the best fitting model in the PHQ-8-Substitutive- Collapsed structure.

Table 4.

Item Fit of Partial Credit Model and Rating Scale Model in PHQ-8-Substitutive-Collapsed Structure

Partial Credit Model (PCM) Rating Scale Model (RSM)


  Item S-X2 df p Infit z (Infit) S-X2 df p Infit z (Infit)

Anhedonia 12.59 10 .25 0.83 −2.25 12.82 10 .23 0.83 −2.33
Depressed mood 14.92 10 .13 0.70 −4.30 18.88 9 .03 0.66 −5.04
Feeling bad about yourself 7.43 8 .49 1.08 0.97 6.47 6 .37 1.11 1.31
Suicidal ideation 13.63 8 .09 0.97 −0.22 14.00 7 .05 0.97 −0.19
Socially withdrawn 10.43 10 .40 0.91 −1.07 11.95 9 .22 0.96 −0.49
Tearfulness 3.62 7 .82 0.90 −1.18 3.59 6 .73 0.92 −0.98
Brooding 10.35 10 .41 0.85 −1.90 1.07 9 .34 0.84 −2.03
Could not be cheered up 13.39 7 .06 0.78 −2.59 13.58 6 .03 0.78 −2.60

Note. Anhedonia, depressed mood, feeling bad about yourself, and suicidal ideation are PHQ-9 items. Socially withdrawn, tearfulness, brooding, and could not be cheered up are Endicott items.

Table 5 shows the percentages of each response option for the items, along with the threshold parameter and standard error estimates using the PCM analysis for the PHQ-8-Substitutive-Collapsed data. The PHQ-9 suicidal ideation item had the largest threshold parameter estimates, reflecting the smallest percentage of response options 1 and 2 (several/most days and nearly every day, respectively). The PHQ-9 anhedonia and depressed mood items had the lowest threshold parameter estimates, reflecting a relatively higher percentages of response options 1 and 2. Overall, the threshold parameter estimates of response options 1 and 2 were very high across items (> 3), reflecting low endorsement of response options reflecting greater depression severity.

Table 5.

Percentage (%) of Response Options and Parameter Estimates of the Partial Credit Model in PHQ-8-Substitutive-Collapsed Structure

Response Option (%) Parameter Estimates Corrected
Item-Total
Correlation


Item 0 1 2 Threshold 1
(Response 0 & 1)
Threshold 2
(Response 1 & 2)

Anhedonia 64.0 30.6 5.4 1.13 (.35) 4.85 (.27) .72
Depressed mood 63.8 32.4 3.8 1.10 (.37) 5.39 (.29) .77
Feeling bad about yourself 77.8 19.7 2.5 2.46 (.41) 5.76 (.34) .61
Suicidal ideation 92.7   7.2 0.2 4.56 (1.06) 8.21 (1.03) .51
Socially withdrawn 75.1 21.5 3.4 2.18 (.39) 5.38 (.31) .69
Tearfulness 78.1 19.7 2.2 2.49 (.43) 5.95 (.36) .68
Brooding 71.1 25.6 3.2 1.78 (.38) 5.52 (.31) .72
Could not be cheered up 82.8 16.1 1.1 3.02 (.52) 6.70 (.47) .70

Note. Standard errors are in parentheses. Anhedonia, depressed mood, feeling bad about yourself, and suicidal ideation are PHQ-9 items. Socially withdrawn, tearfulness, brooding, and could not be cheered up are Endicott items.

To further understand the psychometric properties of this model, the person separation reliability (PSR) was calculated. PSR is based on the replicability of the ordering of persons along the latent trait and is conceptually equivalent to Cronbach’s α (Andrich, 1982; de Ayala, 2009; Wright & Masters, 1982). The model had reliability of .81, which is considered acceptable (Nunnally, 1978). Corrected item-total correlations ranged between .51 and .77, reflecting that all items were highly correlated with the rest of the scale6.

DIF Analysis

The standardized mean difference β estimates and the significance test results for uniform and non-uniform DIF detection in the items comprising the PHQ-8-Substitutive-Collapsed structure across gender, age, and racial groups are presented as in Table S4 of the online supplementary materials. No items showed significant DIF using Holm’s adjusted p values accounting for multiple comparisons.

Validity Analysis

A frequency analysis showed that for the PHQ-9-Original, 117 of the 558 participants (21%) obtained the lowest possible scale score of 0, whereas on the PHQ-8-Substitutive- Collapsed, 267 individuals (47.8%) obtained a scale score of 0. The discrepancy in scores between these two versions of the PHQ supports the hypothesis that the somatic items potentially inflate the scores on the PHQ-9-Original and may overestimate depression severity.

The PHQ-8-Substitutive-Collapsed scale scores had a correlation of .81 with the CES-D depressed affect factor, r = .33 with the positive affect factor, r = .74 with the somatic complaints factor, r = .42 with the interpersonal problems factors, and r = .81 with the CES-D total score7. Conversely, the PHQ-9-Original scale scores had a correlation of .74 with the CES-D depressed affect factor, r = .32 with the positive affect factor, r = .84 with the somatic complaints factor, r = .37 with the interpersonal problems factor, and r = .82 with the CES-D total score. The PHQ-8-Substitutive-Collapsed and PHQ-9-Original scale scores were also highly correlated (r = .86).

Scale scores of the PHQ-8-Substantive-Collapsed and PHQ-9-Original were also compared between participants who reported receiving treatment for depression and those who did not. The Cohen’s d for the PHQ-8-Substitutive-Collapsed (d = .70) was larger than that of PHQ-9-Original (d = .60), suggesting that substituting the somatic items with the ESC items provided a more accurate reflection of depression “caseness” in oncology setting.

Discussion

This is the first study to use IRT to analyze the PHQ-9 in an oncology setting, and to examine whether the Endicott substitutive criteria (ESC) items improve the detection of depression. The preliminary CFA results support the unidimensional IRT analysis for the original PHQ-9 and the revised structure, with somatic PHQ-9 items replaced by the ESC items in patients with cancer. The IRT analyses indicated that the replacement of somatic items in the PHQ-9 with ESC resulted in a large improvement in model fit to the original PHQ-9. These analyses supported the often discussed (but rarely investigated) recommendation to replace the somatic depression items with ESC items. Although preliminary, these results indicate that the substitutive items generate a potentially psychometrically superior measure of depression. Overall, an 8-item version of the PHQ that substitutes somatic items with Endicott items performed the best, and a three-option response format (PHQ-8-Substitutive-Collapsed) generated the best model fit.

Among the competing Rasch (i.e., RSM, PCM) and non-Rasch (i.e., GPCM, GRM) polytomous IRT models, PCM was the best fitting model to the final PHQ-8-Substitutive-Collapsed Structure. This indicates that the individual scale items differ in their thresholds, that is, some items (e.g., suicidal ideation) require higher levels of depression to choose a higher response option than others do. However, all items have similar levels of discriminating power (i.e., the degree to which an item differentiates people with different depression levels), which allows the use of average/sum scores for evaluation purposes in applied settings. If a non-Rasch model would have provided significantly better fit, then sum scores would be sub-optimal (and potentially less reliable) estimates.

In terms of overall item endorsement, only 21% of participants obtained a total score of zero on the original PHQ-9, whereas over 45% of participants obtained a score of zero when the ESC items were used. This observation suggests that, as has been a frequent concern, the somatic symptoms of depression are more likely to be endorsed by patients with cancer. In addition to improved model fit over the traditional PHQ-9, the ESC items on the PHQ-8-Substitutive-Collapsed demonstrated high item-total correlations, indicating strong internal consistency. Support for utilizing the Endicott criteria in lieu of the traditional somatic items is consistent with the findings of Akechi et al. (2012), among a large sample of depressed patients with cancer who completed semi-structured diagnostic interviews. They reported that the ESC items “social withdrawal or decreased talkativeness” and “cannot be cheered up, doesn’t smile, no response to good news or funny situations” had moderate difficulty and high discrimination parameters, suggesting the potential utility of these ESC items as markers of moderate to severe depression in oncology settings. The present study extends the findings of Akechi et al. (2012) to a more heterogeneous sample of cancer patients, increasing the generalizability of these findings.

This study is also the first to examine the psychometric performance of the Endicott items when administered in a self-report format. The results suggest that the four items can be reliably and meaningfully administered in this format. Relying on the original PHQ-9, however, may increase the risk of over-inclusivity when screening for depression. This is problematic in that it can deplete already limited mental health resources and possibly subject patients without clinically significant depressive symptoms to unnecessary treatments with their own side effect profiles. On the other hand, lower mean scores observed for the PHQ-8-Substitutive-Collapsed indicates that further research may be necessary to determine the optimal algorithm or cut-score for optimally identifying patients with clinically significant depression.

The data also revealed large infit statistics for the psychomotor retardation and agitation item in the PHQ-9-Substitutive model, and removal of this item improved model fit. Previous IRT evaluations of the PHQ-9 have also found this item to detract from model fit in samples of community-dwelling older adults (Forkmann et al., 2013) and patients undergoing coronary artery bypass graft surgery (Kendel et al., 2010). It is unclear whether the wording of this item, which measures both hyper and hypoactivity, is problematic, or whether psychomotor changes are simply not reliable in self-report format, as individuals may not be able to readily observe these changes in themselves (i.e., compared to a clinician-rated evaluation of this symptom). It is also possible that this item is more appropriately categorized as somatic, and it would therefore not be expected to perform as well as the other exclusively cognitive and affective items on the revised PHQ-9.

The two gateway symptoms, anhedonia and depressed mood, were relatively easier (i.e., were endorsed at lower levels of depression compared to other items). This observation lends support for the approach to MDD diagnosis utilized by the DSM, which requires at least one of these two symptoms to be present for the diagnosis. It suggests that even in oncology, these two symptoms likely represent important “entry criteria” for identifying patients who are experiencing genuine depressive syndromes. In contrast, suicidal ideation was the most difficult item. This finding is hardly surprising, as suicidal ideation, although infrequently endorsed, is a key indicator of the highest levels of depressive symptoms. It also suggests that although some increased preoccupation with death or dying might be expected in the context of a life-threatening illness like cancer, suicidal ideation still remains an important indicator of depression and should not be normalized without further evaluation.

Similar to previous investigations of the PHQ-9 among medical patients (Dyer et al., 2016; Forkmann et al., 2013; Gothwal et al., 2014; Lamoureux et al., 2009; Pedersen et al., 2016), a collapsed range of response options, with several days and more than half the days combined into a single option was the most suitable scoring approach. The presence of two intermediate response options appears problematic, particularly when the PHQ-9 sum score is used as an indicator of change since clinically important changes may be obscured if patients do not perceive a meaningful difference between these two response options. Future research with this collapsed set of response options is needed in order to determine optimal clinical cutoffs for identifying levels of depression severity.

There was no differential item functioning identified for gender, age, or racial groups on the proposed PHQ-8-Substitutive-Collapsed, suggesting that the construct is similarly understood across these groups and that therefore, total score and individual item endorsement can be interpreted and compared across groups. Studies of the original PHQ-9 in other medical groups have occasionally observed DIF for gender, with women more readily endorsing depressed mood than men (Kendel et al., 2010; Pedersen et al., 2016). It is possible that cancer and its side effects diminish the potential gender differences typically observed on this item in other populations, but future research should examine this further.

Limitations and Suggestions for Future Research

Despite the contributions of the current study in elucidating an improved approach to depression assessment in oncology settings, several limitations warrant note. First, the sample, while diverse in terms of age, cancer type and stage, was relatively homogeneous in terms of race, ethnicity, and education. The study deliberately sampled a heterogenous sample of patients with cancer (both by disease and treatment status) to obtain a broad understanding of depressive symptom presentation and item endorsement in oncology outpatients. However, there are potentially important disease-specific and treatment-specific sequelae that may contribute to the relative prominence of affective, cognitive and somatic items that could not be disentangled in the current study. Similarly, the number of individuals who were approached about participating in the study but declined was not evaluated and thus conclusions about potential selection bias cannot be ascertained. Participants were also healthy enough to receive ambulatory care, while those who were more critically ill are not represented in this sample. Given the limited sociodemographic diversity of the sample, it is unclear if the observed results would be maintained in a more diverse sample. This is important given that previous research has supported different manifestations of depression among population subgroups (e.g., Kalibatseva, Leong, & Ham, 2014). For example, one study utilized IRT to compare depressive symptom endorsement between Asian and European American community-dwelling adults (Kalibatseva et al., 2014). That study detected DIF for nearly one quarter of depression items, with European Americans more reporting higher levels of affective symptoms but the same level of somatic symptoms. Thus, the results of this study must be interpreted in light of this limitation, as item utility might be different depending on the subgroup. Future studies should include more racial, ethnic, and socioeconomic diversity, as well as examination of potential differences in item utility depending on disease severity (e.g., inpatient palliative care or hospice, survivorship clinics, etc.).

The findings of the current study do not allow for a determination of classification accuracy, given the absence of a “gold standard” criterion measure such as an expert clinician interview. Therefore, while the findings provide tentative support for a revised version of the PHQ-9 using the ESC and condensed response options, further evaluation is needed, ideally using expert clinician diagnostic interviews to more fully evaluate these modifications. Similarly, because some ESC items included multiple constructs within a single item (e.g., “brooding, self-pity, or pessimism”), future research should separate these constructs in order to clarify which are the most salient for and if there are meaningful differences in the psychometric properties of each element that warrant separating compound items into unique items. Moreover, the current study focused on the substitutive criteria proposed by Endicott (1984) for cancer, but other substitutive approaches like that proposed by Cavanaugh (1995) might also warrant additional rigorous examination. Finally, the current study was cross-sectional and therefore the relationship between symptoms (i.e., both affective and somatic), antidepressant treatment, and symptom management/treatment response could not be determined. Repeated assessment of depressive symptoms over time would also allow for the determination of the reliability of the substitutive and somatic symptoms and their predictive validity in the cancer setting.

Conclusion

Notwithstanding these limitations, this study was the first to examine the performance of the PHQ-9 and the Endicott’s substitutive criteria using IRT in a large sample of oncology outpatients. As expected, the somatic items included in the PHQ-9 had poorer model fit than a model that replaced these items with the four Endicott’s substitutive criteria. Likewise, a collapsed response scale further improved the overall model fit. With these modifications, there was no evidence of significant differential item functioning across gender, age, or racial groups. Taken together, this study provides strong preliminary support for utilizing the Endicott substitutive criteria when screening for depression in oncology settings. The potential impact of these findings on clinical practice is substantial, as the growing numbers of patients with cancer means an even higher burden on already limited mental health resources in these settings. Developing a more precise method (i.e., maximizing both sensitivity and specificity) of identifying patients who are experiencing genuine depressive symptoms above and beyond the somatic symptoms of their illness and treatment will decrease unnecessary triage for additional psychiatric evaluation. Moreover, it lessens the likelihood of patients being unnecessarily prescribed psychotropic medications in the absence of genuinely severe depressive symptoms. Future replication studies and studies with more heterogeneous patient samples are needed to further determine the robustness of these findings.

Supplementary Material

Supplemental Material

Public Significance Statement:

Alternative approaches to assessing depression in patients with cancer may be more accurate than current approaches, which rely heavily on physical symptoms. An improved approach might eliminate physical symptoms and focus more on emotional symptoms.

Acknowledgments

This research was supported by funding from the National Institutes of Health [T32CA009461 and P30CA008748].

Footnotes

1

Age 40 was selected as the inclusion criteria cut-off in order to differentiate the sample from what the National Comprehensive Cancer Network (Coccia et al., 2018) operationalized as “Adolescent and Young Adult,” which refers to patients from 15 to 39 years of age. This age group was selected as the primary purpose was to examine depression assessment in adults.

2

Due to HIPPA protection participants who were 90 years or older (n = 2) checked a box indicating they were in this age range.

3

GPCM and GRM results for all steps are available upon request.

4

Analysis of the PCM threshold parameter estimates and item response curves of this item in PHQ-9-Substitutive supported this decision (available upon request).

5

Results of the RSM in PHQ-8-Substitutive were consistent with those of the PCM (available upon request). Across items, 18.7% of the 95% confidence intervals of the threshold parameters for response options 1 and 2 overlapped.

6

We have examined the degree to which the local independence assumption was violated in the PCM using the PHQ-8-Substitutive-Collapsed structure. In sum, the assumption was fulfilled and unidimensionality was supported. Detailed results were summarized in the online supplementary materials.

7

The PHQ-8-Substitutive-Collapsed maximum a posteriori (MAP) factor scores under PCM had similar correlations: r = .79 with the CES-D depressed affect factor, r = .33 with the positive affect factor, r = .75 with the somatic complaints factor, r = .38 with the interpersonal problems factors, and r = .80 with the CES-D total score.

Contributor Information

Rebecca M. Saracino, Department of Psychiatry and Behavioral Sciences, Memorial Sloan Kettering Cancer Center and Psychology Department, Fordham University

Ezgi Aytürk, Psychology Department, Fordham University.

Heining Cham, Psychology Department, Fordham University.

Barry Rosenfeld, Department of Psychiatry and Behavioral Sciences, Memorial Sloan Kettering Cancer Center and Psychology Department, Fordham University.

Leah M. Feuerstahler, Psychology Department, Fordham University

Christian J. Nelson, Department of Psychiatry and Behavioral Sciences, Memorial Sloan Kettering Cancer Center

References

  1. Akechi T, Ietsugu T, Sukigara M, Okamura H, Nakano T, Akizuki N, ... & Uchitomi Y. (2009). Symptom indicator of severity of depression in cancer patients: A comparison of the DSM-IV criteria with alternative diagnostic criteria. General Hospital Psychiatry, 31, 225–232. https://doi.org/10.10167i.genhosppsych.2008.12.004 [DOI] [PubMed] [Google Scholar]
  2. Akechi T, Nakano T, Akizuki N, Okamura M, Sakuma K, Nakanishi T, ... & Uchitomi Y (2003). Somatic symptoms for diagnosing major depression in cancer patients. Psychosomatics, 44, 244–248. 10.1176/appi.psy.44.3.244 [DOI] [PubMed] [Google Scholar]
  3. Andrich D (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–573. 10.1007/BF02293814 [DOI] [Google Scholar]
  4. Andrich D (1982). An index of person separation in latent trait theory, the traditional KR.20 index, and the Guttman scale response pattern. Education Research and Perspectives, 9, 95–104. Retrieved from https://rasch.org/erp7.htm [Google Scholar]
  5. American Psychiatric Association; (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Arlington, VA: American Psychiatric Association; 10.1176/appi.books.9780890425596 [DOI] [Google Scholar]
  6. Bentler PM (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246. 10.1037/0033-2909.107.2.238 [DOI] [PubMed] [Google Scholar]
  7. Bluethmann SM, Mariotto AB, & Rowland J (2016). Anticipating the “Silver Tsunami”: prevalence trajectories and comorbidity burden among older cancer survivors in the United States. Cancer Epidemiology, Biomarkers & Prevention, 25, 1029–1036. 10.1158/1055-9965.EPI-16-0133 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bock RD, & Aitkin M (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459. 10.1007/BF02293801 [DOI] [Google Scholar]
  9. Cai L, & Monroe S (2014). A new statistic for evaluating item response theory models for ordinal data (CRESSTReport 839) Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Retrieved from https://files.eric.ed.gov/fulltext/ED555726.pdf [Google Scholar]
  10. Cavanaugh SVA (1995). Depression in the medically ill: Critical issues in diagnostic assessment. Psychosomatics, 36, 48–59. 10.1016/S0033-3182(95)71707-8 [DOI] [PubMed] [Google Scholar]
  11. Chalmers RP (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. 10.18637/jss.v048.i06 [DOI] [Google Scholar]
  12. Chalmers RP (2018). Improving the Crossing-SIBTEST statistic for detecting non-uniform DIF. Psychometrika, 83, 376–386. 10.1007/s1133601795838 [DOI] [PubMed] [Google Scholar]
  13. Coccia PF, Pappo AS, Beaupin L, Borges VF, Borinstein SC, Chugh R, … & Gubin, (2018). Adolescent and Young Adult Oncology, Version 2.2018, NCCN Clinical Practice Guidelines in Oncology. Journal of the National Comprehensive Cancer Network, 16, 66–97. 10.6004/jnccn.2018.0001 [DOI] [PubMed] [Google Scholar]
  14. Cohen J (1988). Statistical power analysis for the behavioral sciences (2nd ed). Hillsdale, NJ: Lawrence Erlbaum Associates. [Google Scholar]
  15. de Ayala RJ (2009). The theory and practice of item response theory. New York, NY: Guilford Press. [Google Scholar]
  16. DiMatteo MR, Lepper HS, & Croghan TW (2000). Depression is a risk factor for noncompliance with medical treatment: meta-analysis of the effects of anxiety and depression on patient adherence. Archives of Internal Medicine, 160, 2101–2107. 10.1001/archinte.160.14.2101 [DOI] [PubMed] [Google Scholar]
  17. Dyer JR, Williams R, Bombardier CH, … & Fann JR (2016). Evaluating the psychometric properties of 3 depression measures in a sample of persons with traumatic brain injury and major depressive disorder. Journal of Head Trauma Rehabilitation, 31, 225–232. 10.1097/HTR.0000000000000177 [DOI] [PubMed] [Google Scholar]
  18. Endicott J (1984). Measurement of depression in patients with cancer. Cancer, 53, 2243–2248. 10.1002/cncr.1984.53.s10.2243 [DOI] [PubMed] [Google Scholar]
  19. Forkmann T, Gauggel S, Spangenberg L, Brahler E, & Glaesmer H, (2013). Dimensional assessment of depressive severity in the elderly general population: Psychometric evaluation of the PHQ-9 using Rasch analysis. Journal of Affective Disorders, 148, 323–330. 10.1016/j.jad.2012.12.019 [DOI] [PubMed] [Google Scholar]
  20. Gothwal VK, Bagga DK, & Sumalini R (2014). Rasch validation of the PHQ-9 in people with visual impairment in South India. Journal of Affective Disorders, 167, 171–177. 10.1016/j.jad.2014.06.019 [DOI] [PubMed] [Google Scholar]
  21. Holland PW, & Wainer H (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum Associates. [Google Scholar]
  22. Holm S (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70. Retrieved from https://www.jstor.org/stable/4615733 [Google Scholar]
  23. Hu LT, & Bentler PM (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecification. Psychological Methods, 3, 424–453. https://doi.Org/10.1037/1082-989X.3.4.424 [Google Scholar]
  24. Jones SM, Ludman EJ, McCorkle R, Reid R, Bowles EJA, Penfold R, & Wagner EH (2015). A differential item function analysis of somatic symptoms of depression in people with cancer. Journal of Affective Disorders, 170, 131–137. 10.1016/j.jad.2014.09.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kalibatseva Z, Leong FTL, & Ham EH (2014). A symptom profile of depression among Asian Americans: Is there evidence for differential item functioning of depressive symptoms?. Psychological medicine, 44, 2567–2578. 10.1017/S0033291714000130 [DOI] [PubMed] [Google Scholar]
  26. Kang T, & Chen T (2008). Performance of the generalized S-X2 item fit index for polytomous IRT models. Journal of Educational Measurement, 45, 391–406. 10.1111/j.1745-3984.2008.00071.x [DOI] [Google Scholar]
  27. Kendel F, Wirtz M, Dunkel A, Lehmkuhl E, Hetzer R, & Regitz-Zagrosek V (2010). Screening for depression: Rasch analysis of the dimensional structure of the PHQ-9 and the HADS-D. Journal of Affective Disorders, 122, 241–246. 10.1016/ijad.2009.07.004 [DOI] [PubMed] [Google Scholar]
  28. Kim J, & Oshima TC (2013). Effect of multiple testing adjustment in differential item functioning detection. Educational and Psychological Measurement, 73, 458–470. 10.1177/0013164412467033 [DOI] [Google Scholar]
  29. Krebber AMH, Buffart LM, Kleijn G, Riepma IC, De Bree R, Leemans CR, … & Verdonck-de Leeuw IM (2014). Prevalence of depression in cancer patients: A meta-analysis of diagnostic interviews and self-report instruments. Psycho-Oncology, 23, 121–130. 10.1002/pon.3409 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Kroenke K, & Spitzer RL (2002). The PHQ-9: A new depression diagnostic and severity measure. Psychiatric Annals, 32, 509–515. 10.3928/0048-5713-20020901-06 [DOI] [Google Scholar]
  31. Lamoureux EL, Tee HW, Pesudovs K, Pallant JF, Keeffe JE, & Rees G (2009). Can clinicians use the PHQ-9 to assess depression in people with vision loss? Optometry and Vision Science, 86, 139–145. 10.1097/QPX.0b013e318194eb47 [DOI] [PubMed] [Google Scholar]
  32. Li H-H, & Stout W (1996). A new procedure for detection of crossing DIF. Psychometrika, 61, 647–677. 10.1007/BF02294041 [DOI] [Google Scholar]
  33. Little RJ, & Rubin DB (2002). Statistical analysis with missing data (2nd ed). Hoboken, NJ: John Wiley & Sons. [Google Scholar]
  34. Mair P, & Hatzinger R (2007). Extended Rasch modeling: The eRm package for the application of IRT models in R. Journal of Statistical Software, 20(9), 1–20. http://www.jstatsoft.org/v20/i09 [Google Scholar]
  35. Masters GN (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174. 10.1007/BF02296272 [DOI] [Google Scholar]
  36. Maydeu-Olivares A, & Joe H (2006). Limited information goodness-of-fit testing in multidimensional contingency tables. Psychometrika, 71, 713–732. 10.1007/s11336-005-1295-9 [DOI] [Google Scholar]
  37. Misono S, Weiss NS, Fann JR, Redman M, & Yueh B (2008). Incidence of suicide in persons with cancer. Journal of Clinical Oncology, 26, 4731–4738. 10.1200/JCQ.2007.13.8941 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Muraki E (1992). A Generalized Partial Credit Model: Application of an EM Algorithm. Applied Psychological Measurement, 16(2), 159–176. 10.1177/014662169201600206 [DOI] [Google Scholar]
  39. Nelson CJ, Cho C, Berk AR, Holland J, & Roth AJ (2010). Are gold standard depression measures appropriate for use in geriatric cancer patients? A systematic evaluation of self-report depression instruments used with geriatric, cancer, and geriatric cancer samples. Journal of Clinical Oncology, 28, 348–356. 10.1200/JCO.2009.23.0201 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Nunnally JC (1978). Psychometric theory (2nd ed). New York, NY: McGraw-Hill. [Google Scholar]
  41. Oakes D. (1999). Direct calculation of the information matrix via the EM algorithm. Journal of the Royal Statistical Society, Series B, 61, 479–482. 10.1111/1467-9868.00188 [DOI] [Google Scholar]
  42. Orlando M, & Thissen D (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64. 10.1177/01466216000241003 [DOI] [Google Scholar]
  43. Pedersen SS, Mathiasen K, Christensen KB, & Makransky G (2016). Psychometric analysis of the Patient Health Questionnaire in Danish patients with an implantable cardioverter defibrillator (The DEFIB-WOMEN study). Journal of Psychosomatic Research, 90, 105–112. 10.1016/j.jpsychores.2016.09.010 [DOI] [PubMed] [Google Scholar]
  44. Radloff LS (1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1, 385–401. 10.1177/014662167700100306 [DOI] [Google Scholar]
  45. Rosseel Y (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1–36. Retrieved from http://www.istatsoft.ore/v48/i02/ [Google Scholar]
  46. Samejima F (1969). Estimation of latent ability using a response pattern of graded scores. Psychometric Monograph Supplement, 17 (4, Pt. 2). [Google Scholar]
  47. Saracino RM, Cham H, Rosenfeld B, & Nelson C (2018). Confirmatory factor analysis of the Center for Epidemiologic Studies Depression scale in oncology with examination of invariance between younger and older patients. European Journal of Psychological Assessment. Advanced online publication. 10.1027/1015-5759/a000510 [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Saracino RM, Rosenfeld B, & Nelson CJ (2018). Performance of four diagnostic approaches to depression in adults with cancer. General hospital psychiatry, 51, 90–95. 10.1016/i.genhosppsych.2018.01.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Shealy R & Stout W (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detect test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159–194. 10.1007/BF02294572 [DOI] [Google Scholar]
  50. Steiger JH (1989). EzPATH: A supplementary module for SYSTAT and SYGRAPH. Evanston, IL: SYSTAT. [Google Scholar]
  51. Vodermaier A, Linden W, & Siu C (2009). Screening for emotional distress in cancer patients: a systematic review of assessment instruments. Journal of the National Cancer Institute, 101, 1464–1488. 10.1093/inci/dip336 [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Williams RT, Heinemann AW, Bode RK, Wilson CS, Fann JR, & Tate DG (2009). Improving measurement properties of the Patient Health Questionnaire–9 with rating scale analysis. Rehabilitation Psychology, 54, 198–203. 10.1037/a0015529 [DOI] [PubMed] [Google Scholar]
  53. Wright BD, & Linacre JM, (1994). Reasonable mean-square fit values. Rasch Measurement Transaction, 8, 370. [Google Scholar]
  54. Wright BD, & Masters GN (1982). Rating scale analysis. Chicago, Illinois: MESA Press. [Google Scholar]
  55. Yen WM (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187–213. 10.1111/j.1745-3984.1993.tb00423.x [DOI] [Google Scholar]
  56. Zigmond AS, & Snaith RP (1983). The hospital anxiety and depression scale. Acta Psychiatrica Scandinavica, 67, 361–370. 10.1111/j.1600-0447.1983.tb09716.x [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

RESOURCES