Abstract
Depression during pregnancy is linked to adverse perinatal and offspring outcomes. The Patient Health Questionnaire-9 (PHQ-9) has been validated for identifying depression in pregnant women in limited cultural contexts. Construct validity and reliability have been assessed in Lima, Peru, but criterion validity has not. This study aimed to comprehensively evaluate the PHQ-9 among pregnant Peruvian women in the Pregnancy Outcomes, Maternal and Infant Study (PrOMIS). Using Composite International Diagnostic Interview (CIDI) criteria for past-12-month major depressive disorder as the reference standard, sensitivity, specificity, and predictive value of the PHQ-9 for detecting depression were assessed at various cutpoints of the PHQ-9. Confirmatory factor analysis (CFA) was used to evaluate one- and two-factor structures for the PHQ-9. Cronbach’s alpha was computed for the entire PHQ-9 scale and for subscales supported by CFA. A cutpoint of ≥8 maximized combined sensitivity (61%) and specificity (62%). At this cutpoint, positive predictive value was low (15%) and negative predictive values was high (93%). Reliability for the full scale was high (α=0.80). Both one- and two-factor solutions were appropriate for this population, but a two-factor solution containing an affective/mood factor (α=0.67) and a somatic factor (α=0.75) was optimal (CFI=0.93, RMSEA=0.075). Among pregnant women in Lima, screening with the PHQ-9 can identify those in need of mental health care, but may identify a large number of false positive cases.
Keywords: antepartum depression, depression, maternal health, Patient Health Questionnaire-9 (PHQ-9), pregnancy, screening, validation
Introduction
Depression during pregnancy is common worldwide, especially in low- and middle-income countries (Dadi, Miller, Bisetegn, & Mwanri, 2020). Nevertheless, it receives less research attention than postpartum depression (Gelaye, Rondon, Araya, & Williams, 2016). Depression during pregnancy has been associated with pregnancy complications (Chung, Lau, Yip, Chiu, & Lee, 2001; Larsson, Sydsjö, & Josefsson, 2004), preterm birth (Venkatesh, Riley, Castro, Perlis, & Kaimal, 2016), and low birth weight (Nasreen et al., 2019). Its effects may even extend into childhood (Davalos, Yadon, & Tregellas, 2012; Hoffman & Hatch, 2000). Early detection of depression among pregnant women could allow for interventions to prevent adverse perinatal outcomes (Breedlove & Fryzelka, 2011; Schaffir, 2018), and to reduce the likelihood of postpartum depression (Batmaz, Dane, Sarioglu, Kayaoglu, & Dane, 2015; Gulseren et al., 2006; Ongeri et al., 2018). In Peru, intimate partner violence (Gomez-Beloz, Williams, Sanchez, & Lam, 2009; Perales, Cripe, Lam, Sanchez, & Williams, 2014), unplanned pregnancy (Cripe et al., 2008), and posttraumatic stress disorder (Levey et al., 2018) are prevalent and contribute to risk of depression during pregnancy. Women in Peru, as in many low- and middle-income countries, often have their first sustained engagement with the health care system during antenatal care, making this setting ideal for early detection of depression via screening. However, effective screening relies on validated, culturally-appropriate, and easy-to-use tools. In Peru, the Patient Health Questionnaire-9 (PHQ-9) (Kroenke, Spitzer, & Williams, 2001) is often employed when screening pregnant women for depression (Laura Manea, Gilbody, & McMillan, 2015), despite most validation data coming from high-income countries (Gelaye et al., 2016).
As the PHQ-9 relies upon subjective and culturally-dependent symptoms, its validity can vary among different populations. Studies have evaluated the validity of the PHQ-9 among pregnant women in Cote d’Ivoire (Barthel et al., 2015), Ethiopia (Woldetensay et al., 2018), Ghana (Barthel et al., 2015), Pakistan (Gallis et al., 2018), and the United States (Sidebottom, Harrison, Godecker, & Kim, 2012). Construct validity (i.e., the degree to which a test measures what it purports to measure) has previously been assessed among pregnant women in Peru, with preliminary evidence supporting a two-factor structure that can be measured with high reliability (Zhong et al., 2014). However, criterion validity (i.e., the degree to which an estimate agrees with a reference standard) has yet to be assessed among pregnant women in Peru.
The PHQ-9 yields a continuous score, which must be dichotomized at a cutpoint to define “probable depression.” The choice of a cutpoint has clinical implications: if it is too low, mental health systems may be burdened by false positive cases; if it is too high, women at risk may not be flagged for follow-up care. A cutpoint of ≥10 is frequently used (Kroenke et al., 2001; Levis, Benedetti, & Thombs, 2019; Spitzer, Kroenke, Williams, & Patient Health Questionnaire Primary Care Study Group, 1999), based on early validation studies in the United States (Kroenke et al., 2001). However, in the few studies that have assessed criterion validity among pregnant women, ideal cutpoints ranged from ≥8 in Ethiopia (Woldetensay et al., 2018) to ≥10 in Pakistan (Gallis et al., 2018) and the United States (Sidebottom et al., 2012). A pooled meta-analysis found that >8 to >11 is an acceptable range (LM Manea, Gilbody, & McMilian, 2012), although pooling data spanning various populations within ten countries may have obscured between-population variation.
The PHQ-9 is frequently used to screen for depression in low-and middle-income countries despite gaps in understanding criterion validity. Thus, this study aimed to comprehensively evaluate criterion validity, construct validity, and reliability of the Spanish-language PHQ-9 for detecting depression during the first trimester of pregnancy among a large cohort of 5,440 participants in Lima, Peru. This is the first attempt to assess criterion validity among pregnant women in Peru. In addition, this study enhances prior assessments of construct validity by comparing two possible factor structures: a two-factor solution supported by a preliminary analysis of a subset of the present data (Zhong et al., 2014), and a one-factor solution frequently seen in other research settings (Boothroyd, Dagnan, & Muncer, 2019; Familiar et al., 2015; Huang, Chung, Kroenke, Delucchi, & Spitzer, 2006; Kocalevent, Hinz, & Brähler, 2013).
Method
Data were derived from the Pregnancy Outcomes, Maternal and Infant Study (PrOMIS), a study of pregnant women enrolled in prenatal care clinics at the Instituto Nacional Materno Perinatal (INMP) in Lima, Peru. The INMP is the primary referral hospital for maternal and perinatal care operated by the Peruvian Ministry of Health. Details of the study are described elsewhere (Zhong et al., 2015). Eligibility criteria included: attending the INMP for the first prenatal care visit between February 2012 and March 2014, being at 16 weeks or less of gestational age, being 18-49 years of age and speaking/understanding Spanish.
Interviews were administered in a private room using a structured tool that collected information on maternal socio-demographic factors, lifestyle, medical and reproductive history, abuse history, and the PHQ-9. A total of 5,440 women were interviewed in two phases, PrOMIS one (n=3,372) and PrOMIS two (n=2,068). For the present analysis, participants missing PHQ-9 score were excluded (n=41, <1%), leaving 5,399 participants.
Participants for the criterion validity analysis were a subset of randomly-selected PrOMIS one participants (42%, n=1,413). These participants were given a diagnostic interview within 15 days of the initial interview. Of the 1,413 selected women, 1,098 (78%) completed the diagnostic interview. A total of 315 women (22%) did not participate in the diagnostic interviews for the following reasons: 123 were not reached within the stipulated 14 days after screening; 96 were no longer eligible due to abortions, malformation, or twin pregnancies; 56 had a change of address or inaccurate contact information; and 40 refused to participate citing reasons such as lack of time.
Measures
Patient Health Questionnaire-9 (PHQ-9).
The PHQ-9 is a depression screening scale assessing nine symptoms: anhedonia, depressed mood, problems with sleep, fatigue or loss of energy, problems with appetite, guilt or worthlessness, diminished ability to think or concentrate, psychomotor agitation or retardation, and suicidal thoughts (Kroenke et al., 2001; Spitzer et al., 1999). Participants rated how often during the past two weeks they experienced each item: “not at all” (0), “several days” (1), “more than half of days” (2), or “nearly every day” (3), and a sum score was calculated (range=0-27). Original validation studies found a score of ≥10 was optimal for identifying probable cases of major depressive disorder (MDD) based on Diagnostic and Statistical Manual of Mental Disorders, fourth edition (DSM-IV) criteria (sensitivity=88%, specificity=88%) (Kroenke et al., 2001; Spitzer et al., 1999). In a multi-country meta-analysis pooling data from diverse countries, a cutpoint of ≥10 was associated with high sensitivity and specificity (>80%) (Levis et al., 2019).
World Health Organization World Mental Health Composite International Diagnostic Interview (WMH-CIDI).
The WMH-CIDI (hereafter, CIDI) is a fully structured interview that can be administered by non-clinicians. It assesses major depressive disorder (MDD) and several other mental disorders based on International Classification of Diseases-10 (ICD-10) and DSM-IV criteria (Kessler & Ustun, 2004). Lifetime, past-12-month, and past-30 day diagnoses of MDD can be generated. The CIDI is widely-used across diverse countries to identify depression (Bromet et al., 2011).
Four licensed research psychologists received structured training on CIDI administration via a training course conducted by the Social Survey Institute at the University of Michigan (WHO Training Center). Training involved item-by-item descriptions of questionnaires and role-plays, and strict onsite supervision/support in the field. Questionnaire data were entered using Blaise version 4.6 (Statistics Netherlands), which contained the entire CIDI algorithm along with an automatic checking mechanism to identify item omissions and unusual responses.
Analysis
Criterion validity.
Sensitivity, specificity, positive likelihood ratio, negative likelihood ratio, positive predictive value (PPV), and negative predictive value (NPV) of the PHQ-9 were calculated, along with corresponding 95% confidence intervals (CI). CIDI diagnosis of past-12-month MDD was used as the reference standard. A receiver operating characteristic (ROC) curve was plotted to identify the optimal balance of sensitivity and specificity. Area under the ROC curve (AUC) and its 95% CI were also calculated.
Although the PHQ-9 assesses depressive symptoms in the past two weeks, past-12-month MDD was chosen as the reference standard because past-two-week MDD is not assessed by the CIDI, and very few individuals met criteria for past-30-day MDD. However, a sensitivity analysis was conducted, using past-30-day MDD as the reference standard. Results of this sensitivity analysis are presented in supplemental material.
Construct validity.
The factor structure of the PHQ-9 was evaluated with confirmatory factor analysis (CFA). In line with previous research, two possible structures were considered: A) a two-factor solution identified by Zhong et al. (2014b), which used a subset of the data used in the present analysis; and B) a one-factor solution, which has been identified in diverse geographic/cultural settings (Boothroyd et al., 2019; Familiar et al., 2015; Huang et al., 2006; Kocalevent et al., 2013). Because response data were ordinal, and therefore not normally distributed, robust maximum likelihood estimation was used (Li, 2016). Model fit for each solution was evaluated using the comparative fit index (CFI), Tucker Lewis index (TLI), root mean square error of approximation (RMSEA) with 90% confidence interval [CI], and standardized root mean square residual (SRMR). To reduce bias due to nonnormality, the sample-corrected robust CFI, TLI, and RMSEA were reported (Brosseau-Liard, Savalei, & Li, 2012). The following criteria were used as evidence of reasonably good model fit: SRMR ≤ 0.08, CFI ≥ 0.95, TLI ≥ 0.95, RMSEA ≤ 0.06 (Brown, 2015; Hu & Bentler, 1999). Finally, nested one- and two-factor models were compared using the Satorra-Bentler scaled χ2 difference test (Satorra, 2000; Satorra & Bentler, 2010).
Reliability.
Cronbach’s alpha was computed as a measure of internal consistency for the entire PHQ-9 scale assuming unidimensionality, and for each of the subscales supported by CFA. Cronbach’s alpha if each item were deleted, as well as the correlation between each item with the total PHQ-9 score, were also computed.
Analyses other than CFA were conducted in SAS version 9.4. CFA was conducted in RStudio version 3.4.3. All procedures were approved by the institutional review boards of the INMP, Lima, Peru, and the Harvard School of Public Health Office of Human Research Administration, Boston, Massachusetts.
Results
Participant characteristics
Mean age was 28.1 years (sd=6.3 years) and most participants were 20-29 years old (Table 1). Mean gestational age was 10.0 weeks (SD=3.8 weeks). The majority were Mestizo (77%) and were married or living with a partner (81%). Half (47%) were employed during pregnancy and half (47%) reported difficulty affording basics (e.g., food). Nearly all participants (97%) had at least seven years of education and half (47%) had at least 12 years. Compared to women who were not administered the CIDI, women who did receive the diagnostic interview had less education (41% versus 48% completed high school), but did not differ on other characteristics (Table 1).
Table 1.
Variable | Diagnostic interview | |||||
---|---|---|---|---|---|---|
All (N=5,399) |
(CIDI) (N=1,098) |
PHQ-9 only (N=4,301) |
||||
n | % | n | % | n | % | |
Maternal age, years | ||||||
18-19 | 303 | 5.6 | 52 | 4.7 | 251 | 5.8 |
20-29 | 3002 | 56 | 617 | 56 | 2385 | 55 |
30-34 | 1134 | 21 | 231 | 21 | 903 | 21 |
>34 | 960 | 18 | 198 | 18 | 762 | 18 |
Maternal age, years: mean (sd) | 28.1 (6.3) | 28.2 (6.2) | 28.1 (6.3) | |||
Education, years | ||||||
<7 | 172 | 3.2 | 50 | 4.6 | 122 | 2.8 |
7-12 | 2692 | 50 | 593 | 54 | 2099 | 49 |
>12 | 2518 | 47 | 453 | 41 | 2065 | 48 |
Mestizo | 4169 | 77 | 837 | 76 | 3332 | 78 |
Married/living with partner | 4397 | 82 | 879 | 80 | 3518 | 82 |
Employed during pregnancy | 2553 | 47 | 512 | 47 | 2041 | 47 |
Planned pregnancy | 2184 | 41 | 472 | 43 | 1712 | 40 |
Access to basics | ||||||
Hard | 2544 | 47 | 548 | 50 | 1996 | 47 |
Not very hard | 2845 | 53 | 550 | 50 | 2295 | 53 |
Gestational age at interview, weeks: mean (sd) | 10.0 (3.8) | 9.5 (3.4) | 10.1 (3.8) |
Note: Percentages were calculated among those with data available (n=17 missing education; n=6 missing race; n=18 missing marital status; n=1 missing employment n=25 missing planned pregnancy; n=10 missing access to the basics; n=25 missing gestational age).
Among 1,098 women administered the CIDI, 109 (9.9%) fulfilled diagnostic criteria for past-12-month MDD. Compared to women without MDD, women with MDD were younger (27.2 years (sd=5.7 years) versus 28.3 years (sd=6.2 years)); less likely to be married (70% versus 82%); and more likely to report difficulty affording basics (68% versus 48%, Table 2).
Table 2.
Variable | MDD (screen positive on CIDI) (N=109) |
No MDD (screen negative on CIDI) (N=989) |
||
---|---|---|---|---|
n | % | n | % | |
Maternal age, years | ||||
18-20 | 5 | 4.6 | 47 | 4.8 |
20-29 | 70 | 64 | 547 | 55 |
30-34 | 19 | 17 | 212 | 21 |
>34 | 15 | 14 | 183 | 19 |
Maternal age, years: mean (sd) | 27.2, 5.7 | 28.3, 6.2 | ||
Education, years | ||||
<7 | 5 | 4.6 | 45 | 4.6 |
7-12 | 60 | 56 | 533 | 54 |
>12 | 43 | 40 | 410 | 42 |
Mestizo | 83 | 76 | 754 | 76 |
Married/living with partner | 75 | 70 | 804 | 82 |
Employed during pregnancy | 52 | 48 | 460 | 47 |
Planned pregnancy | 44 | 41 | 428 | 44 |
Access to basic foods | ||||
Hard | 74 | 68 | 474 | 48 |
Not very hard | 35 | 32 | 515 | 52 |
Gestational age at interview (weeks): mean (sd) | 9.9, 3.4 | 9.4, 3.4 |
Note: Percentages were calculated among those with data available (n=2 missing education; n=5 missing marital status; n=1 missing employment; n=6 missing planned pregnancy; n=8 missing gestational age).
Item endorsement
Among all participants, the most frequently-endorsed PHQ-9 items were fatigue or loss of energy, problems with appetite, and anhedonia, with approximately three quarters or more of participants (86%, 77%, and 74%, respectively) indicating that they had difficulty with each of these things at least several days in the past week (Table 3). Suicidal thoughts was least-frequently endorsed: 12% had suicidal thoughts several days or more during the past week, and 2.7% had suicidal thoughts more than half of days in the past week.
Table 3.
Symptom | % Not at all | % Several days |
% More than half of days |
% Nearly every day |
---|---|---|---|---|
1. Little interest or pleasure in doing things (anhedonia) | 26 | 44 | 13 | 17 |
2. Feeling down, depressed, or hopeless (depressed mood) | 38 | 41 | 10 | 11 |
3. Trouble falling or staying asleep, or sleeping too much (problems with sleep) | 43 | 35 | 9.6 | 13 |
4. Feeling tired or having little energy (fatigue or loss of energy) | 14 | 55 | 14 | 17 |
5. Poor appetite or overeating (problems with appetite) | 24 | 36 | 10 | 31 |
6. Feeling bad about yourself - or that you were a failure or have let yourself or your family down (guilt or worthlessness) | 72 | 20 | 3.7 | 3.5 |
7. Trouble concentrating on things, such as reading the newspaper or watching television (diminished ability to think or concentrate) | 68 | 23 | 3.9 | 4.6 |
8. Moving or speaking so slowly that other people could have noticed. Or the opposite - being so fidgety or restless that you have been moving around a lot more than usual (psychomotor agitation or retardation) | 68 | 20 | 5 | 7.3 |
9. Thoughts that you would be better off dead, or of hurting yourself (suicidal thoughts) | 88 | 9.6 | 1.6 | 1.1 |
Validity
Criterion validity.
Sensitivity (61%, 95% CI: 51%, 70%) and specificity (62%, 95% CI: 59%, 65%) were optimized at a cutpoint of ≥8 (Table 4). At this cutpoint, positive and negative likelihood ratios indicated that compared to women without MDD, women with MDD were 1.6 times as likely to have a PHQ-9 score ≥8 (95% CI: 1.4, 1.9) and 0.63 times as likely to have a PHQ-9 score <8 (95% CI: 0.50, 0.80). PPV was 15% (95% CI: 12%, 18%) and NPV was 94% (95% CI: 92%, 95%). The AUC for detecting MDD was 0.67 (95% CI: 0.62, 0.71) (Figure 1). At a cutpoint of ≥8, 438 participants (40%) were categorized as having probable depression, compared to 511 participants (47%) at a cutpoint of ≥7, and 373 participants (34%) at a cutpoint of ≥9. Approximately 85% of these were false positive cases, due to low PPV.
Table 4.
Cutpoint score |
Prevalence at cutpoint, % |
Sensitivity, % (95% CI) |
N false negative |
Specificity, % (95% CI) |
N false positive |
Positive LR (95% CI) |
Negative LR (95% CI) |
PPV, % (95% CI) |
NPV, % (95% CI) |
---|---|---|---|---|---|---|---|---|---|
≥4 | 79 | 95 (91, 99) | 5 | 22 (20, 25) | 768 | 1.2 (1.2, 1.3) | 0.21 (0.087, 0.49) | 12 (9.8, 14) | 98 (96, 99.7) |
≥5 | 67 | 89 (83, 95) | 12 | 36 (33, 39) | 633 | 1.4 (1.3, 1.5) | 0.31 (0.18, 0.52) | 13 (11, 16) | 97 (95, 99) |
≥6 | 54 | 80 (72, 87) | 22 | 48 (45, 52) | 510 | 1.5 (1.4, 1.7) | 0.42 (0.29, 0.61) | 15 (12, 17) | 96 (94, 97) |
≥7 | 47 | 71 (62, 79) | 32 | 56 (53, 59) | 434 | 1.6 (1.4, 1.9) | 0.52 (0.39, 0.70) | 15 (12, 18) | 95 (93, 96) |
≥8 | 40 | 61 (51, 70) | 43 | 62 (59, 65) | 372 | 1.6 (1.4, 1.9) | 0.63 (0.50, 0.80) | 15 (12, 18) | 93 (92, 95) |
≥9 | 34 | 55 46, 64) | 49 | 68 (65, .71) | 313 | 1.7 (1.4, 2.1) | 0.66 (0.53, 0.81) | 16 (12, 20) | 93 (91, 95) |
≥10 | 28 | 49 (39, 58) | 56 | 74 (71, 77) | 258 | 1.9 (1.5, 2.3) | 0.70 (0.58, 0.84) | 17 (13, 21) | 93 (91, 95) |
≥11 | 25 | 38 (29, 47) | 68 | 77 (74, 80) | 228 | 1.6 (1.2, 2.1) | 0.81 (0.70, 0.94) | 15 (11, 20) | 92 (90, 94) |
≥12 | 21 | 33 (24, 42) | 73 | 80 (78, 83) | 196 | 1.7 (1.2, 2.2) | 0.84 (0.73, 0.96) | 16 (11, 20) | 92 (90, 93) |
≥13 | 18 | 28 (19, 36) | 79 | 83 (81, 86) | 166 | 1.6 (1.2, 2.3) | 0.87 (0.77, 0.98) | 15 (10, 20) | 91 (89, 93) |
≥14 | 15 | 22 (14, 30) | 85 | 85 (83, 88) | 145 | 1.5 (1.0, 2.2) | 0.91 (0.82, 1.0) | 14 (8.9, 19) | 91 (89, 93) |
≥15 | 11 | 17 (10, 25) | 92 | 87 (85, 90) | 105 | 1.4 (.89, 2.2) | 0.94 (0.86, 1.0) | 13 (7.7, 19) | 91 (89, 92) |
CI = confidence interval, LR = likelihood ratio, NPV = negative predictive value, PPV = positive predictive value
Construct validity.
Results of CFA are presented in Table 5. While both the one- and two-factor solutions were reasonable, the two-factor solution had a better fit (CFI=0.091; RMSEA (90% CI)=0.075 (0.069, 0.080)). Based on the Satorra-Bentler χ2 test, the two-factor solution fit the data significantly better than the 1-factor solution (χ2=145, df=l, p<.001). The first factor was labeled “somatic symptoms” and contained anhedonia, problems with sleep, fatigue or loss of energy, problems with appetite, and psychomotor agitation or retardation. The second factor was labeled “affective/mood symptoms” and contained depressed mood, guilt or worthlessness, diminished ability to think or concentrate, and suicidal thoughts. The correlation coefficient for the two factors was 0.82.
Table 5.
Solution | Satorra-Bentler χ2(df) |
Robust CFI |
Robust TLI |
Robust RMSEA (90% CI) |
SRMR | Δχ2 (df)* | p |
---|---|---|---|---|---|---|---|
1-factor | 764 (27) | 0.90 | 0.87 | 0.088 (0.083, 0.094) | 0.052 | - | - |
2-factor | 542 (26) | 0.93 | 0.91 | 0.075 (0.069, 0.080) | 0.042 | 145 (1) | <.001 |
Notes: Estimates were obtained using robust maximum likelihood estimation. Δχ2 is from the Satorra-Bentler scaled χ2 difference test. The scaling correction factor was 1.547 for the 1-factor solution and 1.510 for the 2-factor solution.
Reliability
Considering all nine items of the PHQ-9, Cronbach’s alpha was 0.80. Correlations of individual items with the total PHQ-9 score ranged from 0.34 for suicidal thoughts to 0.61 for depressed mood. Reliability was not improved by deleting any single items.
For the somatic subscale, Cronbach’s alpha was 0.75 and correlations with the total score ranged from 0.25 to 0.45. For the affective/mood subscale, Cronbach’s alpha was 0.67 and correlations with the total score ranged from 0.26 to 0.38.
Discussion
This study comprehensively examined the criterion validity, construct validity, and reliability of the PHQ-9 in a population of pregnant women in Lima, Peru. The optimal cutpoint for detecting MDD with the PHQ-9 was ≥8. This cutpoint maximized combined sensitivity (61%) and specificity (62%). It was associated with high NPV (93%), but low PPV (15%). Both one- and two-factor structures were appropriate for this population, but the latter was optimal. Reliability for the full scale and the two factors was high.
Criterion validity
In a multi-country meta-analysis (Levis et al., 2019), the frequently-used cutpoint of ≥10 was associated with high sensitivity and specificity (>80%). However, this finding did not account for variation among specific populations. While a cutpoint of ≥10 is associated with good sensitivity and specificity among pregnant women in Pakistan (Gallis et al., 2018) and the United States (Sidebottom et al., 2012), the recommended cutpoint for pregnant women in Peru, based on this study, is lower. This recommendation is in line with the usual/acceptable range in the literature (i.e. ≥8 to ≥11) (LM Manea et al., 2012), and is consistent with findings from Ethiopia (Woldetensay et al., 2018).
However, even at the optimal cutpoint of ≥8, sensitivity and specificity were low. As demonstrated by Levis et al. (2019), these values would be expected to be lower when using a fully-structured reference standard such as the CIDI, versus a semi-structured reference standard such as the Structured Clinical Interview for DSM Disorders (SCID) (First, 2014). In addition, low sensitivity and specificity may have been driven by use of past-12-month depression in the CIDI as the reference standard. However, a sensitivity analysis using past-30-day MDD as the reference standard found similar results for sensitivity and specificity, with wider confidence intervals (Supplemental Table 1).
Sensitivity and specificity are inversely related, and this has clinical implications. In this study, using a culturally-appropriate cutpoint of ≥8 versus ≥10 improved sensitivity from 49% to 61%. However, improved sensitivity comes at the cost of reduced specificity from 74% to 62%, and overestimated prevalence. A cutpoint of ≥8 identified 40% of the sample as having probable depression, which is equal to a previous estimate in Peru based on the Edinburgh Depression Scale (Luna Matos, Salinas Pielago, & Luna Figueroa, 2009). However, this estimate is higher than the “true prevalence” of MDD, as indicated by the reference standard/CIDI (9.9%), and higher than the reported prevalence of MDD in other Latin American countries (16% in Sao Paolo, Brazil; 11% in Colombia; and 10% in Mexico) (Bromet et al., 2011).
As a result of the PHQ-9 overestimating prevalence, PPV, or the probability of having MDD given a positive screening on the PHQ-9, was found to be very low (15%), while NPV was found to be very high (93%). In the sensitivity analysis using past-30-day MDD as the reference standard, PPV was even lower and NPV was even greater, as a result of further-reduced “true prevalence” when using a more stringent case definition. In practice, this means that pregnant women who screen positive on the PHQ-9 are unlikely to truly require follow-up care for depression, although the vast majority of patient who screen negative are unlikely to have screened negative in error.
Construct validity
Existing literature on the Spanish-language PHQ-9 supports both one-factor (Familiar et al., 2015; González-Blanch et al., 2018; Huang et al., 2006) and two-factor (González-Blanch et al., 2018; Granillo, 2012; Zhong et al., 2014) structures, as does literature on the PHQ-9 generally (Boothroyd et al., 2019). A prior analysis using preliminary data from the same study population as the present analysis (Zhong et al., 2014) found evidence of two factors: an affective/mood factor containing guilt or worthlessness, diminished ability to think or concentrate, suicidal thoughts, and depressed mood; and a somatic factor containing the remaining items. While several studies have also found evidence of a two-factor solution (Beard, Hsu, Rifldn, Busch, & Bjorgvinsson, 2016; Elhai et al., 2012; Gonzalez-Blanch et al., 2018; Miranda & Scoppetta, 2018; Richardson & Richards, 2008), the allocation of items has differed across studies. For example, in several studies, the affective factor contained anhedonia in place of diminished ability to think or concentrate (Beard et al., 2016; Elhai et al., 2012; Miranda & Scoppetta, 2018). Notably, in the present study, one of the affective/mood items, depressed mood (“feeling down, depressed, or hopeless”), loaded onto both the affective/mood and somatic factors, but was placed in the affective/mood category based on a better conceptual fit. In a sensitivity analysis, CFA was conducted considering depressed mood in the somatic category. While this resulted in marginally better goodness-of-fit indices, the former results were reported for ease of interpretation. Future research should explore reasons for variation in factor composition among different studies.
Limitations
Several limitations should be considered, especially the lack of an ideal reference standard for MDD. First, fully structured interviews such as the CIDI do not allow for nuanced clinical assessments. Second, the CIDI does not measure past-two-week MDD, which would best-align with the PHQ-9. While past-30-day MDD is closer conceptually to what is measured by the PHQ-9, past-12-month MDD was chosen as a reference standard due to the small number of participants categorized as having past-30-day MDD. Past-30-day depression was used as the reference standard in a sensitivity analysis. Third, sensitivity and specificity were calculated based on the responses of only a subset of participants. Nevertheless, PHQ-9 scores were not associated with selection into the validation study – a finding confirmed by repeating the analysis adjusting for possible verification bias (Begg & Greenes, 2009). This sensitivity analysis did not change the results in a way that would affect the conclusions. Fourth, CFA results are sensitive to the estimation method chosen, and an alternate method, such as diagonally-weighted least squares, could have been used (Li, 2016). However, using diagonally-weighted least squares did not alter the study’s conclusions. Fifth, variation in gestational age was not accounted for in the analysis. As depressive symptoms may change over the course of pregnancy, this may have impacted results.
Conclusions
This study enhances literature on the validity of the PHQ-9 among pregnant Peruvian women in two ways. First, its results build on previous findings regarding construct validity, by showing the relative merits of the PHQ-9 as a one- versus two-factor construct in this population. Second, as the first study to assess criterion validity in this population, it provides evidence for use of ≥8 as an optimal, though imperfect, cutpoint. As screening for depression during pregnancy become more common in Peru, especially in primary health care centers, these results may help inform best practices. Clinicians should be aware, however, that while the recommended cutpoint is sufficient for identifying the majority of pregnant women in need of follow-up care for depression, it is likely most people who screen positive in this setting will not be found to truly have MDD upon further clinical assessment. Future research confirming this study’s criterion validity findings using a different reference standard are warranted.
Supplementary Material
Acknowledgments
The authors wish to thank Ms. Elena Sanchez and the dedicated staff members of Asociación Civil Proyectos en Salud (PROESA), Perú and Instituto Materno Perinatal, Perú for their expert technical and administrative assistance with this research.
This research was supported by awards from the National Institutes of Health (NIH), the Eunice Kennedy Shriver Institute of Child Health and Human Development (R01-HD-059835), and the National Institute of Mental Health (R01MH110453, PI: Gradus). The NIH had no further role in study design; in the collection, analysis and interpretation of data; in the writing of the report; or in the decision to submit the paper for publication.
Footnotes
Conflict of Interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Ethical Statement
All procedures were approved by the institutional review boards of the Instituto Nacional Materno Perinatal, Lima, Peru, and the Harvard School of Public Health Office of Human Research Administration, Boston, USA.
Publisher's Disclaimer: This Author Accepted Manuscript is a PDF file of a an unedited peer-reviewed manuscript that has been accepted for publication but has not been copyedited or corrected. The official version of record that is published in the journal is kept up to date and so may therefore differ from this version.
References
- Barthel D, Barkmann C, Ehrhardt S, Schoppen S, Bindt C, Appiah-Poku J, … Tannich E (2015). Screening for depression in pregnant women from Côte d’Ivoire and Ghana: Psychometric properties of the Patient Health Questionnaire-9. Journal of Affective Disorders, 187, 232–240. 10.1016/j.jad.2015.06.042 [DOI] [PubMed] [Google Scholar]
- Batmaz G, Dane B, Sarioglu A, Kayaoglu Z, & Dane C (2015). Can we predict postpartum depression in pregnant women? Clinical and Experimental Obstetrics & Gynecology, 42(5), 605–609. [PubMed] [Google Scholar]
- Beard C, Hsu KJ, Rifkin LS, Busch AB, & Björgvinsson T (2016). Validation of the PHQ-9 in a psychiatric sample. Journal of Affective Disorders, 193,267–273. 10.1016/j.jad.2015.12.075 [DOI] [PubMed] [Google Scholar]
- Begg CB, & Greenes RA (2009). Assessment of Diagnostic Tests When Disease Verification is Subject to Selection Bias Published by : International Biometric Society Stable URL : http://www.jstor.org/stable/2530820, 39(1), 207–215. [PubMed] [Google Scholar]
- Boothroyd L, Dagnan D, & Muncer S (2019). PHQ-9: One factor or two? Psychiatry Research, 271(April 2018), 532–534. 10.1016/j.psychres.2018.12.048 [DOI] [PubMed] [Google Scholar]
- Breedlove G, & Fryzelka D (2011). Depression Screening During Pregnancy. Journal of Midwifery and Women’s Health, 56(1), 18–25. 10.1111/j.1542-2011.2010.00002.x [DOI] [PubMed] [Google Scholar]
- Bromet E, Andrade LH, Hwang I, Sampson NA, Alonso J, de Girolamo G, … Kessler RC (2011). Cross-national epidemiology of DSM-IV major depressive episode. BMC Medicine, 9(1), 90. 10.1186/1741-7015-9-90 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brosseau-Liard PE, Savalei V, & Li L (2012). An Investigation of the Sample Performance of Two Nonnormality Corrections for RMSEA. Multivariate Behavioral Research, 47(6), 904–930. 10.1080/00273171.2012.715252 [DOI] [PubMed] [Google Scholar]
- Brown TA (2015). Confirmatory factor analysis for applied research, 2nd edn. New York: The Guilford Press. [Google Scholar]
- Chung TK, Lau TK, Yip a S., Chiu HF, & Lee DT (2001). Antepartum depressive symptomatology is associated with adverse obstetric and neonatal outcomes. Psychosomatic Medicine, 63(5), 830–834. 10.1097/00006842-200109000-00017 [DOI] [PubMed] [Google Scholar]
- Cripe SM, Sanchez SE, Perales MT, Lam N, Garcia P, & Williams MA (2008). Association of intimate partner physical and sexual violence with unintended pregnancy among pregnant women in Peru. International Journal of Gynecology and Obstetrics, 100(2), 104–108. 10.1016/j.ijgo.2007.08.003 [DOI] [PubMed] [Google Scholar]
- Dadi AF, Miller ER, Bisetegn TA, & Mwanri L (2020). Global burden of antenatal depression and its association with adverse birth outcomes: An umbrella review. BMC Public Health, 20(1). 10.1186/s12889-020-8293-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davalos DB, Yadon CA, & Tregellas HC (2012). Untreated prenatal maternal depression and the potential risks to offspring: A review. Archives of Women’s Mental Health, 15(1), 1–14. 10.1007/s00737-011-0251-1 [DOI] [PubMed] [Google Scholar]
- Elhai JD, Contractor AA, Tamburrino M, Fine TH, Prescott MR, Shirley E, … Calabrese JR (2012). The factor structure of major depression symptoms: a test of four competing models using the Patient Health Questionnaire-9. Psychiatry Research, 199(3), 169–173. 10.1016/j.psychres.2012.05.018 [DOI] [PubMed] [Google Scholar]
- Familiar I, Ortiz-Panozo E, Hall B, Vieitez I, Romieu I, Lopez-Ridaura R, & Lajous M (2015). Factor structure of the Spanish version of the Patient Health Questionnaire-9 in Mexican women. International Journal of Methods in Psychiatric Research, 24(1), 74–82. 10.1002/mpr.1461 [DOI] [PMC free article] [PubMed] [Google Scholar]
- First MB (2014). Structured clinical interview for the DSM (SCID). The Encyclopedia of Clinical Psychology, 1–6. [Google Scholar]
- Gallis JA, Maselko J, O’Donnell K, Song K, Saqib K, Turner EL, & Sikander S (2018). Criterion-related validity and reliability of the Urdu version of the patient health questionnaire in a sample of community-based pregnant women in Pakistan. PeerJ. 10.7717/peerj.5185 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelaye B, Rondon M, Araya R, & Williams M (2016). Epidemiology of maternal depression, risk factors, and child outcomes in low-income and middle-income countries. Lancet Psychiatry, 3(10), 973–982. 10.1016/S2215-0366(16)30284-X.Epidemiology; [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gomez-Beloz A, Williams MA, Sanchez SE, & Lam N (2009). Intimate partner violence and risk for depression among postpartum women in Lima, Peru. Violence and Victims, 24(3), 380–398. 10.1891/0886-6708.24.3.380 [DOI] [PubMed] [Google Scholar]
- González-Blanch C, Medrano LA, Muñoz-Navarro R, Ruíz-Rodríguez P, Moriana JA, Limonero JT, … Cano-Vindel A (2018). Factor structure and measurement invariance across various demographic groups and over time for the PHQ-9 in primary care patients in Spain. PLoS ONE, 13(2), 1–16. 10.1371/journal.pone.0193356 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Granillo MT (2012). Structure and function of the Patient Health Questionnaire-9 among Latina and non-Latina White female college students. Journal of the Society for Social Work and Research, 3(2), 80–93. [Google Scholar]
- Gulseren L, Erol A, Gulseren S, Kuey L, Kilic B, & Ergor G (2006). From antepartum to postpartum: a prospective study on the prevalence of peripartum depression in a semiurban Turkish community. The Journal of Reproductive Medicine, 51(12), 955–960. [PubMed] [Google Scholar]
- Hoffman S, & Hatch MC (2000). Depressive symptomatology during pregnancy: evidence for an association with decreased fetal growth in pregnancies of lower social class women. Health Psychology, 19(6), 535–543. [PubMed] [Google Scholar]
- Hu LT, & Bentler PM (1999) Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Struct Equ Model A Multidiscip Journal, 6,1–55. [Google Scholar]
- Huang FY, Chung H, Kroenke K, Delucchi KL, & Spitzer RL (2006). Using the Patient Health Questionnaire-9 to measure depression among racially and ethnically diverse primary care patients. Journal of General Internal Medicine, 21(6), 547–552. 10.1111/j.1525-1497.2006.00409.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kessler RC,& Ustun TB (2004). The World Mental Health (WMH) Survey Initiative Version of the World Health Organization (WHO) Composite International Diagnostic Interview (CIDI). International Journal of Methods in Psychiatric Research, 13(2), 93–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kocalevent RD, Hinz A, & Brahler E (2013). Standardization of the depression screener Patient Health Questionnaire (PHQ-9) in the general population. General Hospital Psychiatry, 35(5), 551–555. 10.1016/j.genhosppsych.2013.04.006 [DOI] [PubMed] [Google Scholar]
- Kroenke K, Spitzer RL, & Williams JB (2001). The PHQ-9: validity of a brief depression severity measure. Journal of General Internal Medicine, 16(9), 606–613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Larsson C, Sydsjö G, & Josefsson A (2004). Health, sociodemographic data, and pregnancy outcome in women with antepartum depressive symptoms. Obstetrics and Gynecology, 104(3), 459–466. 10.1097/01.AOG.0000136087.46864.e4 [DOI] [PubMed] [Google Scholar]
- Levey EJ, Gelaye B, Koenen K, Zhong QY, Basu A, Rondon MB, … Williams MA (2018). Trauma exposure and post-traumatic stress disorder in a cohort of pregnant Peruvian women. Archives of Women’s Mental Health, 21(2), 193–202. 10.1007/s00737-017-0776-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Levis B, Benedetti A, & Thombs BD (2019). Accuracy of Patient Health Questionnaire-9 (PHQ-9) for screening to detect major depression: Individual participant data meta-analysis. BMJ (Online), 365. 10.1136/bmj.11476 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li CH (2016). Confirmatory factor analysis with ordinal data: Comparing robust maximum likelihood and diagonally weighted least squares. Behavior Research Methods, 48(3), 936–949. 10.3758/s13428-015-0619-7 [DOI] [PubMed] [Google Scholar]
- Luna Matos ML, Salinas Pielago J, & Luna Figueroa A (2009). [Major depression in pregnant women served by the National Materno-Perinatal Institute in Lima, Peru]. Revista panamericana de salud publica = Pan American journal of public health, 26(4), 310–314. [DOI] [PubMed] [Google Scholar]
- Manea Laura, Gilbody S, & McMillan D (2015). A diagnostic meta-analysis of the Patient Health Questionnaire-9 (PHQ-9) algorithm scoring method as a screen for depression. General Hospital Psychiatry, 37(1), 67–75. 10.1016/j.genhosppsych.2014.09.009 [DOI] [PubMed] [Google Scholar]
- Manea LM, Gilbody S, &McMilian D (2012). Optimal cut-off score for diagnosing depression with the Patient Health Questionnaire (PHQ-9): a meta-analysis. Canadian Medical Association, 184(3), E191–E196. 10.1503/cmaj.112004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miranda CAC, & Scoppetta O (2018). Factorial structure of the Patient Health Questionnaire-9 as a depression screening instrument for university students in Cartagena, Colombia. Psychiatry Research, 269(February), 425–429. 10.1016/j.psychres.2018.08.071 [DOI] [PubMed] [Google Scholar]
- Nasreen HE, Pasi HB, Rifin SM, Aris MAM, Rahman JA, Rus RM, & Edhborg M (2019). Impact of maternal antepartum depressive and anxiety symptoms on birth outcomes and mode of delivery: A prospective cohort study in east and west coasts of Malaysia. BMC Pregnancy and Childbirth, 19(1), 1–11 10.1186/s12884-019-2349-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ongeri L, Wanga V, Otieno P, Mbui J, Juma E, Stoep A. Vander, & Mathai M (2018). Demographic, psychosocial and clinical factors associated with postpartum depression in Kenyan women. BMC Psychiatry, 18(1), 1–9. 10.1186/s12888-018-1904-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Perales MT, Cripe SM, Lam N, Sanchez E, & Williams MA (2014). in Lima , Peru, 224–250. [DOI] [PubMed] [Google Scholar]
- Richardson EJ, & Richards JS (2008). Factor structure of the PHQ-9 screen for depression across time since injury among persons with spinal cord injury. Rehabilitation Psychology, 53(2), 243. [Google Scholar]
- Satorra A (2000). Scaled and adjusted restricted tests in multi-sample analysis of moment structures. In Heijmans R, Pollack D, & Satorra A (Eds.), Innovations in multivariate statistical analysis. A Festschrift for Heinz Neudecker (pp. 233–247). London: Kluwer Academic Publishers. [Google Scholar]
- Satorra A, &Bentler PM (2010). Ensuring positiveness of the scaled difference chi-square test statistic. Psychometrika, 75(2), 243–248. 10.1007/s11336-009-9135-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schaffir J (2018). Consequences of antepartum depression. Clinical Obstetrics and Gynecology, 61(3), 533–543. 10.1097/GRF.0000000000000374 [DOI] [PubMed] [Google Scholar]
- Sidebottom AC, Harrison PA, Godecker A, & Kim H (2012). Validation of the Patient Health Questionnaire (PHQ)-9 for prenatal depression screening. Archives of Women’s Mental Health, 15(5), 367–374. 10.1007/s00737-012-0295-x [DOI] [PubMed] [Google Scholar]
- Spitzer R, Kroenke K, Williams JBW, & Patient Health Questionnaire Primary Care Study Group. (1999). Validation and Utility of a Self-report Version of PRIME-MD. JAMA, 282(18), 1737–1744. 10.1001/jama.282.18.1737 [DOI] [PubMed] [Google Scholar]
- Venkatesh KK, Riley L, Castro VM, Perlis RH, & Kaimal AJ (2016). Association of antenatal depression symptoms and antidepressant treatment with preterm birth. Obstetrics and Gynecology, 127(5), 926–933. 10.1097/AOG.0000000000001397 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Woldetensay YK, Belachew T, Tesfaye M, Spielman K, Biesalski HK, Kantelhardt EJ, & Scherbaum V (2018). Validation of the Patient Health Questionnaire (PHQ-9) as a screening tool for depression in pregnant women: Afaan Oromo version. PLoS ONE, 13(2), 1–15. 10.1371/journal.pone.0191782 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhong Q-Y, Gelaye B, Rondon M, Sanchez SE, Garcia PJ, Sanchez E, … Williams MA (2014). Comparative performance of Patient Health Questionnaire-9 and Edinburgh Postnatal Depression Scale for screening antepartum depression. Journal of Affective Disorders, 162, 1–7. 10.1016/j.jad.2014.03.028 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhong Q-Y, Gelaye B, Zaslavsky AM, Farm JR, Rondon MB, Sánchez SE, & Williams MA (2015). Diagnostic validity of the generalized anxiety disorder - 7 (GAD-7) among pregnant women. PLoS ONE, 10(4), 1–17. 10.1371/journal.pone.0125096 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.