Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2022 Oct 21;17(10):e0276534. doi: 10.1371/journal.pone.0276534

Combining the GP’s assessment and the PHQ-9 questionnaire leads to more reliable and clinically relevant diagnoses in primary care

Clara Teusen 1,*, Alexander Hapfelmeier 1,2, Victoria von Schrottenberg 1, Feyza Gökce 1, Gabriele Pitschel-Walz 1, Peter Henningsen 3, Jochen Gensichen 4, Antonius Schneider 1; for the POKAL-Study-Group
Editor: Pedro Vieira da Silva Magalhaes5
PMCID: PMC9586376  PMID: 36269712

Abstract

Background

Screening questionnaires are not sufficient to improve diagnostic quality of depression in primary care. The additional consideration of the general practitioner’s (GP’s) assessment could improve the accuracy of depression diagnosis. The aim of this study was to examine whether the GP rating supports a reliable depression diagnosis indicated by the PHQ-9 over a period of three months.

Methods

We performed a secondary data analysis from a previous study. PHQ-9 scores of primary care patients were collected at the time of recruitment (t1) and during a follow-up 3 months later (t2). At t1 GPs independently made a subjective assessment whether they considered the patient depressive (yes/no). Two corresponding groups with concordant and discordant PHQ-9 and GP ratings at t1 were defined. Reliability of the PHQ-9 results at t1 and t2 was assessed within these groups and within the entire sample by Cohen’s Kappa, Pearson’s correlation coefficient and Bland-Altman plots.

Results

364 consecutive patients from 12 practices in the region of Upper Bavaria/Germany participated in this longitudinal study. 279 patients (76.6%) sent back the questionnaire at t2. Concordance of GP rating and PHQ-9 at t1 led to higher replicability of PHQ-9 results between t1 and t2. The reliability of PHQ-9 was higher in the concordant subgroup (κ = 0.507) compared to the discordant subgroup (κ = 0.211) (p = 0.064). The Bland-Altman Plot showed that the deviation of PHQ-9 scores at t1 and t2 decreased by about 15% in the concordant subgroup. Pearson’s correlation coefficient between PHQ-9 scores at t1 and t2 increased significantly if the GP rating was concordant with the PHQ-9 at t1 (r = 0.671) compared to the discordant subgroup (r = 0.462) (p = 0.044).

Conclusions

The combination of PHQ-9 and GP rating might improve diagnostic decision making regarding depression in general practices. PHQ-9 positive results might be more reliable and accurate, when a concordant GP rating is considered.

Introduction

Epidemiological studies show that depression is one of the major health problems worldwide [1, 2]. Thus, the accurate diagnosis and appropriate treatment of patients with depressive disorder is a key challenge to our health system [35]. Often, the first point of contact for patients with depression is the general practitioner (GP) [6]. Differentiating depressive symptoms from somatic, non-specific, functional or somatoform body complaints is difficult for GPs [7], particularly if patients are multimorbid and seek help for physical rather than psychological complaints [810]. Furthermore, compared to specialists in mental health care, the work of GPs takes place in a setting with a high risk of both over and under diagnosis of depression due to the presence of multimorbidity [3]. Even though about 10% of primary care patients are likely to meet criteria for major depression, detection and treatment rates are still low [11]. In addition to that, the low-threshold access to patients in general practices and the confrontation with subthreshold symptoms that could indicate a multitude of possible diseases complicate the correct diagnosis of depression [12, 13]. However, a correct diagnosis is essential for further adequate treatment of depression.

Previous studies investigated how GPs deal with the challenges of diagnostic decision making regarding depression. It was shown that the GP’s approach often differs from common psychiatric diagnostic systems like the International Statistical Classification of Diseases and Related Health Conditions (ICD-10) or the Diagnostic and Statistical Manual of Mental Disorders (DSM-V) [14]. According to ICD-10 and DSM-V, a depressive disorder can be diagnosed after a 2-week reference period [15]. In general practice, this time criterion is sometimes considered too short to confirm a depression diagnosis [16]. Usually, only if depressive symptoms persist for a longer period of time a depression diagnosis is considered and discussed carefully with the patient. Accordingly, GPs may be better at diagnosing correctly more severe than mild depression [17], in part because this subtype of depression is associated with a longer duration of symptoms [18]. Other studies found that the 2-week reference period for depression is appropriate and valid in general practice [19, 20]. However, in order to initiate even better diagnosis for depression in primary care, it might be helpful to take into account GP heuristics, their diagnostic strategies and thought processes in addition to existing psychiatric diagnostic criteria [3, 2124]. A systematic review identified the "experienced anamnesis" and the patient’s long-term history as decisive factors for the GP’s diagnosis [25]. The length of the doctor-patient relationship is thus an important factor for GP diagnostics [26]. Besides these, other heuristics that GPs use for diagnostic decision making are watchful waiting, consideration of etiological and contextual factors and stepwise diagnostic procedures [25, 27].

A common strategy to improve diagnostic decision making is the introduction of screening questionnaires [28]. It has been shown that the systematic use of validated screening tools can improve detection and diagnosis of depression in primary care [29]. However, the use of depression screening questionnaires in primary care is discussed controversially, as a screening strategy leads to an over-estimation of the depression prevalence [30]. For example, screening questionnaires like the Patient Health Questionnaire 9 (PHQ-9) lead to high false-positive rates due to the low pre-test probability of depression in primary care [30]. In addition, the diagnosis of depression is affected by the correlation between somatic illnesses and the somatic symptoms of depression [31, 32]. Whereas a PHQ-9 positive result with a PHQ-9 score ≥10 may represent clinically significant depressive symptoms, a structured diagnostic interview is needed to confirm the presence of major depressive disorder. Beyond that, the PHQ-9 seems to be more suitable for ruling-out than for ruling-in the diagnosis of depression in primary care [3335]. Therefore, there are conflicting opinions about the recommendation of routinely screening for depression in primary care [36]. The Canadian Task Force on Preventive Health Care [37] and the guideline for depression management from the United Kingdom’s National Institute for Health and Clinical Excellence [34] do not recommend routinely screening for depression in primary care settings whereas the US Preventive Services Task Force recommends a universal screening approach for depression in the general adult population [38]. We suppose that a simplified combination of screening questionnaires and GP heuristics could be useful to improve diagnostic quality and practicability in primary care.

The aim of the present analysis is to examine whether the GP rating of the patient as depressive is related to the persistence of depressive symptoms indicated by the PHQ-9 at the time of recruitment and during a follow up three months later. We assume that a longer duration of symptoms as determined with repeated PHQ-9 measurements is stronger associated with a depression diagnosed by the GP.

Methods

Study design and sample

We performed a secondary data analysis from a previous study on the impact of a complex educational intervention on diagnostic accuracy of impaired mental health in general practices [26]. 12 general practitioners in the region of Upper Bavaria/Germany agreed to take part in this longitudinal study. Suitable practices were recruited through the network of the 210 general practitioners with teaching duties at the Technical University of Munich (TUM). The data collection was carried out between March and October 2014.

All consecutive adult patients who attended the practices in the study period on certain days at regular intervals were asked in the waiting room by a research assistant to fill in a PHQ-9 questionnaire before seeing the doctor (t1). Additionally, at time point t1, the GPs made a subjective assessment whether they considered the patient depressive (yes/no). GPs were blinded to the PHQ-9 result of the patients they were supposed to assess for depression. During a follow-up investigation three months later (t2), patients were invited to fill in the PHQ-9 again to investigate the stability of the depressive symptoms. The questionnaire was sent by post and patients were asked to return it to the Institute of General Practice and Health Services Research. The patients received an incentive of 10 € per completed questionnaire. Inclusion criteria were an age of at least 18 years, sufficient knowledge of the German language and a signed consent form. Patients were not asked if they received any therapy, counselling or medication in the meantime. The underlying data for this study are pseudonymized and the study was approved by the Medical Ethics Committee of the Technical University of Munich (approval No 15/14).

Questionnaire

The validated German version of the Patient Health Questionnaire 9 (PHQ-9) was used as a screening questionnaire to assess the presence of depressive symptoms in the patients within the past two weeks [39]. The depression severity score comprises nine items which can be summarized, with a range from zero (no depression) to 27 (maximum). Findings from previous studies show that the use of a cut-off value of 10 or higher is considered useful [40] as a score of 10 represents at least a moderate level of depressive symptoms [39]. A score between five and 10 is mostly found in patients with mild or subthreshold depressive symptoms and corresponds to a mild degree of severity [39]. Further on, the questionnaire presented to the patients at t1 comprised sociodemographic items regarding education, occupation and family status.

Data analysis

The distribution of quantitative data is described by mean values and standard deviations. Qualitative data is presented by absolute and relative frequencies. Statistical significance of respective group differences was assessed by Chi-Squared Tests and t-Tests. The main research question was whether patients’ self-ratings on the PHQ-9 at t1 and t2 were more reliable when the GP’s rating matched with the PHQ-9 assessment at t1, compared to when there was no match. For this purpose, a PHQ-9 ≥10 was used to indicate a self-rated depression. This outcome was labelled as PHQ-9 positive, PHQ-9 <10 was labelled as PHQ-9 negative. Two groups with concordant versus discordant PHQ-9 and GP ratings at t1 were defined. The replicability of PHQ-9 results between t1 and t2 was compared between the concordant and discordant group by a 2 x 2 table. Mean values were presented additionally to enable a comparison of the dimensional PHQ-9 results. Furthermore, reliability of the PHQ-9 test results at t1 and t2 was assessed within these groups and within the entire sample by Cohen’s Kappa, Pearson’s correlation coefficient and Bland-Altman plots. Limits of Agreement (LoA) which cover about 95% of the differences between measurements were computed for the latter. Respective hypothesis testing of group differences was performed by Z-tests. All analyses were performed using the software package SPSS (Version 25, IBM, Armonk, NY, USA) and R 4.0.3 (The R Foundation for Statistical Computing, Vienna, Austria). Exploratory two-sided 5% significance levels were used for any hypothesis testing.

Within the previous cluster randomised controlled pilot study the 12 practices were divided into a control and an intervention group after a one-time training intervention for general practitioners [26]. The GPs in the intervention group received a one-day training on diagnostics and interviewing as well as on recognising and dealing with psychosomatic patients. The training included expert lectures (on depression, anxiety and somatization), group discussions and acting out psychosomatic counselling situations with an acting patient. However, a one-day training intervention alone did not seem to improve the perception and management of psychosomatic illness [26]. No relevant sociodemographic or diagnostic differences were found between the intervention and the control group, so that collected data of all patients were merged in our secondary analysis.

Results

364 consecutive patients from 12 practices in the region of Upper Bavaria/Germany participated in this longitudinal study (see Fig 1). 279 patients (76.6%) sent back the questionnaire at t2. Non-responder analysis showed no relevant differences with respect to gender and depression diagnosis by the GP or PHQ-9 (not in Table 1). However, the average age of non-responders (mean 47.7, standard deviation 19.0) was significantly (p = 0.043) lower compared to responders (mean 52.3, standard deviation 18.1). The baseline characteristics of the patients at t1 are displayed in Table 1.

Fig 1. Flowchart of patients.

Fig 1

Table 1. Baseline characteristics (t1).

Depression rating by the GP1 Depression at t1 (PHQ-9 ≥10)2
Parameter (Missing values) Total (n = 364) n(%) Yes (n = 85) n(%) No (n = 271) n(%) P-Value Yes (n = 61) n(%) No (n = 290) n(%) P-Value
Sex, female (0) 203 (55.8) 55 (64.7) 142 (52.4) p = 0.046 38 (62.3) 159 (54.8) p = 0.285
Age (6) [Mean (SD)] in years 51.24 (18.41) 50.98 (18.56) 51.16 (18.23) p = 0.803 45.82 (16.95) 51.90 (18.61) p = 0.158
Marital status (2) p = 0.087 p<0.001
    Married or in a stable relationship 245 (67.3) 50 (58.8) 193 (71.2) 27 (44.3) 210 (72.4)
    Single 92 (25.3) 29 (34.1) 61 (11.4) 27 (44.3) 60 (20.7)
    Widowed 25 (6.9) 6 (7.1) 17 (6.3) 6 (9.8) 19 (6.6)
Education level (2) p = 0.485 p = 0.659
    No school degree 6 (1.6) 3 (3.5) 3 (1.1) 2 (3.3) 4 (1.4)
    < 10 y of formal education 107 (29.4) 26 (30.6) 79 (29.2) 15 (24.6) 83 (28.6)
    10 y of formal education 112 (30.8) 28 (32.9) 81 (29.9) 22 (36.1) 88 (30.3)
    12–13 y of formal education 116 (31.9) 23 (27.1) 92 (33.9) 17 (27.9) 97 (33.3)
    Other 21 (5.8) 5 (5.9) 15 (5.5) 4 (6.6) 17 (5.9)
Occupation (5) p = 0.017 p<0.001
    Employed part-time 61 (16.8) 17 (20.0) 43 (15.9) 12 (19.7) 46 (15.9)
    Employed full-time 146 (40.1) 26 (30.6) 119 (43.9) 17 (27.9) 126 (43.4)
    Housewife/homemaker/non-working 16 (4.4) 3 (3.5) 13 (4.8) 5 (8.2) 11 (3.8)
    Retired 94 (25.8) 23 (27.1) 68 (25.1) 10 (16.4) 80 (27.6)
    Unemployed 10 (2.7) 7 (8.2) 3 (1.1) 7 (11.5) 3 (1.0)
    Other 32 (8.8) 7 (8.2) 23 (8.5) 8 (13.1) 22 (7.6)

1 8 missings due to missing GP rating

2 13 missings due to incomplete PHQ-9 response. SD: Standard deviation.

Table 2 depicts the replicability of the PHQ-9 results over the period from t1 to t2. Related to the entire sample, 47 patients (17.9%) were PHQ-9 positive at t1; and 216 patients (82.1%) were PHQ-9 negative. 25 (53.2%) of the PHQ-9 positives received a positive result at t2 again. 194 patients (89.8%) without depression at t1 remained without depression at t2.

Table 2. Replicability of the PHQ-9 results at t1 and t2 alone compared to the stability of the PHQ-9 results in the PHQ-9 and GP rating concordant vs. discordant subgroup at t1.

All responders1 (N = 263)
Depression (PHQ-9≥10) at t2 PHQ-9 at t1 PHQ-9 at t2 No depression (PHQ-9<10) at t2 PHQ-9 at t1 PHQ-9 at t2 All
Depression (PHQ-9≥10) at t1 25 (53.2) 15.2±4.9 15.2±3.5 22 (46.8) 12.4±2.1 6.1±1.9 47 (17.9)
14; 10–27 15; 10–25 12; 10–17 6.7; 1–9
No depression (PHQ-9<10) at t1 22 (10.2) 5.3±2.9 12.2±2.4 194 (89.8) 3.1±2.5 3.3±2.2 216 (82.1)
5; 0–9 11; 10–18 2.5; 0–9 3; 0–9
Concordant subgroup at t1 (N = 208)
Depression (PHQ-9≥10) at t2 PHQ-9 at t1 PHQ-9 at t2 No depression (PHQ-9<10) at t2 PHQ-9 at t1 PHQ-9 at t2 All
Depression (PHQ-9≥10) at t1 17 (63.0) 16.6±5.3 15.5±3.9 10 (37.0) 12.3±2.3 6.2±1.4 27 (13.0)
17; 10–27 15; 10–25 12; 10–16 6.4; 4–8
No depression (PHQ-9<10) at t1 15 (8.3) 4.7±2.9 11.8±2.2 166 (91.7) 2.9±2.4 3.1±2.1 181 (87.0)
4; 0–9 11; 10–16 2; 0–9 3; 0–9
Discordant subgroup at t1 (N = 55)
Depression (PHQ-9≥10) at t2 PHQ-9 at t1 PHQ-9 at t2 No depression (PHQ-9<10) at t2 PHQ-9 at t1 PHQ-9 mean t2 All
Depression (PHQ-9≥10) at t1 8 (40.0) 12.1±2.0 14.7±2.8 12 (60.0) 12.5±2.0 6.0±2.4 20 (36.4)
12; 10–16 14; 12–20 12; 10–17 6.9; 1–9
No depression (PHQ-9<10) at t1 7 (20.0) 6.7±2.1 13.1±2.8 28 (80.0) 4.3±2.8 4.3±2.5 35 (63.6)
7; 4–9 13; 10–18 3.5; 0–9 4; 0–9

1With complete PHQ-9 response; Descriptive statistics are n (%), Mean ± Standard Deviation and Median (Range).

The PHQ-9 test result in terms of inclusion or exclusion of depression was concordant with the GP rating at t1 in 208 patients. In this concordant subgroup, 27 patients (13.0%) received a PHQ-9 positive result at t1. 17 of these patients (63.0%) were PHQ-9 positive at t2 again. 181 patients (87.0%) were PHQ-9 negative at t1; and 166 patients (91.7%) were PHQ-9 negative at t2 again.

The discordant group with a mismatch between PHQ-9 and GP assessment at t1 comprised 55 patients. In this subgroup, 20 patients (36.4%) were PHQ-9 positive at t1. 8 of these patients (40.0%) received this positive result at t2 again. 35 patients (63.6%) were PHQ-9 negative at t1; and 28 patients (80.0%) were PHQ-9 negative at t2 again. The PHQ-9 mean values were higher in the concordant subgroup compared to the discordant subgroup in case of PHQ-9 positives at t1. Likewise, the PHQ-9 mean values were lower in the concordant subgroup compared to the discordant subgroup in case of PHQ-9 negatives at t1, suggesting that PHQ-9 means are more clearly positioned above or below the cut-off value (≥10) in the concordant subgroup.

The PHQ-9 results at t1 and t2 showed a moderate agreement overall (κ = 0.430, SE = 0.072). They were higher if the GP rating was concordant with the PHQ-9 test result at t1 (κ = 0.507, SE = 0.086) compared to a fair agreement in the discordant group (κ = 0.211, SE = 0.135), still this comparison was not significant (p = 0.064) (data not in Table 2).

The Bland-Altman Plot (Fig 2) indicates that the agreement of PHQ-9 scores at t1 and t2 for the concordant subgroup is higher (Limits of Agreement (LoA): -7.25 to 7.88) than in the discordant subgroup (LoA: -9.75 to 9.33). The LoA are 95% prediction intervals describing the range in which the majority of the individual differences between the PHQ-9 measurements at t1 and t2 are expected to lie, as they cover about 95% of these values. The deviation of PHQ-9 scores at t1 and t2 decreases by about 15% when the GP rating is concordant with the PHQ-9 at t1. The plots show that patients with a PHQ-9 mean between 5 and 10 had large absolute differences between t1 and t2, suggesting high variability of the PHQ-9 scores in this range. Patients with a PHQ-9 mean <5 and a PHQ-9 mean ≥10 showed lower absolute differences between the time points, especially in the concordant subgroup.

Fig 2. Bland-Altman Plots of the subgroups with a concordant vs. discordant GP rating and PHQ-9 results at t1 (dashed lines represent the Limits of Agreement and the bias).

Fig 2

A scatter plot regarding the correlation of PHQ-9 scores at t1 and t2 is given in Fig 3. Pearson’s correlation coefficient between PHQ-9 scores at t1 and t2 was r = 0.646. Pearson’s correlation coefficient increased significantly if the GP rating was concordant with the PHQ-9 at t1 (r = 0.671) compared to the discordant subgroup (r = 0.462) (p = 0.044). All correlation coefficients were statistically significant (p<0.001).

Fig 3. Correlation of PHQ-9 scores at t1 and t2 with discordant vs. concordant PHQ-9 and GP rating.

Fig 3

Discussion

The analysis showed that the concordance of the GP rating and the PHQ-9 results at t1 leads to a higher replicability of a PHQ-9 positive result over a period of three months. In addition, the replicability of a PHQ-9 negative result was improved, when GP rating and PHQ-9 were concordant at t1.

We found for the entire sample, that only 53.2% of all PHQ-9 positives at t1 received the same PHQ-9 positive result at t2. This proportion increased to 63.0% if the GP rating and PHQ-9 results were concordant at t1.

Beyond that, the Bland-Altman Plot showed a higher agreement of PHQ-9 scores between baseline and follow-up in the concordant subgroup. An increased reliability was also indicated by the significantly higher correlation of PHQ-9 scores between t1 and t2 in the concordant subgroup. Assuming that the persistence of symptoms is more strongly associated with actual depression [17, 18], our results suggest that a better ruling-in of depression is achieved when the GP rating in addition to a high PHQ-9 score is considered. The explicit consideration of GP heuristics like the "experienced anamnesis", the patient’s long-term history [25], watchful waiting, consideration of etiological and contextual factors and stepwise diagnostic procedures [25, 27] could be of great use to increase the pre-test probability of depression in primary care. Therefore, as the mere use of screening questionnaires in general practices does not lead to sufficient diagnostic certainty [30], the combination of PHQ-9 and the GP rating might improve the detection of patients with depression in primary care.

Studies have shown that the false-positive rate of the PHQ-9 is around 60% in a population with a 10% prevalence of depression so that false-positives and false-negatives have to be examined carefully [30]. Overall we found that 46.7% of the patients with a PHQ-9 positive result at t1 had a negative result at t2 in the present study. This may indicate a significant overestimation of PHQ-9 positives at t1. In the discordant subgroup, this was even more pronounced (60.0%) than in the concordant subgroup (37.0%). Therefore, the combination of the GP and PHQ-9 assessments may be associated with a decreased likelihood of false-positives.

The accurate ruling-out of depression is also of great importance. In our analysis, we found that patients who were not rated as depressive by their GP had less PHQ-9 positive results at t1 and t2. Other studies have already shown that GPs are good at ruling-out depression [3, 21, 22]. PHQ-9 results were more reliable in the concordant subgroup, in particular when PHQ-9 scores were <5 at t1 and t2. This suggests that a better ruling-out of a depression is achieved when the GP’s rating is combined with negative PHQ-9 results.

In contrast to that, patients with intermediate PHQ-9 scores (5–10) remain in a “grey area” which seems to represent a major challenge for diagnostic decision making, especially if GPs are less confident in their abilities to identify depression [24]. However, it is of great importance to increase reliable diagnostic decision making in cases with milder forms of depression which are characterized by high diagnostic uncertainty.

In the discordant subgroup, patients’ PHQ-9 scores were more often just above or just below the cut-off at t1, so they were more likely to drift above or below the cut-off at t2. Recognizing stable and clear cases with pronounced depressive symptoms or without symptoms seems to be easier for GPs, while diagnosing patients with subthreshold symptoms is a major challenge. GPs are more likely to identify a more severe depression indicated by higher PHQ-9 results and are more likely to exclude a depression diagnosis correctly if the PHQ-9 result is low.

Furthermore, it has to be taken into account that among 20 patients with PHQ-9 positive results at t1 in the discordant subgroup, a significant proportion (40.0%) of patients received a positive result at t2 again, which implies the general tendency of chronicity [41, 42]. Nevertheless, there is a remarkable difference to the concordant subgroup where chronicity rates are higher (63.0%). This might indicate again, that a combination of GP heuristics and screening questionnaire could improve the diagnostic process of patients with severe depression and a high risk of chronicity, which needs to be investigated in further studies.

Limitations

Our analysis has several limitations. First, influential life circumstances, such as therapy consultation, medication or critical life events were not measured between t1 and t2. This could have an impact on the interpretation of the results as we do not know what happened in the meantime between t1 and t2. Future studies should address this point to analyze the impact of such factors. Secondly, the results were derived by a secondary analysis. Therefore, the findings of the present study should be validated within further diagnostic studies. Thirdly, the GP rating of the patient as depressive is a subjective evaluation based on implicit GP heuristics which were used to indicate a depression diagnosis (yes/no). However, we did not identify which specific GP heuristics were used to diagnose depression. Therefore, an in-depth exploration of GP heuristics and the differentiation from psychiatric diagnostics seems to be of great value. Moreover, we could not verify whether the GP explicitly used the diagnostic criteria of depression during patient assessment. Further limitations arise from the patients who refused to participate in the study at t2. These non-responders were on average younger than responders, which might have an effect on the appearance of depression diagnoses at t2. An important limitation is given by the PHQ-9 itself. The questionnaire is well suited as a screening method for depression, but it cannot be used to obtain a reliable and definite psychological diagnosis. To be certain, such a diagnosis must be confirmed by a standardized diagnostic interview, which was not performed in this study. As it is difficult to determine the accurate proportion of false-positives and false-negatives without a reference standard, further studies need to investigate the accuracy of GP diagnoses in combination with screening questionnaires compared to standardized diagnostic interviews as a reference standard.

Conclusion

The combination of the PHQ-9 and the GP rating might improve diagnostic decision making regarding depression in general practice. Therefore, it is necessary to combine common diagnostic methods and GP heuristics to improve the positive predictive value of screening questionnaires like the PHQ-9. Thus, a questionnaire which specifically considers the GP heuristics as well as the psychiatric criteria might be useful. Further studies are necessary to identify explicit GP heuristics which might increase diagnostic accuracy in primary care.

Acknowledgments

The POKAL-Study-Group (PrädiktOren und Klinische Ergebnisse bei depressiven ErkrAnkungen in der hausärztLichen Versorgung (POKAL, DFG-GRK 2621)) consists of the following principle investigators: Tobias Dreischulte, Peter Falkai, Jochen Gensichen, Peter Henningsen, Markus Bühner, Caroline Jung-Sievers, Helmut Krcmar, Karoline Lukaschek, Gabriele Pitschel-Walz and Antonius Schneider. The following doctoral students are as well members of the POKAL-Study-Group: Jochen Vukas, Puya Younesi, Feyza Gökce, Victoria von Schrottenberg, Petra Schönweger, Hannah Schillock, Jonas Raub, Philipp Reindl-Spanner, Lisa Hattenkofer, Lukas Kaupe, Carolin Haas, Julia Eder, Vita Brisnik, Constantin Brand, Katharina Biersack and Regina Wehrstedt von Nessen-Lapp. The study was performed for the PhD thesis of CT at the Medical Faculty of the Technical University Munich.

Data Availability

Our data contain potentially identifying or sensitive patient information. Therefore, the Medical Ethics Committee of the Technical University Munich has restricted data access. The data are held by the Institute of General Practice and Health Services Research of the Technical University Munich. The data are not publicly available due to data protection regulations, but may be obtained from the Institute of General Medicine and Health Services Research of the Technical University Munich by researchers who meet the criteria for access to confidential data. Interested researchers can contact the data protection officer of the Technical University Munich if they wish to access our data (e-mail: beauftragter@datenschutz.tum.de). Alternatively, data requests may be sent to the Institute of General Medicine and Health Services Research of the Technical University Munich (e-mail: allgemeinmedizin@mri.tum.de).

Funding Statement

This secondary data analysis was funded by the German Research Foundation (Deutsche Forschungsgesellschaft, https://www.dfg.de/) (grant No GrK 2621). As principal investigator of the graduate school „PrädiktOren und Klinische Ergebnisse bei depressiven ErkrAnkungen in der hausärztLichen Versorgung (POKAL)“ AS received the funding to conduct the analysis. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Demyttenaere K, Bruffaerts R, Posada-Villa J, Gasquet I, Kovess V, Lepine JP, et al. Prevalence, severity, and unmet need for treatment of mental disorders in the World Health Organization World Mental Health Surveys. JAMA. 2004;291(21):2581–90. doi: 10.1001/jama.291.21.2581 [DOI] [PubMed] [Google Scholar]
  • 2.Jacobi F, Höfler M, Siegert J, Mack S, Gerschler A, Scholl L, et al. Twelve-month prevalence, comorbidity and correlates of mental disorders in Germany: the Mental Health Module of the German Health Interview and Examination Survey for Adults (DEGS1-MH). Int J Methods Psychiatr Res. 2014;23(3):304–19. doi: 10.1002/mpr.1439 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Mitchell AJ, Vaze A, Rao S. Clinical diagnosis of depression in primary care: a meta-analysis. Lancet (London, England). 2009;374(9690):609–19. [DOI] [PubMed] [Google Scholar]
  • 4.Mitchell AJ, Rao S, Vaze A. Can general practitioners identify people with distress and mild depression? A meta-analysis of clinical accuracy. Journal of Affective Disorders. 2011;130(1):26–36. doi: 10.1016/j.jad.2010.07.028 [DOI] [PubMed] [Google Scholar]
  • 5.Mitchell AJ, Rao S, Vaze A. International comparison of clinicians’ ability to identify depression in primary care: meta-analysis and meta-regression of predictors. British Journal of General Practice. 2011;61(583):e72–e80. doi: 10.3399/bjgp11X556227 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kruse J, Schmitz N, Wöller W, Heckrath C, Tress W. [Why does the general practitioner overlooks psychological disorders in his patient?]. Psychother Psychosom Med Psychol. 2004;54(2):45–51. [DOI] [PubMed] [Google Scholar]
  • 7.Schneider A, Wartner E, Schumann I, Hörlein E, Henningsen P, Linde K. The impact of psychosomatic co-morbidity on discordance with respect to reasons for encounter in general practice. Journal of psychosomatic research. 2013;74(1):82–5. doi: 10.1016/j.jpsychores.2012.09.007 [DOI] [PubMed] [Google Scholar]
  • 8.Bühring P. 121. Deutscher Ärztetag in Erfurt: Mehr Aufmerksamkeit für psychische Erkrankungen. Dtsch Arztebl International. 2018;115(17):812–4. German. [Google Scholar]
  • 9.Schäfer I, Hansen H, Schön G, Höfels S, Altiner A, Dahlhaus A, et al. The influence of age, gender and socio-economic status on multimorbidity patterns in primary care. First results from the multicare cohort study. BMC health services research. 2012;12:89. doi: 10.1186/1472-6963-12-89 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Haddad M. Depression in adults with a chronic physical health problem: treatment and management. International journal of nursing studies. 2009;46(11):1411–4. doi: 10.1016/j.ijnurstu.2009.08.007 [DOI] [PubMed] [Google Scholar]
  • 11.Craven MA, Bland R. Depression in primary care: current and future challenges. Canadian journal of psychiatry Revue canadienne de psychiatrie. 2013;58(8):442–8. doi: 10.1177/070674371305800802 [DOI] [PubMed] [Google Scholar]
  • 12.Fink P. Surgery and medical treatment in persistent somatizing patients. Journal of psychosomatic research. 1992;36(5):439–47. doi: 10.1016/0022-3999(92)90004-l [DOI] [PubMed] [Google Scholar]
  • 13.Holmes A, Christelis N, Arnold C. Depression and chronic pain. The Medical journal of Australia. 2013;199(6):S17–S20. doi: 10.5694/mja12.10589 [DOI] [PubMed] [Google Scholar]
  • 14.Davidsen AS, Fosgerau CF. What is depression? Psychiatrists’ and GPs’ experiences of diagnosis and the diagnostic process. International journal of qualitative studies on health and well-being. 2014;9:24866. doi: 10.3402/qhw.v9.24866 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Dilling H, Freyberger HJ. Taschenführer zur ICD-10-Klassifikation psychischer Störungen: nach dem Pocket Guide von J.E. Cooper. 9th ed. Bern; 2019. [Google Scholar]
  • 16.Wockenfuss R, Frese T, Herrmann K, Claussnitzer M, Sandholzer H. Three-and four-digit ICD-10 is not a reliable classification system in primary care. Scandinavian journal of primary health care. 2009;27(3):131–6. doi: 10.1080/02813430903072215 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Lucassen P, van Rijswijk E, van Weel-Baumgarten E, Dowrick C. Making fewer depression diagnoses: beneficial for patients? Mental health in family medicine. 2008;5(3):161–5. [PMC free article] [PubMed] [Google Scholar]
  • 18.Kessler RC, Berglund P, Demler O, Jin R, Koretz D, Merikangas KR, et al. The epidemiology of major depressive disorder: results from the National Comorbidity Survey Replication (NCS-R). Jama. 2003;289(23):3095–105. doi: 10.1001/jama.289.23.3095 [DOI] [PubMed] [Google Scholar]
  • 19.Pedersen SH, Stage KB, Bertelsen A, Grinsted P, Kragh-Sørensen P, Sørensen T. ICD-10 criteria for depression in general practice. J Affect Disord. 2001;65(2):191–4. doi: 10.1016/s0165-0327(00)00268-8 [DOI] [PubMed] [Google Scholar]
  • 20.Gunn J, Elliott P, Densley K, Middleton A, Ambresin G, Dowrick C, et al. A trajectory-based approach to understand the factors associated with persistent depressive symptoms in primary care. J Affect Disord. 2013;148(2–3):338–46. doi: 10.1016/j.jad.2012.12.021 [DOI] [PubMed] [Google Scholar]
  • 21.Fernández A, Pinto-Meza A, Bellón JA, Roura-Poch P, Haro JM, Autonell J, et al. Is major depression adequately diagnosed and treated by general practitioners? Results from an epidemiological study. General hospital psychiatry. 2010;32(2):201–9. doi: 10.1016/j.genhosppsych.2009.11.015 [DOI] [PubMed] [Google Scholar]
  • 22.Carey M, Jones K, Meadows G, Sanson-Fisher R, D’Este C, Inder K, et al. Accuracy of general practitioner unassisted detection of depression. Aust N Z J Psychiatry. 2014;48(6):571–8. doi: 10.1177/0004867413520047 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Menear M, Doré I, Cloutier A-M, Perrier L, Roberge P, Duhoux A, et al. The influence of comorbid chronic physical conditions on depression recognition in primary care: a systematic review. Journal of psychosomatic research. 2015;78(4):304–13. doi: 10.1016/j.jpsychores.2014.11.016 [DOI] [PubMed] [Google Scholar]
  • 24.Sinnema H, Terluin B, Volker D, Wensing M, Van Balkom A. Factors contributing to the recognition of anxiety and depression in general practice. BMC family practice. 2018;19(1):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Schumann I, Schneider A, Kantert C, Löwe B, Linde K. Physicians’ attitudes, diagnostic process and barriers regarding depression diagnosis in primary care: a systematic review of qualitative studies. Family practice. 2012;29(3):255–63. doi: 10.1093/fampra/cmr092 [DOI] [PubMed] [Google Scholar]
  • 26.Schneider A, Mayer V, Dinkel A, Wagenpfeil S, Linde K, Henningsen P. Educational intervention to improve diagnostic accuracy regarding psychological morbidity in general practice. Zeitschrift für Evidenz, Fortbildung und Qualität im Gesundheitswesen. 2019;147:20–7. [DOI] [PubMed] [Google Scholar]
  • 27.Pilars de Pilar M, Abholz H-H, Becker N, Sielk M. How do general practitioners deal with patients they do not consider to be depressed but who are classified as such according the PHQ-9? Psychiatrische Praxis. 2011;39(2):71–8. [DOI] [PubMed] [Google Scholar]
  • 28.Levis B, Sun Y, He C, Wu Y, Krishnan A, Bhandari PM, et al. Accuracy of the PHQ-2 Alone and in Combination With the PHQ-9 for Screening to Detect Major Depression: Systematic Review and Meta-analysis. JAMA. 2020;323(22):2290–300. doi: 10.1001/jama.2020.6504 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Trivedi MH. Major Depressive Disorder in Primary Care: Strategies for Identification. The Journal of clinical psychiatry. 2020;81(2). doi: 10.4088/JCP.UT17042BR1C [DOI] [PubMed] [Google Scholar]
  • 30.Thombs BD, Kwakkenbos L, Levis AW, Benedetti A. Addressing overestimation of the prevalence of depression based on self-report screening questionnaires. Canadian Medical Association Journal. 2018;190(2):E44–E9. doi: 10.1503/cmaj.170691 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Gilbody S, Richards D, Brealey S, Hewitt C. Screening for depression in medical settings with the Patient Health Questionnaire (PHQ): a diagnostic meta-analysis. Journal of general internal medicine. 2007;22(11):1596–602. doi: 10.1007/s11606-007-0333-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Spangenberg L, Forkmann T, Braehler E, Glaesmer H. The association of depression and multimorbidity in the elderly: implications for the assessment of depression. Psychogeriatrics. 2011;11(4):227–34. doi: 10.1111/j.1479-8301.2011.00375.x [DOI] [PubMed] [Google Scholar]
  • 33.Levis B, Benedetti A, Thombs BD. Accuracy of Patient Health Questionnaire-9 (PHQ-9) for screening to detect major depression: individual participant data meta-analysis. BMJ. 2019;365:l1476. doi: 10.1136/bmj.l1476 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Joffres M, Jaramillo A, Dickinson J, Lewin G, Pottie K, Shaw E, et al. Recommendations on screening for depression in adults. Cmaj. 2013;185(9):775–82. doi: 10.1503/cmaj.130403 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Mitchell AJ, Yadegarfar M, Gill J, Stubbs B. Case finding and screening clinical utility of the Patient Health Questionnaire (PHQ-9 and PHQ-2) for depression in primary care: a diagnostic meta-analysis of 40 studies. BJPsych open. 2016;2(2):127–38. doi: 10.1192/bjpo.bp.115.001685 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Thombs BD, Coyne JC, Cuijpers P, de Jonge P, Gilbody S, Ioannidis JP, et al. Rethinking recommendations for screening for depression in primary care. Cmaj. 2012;184(4):413–8. doi: 10.1503/cmaj.111035 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.National Collaborating Center for Mental Health The NICE guideline on the management and treatment of depression in adults (updated edition). London (UK): National Institute for Health and Clinical Excellence; 2010. [Google Scholar]
  • 38.Siu AL, Force atUPST. Screening for Depression in Adults: US Preventive Services Task Force Recommendation Statement. JAMA. 2016;315(4):380–7. [DOI] [PubMed] [Google Scholar]
  • 39.Kroenke K, Spitzer RL, Williams JB. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med. 2001. Sep;16(9):606–13. doi: 10.1046/j.1525-1497.2001.016009606.x ; PMCID: PMC1495268. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Costantini L, Pasquarella C, Odone A, Colucci ME, Costanza A, Serafini G, et al. Screening for depression in primary care with Patient Health Questionnaire-9 (PHQ-9): A systematic review. Journal of Affective Disorders. 2021;279:473–83. doi: 10.1016/j.jad.2020.09.131 [DOI] [PubMed] [Google Scholar]
  • 41.Verhoeven J, Verduijn J, Milaneschi Y, Beekman A, Penninx B. The Clinical Course of Depression: Chronicity is the Rule Rather than the Exception. European Psychiatry. 2017;41(S1):S144–S5. [Google Scholar]
  • 42.Ten Have M, de Graaf R, van Dorsselaer S, Tuithof M, Kleinjan M, Penninx B. Recurrence and chronicity of major depressive disorder and their risk indicators in a population cohort. Acta psychiatrica Scandinavica. 2018;137(6):503–15. doi: 10.1111/acps.12874 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Joseph Donlan

27 Jun 2022

PONE-D-21-39064Combining the GP’s assessment and the PHQ-9 questionnaire leads to more reliable and clinically relevant diagnoses in primary carePLOS ONE

Dear Dr. Teusen,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Your manuscript has been assessed by two expert reviewers, whose comments are appended below and in the attached document. The reviewers have highlighted concerns about several aspects of the methodology and study design, among other issues. Please ensure you respond to each point carefully in your response to reviewers document, and modify your manuscript accordingly.

Please submit your revised manuscript by Aug 09 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Joseph Donlan

Editorial Office

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Your ethics statement should only appear in the Methods section of your manuscript. If your ethics statement is written in any section besides the Methods, please move it to the Methods section and delete it from any other section. Please ensure that your ethics statement is included in your manuscript, as the ethics statement entered into the online submission form will not be published alongside your manuscript.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I have attached my detailed comments in the Reviewer comment documents. They are also itemized there. The paper is a useful contribution to the literature and my comments are addressable in a revised manuscript

Reviewer #2: Most psychiatric care is delivered in primary care settings, where depression is the most common presenting psychiatric symptom. Given the high prevalence of depression worldwide and the well-established consequences of untreated depression, the ability of primary care clinicians to effectively diagnose and treat it is critically important. The systematic use of validated screening tools can improve recognition and diagnosis (J Clin Psychiatry. 2020 17;81(2):UT17042BR1C.). Clinical depression diagnosis by GPs was not always associated with a formal diagnosis through a SCID (Fam Pract. 2019 Jan 25;36(1):3-11.)

.

In the background.

1. The authors mentioned that the prevalence of depression in GPs is low. In fact, previous studies in various countries have shown that the prevalence of depression in primary care is about 5-10%. The authors may state the findings of previous studies, rather than state as "low-prevalence setting (page 3).

2. A depressive disorder should be diagnosed using a 2-week reference period, which is the gold standard of ICD-10 or DSM-5 diagnostic criterion. One may argue the time criterion is too short to confirm a depression diagnosis, however, most of the studies found the ICD-10 criteria for depression seem to be appropriate and valid in general practice (Journal of Affective Disorders 2001, 65(2):191-4; J Affect Disord . 2013 Jun;148(2-3):338-46.

3. I cannot agree with the hypothesis that a 3-month duration of symptoms as determined with repeated PHQ-9 measurement might be an indication for a “real” depression. This seems to imply that a patient in primary care needs to suffer for 3 months’ watchful waiting to confirm a diagnosis.

In the method:

1. Patients were not asked if the received any therapy (counseling? psychotherapy? medication) during the three months, which would have a significant impact on the interpretation of results.

2. No psychiatric diagnostic interviews to confirm the diagnosis is a significant weakness. Gold standard (structured interview, or semi-structured interview, or a diagnosis confirmed by psychiatrist) should be used to confirm whether GP’s rating depression is under or over-diagnosed. The results of screening scale may also have problems of false negatives and false positives.

3. The results of PHQ-9 was based on scores (10-14; 15-19; 20- 27) instead of using the German version of the PHQ-9 cut-point or the gold standard for diagnostic interviews. According to a meta-analysis in 2021 (J Affect Disord. 2021 Jan 15;279:473-483.), the accuracy of the PHQ-9 was evaluated in 31 (74%) studies with a two-stage screening system, with structured interview most often carried out by primary care and mental health professionals. Most of the studies employed a cut-off score of 10 (N=24, 57%; total range 5-15). The authors may cite this paper to support why they used PHQ-9 scores (10-14; 15-19; 20- 27) instead of using the German version of the PHQ-9 cut-point. Also, they need to consider the false negatives and false positives.

In the discussion: Patients who were not rated as depression by their GP had less PHQ positive results at T1 and T2. In fact, Among 20 patients (36%) with PHQ-9 positive at T1 in the discordant group (n=55), a significant proportion (40%) of patients received a positive results at T2 again, which implies the tendency of chronicity.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: 04 PHQ9 vs. GP dep, PLoS (Feb).docx

PLoS One. 2022 Oct 21;17(10):e0276534. doi: 10.1371/journal.pone.0276534.r002

Author response to Decision Letter 0


9 Aug 2022

Dear Editor, dear Reviewers,

Thank you for giving us the opportunity to submit a revised version of our manuscript. We are grateful for the reviewer's thoughtful comments, which have helped to improve the manuscript. We have incorporated almost all the comments in the revised version and included a point by point response to the reviewer’s comments.

We have adapted the text/font of the Bland-Altmann plot (Fig 2) and of the scatter plot (Fig 3) and uploaded both in a modified form. We hope that the new form of the figures is more production-ready and meets the requirements of the reviewers and the journal. I would kindly ask you to contact me if anything is missing or needs to be edited.

COMMENTS

Reviewer #1

This is a useful study examining concordance between a depression measure (PHQ-9) and general practitioners’ (GP) ratings of depression. Strengths include sample size (n=364), longitudinal assessment at two time points (baseline and 3 month follow-up) and generally appropriate analyses.

I have attached my detailed comments in the Reviewer comment documents. They are also itemized there. The paper is a useful contribution to the literature and my comments are addressable in a revised manuscript

Thank you for the compliment and for highlighting the strengths of our study. We tried to address the reviewer’s helpful comments in our revised manuscript.

1. The greater reliability of the concordant group could be partly due to the possibility that those with GP depression at t1 had higher PHQ-9 scores and those with GP no depression had lower PHQ-9 scores, compared to the discordant group. Thus, if the discordant group had PHQ-9 scores closer to just above or just below 10 at t1, they would be more likely to drift above or below 10 at t2. This could be examined by showing mean PHQ-9 scores in Table 2. An example is shown below.

Thank you very much for your valuable comment. To examine whether the reliability in the concordant group was partly due to higher PHQ-9 scores when GPs indicated depression and due to lower PHQ-9 scores when the GPs indicated no depression compared to the discordant group, we have included your suggestion to show PHQ-9 means in our table (Table 2). We also included the median and range as we believe that these measures more adequately describe the distribution of PHQ-9 values and specifically their closeness to the cut-off value in the discordant subgroup. The PHQ-9 mean values and standard deviation have been inserted in the table as you suggested. We do now mention in the result section that the PHQ-9 means of the concordant subgroup are positioned more clearly above or below the cut-off:

“The PHQ-9 mean values were higher in the concordant subgroup compared to the discordant subgroup in case of PHQ-9 positives at t1. Likewise, the PHQ-9 mean values were lower in the concordant subgroup compared to the discordant subgroup in case of PHQ-9 negatives at t1, suggesting that PHQ-9 means are more clearly positioned above or below the cut-off value (≥10) in the concordant subgroup”.

Furthermore, we highlighted in the discussion that GPs are more likely to identify a more severe depression indicated by higher PHQ-9 results and are more likely to exclude a depression diagnosis correctly if the PHQ-9 result is low. However, in the discordant subgroup, there are no such clear-cut cases, as the PHQ-9 values cluster closely around the cut-off:

“In the discordant subgroup, patients' PHQ-9 scores were more often just above or just below the cut-off at t1, so they were more likely to drift above or below the cut-off at t2. Recognizing stable and clear cases with pronounced depressive symptoms or without symptoms seems to be easier for GPs, while diagnosing patients with subthreshold symptoms is a major challenge. GPs are more likely to identify a more severe depression indicated by higher PHQ-9 results and are more likely to exclude a depression diagnosis correctly if the PHQ-9 result is low”.

2. Table 1.

a. Define what PHQ-9 Cutoff value is, either in the column heading itself or a footnote.

b. Except for the Total column, the percentages represent row rather than column percentages. This is different from what published tables usually show and does not allow direct comparison of the “Yes” and “No” groups. For example, the proportion of women who were Yes and No for depression rating by the GP were 55/85 and 142/271, or 64.7% and 52.4%, respectively. This latter numbers directly show the proportional difference in women between the depressed and nondepressed group. The current row percentages (27.1% and 70.1%) do not show this. Thus, the authors should change percentages in table to reflect column rather than row percentages.

a. Thank you very much for this important hint. We added the PHQ-9 cut-off value ≥ 10 in the column heading of Table 1.

b. This suggestion is very helpful. We changed the percentages in Table 1 to reflect column rather than row percentages. This makes it easier to identify the proportional differences and allows direct comparison of the “Yes” and “No” groups.

3. Were GPs blinded to PHQ-9 when they made their assessment of depression? If not (or if we don’t know) this should be added as a study limitation in the Discussion, since knowledge of the PHQ-9 could have influenced their assessment of depression (i.e., the two methods would not be entirely independent).

Thank you for raising this point. Yes, the GPs were blinded and did not know about the PHQ-9 result of the patient they were supposed to assess for depression. We now added an explanatory sentence in the methods section:

“GPs were blinded to the PHQ-9 result of the patients they were supposed to assess for depression”.

4. Lines 81-85 are overstated:

a. The sentence “Thus, a PHQ-9 positive result can only give a hint towards a possible depression and needs to be confirmed by a structured diagnostic interview.” There is a large amount of data supporting the construct validity of scores of 10 or greater representing clinically significant depressive symptoms. Sometimes too much is made of a “major depressive disorder diagnosis”. I might rephrase the sentence to something like: “Whereas a PHQ-9 score ≥10 may represent clinically significant depressive symptoms, a structured diagnostic interview is needed to confirm the presence of major depressive disorder.”

b. The sentence: “Therefore, the standardised and legitimised use of screening questionnaires in primary care has not yet been established.” This is an overstatement. Canadian and some European guidelines are less enthusiastic about depression screening than US guidelines and this nuance should be reflected rather than just stating a “legitimized use of screening questionnaires in primary care has not yet been established.”

a: Thank you for your suggestion. We have taken your suggestion into account and changed the sentence to:

“Whereas a PHQ-9 positive result with a PHQ-9 score ≥10 may represent clinically significant depressive symptoms, a structured diagnostic interview is needed to confirm the presence of major depressive disorder”.

b. Thank you for the important advice. We agree that the sentence was an overstatement and that a more detailed reflection of different approaches seems more appropriate. Now we write:

“Therefore, there are conflicting opinions about the recommendation of routinely screening for depression in primary care [36]. The Canadian Task Force on Preventive Health Care [37] and the guideline for depression management from the United Kingdom’s National Institute for Health and Clinical Excellence [34] do not recommend routinely screening for depression in primary care settings whereas the US Preventive Services Task Force recommends a universal screening approach for depression in the general adult population [38]”.

5. Line 123 states: “A score of 10 thus indicates the presence of depression [32].” This is too simplistic. Instead, a score of 10 or greater represents a moderate level of depressive symptoms. The way the authors state it sounds like a “depression diagnosis.” Also, the reference for a cutpoint of 10 is not reference 32 (Spitzer 1999) but instead the Kroenke et al 2001 reference on the PHQ-9 in J Gen Intern Med.

We agree that the sentence is too simplistic and the diagnosis of depression should be treated with great care. We have corrected the sentence and inserted the correct reference - thank you for pointing this out. In fact, in this context, it is important to talk about the level of depressive symptoms instead of talking about a depression diagnosis. Now we write:

“Findings from previous studies show that the use of a cut-off value of 10 or higher is considered useful [40] as a score of 10 represents at least a moderate level of depressive symptoms [39]”.

6. Line 145 – A few sentences with more detail describing the “training intervention” would be helpful since it could have affected GP diagnosis rates.

The results of the previous cluster randomized controlled pilot study show that there were no differences between the intervention and the control group. To clarify the content of the training intervention we added the following sentences in the methods section:

“The GPs in the intervention group received a one-day training on diagnostics and interviewing as well as on recognising and dealing with psychosomatic patients. The training included expert lectures (on depression, anxiety and somatization), group discussions and acting out psychosomatic counselling situations with an acting patient. However, a one-day training intervention alone did not seem to improve the perception and management of psychosomatic illness [26]”.

7. Lines 183-185 – Could the authors explain a little better “Limits of Agreement” – is this similar or different from 95% CI. The interpretation of Bland-Altman graphs will be unfamiliar to many readers

The limits of agreement (LoA) are a 95% prediction interval which describes the distribution of individual values as it covers about 95% of the values. By contrast, a 95% confidence interval covers an unknown population based parameter, e.g. the expectation µ which is estimated by the sample mean, with a likelihood of 95%. The idea behind the LoA is that we are interested in the agreement between the measurements at t1 and t2 for most (95%) of the patients. A 95% CI only informs us about the precision of the estimate of the mean value, which is not informative if we want to assess agreement of paired values on an individual level. To make this point more understandable for the reader, we have added the following sentence to the manuscript:

“The LoA are 95% prediction intervals describing the range in which the majority of the individual differences between the PHQ-9 measurements at t1 and t2 are expected to lie, as they cover about 95% of these values”.

8. Minor points

a. Line 54. “Subliminal” should probably be “subthreshold”

b. Bland-Altman plots text/font, along axes, is difficult to read, and a more production-ready version of these graphs should be provided.

a. We agree. Thank you for the perceptive consideration. We corrected this.

b. Thank you for this comment. We have adapted the Bland-Altman plots and hope that they are now more in line with the expected standards.  

Reviewer #2

Most psychiatric care is delivered in primary care settings, where depression is the most common presenting psychiatric symptom. Given the high prevalence of depression worldwide and the well-established consequences of untreated depression, the ability of primary care clinicians to effectively diagnose and treat it is critically important. The systematic use of validated screening tools can improve recognition and diagnosis (J Clin Psychiatry. 2020 17;81(2):UT17042BR1C.). Clinical depression diagnosis by GPs was not always associated with a formal diagnosis through a SCID (Fam Pract. 2019 Jan 25;36(1):3-11.)

We agree that GPs play a very important role in the diagnosis and treatment of depression. In order to improve diagnostic decision making in primary care, it should be investigated whether the GP's assessment in addition to a screening questionnaire leads to better diagnostic results. If the additional GP assessment is useful, GP heuristics could be identified and included in new screening tools adapted to the primary care setting. We have added a sentence to the introduction that makes it even clearer that screening tools can lead to an improvement in diagnostics in primary care:

“It has been shown that the systematic use of validated screening tools can improve detection and diagnosis of depression in primary care [29]”.

In the background:

1. The authors mentioned that the prevalence of depression in GPs is low. In fact, previous studies in various countries have shown that the prevalence of depression in primary care is about 5-10%. The authors may state the findings of previous studies, rather than state as "low-prevalence setting (page 3).

Thank you very much for this valuable advice. It is true that depression is relatively prevalent in primary care compared to other diseases, at 5-10%. However, compared to the inpatient setting, this rate seems to be relatively low. With our sentence about the low-prevalence setting, we wanted to make a comparison with the inpatient setting. This comparison was possibly misleading and not clearly expressed. For this reason, we deleted the part about the low-prevalence setting and reformulated the sentence:

“Furthermore, compared to specialists in mental health care, the work of GPs takes place in a setting with a high risk of both over and under diagnosis of depression due to the presence of multimorbidity [3]. Even though about 10% of primary care patients are likely to meet criteria for major depression, detection and treatment rates are still low [11]”.

2. A depressive disorder should be diagnosed using a 2-week reference period, which is the gold standard of ICD-10 or DSM-5 diagnostic criterion. One may argue the time criterion is too short to confirm a depression diagnosis, however, most of the studies found the ICD-10 criteria for depression seem to be appropriate and valid in general practice (Journal of Affective Disorders 2001, 65(2):191-4; J Affect Disord . 2013 Jun;148(2-3):338-46.

We have included the fact that the 2-week reference period is presented as sufficient in several studies. We believe that the additional consideration of GP heuristics can significantly improve the diagnostic process. These heuristics sometimes include watchful waiting and the 2-week criterion is exceeded. However, we added a sentence to point out that previous studies have already shown that the gold standard of a 2-week reference period works as well in primary care:

“Other studies found that the 2-week reference period for depression is appropriate and valid in general practice [19, 20]. However, in order to initiate even better diagnosis for depression in primary care, it might be helpful to take into account GP heuristics, their diagnostic strategies and thought processes in addition to existing psychiatric diagnostic criteria”.

3. I cannot agree with the hypothesis that a 3-month duration of symptoms as determined with repeated PHQ-9 measurement might be an indication for a “real” depression. This seems to imply that a patient in primary care needs to suffer for 3 months’ watchful waiting to confirm a diagnosis.

We agree, the implication that a patient in primary care needs to suffer for 3 months’ watchful waiting to confirm a diagnosis is not correct. We would like to apologize for the slightly misleading presentation of our hypothesis. With the sentence we wanted to express that patients with a longer duration of symptoms are more likely to suffer from major depression which is more easily detected by a GP. However, this does not rule out the possibility that patients who have only had symptoms for two weeks can also suffer from major depression. To clarify any ambiguities, we have deleted the sentence in the last paragraph of the introduction section.

In the method:

4. Patients were not asked if the received any therapy (counseling? psychotherapy? medication) during the three months, which would have a significant impact on the interpretation of results.

Thank you for this comment. You are right, this is a significant limitation of our study. We have included this point in the limitations of our study already. Now we emphasized this limitation even more:

“This could have an impact on the interpretation of the results as we do not know what happened in the meantime between t1 and t2. Future studies should address this point to analyze the impact of such factors.”

5. No psychiatric diagnostic interviews to confirm the diagnosis is a significant weakness. Gold standard (structured interview, or semi-structured interview, or a diagnosis confirmed by psychiatrist) should be used to confirm whether GP’s rating depression is under or over-diagnosed. The results of screening scale may also have problems of false negatives and false positives.

We agree with this point. To confirm the GP’s depression rating or the PHQ-9 result a diagnostic interview should have been performed. We mentioned this point in our limitation section. Due to limited resources we were not able to perform a standardized interview. However, we believe that the combination of both, the PHQ-9 and the GP rating reduces the proportion of false-positives and negatives. Future studies should compare the results with a reference standard. Based on your comment, we try to emphasize in the limitations section that the false-negatives and positives are difficult to determine without a gold standard:

“As it is difficult to determine the accurate proportion of false-positives and false-negatives without a reference standard, further studies need to investigate the accuracy of GP diagnoses in combination with screening questionnaires compared to standardized diagnostic interviews as a reference standard”.

6. The results of PHQ-9 was based on scores (10-14; 15-19; 20- 27) instead of using the German version of the PHQ-9 cut-point or the gold standard for diagnostic interviews. According to a meta-analysis in 2021 (J Affect Disord. 2021 Jan 15;279:473-483.), the accuracy of the PHQ-9 was evaluated in 31 (74%) studies with a two-stage screening system, with structured interview most often carried out by primary care and mental health professionals. Most of the studies employed a cut-off score of 10 (N=24, 57%; total range 5-15). The authors may cite this paper to support why they used PHQ-9 scores (10-14; 15-19; 20- 27) instead of using the German version of the PHQ-9 cut-point. Also, they need to consider the false negatives and false positives.

We are sorry for our misleading presentation. The point you raised is true and we agree with you that a cut-off score of 10 is the ideal way for categorization, which we also used in our study. To avoid confusion for the reader, we deleted “In patients with major depressive symptoms, a score of 10 and higher can be expected, with moderate (10-14), distinct (15-19) and most severe (20-27) levels of the disorder.” Further on, we explain the rationale of the cut-off ≥10 more in detail now. Now we write:

“The depression severity score comprises nine items which can be summarized, with a range from zero (no depression) to 27 (maximum). Findings from previous studies show that the use of a cut-off value of 10 or higher is considered useful [40] as a score of 10 represents at least a moderate level of depressive symptoms [39]. A score between five and 10 is mostly found in patients with mild or subthreshold depressive symptoms and corresponds to a mild degree of severity [39]".

We adapted the methods section to explain that analysis by scores was used as an additional analysis to the main analysis, in which we focused on the cut-off value when comparing the discordant and concordant subgroups. However, we used the PHQ-9 scores for further analysis and to examine the reliability of the results.

“For this purpose, a PHQ-9 ≥10 was used to indicate a self-rated depression. This outcome was labelled as PHQ-9 positive, PHQ-9 <10 was labelled as PHQ-9 negative. Two groups with concordant versus discordant PHQ-9 and GP ratings at t1 were defined. The replicability of PHQ-9 results between t1 and t2 was compared between the concordant and discordant group by a 2 x 2 table. Mean values were presented additionally to enable a comparison of the dimensional PHQ-9 results. Furthermore, reliability of the PHQ-9 test results at t1 and t2 was assessed within these groups and within the entire sample by Cohen’s Kappa, Pearson’s correlation coefficient and Bland-Altman plots”.

In the absence of a structured interview, it is hardly possible to take false-positives and false-negatives into account in our analyses. We do, however, discuss this limitation of the present study in the limitations section. In our discussion we added the point that the false-positive rate of the PHQ-9 is around 60% and discussed it in the context of our results:

“Studies have shown that the false-positive rate of the PHQ-9 is around 60% in a population with a 10% prevalence of depression so that false-positives and false-negatives have to be examined carefully [30]. Overall we found that 46.7% of the patients with a PHQ-9 positive result at t1 had a negative result at t2 in the present study. This may indicate a significant overestimation of PHQ-9 positives at t1. In the discordant subgroup, this was even more expressed (60.0%) than in the concordant subgroup (37.0%). Therefore, agreement in the GP and PHQ-9 assessments may be associated with a decreased likelihood of false-positives.”

In the discussion:

7. Patients who were not rated as depression by their GP had less PHQ positive results at T1 and T2. In fact, Among 20 patients (36%) with PHQ-9 positive at T1 in the discordant group (n=55), a significant proportion (40%) of patients received a positive results at T2 again, which implies the tendency of chronicity.

Thank you for raising this interesting point, we have included this aspect in the discussion. The tendency of chronicity of depression has been shown in several studies. However, we see this tendency of chronicity of depression even more in the concordant subgroup (63.0%) which is why the GP assessment needs to be standardized and evaluated and then taken into account in the diagnostic process consistently. The consideration of the GP assessment could lead to a better diagnostic process and treatment for patients with severe depression with a high risk of chronicity. Now we write:

“Furthermore, it has to be taken into account that among 20 patients with PHQ-9 positive results at t1 in the discordant subgroup, a significant proportion (40.0%) of patients received a positive result at t2 again, which implies the general tendency of chronicity [41, 42]. Nevertheless, there is a remarkable difference to the concordant subgroup where chronicity rates are higher (63.0%). This might indicate that a combination of GP heuristics and screening questionnaire could improve the diagnostic process of patients with severe depression and a high risk of chronicity, which needs to be investigated in further studies.”

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Pedro Vieira da Silva Magalhaes

10 Oct 2022

Combining the GP’s assessment and the PHQ-9 questionnaire leads to more reliable and clinically relevant diagnoses in primary care

PONE-D-21-39064R1

Dear Dr. Teusen,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Pedro Vieira da Silva Magalhaes, M.D., Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

Reviewer #3: I Intersting stuyd with rouboust methodology and desgin and results, insering Bland-Altmann plot was very useful .

congratulation

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

**********

Acceptance letter

Pedro Vieira da Silva Magalhaes

14 Oct 2022

PONE-D-21-39064R1

Combining the GP’s assessment and the PHQ-9 questionnaire leads to more reliable and clinically relevant diagnoses in primary care

Dear Dr. Teusen:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Pedro Vieira da Silva Magalhaes

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: 04 PHQ9 vs. GP dep, PLoS (Feb).docx

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    Our data contain potentially identifying or sensitive patient information. Therefore, the Medical Ethics Committee of the Technical University Munich has restricted data access. The data are held by the Institute of General Practice and Health Services Research of the Technical University Munich. The data are not publicly available due to data protection regulations, but may be obtained from the Institute of General Medicine and Health Services Research of the Technical University Munich by researchers who meet the criteria for access to confidential data. Interested researchers can contact the data protection officer of the Technical University Munich if they wish to access our data (e-mail: beauftragter@datenschutz.tum.de). Alternatively, data requests may be sent to the Institute of General Medicine and Health Services Research of the Technical University Munich (e-mail: allgemeinmedizin@mri.tum.de).


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES