Abstract
Depression screening tools are useful to the extent that they accurately discriminate between depressed and non‐depressed patients. Studies without enough patients to generate precise estimates make it difficult to evaluate accuracy. We conducted a survey of recently published studies on depression screening tool accuracy to evaluate the percentage with sample size calculations; the percentage that provided confidence intervals; and precision, based on the width and lower bounds of 95% confidence intervals for sensitivity and specificity. We calculated 95% confidence intervals, if possible, when not provided. Only three of 89 studies (3%) described a viable sample size calculation. Only 30 studies (34%) provided reasonably accurate confidence intervals. Of 86 studies where 95% confidence intervals were provided or could be calculated, only seven (8%) had interval widths for sensitivity of ≤ 10%, whereas 53 (62%) had widths of ≥ 21%. Lower bounds of confidence intervals were < 80% for 84% of studies for sensitivity and 66% of studies for specificity. Overall, few studies on the diagnostic accuracy of depression screening tools reported sample size calculations, and the number of patients in most studies was too small to generate reasonably precise accuracy estimates. The failure to provide confidence intervals in published reports may obscure these shortcomings. Copyright © 2016 John Wiley & Sons, Ltd.
Keywords: diagnostic test accuracy, sample size, depression
Introduction
Most depression care is provided outside of psychiatric settings (Meng et al., 2013), and screening has been proposed to improve depression management in primary (Thombs and Ziegelstein, 2014; US Preventive Services Task Force, 2009) and specialty care (Canadian Diabetes Association, 2013; Colquhoun et al., 2013; Eskes et al., 2015; Holland et al., 2013; Lichtman et al., 2008; National Comprehensive Cancer Network, 2008; National Institute for Clinical Excellence, 2004). Depression screening, however, is controversial, and guidelines on screening vary substantially (Gilbody et al., 2006; Palmer and Coyne, 2003; Thombs et al., 2012; Thombs and Ziegelstein, 2014).
The US Preventive Services Task Force recommends depression screening in primary care when adequate care supports are available (US Preventive Services Task Force, 2009). In the United States, depression screening is required for accreditation for many health care providers (NCQA, 2011) and is covered in public health care plans (Centers for Medicare and Medicaid Services, 2010). In the UK, the National Institute for Health and Care Excellence (National Collaborating Center for Mental Health, 2010) and the National Screening Committee (Allaby, 2010) recommend against depression screening, and Quality and Outcome Framework incentives for depression screening in place from 2006 to 2013 were discontinued due to disappointing results (Burton et al., 2013). The Canadian Task Force on Preventative Health Care similarly recommends against depression screening and has raised concerns about the quality of research on depression screening tool accuracy (Joffres et al., 2013).
If depression screening is to improve upon usual care, depression screening tools must accurately identify patients with depression who have not otherwise been identified and must effectively screen out patients who do not have depression (Joffres et al., 2013; Thombs et al., 2011; Thombs et al., 2012). Screening tool accuracy has important implications for implementation. Even with a tool that is 90% specific, for instance, 10% of non‐depressed patients would require a mental health evaluation to rule out depression. Depression screening tool accuracy is evaluated by comparing patients above or below a cutoff threshold on a symptom questionnaire to case status based on a reference standard diagnostic interview. Estimates of sensitivity and specificity based on small numbers of patients with or without depression, however, do not generate sufficiently precise estimates to evaluate the accuracy of the tool for clinical practice (Bachmann et al., 2006). As is standard for clinical trials, prior to collecting data, investigators should calculate the sample size needed to generate sufficiently narrow confidence intervals for clinical decision‐making. For example, if at least 90% sensitivity and specificity are deemed necessary, the lower bound of the 95% confidence interval should be at least 90% (Bachmann et al., 2006).
Studies that publish estimates from small samples without taking into account precision may lead to undue confidence in the robustness and clinical utility of results. Furthermore, data‐driven methods are often used to identify and report “optimal” cutoffs that maximize diagnostic accuracy in the context of small samples. When this is done with inadequate numbers of patients, cutoffs identified as “optimal” may vary dramatically across studies and depart substantially from cutoffs that would perform best in actual practice (Ewald, 2006; Leeflang et al., 2008; Rutjes et al., 2006; Whiting et al., 2004; Whiting et al., 2013).
The objectives of the present study were to survey recently published studies of depression screening tool accuracy in order to evaluate the (1) percentage that reported sample size calculations; (2) percentage that provided confidence intervals; (3) precision of sensitivity and specificity estimates; and (4) lower bounds of sensitivity and specificity confidence intervals. We expected to find that studies would rarely report sample size calculations; that few studies would provide confidence intervals; that precision of estimates would be poor, particularly for sensitivity, which is based on the number of depression cases; and that lower bounds of confidence intervals would be below 80% for most studies.
Methods
Survey of recently published primary studies
We searched MEDLINE (PubMed interface) March 27, 2015 for recent studies, published January 1, 2013 or later, using the search terms (depress* AND sensitivity AND specificity), restricted to title or abstract. We included only recent studies in order to reflect current research practices. We searched only MEDLINE because our aim was to survey studies representative of current practices and not to comprehensively catalogue all studies that have been conducted. A recent study found that restricting searches for diagnostic test accuracy studies to only MEDLINE did not fail to identify studies that would influence meta‐analysis results (van Enst et al., 2014).
Eligible studies were published in any language and reported sensitivity and specificity estimates for one or more depression screening tools compared to a depression diagnosis based on a diagnostic interview. Studies were excluded if the reference standard was based on chart notes or a score above a threshold on another self‐report measure or rating scale. Studies that included only patients in mental health treatment were also excluded since screening is done to identify patients with unrecognized depression (Gilbody et al., 2006; Palmer and Coyne, 2003; Thombs et al., 2012; Thombs and Ziegelstein, 2014).
Citations were uploaded from PubMed into DistillerSR (Evidence Partners, Ottawa, Canada), which was used for tracking the review process and data extraction. Two investigators independently reviewed studies for eligibility. If either reviewer deemed a study potentially eligible based on title and abstract review, full text review was conducted. Any disagreements after full‐text review were resolved by consensus.
Data extraction
One investigator extracted data from each included study with independent validation by a second reviewer. For each study we extracted the screening tool(s) evaluated; reference standard; study population; number of patients and depression cases; reporting of an appropriate sample size calculation; and sensitivity and specificity estimates with 95% confidence intervals, if provided.
For publications with multiple screening tools or reference standards, we extracted data only for the first screening tool and reference standard combination listed in the abstract or article text. When results were reported for multiple cutoff thresholds, we extracted data for the cutoff prioritized by the authors as the “primary”, “standard” or “optimal” cutoff or, if not specified, for the first cutoff for which results were reported in the abstract or article text.
Data analysis
We evaluated the percentage of studies that mentioned a sample size calculation and the percentage that described a plausible precision‐based method to calculate sample size for sensitivity and specificity estimates. We evaluated the percentage of studies that provided confidence interval estimates for sensitivity and specificity; the percentage with 95% confidence interval widths for sensitivity and specificity between 0–5%, 6–10%, 11–20%, 21–30%, 31–40%, 41–50%, and > 50%; and the percentage with lower 95% confidence interval bounds below 80%, 80–84%, 85–89%, 90–94%, and ≥ 95%. If 95% confidence intervals were not provided, we calculated confidence intervals based on data provided in the publication, using an approximation method for interval estimation of binomial proportions recommended by Agresti and Coull (1998). If 95% confidence intervals were provided, but were clearly erroneous because they departed substantially from plausible values, we also calculated the 95% confidence interval. We conducted separate sensitivity analyzes that included only journals with impact factor ≥ 3 for the year of publication and only studies of the Patient Health Questionnaire (PHQ) (Kroenke et al., 2001; Spitzer et al., 1999), the most commonly used screening tool in included studies. For 2015 publications, we used the 2014 impact factor since 2015 impact factors were not yet published.
Results
Search results
The database search yielded 501 unique titles and abstracts. Of these, 374 were excluded after title and abstract review and 33 after full‐text review, leaving 89 eligible primary studies (Figure 1). Sample sizes ranged from 34 to 42,676 (median = 224) and depression cases from five to 3115 (median = 37). Most studies were from Europe (28%), Asia (24%) or North America (19%). The most common depression screening tools were the PHQ (Spitzer et al., 1999; Kroenke et al., 2001) (any version, 28 studies), Edinburgh Postnatal Depression Scale (Cox et al., 1987) (11 studies), Beck Depression Inventory or Beck Depression Inventory‐II (Beck and Steer, 1987; Beck et al., 1996) (eight studies), and Hospital Anxiety and Depression Scale (HADS) (Zigmond and Snaith, 1983) (seven studies). There were 34 studies (38%) from journals with impact factor ≥ 3. See Table S1 in Supporting Information for included study characteristics.
Figure 1.

Flow diagram of selection of primary studies that evaluated the diagnostic accuracy of depression screening tools.
Sample size calculations
Only seven of 89 primary studies (8%) mentioned a sample size calculation. However, only three (3%) described a plausible precision‐based method, including one of 34 in journals with impact factor ≥ 3 (3%) and one of 28 PHQ studies (4%). Of the other four studies, two stated that sample size had been calculated without describing any method; one provided a method based on interrater reliability, not diagnostic accuracy; and one correctly calculated the number of cases required, but incorrectly interpreted this as the total sum of cases and non‐cases required. See Table S1.
Reporting of confidence intervals
Of the 89 primary studies, 31 (35%) reported 95% confidence intervals, but one study reported implausible intervals. Thus, 30 studies (34%) published reasonably accurate 95% confidence intervals, including 14 of 34 in journals with impact factor ≥ 3 (41%) and 12 of 28 studies (43%) on the PHQ. See Table S1.
Precision of confidence intervals
As shown in Table 1, among the 86 studies for which 95% confidence intervals were provided or could be calculated, only seven (8%) had widths ≤ 10% for sensitivity, whereas 53 (62%) had intervals at least 21% wide, and 20 (23%) had intervals at least 31% wide. Among 33 studies from journals with impact factor ≥ 3, there were two (6%) with widths ≤ 10%, 18 (55%) with widths ≥ 21% and seven (21%) with widths ≥ 31%. Among 26 PHQ studies, only two (8%) had widths ≤ 10%, whereas 13 (50%) had widths ≥ 21%.
Table 1.
Precision of sensitivity and specificity among 86 primary studies for which 95% confidence intervals were published or could be calculated
| All studies | Studies published in journals with impact factor ≥ 3 | |||
|---|---|---|---|---|
| Width of 95% confidence interval | Sensitivity N (%) Studies | Specificity N (%) Studies | Sensitivity N (%) Studies | Specificity N (%) Studies |
| 0–5% | 2 (2) | 9 (10) | 1 (3) | 2 (6) |
| 6–10% | 5 (6) | 30 (35) | 1 (3) | 16 (48) |
| 11–20% | 26 (30) | 41 (48) | 13 (39) | 14 (42) |
| 21–30% | 33 (38) | 5 (6) | 11 (33) | 1 (3) |
| 31–40% | 14 (16) | 0 (0) | 5 (15) | 0 (0) |
| 41–50% | 1 (1) | 1 (1) | 1 (3) | 0 (0) |
| >50% | 5 (6) | 0 (0) | 1 (3) | 0 (0) |
| Total | 86 (100) | 86 (100) | 33 (100) | 33 (100) |
For specificity, there were 39 studies (45%) with 95% confidence interval widths ≤ 10%. Six (7%) had widths 21% or greater. Among 34 studies in journals with impact factor ≥ 3, 18 (55%) had widths ≤ 10%, and one (3%) had width ≥ 21%. For the 26 PHQ studies, there were 16 (62%) with interval widths of 10% or less. None had widths ≥ 21%.
Lower bounds of confidence intervals
As shown in Table 2, the lower bound of 95% confidence intervals was < 80% for 84% of studies for sensitivity and 66% of studies for specificity. Only one study (1%) had a lower bound ≥ 90% for sensitivity and only five (6%) for specificity. Results were similar for studies published in journals with impact factor ≥ 3 (see Table 2) and for studies of the PHQ (lower bound below 80% = 85% of studies for sensitivity, 58% for specificity; lower bound ≥ 90% = 4% of studies for sensitivity and 12% for specificity).
Table 2.
Lower bounds of 95% confidence intervals among 86 primary studies for which 95% confidence intervals were published or could be calculated
| All studies | Studies published in journals with impact factor ≥ 3 | |||
|---|---|---|---|---|
| Lower bound of 95% confidence interval | Sensitivity N (%) Studies | Specificity N (%) Studies | Sensitivity N (%) Studies | Specificity N (%) Studies |
| <80% | 72 (84) | 57 (66) | 27 (82) | 20 (61) |
| 80–84% | 9 (10) | 14 (16) | 3 (9) | 8 (24) |
| 85–89% | 4 (5) | 10 (12) | 2 (6) | 3 (9) |
| 90–94% | 0 (0) | 3 (3) | 0 (0) | 2 (6) |
| ≥95% | 1 (1) | 2 (2) | 1 (3) | 0 (0) |
| Total | 86 (100) | 86 (100) | 33 (100) | 33 (100) |
Discussion
Among 89 recently published studies on the diagnostic accuracy of depression screening tools that we surveyed, only seven (8%) mentioned a sample size calculation, and only three (3%) described a viable method for a precision‐based sample size calculation. Only 30 studies (34%) provided reasonably accurate confidence intervals for estimates of sensitivity and specificity. Precision was generally poor, however, particularly for sensitivity, which is based on the number of patients with depression. For sensitivity, only 8% of studies had 95% confidence intervals with width of 10% or less, whereas 60% had intervals with widths between 21% and 65%. For specificity, which is based on the number of patients without depression, 45% of studies had 95% confidence intervals with widths of 10% or less, and only 7% had widths of more than 20%. Lower bounds of 95% confidence intervals were less than 80% for 84% of studies for sensitivity and 66% of studies for specificity. Results were similar when only studies published in journals with impact factors of at least three were evaluated and when only studies of the PHQ were considered. The finding that studies published in journals with higher impact factors did not do any better than studies in lower impact‐factor journals suggests that pre‐specification of sample size is rarely considered in studies on the accuracy of depression screening tools and is not just a facet of generally weaker methodology.
Basing estimates of accuracy on samples too small for this purpose can mislead users of research about the diagnostic accuracy of depression screening tools for clinical practice, generally, or for use in particular patient populations. Nonetheless, we found that most recently published studies do not include enough patients to generate reasonably precise estimates of diagnostic accuracy. Furthermore, most had lower bounds for 95% confidence intervals that were below 80% sensitivity and specificity. Low sensitivity would result in a high number of missed cases, whereas low specificity would lead to a high false positive rate and risk of overdiagnosis (Joffres et al., 2013).
In addition, the use of small samples to identify “optimal” cutoffs that maximize performance can lead to the generation of wide arrays of data‐driven “optimal” cutoffs across different studies. For example, a systematic review of depression screening tools in cancer (Meijer et al., 2011) identified nine studies with a median of only 17 depression cases per study that reported “optimal” cutoffs for the depression subscale of the HADS (Zigmond and Snaith, 1983), the most commonly used depression screening tool in cancer. “Optimal” cutoff scores ranged from five to 11 across studies, a range far too wide to be useful for clinical practice.
Often, “standard” cutoff thresholds for depression screening tools are initially set based on very small numbers of patients. In its recent draft guideline (US Preventive Services Task Force, 2015), the US Preventive Services Task Force recommended using the HADS and PHQ‐9 for screening adults for depression. The commonly used cutoff threshold of 11 for identifying “definite” cases of depression on the HADS, however, was selected based on a dataset with only 12 cases, and the HADS cutoff of eight for “probable” cases was based on only 22 cases (Zigmond and Snaith, 1983). Similarly, the “standard” cutoff score of 10 for the PHQ‐9 was based on a study with only 41 depression cases (Kroenke et al., 2001; Spitzer et al., 1999). Meta‐analyses could potentially compensate for small sample sizes in individual studies, but reporting of results from different cutoffs across primary studies makes this difficult. A meta‐analysis of the PHQ‐9 (Manea et al., 2012) was able to include results from 16 of 18 primary studies for the “standard cutoff” of 10, but could only include results from five to 10 studies for alternative cutoffs. A meta‐analysis of the HADS, which analyzed the “standard” cutoffs of eight and 11, had to exclude 16 of 41 (39%) otherwise eligible studies because they reported only study‐specific “optimal” cutoffs, but not results from either of the standard cutoffs (Brennan et al., 2010). Thus, when small samples are used to determine the cutoffs for which results are reported, imprecision cannot be corrected in meta‐analysis without potentially substantial bias.
To ensure that studies generate reasonably precise estimates of sensitivity and specificity, investigators should consider the precision that is needed for use in clinical practice and should calculate the sample size required to achieve that level of precision. The original STARD statement (Bossuyt et al., 2003) which was developed to improve reporting of diagnostic accuracy studies required the inclusion of statistical methods to quantify uncertainty (e.g. 95% confidence intervals), but did not mention sample size calculation. The updated 2015 STARD statement (Bossuyt et al., 2015) requires quantification of uncertainty and “Intended sample size and how it was determined” (Item 18). Compliance with STARD, however, is generally modest (Korevaar et al., 2014), and the updated STARD standards alone are unlikely to address problems identified in the present study unless there is insistence on adherence by researchers, peer reviewers and editors. The updated STARD standards do not explicitly require a formal statistical determination of sample size. Nonetheless, a priori sample size calculations based on precision should be required, and straightforward methods for doing this are available (Flahault et al., 2005).
A possible limitation of the present study was that we conducted a survey of recent publications from journals listed in MEDLINE, rather than a systematic review of all studies ever conducted. This, however, was consistent with our objective of understanding sample size and reporting practices in recent studies that could potentially influence practice rather than to generate an exhaustive list of all studies ever conducted. This approach is also consistent with two previous studies that have used survey approaches to examine sample size and precision in diagnostic test accuracy studies in general medicine (Bachmann et al., 2006) and ophthalmology (Bochmann et al., 2007) journals. Furthermore, restricting searches to only MEDLINE for diagnostic test accuracy studies does not result in substantive loss of information (van Enst et al., 2014) and it is unlikely that inclusion of other databases would have influenced our results substantively. Another potential limitation is that the included studies were published in many different journals and reported on a wide range of depression screening tools. However, results did not change when only studies from journals with impact factor ≥ 3 were evaluated or when we analyzed results for studies of the PHQ, the most commonly used depression screening tool in primary care (US Preventive Services Task Force, 2015).
In summary, we found that fewer than 5% of primary studies on the diagnostic accuracy of depression screening tools published since 2013 included a viable sample size calculation and that just over a third of studies provided confidence intervals to quantify precision of estimates. In most studies, the number of patients was too small to generate robust estimates of accuracy, particularly for sensitivity, and lower bound accuracy estimates were below levels likely needed for clinical practice.
Supporting information
Supporting info item
Acknowledgements
Ms Rice was supported by a CIHR Frederick Banting and Charles Best Canada Graduate Scholarship. Dr Thombs was supported by an Investigator Salary Award from the Arthritis Society. There was no specific funding for this study, and no funders had any role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Authors had full access to the data and can take responsibility for the integrity of the data and the accuracy of the data analysis.
Thombs, B. D. , and Rice, D. B. (2016) Sample sizes and precision of estimates of sensitivity and specificity from primary studies on the diagnostic accuracy of depression screening tools: a survey of recently published studies. Int. J. Methods Psychiatr. Res., 25: 145–152. doi: 10.1002/mpr.1504.
References
- Agresti A., Coull B.A. (1998) Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician, 52(2), 119–126. [Google Scholar]
- Allaby M. (2010) Screening for Depression: A Report for the UK National Screening Committee (revised report), London: UK National Screening Committee. [Google Scholar]
- Bachmann L.M., Puhan M.A., ter Riet G., Bossuyt P.M. (2006) Sample sizes of studies on diagnostic accuracy: literature survey. BMJ, 332, 1127–1129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beck A.T., Steer R.A., Brown G.K. (1996) Manual for the Beck Depression Inventory‐II, San Antonio: TX, Psychological Corporation. [Google Scholar]
- Beck A.T., Steer R.A. (1987) Manual for the Revised Beck Depression Inventory, San Antonio: TX, Psychological Corporation. [Google Scholar]
- Bochmann F., Johnson Z., Azuara‐Blanco A. (2007) Sample size in studies on diagnostic accuracy in ophthalmology: a literature survey. British Journal of Ophthalmology, 91(7), 898–900. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bossuyt P.M., Reitsma J.B., Bruns D.E., Gatsonis C.A., Glaszio P.P., Irwig L., Lijmer J.G., Moher D., Rennie D., de Vet H.C., Kressel H.Y., Rifai N., Golub R.M., Altman D.G., Hooft L., Korevaar D.A., Cohen J.F.; STARD Group . (2015) STARD 2015: An updated list of essential items for reporting diagnostic accuracy studies. BMJ, 351, h5527. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bossuyt P.M., Reitsma J.B., Bruns D.E., Gatsonis C.A., Glaszio P.P., Irwig L., Moher D., Rennie D., de Vet H.C., Lijmer J.G.; Standards for Reporting of Diagnostic Accuracy . (2003) The STARD statement for reporting studies of diagnostic accuracy: Explanation and elaboration. Annals of Internal Medicine, 138, W1–W12. [DOI] [PubMed] [Google Scholar]
- Brennan C., Worrall‐Davies A., McMillan D., Gilbody S., House A. (2010) The Hospital Anxiety and Depression Scale: a diagnostic meta‐analysis of case‐finding ability. Journal of Psychosomatic Research, 69(4), 371–378. [DOI] [PubMed] [Google Scholar]
- Burton C., Simpson C., Anderson N. (2013) Diagnosis and treatment of depression following routine screening in patients with coronary heart disease or diabetes: a database cohort study. Psychological Medicine, 43(3), 529–537. [DOI] [PubMed] [Google Scholar]
- Canadian Diabetes Association (2013) Clinical practice guidelines for the prevention and management of diabetes in Canada. Canadian Journal of Diabetes, 37, S1–S212. [DOI] [PubMed] [Google Scholar]
- Centers for Medicare and Medicaid Services . (2010) Medicare Program; Payment Policies under the Physician Fee Schedule and Other Revisions to Part B for CY 2011 November 29, 2010. http://www.federalregister.gov/articles/2010/11/29/2010-27969/medicare-program-payment-policies-under-the-physician-fee-schedule-and-otherrevisions-to-part-b-forth-177 [20 December 2015].
- Colquhoun D.M., Bunker S.J., Clarke D.M., Glozer N., Hare D.L., Hickie I.B., Tatoulis J., Thompson D.R., Tofler G.H., Wilson A., Branagan M.G.. (2013) Screening, referral and treatment for depression in patients with coronary heart disease. Medical Journal of Australia, 198(9), 483–484. [DOI] [PubMed] [Google Scholar]
- Cox J., Holden J., Sagovsky R. (1987) Detection of postnatal depression: development of the 10‐item Edinburgh Postnatal Depression Scale. British Journal of Psychiatry, 150, 782–786. [DOI] [PubMed] [Google Scholar]
- Eskes G.A., Lanctot K.L., Herrmann N., Lindsay P., Bayley M., Bouvier L., Dawson D., Egi S., Gilchrist E., Green T., Gubitz G., Hill M.D., Hopper T., Khan A., King A., Kirton A., Moorhouse P., Smith E.E., Green J., Foley N., Salter K., Swartz R.H.; Heart Stroke Foundation Canada Canadian Stroke Best Practices Committees . (2015) Canadian stroke best practice recommendations: Mood, cognition and fatigue following stroke practice guidelines, update 2015. International Journal of Stroke, 10(7), 1130–1140. [DOI] [PubMed] [Google Scholar]
- Ewald B. (2006) Post hoc choice of cut points introduced bias to diagnostic research. Journal of Clinical Epidemiology, 59(8), 798–801. [DOI] [PubMed] [Google Scholar]
- Flahault A., Cadilhac M., Thomas G. (2005) Sample size calculation should be performed for design accuracy in diagnostic test studies. Journal of Clinical Epidemiology, 58(8), 859–862. [DOI] [PubMed] [Google Scholar]
- Gilbody S., Sheldon T., Wessely S. (2006) Should we screen for depression? BMJ, 332(7548), 1027–1030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holland J.C., Andersen B., Breitbart W.S., Buchmann L.O., Compas B., Deshields T.L., Dudley M.M., Fleishman S., Fulcher C.D., Greenberg D.B., Greiner C.B., Handzo G.F., Hoofring L., Hoover C., Jacobsen P.B., Kvale E., Levy M.H., Loscalzo M.J., McAllister‐Black R., Mechanic K.Y., Palesh O., Pazar J.P., Riba M.B., Roper K., Valentine A.D., Wagner L.I., Zevon M.A., McMillian N.R., Freedman‐Cass D.A. (2013) Distress management. Journal of the National Comprehensive Cancer Network, 11(2), 190–209. [DOI] [PubMed] [Google Scholar]
- Joffres M., Jaramillo A., Dickinson J., Lewin G., Pottie K., Shaw E., Connor Gorber S., Tonelli M.; Canadian Task Force on Preventive Health Care . (2013) Recommendations on screening for depression in adults. CMAJ, 185(9), 775–782. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Korevaar D.A., van Enst W.A., Spijker R., Bossuyt P.M.M., Hooft L. (2014) Reporting quality of diagnostic accuracy studies: a systematic review and meta‐analysis of investigations on adherence to STARD. Evidence‐Based Medicine, 19(2), 47–54. [DOI] [PubMed] [Google Scholar]
- Kroenke K., Spitzer R.L., Williams J.B. (2001) The PHQ‐9: validity of a brief depression severity measure. Journal of General Internal Medicine, 16(9), 606–613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leeflang M.M., Moons K.G., Reitsma J.B., Zwinderman A.H. (2008) Bias in sensitivity and specificity caused by data‐driven selection of optimal cutoff values: mechanisms, magnitude, and solutions. Clinical Chemistry, 54(4), 729–737. [DOI] [PubMed] [Google Scholar]
- Lichtman J.H., Bigger J.T., Jr. , Blumenthal J.A., Frasure‐Smith N., Kaufmann P.G., Lespérance F., Mark D.B., Sheps D.S., Taylor C.B., Froelicher E.S. (2008) Depression and coronary heart disease: recommendations for screening, referral, and treatment: a science advisory from the American Heart Association Prevention Committee of the Council on Cardiovascular Nursing, Council on Clinical Cardiology, Council on Epidemiology and Prevention, and Interdisciplinary Council on Quality of Care and Outcomes Research: endorsed by the American Psychiatric Association. Circulation, 118(17), 1768–1775. [DOI] [PubMed] [Google Scholar]
- Manea L., Gilbody S., McMillan D. (2012) Optimal cut‐off score for diagnosing depression with the Patient Health Questionnaire (PHQ‐9): a meta‐analysis. CMAJ, 184(3), E191–E196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meijer A., Roseman M., Milette K., Coyne J.C., Stefanek M.E., Ziegelstein R.C., Arthurs E., Leavens A., Palmer S.C., Stewart D.E., de Jonge P., Thombs B.D. (2011) Depression screening and patient outcomes in cancer: A systematic review. PLOS ONE, 6, e27181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meng X., D'Arcy C., Tempier R. (2013) Trends in psychotropic use in Saskatchewan from 1983 to 2007. Canadian Journal of Psychiatry, 58(7), 426–431. [DOI] [PubMed] [Google Scholar]
- National Collaborating Center for Mental Health (2010) The NICE Guideline on the Management and Treatment of Depression in Adults (updated edition), London: National Institute for Health and Clinical Excellence. [Google Scholar]
- National Comprehensive Cancer Network . (2008) NCCN clinical practice guidelines in oncology. Distress management, Fort Washington, PA, National Comprehensive Cancer Network. [DOI] [PMC free article] [PubMed]
- National Institute for Clinical Excellence (2004) Guideline on Cancer Services: Improving Supportive and Palliative Care for Adults with Cancer, London: National Institute for Health and Clinical Excellence. [Google Scholar]
- NCQA . (2011) NCQA level 3 PCMH Recognition Requirements Compared to 2011 Joint Commission Standards and EPs. http://www.jointcommission.org/assets/1/18/PCMH-NCQA_crosswalk-final_June_2011.pdf [20 December 2015].
- Palmer S.C., Coyne J.C. (2003) Screening for depression in medical care: pitfalls, alternatives, and revised priorities. Journal of Psychosomatic Research, 54(4), 279–287. [DOI] [PubMed] [Google Scholar]
- Rutjes A.W., Reitsma J.B., Di Nisio M., Smidt N., van Rijn J.C., Bossuyt P.M. (2006) Evidence of bias and variation in diagnostic accuracy studies. CMAJ, 174(4), 469–476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Spitzer R.L., Kroenke K., Williams J.B. (1999) Validation and utility of a self‐report version of PRIME‐MD: the PHQ primary care study. Primary Care Evaluation of Mental Disorders. Patient Health Questionnaire. JAMA, 282(18), 1737–1744. [DOI] [PubMed] [Google Scholar]
- Thombs B.D., Arthurs E., El‐Baalbaki G., Meijer A., Ziegelstein R.C., Steele R. (2011) Risk of bias from inclusion of already diagnosed or treated patients in diagnostic accuracy studies of depression screening tools: a systematic review. BMJ, 343, d4825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thombs B.D., Coyne J.C., Cuijpers P., de Jonge P., Gilbody S., Ioannidis J.P., Johnson B.T., Patten S.B., Turner E.H., Ziegelstein R.C. (2012) Rethinking recommendations for screening for depression in primary care. CMAJ, 184(4), 413–418. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thombs B.D., Ziegelstein R.C. (2014) Does depression screening improve depression outcomes in primary care? BMJ, 348, g1253. [DOI] [PubMed] [Google Scholar]
- US Preventive Services Task Force (2009) Screening for depression in adults: US Preventive Services Task Force recommendation statement. Annals of Internal Medicine, 151(11), 784–792. [DOI] [PubMed] [Google Scholar]
- US Preventive Services Task Force (2015) Depression in Adults: Screening. Draft Recommendation Statement. http://www.uspreventiveservicestaskforce.org/Page/Document/draft-recommendation-statement115/depression-in-adults-screening1 [20 December 2015].
- van Enst W.A., Scholten R.J., Whiting P., Zwinderman A.H., Hooft L. (2014) Meta‐epidemiologic analysis indicates that MEDLINE searches are sufficient for diagnostic test accuracy systematic reviews. Journal of Clinical Epidemiology, 67(11), 1192–1199. [DOI] [PubMed] [Google Scholar]
- Whiting P., Rutjes A.W., Reitsma J.B., Glas A.S., Bossuyt P.M., Kleijnen J. (2004) Sources of variation and bias in studies of diagnostic accuracy: A systematic review. Annals of Internal Medicine, 140(3), 189–202. [DOI] [PubMed] [Google Scholar]
- Whiting P.F., Rutjes A.W., Westwood M.E., Mallett S., QUADAS‐2 Steering Group (2013) A systematic review classifies sources of bias and variation in diagnostic test accuracy studies. Journal of Clinical Epidemiology, 66(10), 1093–1104. [DOI] [PubMed] [Google Scholar]
- Zigmond A.S., Snaith R.P. (1983) The Hospital Anxiety and Depression Scale. Acta Psychiatrica Scandinavica, 67(6), 361–370. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting info item
