Abstract
Background:
STAndards for Reporting of Diagnostic Accuracy (STARD) were published in 2003 and endorsed by some journals but not others.
Objective:
To determine whether the quality of indexing of diagnostic accuracy studies in MEDLINE and EMBASE has improved since the STARD statement was published.
Design:
Evaluate the change in the mean number of “accurate index terms” assigned to diagnostic accuracy studies, comparing STARD (endorsing) and non-STARD (non-endorsing) journals, for 2 years before and after STARD publication.
Results:
In MEDLINE, no differences in indexing quality were found for STARD and non-STARD journals before or after the STARD statement was published in 2003. In EMBASE, indexing in STARD journals improved compared with non-STARD journals (p = 0.02). However, articles in STARD journals had half the number of accurate indexing terms as articles in non-STARD journals, both before and after STARD statement publication (p < 0.001).
Introduction
The retrieval of diagnostic accuracy studies from large electronic databases such as MEDLINE and EMBASE is important for those clinicians practicing evidence-based medicine since an accurate diagnosis is the cornerstone of sound decision-making for clinical intervention for health problems. Finding the best evidence in these electronic databases can be daunting, however, due to the scatter of relevant articles across a broad array of journal titles, the very dilute concentration of high quality, relevant studies in a very large database, and the inherent limitations of indexing, amplified by clinicians’ lack of searching skills [1].
In an ongoing effort to help bridge the gap between research and practice, researchers in the Health Information Research Unit (HIRU) at McMaster University developed search strategies to help clinicians and researchers find the best evidence. Search strategies developed specifically to improve the retrieval of diagnostic accuracy studies when searching in MEDLINE and EMBASE [2, 3] are available for use on the Clinical Queries interface of PubMed and on the limits screen when accessing MEDLINE and EMBASE through the Ovid interface. Data used to derive the search strategies are contained in the Clinical Hedges Database stored in HIRU.
Once relevant articles are retrieved, the clinician must then assess the validity of the study results before incorporating evidence into practice. This can be difficult because many searchers lack critical appraisal skills [4] and, for those who have adequate skills, the substandard reporting of many published research reports hamper the critical appraisal exercise.
Awareness of the need to improve the quality of reporting of diagnostic accuracy studies was increased with the publication of the STAndards for Reporting of Diagnostic Accuracy (STARD) statement [5, 6]. The statement provides guidelines for improving the quality of reporting of diagnostic accuracy studies. It is hypothesized that complete and informative reporting will lead to better decisions in healthcare. It could also potentially lead to improvements in indexing accuracy and completeness, either as authors’ completeness of reporting may have improved as a direct result of the STARD statement, or because indexers are better informed about how to index such studies. To date no studies have been published evaluating the effect the STARD statement may have on indexing and retrieval of diagnostic accuracy articles from large bibliographic databases. The questions addressed in this study are: has the quality of indexing of the methodologic components of diagnostic accuracy studies improved since the STARD statement was published in 2003, and is there a difference in journals endorsing STARD compared with non-endorsing journals?
Methods
Figure 1 outlines the study design. Briefly, diagnostic accuracy studies were identified by hand searching 6 journals that published the STARD statement in 2003 and 6 journals that did not publish the STARD statement. The 12 journals hand searched were identified using data from the Clinical Hedges Study [2, 3]. Of the 170 journals included in the Clinical Hedges Database 6 published the STARD statement in 2003, AJR American Journal of Roentgenology, Annals of Internal Medicine, BMJ, JAMA, Lancet, and Radiology (herein referred to as the STARD journals), and 5, that had a similar content area, did not, Pediatric Radiology (1 diagnostic), Archives of Internal Medicine (1 internal medicine), American Journal of Medicine, British Journal of General Practice, and New England Journal of Medicine (3 general medicine). To locate an additional comparator journal in the diagnostic area, as one was not available in the Clinical Hedges Database, we used a method similar to that of Moher et al [7] when choosing a journal subset for evaluating the effect of the CONSORT statement for reporting clinical trials. The sixth comparator journal was identified by: 1) the journal subject category of “Radiology, Nuclear Medicine and Medical Imaging” was chosen on the Institute for Scientific Information Journal Citation Reports website; 2) the journals in this subject category were sorted in descending order by journal impact factor; and 3) descending through the list of journals the first journal that was available through the Health Sciences Library at McMaster University, published at least one diagnostic accuracy study after reviewing the contents of two 2001 issues of the journal, and was indexed in both MEDLINE and EMBASE was identified. European Radiology was chosen using these criteria and this journal completes the set of the 6 comparator journals herein referred to as non-STARD journals.
Figure 1.
Study design
Four publishing years were studied, 2001 and 2002 to obtain a pre-STARD statement assessment, and 2004 and 2005 to obtain a post-STARD statement assessment.
Diagnostic accuracy studies were defined as those in which the outcomes from one or more tests under evaluation were compared with outcomes from the reference standard, both measured in the same study population [6]. Three trained and calibrated research assistants (96% crude agreement attained after classifying 428 articles) independently reviewed all items indexed in all issues of the 12 journals for the 4 publishing years to identify all diagnostic accuracy studies.
We used data collected as part of the Clinical Hedges Study to determine whether quality of indexing has improved. During the Clinical Hedges Study, a comprehensive list of index terms related to the methodologic features of diagnostic accuracy studies was compiled for MEDLINE [2] and EMBASE [3]. The frequency with which these indexing terms were used was documented in this study. The mean number of times that an “accurate index term” (defined as appearing in the compiled list of index terms) was used was the dependent variable in an analysis of covariance (ANCOVA) with two independent factors: journal type (STARD vs. non-STARD) and year of publication (2001, 2002, 2004, and 2005) and one covariate: content area of journal (diagnostic vs. nondiagnostic). Separate analyses were conducted for MEDLINE and EMBASE.
Also available through the Clinical Hedges Study are the terms that were top performers in detecting clinically relevant, methodologically sound diagnostic accuracy study reports when searching in MEDLINE and EMBASE [2, 3]. An additional analysis was conducted using the frequency of use of “top performing terms”.
The performance of each index term was also determined by calculating sensitivity, specificity, precision, and accuracy in MEDLINE and EMBASE and these figures were compared in STARD vs. non-STARD journals across the 4 publishing years studied.
Results
40,592 of the articles reviewed were indexed in MEDLINE, 877 of which were classified as diagnostic accuracy studies. The corresponding figures for EMBASE were 33,954 and 864.
Forty MEDLINE index terms (Medical Subject Headings [MeSH]) were identified during the Clinical Hedges Study (e.g., “area under curve.sh.”, “di.xs.”) 7 of which were shown to be “top performers” (e.g., “di.fs”, “diagnosis.sh.”). Seventy-three EMBASE index terms were identified (e.g., “area under the curve.sh.”, “diagnostic test.sh”) 2 of which were shown to be “top performers” (i.e., “di.fs.”, “diagnostic accuracy.sh.”). One additional term was considered in the EMBASE analysis, “sensitivity and specificity.sh.”, as it was added to EMTREE in 2001.
Using the 877 diagnostic accuracy studies indexed in MEDLINE and the mean number of “accurate index terms” as the dependent factor, an ANCOVA showed that the interaction between the 2 independent factors (STARD or non-STARD journal and publishing year) was not significant (p = 0.49). This non-significant result means that the change in the mean number of “accurate index terms” assigned in STARD journals was not significantly different from the change in mean number of “accurate index terms” assigned in non-STARD journals across the 4 publishing years adjusted for the journal being classified as diagnostic or non-diagnostic. Additionally, the analysis showed no main effect of journal type (STARD vs. non-STARD), thus, no difference between STARD and non-STARD journals across the 4 publishing years (p = 0.75). The data used in the analysis is shown in Table 1. A non-significant interaction was also obtained when “top performing terms” was used as the dependent factor (p = 0.14).
Table 1.
Mean number of “accurate MEDLINE terms” assigned to diagnostic accuracy studies by journal type and publishing year
| Year | Journal type | Mean # of “accurate” MEDLINE index terms (SD) |
|---|---|---|
| 2001 | ||
| STARD | 5.77 (2.17) | |
| Non-STARD | 6.16 (2.39) | |
| 2002 | ||
| STARD | 6.41 (2.27) | |
| Non-STARD | 6.15 (2.16) | |
| 2004 | ||
| STARD | 6.16 (2.38) | |
| Non-STARD | 6.28 (2.31) | |
| 2005 | ||
| STARD | 6.04 (2.00) | |
| Non-STARD | 6.18 (2.08) |
The same analysis was conducted using EMBASE data. Using the 864 diagnostic accuracy studies indexed in EMBASE and the mean number of “accurate index terms” as the dependent factor, an ANCOVA showed that the interaction between the two independent factors was significant (p = 0.001). This significant result means that the change in the mean number of “accurate index terms” assigned in STARD journals was significantly different from the change in mean number of “accurate index terms” assigned in non-STARD journals across the 4 publishing years adjusted for the journal being classified as diagnostic or non-diagnostic. Additionally, the analysis showed a significant main effect of journal type with non-STARD journals having a higher mean number of “accurate index terms” assigned than STARD journals. The data used in the analysis is shown in Table 2. A significant interaction was also obtained when “top performing terms including “sensitivity and specificity”.sh.” was used as the dependent factor (p = 0.044). A nonsignificant interaction was obtained when only “top performing terms” was used as the dependent factor (p = 0.86).
Table 2.
Mean number of “accurate EMBASE terms” assigned to diagnostic accuracy studies by journal type and publishing year
| Year | Journal type | Mean # of “accurate” EMBASE index terms (SD) |
|---|---|---|
| 2001 | ||
| STARD | 1.75 (2.74) | |
| Non-STARD | 6.28 (2.30) | |
| 2002 | ||
| STARD | 2.79 (3.02) | |
| Non-STARD | 5.87 (2.01) | |
| 2004 | ||
| STARD | 3.42 (3.16) | |
| Non-STARD | 6.07 (2.24) | |
| 2005 | ||
| STARD | 3.01 (3.18) | |
| Non-STARD | 6.44 (2.25) |
To determine index term performance sensitivity, specificity, precision and accuracy were calculated. All 40 index terms compiled for use in MEDLINE were tested individually using a diagnostic testing model, that is, the hand search was used as the gold standard and the index term was used as the test. In MEDLINE, 5 index terms (“di.xs”, “exp diagnosis”, “exp diagnostic techniques and procedures”, “exp sensitivity and specificity”, and “sensitivity and specificity”) yielded a sensitivity of at least 50% when tested in the entire MEDLINE file and when tested in each publishing year by journal type (STARD and non-STARD). Multiple Fisher’s exact tests comparing 2 independent proportions by year and by journal type showed no significant differences in term performance. Since there were no significant differences by publishing year or by journal type the performance characteristics of the 5 MEDLINE terms were calculated using the entire MEDLINE file and are shown in Table 3.
Table 3.
Performance characteristics of the 5 MEDLINE terms combining STARD and non-STARD journals and the 4 publishing years (2001, 2002, 2004, and 2005)
| Search Term – Ovid syntax* | Sensitivity (%) (95% CI) | Specificity (%) (95% CI) | Precision (%) (95% CI) | Accuracy (%) (95% CI) |
|---|---|---|---|---|
| di.xs | 97.8 (96.9 to 98.8) | 73.1 (72.7 to 73.6) | 7.4 (7.0 to 7.9) | 73.7 (73.2 to 74.1) |
| exp diagnosis | 93.5 (91.9 to 95.1) | 67.0 (66.5 to 67.5) | 5.9 (5.5 to 6.3) | 67.6 (67.1 to 68.0) |
| exp “diagnostic techniques and procedures” | 89.3 (87.2 to 91.3) | 75.0 (74.6 to 75.4) | 7.3 (6.8 to 7.8) | 75.3 (74.9 to 75.7) |
| exp “sensitivity and specificity” | 67.8 (64.8 to 70.9) | 96.4 (96.2 to 96.6) | 29.5 (27.3 to 31.5) | 95.8 (95.6 to 96.0) |
| “sensitivity and specificity”.sh. | 59.9 (56.6 to 63.1) | 97.5 (97.3 to 97.6) | 34.4 (32.0 to 36.7) | 96.7 (96.5 to 96.8) |
di=diagnosis; xs=exploded subheading; exp=explosion; sh=subject heading
All 73 index terms compiled for use in EMBASE were also tested individually using a diagnostic testing model. In EMBASE, 2 index terms (“di.fs” and “exp diagnosis”) yielded a sensitivity of at least 50% when tested in the entire EMBASE file. This performance was not consistent, however, across STARD and non-STARD journals and publishing years. Table 4 shows the performance characteristics of 4 EMBASE terms that yielded a sensitivity of at least 50% in any one of the 4 publishing years. The data are shown by journal type (STARD vs. non-STARD) and by publishing year (2001, 2002, 2004 and 2005).
Table 4.
Performance characteristics of 4 EMBASE terms by journal type (STARD vs. non-STARD) and by publishing year (2001, 2002, 2004, and 2005)
| Search Term Ovid syntax* | Sensitivity (%) | Specificity (%) | Precision (%) | Accuracy (%) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2001 | 2002 | 2004 | 2005 | 2001 | 2002 | 2004 | 2005 | 2001 | 2002 | 2004 | 2005 | 2001 | 2002 | 2004 | 2005 | |
| di.fs | ||||||||||||||||
| STARD | 32 | 51 | 55 | 47 | 87 | 89 | 90 | 90 | 5 | 12 | 14 | 11 | 86 | 87 | 89 | 89 |
| non-STARD | 93 | 91 | 93 | 92 | 62 | 66 | 68 | 71 | 5 | 6 | 8 | 8 | 63 | 87 | 89 | 72 |
| exp diagnosis | ||||||||||||||||
| STARD | 33 | 53 | 62 | 42 | 84 | 84 | 85 | 87 | 4 | 10 | 11 | 8 | 81 | 83 | 85 | 86 |
| non-STARD | 98 | 99 | 97 | 78 | 60 | 61 | 63 | 72 | 5 | 6 | 7 | 7 | 61 | 62 | 64 | 72 |
| diagnostic accuracy.sh. | ||||||||||||||||
| STARD | 21 | 40 | 37 | 34 | 98 | 98 | 98 | 98 | 14 | 41 | 36 | 30 | 96 | 96 | 96 | 96 |
| non-STARD | 47 | 58 | 54 | 69 | 96 | 94 | 96 | 95 | 21 | 20 | 27 | 26 | 95 | 93 | 94 | 94 |
| “sensitivity and specificity”.sh. | ||||||||||||||||
| STARD | 2 | 23 | 27 | 32 | 99 | 99 | 99 | 99 | 8 | 60 | 60 | 47 | 98 | 97 | 97 | 97 |
| non-STARD | 36 | 40 | 41 | 57 | 99 | 98 | 98 | 99 | 48 | 32 | 44 | 51 | 98 | 97 | 97 | 98 |
di=diagnosis; fs=floating subheading; exp=explosion; sh=subject heading
Discussion
Our results show that indexing of diagnostic accuracy studies in MEDLINE did not change with the publishing of the STARD statement and has remained consistent across the publishing years 2001, 2002, 2004 and 2005. Five index terms were shown to have a sensitivity of at least 50% (Table 3). In EMBASE, the results are very different. Indexing of diagnostic accuracy studies in non-STARD journals was more comprehensive than in STARD journals and remained relatively consistent across the 4 publishing years. Indexing comprehensiveness increased somewhat in STARD journals over the 4 publishing years but did not reach non-STARD journal levels. Further, only 2 index terms were consistently shown to have a sensitivity of at least 50% and this occurred only in non-STARD journals (Table 4).
In an effort to explain the EMBASE results of non-STARD journals being more comprehensively indexed than STARD journals 3 additional factors were considered, journal impact factor, journal publisher (Elsevier vs. other), and country of publication (North America vs. Europe). The hypotheses are: journals with higher impact factors are indexed more comprehensively; journals published by Elsevier will be more comprehensively indexed in EMBASE; and journals published in Europe will be more comprehensively indexed in EMBASE. Five of the non-STARD journals had an impact factor < 10 and 1 had an impact factor ≥ 10 compared with 3 and 3 in STARD journals. Both non-STARD and STARD journals had 1 journal published by Elsevier and 5 by another publisher. Three of the non-STARD journals were published in the United States and 3 in Europe compared with 4 STARD journals published in the United States and 2 in Europe. None of these factors appear to account for the EMBASE results.
Our study has some limitations. First, we did not assess the degree to which the STARD statement guidelines were enforced by STARD journals. Second, it is possible that not enough time has passed since the publication of the STARD statement for it to have impacted reporting quality and thus indexing, and that an assessment beyond the publishing year of 2005 is required.
Conclusion
Indexing of diagnostic accuracy studies in MEDLINE did not change with the publishing of the STARD statement and remained consistent across the 4 publishing years. In EMASE, indexing of diagnostic accuracy studies in non-STARD journals was more comprehensive than in STARD journals and remained relatively consistent across the 4 publishing years. Indexing comprehensiveness has increased somewhat in STARD journals over the 4 publishing years but did not reach non-STARD journal levels.
Acknowledgments
This research was funded by the U.S. National Library of Medicine and Canadian Institutes of Health. PhD thesis committee members are Brian Haynes ( bhaynes@mcmaster.ca), Stephen Walter, and Ruta Valaitis. Monika Kastner and Leslie Walters assisted with data collection.
References
- 1.Ely JW, Osheroff JA, Ebell MH, Chambliss ML, Vinson DC, Stevermer JJ, et al. Obstacles to answering doctors' questions about patient care with evidence: qualitative study. BMJ. 2002;23:324, 710. doi: 10.1136/bmj.324.7339.710. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Haynes RB, Wilczynski NL.Optimal search strategies for retrieving scientifically strong studies of diagnosis from Medline: analytical survey BMJ 20043281040(5 pages). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wilczynski NL, Haynes RB.Hedges Team. EMBASE search strategies for identifying methodologically sound diagnostic studies for use by clinicians and researchers BMC Med 200537(6 pages) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Putnam W, Twohig PL, Burge FI, Jackson LA, Cox JL. A qualitative study of evidence in primary care: what the practitioners are saying. CMAJ. 2002;166:1525–30. [PMC free article] [PubMed] [Google Scholar]
- 5.Bossuyt PM, Reitsma JB. Standards for Reporting of Diagnostic Accuracy. The STARD initiative. Lancet. 2003;361:71. doi: 10.1016/S0140-6736(03)12122-8. [DOI] [PubMed] [Google Scholar]
- 6.Bossuyt RM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. The STARD statement for reporting studies of diagnostic accuracy: Explanation and elaboration. Clin Chem. 2003;49:7–18. doi: 10.1373/49.1.7. [DOI] [PubMed] [Google Scholar]
- 7.Moher D, Jones A, Lepage L, for the CONSORT Group Use of the CONSORT statement and quality of reports of randomized trials. A comparative before-and-after evaluation. JAMA. 2001;285:1992–5. doi: 10.1001/jama.285.15.1992. [DOI] [PubMed] [Google Scholar]

