Epidemiologists depend on accurate assessment of disease states for almost all aspects of their work, whether for research or practice. Most epidemiologists are aware that diagnostic tests are fallible. They have developed and applied sophisticated methods to address the measurement error of outcomes. Epidemiologists are comfortable with sensitivity and specificity, the parameters used to express the accuracy of diagnostic and screening tests. However, few “traditional epidemiologists” have contributed to diagnostic test evaluation methods, despite the centrality of diagnostic tests to their work. Instead, diagnostic test evaluation has been the purview of “clinical epidemiologists” and a few biostatisticians.
Diagnostic test evaluation is subject to numerous potential biases.1 Fundamentally, evaluation of a new diagnostic test requires comparison to a reference (“gold”) standard, which is usually assumed to discriminate disease and non-disease states perfectly. Unfortunately, few, if any, reference standards are perfect. The resulting reference test bias is one of the most significant, pervasive, and challenging forms of bias.2,3
The impact of reference test bias can be described with a simple example. Consider a reference test with sensitivity=0.85 and specificity=0.90 and an improved, new test with a true, but unknown sensitivity=0.90 and specificity=0.95. For simplicity, we assume conditional independence of these tests. In a study sample with a prevalence of 0.1, the measured sensitivity and specificity of the new test would be 0.46 and 0.93, respectively. In study sample with a prevalence of 0.5 (i.e. cases and non-cases selected independently), the estimates of sensitivity and specificity change markedly to 0.81 and 0.83, respectively. The dependence of the sensitivity and specificity estimates on prevalence is comparable to the variation of positive and negative predictive values with prevalence. False positives by the reference test increase as prevalence decreases; reference test false negatives increase as prevalence increases. The imperfect classification of disease by the reference standard leads to apparent misclassification by the new test and biased sensitivity and specificity estimates.
Reference test bias is particularly problematic with new tests that are better than the reference standard, as in the example above. The belief that a new test is inherently better has led to intuitive “solutions” to account for reference test bias. Unfortunately, as with other areas of epidemiology, intuition is often a poor statistician. In the 1990’s, microbiologists adopted an intuitively appealing procedure referred to as discrepant analysis for the evaluation of new nucleic acid amplification tests, such as polymerase chain reaction, for chlamydial infection and other infectious diseases.4-6 The microbiologists recognized that these new tests represented major advances biologically over culture techniques, which were known to have limited sensitivity.7 To address the concern of reference test bias using culture as the reference standard, the microbiologists chose to conduct additional testing on specimens with discordant results (i.e. positive by new test, but negative by reference standard). This “resolution of discrepancy” was often performed with tests mechanistically similar to the new test under evaluation. After resolving the discordant specimens, sensitivity and specificity were recalculated. Although intuitively appealing, this procedure was inherently biased with substantial overestimation of test performance, even under ideal conditions.8-10
The history of discrepant analysis reveals the fundamental challenges of communication between laboratory scientists and statistical methodologists. After the FDA allowed discrepant analysis for clearance of a few tests, Hadgu8,9 and others10,11 demonstrated the inherent bias in discrepant analysis. These reports led to vigorous debate between the microbiologists and methodologists in letters and commentaries.7,10,12-14 At times, the misunderstanding between the “scientists” and the “statisticians” was remarkable. The laboratory scientists recognized that the limits of detection of the new nucleic acid amplification tests under evaluation were markedly better than previous culture-based tests.7 Thus, discrepant analysis was employed to account for what they considered a biological fact. Assumptions and bias were secondary considerations, as was evident at a diagnostic test evaluation workshop in 1999 at the Centers for Disease Control and Prevention. After carefully explaining the statistical assumptions of various approaches for reference test bias, a prominent laboratory scientist raised his hand and proclaimed, “I work in a laboratory. I make no assumptions. I simply let the data speak for itself.” This fundamentally different perspective was extremely difficult to overcome. Fortunately, the FDA recognized the inherent issues with discrepant analysis and limited its use.
In this issue of EPIDEMIOLOGY, Hadgu and colleagues address a different intuitive yet biased diagnostic test evaluation procedure, the patient infected status algorithm (PISA).15 Over the past decade, PISA has replaced discrepant analysis as a common procedure for evaluating diagnostic and screening tests for many infectious diseases.16-18 Unfortunately, this procedure also yields biased results.
PISA uses a combination of tests to create a composite reference standard.16-18 Sometimes two tests are used with specimens taken from multiple anatomical sites. Hadgu, et al. clearly show that this approach can be associated with significant bias in the simplest conditions (assuming conditional independence between tests) or with more complex conditions (assuming conditional dependence).15 Given this bias, PISA should not be an acceptable procedure for FDA clearance.
The bias of simple, intuitive approaches to diagnostic test evaluation necessitates the application of more sophisticated statistical approaches. Several alternative approaches have been developed.19-23 Latent class analysis is a probabilistic approach which recognizes that the true disease state is an unknown, underlying latent variable. In its simplest form, conditional independence is assumed,21 but more advanced applications can incorporate conditional dependence.22 Bayesian approaches, including Bayesian latent class models, have also been developed.23 Generally, these statistically intensive approaches are important steps forward, but unfortunately, none are “ideal.” For example, latent class approaches have been criticized because the latent variable is unknown and unspecified and may not reflect the clinical entity under study.3These approaches may also be limited by feasibility, such as the number of tests required for analysis.20
The past two decades of controversy and progress in diagnostic test evaluation have largely gone unnoticed by traditional epidemiologists. A quick PubMed search reveals zero relevant diagnostic test evaluation articles in the American Journal of Epidemiology and five articles in EPIDEMIOLOGY since 1990. In contrast, the Journal of Clinical Epidemiology has 15 relevant articles since 2006. Should traditional epidemiologists be concerned about diagnostic test evaluation methods? I would have to say yes. Given that the outcomes used in epidemiological studies are determined by diagnostic tests, epidemiologists must have a deep understanding of the validity of these data. Epidemiologists have a long history of evaluating bias and developing appropriate methods to account for bias. If epidemiology’s methodologists gave attention to diagnostic test evaluation, I am convinced that new and important insights would result. Better evaluation of diagnostic tests would improve the quality of epidemiological studies and simultaneously lead to improved clinical outcomes and public health.
Acknowledgements
This work was supported by NIH grants 5R01AI067913-05 and 5UL1RR025747-04.
References
- 1.Begg CB. Biases in the assessment of diagnostic tests. Stat Med. Jun 1987;6(4):411–423. [DOI] [PubMed] [Google Scholar]
- 2.Boyko EJ, Alderman BW, Baron AE. Reference test errors bias the evaluation of diagnostic tests for ischemic heart disease. J Gen Intern Med. Sep-Oct 1988;3(5):476–481. [DOI] [PubMed] [Google Scholar]
- 3.Reitsma JB, Rutjes AW, Khan KS, Coomarasamy A, Bossuyt PM. A review of solutions for diagnostic accuracy studies with an imperfect or missing reference standard. J Clin Epidemiol. Aug 2009;62(8):797–806. [DOI] [PubMed] [Google Scholar]
- 4.Schachter J, Stamm WE, Quinn TC, Andrews WW, Burczak JD, Lee HH. Ligase chain reaction to detect Chlamydia trachomatis infection of the cervix. J Clin Microbiol. Oct 1994;32(10):2540–2543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wiesenfeld HC, Uhrin M, Dixon BW, Sweet RL. Diagnosis of male Chlamydia trachomatis urethritis by polymerase chain reaction. Sex Transm Dis. Sep-Oct 1994;21(5):268–271. [DOI] [PubMed] [Google Scholar]
- 6.Vuorinen P, Miettinen A, Vuento R, Hallstrom O. Direct detection of Mycobacterium tuberculosis complex in respiratory specimens by Gen-Probe Amplified Mycobacterium Tuberculosis Direct Test and Roche Amplicor Mycobacterium Tuberculosis Test. J Clin Microbiol. Jul 1995;33(7):1856–1859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Schachter J. Two different worlds we live in. Clin Infect Dis. Nov 1998;27(5):1181–1185. [DOI] [PubMed] [Google Scholar]
- 8.Hadgu A. The discrepancy in discrepant analysis. Lancet. Aug 31 1996;348(9027):592–593. [DOI] [PubMed] [Google Scholar]
- 9.Hadgu A. Bias in the evaluation of DNA-amplification tests for detecting Chlamydia trachomatis. Stat Med. Jun 30 1997;16(12):1391–1399. [DOI] [PubMed] [Google Scholar]
- 10.Miller WC. Bias in discrepant analysis: when two wrongs don't make a right. J Clin Epidemiol. Mar 1998;51(3):219–231. [DOI] [PubMed] [Google Scholar]
- 11.Miller WC. Can we do better than discrepant analysis for new diagnostic test evaluation? Clin Infect Dis. Nov 1998;27(5):1186–1193. [DOI] [PubMed] [Google Scholar]
- 12.Hilden J. Discrepant analysis--or behaviour? Lancet. Sep 27 1997;350(9082):902. [DOI] [PubMed] [Google Scholar]
- 13.Schachter J, Stamm WE, Quinn TC. Discrepant analysis and screening for Chlamydia trachomatis. Lancet. Nov 9 1996;348(9037):1308–1309. [DOI] [PubMed] [Google Scholar]
- 14.Schachter J, Stamm WE, Quinn TC. Discrepant analysis and screening for Chlamydia trachomatis. Lancet. Jan 17 1998;351(9097):217–218. [DOI] [PubMed] [Google Scholar]
- 15.Hadgu A, Dendukuri N, Wang L. Evaluation of Screening Tests for Detecting Chlamydia Trachomatis:Bias Associated with the Patient Infected Status Algorithm. Epidemiology. 2011:in press. [DOI] [PubMed] [Google Scholar]
- 16.Martin DH, Nsuami M, Schachter J, et al. Use of multiple nucleic acid amplification tests to define the infected-patient "gold standard" in clinical trials of new diagnostic tests for Chlamydia trachomatis infections. J Clin Microbiol. Oct 2004;42(10):4749–4758. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Chernesky MA, Martin DH, Hook EW, et al. Ability of new APTIMA CT and APTIMA GC assays to detect Chlamydia trachomatis and Neisseria gonorrhoeae in male urine and urethral swabs. J Clin Microbiol. Jan 2005;43(1):127–131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Schachter J, Chernesky MA, Willis DE, et al. Vaginal swabs are the specimens of choice when screening for Chlamydia trachomatis and Neisseria gonorrhoeae: results from a multicenter evaluation of the APTIMA assays for both infections. Sex Transm Dis. Dec 2005;32(12):725–728. [DOI] [PubMed] [Google Scholar]
- 19.Hadgu A, Qu Y. A biomedical application of latent class models with random effects. Journal of the Royal Statistical Society Series C-Applied Statistics. 1998;47:603–616. [Google Scholar]
- 20.Qu Y, Tan M, Kutner MH. Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics. Sep 1996;52(3):797–810. [PubMed] [Google Scholar]
- 21.Hui SL, Walter SD. Estimating the error rates of diagnostic tests. Biometrics. Mar 1980;36(1):167–171. [PubMed] [Google Scholar]
- 22.Dendukuri N, Hadgu A, Wang L. Modeling conditional dependence between diagnostic tests: a multiple latent variable model. Stat Med. Feb 1 2009;28(3):441–461. [DOI] [PubMed] [Google Scholar]
- 23.Enoe C, Georgiadis MP, Johnson WO. Estimation of sensitivity and specificity of diagnostic tests and disease prevalence when the true disease state is unknown. Prev Vet Med. May 30 2000;45(1-2):61–81. [DOI] [PubMed] [Google Scholar]
