Reference Standards in Evaluating System Performance

Randolph A Miller

doi:10.1136/jamia.2002.0090087

editorial

. 2002 Jan-Feb;9(1):87–88. doi: 10.1136/jamia.2002.0090087

Reference Standards in Evaluating System Performance

Randolph A Miller ¹

PMCID: PMC349391 PMID: 11751807

The paper in this issue by Hripcsak and Wilcox, “Reference Standards, Judges, and Comparison Subjects: Roles for Experts in Evaluating System Performance,”¹ is well written and presents a thoughtful analysis of the topic. As the authors acknowledge, however, there is more to the evaluation of clinical informatics systems than can be accomplished through comparison to experts.^2,³ Hripcsak and Wilcox focus on “how to use experts in evaluating systems when one needs them,” whereas this commentary focuses on the question, “when should one use experts as part of a system' evaluation” The two perspectives are complementary rather than contradictory.

As noted previously,⁴

System evaluation in biomedical informatics should take place as an ongoing, strategically planned process, not as a single event or small number of episodes. Complex software systems and accepted medical practices both evolve rapidly, so evaluators and readers of evaluations face moving targets. … [C]urrent thinking recognizes that such systems are of value only when they help users to solve users' problems. Users, not systems, characterize and solve clinical diagnostic problems. The ultimate unit of evaluation should be whether the user plus the system is better than the unaided user with respect to a specified task or problem.…

If the ultimate evaluation of a system depends on whether users of the system perform a specified task better when they use the system than when they don't, then there must be public, objective criteria (a “gold standard”) made available before an evaluation begins, to determine the quality of performance of an individual on a task (independent of whether the individual uses a decision-support tool).

Hripcsak and Wilcox state that experts can be used in three evaluation settings: to generate, through introspection and expertise, the reference standard per se (e.g., by providing a list of “correct” diagnoses or of “correct” therapeutic interventions based on a reading of the problem at hand); to judge (and label) the individual behaviors of subjects in the study—on a scale ranging from “optimal” through “acceptable” to “inadequate”—without providing an absolute list of “correct answers”; and as actual subjects in the study, to make it possible to rate how well the system performs in comparison with the performance of human experts. Hripcsak and Wilcox' first two scenarios assume that no absolute, independent gold standard is available, so that the experts' opinions represent the next best metric; in their third scenario, a gold standard must exist against which both experts and study subjects are graded in performance. In each of these three settings, as in any formal, summative evaluation of a clinical informatics system, it is best to compare subjects' performances with and without the system, no matter what absolute metric of performance is used.

Hripcsak and Wilcox state,

Experience shows that accurate reference standards rarely exist; if it were easy to obtain the correct response, a medical informatics system would be unnecessary.

However, while it is indeed difficult to find clinical settings in which absolute answers are available, it should be the goal of system evaluators to first ask the question, “Can we design an evaluation for the system that involves use of a reliable, objective, external gold standard?” For example, each clinicopathologic conference published in “Case Records of the Massachusetts General Hospital” in the New England Journal of Medicine involves a definitive procedure (laboratory test, biopsy, or autopsy) as a “gold standard” to establish the patient' correct diagnosis. Some formative system evaluations of diagnostic systems in clinical informatics have used such retrospective published cases to evaluate systems because they provide an external “gold standard.” By definition, such retrospective studies cannot test system performance “on the front lines” of clinical care provision. In situations in which a prospective, summative evaluation of a clinical diagnostic system is required, the evaluation should ideally be performed “on the front lines” at a time when assistance is truly required and no definitive answer is available.

Rather than using experts as a gold standard, however, it may often be possible to develop a protocol by which patients can be used as subjects when their diagnoses are unknown, but then patients are closely followed by protocol for a long time until a diagnosis is established by objective, predefined criteria.⁵ If at the end of the follow-up interval, no diagnosis can be determined by the preset criteria, the case should be labeled as “unable to establish/confirm a diagnosis” and dropped from inclusion in the study. Only when no reliable, external gold standard can be identified should experts' opinions be used.

In the area of systems for therapy and prognosis, expert opinions may play a role when randomized controlled studies cannot be carried out. Each patient can only follow their own trajectory of responses to interventions, so if subjects are allowed to select from a set of potential interventions, even if “real” patient case data are used, one can only hypothesize that with a different intervention than was actually used in the case, the patient' outcome might have been different. Ideally, only randomized controlled trials matching patients in the intervention group to patients in the control group, with the objective of tracking and comparing specific outcomes, can determine whether clinicians using decision support tools provide “better” care than the same (or similar) clinicians without decision support tools. Such studies are difficult at best, requiring large numbers of clinicians and patients and long follow-up intervals. Matching physicians with like abilities in control and intervention groups is arduous; matching patients with “equivalent” degrees of “equivalent” illnesses between intervention and control groups is extremely difficult.

The use of experts can be misleading in the absence of a gold standard. Imagine a scenario in which patient case records are presented to students who are asked to provide diagnoses, and experts' opinions are sought for use as “gold standard” diagnoses. This may be appropriate if the evaluation of the students' diagnoses is aimed at probing their reasoning abilities. However, consider a different situation, in which instead of actual patient data, a computer-based diagnostic knowledge base is used to generate sample “patient cases.” If a case with findings of “fever, arthralgias, skin rash, and abdominal pain” is presented, would one accept the disease template used to generate the findings, systemic lupus erythematosus, as being the “correct” diagnosis? What if an expert panel determined that with the same nonspecific findings, Lyme disease were the “best” diagnosis? In the absence of a pathognomonic weight of evidence, a “definitive” opinion by experts must be taken with at least one grain of salt, since a truly expert opinion would be that the weight of the evidence in the case could not lead to the conclusion of any specific diagnosis. Experts rarely offer such opinions when they are being consulted as experts.

In the current era of evidence-based medicine, the opinions of experts should be tempered by an attempt to measure the “weight of the evidence” that the experts interpret. Even human experts are susceptible to the “garbage in, garbage out” phenomenon.—Randolph A. Miller

References

1.Hripcsak G, Wilcox A. Reference standards, judges, comparison subjects: roles for experts in evaluating system performance. J Am Med Inform Assoc. 2001;9:1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Stead WW, Haynes RB, Fuller S. Designing medical informatics research and library-resource projects to increase what is learned. J Am Med Inform Assoc. 1994;1(1):28–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Wyatt JC, Friedman CP. Evaluation Methods in Medical Informatics. New York: Springer, 1997.
4.Miller RA. Evaluating evaluations of medical diagnostic systems. J Am Med Informatics Assoc. 1996;3:429–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Bankowitz RA, McNeil MA, Challinor SM, et al. A computer-assisted medical diagnostic consultation service: implementation and prospective evaluation of a prototype. Ann Intern Med. 1989; 10:824–32. [DOI] [PubMed] [Google Scholar]

[r1] 1.Hripcsak G, Wilcox A. Reference standards, judges, comparison subjects: roles for experts in evaluating system performance. J Am Med Inform Assoc. 2001;9:1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r2] 2.Stead WW, Haynes RB, Fuller S. Designing medical informatics research and library-resource projects to increase what is learned. J Am Med Inform Assoc. 1994;1(1):28–34. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3] 3.Wyatt JC, Friedman CP. Evaluation Methods in Medical Informatics. New York: Springer, 1997.

[r4] 4.Miller RA. Evaluating evaluations of medical diagnostic systems. J Am Med Informatics Assoc. 1996;3:429–31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.Bankowitz RA, McNeil MA, Challinor SM, et al. A computer-assisted medical diagnostic consultation service: implementation and prospective evaluation of a prototype. Ann Intern Med. 1989; 10:824–32. [DOI] [PubMed] [Google Scholar]

PERMALINK

Reference Standards in Evaluating System Performance

Randolph A Miller, MD

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Reference Standards in Evaluating System Performance

Randolph A Miller, MD

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases