Sensitivity, Specificity, Receiver-Operating Characteristic (ROC) Curves and Likelihood Ratios: Communicating the Performance of Diagnostic Tests

Christopher M Florkowski

. 2008 Aug;29(Suppl 1):S83–S87.

Sensitivity, Specificity, Receiver-Operating Characteristic (ROC) Curves and Likelihood Ratios: Communicating the Performance of Diagnostic Tests

Christopher M Florkowski ^1,^✉

PMCID: PMC2556590 PMID: 18852864

Summary

Diagnostic accuracy studies address how well a test identifies the target condition of interest.
Sensitivity, specificity, predictive values and likelihood ratios (LRs) are all different ways of expressing test performance.
Receiver operating characteristic (ROC) curves compare sensitivity versus specificity across a range of values for the ability to predict a dichotomous outcome. Area under the ROC curve is another measure of test performance.
All of these parameters are not intrinsic to the test and are determined by the clinical context in which the test is employed.
High sensitivity corresponds to high negative predictive value and is the ideal property of a “rule-out” test.
High specificity corresponds to high positive predictive value and is the ideal property of a “rule-in” test.
LRs leverage pre-test into post-test probabilities of a condition of interest and there is some evidence that they are more intelligible to users.

Diagnostic Accuracy Studies

Diagnostic accuracy studies address the agreement between a proposed (index) test and a reference standard for the ability to identify a target condition.¹ Their fundamental design is to study a consecutive series of well-defined patients who undergo both the index and reference tests in a blinded fashion.² Diagnostic accuracy refers to the degree of agreement between the index test and the reference standard.¹ The starting point is the construction of a 2 X 2 table with the index test results on one side and those of the reference standard on the other (Table).¹

	Reference Standard

	Disease present	Disease absent	Total
Index Test positive	True positive (TP)	False positive (FP)	TP + FP
Index Test negative	False negative (FN)	True negative (TN)	TN + FN
Total	TP + FN	TN + FP

Open in a new tab

Sensitivity = TP / (TP + FN)

Specificity = TN / (TN + FP)

Positive predictive value (PPV) = TP / (TP + FP)

Negative predictive value (NPV) = TN / (TN + FN)

Positive likelihood ratio (LR+) = sensitivity / (1 – specificity)

Negative likelihood ratio (LR−) = (1 – sensitivity) / specificity

Sensitivity (“positivity in disease”) refers to the proportion of subjects who have the target condition (reference standard positive) and give positive test results.¹ Specificity (“negativity in health”) is the proportion of subjects without the target condition and give negative test results.¹ Positive predictive value is the proportion of positive results that are true positives (i.e. have the target condition) whereas negative predictive value is the proportion of negative results that are true negatives (i.e. do not have the target condition).¹ Predictive values will vary depending upon the prevalence of the target condition in the population being studied, even if the sensitivity and specificity remain the same.¹ In the examples discussed below, the positive predictive value for B-type natriuretic peptide (BNP) to identify congestive heart failure (CHF) is lower in a low prevalence setting, namely patients being screened in general practice compared with newly-presenting breathless patients in the emergency department (ED).

This approach usually requires the creation of a cut-off point from continuous data and depending on the cut-off selected, the sensitivity and specificity of a test will vary. If the cutoff is selected so that the sensitivity increases, the specificity will decrease as discussed in the example below. ROC curves are a way of graphically displaying true positives versus false-positives across a range of cut-offs and of selecting the optimal cut-off for clinical use to be selected.¹

Ultimately, the value of a test will depend upon its ability to alter a pre-test probability of a target condition into a posttest probability that will influence a clinical management decision. This can be achieved through the application of LRs as discussed further below.³ Positive LR is the ratio of the proportion of patients who have the target condition and test positive to the proportion of patients without the target condition who also test positive.³ Negative LR is the ratio of the proportion of patients who have the target condition who test negative to the proportion of patients without the target condition who also test negative.³ A positive LR >10 and a negative LR <0.1 are considered to exert highly significant changes in probability, such as to alter clinical management.³ Application of Fagan’s nomogram is a way of making these changes in probability by graphical means.³

Applied Examples – BNP Studies

BNP is a cardiac peptide secreted from the ventricles in response to volume expansion and plasma levels have been shown to be elevated in patients with left ventricular (LV) dysfunction and to correlate with the New York Heart Association class and prognosis.⁴ The Breathing Not Properly (BNP) study is an example of a diagnostic accuracy study in which breathless patients newly presenting to the ED were enrolled.⁵ Patients whose dyspnoea was clearly not secondary to CHF were excluded.⁵ During the initial evaluation, BNP was compared with the reference standard of CHF provided by two cardiologists who reviewed all medical records and independently classified the diagnosis without knowledge of the BNP result. In the ROC curve shown (Figure 1), BNP levels are plotted for their ability to predict CHF (a dichotomous outcome) with true positives on the vertical axis (sensitivity) and false-positives (1-specificity) on the horizontal axis.

ROC curve for various cut-off levels of BNP in differentiating between dyspnoea due to congestive heart failure and dyspnoea due to other causes. Copyright © 2002 Massachusetts Medical Society. All rights reserved.⁵

At lower BNP cut-offs, e.g. 50 pg/mL (17 μmol/L), there is higher sensitivity or better ability to identify patients with CHF, although this is compromised by lower specificity (i.e. the test falsely identifies more subjects without CHF).⁵ The corollary of higher sensitivity, however is higher negative predictive value, in other words the test performs better as a “rule-out” test and enables the clinician to consider causes of dyspnoea other than CHF. Conversely, higher cut-offs are more likely to identify patients with CHF than due to other causes, in other words higher specificity and positive predictive value, giving a better “rule-in” test.

The ROC curve graphically displays the trade-off between sensitivity and specificity and is useful in assigning the best cut-offs for clinical use.³ Overall accuracy is sometimes expressed as area under the ROC curve (AUC) and provides a useful parameter for comparing test performance between, for example, different commercial BNP assays and also the related N-terminal pro-BNP assay.⁶

The diagnostic parameters of a test are not intrinsic properties of the test and are critically dependent upon the clinical context within which they are employed, as illustrated in the following example.

The diagnostic accuracy of BNP was evaluated in a General Practice study of elderly patients (mean age of 74 years) who presented with breathlessness. The diagnosis of LV dysfunction was confirmed by transthoracic echocardiography, and a BNP concentration >17.9 pg/mL was considered abnormal.⁷

BNP was raised in the 40 patients with LV systolic dysfunction compared with those with normal ventricular systolic function.⁷ At a BNP concentration >17.9 pg/mL, there was a sensitivity of 88% and specificity of 34% for identification of LV dysfunction.⁷ The prevalence (or prior probability) of LV dysfunction in this study was 32%, lower than the 47% of patients with CHF in the ED setting.⁵ The negative LR for a patient without a history of myocardial infarction, with normal chest radiography and electrocardiogram (ECG) is 0.53, yielding a posterior probability of LV dysfunction of 20%.⁷ This is derived as follows:

LRs are multiplied by pre-test odds of a condition in order to give post-test odds. In the case of positive LRs, it gives the post-test odds of a condition being present if the test is positive (and relative to whatever chosen cut-off). Conversely in the case of negative LRs, it gives the post-test odds of a condition being present if the test is negative (again relative to the chosen cut-off). In order to do the calculations long-hand, it necessitates converting probabilities into odds, multiplication by LR and then back calculation into probability.

\begin{array}{l} Post - test odds = pre - test odds X LR \\ Odds = prevalence / (1 - prevalence) \\ Prevalence = odds / (1 + odds) \end{array}

In the example above, the prior probability of LV dysfunction in this clinical setting was 32%. By applying the equation above, this can be converted to odds.

\begin{array}{l} Odds = 0.32 / (1 - 0.32) \\ Odds = 0.32 / 0.68 = 0.47 (or odds of approximately 1 to 2) \end{array}

\begin{array}{l} Post - test odds = 0.47 \times 0.53 (LR of a negative test) \\ Post - test odds = 0.25 \\ Post - test probability = 0.25 / (1 + 0.25) \\ Post - test probability = 0.2 or 20 % \end{array}

The posterior (or post-test) probability of LV dysfunction is therefore 20% in the presence of normal ECG and chest radiogram and the absence of a prior myocardial infarction.

When a negative BNP (<17.9 pg/mL) is added to the above combination of tests, the negative LR becomes 0.42 (as opposed to 0.53 without BNP). It is an instructive exercise for the reader to follow the above train of calculations starting with the given pre-test probability of 32%. With the addition of BNP, the reader should be able to derive the post-test probability of 16%.⁷

The point of this exercise is to show that adding a test for BNP to the determination of a patient’s history of myocardial infarction in the diagnostic screening process reduces the posterior probability to 16%, a small incremental advantage to that achieved by a combination of clinical history and traditional investigations likely to be undertaken in any case.⁷ This leaves a residual 1 in 7 chance of LV systolic dysfunction which is unacceptably high and unlikely to deter a General Practitioner from referring the patient for echocardiography.⁷ Therefore, in the clinical context of General Practice, the prevalence of CHF is lower than that among newly presenting dyspnoeic patients to the ED and the diagnostic performance of BNP is correspondingly lower.

Another way of applying LRs without doing long-hand calculations is to use Fagan’s nomogram, an example of which is shown in Figure 2.⁸ Prior probability is indicated on the vertical axis on the left of the nomogram and a line is then drawn through the LR value in the middle (note the logarithmic scale) and extrapolated to the point where it intercepts the vertical axis on the right of the nomogram which corresponds to post-test probability.

An example of Fagan’s nomogram.⁸ Prior probability is indicated on the vertical axis on the left of the nomogram and a line can be drawn through the BNP value in the middle (note the logarithmic scale) and extrapolated to the point where it intercepts the vertical axis on the right of the nomogram which corresponds to post-test probability. Source: BMJ, 2004, 329, 168-9. Reproduced with permission from the BMJ Publishing Group.

Likelihood Ratios may be More Intelligible

In conveying the meaning of diagnostic accuracy to clinicians, there is some evidence that LRs expressed in non-technical language are more intelligible to clinicians and enable a more appropriate interpretation of tests.

General practitioners were asked to estimate the probability of endometrial cancer in a 65 year old woman with abnormal uterine bleeding with the prevalence of endometrial cancer in all women with abnormal uterine bleeding given as 10%.⁹ Participants were given the result of a transvaginal ultrasound scan in one of three different ways: “Transvaginal ultrasound showed a pathological result compatible with cancer”; “Transvaginal ultrasound showed a pathological result compatible with cancer. The sensitivity of this test is 80%, its specificity is 60%”; or “Transvaginal ultrasound showed a pathological result compatible with cancer. A positive result is obtained twice as frequently in women with an endometrial cancer than in women without this disease.” The third version was intended to present the positive LR of the second in non-technical language. The participants who were not given any information on the test’s accuracy seemed to grossly overestimate the probability of endometrial cancer compared with the other two groups. Those clinicians provided with the sensitivity and specificity of the scan had a lower degree of over-estimation of test performance and those given the LR in plain language gave the most appropriate estimation of test performance.⁹ Despite a long tradition of reporting diagnostic accuracy in terms of sensitivity and specificity, only a minority of clinicians correctly apply this information. Authors of diagnostic test data have been urged to reconsider the way they communicate their research data, with more emphasis on LRs. With more structured request forms, it may be possible to elicit prior probabilities of a condition and then use the test values (converted to LRs) to derive post-test probabilities.

Pre-test Probability – the Starting Point

An assignment of pre-test probability is the prerequisite to any decision to undertake a diagnostic test or not and presupposes that there is diagnostic uncertainty. Related to this are the concepts of test and treatment thresholds.¹ If the probability of a condition is so unlikely (below the test threshold), it can be eliminated from the differential diagnosis. Conversely, if the probability is sufficiently high for treatment to be initiated (above the treatment threshold), then testing is not required.¹ Where the probability lies between the two thresholds, further diagnostic testing is indicated.¹ Where the thresholds are set depends upon the clinical context and clinician preference.¹

Quality of Studies and STARD Criteria

Diagnostic accuracy presupposes that the quality of studies is rigorous and that sources of bias are avoided. There are many sources of bias. For example, spectrum bias where one compares a group of subjects with the condition of interest to a group without the condition causes the diagnostic accuracy to be overestimated (perhaps as much as three-fold), a commonly observed problem with tumour markers. Lijmer et al. give a good review of sources of bias, an essential foundation for any appraisal of the quality of diagnostic studies,¹⁰ and the Standards for Reporting of Diagnostic Accuracy (STARD) criteria give a checklist of points to be fulfilled by investigators.¹¹ Beyond diagnostic accuracy is the consideration of whether the performance of a diagnostic test actually influences a clinical outcome, a higher stratum in the pyramid of evidence based laboratory medicine and beyond the scope of the present review.

Conclusion

Sensitivity and specificity vary with the cut-off chosen for a diagnostic test and are not intrinsic to the test but critically dependent upon the clinical context. ROC curve analysis enables the best cut-off for clinical purpose to be assigned, higher sensitivity corresponding to high negative predictive value and the ideal property of a “rule-out” test. LRs may be a more intelligible way of conveying the properties of a diagnostic test to clinicians and may merit further adoption into operational practice.

Footnotes

Competing Interests: None declared.

References

1.Matchar DB, Orlando LA. The Relationship Between test and Outcome. In: Price CP, editor. Evidence- Based Laboratory Medicine; Principles, Practice and Outcomes. 2. Washington DC, USA: AACC Press; 2007. pp. 53–66. [Google Scholar]
2.Bossuyt PMM. Studies for Evaluating Diagnostic and Prognostic Accuracy. In: Price CP, editor. Evidence- Based Laboratory Medicine; Principles, Practice and Outcomes. 2. Washington DC, USA: AACC Press; 2007. pp. 67–81. [Google Scholar]
3.Boyd JC. Statistical Analysis and Presentation of Data. In: Price CP, editor. Evidence-Based Laboratory Medicine; Principles, Practice and Outcomes. 2. Washington DC, USA: AACC Press; 2007. pp. 113–40. [Google Scholar]
4.Maisel A. Circulating natriuretic peptide levels in acute heart failure. Rev Cardiovasc Med. 2007;8 (Suppl):S13–21. [PubMed] [Google Scholar]
5.Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et al. Rapid measurement of Btype natriuretic peptide in the emergency diagnosis of heart failure. N Engl J Med. 2002;347:161–7. doi: 10.1056/NEJMoa020233. [DOI] [PubMed] [Google Scholar]
6.Lainchbury JG, Campbell E, Frampton CM, Yandle TG, Nicholls MG, Richards AM. Brain natriuretic peptide and n-terminal brain natriuretic peptide in the diagnosis of heart failure in patients with acute shortness of breath. J Am Coll Cardiol. 2003;42:728–35. doi: 10.1016/s0735-1097(03)00787-3. [DOI] [PubMed] [Google Scholar]
7.Landray MJ, Lehman R, Arnold I. Measuring brain natriuretic peptide in suspected left ventricular systolic dysfunction in general practice: cross-sectional study. BMJ. 2000;320:985–6. doi: 10.1136/bmj.320.7240.985. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Deeks JJ, Altman DG. Diagnostic tests 4: likelihood ratios. BMJ. 2004;329:168–9. doi: 10.1136/bmj.329.7458.168. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Steurer J, Fischer JE, Bachmann LM, Koller M, Riet G. Communicating accuracy of tests to general practitioners: a controlled study. BMJ. 2002;324:824–6. doi: 10.1136/bmj.324.7341.824. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der Meulen JH, et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA. 1999;282:1061–6. doi: 10.1001/jama.282.11.1061. [DOI] [PubMed] [Google Scholar]
11.Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. BMJ. 2003;326:41–4. doi: 10.1136/bmj.326.7379.41. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b1-cbr29_s_pgs83] 1.Matchar DB, Orlando LA. The Relationship Between test and Outcome. In: Price CP, editor. Evidence- Based Laboratory Medicine; Principles, Practice and Outcomes. 2. Washington DC, USA: AACC Press; 2007. pp. 53–66. [Google Scholar]

[b2-cbr29_s_pgs83] 2.Bossuyt PMM. Studies for Evaluating Diagnostic and Prognostic Accuracy. In: Price CP, editor. Evidence- Based Laboratory Medicine; Principles, Practice and Outcomes. 2. Washington DC, USA: AACC Press; 2007. pp. 67–81. [Google Scholar]

[b3-cbr29_s_pgs83] 3.Boyd JC. Statistical Analysis and Presentation of Data. In: Price CP, editor. Evidence-Based Laboratory Medicine; Principles, Practice and Outcomes. 2. Washington DC, USA: AACC Press; 2007. pp. 113–40. [Google Scholar]

[b4-cbr29_s_pgs83] 4.Maisel A. Circulating natriuretic peptide levels in acute heart failure. Rev Cardiovasc Med. 2007;8 (Suppl):S13–21. [PubMed] [Google Scholar]

[b5-cbr29_s_pgs83] 5.Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et al. Rapid measurement of Btype natriuretic peptide in the emergency diagnosis of heart failure. N Engl J Med. 2002;347:161–7. doi: 10.1056/NEJMoa020233. [DOI] [PubMed] [Google Scholar]

[b6-cbr29_s_pgs83] 6.Lainchbury JG, Campbell E, Frampton CM, Yandle TG, Nicholls MG, Richards AM. Brain natriuretic peptide and n-terminal brain natriuretic peptide in the diagnosis of heart failure in patients with acute shortness of breath. J Am Coll Cardiol. 2003;42:728–35. doi: 10.1016/s0735-1097(03)00787-3. [DOI] [PubMed] [Google Scholar]

[b7-cbr29_s_pgs83] 7.Landray MJ, Lehman R, Arnold I. Measuring brain natriuretic peptide in suspected left ventricular systolic dysfunction in general practice: cross-sectional study. BMJ. 2000;320:985–6. doi: 10.1136/bmj.320.7240.985. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b8-cbr29_s_pgs83] 8.Deeks JJ, Altman DG. Diagnostic tests 4: likelihood ratios. BMJ. 2004;329:168–9. doi: 10.1136/bmj.329.7458.168. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b9-cbr29_s_pgs83] 9.Steurer J, Fischer JE, Bachmann LM, Koller M, Riet G. Communicating accuracy of tests to general practitioners: a controlled study. BMJ. 2002;324:824–6. doi: 10.1136/bmj.324.7341.824. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b10-cbr29_s_pgs83] 10.Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der Meulen JH, et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA. 1999;282:1061–6. doi: 10.1001/jama.282.11.1061. [DOI] [PubMed] [Google Scholar]

[b11-cbr29_s_pgs83] 11.Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. BMJ. 2003;326:41–4. doi: 10.1136/bmj.326.7379.41. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Sensitivity, Specificity, Receiver-Operating Characteristic (ROC) Curves and Likelihood Ratios: Communicating the Performance of Diagnostic Tests

Christopher M Florkowski

Summary

Diagnostic Accuracy Studies

Applied Examples – BNP Studies

Figure 1.

Figure 2.

Likelihood Ratios may be More Intelligible

Pre-test Probability – the Starting Point

Quality of Studies and STARD Criteria

Conclusion

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Sensitivity, Specificity, Receiver-Operating Characteristic (ROC) Curves and Likelihood Ratios: Communicating the Performance of Diagnostic Tests

Christopher M Florkowski

Summary

Diagnostic Accuracy Studies

Applied Examples – BNP Studies

Figure 1.

Figure 2.

Likelihood Ratios may be More Intelligible

Pre-test Probability – the Starting Point

Quality of Studies and STARD Criteria

Conclusion

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases