Skip to main content
. 2014 Jan 9;21(5):815–823. doi: 10.1136/amiajnl-2013-001934

Table 1.

Summary of performance measures for Topaz and MedLEE

Measures Performance for primary analyses (use all 31 findings) Performance for secondary analyses (use 17 influential findings)
Topaz MedLEE p Value (Topaz vs MedLEE) Topaz MedLEE p Value (Topaz vs MedLEE)
Accuracy 0.91 (0.90 to 0.91) 0.90 (0.89 to 0.91) 0.1650 0.92 (0.91 to 0.92) 0.90 (0.89 to 0.91) 0.0537
Accuracy (absent and present) 0.78 (0.76 to 0.80) 0.71 (0.70 to 0.73) <0.0001* 0.75 (0.73 to 0.78) 0.70 (0.67 to 0.73) 0.0047*
Recall for present 0.80 (0.77 to 0.82) 0.79 (0.77 to 0.82) 0.7499 0.72 (0.69 to 0.76) 0.77 (0.74 to 0.80) 0.0453*
Recall for absent 0.76 (0.73 to 0.78) 0.62 (0.59 to 0.65) <0.0001* 0.81 (0.77 to 0.84) 0.58 (0.53 to 0.63) <0.0001*
Precision for present 0.85 (0.83 to 0.87) 0.90 (0.88 to 0.92) 0.0002* 0.92 (0.90 to 0.94) 0.92 (0.89 to 0.94) 0.8100
Precision for absent 0.87 (0.85 to 0.90) 0.90 (0.88 to 0.93) 0.0852 0.89 (0.86 to 0.92) 0.87 (0.83 to 0.91) 0.5144

95% CIs in parentheses.

For each report, physicians and NLP parsers labeled values (present, absent, or missing) for each of the 31 influenza-related findings. With physician annotations as gold standard, we calculated accuracy, recall, and precision as measurements of NLP accuracy as follows:

Accuracy: A+E+I/(A+B+C+D+E+F+G+H+I).

Accuracy (absent and present): A+E/(A+B+D+E+G+H).

Recall for present (sensitivity): A/(A+D+G).

Recall for absent (specificity): E/(B+E+H).

Precision for present (positive predictive value): A/(A+B+C).

Precision for absent (negative predictive value): E/(D+E+F).

where A stands for number of findings with both expert and NLP labeled present; B stands for number of findings with expert labeled absent but NLP labeled present; C stands for number of findings with expert labeled missing but NLP labeled present; D stands for number of findings with expert labeled present but NLP labeled absent; E stands for number of findings with both expert and NLP labeled absent; F stands for number of findings with expert labeled missing but NLP labeled absent; G stands for number of findings with expert labeled present but NLP labeled missing; H stands for number of findings with expert labeled absent but NLP labeled missing; and I stands for number of findings with both expert and NLP labeled missing.

95% CI of the empirical distribution is obtained by bootstrapping with replacement (2000 times, sample size is 31 features per report×211 report=6541 each time).

The 17 influential findings indicated in BN-EM-Topaz were arthralgia, cervical lymphadenopathy, chill, cough, fever, hoarseness, influenza-like illness, lab confirmed influenza, lab order (nasal swab), malaise, myalgia, rhinorrhea, sore throat, suspected flu, viral infection, viral syndrome, and wheezing. 95% CI of the empirical distribution is obtained by bootstrapping with replacement (2000 times, sample size is 17 influential findings per report×211 report=3587 each time).

Each p value was calculated with a two-sided z test for comparison of two proportions. *p<0.05.

BN, Bayesian network; NLP, natural language processing.