Table 1.
Summary of performance measures for Topaz and MedLEE
Measures | Performance for primary analyses (use all 31 findings) | Performance for secondary analyses (use 17 influential findings) | ||||
---|---|---|---|---|---|---|
Topaz | MedLEE | p Value (Topaz vs MedLEE) | Topaz | MedLEE | p Value (Topaz vs MedLEE) | |
Accuracy | 0.91 (0.90 to 0.91) | 0.90 (0.89 to 0.91) | 0.1650 | 0.92 (0.91 to 0.92) | 0.90 (0.89 to 0.91) | 0.0537 |
Accuracy (absent and present) | 0.78 (0.76 to 0.80) | 0.71 (0.70 to 0.73) | <0.0001* | 0.75 (0.73 to 0.78) | 0.70 (0.67 to 0.73) | 0.0047* |
Recall for present | 0.80 (0.77 to 0.82) | 0.79 (0.77 to 0.82) | 0.7499 | 0.72 (0.69 to 0.76) | 0.77 (0.74 to 0.80) | 0.0453* |
Recall for absent | 0.76 (0.73 to 0.78) | 0.62 (0.59 to 0.65) | <0.0001* | 0.81 (0.77 to 0.84) | 0.58 (0.53 to 0.63) | <0.0001* |
Precision for present | 0.85 (0.83 to 0.87) | 0.90 (0.88 to 0.92) | 0.0002* | 0.92 (0.90 to 0.94) | 0.92 (0.89 to 0.94) | 0.8100 |
Precision for absent | 0.87 (0.85 to 0.90) | 0.90 (0.88 to 0.93) | 0.0852 | 0.89 (0.86 to 0.92) | 0.87 (0.83 to 0.91) | 0.5144 |
95% CIs in parentheses.
For each report, physicians and NLP parsers labeled values (present, absent, or missing) for each of the 31 influenza-related findings. With physician annotations as gold standard, we calculated accuracy, recall, and precision as measurements of NLP accuracy as follows:
Accuracy: A+E+I/(A+B+C+D+E+F+G+H+I).
Accuracy (absent and present): A+E/(A+B+D+E+G+H).
Recall for present (sensitivity): A/(A+D+G).
Recall for absent (specificity): E/(B+E+H).
Precision for present (positive predictive value): A/(A+B+C).
Precision for absent (negative predictive value): E/(D+E+F).
where A stands for number of findings with both expert and NLP labeled present; B stands for number of findings with expert labeled absent but NLP labeled present; C stands for number of findings with expert labeled missing but NLP labeled present; D stands for number of findings with expert labeled present but NLP labeled absent; E stands for number of findings with both expert and NLP labeled absent; F stands for number of findings with expert labeled missing but NLP labeled absent; G stands for number of findings with expert labeled present but NLP labeled missing; H stands for number of findings with expert labeled absent but NLP labeled missing; and I stands for number of findings with both expert and NLP labeled missing.
95% CI of the empirical distribution is obtained by bootstrapping with replacement (2000 times, sample size is 31 features per report×211 report=6541 each time).
The 17 influential findings indicated in BN-EM-Topaz were arthralgia, cervical lymphadenopathy, chill, cough, fever, hoarseness, influenza-like illness, lab confirmed influenza, lab order (nasal swab), malaise, myalgia, rhinorrhea, sore throat, suspected flu, viral infection, viral syndrome, and wheezing. 95% CI of the empirical distribution is obtained by bootstrapping with replacement (2000 times, sample size is 17 influential findings per report×211 report=3587 each time).
Each p value was calculated with a two-sided z test for comparison of two proportions. *p<0.05.
BN, Bayesian network; NLP, natural language processing.