Table 3.
Dataset | Mode of dataset generation | Total questions, n | Unanswered questions, n | Average question length (tokens, n) | Total articles, n | Average article length (tokens, n) |
emrQA [26] | Semiautomatically generated | 1,295,814 | 0 | 8.6 | 2425 | 3825 |
RxWhyQA [27] | Automatically derived from the n2c2b 2018 ADEsc NLPd challenge | 96,939 | 46,278 | —a | 505 | — |
Raghavan et al [34] | Human-generated (medical students) | 1747 | 0 | — | 71 | — |
Fan [35] | Human-generated (author) | 245 | 0 | — | 138 | — |
RadQAe [37] | Human-generated (physicians) | 6148 | 1754 | 8.56 | 1009 | 274.49 |
Oliveira et al [38] | Human-generated (author) | 18 | 0 | — | 9 | — |
Yue et al [42,74] | Trained question generation model paired with a human-in-the-loop | 1287 | 0 | 8.7 | 36 | 2644 |
DiSCQf [43] | Human-generated (medical experts) | 2029 | 0 | 4.4 | 114 | 1481 |
Mishra et al [45] | Semiautomatically generated | 6 questions or article | — | — | 568 | — |
Yue et al [46] | Human-generated (medical experts) | 50 | 0 | — | — | — |
CLIFTg [47] | Validated by human experts | 7500 | 0 | 6.42, 8.31, 7.61, 7.19, and 8.40 for smoke, heart, medication, obesity, and cancer datasets | — | 217.33, 234.18, 215.49, 212.88, and 210.16 for smoke, heart, medication, obesity, and cancer datasets, respectively |
Hamidi and Roberts [48] | Human-generated | 15 | 5 | — | — | — |
Mahbub et al [50] | Combination of manual exploration and rule-based NLP methods | 28,855 | — | 6.22 | 2336 | 1003.98 |
Dada et al [51] | Human-generated (medical student assistants) | 29,273 | Unanswered questions available | — | 1223 | — |
aNot applicable.
bn2c2: natural language processing clinical challenges.
cADE: adverse drug events.
dNLP: natural language processing.
eRadQA: Radiology Question Answering Dataset.
fDiSCQ: Discharge Summary Clinical Questions.
gCLIFT: Clinical Shift.