Skip to main content
. 2025 Apr 8;10:401. Originally published 2021 May 19. [Version 3] doi: 10.12688/f1000research.51117.3

Table 4. Corpora used in the included publications.

RCT, randomized controlled trials; IR, information retrieval; PICO, population, intervention, comparison, outcome; UMLS, unified medical language system.

Publication Also used by Name Description Classes Size/type Availability Note
96 39, 54, 87, 95, 98, 136 Dataset adaptations: 60, 167 PubMedPICO Automatically labelled sentence labels from structured abstracts up to Aug’17 P, IC, O, Method 24,668 abstracts Yes, https://github.com/jind11/PubMed-PICO-Detection
55 32, 33, 36, 61, 74, 85, 95, 98, 100, 106, 130, 135, 138, 140, 157, 165, 178, 179, Via BLURB-Benchmark: 132, 169 Dataset adaptions: 34, 37, 50, 67, 134, 139, 145 EBMNLP, EBM-PICO Entities P, IC, O + age, gender, and more entities 5,000 abstracts Yes, https://github.com/bepnye/EBM-NLP
97 Entities I and dosage-related 694 abstract/full text Yes, https://ii.nlm.nih.gov/DataSets/index.shtml Domain drug-based interventions
48 Entities P, O, Design, Exposure 60 + 30 abstracts Yes, http://gnteam.cs.manchester.ac.uk/old/epidemiology/data.html Domain obesity
75 Sentence level 90,000 distant supervision annotations, 1000 manual. Target condition, index test and reference standard 90,000 + 1000 sentences Yes (labels, not text), https://zenodo.org/record/1303259 Domain diagnostic tests
52 64 (includes classifiers from), 40, 53, 54, 102, 107110, 147, 153 NICTA-PIBOSO Structured and unstructured abstracts, multi-label on sentences. P, IC, O, Design 1000 abstracts Yes, https://drive.google.com/file/d/1M9QCgrRjERZnD9LM2FeK-3jjvXJbjRTl/view?usp=sharing Multi-label sentences
47 Sentences Drug intervention and comparative statements for each arm 300 (500 in available data) sentences Yes, https://dataverse.harvard.edu/file.xhtml?fileId=4171005&version=1.0 Domain drug-based interventions
98 Sentences P, IC, O 5099 sentences from references included in SRs, labelled using active-learning Yes, https://github.com/wds-seu/Aceso/tree/master/datasets Domain heart disease
62 based on 111 32, 61, 99, 171. Extending/adapting dataset: 177, 149 Evidence-inference 2.0 Sentences P, I, O Fulltext: 12,616 prompts stemming from 3,346 articles; Abstract-only: 6375 prompts Yes, http://evidence-inference.ebm-nlp.com/download/ Triplets for relation extraction
177 Entities and document-level classifications IC (per arm), O, N (per arm), Other 120 abstracts+results sections from existing corpus Yes, https://github.com/hyesunyun/llm-meta-analysis/tree/main/evaluation/data Extending Evidence Inference 2.0
149 LLM summaries for each entity P, IC (per arm), O, Other 345 RCT summries created by 3 LLMs from 115 abstracts in Evidence Inference 2.0 Yes, https://utexas.app.box.com/s/mpe5idxrqrzs1wcakphng7xfi7h4g83j Extending Evidence Inference 2.0
61 MS^2 Sentences, Entities P, IC, O 470 studies from 20k reviews, entity labels initially assigned via model trained on EBM-NLP Yes, https://github.com/allenai/ms2 Relation extraction with direction of effect labels
35 Entities P, IC, diagnostic test 500 abstracts and 700 trial records Yes, http://www.lllf.uam.es/ESP/nlpmedterm_en.html Spanish dataset, UMLS normalisations
36 AbstRCT Argument Mining Dataset Entities P, O 660 RCT abstracts Yes, https://gitlab.com/tomaye/abstrct Relation extraction, domains neoplasm, glaucoma, hepatitis, diabetes, hypertension
112 50 Entities P, IC, O, Design 99 RCT abstracts Yes, https://github.com/jetsunwhitton/RCT-ART Excluded for containing only glaucoma studies
34 67, 138, 139 EBM-Comet Entities O 300 abstracts Yes, https://github.com/LivNLP/ODP-tagger Own data + adaptation of EBM-NLP with normalization to 38 domains and 5 outcome-areas
33 Entities I 1807 abstracts, labelled automatically by matching intervention strings from clinical trial registration Yes, https://data.mendeley.com/datasets/ccfnn3jb2x/1
60 137 Sentences P, IC, O 42000 sentences Yes, https://github.com/smileslab/Brain_Aneurysm_Research/tree/master/BioMed_Summarizer Own data on brain aneurysm + existing dataset from Jin and Szolovits 96
74 Sentences, Entities P, IC, O 130 abstracts from MEDLINE's PubMed Online PICO interface Yes, https://github.com/nstylia/pico_entities/
99 150 Entities I,C,O 10 RCT abstracts Yes, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8135980/bin/ocab077_supplementary_data.pdf Relation extraction, domain COVID-19
38 143, 147 CONSORT-TM Sentences P, IC, O, N + CONSORT items 50 Full text RCTs Yes, https://github.com/kilicogluh/CONSORT-TM
82 Entities, Sentences I, C, O + animal entities 400 RCT abstracts in first corpus, 10k abstract in additional corpus from mined data Yes, https://osf.io/2dqcg/ Domain animal RCTs
51 175, 176 Entities P, I, C, O, other 211 RCT abstracts and 20 full texts Yes, https://zenodo.org/record/6365890
70 Entities N 200 RCT fulltexts from PMC, annotated N from baseline tables Yes, https://zenodo.org/record/6647853#.ZCa9dXbMJPY
63 based on 111 171 Entities I, C, O First corpus 160 abstracts, second corpus 20 Yes, https://github.com/bepnye/evidence_extraction/blob/master/data/exhaustive_ico_fixed.csv Second corpus is domain cancer
174 Entities N (per arm), N (total), Other N abstracts: 847 RCTs train+ test 150 RCTs Yes, https://github.com/windisch-paul/sample_size_extraction/tree/main
150 Entities O, IC (per arm), P 80 COVID-19 RCT abstracts + 229 general RCT abstracts Yes, https://github.com/WengLab-InformaticsResearch/EvidenceMap_Model
145 179 Entities, Sentences P, O, IC (per arm), Sections (Aim; Method etc.) Entities: 150 Covid RCT abstracts+ 150 Alzheimers disease (AD) RCT abstracts. Sentences: 200 covid and AD each Yes, https://github.com/BIDS-Xu-Lab/section_specific_annotation_of_PICO/tree/main
143 Entities, Sentences Withdrawals or exclusions, Randomisation, Setting, Blinding, N (per arm), N (total), Design, Other 45 PMC full text sections, ti-abs-methods-results Yes, https://github.com/kellyhoang0610/RCTMethodologyIE Possible overlap with CONSORT-TM, earlier version
165 Entities P, IC, O, N (total), Country, Design 30 abstracts from RCT, animal studies, social science studies Yes, appendix of paper and https://githubcom/L-ENA/ES-hackathon-GPT-evaluation
152 Entities IC (dose; duration and others), P (Condition or disease), O, Design, N (total), N (per arm) ReMedy database (cancer) and own curated leukemia dataset Partly, leukemia data no, remedy data: https://remedycancer.app.emory.edu/multi-search? Domain cancer
156 Entities, Sentences P, IC, O, N (total), Age, Randomisation, Blinding, Design 10,266 Chinese RCT paragraphs Yes, https://github.com/yizhen-buaa/Annotated-dataset-of-TCM-clinical-literature Traditional Chinese Medicine
166 Entities P, IC (per arm), O, O (primary or secondary outcome), N (total), Exposure, Design 100 various study types abstracts + 1488 abstracts No Domain nutrition, cardiovascular
172 Entities IC (per arm), P, O, Design 870 involved clinical studies from 25 meta-analyses, full texts No Domain cancer
135 Entities IC 940k distantly supervised, 200 manual gold standard No Domain physio/rehabilitation
163 Entities N (per arm), N (total), Randomisation, Other, IC (per arm) 4 NMA reviews with 29 RCTS fulltexts No Prognostic studies
161 Entities IC, IC (dose; duration and others), Age, Design Fulltexts: cancer 16+70; Fabry 26+150 studies from reviews and PubMed. RCT, prognostic, observational No Domain cancer, Fabry disease
157 Entities N (per arm), N (total) 300 Covid19 RCT abstracts + 100 generic RCT abstracts No
154 Entities P, IC, IC (Drug name), IC (dose; duration and others), Country, O, N (per arm), N (total), Design 245 multiple myeloma abstracts + 115 abstracts across four other cancers No Domain cancer
164 Entities P, IC, O 682,667 abstracts from PubMed, 350 labelled No
137 Sentences P, O, IC Covid dataset, size unclear Domain Covid
162 Entities P, IC, O, Diagnostic tests, N (total), Design, Eligibility criteria, Funding org 400 rct abstracts+ 123 abstracts+ included studies from 8 Cochrane reviews No
182 131, 180, 181 CHIP 2023 Task 5 Sentences P, IC, O, design 4500 abstracts No Chinese
39 Sentences, Entities P, IC, O 500 labelled abstracts for sentences and 100 for P, O entities No
73 Entities O 1300 abstracts with 3100 outcome statements No Domain cancer
63, 111 EvidenceInference 1.0 Entities Yes, but use EvidenceInference 2.0 https://github.com/jayded/evidence-inference Evidence inference, papers not included for not reporting ICO results
45 Entities P, IC, O Cochrane-provided dataset with 10137 abstracts No
61 113 Sentences and entities P, N, sections 3657 structured abstracts with sentence tags, 204 abstracts with N (total) entities No
57 Structured, auto-labelled RCT abstracts with sentence tags and 378 documents with entity-level IR query-retrieval tags P, IC, O 15,000 abstracts + 378 documents with IR tags No
84 83 (unclear) Sentences and entities IC, O, N (total + per arm) 263 abstracts No
76 53, 58 100 abstracts with P, Condition, IC, possibly on entity level. For O, 633 abstracts are annotated on sentence level. P, Condition, IC, 0 633 abstracts for O, 100 for other classes No
77 Entities Age, Design, Setting (Country), IC, N, study dates and affiliated institutions 185 full texts (at least 93 labelled) No
79 Sentences and entities P, IC, Age, Gender, Design, Condition, Race 2000 sentences from abstracts No
93 200 abstracts, 140 contain sentence and entity labels P, IC 200 abstracts No
114 Auto-labelled structured abstracts, sentence level. P, IC, O 14200+ abstracts No
94 Entities P, age, gender, race 50 abstracts No
115 Sentences (and entities?) P, IC, O 3000 abstracts No
42 Entities N (total) 648 abstracts No
90 Entities IC 330 abstracts No
66 Indonesian text with sentence annotations P,I,C,O 200 abstracts No
68 Sentences from 69 (heart) +24 (random) RCTs included in Cochrane reviews Inclusion criteria 69 + 24 full texts No Domain cardiology
80 Sentences and entities P, IC, Age, Gender, P (Condition or disease) 200 abstracts No
71 4,824 sentences from 18 UpToDate documents and 714 sentences from MEDLINE citations for P. For I: CLEF 2013 shared task, and 852 MEDLINE citations P, IC, P (Condition or disease) abstracts, full texts No General topic and cardiology domain
41 102 Entity annotation as noun phrases O, IC 100 + 132 sentences from full texts No Diabetes and endocrinology journals as source
92 103 Auto-labelled structured RCT abstract sentences. 92 has 19,854 sentences, assumed same corpus as authors and technique are the same. P, IC, O 23,472 abstracts No
46 RCTs abstracts and full texts: 132 + 50 articles IC (per arm), IC (drug entities.), O (time point), O (primary or secondary outcome), N (total), Eligibility criteria, Enrolment dates, Funding org, Grant number, Early stopping, Trial registration, Metadata 132 + 50 abstracts and full texts No
86 Sentences and entities P, IC, O, N (per arm + total) 48 full texts No
49 Studies from 5 systematic reviews on environmental health exposure, entities P, O, Country, Exposure Studies from 5 systematic reviews No Observational studies on environmental health exposure in humans
44 Labelled via supervised distant supervision. Full texts (~12500 per class), 50 + 133 manually annotated for evaluation. P, IC, O 12700+ full texts No
89 Sentence labels, structured & unstructured abstracts. Manually annotated: 344 IC, 341 O, and 144 P and more derived by automatic labelling. P, IC, O 344+ abstracts No
88 Entities P, IC, O, O as "Instruments" or "Study Variables" 20 full texts/abstracts No
85 Entities (Brat, IOB format) P, IC, O 170 abstracts No
59 Entities assigned to UMLS concepts (probably Cochrane corpus, size unclear). '88 instances, annotated in total with 76, 87, and 139 [P, IC, O respectively]' P, IC, O Unclear, at least 88 documents No
43 Sentences and entities P, IC (per arm), N (total) 1750 title or abstracts No
116 Excluded paper, no data extraction system. Corpus of Patient, Population, Problem, Exposure, Intervention, Comparison, Outcome, Duration and Results sentences in abstracts. No Excluded from review, but describes relevant corpus
56 Sentences and entities P, IC (per arm), O, multiple more 88 full texts No