Skip to main content
. 2020 Oct 23;22(10):e19810. doi: 10.2196/19810

Table 3.

Dataset preparation protocols.

Preparation Protocol PQRa dataset (D1) CCAb dataset (D2)
Description This dataset was created for the quality assessment of biomedical studies related to the prognosis of brain aneurysm. This dataset was curated for the use of PICOc sequence classification. The final dataset was specific to the prognosis of brain aneurysm.
Purpose To select only published documents that are scientifically rigorous for final summarization. To identify a sentence or a group of sentences for discovering the clinical context in terms of population, intervention, and outcomes.
Methods The manual preparation of the dataset is a cumbersome job, and thus AId models were used. For development of an AI model, a massive set of annotated documents is needed. Annotation is a tedious job; therefore, PubMed Clinical Queries (narrow) as a surrogate were used to obtain scientifically rigorous studies. N/Ae
Data sources PubMed Database (for positive studies, the “Narrow[filter]” parameter was enabled). First, we collected a publicly available dataset, BioNLP 2018 [21], which was classified based on the PICO sequence in addition to “Method” and “Results” elements. To increase the dataset size, we added more sentences related to brain aneurysm created from Medline abstracts retrieved using the NCBIf PubMed service Biopython Entrez library [53].

Query The term “(Prognosis/Narrow[filter]) AND (intracranial aneurysm)” was used as a query string. The term “Intracranial aneurysm” (along with its synonyms “cerebral aneurysm” and “brain aneurysm”) were used as a query string.
Size 2686 documents, including 697 positive (ie, scientifically rigorous) records A total of 173,000 PICO sequences (131,000 BioNLP+42,000 Brain Aneurysm) were included in the dataset.
Inclusion/exclusion Only studies that were relevant and passed the criteria to be “Prognosis/Narrow[filter]” were included in the positive set. The other relevant studies not in the positive set were included in the negative set. All other studies were excluded from the final dataset. Only structured abstracts identified with at least one of the PICO elements were considered to extract the text sequence.
Study types RCTsg, systematic reviews, and meta-analysis of RCTs were given more importance. RCTs, systematic reviews, and meta-analysis of RCTs were given more importance.

aPQR: prognosis quality recognition.

bCCA: clinical context–aware.

cPICO: Patient/Problem, Intervention, Comparison, Outcome.

dAI: artificial intelligence.

eN/A: not applicable.

fNCBI: National Center of Biotechnology Information.

gRCT: randomized controlled trial.