Abstract
Objective
To develop a computable representation for medical evidence and to contribute a gold standard dataset of annotated randomized controlled trial (RCT) abstracts, along with a natural language processing (NLP) pipeline for transforming free-text RCT evidence in PubMed into the structured representation.
Materials and methods
Our representation, EvidenceMap, consists of 3 levels of abstraction: Medical Evidence Entity, Proposition and Map, to represent the hierarchical structure of medical evidence composition. Randomly selected RCT abstracts were annotated following EvidenceMap based on the consensus of 2 independent annotators to train an NLP pipeline. Via a user study, we measured how the EvidenceMap improved evidence comprehension and analyzed its representative capacity by comparing the evidence annotation with EvidenceMap representation and without following any specific guidelines.
Results
Two corpora including 229 disease-agnostic and 80 COVID-19 RCT abstracts were annotated, yielding 12 725 entities and 1602 propositions. EvidenceMap saves users 51.9% of the time compared to reading raw-text abstracts. Most evidence elements identified during the freeform annotation were successfully represented by EvidenceMap, and users gave the enrollment, study design, and study Results sections mean 5-scale Likert ratings of 4.85, 4.70, and 4.20, respectively. The end-to-end evaluations of the pipeline show that the evidence proposition formulation achieves F1 scores of 0.84 and 0.86 in the adjusted random index score.
Conclusions
EvidenceMap extends the participant, intervention, comparator, and outcome framework into 3 levels of abstraction for transforming free-text evidence from the clinical literature into a computable structure. It can be used as an interoperable format for better evidence retrieval and synthesis and an interpretable representation to efficiently comprehend RCT findings.
Keywords: knowledge representation, corpus annotation, randomized controlled trial, evidence-based medicine, natural language processing, medical literature analysis and retrieval system
INTRODUCTION
Randomized controlled trials (RCTs) are the long-time gold standard for generating high-quality medical evidence to support evidence-based medicine (EBM).1 However, most of such evidence exists as peer-reviewed RCT publications, whose free-text format presents formidable information overload challenges for EBM practitioners, causing evidence underuse in practice.2
To bring the free-text evidence from the literature to practice more timely and efficiently, natural language processing (NLP) techniques have been leveraged to parse evidence text and generate machine-interpretable evidence bases.3,4 Several rule-based and machine learning NLP methods5–12 were proposed to extract PICO (ie, participant, intervention, comparator, and outcome) entities as medical evidence elements from RCT publications. However, the PICO framework has limitations in encoding clinical information needs, and potentially impacts the quality of clinical evidence retrieved under its representation.13 For example, a clinical question asking, “Are the combination of rivaroxaban and aspirin effective for stroke patients with high BMI?” might fail to retrieve the correct evidence assertion “the [combination of rivaroxaban and aspirin]Intervention compared with aspirin produced a consistent reduction in the primary outcome of [cardiovascular death]Outcome, [stroke]Outcome, or [myocardial infarction]Outcome, irrespective of BMI or body weight” (PMID 33538248), because we can identify the Intervention and Outcome entities but not the relation “consistent reduction” using the PICO framework.
Recent research tries to directly extract evidence snippets regarding interventions and outcomes from full-text publications such as Evidence Inference14,15 and Trialstreamer.16 However, the extracted evidence snippets still require additional knowledge to be computable. For example, Trialstreamer represents the evidence as 6 Population, 2 Intervention, and 5 Outcome entities and one sentence as the key finding from the article PMID 34582477 (Supplementary Material S2b). Trialstreamer might be helpful for users to understand the evidence in this article, but hard to summarize related work because all the entities are independently listed without indicating their relations and the “key finding” is not in a structured form.
To address these unmet needs, we proposed a novel knowledge representation, EvidenceMap, for representing RCT evidence with 3 levels of abstraction: Medical Evidence Entity, Proposition, and Map. Evidence Entity is used to represent PICO elements, which makes EvidenceMap expressive and compatible with existing methods. Evidence Proposition formulates evidence assertion by using the semantic dependency relationships among entities. It provides a relational and logical retranslation for the medical evidence and thus can largely improve the precision of evidence retrieval. Finally, Evidence Map clusters and links all related evidence assertions to construct an evidence knowledge graph. It provides a portable way for efficient evidence comprehension and synthesis.
We further performed user evaluations of the EvidenceMap using 10 randomly selected RCT abstracts and demonstrated its effectiveness and efficiency in representing medical evidence and its potential for assisting medical evidence retrieval, synthesis and comprehension. Finally, we created a sharable corpus including 309 RCT abstracts annotated by domain experts, which can be used to train and evaluate machine learning models for automated evidence extraction. We also developed an NLP pipeline to transform free-text evidence into EvidenceMap representation.
METHODS
EvidenceMap: a 3-level evidence representation
We follow 3 desiderata to develop the 3-level representation EvidenceMap (Figure 1): interoperable with RCT reports; interpretable and expressive; and competent for evidence retrieval, appraisal, and comprehension.
Figure 1.
EvidenceMap—a 3-level representation for unstructured medical evidence. Abbreviations: M: measure; C: count; RCT: randomized controlled trial.
Level 1: medical evidence entity
Inspired by the PICO framework, medical evidence entity defines the essential information elements in clinical trials:
Participant: the characteristics of the enrolled population, such as “Obese Patients”;
Intervention/Comparison: the primary intervention considered, such as “combination of rivaroxaban and aspirin” and to what the intervention is compared, such as “aspirin alone” or “placebo”;
Outcome: the anticipated measures for determining the effect of the intervention on participants, such as “cardiovascular death” or “stroke”.
In addition, Observation entities are defined with 2 types of elements17:
Measure: quantitative or qualitative outcome measures (including negative observation(s)) that directly represent the effectiveness of the Intervention, such as “significantly reduced” or “no difference”;
Count: count of participants for an outcome measure, that describes the effectiveness of Intervention as an attribute to Participant, such as “24 (9%) patients”.
Level 2: medical evidence proposition
Based on the property of the language,18,19 we define a medical evidence proposition (MEP) as a directional dependency among Intervention, Observation, and Outcome entities. Dependency is a one-to-one correspondence. For each entity A in the sentence, there is a maximum of one corresponding entity B. Any other entities in relation to A are considered independent (example in Supplementary Figure S1). In this work, an MEP represents a clinically meaningful statement and consists of 2 consecutive dependencies: from Intervention to Observation and from Observation to Outcome. Participant entities are not linked with MEP because they are usually defined at the study level. An MEP can have 2 interpretations: (1) a hypothesis/design of an outcome if it is extracted from the method section or (2) a finding/observation if it is from the result section.
Figure 2 shows an example of a structured medical evidence snippet. We annotated 6 Medical Evidence Entities and their dependencies and then formulated 2 MEPs:
Figure 2.
Example of a randomized control trial abstract (PMID 33470369) snippet represented by medical evidence propositions. Medical evidence propositions represent the relationships between the medical evidence entities within a sentence. (Green: Intervention entity; Grey: Observation entity; and Yellow: Outcome entity).
Intervention: “70% alcohol”, Measure: “more effective in reducing”, Outcome: “number of CFU/cm2”
Intervention: “two groups”, Measure: “no difference”, Outcome: “skin colonization”.
The example sentence is retrieved from the results section; thus, the MEPs represent 2 observed findings. The first asserts that “70% alcohol” is more effective at reducing the “number of CFU/cm2”. The second asserts that there is no difference for “skin colonization” between the 2 groups.
Level 3: medical evidence map
Medical evidence map provides a summarization of 3 major components of the RCT report: participant enrollment, study design, and study results. It is composed of the medical evidence entities and their corresponding MEPs (Figure 3). First, coreference resolution is performed on the medical evidence entities such that different terms referring to the same entity are represented in the Evidence Map by a single term. For example, “continuous positive airway pressure (CPAP)”, “CPAP”, and “prehospital CPAP” are different terms representing the same Intervention entity in the article (PMID 33771819; Supplementary Material S6d). Then, the evidence entities and MEPs are contextually merged to form the Map. The Map groups Participant entities together in the “Enrollment” section to profile the study population. The “Study Design” section is created by merging MEPs from sentences describing the background, objectives, and/or methods (ie, methods context). Within the Study Design section, if no MEP contains the Comparison Entity, it will be inferred from the full abstract (eg, “Placebo” if a placebo entity exists within the abstract, else “Control”). Similarly, the “Study Results” section is created by merging MEPs from sentences describing the results, findings, discussion, and/or conclusions (ie, results context). MEPs that further explain the findings contained in another MEP (eg, quantitative details supporting a finding) are excluded from the Map for simplicity.
Figure 3.
Example Medical Evidence Map. Medical Evidence Map for the randomized control trial abstract (PMID 33470369) (Blue: Participant entity; Green: intervention entity, Grey: Observation entity; and Yellow: Outcome entity).
EvidenceMap corpora generation
We developed 2 corpora to facilitate building NLP models with EvidenceMap. The MeSH index “Randomized Control Trials” (D016449) was used to retrieve articles that report the design and results of RCTs from PubMed. Annotation was conducted on the title and abstracts using the guidelines specified in the Supplementary Material.20 The annotation team consisted of one clinician (JK) and one informatics student (TK). The overall goal of annotation, the scope, and detailed rules were discussed using an iterative process until the team reached a consensus. All annotations were conducted using Brat.21 First, entities were identified and classified as Participant, Intervention, Outcome, Measure, and Count. Then, dependencies were annotated between Intervention → Measure/Count and Measure/Count → Outcome. As such, MEPs can be formulated by joining the directional relations connecting the Intervention, Measure/Count, and Outcome Entities. For instance, in line #9 of Figure 4, the relation link “brovincamine → 0.004 (0.016) dB/year → Changes in CPSD (SE)” can formulate the MEP: Intervention: “brovincamine”, Measure: “0.004 (0.016) dB/year”, Outcome: “Changes in CPSD (SE).”
Figure 4.
Annotation example of Medical Evidence Entity and Dependency (PMID 10209728). By default, “Observation” labels refer to “Measure” labels, which appear more frequently. “Count” observations are specifically annotated as such.
An NLP pipeline for EvidenceMap
With the annotated corpus, we developed an NLP pipeline consisting of 3 modules: evidence entity recognition, relation extraction, and MEP clustering. For evidence entity recognition, we treated it as a named entity recognition (NER) task and extracted 5 entity types: Participant, Intervention, Observation, Count, and Outcome (Supplementary Figure S2). We adopted a deep learning architecture that achieves on this task and used the data augmentation technique.22 It includes a bidirectional long short-term memory (LSTM) network with the conditional random field (CRF) layer for sequence tagging, and a BERT model fine-tuned with our corpora and initialized the weights from BlueBERT.23
For relation extraction, we treated it as a binary relation extraction task between “Intervention” and “Observation” entities or between “Observation” and “Outcome” entities. A MEP can then be formulated by linking a set of dependent “Intervention”, “Observation”, and “Outcome” entities. Supplementary Figure S3 shows the model structure, and Supplementary Table S1 shows an example with 9 possible pairs of entities and labels for dependency parsing. We adopted the same architecture as NER and fine-tuned it with our corpora for training, except that, instead of using a CRF layer to decode Elements, a dense layer with Softmax is used to predict a binary label to indicate if there is dependency or not between all possible combination of Evidence Elements. The negation status was detected for each MEP using NegEx.24 For example, in “There was no evidence that hydroxychloroquine reduced symptom duration” (PMID 34145052), the evidence proposition “hydroxychloroquine reduced symptom duration” was labeled as negated.
By merging MEPs that belong to the same study arm, we formulated the Evidence Map of each study. All Intervention entities were converted to embedding vectors using the NER model. K-means clustering method was then used to group MEPs. For 2-arm clinical trial reports, K equals 2. As shown in Supplementary Figure S4, we randomly selected one MEP and used the 2 attached Intervention terms as the seed centroid for each arm respectively. Next, we assigned each MEP to the closest centroid by Intervention entities and recomputed the centroids using the current cluster memberships. We repeated the above process until it converged.
EvidenceMap evaluations
We retrieved 10 RCT abstracts25–34 from PubMed to evaluate EvidenceMap. KS and NM first annotated Evidence Maps based on the EvidenceMap annotation guidelines. To evaluate the expressiveness of EvidenceMap, KS and NM annotated all evidence elements they determined to be germane to evidence evaluation (eg, PICO entities and evidence assertions) without following any specific guidelines. All results were reviewed by a third expert (YS), and any disagreements were resolved after discussion. The discrepancies between the EvidenceMap and freeform evidence annotations were assessed to determine the potential limitations of EvidenceMap.
To evaluate the helpfulness of EvidenceMap, 4 domain experts (KS, YS, YZ, MW) used a 5-scale Likert rating system (very helpful [5]–very unhelpful [1]) to rate each section (Enrollment, Study Design, and Study Results) by how the visualized evidence map facilitates comprehension of the evidence. Helpfulness was assessed according to the representation’s clarity and success in communicating study findings relative to understanding the evidence directly from the source abstract. Supplementary Table S2 provides detailed evaluation criteria. To determine the limitations of EvidenceMap, the reviewers were also asked to provide explanations for scores below 5.
To evaluate the efficiency of EvidenceMap in helping users comprehend medical evidence, we measured the time consumed by 4 users (KS, YS, YZ, and MW) compared against reading the abstract raw text and 2 other state-of-art representations, “Evidence Inference”15 and “Trialstreamer”.16 Evidence Inference represents each MEP with PICO entities and a predicative label indicating that the intervention significantly increased, significantly decreased, or realized no significant difference, relative to the comparator and with respect to the outcome. YS manually annotated each article with the Evidence Inference representation. Trialstreamer represents each RCT abstract with lists of PICO entities, sample size, risk of bias, and a key finding summary. We acquired the Trialstreamer representation from trialstreamer.robotreviewer.net. To control for any time usage discrepancies caused by the reading order, we evaluated the representations for each RCT abstract in a random order. KS and YS who were involved in creating the Evidence Map and Evidence Inference representations performed the helpfulness and efficiency evaluations 4 months after performing the annotation.
To evaluate the NLP pipeline, we trained 2 sets of models: one disease-agnostic from the “General” corpus, and the other fine-tuned on the COVID-19 corpus. Each set of models contains an entity recognition model and a relation extraction model. We reported the performance of entity recognition and relation extraction and the end-to-end performance for MEP formulation and clustering. The evaluation results are reported using the standard precision, recall, and F1 metrics, with equality defined by both strict matching and partial matching. Strict matching holds between any 2 MEPs when (1) the entities are fully matched, (2) the entity types are the same, and (3) the dependencies are the same. Partial matching is defined by relaxing the requirement for entity text span matching. For example, 2 entities “significantly higher” and “higher” were considered partially matched but not strictly matched because they have overlapping word(s). For MEP clustering, adjusted rand index (ARI) is used.35 All models were implemented using TensorFlow 2.6 and trained on 4 × NVIDIA GeForce RTX 2080 Ti GPUs. To accelerate the execution, we concatenated sentences from RCT abstracts and processed them in batch mode.
RESULTS
EvidenceMap corpora
Two corpora were annotated in accordance with EvidenceMap (Table 1). The “General” corpus includes a broad range of disease domains by randomly selecting 229 RCT article abstracts. The “COVID-19” corpus includes 80 randomly selected COVID-19 RCT article abstracts to accommodate the recent demand for evidence retrieval and synthesis resources related to the pandemic. We split both corpora into training (60%), validation (20%), and test (20%) sets, respectively.
Table 1.
Descriptive statistics of the General and COVID-19 corpora
| General corpus |
COVID-19 corpus |
||||||||
|---|---|---|---|---|---|---|---|---|---|
| Train | Dev | Test | Total | Train | Dev | Test | Total | ||
| Evidence entity | Population | 765 | 255 | 255 | 1275 | 338 | 107 | 118 | 563 |
| Intervention | 1798 | 599 | 599 | 2996 | 604 | 191 | 211 | 1006 | |
| Observation | 1203 | 308 | 378 | 1889 | 325 | 103 | 114 | 541 | |
| Count | 228 | 94 | 81 | 403 | 125 | 40 | 44 | 209 | |
| Outcome | 1827 | 459 | 572 | 2858 | 591 | 187 | 207 | 985 | |
| Total | 5821 | 1716 | 1884 | 9421 | 1982 | 628 | 694 | 3304 | |
| Evidence dependency | Dependent | 2876 | 911 | 1007 | 4793 | 929 | 294 | 325 | 1548 |
| Independent | 11 777 | 3730 | 4122 | 19 629 | 3713 | 1238 | 1238 | 6189 | |
| Total | 14 653 | 4640 | 5129 | 24 422 | 4642 | 1532 | 1563 | 7737 | |
Both annotators annotated all Evidence Entities in the “General” corpus and the instance-level agreement36 was calculated. An annotation instance is represented as a triple (d, l, o), where d is a document id, l is a label, and o is a list of start-end character offset tuples. Only 2 identical instances from annotators are used for calculating agreement. Grouping annotations by labels allows us to calculate F1 per label. The instance-level agreements were 0.916 for Participant, 0.844 for Intervention, 0.727 for Outcome, 0.955 for Measure and Count, and 0.861 overall. The agreement for evidence dependency annotation is 0.691.
EvidenceMap representation evaluations
Ten RCT abstracts25–34 were randomly retrieved from PubMed. The abstract text, Trialstreamer, Evidence Inference, and EvidenceMap representations of these trials are shown in the Supplementary Materials in the Diverse Representation for 10 Articles section. For EvidenceMap, we show the abstract text annotated with Medical Evidence Entities and Propositions alongside the Map to elucidate the annotation and construction of the Map, but only the Map was shown to the users during the evaluation.
Table 2 shows the results of the helpfulness evaluation. All modules of the Medical Evidence Map (Enrollment, Study Design and Study Results) were helpful for understanding the key study elements, with average (standard deviation) scores of 4.85 (0.39), 4.70 (0.57), and 4.20 (0.92), respectively. For 3 RCT abstracts (PMID 30535714, 30879707, and 30739747), the “Study Results” section received ratings of 3 or 2 due to missing temporal information that would distinguish the Outcome variables (eg, mortality at 7 days vs mortality at 30 days). For one RCT abstract (PMID 24548534), important geographical attributes specifying where the trial was conducted were missing for the Enrollment section, leading to an average rating of 3.75 (between “Helpful” and “Neither helpful nor unhelpful”) for the Enrollment module. A small number of evidence assertions were not represented appropriately in RCT abstracts (PMID 34526033 and 33538248), and they both were rated 4 for the Study Results module. In all other cases, the Evidence Maps were rated between “Very Helpful” and “Helpful”.
Table 2.
Five-level Likert scale rating for each section of Medical Evidence Map (mean and standard deviation) in facilitating evidence comprehension (Score 1–5: 5 is very helpful, 1 is no help.)
| PMID | Enrollment | Study design | Study results |
|---|---|---|---|
| 24548534 | 3.75 (0.5) | 5 (0) | 5 (0) |
| 34582477 | 5 (0) | 5 (0) | 5 (0) |
| 30535714 | 5 (0) | 4.5 (0.58) | 2.75 (0.5) |
| 30879707 | 5 (0) | 3.25 (0.5) | 3 (0) |
| 34526033 | 5 (0) | 5 (0) | 4 (0) |
| 33771819 | 5 (0) | 5 (0) | 5 (0) |
| 30739747 | 4.75 (0.5) | 4.25 (0.96) | 2.5 (0.58) |
| 34807243 | 5 (0) | 5 (0) | 5 (0) |
| 33538248 | 5 (0) | 5 (0) | 4 (0) |
| 34015311 | 5 (0) | 5 (0) | 5 (0) |
| Average | 4.85 (0.39) | 4.70 (0.57) | 4.20 (0.92) |
Next, we report the efficiency of EvidenceMap by comparing its time usage in medical evidence comprehension with the raw text, “Evidence Inference”15 and “Trialstreamer”.16Figure 5 shows the average time usage for comprehending the evidence of each RCT abstract for the 4 different representations. Compared with the raw text, the other 3 types of representations all improved the efficiency of evidence comprehension. On average, Evidence Inference, EvidenceMap, and Trialstreamer improved comprehension time by 42.96%, 51.9%, and 53.88%, respectively. RCT abstracts containing more evidence assertions require more time for readers to understand all the study findings, so the time usage positively correlated with the number of assertions. Supplementary Table S3 lists the number of assertions that were represented by different representations for the 10 RCT abstracts. Evidence Inference and EvidenceMap were both able to represent almost all assertions, and Trialstreamer only selected a few from all assertions to present.
Figure 5.
Time usage for evidence comprehension of the 10 RCT abstracts for diverse representations. Abstracts are ranked by the number of evidence assertions. Abbreviations: RCT: randomized controlled trial.
EvidenceMap pipeline evaluations
Table 3 reports the performance of entity recognition and evidence relation extraction tasks. The entity recognition model achieved a weighted-average F1 score of 0.74 from 5 classes of entities. We further fine-tuned the entity recognition model on the COVID-19 corpus, and the specialized model achieved a weighted-average F1 score of 0.83. The relation extraction model fine-tuned on the General corpus achieved an F1 score of 0.97. By further fine-tuning on COVID-19 corpus, we achieved an F1 score of 0.96.
Table 3.
Performance on the test set for medical entity recognition and evidence relation extraction
| General corpus |
COVID-19 corpus |
|||||
|---|---|---|---|---|---|---|
| Precision | Recall | F1 score | Precision | Recall | F1 score | |
| Participant | 0.71 | 0.67 | 0.69 | 0.82 | 0.92 | 0.86 |
| Intervention | 0.67 | 0.77 | 0.72 | 0.90 | 0.86 | 0.89 |
| Observation | 0.77 | 0.83 | 0.80 | 0.81 | 0.79 | 0.80 |
| Count | 0.73 | 0.69 | 0.71 | 0.79 | 0.90 | 0.84 |
| Outcome | 0.74 | 0.79 | 0.77 | 0.76 | 0.78 | 0.77 |
| Weighted avg. | 0.72 | 0.77 | 0.74 | 0.83 | 0.84 | 0.83 |
| Dependent | 0.94 | 0.93 | 0.94 | 0.93 | 0.94 | 0.93 |
| Independent | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 |
| Weighted average | 0.97 | 0.97 | 0.97 | 0.96 | 0.97 | 0.96 |
Note: The bold values highlighted the F-1 scores for weighted average across all the entities.
Table 4 reports the overall end-to-end performance of the evidence extraction on 129 gold standard propositions in the COVID-19 test set. The pipeline fine-tuned on the General corpus (Model Set A) achieved F1 scores of 0.72 and 0.75 with strict and partial matching, respectively. The pipeline fine-tuned on COVID-19 corpus achieved F1 scores of 0.79 and 0.83. The MEP clustering is evaluated only on 2-arm studies, which comprise 84% of the test data (10 out of 12 studies). In 80% of the studies, all formulated MEPs are clustered correctly, resulting the average ARI of 0.86.
Table 4.
End-to-end evaluation of the pipeline with different models on COVID-19 test dataset
| Precision | Recall | F1 score | ||
|---|---|---|---|---|
| Model set A | Strict matching | 0.74 | 0.71 | 0.72 |
| Partial matching | 0.77 | 0.73 | 0.75 | |
| Model set B | Strict matching | 0.87 | 0.73 | 0.79 |
| Partial matching | 0.89 | 0.78 | 0.83 |
Note: Model Set A: evidence entity recognition and relation extraction models fine-tuned on the general corpus. Model Set B: evidence entity recognition and relation extraction models fine-tuned on the general and COVID-19 corpora
DISCUSSION
EvidenceMap corpora
Two manually annotated corpora were produced, one general corpus and one focused on COVID-19 trials. These corpora consist of multiple levels of annotations corresponding to the EvidenceMap and can serve as gold standards generated by domain experts to develop various approaches for automating evidence extraction following EvidenceMap. We achieved the highest agreement among Participant Entities and lowest among Outcome Entities. One potential reason was Outcome entities have more complicated structure that requires domain knowledge for accurate comprehension. For example, the outcomes “presence of Doppler signal or synovial fluid and radiographic joint space by musculoskeletal ultrasound” (PMID 30879707) is difficult to parse as it refers to a single joint outcome or 3 individual outcomes.
EvidenceMap evaluation
In the EvidenceMap representation evaluation, all 3 representation models improved user comprehension time. For example, users only needed an average of 55 s to understand the EvidenceMap representation of the RCT abstract (PMID 34015311), instead of 129 s on its raw text. Though Trialstreamer’s representation was the fastest to process, it also provided the fewest evidence results since it only presented small snippets of the abstract as the study’s key findings. The 10 RCT abstracts contained 13.7 assertions on average per article. But Trialstreamer only presented 2.7 assertions on average. In contrast, EvidenceMap provided a more comprehensive view of medical evidence by clustering all MEPs for almost all assertions in the map for comprehension.
Regarding the scope, EvidenceMap was designed for parallel-group or cluster RCT types, where each group of participants receive (or do not receive) only one intervention.37 It needs to be extended for “Crossover” or “Factorial” RCT studies.38 For example, article (PMID 30879707) discusses the results of a randomized crossover trial where one group of participants were treated in the first 2 weeks but crossed over to control for the following 2 weeks, and vice versa for the other group. The representation of cross over relationships for interventions is not currently supported in EvidenceMap, so a few MEPs were missed. However, as 78% of RCTs were parallel-group trials,38 dominating over other types of RCTs, EvidenceMap is able to provide effective medical evidence representation for a large majority of RCT articles.
In the EvidenceMap pipeline Evaluation, Entity recognition is still the technical bottleneck and has room to improve. In this study, we selected a particular disease domain, COVID-19, and got substantial improvement by fine-tuning models on the small but restricted dataset. On the other hand, the high performance of relation extraction proves its effectiveness in reducing the complexity of training. In the end-to-end evaluation, we also observed that a small domain-specific annotation corpus could substantially improve performance for the particular domain.
Use case scenarios
EvidenceMap has been demonstrated to be helpful and efficient in visualizing study findings and providing interpretable results with its hierarchical and structured representation, thus facilitating the practice of EBM for various tasks. For example, a keyword search on the term “AIDS” might include not only references to “Acquired Immune Deficiency Syndrome” but also hearing aids. The PICO framework is helpful for framing a clinical question and representing evidence in a structured form, but still has limited ability to identify medical evidence with complex semantic relations. EvidenceMap formulates evidence findings as MEPs by representing not only medical evidence entities, but also their dependencies, providing a semantic relational retranslation for the medical evidence. The sentence “the combination of rivaroxaban and aspirin compared with aspirin produced a consistent reduction in the primary outcome of cardiovascular death, stroke, or myocardial infarction, irrespective of BMI or body weight” (PMID 33538248) can be represented as 5 MEPs (Supplementary Material Diverse Representations S9d). Users who are interested in determining the intervention’s effects on a specific outcome such as “stroke” can get the Observation result “reduction” by searching the MEPs. It will largely improve the precision of the evidence retrieval and save manual work for abstract screening. Our previous work also shows that MEP representation can efficiently empower the neural model with better reasoning capability and substantially improve the state-of-the-art on 2 public benchmarks of evidence synthesis.22 The results of this study show primary evidence of the efficacy of using EvidenceMap for evidence synthesis and appraisal.
Limitations and future work
In this study, we created corpora and developed our tools on PubMed abstracts, but EvidenceMap is adaptable to full text RCT literature as well. In the experiments, biases were avoided as much as possible to make the comparative evaluations relatively representative, but the sample size of 10 is relatively small. In the future, we will recruit more people to annotate full-text RCT articles and increase the sample size.
In the expressivity evaluation, most evidence elements identified during the freeform evidence annotation were successfully represented by EvidenceMap. However, the reviewers identified a few limitations. Temporal and geographical information was not captured, as also identified during the helpfulness evaluation. This limitation was a design choice in EvidenceMap trading off the representative ability for comprehension efficiency. In most cases, the key evidence was clearly communicated without these attributes, but in a few studies, this led to a representation failure. For example, the RCT article (PMID 24548534) investigated Bangladeshis with diabetes in New York City, so the geographical attribute “in NYC” is important. Another limitation is related to semantic relationships beyond EvidenceMap. For example, in the RCT abstract (PMID 34526033), the conclusion was represented as the MEP “therapeutic regimen (regimen I) with low dose prednisolone” ->” shortening” ->” the length of hospital stay”, but the Comparison information “superior to other regimens” was omitted; thus, users were not informed that therapeutic regimen I was significantly more effective at reducing hospital length of stay than other regimens.
Coreference resolution is an important step requested in the third level of representation in EvidenceMap but was not included in our NLP pipeline. Our future work will focus on the enrichment of the corpora with annotations of coreferences and the development of coreference resolution tool. Integrating entity normalization resources like Observational Medical Outcomes Partnership (OMOP) or UMLS synonym dictionary into the language models might improve the performance for biomedical coreference resolution task. The generation of EvidenceMap across documents will also be explored under the same data model. We used embedding method for word semantic similarity calculation in MEP clustering but relevant MEPs may still be incorrectly grouped due to the language complexity. We will investigate new methods to capture long-range dependencies to measure the semantic textual similarity between MEPs.
CONCLUSIONS
We have introduced a 3-level computational evidence representation, EvidenceMap, for hierarchically representing free-text medical evidence in a structured format. We provided 2 corpora of RCT abstracts annotated with the representation for training machine learning pipelines. Evaluations showed that EvidenceMap is helpful and efficient for understanding medical evidence reported in clinical trials. EvidenceMap can better support various core tasks for EBM, such as evidence comprehension, retrieval and computing with reduced human effort.
Supplementary Material
Contributor Information
Tian Kang, Department of Biomedical Informatics, Columbia University, New York, New York, USA.
Yingcheng Sun, Department of Biomedical Informatics, Columbia University, New York, New York, USA.
Jae Hyun Kim, Department of Biomedical Informatics, Columbia University, New York, New York, USA.
Casey Ta, Department of Biomedical Informatics, Columbia University, New York, New York, USA.
Adler Perotte, Department of Biomedical Informatics, Columbia University, New York, New York, USA.
Kayla Schiffer, Department of Biomedical Informatics, Columbia University, New York, New York, USA.
Mutong Wu, Department of Statistics, Columbia University, New York, New York, USA.
Yang Zhao, Department of Statistics, Columbia University, New York, New York, USA.
Nour Moustafa-Fahmy, Department of Statistics, Columbia University, New York, New York, USA.
Yifan Peng, Department of Population Health Sciences, Weill Cornell Medicine, New York, New York, USA.
Chunhua Weng, Department of Biomedical Informatics, Columbia University, New York, New York, USA.
FUNDING
This work was supported by 5R01LM009886-11 (Bridging the semantic gap between research eligibility criteria and clinical data; PI: CW).
AUTHOR CONTRIBUTIONS
TK designed and carried out the experiments and drafted the manuscript. YS and AP participated in the study design and manuscript writing. JK participated in the data generation and reviewed the manuscript. CT and YP reviewed the manuscript and participated in the writing. KS, MW, YZ, and NM participated in the evaluation study. CW supervised the research, participated in study design, reviewed the representation, and critically edited the manuscript.
SUPPLEMENTARY MATERIAL
Supplementary material is available at Journal of the American Medical Informatics Association online.
CONFLICT OF INTEREST STATEMENT
None declared.
DATA AVAILABILITY
The data underlying this article are available in GitHub at https://github.com/WengLab-InformaticsResearch/EvidenceMap_Model.
REFERENCES
- 1. Burns PB, Rohrich RJ, Chung KC.. The levels of evidence and their role in evidence-based medicine. Plast Reconstr Surg 2011; 128 (1): 305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Sim I. Trial Banks: An Informatics Foundation for Evidence-Based Medicine. Stanford University; 1998. [Google Scholar]
- 3. Schardt C, Adams MB, Owens T, Keitz S, Fontelo P.. Utilization of the PICO framework to improve searching PubMed for clinical questions. BMC Med Inform Decis Mak 2007; 7 (1): 1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Blake C. Beyond genes, proteins, and abstracts: identifying scientific claims from full-text biomedical articles. J Biomed Inform 2010; 43 (2): 173–89. [DOI] [PubMed] [Google Scholar]
- 5. Huang K-C, Liu CC-H, Yang S-S, et al. Classification of PICO elements by text features systematically extracted from PubMed abstracts. In: 2011 IEEE International Conference on Granular Computing; 2011; 279–83; Kaohsiung, Taiwan. [Google Scholar]
- 6. Wallace BC, Kuiper J, Sharma A, Zhu M, Marshall IJ.. Extracting PICO sentences from clinical trial reports using supervised distant supervision. J Mach Learn Res 2016; 17 (1): 4572–96. [PMC free article] [PubMed] [Google Scholar]
- 7. Jin D, Szolovits P. PICO element detection in medical text via long short-term memory neural networks. In: proceedings of the BioNLP 2018 workshop; 2018: 67–75; Melbourne, Australia. Association for Computational Linguistics.
- 8. Kang T, Zou S, Weng C.. Pretraining to recognize PICO elements from randomized controlled trial literature. Stud Health Technol Inform 2019; 264: 188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Nye B, Li JJ, Patel R, et al. A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. NIH Public Access 2018; 197–207. [PMC free article] [PubMed] [Google Scholar]
- 10. Chabou S, Iglewski M.. Combination of conditional random field with a rule based method in the extraction of PICO elements. BMC Med Inform Decis Mak 2018; 18 (1): 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Marshall IJ, Kuiper J, Wallace BC.. RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. J Am Med Inform Assoc 2016; 23 (1): 193–201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Marshall IJ, Kuiper J, Banner E, Wallace BC.. Automating biomedical evidence synthesis: RobotReviewer. NIH Public Access 2017; 7–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Huang X, Lin J, Demner-Fushman D.. Evaluation of PICO as a knowledge representation for clinical questions. Am Med Inform Assoc 2006; 359–63. [PMC free article] [PubMed] [Google Scholar]
- 14. Lehman E, DeYoung J, Barzilay R, Wallace BC. Inferring which medical treatments work from reports of clinical trials. In: proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); 2019: 3705–17; Minneapolis, MN. Association for Computational Linguistics.
- 15. DeYoung J, Lehman E, Nye B, Marshall IJ, Wallace BC. Evidence inference 2.0: more data, better models. In: proceedings of the BioNLP 2020 workshop [Online]; July 9, 2020: 123–32.
- 16. Nye BE, Nenkova A, Marshall IJ, Wallace BC.. Trialstreamer: mapping and browsing medical evidence in real-time. In: proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations; 2020; 63–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Health NIo. ClinicalTrials.gov results data element definitions. https://prsinfoclinicaltrialsgov/results_definitionshtml. Accessed December 15, 2022.
- 18. King JC, Soames S, Speaks J.. New Thinking about Propositions. Oxford: OUP; 2014. [Google Scholar]
- 19. Crystal D. A Dictionary of Linguistics and Phonetics. Hoboken, NJ: Wiley; 2011. [Google Scholar]
- 20. Atanassova I, Bertin M, Larivière V.. On the composition of scientific abstracts. J Doc 2016; 72 (4): 636–47. [Google Scholar]
- 21. Stenetorp P, Pyysalo S, Topić G, Ohta T, Ananiadou S, Tsujii J. brat: a web-based tool for NLP-assisted text annotation. In: proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics; 2012: 102–107.
- 22. Kang T, Turfah A, Kim J, Perotte A, Weng C.. A neuro-symbolic method for understanding free-text medical evidence. J Am Med Inform Assoc 2021; 28 (8): 1703–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Peng Y, Yan S, Lu Z. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In: proceedings of the 18th BioNLP Workshop and Shared Task; 2019: 58–65; Florence, Italy. Association for Computational Linguistics.
- 24. Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG.. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001; 34 (5): 301–10. [DOI] [PubMed] [Google Scholar]
- 25. Islam N, Riley L, Wyatt L, et al. Protocol for the DREAM Project (Diabetes Research, Education, and Action for Minorities): a randomized trial of a community health worker intervention to improve diabetic management and control among Bangladeshi adults in NYC. BMC Public Health 2014; 14 (1): 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Hernandez-Cardenas C, Thirion-Romero I, Rodríguez-Llamazares S, et al. ; on behalf of the Research Group on hydroxychloroquine for COVID-19. Hydroxychloroquine for the treatment of severe respiratory infection by covid-19: a randomized controlled trial. PLoS One 2021; 16 (9): e0257238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Rezaei M, Badiei R, Badiei R.. The effect of platelet-rich plasma injection on post-internal urethrotomy stricture recurrence. World J Urol 2019; 37 (9): 1959–64. [DOI] [PubMed] [Google Scholar]
- 28. Levitsky A, Kisten Y, Lind S, et al. Joint mobilization of the hands of patients with rheumatoid arthritis: results from an assessor-blinded, randomized crossover trial. J Manip Physiol Ther 2019; 42 (1): 34–46. [DOI] [PubMed] [Google Scholar]
- 29. Ghanei M, Solaymani-Dodaran M, Qazvini A, et al. The efficacy of corticosteroids therapy in patients with moderate to severe SARS-CoV-2 infection: a multicenter, randomized, open-label trial. Respir Res 2021; 22 (1): 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Finn JC, Brink D, Mckenzie N, et al. Prehospital continuous positive airway pressure (CPAP) for acute respiratory distress: a randomised controlled trial. Emerg Med J 2022; 39 (1): 37–44. [DOI] [PubMed] [Google Scholar]
- 31. Hanley DF, Thompson RE, Rosenblum M, et al. Efficacy and safety of minimally invasive surgery with thrombolysis in intracerebral haemorrhage evacuation (MISTIE III): a randomised, controlled, open-label, blinded endpoint phase 3 trial. Lancet 2019; 393 (10175): 1021–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Salloway S, Chalkias S, Barkhof F, et al. Amyloid-related imaging abnormalities in 2 phase 3 studies evaluating aducanumab in patients with early Alzheimer disease. JAMA Neurol 2022; 79 (1): 13–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Guzik TJ, Ramasundarahettige C, Pogosova N, et al. Rivaroxaban plus aspirin in obese and overweight patients with vascular disease in the COMPASS trial. J Am Coll Cardiol 2021; 77 (5): 511–25. [DOI] [PubMed] [Google Scholar]
- 34. Altorki NK, McGraw TE, Borczuk AC, et al. Neoadjuvant durvalumab with or without stereotactic body radiotherapy in patients with early-stage non-small-cell lung cancer: a single-centre, randomised phase 2 trial. Lancet Oncol 2021; 22 (6): 824–35. [DOI] [PubMed] [Google Scholar]
- 35. Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 1971; 66 (336): 846–50. [Google Scholar]
- 36. Kolditz T, Lohr C, Hellrich J, et al. Annotating German clinical documents for de-identification. Stud Health Technol Inform2019; 264: 203–7. [DOI] [PubMed]
- 37. Kaiser J, Niesen W, Probst P, et al. Abdominal drainage versus no drainage after distal pancreatectomy: study protocol for a randomized controlled trial. Trials 2019; 20 (1): 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Hopewell S, Dutton S, Yu L-M, Chan A-W, Altman DG.. The quality of reports of randomised trials in 2000 and 2006: comparative study of articles indexed in PubMed. BMJ 2010; 340: c723. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data underlying this article are available in GitHub at https://github.com/WengLab-InformaticsResearch/EvidenceMap_Model.





