Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2024 Mar 12;31(5):1163–1171. doi: 10.1093/jamia/ocae065

A span-based model for extracting overlapping PICO entities from randomized controlled trial publications

Gongbo Zhang 1, Yiliang Zhou 2, Yan Hu 3, Hua Xu 4, Chunhua Weng 5,, Yifan Peng 6,
PMCID: PMC11031223  PMID: 38471120

Abstract

Objectives

Extracting PICO (Populations, Interventions, Comparison, and Outcomes) entities is fundamental to evidence retrieval. We present a novel method, PICOX, to extract overlapping PICO entities.

Materials and Methods

PICOX first identifies entities by assessing whether a word marks the beginning or conclusion of an entity. Then, it uses a multi-label classifier to assign one or more PICO labels to a span candidate. PICOX was evaluated using 1 of the best-performing baselines, EBM-NLP, and 3 more datasets, ie, PICO-Corpus and randomized controlled trial publications on Alzheimer’s Disease (AD) or COVID-19, using entity-level precision, recall, and F1 scores.

Results

PICOX achieved superior precision, recall, and F1 scores across the board, with the micro F1 score improving from 45.05 to 50.87 (P ≪.01). On the PICO-Corpus, PICOX obtained higher recall and F1 scores than the baseline and improved the micro recall score from 56.66 to 67.33. On the COVID-19 dataset, PICOX also outperformed the baseline and improved the micro F1 score from 77.10 to 80.32. On the AD dataset, PICOX demonstrated comparable F1 scores with higher precision when compared to the baseline.

Conclusion

PICOX excels in identifying overlapping entities and consistently surpasses a leading baseline across multiple datasets. Ablation studies reveal that its data augmentation strategy effectively minimizes false positives and improves precision.

Keywords: PICO extraction, artificial intelligence, span-based model, named entity recognition

Background and significance

The PICO (Population, Intervention, Comparison, and Outcome) framework is a widely adopted standard for evidence retrieval in evidence-based medicine.1,2 This framework specifies the study population of a clinical study, the intervention and comparison applied, and the expected outcomes to assist with retrieving relevant evidence.3 However, manual extraction of PICO is time-consuming.4 The exponential growth of randomized controlled trial (RCT) publications further exacerbates this challenge. It is imperative to automate PICO extraction to enable timely and efficient evidence retrieval, appraisal, and synthesis.

Existing related studies largely treat PICO extraction as a sequence labeling task of Named Entity Recognition (NER), where each token is labeled with a predefined tag that indicates the entity type paired with the IOB2 schema, symbolizing the inside, outside, and beginning of an entity. Some methods are based on Conditional Random Fields (CRFs) with well-designed features.5,6 However, with the evolution of neural networks, the LSTM-CRF model (Long-Short Term Memory—Conditional Random Field) has demonstrated substantial promise for the NER task.7–9 More recently, language models such as ELMo10 and BERT11 have also achieved remarkable successes.12–16 While these methods have demonstrated promising results, they often falter when handling overlapping PICO entities, which are commonly observed in PICO annotations and account for 8.2% of the sentences in the EBM-NLP dataset.17 For example, in Figure 1, in the text of “children’s attitude and behavior interventions,” there are overlapping P entity (“children”) and O entities (“children’s attitude” and “(children’s) behavior intentions”).

Figure 1.

Figure 1.

An example sentence that contains overlapping PICO entities. The population entity is contained within the outcome entity.

Span-based methods have been frequently adopted to extract overlapping entities.18–25 These methods identify spans containing named entities and classify these candidates by corresponding entity types but have not yet been used for PICO extraction. Moreover, to limit the number of span candidates, most studies presume a maximum length for an entity span.26–29 While this presumption works effectively for generic named entities like people or locations, it is not necessarily optimal for PICO entity extraction. Regarding PICO entities, the length of entity spans has a significant variability. Some can be as condensed as a single word, like treatments, while others can expand across an entire sentence, providing comprehensive descriptions of population groups, including age, gender, and sample size. Hence, assuming a maximum length for an entity span might fail to capture PICO entities of various lengths.

To tackle these unsolved problems, we define PICO extraction as a “span detection” task and present a novel span-based model called PICOX. Figure 2 shows the 2-step workflow for PICOX. At step 1, drawing inspiration from the works of Tan et al21 and Shen et al,30 it locates the entities in a sentence by determining whether a word signals the start or end of an entity. It is distinguished from previous works by not only assessing if a token marks the start or end of spans but also categorizing non-boundary tokens as “inside” or “outside” spans. This distinction provides advantages for model training, as it considers the relative orders of these tokens. For instance, the likelihood of the last token in an entity being followed by a word outside any entity is higher than by a word inside one. At step 2, it classifies each span using a multi-label classifier. The objective here is to not only identify the entity type of a span, but also to differentiate spans that represent an entity from those that do not. Since a sentence might contain multiple entities, not all extracted span candidates represent a valid entity; refer to the first and fourth extracted spans in Figure 2 for examples. To distinguish PICO entities and such invalid span candidates, we introduce a new strategy that augments the training data.

Figure 2.

Figure 2.

The workflow for PICOX consists of 2 steps. Initially, it detects the start and end positions in the text sequence. Subsequently, it categorizes each span bounded by a pair of start and end positions. In the provided example from PMID 20840173, PICOX discerned 2 potential start positions and 2 end positions, resulting in 4 valid pairs where the start position does not surpass the end position. Post span classification, PICOX identified a population entity “A total of 403 adult … to bupropion use” and an intervention entity “bupropion SR.” In this instance, the Intervention entity is encapsulated within the Population entity.

This research makes 3 key contributions. First, we contribute a novel framework that parses overlapping spans and accommodates unlimited lengths of entities. Secondly, we present a data augmentation strategy to boost the training data for the span classifier, which not only identifies the entity type within a span but also differentiates between spans that represent an entity and those that do not. Thirdly, we extensively evaluate our model on 4 diverse benchmark datasets.

Materials and methods

Data description

This study utilized 4 benchmark datasets: EBM-NLP,17 PICO-Corpus,31 and 2 sets of RCT publications focusing on Alzheimer’s disease (AD) and COVID-19 (Table 1).

Table 1.

Number of annotations in the datasets.

Training
Test
Total
Dataset P I O P I O P I O % of overlap entities
EBM-NLP 17 952 32 859 33 554 643 1726 1833 18 595 34 585 35 387 8.2
PICO-Corpus 1404 1849 3580 1561 2054 3978 3121 4108 7956 0
AD 98 221 295 109 245 328 218 490 656 0
COVID-19 118 281 279 132 313 310 263 625 619 0

P = population, I = intervention/comparison, O = outcome, AD = Alzheimer’s disease.

EBM-NLP

The EBM-NLP dataset consists of 4993 abstracts describing CRT.17 These abstracts were annotated with entities related to “population,” “intervention/comparison,” and “outcome.” The annotations for interventions and comparisons were merged during the annotation process. The training labels for this dataset were obtained via crowdsourcing on Amazon Mechanical Turk and subsequently aggregated. The test set, in contrast, was manually labeled by medical professionals. We used the standard training and test sets provided in the EBM-NLP. Following a previous practice,32 we randomly selected 5% from the training set for evaluation.

AD and COVID-19 datasets

We collected a total of 1980 abstracts from RCT publications related to AD (as of November 5, 2021) and 552 abstracts related to COVID-19 (as of September 1, 2021). From these collections, we randomly selected and annotated 150 RCT abstracts for each disease.33 To enhance the annotation process, we used a sentence classification model that filtered and focused solely on method-related sentences. This narrowed the scope of annotation.34 The RCT abstracts filtered through the sentence classification model were annotated by 2 independent annotators, both with medical training. To ensure high accuracy and consistency, we followed a rigorous annotation test loop grounded in the original EBM-NLP annotation guidelines. These were supplemented with rules and examples in each iteration to improve clarity.

In each loop, 10 abstracts were selected randomly and manually annotated using the evolving guidelines. After each loop, we evaluated the inter-annotator agreement using Cohen’s kappa statistic. If the measure of agreement was below 0.7, we addressed the discrepancies between annotators and refined the guidelines by enriching rules and examples. By conducting multiple annotation loops and continuously improving the guidelines, we achieved a final inter-annotator agreement of 0.714, 0.808, 0.701, and 0.790 for PICO, respectively. The combined coefficient for all PICO elements stood at 0.746.33

With the finalized guidelines, we annotated the above abstracts in this study. To create a sufficiently large test set, we divided the PICO-Corpus into 2 subsets of the same size, 1 for training and the other for testing. Following an earlier biomedical text mining work, we take 5% of samples out of the training set as the validation set, to tune the model’s hyperparameters.32

PICO-Corpus

The PICO-Corpus comprises 1011 PubMed abstracts, all of which are RCTs primarily focusing on breast cancer.31 Each abstract was annotated with specific textual elements that represent the Population, Intervention, Control, and Outcome (PICO). Like the EBM-NLP dataset,17 the annotations for interventions and comparisons were consolidated into 1 category. Like the AD and COVID-19 corpus, we divided the PICO-corpus datasets into 3 distinct subsets: 45% for training, 5% for validation, and the rest 50% for testing.

Proposed model

PICOX comprises 2 parts (Figure 2): 1 to identify the presence of entity spans in a sentence and the other to determine the types of the located span.

Span localization

The input of the model is a sequence of text with length n, denoted as s={t1,t2,,tn}. Then, we apply a BERT-based encoder to obtain contextual representations of each token, denoted as H={h1,h2,,hn}. Next, we develop a model to determine whether a token marks the beginning or end of an entity. In this case, we define a set of 5 relative-position categories L= {inside, outside, start, end, both-start-and-end}. The “both-start-and-end” category is incorporated for situations where a single-word entity results in a token functioning as both the start and end of a span. Because a sentence may contain multiple entities, we identify the start and end of all entities present in the sentence. We calculate the probability PijL of the token ti being in relative position jL to the nearest entity.

PijL=expWjLhi+bjLlL  expWlLhi+blL,

where WL and bL are parameters to be learned. We use the cross-entropy as the loss function,

LL=i=1N jL-yijLlogPijL,

where yijL denotes the gold standard relative-position label of token ti.

To determine whether a token ti is the starting point of a span, we select a set of relative positions LS={j|P^ijLt}, where t represents a threshold parameter. We then consider position j as a starting point if either “both-start-and-end” or “start” is within the set LS. We identify ending positions using the same procedure.

Span classification

After predicting the span boundaries, we extract all spans ts:e defined by pairs of start and end positions s and e where s<e. Subsequently, we classify these spans based on the type of entity, denoted as C= {Population, Intervention/Comparison, Outcome}. Following the work of Nye et al,17 we merge Interventions and Comparison into a single category (I). For each span ts:e, we feed ts:e into BERT, adding a special token [CLS] at the beginning, and utilize the encoding of the inserted token [CLS], denoted as hs:e, as the feature vector.27 To determine whether span ts:e falls under category cC, we calculate

Ps:ec=σWchs:e+bc,

where σ is the Sigmoid function and Wc and bc are weight parameters. The objective function of span classification is defined as

LC=cC-ys:eclogPs:ec,

where ys:ec indicates whether the span ts:e represents an entity span in the category c.

Data augmentation

In scenarios where a sentence contains multiple entities, some extracted spans may not represent any entity. For example, consider a sentence with 2 spans, ts1,,te1 and ts2,,te2, containing 2 pairs of start and end positions. This would result in a total of 4 span candidates being extracted. However, 2 of these candidates, ts1,,te2 and ts2,,te1, are not associated with any PICO elements. We refer to such candidates as “composite spans” because their boundaries are made up of the start of one entity and the end of another.

Throughout the span classification stage, it is necessary to filter out the composite spans from the extracted candidates and accurately identify the entity type for the remaining ones. To address this challenge, we augment span classification training data by including composite spans (Algorithm 1). By introducing these extra spans, we provide the model with examples that aid it in learning to differentiate between spans representing named entities and those that do not.

Algorithm 1.

Data augmentation.

Let la, lb be two entity lists of different categories.

  D{}

  for (ea,eb) in la×lbdo:

    ifSTARTeaENDebthen

     DD{<STARTea, ENDeb>}

    end if

    ifSTARTebENDeathen

     DD{<STARTeb, ENDea>}

    end if

  end for

return D

Evaluation metrics

We employed the span-level precision, recall, and F1 scores as evaluation metrics. For a prediction to be considered a true positive, it must satisfy 2 conditions: (1) the predicted span is identical to the truth entity, and (2) they should have the same PICO entity type.

Experimental settings

In our experiments, we compared PICOX with multiple open-source models, including BioELECTRA,16 BioBERT,15 SciBERT,14 and PubMedBERT12 on EBM-NLP, the largest dataset with PICO annotation. We selected PubMedBERT as our backbone model. To the best of our knowledge, PubMedBERT is one of the best language models for biomedical text. We fine-tuned an instance of the large, uncased version on EBM-NLP. As a sanity check, our baseline instance achieved a token-level macro F1 score of 73.33, closely matching the reported 73.38 in the work of Gu et al.12

On the EBM-NLP dataset, we fine-tuned the span localization and span classification models using their respective training data. On the PICO-Corpus, AD, and COVID-19 datasets, we continued fine-tuning the models to benefit from the knowledge learned from the larger EBM-NLP dataset.

We applied a learning rate of 5e-5, batch size 8, and 3 epochs of training. The default span localization threshold is 0.25 (0.4 was used for AD). Intel Core i9-9960X 16 cores processor, NVIDIA Quadro RTX 5000 GPU, and a memory size of 128G were used in this work.

Results

Overall performance

We evaluated PICOX on the following benchmark datasets: EBM-NLP,17 PICO-Corpus,31 AD, and COVID-19 trials.33 The results demonstrate that our model outperforms the previously established state-of-the-art BERT-based models.12,14–16 Our model significantly (P ≪.01) enhances the identification of both overlapping entities (with a 4.82% increase in F1 score) and non-overlapping entities (with a 5.16% increase in F1 score). Further analysis shows that the data augmentation strategy effectively reduces false positive errors which result in a higher precision.

Table 2 compares PICOX with BioELECTRA, BioBERT, SciBERT, PubMedBERT on the EBM-NLP. In this study, we selected PubMedBERT as the baseline since it was pretrained specifically on biomedical text corpus and achieved a similar or better overall performance (Macro F1) than others for PICO extraction12 (Table 2). We replicated the performance of the baseline model, as reported by Gu et al.12 In the paired t-test, our method significantly outperformed PubMedBERT with a P-value ≪.01.

Table 2.

Performance comparison of BioELECTRA, SciBERT, BioBERT, PubMedBERT, and PICOX on EBM-NLP.

Participants
Interventions
Outcomes
Average F1
P R F1 P R F1 P R F1 micro macro
BioELECTRA 52.30 63.76 57.46 52.86 41.83 46.70 44.40 33.93 38.47 45.25 47.54
SciBERT 55.36 62.67 58.79 51.77 41.43 46.03 45.07 33.17 38.21 45.07 47.68
BioBERT 54.19 62.36 57.99 55.02 41.89 47.57 44.69 33.06 38.01 45.45 47.85
PubMedBERT 56.38 63.91 59.91 51.32 41.54 45.92 44.44 32.90 37.81 45.05 47.88
PICOX 55.93 66.72 60.85 57.13 52.43 54.68 48.25 38.41 42.77 50.87 52.11

P = precision, R = recall, F1 = F1 score.

The best scores are highlighted in bold.

To demonstrate our method’s generalizability, we also compared the performance of PICOX on the EBM-NLP, PICO-Corpus, AD, and COVID-19 datasets with PubMedBERT, as shown in Table 3. On the EBM-NLP dataset, PICOX achieved higher precision, recall, and F1 scores across all categories than the baseline, with the micro F1 score increasing from 45.05 to 50.87. Specifically for interventions, the precision is improved from 51.32 to 57.13, the recall is improved from 41.54 to 52.43, and F1 is improved from 45.92 to 54.68.

Table 3.

Performance for PICO entity recognition on EBM-NLP, PICO-Corpus, AD, and COVID-19 datasets.

EBM-NLP
PICO-Corpus
AD
COVID-19
P R F1 P R F1 P R F1 P R F1
PubMedBERT
 Populations 56.38 63.92 59.91 69.53 78.07 73.55 83.62 80.83 82.20 82.39 87.97 85.09
 Interventions 51.32 41.54 45.92 67.78 51.68 58.65 71.43 74.14 72.76 73.31 73.31 73.31
 Outcomes 44.44 32.90 37.81 73.99 53.85 62.33 74.01 74.92 74.46 74.32 80.95 77.49
  micro 49.70 41.19 45.05 70.61 56.66 62.87 74.58 75.64 75.11 75.30 78.99 77.10
  macro 50.71 46.12 47.88 70.43 61.20 64.84 76.35 76.63 76.48 76.67 80.75 78.63
PICOX
 Populations 55.93 66.72 60.85 55.92 80.09 65.86 88.29 81.67 84.85 90.16 82.71 86.27
 Interventions 57.13 52.43 54.68 68.12 65.24 66.65 76.69 68.82 72.55 78.98 76.07 77.50
 Outcomes 48.25 38.41 42.77 64.29 64.81 64.55 76.14 72.14 74.09 76.53 85.42 80.73
  micro 53.49 48.50 50.87 63.98 67.33 65.61 78.41 72.52 75.35 79.53 81.13 80.32
  macro 53.77 52.52 52.77 62.78 70.05 65.69 80.38 74.21 77.16 81.89 81.40 81.50

P = precision, R = recall, F1 = F1 score.

The best scores are highlighted in bold.

On the PICO-Corpus, PICOX achieved lower precision but higher recall and F1 scores than the baseline. The micro recall score improves from 56.66 to 67.33, and the micro F1 score increases from 62.87 to 65.64. For the interventions, the recall was increased from 51.68 to 65.24. The result underscores PICOX’s capability to correctly classify entities within the PICO-Corpus.

On the AD dataset, PICOX demonstrated comparable F1 scores with higher precision but lower recall when compared to the baseline. This result suggests that PICOX achieved more accurate predictions but misclassified certain entities as non-PICO elements. Nevertheless, the overall F1 score remained similar.

On the COVID-19 dataset, PICOX outperformed the baseline, achieving higher precision, recall, and F1 scores. The micro F1 score was improved from 77.10 to 80.32, indicating the effectiveness of PICOX in accurately extracting and classifying entities on the COVID-19 dataset.

Performance on overlapping PICO entities

To further investigate the discrepancy in performance, we divided the sentences in the EBM-NLP test set into 2 groups: one containing sentences with overlapping PICO elements and the other without any overlapping entities. We compared the precision, recall, and F1 scores (micro-averaged) across all categories. Table 4 shows that PICOX consistently outperforms the baseline, with F1 scores improving from 28.15 to 32.97 for the overlapping entity detection and from 46.51 to 51.67 for the non-overlapping entity detection. It is also worth noting that both PICOX and the baseline exhibit lower performance in detecting overlapping entities than non-overlapping ones, highlighting the challenge of identifying overlapping spans in the PICO extraction tasks. Supplementary Appendix Table 1 breaks down the over performance on EBM-NLP by the entity length into categories of 1 word, 2-5 words, and >5 word.

Table 4.

Performance of detecting overlapping and non-overlapping entities on EBM-NLP.

P R F1
PubMedBERT
 Overlapping 38.05 22.34 28.15
 Non-overlapping 50.51 43.10 46.51
PICOX
 Overlapping 41.09 27.53 32.97
 Non-overlapping 54.22 49.36 51.67

P = precision, R = recall, F1 = F1 score.

The impact of data augmentation

To assess the effectiveness of the data augmentation strategy, we compared 2 variants of our model. In the first variant, we trained the span classifiers on a collection of text spans, each representing a PICO entity. In the second variant, we augmented the training data according to Algorithm 1. We evaluated the performance of the 2 versions on EBM-NLP and displayed the results in Table 5. Version 2 achieved higher precision (53.77 vs 51.49) and maintained a similar recall (52.52 vs 52.77) compared to version 1.

Table 5.

Comparison of 2 different implementations, one incorporates the data augmentation strategy whereas the other does not. The performance was evaluated on EBM-NLP.

P R F1
w/o data augmentation
 Populations 50.00 66.87 57.22
 Interventions 56.96 52.61 54.70
 Outcomes 47.50 38.84 42.74
micro 51.86 48.79 50.28
macro 51.49 52.77 51.55
w/data augmentation
 Populations 55.93 66.72 60.85
 Interventions 57.13 52.43 54.68
 Outcomes 48.25 38.41 42.77
micro 53.49 48.50 50.87
macro 53.77 52.52 52.77

P = precision, R = recall, F1 = F1 score.

The best scores are highlighted in bold.

The effectiveness of the span localization threshold

In our approach, we utilized a threshold to determine if a word should be considered as the start or end point of an entity span. In this section, we analyzed the effectiveness of this threshold on the model’s performance. We plotted the changes in precision, recall, and F1 on the EBM-NLP dataset for varying threshold values, ranging from 0.2 to 0.5. A minimum 0.2 threshold was chosen because we pre-defined 5 relative-position categories. A maximum 0.5 threshold was chosen because when the threshold is larger than 0.5, the localization model will assign only a single label to each word. Consequently, it may fail to locate spans that consist of a single word, which is both the start and end of the span.

As shown in Figure 3, an increase in the threshold leads to higher precision but lower recall. This observation aligns with the understanding that a more selective model filters out false positives but may also overlook true positives.

Figure 3.

Figure 3.

Comparison of 2 different implementations, one incorporates the data augmentation strategy whereas the other does not. The performance was evaluated on EBM-NLP. P: precision, R: recall, F1: F1 score.

Discussion

Across all 4 datasets, PICOX achieved superior performance than PubMedBERT. For the EBM-NLP dataset, PICOX surpassed the baseline in precision, recall, and F1 scores, with the micro F1 score rising from 45.05 to 50.87. For the PICO-Corpus, PICOX had similar precision but better recall and F1 scores than the baseline. Notably, the micro recall score went from 56.66 to 67.33. For the AD dataset, PICOX had a similar F1 score and higher precision but lower recall than the baseline. For the COVID-19 dataset, PICOX excelled over the baseline in all metrics.

As shown in Table 4, F1 scores rose from 28.15 to 32.97 for overlapping entity detection and from 46.51 to 51.67 for non-overlapping entity detection. Both the PICOX and the baseline had more difficulty detecting overlapping entities than non-overlapping ones, underlining the intricacy of pinpointing overlapping sections in PICO extraction tasks. These results show that PICOX performs better than the baseline at identifying entities that are contained within another longer one, ie, overlapping spans.

Figure 4 provides several examples that allow for a detailed comparison of the extraction capabilities of PICOX and the baseline. The first example correctly pointed out the intervention entity “physiotherapy assessment/intervention” within the Population entity. Similarly, in the second example, PICOX recognized the Intervention entity “bupropion SR.” In contrast, the baseline could not extract the Intervention entities that are part of the description of the Population entities. This is because the baseline only assigns a single label to each word, while these Intervention entities should be attributed multiple labels.

Figure 4.

Figure 4.

Examples of PICO extraction by PICOX and the baseline.

We further implemented a data augmentation strategy to enhance the training of our span classifier. The results indicate that the data augmentation strategy effectively reduces false positive errors, leading to higher precision. In addition, even without the data augmentation, our model continues to outperform the baseline on EBM-NLP.

This study has several limitations. While PICOX has shown advantages in identifying overlapping PICO entities, it has several limitations for further improvement. One lies in accurately locating span boundaries. In the third example in Figure 4, PICOX overlooked the first Outcome entity, because it did not recognize the starting position, although it correctly recognized the ending position. A more efficient span localization module would alleviate the issue.

The second challenge was accurately identifying boundaries for long spans. In the first example in Figure 4, the entire sentence was marked as a Population entity. However, PICOX did not include the first 3 words, leading to the inaccurate determination of span boundaries. In the second example, PICOX identified the entire sentence as the Population entity, even though the annotation does not contain the first 3 words. In our approach, we utilized a threshold to determine if a word should be considered as the start or end point of an entity span. As the initial step in our pipeline, our strategy is to include as many plausible span candidates as possible, instead of selectively choosing boundaries with high confidence. This is because false positive spans can be filtered out further down the pipeline. We consider a word as a possible start or end even if the localization model holds uncertainty about its classification. Our results show that PICOX maintains higher precision than the baseline, but the recall falls below the baseline once the threshold exceeds 0.4. The optimal F1 score was achieved when the threshold was within a range from 0.2 and 0.3. However, this optimal threshold value is dataset dependent. In our experiments, we empirically selected a 0.25 threshold for EBM-NLP, PICO-Corpus, and COVID-19 and a 0.4 threshold for AD. We plan to learn these hyperparameters adaptively in the future.35 Other future directions include exploring larger model architectures and distinguishing interventions and comparisons, which were aggregated into one class in most of the existing studies.12,17,31

The third challenge was accurately identifying short spans located near each other, posing difficulty in defining the exact boundaries of an entity. Take, for instance, the third example, where “epistaxis” and “headache” are annotated as 2 separate Outcome entities. Our model identified “epistaxis” as the beginning of the span and “headache” as its end. Consequently, the extracted outcome entity was “epistaxis and headache,” introducing a false positive error and 2 false negative errors.

Finally, the span localization and classification modules are trained separately. Moving forward, our plan involves joint training of these 2 sub-modules. This idea is fueled by the fact that the same encoder structure is shared by the backbones of both these sub-modules. By jointly training them, we can tap into this collective knowledge base for further improvement.

Conclusion

We introduced a span-based model, PICOX, for recognizing PICO entities from RCT publications. Our model demonstrated improvement in identifying overlapping entities and extracting those of any size without requiring supplementary presumptions to limit the span width. Experimental testing was executed on benchmark datasets across 4 distinct fields, and our model demonstrated superior results compared to the latest cutting-edge models. A comprehensive examination reveals that span-based models offer valuable perspectives on future pathways for the efficient and effective detection of overlapping PICO elements drawn from RCT publications.

Supplementary Material

ocae065_Supplementary_Data

Contributor Information

Gongbo Zhang, Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States.

Yiliang Zhou, Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States.

Yan Hu, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, United States.

Hua Xu, Section of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT 06510, United States.

Chunhua Weng, Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States.

Yifan Peng, Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States.

Author contributions

Gongbo Zhang, Chunhua Weng, Yifan Peng (Study concepts/study design); all authors (manuscript drafting or manuscript revision for important intellectual content); all authors (approval of final version of submitted manuscript); all authors (agrees to ensure any questions related to the work are appropriately resolved); Gongbo Zhang, Yifan Peng (literature research); Gongbo Zhang, Yiliang Zhou (experimental studies); Yan Hu, Hua Xu (data annotation); Gongbo Zhang, Yifan Peng (data interpretation and statistical analysis); and all authors (manuscript editing).

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This project was sponsored by the National Library of Medicine grant R01LM009886, R01LM014344, and the National Center for Advancing Clinical and Translational Science award UL1TR001873.

Conflicts of interest

The authors declare no competing interests.

Data availability

The data and codes underlying this article will be shared on https://github.com/WengLab-InformaticsResearch/PICOX.

References

  • 1. Richardson WS, Wilson MC, Nishikawa J, Hayward RS.. The well-built clinical question: a key to evidence-based decisions. ACP J Club. 1995;123(3):A12-A13. [PubMed] [Google Scholar]
  • 2. Kang T, Sun Y, Kim JH, et al. EvidenceMap: a three-level knowledge representation for medical evidence computation and comprehension. J Am Med Inform Assoc. 2023;30(6):1022-1031. 10.1093/jamia/ocad036 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Peng Y, Rousseau JF, Shortliffe EH, Weng C.. AI-generated text may have a role in evidence-based medicine. Nat Med. 2023;29(7):1593-1594. 10.1038/s41591-023-02366-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Borah R, Brown AW, Capers PL, Kaiser KA.. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017;7(2):e012545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. McCallum A, Li W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. Association for Computational Linguistics; 2003. 10.3115/1119176.1119206 [DOI]
  • 6. Finkel JR, Grenager T, Manning C. Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics; 2005:363-370.
  • 7. Ma X, Hovy E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol 1: Long Papers). Association for Computational Linguistics; 2016:1064-1074.
  • 8. Yang J, Liang S, Zhang Y. Design challenges and misconceptions in neural sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics; 2018:3879-3889.
  • 9. Kang T, Zou S, Weng C.. Pretraining to recognize PICO elements from randomized controlled trial literature. Stud Health Technol Inform. 2019;264:188-192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Peters ME, Neumann M, Iyyer M, et al. Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol 1: Long Papers). Association for Computational Linguistics; 2018:2227-2237.
  • 11. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol 1 (Long and Short Papers). Association for Computational Linguistics; 2019:4171-4186.
  • 12. Gu Y, Tinn R, Cheng H, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthcare. 2021;3(1):1-23. [Google Scholar]
  • 13. Wang Q, Liao J, Lapata M, Macleod M.. PICO entity extraction for preclinical animal literature. Syst Rev. 2022;11(1):209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Beltagy I, Lo K, Cohan A.. 2019. SciBERT: a pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics; 2019:3615-3620.
  • 15. Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Kanakarajan KR, Kundumani B, Sankarasubbu M. BioELECTRA: pretrained biomedical text encoder using discriminators. In: Demner-Fushman D, Cohen KB, Ananiadou S, Tsujii J, eds. Proceedings of the 20th Workshop on Biomedical Language Processing. Association for Computational Linguistics; 2021:143–154.
  • 17. Nye B, Li JJ, Patel R, et al. A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Vol 1: Long Papers). Association for Computational Linguistics; 2018. 10.18653/v1/p18-1019 [DOI] [PMC free article] [PubMed]
  • 18. Wadden D, Wennberg U, Luan Y, Hajishirzi H. Entity, relation, and event extraction with contextualized span representations. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics; 2019:5784-5789.
  • 19. Luan Y, Wadden D, He L, Shah A, Ostendorf M, Hajishirzi H. A general framework for information extraction using dynamic span graphs. In: Proceedings of the 2019 Conference of the North. Association for Computational Linguistics; 2019. 10.18653/v1/n19-1308 [DOI]
  • 20. Liu S, Sun Y, Li B, Wang W, Zhao X.. HAMNER: headword amplified multi-span distantly supervised method for domain specific named entity recognition. AAAI. 2020;34(05):8401-8408. [Google Scholar]
  • 21. Tan C, Qiu W, Chen M, Wang R, Huang F.. Boundary enhanced neural span classification for nested named entity recognition. AAAI. 2020;34(05):9016-9023. [Google Scholar]
  • 22. Fu J, Huang X, Liu P. SpanNER: named entity re-/recognition as span prediction. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol 1: Long Papers). Association for Computational Linguistics; 2021:7183-7195.
  • 23. Li F, Lin Z, Zhang M, Ji D. A span-based model for joint overlapping and discontinuous named entity recognition. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol 1: Long Papers). Association for Computational Linguistics; 2021:4814-4828.
  • 24. Wan J, Ru D, Zhang W, Yu Y. Nested named entity recognition with span-level graphs. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Vol 1: Long Papers). Association for Computational Linguistics; 2022:892-903.
  • 25. Zhu E, Li J. Boundary smoothing for named entity recognition. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Vol 1: Long Papers). Association for Computational Linguistics; 2022:7096-7108.
  • 26. Golam Sohrab M, Shoaib Bhuiyan M. Span-based neural model for multilingual flat and nested named entity recognition. In: 2021 IEEE 10th Global Conference on Consumer Electronics (GCCE), Kyoto, Japan; 2021:80-84.
  • 27. Fei H, Zhang Y, Ren Y, Ji D.. A span-graph neural model for overlapping entity relation extraction in biomedical texts. Bioinformatics. 2021;37(11):1581-1589. [DOI] [PubMed] [Google Scholar]
  • 28. Zaratiana U, Tomeh N, Holat P, Charnois T. GNNer: reducing overlapping in span-based NER using graph neural networks. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics; 2022:97-103.
  • 29. Zaratiana U, Tomeh N, Holat P, Charnois T. Named entity recognition as structured span prediction. In: Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS). Association for Computational Linguistics; 2022:1-10.
  • 30. Shen Y, Ma X, Tan Z, Zhang S, Wang W, Lu W. Locate and label: a two-stage identifier for nested named entity recognition. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol 1: Long Papers). Association for Computational Linguistics; 2021:2782-2794.
  • 31. Mutinda F, Liew K, Yada S, Wakamiya S, Aramaki E. PICO corpus: a publicly available corpus to support automatic data extraction from biomedical literature. In: Proceedings of the First Workshop on Information Extraction from Scientific Publications. Association for Computational Linguistics; 2022:26-31.
  • 32. Zhang G, Roychowdhury D, Li P, et al. Identifying experimental evidence in biomedical abstracts relevant to drug-drug interactions. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery; 2018:414-418.
  • 33. Hu Y, Keloth VK, Raja K, Chen Y, Xu H.. Towards precise PICO extraction from abstracts of randomized controlled trials using a section-specific learning approach. Bioinformatics. 2023;39(9):btad542. 10.1093/bioinformatics/btad542 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Hu Y, Chen Y, Xu H. Improving sentence classification in abstracts of randomized controlled trial using prompt learning. In: 2022 IEEE 10th International Conference on Healthcare Informatics (ICHI), Rochester, Minnesota, USA; 2022:606-607.
  • 35. Zhou W, Huang K, Ma T, Huang J.. Document-level relation extraction with adaptive thresholding and localized context pooling. AAAI. 2021;35(16):14612-14620. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocae065_Supplementary_Data

Data Availability Statement

The data and codes underlying this article will be shared on https://github.com/WengLab-InformaticsResearch/PICOX.


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES