Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2018 Feb 1;25(7):774–779. doi: 10.1093/jamia/ocx155

Tool for filtering PubMed search results by sample size

Carlos Baladrón 1,, Alejandro Santos-Lozano 2, Javier M Aguiar 3, Alejandro Lucia 4, Juan Martín-Hernández 2
PMCID: PMC7647020  PMID: 29409012

Abstract

Objective

The most used search engine for scientific literature, PubMed, provides tools to filter results by several fields. When searching for reports on clinical trials, sample size can be among the most important factors to consider. However, PubMed does not currently provide any means of filtering search results by sample size. Such a filtering tool would be useful in a variety of situations, including meta-analyses or state-of-the-art analyses to support experimental therapies. In this work, a tool was developed to filter articles identified by PubMed based on their reported sample sizes.

Materials and Methods

A search engine was designed to send queries to PubMed, retrieve results, and compute estimates of reported sample sizes using a combination of syntactical and machine learning methods. The sample size search tool is publicly available for download at http://ihealth.uemc.es. Its accuracy was assessed against a manually annotated database of 750 random clinical trials returned by PubMed.

Results

Validation tests show that the sample size search tool is able to accurately (1) estimate sample size for 70% of abstracts and (2) classify 85% of abstracts into sample size quartiles.

Conclusions

The proposed tool was validated as useful for advanced PubMed searches of clinical trials when the user is interested in identifying trials of a given sample size.

Keywords: text mining, clinical trial, knowledge discovery, sample size

BACKGROUND AND SIGNIFICANCE

PubMed is probably the most important search engine for literature on biomedical and life science topics, and is the main point of entry to the popular Medline database. This engine has tools to locate papers in the form of filters of elements such as title, author, publication date, species, or Medical Subject Heading (MeSH) terms.

Notwithstanding, the recent exponential growth of scientific literature (Medline1 currently hosts >23 million references) is making it increasingly difficult for users to extract the information they need from all the available data. In effect, since the advent of information technologies such as sensor networks and the Internet of Things, this problem is quickly spreading to many areas of knowledge and is considered a major challenge in the currently popular big data.

To deal with the many challenges of complex, large databases, tools are being developed to help identify and extract specific information from sources of raw data. This practice is known as knowledge discovery (deriving actual knowledge from data) and is based on a combination of statistics, machine learning, database design, data mining, and semantic data processing techniques. Knowledge discovery is quickly gaining momentum for scientific literature analysis as well.

A plethora of new third-party tools is available to improve filtering, the relevance of search results, and information structure besides or beyond PubMed, as reviewed by Wildgaard2 and Zhiyong.3 Many of these tools are text-mining processors capable of classifying and/or extracting relevant concepts from abstracts and full texts4,5 using a technique known as biomedical language processing, a highly specialized branch of text mining.6 Other solutions are designed to extract higher-level concepts and abstractions, such as clustering of results or visualization tools to graphically represent relations and links among results.7

Among these third-party tools, we find a specific subgroup consisting of extensions that restrict searches to specific domains, such as instruments for measuring patient-reported outcomes,8 identifying studies carried out in a given country, such as Spain,9 classifying results according to the hallmarks of cancer,10 or identifying sex- and gender-specific literature.11 The most basic of these solutions are long filters written in PubMed search syntax, while others are complex data-processing algorithms involving text mining, ontologies, and machine learning.

Despite the long list of available tools, as far as we are aware, there is currently no tool able to filter references for clinical trials by sample size. Such a tool would be useful for a variety of research and clinically oriented tasks, for instance meta-analyses or assessing the confidence level of scientific proof for a given therapy or drug.

This study was designed to develop a tool to automatically extract information on sample size from clinical trial reports by processing their abstracts.

METHODS

The sample size search (SSS) tool was developed in 3 stages:

  1. Ground-truth database. A set of clinical trial search results randomly obtained from PubMed was manually examined and annotated to build a ground-truth database to develop the text processing algorithm, train the machine learning components, and validate results.

  2. Development. A set of algorithms was generated to extract sample sizes from abstracts through a user interface. Two software components were developed:
    1. A sample size estimator (SSE), an algorithm that automatically extracts a candidate sample size from an abstract.
    2. A low-confidence detector (LCD), an algorithm that detects and marks low-confidence estimations produced, in order to let the user review them manually if necessary.
  3. Validation. The tool developed was tested and its accuracy, sensitivity, and specificity measured using a dedicated portion of the ground truth from the first stage.

Ground-truth database

Before implementing the algorithms, a ground-truth database was created as a set of abstracts of clinical trial reports returned by PubMed associated with the actual sample sizes they report. This ground truth serves a 3-fold purpose during the following steps:

  1. Feature inspection. Before deciding on the family of algorithms to be employed, the features of the data to be handled were examined. These features were syntax and word structures normally used to report sample size, different styles (eg, reporting the total sample size vs reporting the sizes of different groups), number formats (among others, textual representation or use of decimal comma).

  2. Machine learning training. Some of the algorithms considered are based on machine learning techniques, which need sets of training data to learn how to extract the appropriate information.

  3. Validation. Finally, the accuracy of the sample size estimated by the search tool was quantified. This was done on the ground-truth database abstracts and the results were compared with the manually extracted actual sample sizes.

The database hosted 750 search results returned for the search:

“clinical trial”[Publication Type] OR “clinical trials as topic”[MeSH Terms] OR “clinical trial”[All Fields]

In order to eliminate selection bias, the PubMed indices of all matches (1 003 047) were stored, and 750 abstracts were randomly selected. Access to the PubMed database was automated through the Hypertext Transfer Protocol interface of the Entrez Programming Utilities (E-Utilities)12 provided by the National Center for Biotechnology Information.

The 750 randomly selected reports of clinical trials were split among the researchers for manual inspection. For each abstract, the following information was extracted:

  • Nature of the paper: whether the identified abstract corresponded to an actual randomized controlled trial.

  • Direct or indirect report of sample size: for abstracts reporting some kind of sample size, whether it is reported directly or indirectly.
    • • Directly: the total sample size was provided as a single number.
    • • Indirectly: the total sample size was the sum of partial sample sizes reported individually or the result of a complex operation described in the text. For instance, a paper may report that “the sample included 607 patients who underwent pancreaticogastrostomy and 604 who underwent pancreaticojejunostomy,” not explicitly stating that the total sample size was actually the sum of both numbers (ie, N = 1211).
  • Actual sample size.

  • Format in which the information about sample size was provided: numeric or textual representation, digit grouping/separation, etc.

  • Keywords associated with the numbers describing the sample size (eg, patients, volunteers, etc.), and distance (in words) separating keywords from the numbers they represent.

Three researchers (all with a PhD in biomedicine) participated in the ground-truth creation process. The database was split in 2 and annotated by 2 researchers, and then reviewed by the third in search of errata.

This analysis (graphically represented in Figure 1) revealed that, of the 750 references identified, only 648 had valid abstracts (for the remaining 102 results, the PubMed database did not even provide an abstract, only information on the title and authors). Of these 648 abstracts, 79 reported no sample size, either directly or indirectly, and 117 did not correspond to clinical trials. However, in 83 of the 117 studies that were not clinical trials, information about a valid sample size was provided in the abstract (eg, total sample size of all the studies included in a meta-analysis).

Figure 1.

Figure 1.

Abstracts in the ground-truth database classified according to how they report sample size.

For the purposes of development, training, and validation, the 102 results that did not provide an abstract were discarded, with the remaining 648 being usable. Even those abstracts reporting no sample size are valid, since the SSS tool is aimed at recognizing them.

Sample size estimation

Several candidate algorithms were considered as valid options to extract sample size information from abstracts. Different machine-learning approaches (such as latent semantic analysis13) were ruled out because of the huge variety of number formats, sentences, expressions, and aggregations leading to the actual sample size. This meant that the ground-truth database required to train such an algorithm would have to be many times larger than the one used for this study (possibly of the same order of magnitude as the actual number of clinical trials hosted by PubMed). This is because the system needs to learn all the nuances of human language employed to represent the “sample size” concept. In this particular case, machine learning shows no good return on investment.

In most of the papers in the development portion of the database reviewed, it was noted that sample sizes were usually surrounded by varied sets of common words (eg, “6 healthy older participants,” “20 young women,” or “N = 272”). Therefore, a syntactic approach was employed, in which a set of keywords is identified within the abstract, and then the vicinity (V nearest words before a punctuation mark is found, with V being a configurable parameter) scanned for numbers; all the candidate numbers are collected, and then the maximum is marked as the estimated sample size. As detailed in the Results section, this approach, based on expert knowledge, worked well for a large number of abstracts in the database.

To eliminate overfitting bias, the database was split into 2 portions: a development/training portion (250 samples) and a validation portion (500 samples). The list of keywords to be associated with sample size and optimization of parameter V was generated by examining abstracts in the training portion of the database to ensure that validation results would not be biased by “manually overfitting” the keyword list by pinpointing all the appropriate keywords. The resulting list contains 23 keywords, which are normally related to how sample size is defined. For instance, “patients,” “individuals,” “cases,” “subjects,” “adults,” “volunteers,” and “men/women” are some of the keywords in the list most commonly found in abstracts.

The parameter V (ie, number of words examined around a keyword in search of a number) was determined experimentally, optimizing its value against the development portion of the database. The best results were obtained with V = 6.

LCD

The main source of errors found in this syntactic approach was abstracts that did not report a total sample size, but rather the individual sizes of the different experimental groups. To minimize the impact of these errors, a second stage was implemented to tag estimates that were likely not 100% accurate due to indirect reporting of sample size.

Again, a latent semantic analysis–based classifier was ruled out, because the amount of training data required for this approach to work is several orders of magnitude larger, potentially close to the entire number of clinical trials hosted by PubMed.

Instead, we followed a quantitative approach, under the assumption that when the sample size is reported indirectly as individual group sizes, the candidate numbers retrieved by the first stage of the algorithm will be similar, while for direct reporting of sample size, one of the candidate numbers will be larger than the others. To capture these relations among candidate numbers as predictors of the confidence level of the estimated sample size, a set of quantitative features (described in Table 1) was designed and computed for each abstract.

Table 1.

List of quantitative features used to identify indirect sample size reporting

Number of candidate sample sizes
Mean of candidate sample sizes
Standard deviation of candidate sample sizes
Maximum candidate sample size/second largest candidate sample size
Maximum candidate sample size/sum of candidate sample sizes
Maximum candidate sample size/minimum candidate sample size
(Maximum candidate sample size – second largest candidate sample size)/mean of all candidate sample sizes, excluding maximum candidate sample size
Mean of candidate sample sizes/mean of all candidate sample sizes, excluding maximum candidate sample size

Manually defining what is a low-confidence sample size estimation on the basis of these parameters is an extremely difficult task. A classifier based on machine learning algorithms has been used to build this “low-confidence” model automatically. In order to do so, the classifier is presented with training data containing the sample size estimation produced and the true (manually annotated) sample size. The classifier then learns to identify situations when the estimation produced is of low confidence based on the quantitative features listed in Table 1.

Several classifying algorithms were systematically tested using methods such as support vector machines and decision trees, with the boosted tree ensemble classifier showing the best results. Initially, they were trained with the 250-sample training portion of the database and tested with the 500-sample validation portion. However, it soon became evident that the data available in the 250 results of the training portion were insufficient for this approach, as the area under the receiver operating characteristic (ROC) curve14 quickly increased with the addition of abstracts to the training data. In order to solve this problem, for this second stage, the training portion of the database was increased to 500 samples, leaving 250 for the validation portion.

During training, in order to identify the optimum operation point of the classifier, a 5-fold cross-validation approach15 was implemented, in which the available data were split into 5 sections of equal length, and then 5 classifiers were trained with a different combination of 4 of those sections and validated with the remaining one, and finally the validation results were integrated. The resulting ROC curve was used to select the appropriate operation point (in terms of true positive rate vs false positive rate ratio). Then, the classifier was validated as an LCD with the remaining 250 samples to obtain statistical measures.

Figure 2 is a simplified flow diagram of the developed algorithm. A screenshot of the tool’s user interface is provided in Figure 3.

Figure 2.

Figure 2.

Simplified flow diagram of the sample size extraction algorithm.

Figure 3.

Figure 3.

Screenshot of the user interface.

RESULTS

The sample size extraction tool developed here was quantitatively tested on the 500-sample validation portion of the database, which hosted 435 abstracts. Two validation tests were carried out:

  • Absolute estimation: A correct prediction (hit) was counted when the sample size was estimated correctly (true positive) or when no sample size was found and the abstract did not report one (true negative). A failure (miss) was counted when the estimated sample size did not match the actual one or the abstract did not report one (false positive) or when no estimation for the sample size was found but the abstract reported one (false negative).

  • Quartile classification: For some tasks (eg, quickly searching for large clinical trials), the user will not need a 100% accurate prediction, but a rough estimation will be enough. To evaluate performance for these tasks, the SSE was used to classify abstracts into 1 of 4 quartiles: sample size <100, sample size = 100–299, sample size = 300–1000, or sample size >1000. In this case, a true positive was counted when the quartile of the sample size was estimated correctly, a false positive when the estimated quartile did not match the actual one, a true negative when no quartile was found and the abstract did not report a sample size, and a false negative when no estimation for the sample size was found but the abstract reported one. Numeric thresholds of 100, 300, and 1000 were randomly chosen as examples for classifying a sample size as “small,” “normal,” “large,” and “very large”; that is, they were not actually selected based on evidence, but were used to exemplify how the tool can perform rough estimations. In other scenarios, a user searching for a specific subset of clinical trials could define other threshold values, depending on the specific purpose of the search and the clinical trials in question.

In both cases, precision and recall were computed to measure the usefulness and completeness of results. We used the number of returned abstracts as reference (not the total number of results, including those without an abstract) in order to obtain decoupled independent measures for the SSS tool: the percentage of results without an abstract can vary with the specific search performed, and it is out of the SSS tool’s control.

The results of these tests are summarized in Table 2.

Table 2.

Validation of the SSS tool

Sample size estimator for absolute estimation
Number of abstracts in the validation database 435
True positives (TPs) 210
Correctly estimated sample sizes
True negatives (TNs) 97
Abstracts with no sample size reported identified correctly
False positives (FPs) 88
Abstracts for which some sample size was incorrectly found
False negatives (FNs) 40
Abstracts reporting a sample size that were incorrectly labeled as not reporting any
Precision=TPTP + FP 70.4%
Recall=TPTP + FN 84.0%
Accuracy=TP + TNTP + TN + FP + FN 70.6%
SSE for quartile classification
Number of abstracts in the validation database 435
TPs 275
Correctly estimated quartile
TNs 97
Abstracts with no sample size reported identified correctly
FPs 23
Abstract for which some quartile was incorrectly found
FNs 40
Abstracts reporting some sample size that were incorrectly labeled as not reporting any
Precision=TPTP + FP 92.2%
Recall=TPTP + FN 87.3%

Accuracy=TP + TNTP + TN + FP + FN

85.5%
Low-confidence detector
Number of abstracts in the validation database 219
Number of wrong predictions provided by SSE 70
Number of correct predictions provided by SSE 149
TPs 17
Number of wrong predictions detected
TNs 134
Number of correct predictions detected
FPs 15
Number of correct predictions labeled as low confidence
FNs 53
Number of wrong predictions not labeled as low confidence
Precision=TPTP + FP 53.1%
Recall=TPTP + FN 24.3%

Accuracy=TP + TNTP + TN + FP + FN

68.9%

For the LCD, a boosted trees ensemble16 showed the best results, with an area under the ROC curve of 0.71. This ROC curve is provided in Figure 4. In this figure, the selected operation point for the final version of the tool is indicated.

Figure 4.

Figure 4.

Receiver operating characteristic curve of the low-confidence-detector component; circle indicates the operation point selected for the released version of the SSS tool.

DISCUSSION

When analyzing our results, it is important to consider first the absolute estimation scenario for the SSE. The precision value shows that 70% of the estimations exactly matched the sample sizes reported in the abstracts, and the 84% recall figure indicates that not many sample sizes reported in abstracts come unnoticed, showing that the search results returned by the tool are of high quality and quite complete. The LCD may help to improve these figures by further labeling estimations as low confidence. This task is extremely complex, but with 53% precision, 1 of 2 low-confidence-labeled predictions will actually be wrong, and the 24% recall shows that one-fourth of wrong estimations will be detected.

The performance figures of the SSE component may also considerably improve in those scenarios where rough estimations suffice. For example, for a literature review, the user could consider irrelevant whether the sample size of a given study equals 1200 or 1300 subjects, as long as it is identified as a large trial. We have provided the quartile classification test as an example of one of these scenarios. In this test, precision reached a very high value, 92%, and recall increased to 85%. These specific figures may vary in other cases, depending on what is considered a “good rough estimation,” but they help to prove that even when the estimation provided does not exactly match the actual sample size, it is, in many cases, a good estimation, close to the actual figure given in the abstract.

The effectiveness of the tool may be questioned if we consider that some important studies may pass unnoticed by the user. However, this problem is also inherent in many established search tools. For instance, when building the ground-truth database, we found that roughly 15% of the results returned when searching for clinical trials were not, in fact, clinical trials.

After reviewing the estimation errors, it was found that the most common sources of error were the following:

  • Indirect reporting: Many abstracts simply do not directly report sample size. Some of them individually report the sizes of the various groups of the trial or mention that the trial consisted of X groups, each composed of Y subjects; others report sample size mixed with other figures or use sentences that do not facilitate analysis (eg, stating, “From an initial pool of 4000 patients 300 were enrolled in the trial”).

  • Those abstracts that individually report the sizes of the various groups without providing the total figure are those more consistently detected by the LCD.

  • Keyword-number confusion: In other cases, the abstract contains a number not related to sample size located near one of the keywords, and is wrongly identified as a candidate sample size. For instance, the phrase “3000 doses administered to patients” could result in the number 3000 being incorrectly identified as related to the keyword “patients”; or in the phrase “in 1998 patients were treated,” the year 1998 could be incorrectly identified as a sample size (even if the inclusion of a comma after the year by the authors would have prevented the error); or in “100 twelve-year-old boys were included,” age (12) could be the final estimation because it is closer to the keyword “boys.

  • Narrow subject terms: A third source of error is abstracts that report sample size using a very narrowly defined word to label subjects that are not present in the keyword list. Examples would be “100 fourth-graders,” “37 employees,” or “500 soldiers.

Though the tool proposed might not offer a 100% exhaustive, 100% accurate sample size–filtered literature review of a given subject, the user will, in many cases, want to quickly and easily screen the literature for large trials to provide solid evidence about the subject, even if some research is missing. For these cases, the performance figures of the tool prove that the returned search results will be very helpful and will save a considerable amount of time.

For exhaustive literature reviews, manual filtering of abstracts might also be an option. In these cases, even if the amount of time and effort required is acceptable, attention-intensive repetitive tasks like this are notoriously highly error-prone. As an example, during the execution of this work, the automated tool detected 3 errors introduced (involuntarily) in the manually annotated database by the researchers. Therefore, even in cases with high accuracy requirements, the proposed tool might also be useful as a means of double-checking.

CONCLUSION

In conclusion, the tool developed here proved useful for sample size filtering of the Medline database, helping in the identification of meaningful research. Validation tests show that it is capable of correctly estimating the sample sizes of 70% of abstracts and of correctly classifying 85% of them into sample size quartiles. Our work provides direction for future studies designed to develop machine learning approaches based on user-tool interactions that will empower data collection and analysis.

FUNDING

Research by AL is funded by Fondo de Investigaciones Sanitarias (grant # PI15/00558) and cofinanced by Fondos FEDER.

COMPETING INTERESTS

The authors have no competing interests to declare.

CONTRIBUTORS

All authors contributed to the idea of the tool. AS, AL, and JM created the ground-truth database and reviewed the state of the art. CB and JA designed and developed the algorithms and the graphical user interface of the tool. CB, AS, and JM performed the validation study. All authors participated in data analysis and article writing and reviewed the final version of the manuscript.

REFERENCES

  • 1. Medline. US National Library of Medicine, Medline Factsheet. 2007. www.nlm.nih.gov/pubs/factsheets/medline.html. Accessed April 28, 2017.
  • 2. Wildgaard LE, Lund H. Advancing PubMed? A comparison of third-party PubMed/Medline tools. Libr Hi Tech. 2016;344:669–84. [Google Scholar]
  • 3. Lu Z. PubMed and beyond: a survey of web tools for searching biomedical literature. Database. Jan 18;2011:baq036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Rani J, Shah AB, Ramachandran S. Pubmed.mineR: an R package with text-mining algorithms to analyse PubMed abstracts. J Biosci. 2015;404:671–82. [DOI] [PubMed] [Google Scholar]
  • 5. Schardt C, Adams MB, Owens T et al. Utilization of the PICO framework to improve searching PubMed for clinical questions. BMC Med Inform Decis Mak. 2007;71:16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Hunter L, Cohen KB. Biomedical language processing: what’s beyond PubMed? Mol Cell. 2006;215:589–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Lee K, Shin W, Kim B et al. HiPub: translating PubMed and PMC texts to networks for knowledge discovery. Bioinformatics. 2016;3218:2886–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Terwee CB, Jansma EP, Riphagen II et al. Development of a methodological PubMed search filter for finding studies on measurement properties of measurement instruments. Qual Life Res. 2009;188:1115–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Valderas JM, Mendivil J, Parada A et al. Construcción de un filtro geográfico para la identificación en PubMed de estudios realizados en España. Rev Esp Cardiol. 2006;5912:1244–51. [PubMed] [Google Scholar]
  • 10. Baker S, Silins I, Guo Y et al. Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics. 2016;323:432–40. [DOI] [PubMed] [Google Scholar]
  • 11. Song MM, Simonsen CK, Wilson JD et al. Development of a PubMed based search tool for identifying sex and gender specific health literature. J Women’s Health. 2016;252:181–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Acland A, Agarwala R, Barrett T et al. Database resources of the National Center for Biotechnology information. Nucleic Acids Res. 2014;42(Database issue):D7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Landauer TK. Latent Semantic Analysis. In Encyclopedia of Cognitive Science. John Wiley & Sons, Ltd; 2016. [Google Scholar]
  • 14. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;1431:29–36. [DOI] [PubMed] [Google Scholar]
  • 15. Rodriguez JD, Perez A, Lozano JA. Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans Pattern Anal Mach Intell. 2010;323:569–75. [DOI] [PubMed] [Google Scholar]
  • 16. Bauer E, Kohavi R. An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach Learn. 1999;36(1–2):105–39. [Google Scholar]

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES