Skip to main content
. Author manuscript; available in PMC: 2020 Jan 1.
Published in final edited form as: J Biomed Inform. 2018 Nov 22;89:1–10. doi: 10.1016/j.jbi.2018.11.010

Table 1.

Classification features used in HI-TA to identify high impact clinical articles on disease treatment.

Feature name Feature description Data Type Used in HI-TS
Number of clinically useful sentences Clinicians prefer sentences that provide patient-specific, actionable recommendations for a particular intervention [3337]. Clinical useful sentences in the citation title and abstract were identified using a sentence classifier developed in a previous study [38]. Numeric No
Journal impact factors Journal impact factors (JIF) are measures of the reputation and Numeric
impact of a journal. In general, they are calculated as the ratio between the number of citations received by articles published in the journal and the number of articles published in the journal during a certain period of time (e.g., two years) [39,40]. We retrieved the JIFs for each citation at the time of its publication. We obtained JIFs from the Scimago Journal & Country Rank® (SJR®), developed by Scimago Lab with the data source provided by Scopus® [28]. We obtained three different JIFs from SJR: 1) SJR (SCImago Journal Rank) indicator, which represents the ratio of the weighted citation counts to the documents published in the journal of interest for the past three years [41]; 2) journal h-index, which represents the number of articles in the journal that received more than h citations; and 3) citations per document for the past two years.
Numeric Yes
Study sample size Number of participants in the study. High impact clinical studies Numeric often have larger sample sizes [5]. The sample size was extracted using enhanced EasyCIE [42], a rule-based information extraction tool. This tool uses ConText algorithm [43] to identify the numbers within the context of sample size related description in abstracts. Then it applies predefined rules to solve the conflicts if there is any. For instance, if there are multiple numbers that are likely to be the sample size, it will choose the first one. We developed the rules base on 700 training abstracts randomly sampled from PubMed and evaluated on another 100 abstracts.
We measured the performance in two metrics: the F1- score and the average numeric difference rate (the average of the normalized difference between the extracted sample sizes and the true sample sizes). The F1-score of the test set is 0.82, and the average numeric difference rate ((Z|Es-Ts|/Ts)/n, Es = extracted sample sizes, Ts = true sample size, n = number of abstracts) is 0.12. Analyses on the extracted sample sizes showed that a sample size of greater than 30000 or smaller than 10 are usually not the actual study sample size. We treated them as missing values of this feature.
Numeric Yes
Number of grants Research shows that publications sponsored by grants have higher impact than studies without grant support [44]. We obtained the number of grants supporting a study using the Scopus API [45]. Numeric No
Number of authors The number of authors is an independent predictor for the number of citations an article will receive [46]. We obtained this feature using the Scopus API [45]. Numeric No
Scientific impact of the authors’ institution The overall scientific impact of the authors’ institution could be a surrogate for the impact of the authors’ work. We collected a snapshot of year 2017 for the following features both for the first author and the corresponding author: 1) total number of citations to publications from the first/corresponding author’s institution; 2) total number of authors from the first/corresponding author’s institution; 3) institution’s average citation count per author. In case an author had multiple affiliations, we used the institution with the highest reputation. We obtained all these features using the Scopus API [45]. Numeric No
Number of institutions and countries in a study Multi-center studies are more likely to produce high impact. Collaboration helps better utilize resources and produces higher quality research [47]. In addition, collaborative studies receive more citations [47,48]. Multi-center studies may also have stronger design, such as larger sample sizes and more diverse subjects. We obtained the number of institutions and countries participating in a study using the Scopus API [45]. Numeric No
Number of bibliographic references Research shows that article impact (based on citation count) is correlated with the number of bibliographic references included in the article [19,46,49,50]. We obtained this feature using the Scopus API [45]. Numeric No
Article page count and title word count Research shows that article impact (based on citation count) is correlated with the article length [46] and title length [51,52]. We used the Scopus API [45] to obtain page count and a Java program to obtain the title word count. Numeric No
Core clinical journal Represents whether the journal in which the study is published is part of a subset of core clinical journals. The list was obtained from the union of journals in the MEDLINE Core Clinical journals [53] and the McMaster Premium LiteratUre Service (Plus) journals [54]. Periodic evaluation and updates by experts ensure the quality of these lists [5557]. Categorical Yes
Trial registration in ClinicalTrials.gov Represents whether the study that produced the publication is registered in ClinicalTrials.gov. Registering a clinical trial in national registries, such as ClinicalTrials.gov, is required by funding agencies and by many of the core clinical journals. This feature was extracted from PubMed citation metadata using the eUtils API. Categorical Yes
Publication in PubMed Central® Represents whether the article is included in the PubMed Central database. Studies funded by the US National Institutes of Health (NIH) are included in PubMed Central. They tend to be more balanced than commercial funded studies, which is an indication of strong clinical impact [5860]. This feature was extracted from PubMed citation metadata using the eUtils API. Categorical Yes