Skip to main content
. Author manuscript; available in PMC: 2016 Oct 14.
Published in final edited form as: J Mach Learn Res. 2016;17:132.

Table 3.

Features we used for the target learning tasks and additional features we used in learning to map from candidate sets (the distant supervision) to ‘true’ labels. We set discrete (‘binned’) feature thresholds heuristically, reflecting intuition; we did not experiment at length with alternative coding schemes. Note that separate models were learned for each PICO domain.

Feature Description
Bag-of-Words Term-Frequency Inverse-Document-Frequency (TF-IDF) weighted uni-and bi-gram count features extracted for each sentence. We include up to 50,000 unique tokens that appear in at least three unique sentences.
Positional Indicator variable coding for the decile (with respect to length) of the article where the corresponding sentence is located.
Line lengths Variables indicating if a sentence contains 10%, 25% or a greater percentage of ‘short’ lines (operationally defined as comprising 10 or fewer characters); a heuristic for identifying tabular data
Numbers Indicators encoding the fraction of numerical tokens in a sentence (fewer than 20% or fewer than 40%).
New-line count Binned indicators for new-line counts in sentences. Bins were: 0–1, fewer than 20 and fewer than 40 new-line characters.
Drugbank An indicator encoding whether the sentence contains any known drug names (as enumerated in a stored list of drug names from http://www.drugbank.ca/).
Additional features used for SDS task (encoded by X̃)
Shared tokens TF-IDF weighted features capturing the uni- and bi-grams present both in a sentence and in the Cochrane summary for the target field.
Relative similarity score ‘Score’ (here, token overlap count) for sentences with respect to target summary in the CDSR. Specifically, we use the score for the sentence minus the average score over all candidate sentences.