Abstract
Objective
Evidence synthesis teams, physicians, policy makers, and patients and their families all have an interest in following the outcomes of clinical trials and would benefit from being able to evaluate both the results posted in trial registries and in the publications that arise from them. Manual searching for publications arising from a given trial is a laborious and uncertain process. We sought to create a statistical model to automatically identify PubMed articles likely to report clinical outcome results from each registered trial in ClinicalTrials.gov.
Materials and Methods
A machine learning-based model was trained on pairs (publications known to be linked to specific registered trials). Multiple features were constructed based on the degree of matching between the PubMed article metadata and specific fields of the trial registry, as well as matching with the set of publications already known to be linked to that trial.
Results
Evaluation of the model using known linked articles as gold standard showed that they tend to be top ranked (median best rank = 1.0), and 91% of them are ranked in the top 10.
Discussion
Based on this model, we have created a free, public web-based tool that, given any registered trial in ClinicalTrials.gov, presents a ranked list of the PubMed articles in order of estimated probability that they report clinical outcome data from that trial. The tool should greatly facilitate studies of trial outcome results and their relation to the original trial designs.
Keywords: clinical trials as topic, ClinicalTrials.gov, evidence-based medicine, bibliometrics, systematic reviews
OBJECTIVE
Clinical trials drive improvements in health care. Evidence-based medicine seeks to collect and evaluate all possible evidence on a given question, giving highest priority to randomized controlled trials.1 Evidence may reside in peer-reviewed publications that report trial clinical outcome data; clinical outcome data that is deposited in trial registries; gray literature (eg, conference proceedings or online technical reports); and patient-level trial data. Posted trial results often give more information about adverse events than those in the corresponding publications, and even such basic information as primary clinical outcome measures may differ between these sources.2–6 Although it is unclear how often the conclusions of a systematic review are altered by including posted trial results,7–9 physicians, policy makers, and patients and their families all have an interest in following the outcomes of registered trials,10 and would benefit from being able to evaluate both the trial registry results and the publications that arise from them.
Manual searching for publications arising from a given trial is a laborious and uncertain process.11,12 Several machine learning methods employ textual similarity and other features to link publications to individual trials,13–15 although there are currently no web based systems available. In the present study, we have created a free, public web-based tool that, given any registered trial in ClinicalTrials.gov, presents a ranked list of the PubMed articles in order of the estimated probability that they arise from that trial.
BACKGROUND AND SIGNIFICANCE
Ideally, all publications related to a given clinical trial would be findable as part of a single “thread” by virtue of being indexed with the same registry number.16,17 Finding the publications that arise from a given clinical trial is no easy matter; however, since only about half of trials give rise to any publications,18–20 and of those that do, fewer than half mention the trial registry number to permit unambiguous linkage back to the trial.11,12 Identifying publications that refer to a given clinical trial is not a straightforward problem, for several reasons. First, the number of trials is large (∼392 000 in ClinicalTrials.gov as of October 1, 2021) and the number of potentially linked publications is even larger (∼33 million articles indexed in PubMed, of which ∼1.96 million articles mention the words trial or trials in title or abstract). Second, there is great variability among trials in the number of publications and their publication lags.18,19,21–24 Although half of trials lack any publications, and most of those that do have only 1 or 2 publications, yet there is a long tail with some having >20. Most are published between 2 and 5 years after the completion of the trial, but a few may be published after 10 or more years. Third, previously proposed methods of linking trials to publications (eg, overall matching of textual similarity15 or shared authors between trials and publications11) have limited predictive performance on their own. Fourth, the textual fields and metadata of trial registries are not well standardized,14,25–27 which complicates matching specific textual fields of trials to those of publications. Finally, a wide variety of ancillary publications may arise from a trial, such as questionnaire development, genome-wide association studies carried out on trial subjects, reanalysis of data across multiple trials, and so on, which may not share word usage, topics, or investigators with the registered trial entry. Thus, similarity-based methods may be expected to be more successful for clinical outcome articles than for ancillary articles. However, our intent was to attempt to identify both clinical outcomes and ancillary articles, and our modeling and evaluation efforts employed the full range of article types.
We built our Trials to Publications tool specifically for ClinicalTrials.gov and PubMed because they allow comprehensive and regularly updated XML-formatted downloading of all their trials and publications, which are not available for other trial registries and bibliographic databases. We have employed some of the features used in previous studies14,15 but have carried out additional trial-publication feature engineering, and have added a new set of article–article similarity features based on the Aggregator model,27,28 which scores the degree of matching between a given candidate publication and the set of publications that are definitely known to be linked to the trial. As we will show, this innovation substantially improves the performance of the model and surpasses previous efforts.
MATERIALS AND METHODS
See the Supplementary File for full details.
Definitions and overview
Each registered trial in ClinicalTrials.gov is assigned a unique National Clinical Trial (NCT) number and has a trial registry entry consisting of multiple templated fields (eg, start date, sponsor, inclusion and exclusion criteria, etc). Publications that arise from a given trial comprise 3 different cases:
Some publications mention the NCT number explicitly in the abstract and/or are indexed in the PubMed record. Most of these are posted as a templated field in the ClinicalTrials.gov registry. As well, we have found that the automatic recognition system misses some textual variants, as well as NCT numbers mentioned in the Corporate Author field; therefore, we have supplemented this set of articles with additional articles found using our own algorithms (Supplementary Methods). These are referred to as “NCT-linked articles” and were utilized as training data for our modeling, excluding publications that contained multiple NCT numbers. They were also regarded as gold standards for evaluation (see below).
Some trials contain publications that were submitted by the trial investigators, but that do not contain NCT numbers. Some provide clinical outcome results of the trial, whereas others are earlier studies or reviews which provided motivation for carrying out the trial (Figure 1). We attempted to identify clinical outcome articles by making restrictions on publication date and requiring the article be listed in the specific “results_reference” field of ClinicalTrials.gov (Supplementary Methods). These “investigator-submitted articles” were used as silver standards for evaluation.
Finally, some PubMed-indexed publications arise from a given trial, but are not listed or explicitly attached to the ClinicalTrials.gov registry. (Some of these articles mention the NCT number only within the full-text; others do not mention any NCT numbers at all.) Our goal is to identify such articles—whether or not they are clinical trial outcome articles, and whether or not they have been assigned Medical Subject Headings or Publication Types.
Figure 1.
Relationship between trial start date and linked article publication dates for all registered trials in ClinicalTrials.gov. Data were collected in February 2020. Publications with an explicit NCT link (shown in gold) most commonly appear 3–4 years after the start date of a trial. In contrast, investigator-submitted publications (shown in gray) exhibit a bimodal distribution of publication dates: Some are reviews or other publications that predate the start of a trial and provide motivation for carrying out the trial; in contrast, those published after the start of a trial appear to comprise primarily articles that arose from the trial itself.
Data preparation
We downloaded all public data as provided by the U.S. National Library of Medicine on the ClinicalTrials.gov website (https://clinicaltrials.gov/ct2/resources/download on February 20, 2020). The full dataset contained approximately 330 000 unique clinical trial registrations with all public data supplied in XML format. All XML fields were imported into a relational database for efficient data retrieval with a filter that retains only alpha-numeric characters and common punctuation. We also obtained PubMed records (metadata) readily available for all ∼30 million articles at pubmed.gov.
Machine learning
Our strategy was to create a “positive training set” comprised of 21 430 known trial registry entry-publication pairs versus an equal sized “negative training set” constructed by randomly pairing trials with publications from other trials, matched such that they study the same condition or intervention. This was done to be consistent with how we anticipate our model would be applied in the real-world, in which searches will generally be constrained to studies of a particular condition or intervention. We extracted multiple features based on different aspects of matching between the registered trial and its linked publication, and used this to create a monotonic multidimensional measure of similarity that optimally distinguishes pairs in the positive versus negative training set. Finally, the similarity score is further modeled to estimate the probability that the publication arose from that trial:
Feature engineering—I: matching fields from trials to articles
We investigated various ways of matching different fields from trial registry entries with article metadata, to create a series of potential features for machine learning. For comparisons of English-like text, we computed weighted and unweighted shared terms and implicit scores as described in Smalheiser et al,29 as well as Okapi BM25 similarity scores. For comparisons involving authors and investigators, we applied the Authority author name matching algorithm30 that assigns a score from 0 to 100 determined by levels of partial match for last name and initials, with human names parsed from the trial XML elements with the python Nameparser library.31 For data elements that generally contained IDs or lists, such as MeSH terms, we computed simple occurrence counts and/or the proportion of items in the list found. Table S1 contains a list of all combinations of trial registry field attributes, PubMed article metadata fields, and matching algorithms that were considered as potential predictive features.
Because articles arising from a registered trial must be published after the trial onset, the relation between trial dates and article publication dates was an important feature to predict the likelihood of a trial-article match. We found the discriminant ratio between the start date as provided by the//clinical_study/start_date XML field from ClinicalTrials.gov versus the article PD publication date to be the most stable and predictive feature for modeling.
Feature engineering—II: matching fields from articles to articles
Many registered trials generate more than 1 publication. In such situations, not only does each publication show some matching relationship with the trial entry, but they are likely to show a high degree of matching among the publications themselves. We have previously investigated how 2 publications, linked to the same registered trial, share textual and metadata features.27 We have employed this model to create a tool, Aggregator, which takes a list of clinical trial articles as input and identifies those which are likely to arise from the same registered trial.27,28 In the present study, we have supplemented the trial-article features described above, with additional article–article features derived from the “Aggregator” model. That is, given a specific registered trial, each article in the candidate pool is scored both on their similarity to the trial entry using trial-article features, and on their similarity to any publications that are known to be linked to that registered trial (ie, NCT-linked or investigator-submitted articles) using article–article features. The article–article features provided a substantial contribution to our model’s performance overall (see Results section).
Feature selection and optimizing the model
We chose to fit a classical Logistic Regression model because of its interpretability and ease of implementation, and because we expected no strong nonlinearities or nonseparable feature interactions (see Supplementary Methods). We examined a neural network (multilayer perceptron with 1 hidden layer) and found that it gave the same F1 value as Logistic Regression. As well, in our previous modeling, we found that Logistic Regression gave comparable performance as decision trees and linear SVMs.27 Thirty percent of our modeling dataset was held out for use as a test sample leaving us with 29 818 training cases and 13 042 cases for model validation, both balanced with equal positive and negative trial-article pairs.
Model development began with a forward stepwise process with features added until the model adjusted r-square no longer increased. We further optimized the model and reduced feature multicollinearity by choosing only the best performing comparison method for each trial-article pair of elements when multiple methods had entered the model via forward selection. Other model adjustments were selected by careful examination of model performance and specific errors. Features that compared trial sponsors with article affiliations were error prone as described in previous works.1,6 After experimentation attempting to mute the effect of erroneous and mis-matched organizational names, we excluded the affiliations features from the model.
Final model parameters are shown in Table S2 (Supplementary Methods). The strongest predictor was the feature that compares candidate articles to articles that are known to be linked to each trial. After that, the top predictors include trial intervention matching, the weighted text similarity score2 between the trial summary field and article title/abstract, and the time differential from trial start date. The remaining predictors (Table S2) provided lesser but still significant contributions.
Converting the model’s raw similarity score to an estimated probability value
Given that our model is derived from a balanced sample of positive and negative cases, we engineered a simple odds ratio function to transform model scores to estimated probabilities. For the observed odds ratios for pairs in the positive and negative training sets, we smoothed the estimated probabilities across 3 quasi-linear segments from 0.00 to 0.75, from 0.76 to 0.96, and 0.97 to 1.00 (Figure S2, Supplementary Methods). We then adjusted the odds ratio for each segment and interpolated observed similarity scores against these baselines to estimate probabilities for display in the web tool.
RESULTS
The machine learning performance for the fitted model, assuming a binary decision threshold of 0.5, was precision = 90.43% and recall = 84.57%, with F1 = 87.41%, for the hold-out test data (which did not include any examples used for training). Overall accuracy was 87.81% with an area under the operating curve of 0.95. These results, particularly the area under the operating curve value, indicate that the model is inherently able to discriminate positive examples (trials paired with NCT-linked publications) versus negative examples (trials randomly paired with publications from different trials, matched such that they study the same condition or intervention). However, it is more relevant to evaluate the model in the context of the implemented web-based ranking tool, which given a registered trial as input, first identifies a pool of 5000 candidate articles and then uses the model to make a ranked list of articles according to their similarity scores.
We evaluated how known NCT-linked articles (ie, the gold standards) were ranked by the model; however, we also note that there are several nuances in the evaluation scheme that needed to be dealt with. First, investigator-submitted articles, and any true-positive articles that are not explicitly linked, are all counted as negatives in this evaluation, which may under-estimate true performance. Second, trials which are associated with exactly 1 NCT-linked article will always rank the NCT-linked article at the top (due to article–article matching) which will overestimate performance in that specific situation unless corrected. Third, trials which are not associated with any known articles require manual assessment of the highly ranked articles to see which, if any, appear to be truly associated with that trial. To make a more robust evaluation that deals with these issues, we separately examined trials with 2 or more NCT-linked articles; trials with exactly 2 NCT-linked article (but possibly 1 or more investigator-submitted articles); and trials with no NCT-linked articles.
Trials with 2 or more NCT-linked articles
In this situation, each candidate article is compared not only to the registered trial, but to each of the other articles known to be linked to the trial (including the NCT-linked articles and any investigator-submitted articles having publication dates later than the trial start date). A candidate article is not compared against itself, however. This situation provides the most accurate assessment of performance. As shown in Table 1, the model ranks NCT-linked articles extremely well, with a median first rank of 1.0 and median first similarity score of 0.993. Overall, about 78% of the NCT-linked articles arising from a trial are ranked in the top 10. (Note that maximum recall is less than 100% for this category, since some trials have more than 10 linked articles.)
Table 1.
Performance parameters for registered trials having >2 NCT-linked articles
| Test set | No. of trials in sample | Median first rank (IQR) | MRR (95% CI) | Recall@5 (95% CI) | Recall@10 (95% CI) | Median first score (95% CI) |
|---|---|---|---|---|---|---|
| Trials with 2 or more NCT-linked articles | 300 | 1.0 (1.0–3.0) | 0.690 (0.618–0.765) | 0.660 (0.591–0.727) | 0.779 (0.719–0.841) | 0.993 (0.988–0.996) |
| Trials with 2 or more NCT-linked articles, without aggregator features | 300 | 2.0 (1.0–6.0) | 0.576 (0.498–0.657) | 0.534 (0.461–0.612) | 0.642 (0.566–0.717) | 0.951 (0.912–0.973) |
Note: Three hundred registered trials were randomly chosen that were known to be linked to 2 or more NCT-linked articles, and we applied the full model or the model lacking the Aggregator feature. The 5000 candidate articles were all ranked but only the top 10 were displayed by the web-based tool. The parameters include: mean reciprocal rank, that is, 1/(mean rank across all linked articles); Recall@5, recall of the top 5 ranked articles; Recall@10, recall of the top 10 ranked articles; median first rank, that is, the top rank among all its NCT-linked articles; median first similarity score, that is, the top similarity score among all its NCT-linked articles; IQR, interquartile range of the median first rank scores across the dataset; 95% CI, 95% confidence interval of the corresponding parameter. CI was calculated from a simple bootstrap of 1000 samples with replacement.
The Aggregator features (ie, article–article matching) contribute substantially to overall performance, since if they are removed entirely from the model, then the performance drops to a median first rank of 2.0 and median first score of 0.951—significantly lower (P = 1.04 × 10−5) but still quite respectable (Table 1).
Trials with exactly 1 NCT-linked article
Overall, trials in this situation (Table 2, row a) produce a ranked list of articles whose median first rank is 1.0 and median first similarity score is 0.993, essentially the same as observed above in the case of 2 or more NCT-linked articles. However, we were concerned that this might be an overestimate of the true performance since when there is only 1 article linked to a trial, the model allows that article to be compared with itself. In order to assess the performance a bit more realistically, we examined trials that had 1 NCT-linked article but also had 1 or more investigator-submitted articles—in this situation, the test article is compared against the other investigator-submitted articles but not against itself. Table 2 row b shows that the median score of the NCT-linked article remains very high, 0.969.
Table 2.
Performance parameters for registered trials having 1 NCT-linked article
| Test set | No. of trials in sample | Median first rank (IQR) | MRR (95% CI) | Recall@5 (95% CI) | Recall@10 (95% CI) | Median first score (95% CI) |
|---|---|---|---|---|---|---|
| a) One NCT-linked article | 300 | 1.0 (1.0–1.0) | 0.851 (0.800–0.902) | 0.953 (0.910–0.990) | 0.983 (0.950–1.000) | 0.993 (0.988–0.996) |
| b) One NCT-linked article, but has other investigator submitted articles (only gold standard examples marked as positives) | 300 | 3.0 (1.0–14.0) | 0.492 (0.412–0.578) | 0.650 (0.560–0.740) | 0.730 (0.630–0.810) | 0.969 (0.951–0.982) |
| c) One NCT-linked article, but has other investigator submitted articles (gold and silver standard examples both marked as positives) | 300 | 1.5 (1.0–5.0) | 0.615 (0.537–0.693) | 0.533 (0.454–0.606) | 0.637 (0.561–0.707) | 0.984 (0.973–0.991) |
| d) One NCT-linked article, without Aggregator features | 300 | 2.0 (1.0–23.0) | 0.506 (0.422–0.593) | 0.607 (0.520–0.700) | 0.690 (0.600–0.780) | 0.887 (0.753–0.942) |
Note: Three hundred registered trials were randomly sampled having exactly 1 NCT-linked article and either processed by the full model (top row) or the model lacking the aggregator feature (bottom row). An additional 300 trials were sampled having exactly 1 NCT-linked article but 1 or more investigator-submitted articles whose publication dates were after the trial start date; these were processed by the full model (middle rows). Parameters are defined as in Table 1.
Although the median rank of the NCT-linked article falls from 1.0 to 3.0, this effect is more apparent than real, since the investigator-submitted articles are competing for the top ranks, yet are not counted as positives in this Gold Standard evaluation. Table 2 row c corrects for this by counting both NCT-linked and investigator-submitted articles as positives, and in this case, the median first rank is 1.5 and median first similarity score is 0.984. This may be the fairest estimate of the performance of the model applied to trials that have a single NCT-linked article. As shown in Table 2, row d, if article–article features are removed entirely from the model, the performance falls substantially and with statistical significance (note the nonoverlapping confidence intervals), with the median score of the NCT-linked article centered at 0.887 (compare with 0.993 observed in Table 2, row a).
Trials with no known associated articles
This category comprises 3 different subcases: (1) Roughly half of trials do not generate any publications at all. (2) Some trials generate publications that are not indexed in PubMed. (3) Trials may generate PubMed articles that do not specify NCT numbers in the abstract or record metadata.
For this category, precision and recall are not appropriate performance parameters. Instead, we created a random sample of 100 registered trials lacking any NCT-linked or investigator submitted articles, made a ranked candidate list of 5000 PubMed articles for each trial, and plotted the best predictive score for each trial. As shown in Figure 2, only 12 of the 100 trials had best similarity scores above 0.98, versus 79 of 100 randomly chosen trials that had 2 or more NCT-linked articles (Figure 2). In 9 of the 100 trials, the best-scoring article was >0.99, of which 2 were definitely linked to the trial in question (the NCT number was listed within the full-text) and 2 probably belonged (same topic, investigator and institution). Two were associated with different trials (different NCT number given in full text) but were closely related, for example, the same investigator studying “hepatic function during and following 3 days of acetaminophen dosing” versus “aminotransferase trends during prolonged acetaminophen dosing.” This suggests that very high-scoring candidate articles include at least some true positives as well as those associated with closely related trials.
Figure 2.
Best model-derived similarity scores for ranked article lists processed for 100 trials with no known results publications versus 100 trials with 2 or more NCT-linked publications.
Comparison of our model to previously published methods
The current state of the art is represented by Dunn et al,15 who have published a trials-to-publications model based on textual and conceptual similarity between article and trial metadata. To compare our methods directly versus theirs as the previously published baseline, we used the same test corpus of 300 trials as reported in Table 1 (a few articles could not be parsed by Dunn’s system, archived publicly at https://github.com/pmartin23/tfidf, and were removed from our comparison as well). As shown in Table 3, our performance was greater for all parameters, and the difference was statistically highly significant.
Table 3.
Performance parameters to compare our methods versus Dunn et al.15
| Test set | No. of trials in sample | Median first rank (IQR) | MRR (95% CI) | Recall@5 (95% CI) | Recall@10 (95% CI) |
|---|---|---|---|---|---|
| Sample 4: Trials with 2 or more NCT-linked articles | 300 | 1.0 (1.0–3.0) | 0.690 (0.617–0.761) | 0.661 (0.593–0.731) | 0.780 (0.722–0.845) |
| Dunn scoring: Trials with 2 or more NCT-linked articles | 300 | 2.0 (1.0–14.0) | 0.558 (0.471–0.641) | 0.418 (0.344–0.493) | 0.517 (0.448–0.593) |
| P-value for difference | 3.4 × 10−9** | 1.67 × 10−5** | 5.36 × 10−17** | 1.28 × 10−18** |
Note: P < 0.05;
P < 0.001.
Goodwin et al14 have also published a deep learning-based, multifeature model to identify publications that are linked to registered trials. It is not possible to make a direct comparison between our model and theirs, because their system is not currently available, and because their evaluated trials and articles were not explicitly documented. Nevertheless, computing the same information retrieval performance metrics as Goodwin et al did, in order to make an approximate comparison of methods, we achieved superior performance (see Table S4, Supplementary Methods).
From our ablation studies (Tables 1 and 2) and the comparisons above (Tables 3 and S4), the improved performance of our model relative to Dunn et al and Goodwin et al is likely to be explained, at least in part, because we incorporated article–article similarity features (“Aggregator” features) which were not used in prior work.14,15
Error analysis
We manually examined a selection of articles that were given predictive scores >0.99 yet definitely did not arise from the registered trial as predicted (eg, associated with different NCT numbers). The most common source of errors were caused by investigators who have individually produced numerous trials and numerous publications regarding the same condition or treatment. Thus, articles may be predicted to belong to 1 registered trial when, in fact, they belong to a related or follow-up trial (eg, a phase III trial rather than phase II). A similar type of error could also occur for heavily studied conditions (eg, metformin to treat diabetes) in which many similar trials produced many similar publications. Even these errors do not entirely reduce the utility of our tool, however: Since most trials generate only a few publications at most, displaying the top 10 publications (along with any that explicitly list NCT or other trial registry numbers) should point users to closely related trials.
Implementing the tool based on the model
A web-based query interface that implements our model is shown in Figure 3. The user is prompted to enter a valid NCT number of a registered trial in ClinicalTrials.gov, and receives a list of PubMed articles ranked according to similarity score (Figure 4). Note that in the Basic Search, we have restricted the pool of candidate articles to those which were published after the trial start date, and that share at least some minimal features with the registered trial (Supplementary Methods). If more than 5000 articles satisfy these criteria, then the pool is limited to the 5000 which match best on the initial feature set. A pool of 5000 articles can be scored and ranked in real time within ∼10 min. Because most trials have no or very few publications associated with them, displaying the top 10 articles will capture nearly all relevant articles in most cases. However, if all 10 have similarity scores >0.8, then the display is extended to show all articles that have scores ≥0.8. In addition to displaying the similarity score for each article, we also display the estimated probability that the article arose from the given trial (see Figure 4 and Supplementary Methods). Note that probabilities are not simply proportional to similarity scores, for example, a similarity score of 76.3% implies only a 1.9% chance that the article arose from that trial (Figure 4). Articles that share the same NCT number as the trial are displayed at the top of the page, along with investigator-submitted articles, regardless of their similarity score. Finally, we have scraped registry numbers for a large number of international trial registries (Supplementary Methods); if a ranked article is linked to any of these registries, we display the registry number next to the article. In the back end of the web service, our database is automatically incremented for newly registered trials and newly published articles on a weekly basis.
Figure 3.
Screenshot of the Trials to Publications query interface. Visitors may view top-ranked articles for any valid ClinicalTrials.gov NCT number with a predetermined candidate set (Basic Search) or a PubMed.gov compatible query (Advanced Search). In the Advanced Search, the user is offered a PubMed query that is pre-populated with suggested terms taken from the condition, intervention and investigator fields of the registered trial, but can be freely edited so that, in effect, the user can enter any PubMed query at all, and create a candidate set of PubMed articles of possibly up to 100 000 candidate articles. This allows maximal flexibility. However, some guidance will be required since such queries cannot be pre-calculated but must be run in real time, and large sets may potentially take hours to process. See Supplementary Methods for details.
Figure 4.
Screenshot of the Results page for the query shown in Figure 3.
DISCUSSION
We present a machine learning based model and web-based tool that automatically predicts, for any given registered trial in ClinicalTrials.gov, which PubMed articles are the most likely to arise from that trial. The underlying assumption is that such articles, particularly those that describe clinical outcomes from a trial, will tend to share similarities with the trial registry in terms of text, MeSH terms, and/or investigator names. In addition, we introduced an additional set of features,27,28 in which articles were also compared for similarity with any articles already known to be linked to the trial. The similarity metrics were trained using a corpus of positive and negative examples taken from ClinicalTrials.gov, and evaluated on different test sets of articles and trials.
Our evaluations showed that the model ranked the majority of linked articles in the top 5. Conversely, articles that had extremely high similarity scores with a given registered trial (eg, >0.99) often arose from that trial. What about ancillary publications that arise from a trial, such as questionnaire development studies, secondary analyses of data, comments or letters, which are likely to have lesser similarity with the registered trial? Our training sets, and gold and silver standard evaluations, encompassed all these types of publications, and our results confirmed that overall the majority of articles associated with a given trial do rank in the top 10 (recall of 64% or better, Tables 1 and 2). Direct comparisons with previous published methods suggest that our system exceeds the current state of the art. The web-based tool is free and publicly available at http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/TrialPubLinking/trial_pub_link_start.cgi.
Some limitations include: (1) Queries to the web tool are currently processed in real time. Preselecting 5000 candidate articles reduces the time for presenting ranked results to under 10 minutes at present, though users who formulate their own larger queries via the Advanced option may encounter processing times ranging up to several hours. We are currently in the process of pre-processing all existing trials and their candidate publications, which will allow almost instantaneous display in most cases in the future, and will create a comprehensive dataset which we will disseminate publicly. (2) We have not evaluated the performance of our tool as a function of the type of clinical trial nor as a function of the popularity of the topic. For example, adaptive/platform trials (ie, trials in which a prospectively planned opportunity is included to modify trial designs and hypotheses based on analysis of data from subjects in the study) do not have fixed pre-specified criteria and so may represent a challenge to match against candidate publications. (3) Although our scoring of article–article features is the best among several schemes examined (Supplementary File 1.6), future work is needed to assess whether the current scheme is fully optimal.
CONCLUSION
The ability to find articles that arise from a given registered trial should provide value for physicians, patients, and their families who seek to follow-up on individual trials, as well as evidence synthesis teams seeking to find relevant publications. Since most registered trials give rise to at most a few publications, the top-10 ranking of articles generated by our tool will not only identify articles that arise from a given registered trial, but also those that are relevant to the trial, e.g., articles arising from other closely related trials. Since many SR teams search clinical trial registries as part of their workflow,32 our tool should be valuable for identifying publications relevant to those trials, both for traditional SRs as well as for updates and for living evidence syntheses. Our ranked predictions hence may find articles not captured in registers such as the Cochrane Central Register of Controlled Trials (CENTRAL).
The current web-based tool is restricted to matching ClinicalTrials.gov with PubMed articles. However, the underlying model can potentially be applied to other clinical trial registries, to study-based registries,33,34 and to other bibliographic databases (including preprint repositories).
Future refinements to the tool, for example, the Advanced search option, will depend on user feedback. Similarly, if user demand is adequate, we may potentially engineer the same model in the reverse direction, that is, given a PubMed clinical trial article, to identify its most similar registered trials. This may assist in retrieval efforts, particularly for older literature; however, it does not preclude the need for publications to mention explicit trial registry numbers in their metadata.
FUNDING
This work was supported by National Library of Medicine grant R01LM010817 and National Institute on Aging grant xyaP01AG039347. The funders did not have any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
AUTHOR CONTRIBUTIONS
NS: conceptualization, methodology, writing—original draft, writing—review and editing, supervision, and funding acquisition. AH: methodology, software, validation, formal analysis, investigation, writing—original draft, writing—review and editing, visualization.
SUPPLEMENTARY MATERIAL
Supplementary material is available at Journal of the American Medical Informatics Association online.
Supplementary Material
ACKNOWLEDGEMENTS
The author thanks Adam Dunn and Shifeng Liu for their cooperation and advice in evaluating their previously published model in comparison with the author’s one.
CONFLICT OF INTEREST STATEMENT
None declared.
DATA AVAILABILITY
The code underlying the implementation of the web tool is available in the Dryad Digital Repository, at https://dx.doi.org/10.5061/dryad.sqv9s4n5f.
Contributor Information
Neil R Smalheiser, Department of Psychiatry, University of Illinois College of Medicine, Chicago, Illinois, USA.
Arthur W Holt, Department of Psychiatry, University of Illinois College of Medicine, Chicago, Illinois, USA.
References
- 1. Murad MH, Asi N, Alsawas M, Alahdab F. New evidence pyramid. Evid Based Med 2016; 21 (4): 125–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Hartung DM, Zarin DA, Guise JM, McDonagh M, Paynter R, Helfand M. Reporting discrepancies between the ClinicalTrials.gov results database and peer-reviewed publications. Ann Intern Med 2014; 160 (7): 477–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. van Lent M, IntHout J, Out HJ. Differences between information in registries and articles did not influence publication acceptance. J Clin Epidemiol 2015; 68 (9): 1059–67. [DOI] [PubMed] [Google Scholar]
- 4. Jones CW, Keil LG, Holland WC, Caughey MC, Platts-Mills TF. Comparison of registered and published outcomes in randomized controlled trials: a systematic review. BMC Med 2015; 13: 282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Earley A, Lau J, Uhlig K. Haphazard reporting of deaths in clinical trials: a review of cases of ClinicalTrials.gov records and matched publications—a cross-sectional study. BMJ Open 2013; 3 (1): e001963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Riveros C, Dechartres A, Perrodeau E, Haneef R, Boutron I, Ravaud P. Timing and completeness of trial results posted at ClinicalTrials.gov and published in journals. PLoS Med 2013; 10 (12): e1001566. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Adam GP, Springs S, Trikalinos T, et al. Does information from ClinicalTrials.gov increase transparency and reduce bias? Results from a five-report case series. Syst Rev 2018; 7 (1): 59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Wilson LM, Sharma R, Dy SM, Waldfogel JM, Robinson KA. Searching ClinicalTrials.gov did not change the conclusions of a systematic review. J Clin Epidemiol 2017; 90: 127–35. [DOI] [PubMed] [Google Scholar]
- 9. Isojarvi J, Wood H, Lefebvre C, Glanville J. Challenges of identifying unpublished data from clinical trials: Getting the best out of clinical trials registers and other novel sources. Res Synth Methods 2018; 9 (4): 561–78. [DOI] [PubMed] [Google Scholar]
- 10. Jones CW, Keil LG, Weaver MA, Platts-Mills TF. Clinical trials registries are under-utilized in the conduct of systematic reviews: a cross-sectional analysis. Syst Rev 2014; 3: 126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Huser V, Cimino JJ. Linking ClinicalTrials. gov and PubMed to track results of interventional human clinical trials. PLoS One 2013; 8 (7): e68409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Bashir R, Bourgeois FT, Dunn AG. A systematic review of the processes used to link clinical trial registrations to their published results. Syst Rev 2017; 6 (1): 123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Huser V, Cimino JJ. Precision and negative predictive value of links between ClinicalTrials.gov and PubMed. AMIA Annu Symp Proc 2012; 2012: 400–8. [PMC free article] [PubMed] [Google Scholar]
- 14. Goodwin TR, Skinner MA, Harabagiu SM. Automatically linking registered clinical trials to their published results with deep highway networks. AMIA Jt Summits Transl Sci Proc 2018; 2017: 54–63. [PMC free article] [PubMed] [Google Scholar]
- 15. Dunn AG, Coiera E, Bourgeois FT. Unreported links between trial registrations and published articles were identified using document similarity measures in a cross-sectional analysis of ClinicalTrials.gov. J Clin Epidemiol 2018; 95: 94–101. [DOI] [PubMed] [Google Scholar]
- 16. Chalmers I, Altman DG. How can medical journals help prevent poor medical research? Some opportunities presented by electronic publishing. Lancet 1999; 353 (9151): 490–3. [DOI] [PubMed] [Google Scholar]
- 17. Altman DG, Furberg CD, Grimshaw JM, Shanahan DR. Linked publications from a single trial: a thread of evidence. Trials 2014; 15: 369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Manzoli L, Flacco ME, D’Addario M, et al. Non-publication and delayed publication of randomized trials on vaccines: survey. BMJ 2014; 348: g3058. [DOI] [PubMed] [Google Scholar]
- 19. Sreekrishnan A, Mampre D, Ormseth C, et al. Publication and dissemination of results in clinical trials of neurology. JAMA Neurol 2018; 75 (7): 890–1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Ross JS, Mulvey GK, Hines EM, Nissen SE, Krumholz HM. Trial publication after registration in ClinicalTrials.gov: a cross-sectional analysis. PLoS Med 2009; 6 (9): e1000144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Asiimwe IG, Rumona D. Publication proportions for registered breast cancer trials: before and following the introduction of the ClinicalTrials.gov results database. Res Integr Peer Rev 2016; 1: 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Al-Durra M, Nolan RP, Seto E, Cafazzo JA, Eysenbach G. Nonpublication rates and characteristics of registered randomized clinical trials in digital health: cross-sectional analysis. J Med Internet Res 2018; 20 (12): e11924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Zwierzyna M, Davies M, Hingorani AD, Hunter J. Clinical trial design and dissemination: comprehensive analysis of clinicaltrials.gov and PubMed data since 2005. BMJ 2018; 361: k2130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Schmucker C, Schell LK, Portalupi S, et al. Extent of non-publication in cohorts of studies approved by research ethics committees or included in trial registries. PLoS One 2014; 9 (12): e114023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Miron L, Gonçalves RS, Musen MA. Obstacles to the reuse of study metadata in ClinicalTrials.gov. Sci Data 2020; 7 (1): 443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Chaturvedi N, Mehrotra B, Kumari S, Gupta S, Subramanya HS, Saberwal G. Some data quality issues at ClinicalTrials.gov. Trials 2019; 20 (1): 378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Shao W, Adams CE, Cohen AM, et al. Aggregator: a machine learning approach to identifying MEDLINE articles that derive from the same underlying clinical trial. Methods 2015; 74: 65–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Smalheiser NR, Holt AW. New improved Aggregator: predicting which clinical trial articles derive from the same registered clinical trial. JAMIA Open 2020; 3 (3): 338–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Smalheiser NR, Cohen AM, Bonifield G. Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings. J Biomed Inform 2019; 90: 103096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Torvik VI, Smalheiser NR. Author name disambiguation in MEDLINE. ACM Trans Knowl Discov Data 2009; 3 (3): 1–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Gulbranson D. Python Human Name Parser [Computer software]. 2018. https://github.com/derek73/python-nameparser.git. Accessed September 15, 2021.
- 32.Cochrane MECIR Manual. 2021. https://community.cochrane.org/sites/default/files/uploads/MECIR-February-2021.pdf. Accessed September 15, 2021.
- 33. Shokraneh F, Adams CE. Study-based registers reduce waste in systematic reviewing: discussion and case report. Syst Rev 2019; 8 (1): 129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Metzendorf MI, Featherstone RM. Evaluation of the comprehensiveness, accuracy and currency of the Cochrane COVID-19 Study Register for supporting rapid evidence synthesis production. Res Synth Methods 2021; 12 (5): 607–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The code underlying the implementation of the web tool is available in the Dryad Digital Repository, at https://dx.doi.org/10.5061/dryad.sqv9s4n5f.




