Skip to main content
International Journal of Epidemiology logoLink to International Journal of Epidemiology
. 2018 Jan 12;47(2):369–379. doi: 10.1093/ije/dyx251

MELODI: Mining Enriched Literature Objects to Derive Intermediates

Reviewed by: Benjamin Elsworth 1,, Karen Dawe 1, Emma E Vincent 1, Ryan Langdon 1, Brigid M Lynch 2,3,4, Richard M Martin 1, Caroline Relton 1, Julian P T Higgins 1, Tom R Gaunt 1
PMCID: PMC5913624  PMID: 29342271

Abstract

Background

The scientific literature contains a wealth of information from different fields on potential disease mechanisms. However, identifying and prioritizing mechanisms for further analytical evaluation presents enormous challenges in terms of the quantity and diversity of published research. The application of data mining approaches to the literature offers the potential to identify and prioritize mechanisms for more focused and detailed analysis.

Methods

Here we present MELODI, a literature mining platform that can identify mechanistic pathways between any two biomedical concepts.

Results

Two case studies demonstrate the potential uses of MELODI and how it can generate hypotheses for further investigation. First, an analysis of ETS-related gene ERG and prostate cancer derives the intermediate transcription factor SP1, recently confirmed to be physically interacting with ERG. Second, examining the relationship between a new potential risk factor for pancreatic cancer identifies possible mechanistic insights which can be studied in vitro.

Conclusions

We have demonstrated the possible applications of MELODI, including two case studies. MELODI has been implemented as a Python/Django web application, and is freely available to use at [www.melodi.biocompute.org.uk].

Keywords: Data mining, risk factors, publications


Key Messages

  • The biomedical literature contains information on potential mechanisms linking risk factors and disease.

  • We describe MELODI, a data mining tool to enable potential mechanisms to be derived from the literature.

  • MELODI prioritizes known and unknown mechanisms for further detailed investigation.

Introduction

An understanding of the mechanisms that relate identified risk factors to health outcomes is important in the discovery of potential drug targets and disease biomarkers. Identifying the mechanistic pathway from a given exposure to a given disease allows us to consider which steps along the pathway are potentially modifiable. This offers the potential to identify new biomarkers and potential treatments to reduce subsequent risk of the disease.

Before embarking on novel research, either in vitro or in silico, it is important to examine what evidence is already available, so that the most promising opportunities are pursued and resources are not wasted. However, most scientific subjects produce too many publications to read in detail, and consequently researchers often rely on inefficient and potentially biased methods to prioritize which mechanisms to investigate. Existing approaches to collating relevant published evidence include manual filtering/selection of the literature, ranking by publication metrics such as impact factor/number of citations, examining media/social media reports, and word of mouth. It is clear that a more systematic, automated data mining approach offers enormous potential to assist in identification of existing evidence and therefore prioritization of mechanisms to investigate. To this end, tools which help us search and refine a set of literature are becoming increasingly important.

To derive potential mechanisms on the pathway between two concepts, we can search the literature for overlapping mechanisms between these two concepts. One option to do this is to simply search the two concepts simultaneously with a boolean ‘AND’ operator, using PubMed [https://www.ncbi.nlm.nih.gov/pubmed] or a standard search engine. This approach, however, will only find cases where both concepts are described together in the same place. This will miss cases where the overlapping mechanisms are described in independent places, i.e. in one place in relation to the risk factor and in another in relation to the outcome. To address this, we need to assess each concept separately and look for overlapping elements.

Many text mining tools are available which can be used to extract potential mechanistic terms from the free text of articles, usually titles and abstracts. A search of OMICtools (March 2017) produced 104 tools in the ‘Information Extraction’ category alone [https://omictools.com/information-extraction-category]. These vary widely in their approach to key aspects such as synonyms, computational performance and precision (predictive value). Measures of precision and recall are vital in understanding the effectiveness of a tool; however, they are hard to measure when searching for novel mechanisms. An alternative to extracting information from raw literature text is to use pre-calculated literature annotation objects. These can either be generated by humans, e.g. Medical Subject Headings (MeSH) [https://www.nlm.nih.gov/mesh/] or computationally derived, e.g. Semantic MEDLINE Database (SemMedDB).1

The MeSH thesaurus is a well-established set of terms used to index articles from over 5000 of the world’s leading biomedical journals. These terms are freely provided by the National Library of Medicine and currently consist of over 27 000 descriptors, which are used by the MeSH staff to assign the most appropriate terms to each MEDLINE/PubMed article. Data are arranged in a complex hierarchy and are available to download in many different formats including summary data, hierarchy and frequency.

SemMedDB is a repository of semantic predications extracted by SemRep2 which uses the Unified Medical Language System (UMLS) and a set of defined rules. For every title and abstract, a Subject-PREDICATE-object triple is generated where the subject and object are terms from the UMLS Metathesaurus and the predicate is a relation from the UMLS Semantic Network. For example, the sentence ‘We used haemofiltration to treat a patient with digoxin overdose that was complicated by refractory hyperkalaemia’ produces the following four triples [https://semrep.nlm.nih.gov/]:

  1. Haemofiltration-TREATS-Patient;

  2. Digoxin overdose-PROCESS_OF-Patients;

  3. hyperkalaemia-COMPLICATES-Digoxin overdose;

  4. Haemofiltration-TREATS(INFER)-Digoxin overdose.

Identifying patterns and overlapping elements across two sets of scientific literature could involve either finding single terms common to both articles, e.g. MeSH or SemMedDB terms, or more complex associations, e.g. finding pairs of SemMedDB triples where the object of a triple in one article set is the subject of a triple in the other (Figure 1). This has been attempted by a number of others. One well-established method is Arrowsmith,3 which is based on single overlapping words from article titles only, limited to a maximum of 25 000 articles per article set and has limited options to download, visualize, filter and explore the resulting data (Supplementary Table 2, available as Supplementary data at IJE online). Another application is TeMMPo [https://www.temmpo.org.uk/] which was developed for identifying intermediate biological mechanisms underpinning the effects of lifestyle factors on cancer outcomes, using a predefined set of possible intermediates. This approach, however, is hypothesis driven (because intermediates have to be identified in advance) which limits the capacity to identify novel intermediates. There are also examples of searching SemMedDB data for overlapping elements4,5 and even proof of principle examples of using graph databases.6 However, there is currently no modern software tool to perform this task on custom data sets in an efficient and intuitive way using the enrichment of annotation objects to refine the list of terms.

Figure 1.

Figure 1

An overview of the enrichment analysis. An example analysis of two article sets, one with 70 000 articles and one with 100 000 articles. For each article set, the SemMedDB and MeSH data are retrieved from the graph (open circles) with roughly twice as many objects as articles. An enrichment step identifies those objects that are enriched in that particular data set compared with the background (solid circles and numbers in parentheses). Those objects that are found to be overlapping between the article sets are returned for further analysis.

We present MELODI (Mining Enriched Literature Objects to Derive Intermediates), a web application which provides a user-friendly environment in which to identify overlapping intermediates from two sets of articles representing two distinct biomedical concepts. MELODI includes an enrichment step, whereby the frequencies of terms within a set of articles are compared with the background frequencies in the whole database. This step is particularly important for identification of overlapping elements, due to the abundance of common terms and the error rate associated with extracting terms from free text. Enrichment prioritizes cases where an object has been derived from text on multiple separate occasions, and often these are from independent studies. This reduces the effect of artificial object enrichment caused by frequent mentions of a term/phrase in a single abstract and incorrect object assignment (where an object has been incorrectly extracted/assigned) based on writing style. MELODI also has an upper limit on the number of articles per article set of one million and, via authentication, provides user space which retains user-specific search results and article sets.

Methods: implementation

Application construction

We constructed MELODI using the Django web framework [https://djangoproject.com] with the following additional plugins and features. Authentication is handled via the Django Social Authentication plugin [https://django-social-auth.readthedocs.org/en/latest/], providing a method to make all data, jobs and results both user-specific and retrievable. As some of the database queries were proving too intensive for a responsive user experience, we integrated a task management system. This uses the distributed task queue Celery [http://www.celeryproject.org/] and the in-memory data structure store Redis [http://redis.io/]. Allowing jobs to be handled by a sophisticated task management system removes the need to wait on-screen for analyses to complete and provides the opportunity for large complex queries. Supplementary Figure 1 (available as Supplementary data at IJE online) summarizes the flow of the application.

The graph database

Identifying connections between two sets of articles can involve searching many millions of objects and relationships. Data storage and analysis of this type are suited to graph databases, and with recent advances in this field we decided from the outset to use a graph database [Neo4j, https://neo4j.com/]. Neo4j is a database constructed of nodes, relationships and properties, which structures data on the basis of relationships rather than in a conventional tabular structure. We chose this approach in preference to a relational database due to: the data being relationship rich; the predicted search strategies (i.e. identifying novel relationships between data sets); and the intuitive nature of using a graph to contain and search data of this type. We implemented an additional MySQL database [https://www.mysql.com/] to provide job progress reports, record user-specific job data and improve data processing at the front end.

Data

We preloaded the graph database with publicly available data from MeSH and SemMedDB, and then augmented and modified with frequencies per year for each annotated term and user-provided relationship datum (Table 1). As of March 2017, the graph consists of over 44 million nodes and 200 million relationships, the main components of which are listed in Table 2. We will update the graph as new releases of MeSH and SemMedDB data are released.

Table 1.

Sources of data

Entry Name Version Description
P MeSH countsa 2015 Frequency counts for main MeSH terms
P MeSH structureb 2016 MeSH hierarchy structure
P PubMedc 26 Basic article data up to 30 April 2016
P PubMed-MeSHc 2016 MeSH terms for each PubMed article
P SemMedDBd 26 SemMedDB summary data for each article
U Article set N/A Collection of PubMed articles
U Article set: PubMed N/A Article set to PubMed relationships
C New PubMed Daily PubMed information not already in database
C PubMed-Mesh Daily PubMed to MeSH relationships

Table 2.

Graph details

Data Type Number
PubMed article Node 25698930
MeSH term Node 464122
MeSH tree Node 56326
SemMedDB triple Node 17713740
SemMedDB concept Node 284806
PubMed-MeSH Relationship 80395022
PubMed-SemMedDB triple Relationship 84621296
SemMedDB Triple-SemMedDB concept Relationship 35498324

Preloaded data

Table 1 lists the sources of data that were preloaded into the graph, notably summary data from SemMedDB and MeSH. We first transformed SemMedDB data sets from SQL and then all data were converted to a standard delimited format. We then inserted the data into the graph using the ‘neo4j-import’ command, a very efficient method for inserting large amounts of data into a Neo4j graph. Data needed to be pre-processed to include the same separator, with correct header information and care taken so that all other insertion requirements were met. The application code, scripts that we used to transform the raw data into appropriate files and the neo4j import command are available at [https://github.com/MRCIEU/melodi/]. Once data were inserted, the graph was indexed in a similar way to standard relational databases.

Uploaded data

Our main aim of preloading a large amount of data into the graph was to minimize the time spent processing user-supplied information. Therefore, the only modifications to the graph required when generating a user set of articles are the specification of the article set as a node and specification of the relationships between that node and the publications to which it relates, e.g. the lines connecting the blue ‘Article Set’ node and the green ‘Publication’ nodes in Supplementary Figure 1. We initially attempted to create an empty graph that was organically populated with user information; however, the overheads associated with dynamically creating a graph using large amounts of data and using the less efficient insertion methods were too great.

Using the application

Creating an article set

MELODI is based on identifying mechanisms linking two article sets. An article set is simply a set of articles that represent a defined concept, e.g. ‘body mass index’. There are two ways to create a new article set, each of which is defined by a set of PubMed IDs. The first is to perform a PubMed search within the application. This retrieves the PubMed IDs using the Entrez programming utilities,7 and populates the graph with the new relationships as described above. The second option is to upload a set of PubMed IDs. The benefit of this option is that a set of IDs can be hand curated, improving the focus of the article set, removing the reliance on the methodology of a PubMed search and allowing for greater flexibility in how the set is created. When carefully constructed, producing article sets in this way will generate the most informative results with higher specificity (although this ‘manual curation’ has the potential to introduce additional bias).

Comparing article sets

A simple hypothesis-free approach to comparing the two article sets would be to identify terms which overlap across the two sets and quantify their occurrence using the number of articles mentioning them. This would identify the most common elements present in the article sets, but many of these will be uninformative (e.g. commonly occurring terms). MELODI reduces the problem with a two-step strategy (Figure 2), using enrichment to identify terms that occur in the article sets more often than would occur by chance.

Figure 2.

Figure 2

Exploring the results. An example of the visualizations and filters within the results. Panel A contains the main filtering options for the Semantic Medline data, including P-value, odds ratio and predicate frequency rank (PFR). This final metric allows the user to filter results based on the global frequencies of the SemMedDB predicates, on the assumption that less frequent predicates are more informative. Panel B gives an example of the force-directed graph visualization of the results. Each article set, subject and object is displayed as a node, with the relationships between them collapsed into single arcs. The thickness of these arcs represents the number of publications generating the relationship. Panel C represents the same data in a Sankey diagram. This time the individual predicates are listed in square brackets and the thickness of the bands represents the number of publications, each SemMed subject and object as a type associated with it. Those present in the figures are listed in panel D. This gives an immediate overview of the types of terms present in the results on display. Panel E shows the table displayed below panels A-D on the results page. This table is the data source behind the visualizations, and therefore controls what is displayed. Clicking on any of the coloured terms adds that term to a filter box in panel A, which can then be set to exclude (N) or include (Y) in the results. Panel F shows the detailed publication data from one of these links, allowing the user to correlate the enriched concepts to the actual articles they were derived from.

In the first step, for a given article set the enriched elements are identified. For MeSH terms, this is based on the number of times a MeSH term has been annotated as a main MeSH term in the article set, compared with the frequency of the main MeSH terms across all articles in MEDLINE (calculated from the entire set of MeSH data). For SemMedDB, two alternative analysis methods are available. The first is very similar to the MeSH approach, using the single SemMedDB terms (extracted from the triples) and then identifying enriched terms using their pre-calculated global frequencies. The second has a similar enrichment step but using the entire triplet and, again, frequencies calculated from the global data set as the reference. An extra filtering step is also employed for SemMedDB triples, which restricts the objects of article set A and the subjects of article set B to terms which are present fewer than 150 000 times in the SemMedDB data set. This vastly reduces the number of overlapping terms, removing 88 high-level terms such as ‘Patients’ and ‘Cells’ which would otherwise introduce unnecessary noise (Supplementary Table 1, available as Supplementary data at IJE online). In all three cases, the frequency of the term is compared between the article set and the entire literature set using a two-tailed Fisher’s exact test (FET). P-values are corrected for multiple testing using the Benjamini/Hochberg (non-negative) correction with a cutoff of P < 1e-5. The results (P-value, corrected P-value and odds ratio) are stored on file for later use to avoid the need for repeating this step which can be computationally demanding. The enrichment ensures that when large numbers of articles are involved, multiple instances of a term are required to define a term as enriched, i.e. a single instance of a term is unlikely to be sufficient to be marked as enriched unless the article set in question is relatively small.

Exploring the results

An analysis of two article sets can be performed using the MeSH terms and/or the SemMedDB data. The results of these independent approaches are provided separately for further investigation. As the analysis is hypothesis free, we believe it is important to provide a first-pass filtered set of results on the initial view but also include access to the entire set of results, allowing the user to explore the data in their entirety. We have provided the following filtering parameters: FET corrected P-value, FET odds ratio and the top number of results (all methods), as well as minimum position in the MeSH hierarchy (MeSH method) and frequency of the predicate term (SemMedDB triple method). The inclusion of all predicates in the final data is deliberate, as cases where there are low numbers of enriched overlapping SemMedDB triple objects can still hold valuable information but may use more common predicates such as ASSOCIATED_WITH and AFFECTS. This is unlike other approaches5 in which a universal predicate filter was used. As article set comparisons produce differing numbers of overlapping elements, a basic set of dynamic filtering rules are employed when displaying the results for the first time, adjusting the filters based on the number of results, e.g. low numbers of overlapping enriched elements are treated with a more relaxed set of parameters.

Visualization of results is provided using interactive Sankey diagrams and, for SemMedDB triples, an additional force-directed network diagram (Figure 2). Both are based on the same data displayed in the table on the page, with strength of supporting evidence represented by the thickness of connections. The network-based view is often more intuitive when trying to identify paths between two article sets, as the same SemMedDB object may be present multiple times. This view also highlights the possibility of multiple steps between article sets.

The automatic filtering performed before visualization attempts to highlight the most informative enriched overlapping elements. However, as free text is inherently noisy (from the automatic language processing point of view) and computational predictions are not perfect, an additional text filtering option is provided. This consists of a number of text boxes that can be used either to ‘filter out’ or to ‘restrict to’ keywords within the results, and in the case of SemMedDB analysis this can be done on any of the five elements (subject A—predicate 1—[object A/subject B] —predicate 2—object B) simply by clicking on the corresponding term in the table (Figure 2). The filter option is particularly useful where there are many overlapping terms between two article sets, and the restrict option is useful in cases where a more focused search is required, for example with defined exposure and outcome terms (Supplementary Figure 2, available as Supplementary data at IJE online). Examples of this are discussed later.

The order in which the results are delivered is critical, as this determines the top set that are displayed for closer inspection. This order can be based on a number of factors, e.g. the mean corrected P-value across the two article sets, the minimum position in the MeSH hierarchy or the predicate frequency rank. In addition to these, a custom-designed score is used which aims to identify overlapping elements with a high number of supporting articles in both article sets (Equation 1), where uniq_a and uniq_b are the number of unique articles from each article set that contain the overlapping term. This is based on the assumption that these elements may be the most reliable, due to many and equal articles producing the objects. This assumption does risk ignoring cases where a small number of valid articles support one side of the relationship, and should therefore be used with caution.

Equation 1

score=min(uniqa,uniqb)max(uniqa,uniqb)×(uniqa+uniqb)

Applying MELODI to scientific research

The hypothesis generation methods provided by MELODI can be applied to a wide range of scientific disciplines. As we have implemented MELODI using PubMed and SemMedDB, the current applications have a biomedical focus. In all cases, the overarching aim would be to derive intermediates between two clearly defined scientific concepts. These two scientific concepts can arise from a range of different techniques and approaches, and the intermediate mechanisms identified can be further investigated using a similar range of techniques and approaches (Figure 3). For simplicity, we have grouped some of these approaches into three categories.

  1. Laboratory investigation. There are many ways in which laboratory studies could result in two concepts that would benefit from a methodical search of the literature for mechanisms that may connect them. For example, a cell line study that identifies differential levels of a particular protein in a treated cancer cell line, compared with untreated, could be followed up by a MELODI analysis to identify potential intermediate proteins. These could then be investigated by further laboratory analysis, or through literature review.

  2. Epidemiology. Observational epidemiology focuses on broad ‘exposure’ and ‘outcome’ concepts that encompass a lot of underlying mechanisms (using cohort studies and case-control studies in particular). A relationship such as that between alcohol intake and heart disease could be investigated to identify potential intermediate mechanisms that could be investigated in the laboratory, as biomarkers or risk predictors in epidemiology, or via review of the literature. For causal analysis in epidemiology, Mendelian randomization (MR) uses genetic variants as proxies (instruments) to investigate whether an exposure (such as alcohol intake) has a causal effect on disease outcome.8–11 Causal relationships identified with MR could also be investigated in MELODI (exactly as with conventional observational studies), but a real advantage of MR is in following up potential mechanisms identified by MELODI, to determine whether they are truly on the causal pathway or are simply biomarkers.

  3. Scientific literature. MELODI is based on the scientific literature and can very naturally be used to identify potential mechanisms underlying relationships identified from the literature (particularly those with strong evidence provided by formal systematic review12 and meta-analysis). The results of a MELODI analysis are also amenable to literature review for the initial investigation of a potential mechanism. MELODI provides access to all the articles that support a particular mechanism, enabling comprehensive manual review to determine whether the candidate mechanism is plausible. If so, this may be followed up with a formal systematic review or meta-analysis, or using laboratory or epidemiological approaches if additional evidence is needed.

Figure 3.

Figure 3

Concept origins and post-MELODI investigations. MELODI takes two concepts and derives intermediates. These concepts (solid lines) can originate from a number of places, e.g. an epidemiological observation, or a finding from a laboratory experiment. By creating article sets that represent these two concepts, potential intermediates (dashed lines) can be derived which may provide testable hypotheses. These can then be investigated further using approaches similar to those that created the initial concepts.

In addition to the approaches described above, an iterative MELODI analysis may be performed, in which an intermediate derived from a MELODI analysis can be used as the starting point for a new investigation using MELODI.

Results

Performance

We evaluated the time taken for insertion and indexing of all data. This was completed in under 2 h and produced a graph database around 50 GB in size, using 80 GB RAM. By pre-populating the graph with almost all the necessary data, we greatly reduced the burden of uploading data dynamically. Extracting large amounts of heavily connected data is still, however, a challenge. An article set containing 100 000 articles will take around 1 min to create relationships, 2 min for MeSH enrichment and 5 min for SemMedDB enrichment. When comparing two article sets of that size, the overlap step takes 2–5 min, depending on the number of overlapping elements. Therefore, in total, performing a complete analysis on two large article sets can take around 10 min, depending on server load and queues. However, as the results of each of the enrichment steps are stored to disk, they are only performed once. This means that if the same article set is used for subsequent analysis this step is skipped, reducing computational requirements by around 80%.

Our inclusion of a task management system allows multiple users to work with MELODI simultaneously, as jobs are either added to an available worker or held in a queue until one becomes available. Currently MELODI is running on a virtual machine with four central processing units (CPUs) (and therefore four workers), but can be scaled easily to support demand by adding more CPUs and distributing the graph on a cluster.

Case studies

Two case studies describing how MELODI can be used to generate hypotheses, and how they can be explored, further are described below.

ERG and prostate cancer

ETS-related gene (ERG) is an oncogene that has, in the past decade, become closely linked with prostate cancer.13 Chromosomal rearrangements cause ERG and transmembrane protease, serine 2 (TMPRSS2) to fuse together, forming an oncogenic fusion gene which then disrupts the ability of stem cells to differentiate into prostate cells, leading to unregulated and unorganized tissue. This gene fusion is the most common type found in prostate cancer and can be identified by overexpression of ERG in prostate carcinomas.14

Much work has been done on elucidating the mechanism by which ERG is associated with prostate cancer. We used MELODI, therefore, to assess the literature on ERG and prostate cancer. Article sets for ‘TMPRSS2:ERG or ETS-related gene’ and ‘prostate cancer’ were generated and compared (1 March 2017) using 867 and 138 391 articles, respectively. As expected, many genes previously described were identified, notably phosphatase and tensin homolog (PTEN) and androgen receptor (AR) in the SemMedDB triple results (Table 3, example 1) and ETS Proto-Oncogene 1 (ETS1), ETS variant 1 (ETV1) and ETS variant 4 (ETV4) in the SemMedDB single-term results (Table 3, example 2). In addition, by ordering this latter set of results by those that share the fewest articles, the Sp1 transcription factor gene (SP1) appears, a transcription factor known to bind to many promoters with a wide-ranging set of proposed functions (Figure 4). The SemMedDB data used to identify this intermediate gene was version 26 which included publication data up to 30 April 2016. In August 2016, Sharon et al. published a paper describing the role of SP1 in the regulation of insulin-like growth factor receptor 1 (IGFR1) by the TMPRSS2-ERG fusion gene.15 In this paper they describe a physical interaction between the ERG and SP1 transcription factors, identified by co-immunoprecipitation assays. This work demonstrates the kind of further investigation that could have been performed on the basis of identifying SP1 as a potential intermediate. and confirms the value of MELODI in identifying novel intermediates.

Table 3.

MELODI result uniform resource locators (URLs)

Figure 4.

Figure 4

ERG, SP1 and prostate cancer. The Sankey Plot visualization of the ERG-prostate cancer analysis with SP1 highlighted.

In addition, the seven articles supporting the connection between ERG and SP1 are from 2007 and earlier, suggesting that the previous studies of Ewing’s sarcoma and acute myeloid leukaemia, and not prostate cancer, may have been overlooked. This indicates that the connection between ERG, SP1 and prostate cancer could have been identified many years ago. This can be demonstrated by running the same analysis but using publications up to and including 2005, which still finds SP1 as enriched and overlapping (Table 3, example 3).

Carnitine and pancreatic cancer

The following example illustrates the utility of MELODI in the dissection of causal pathways. Using Mendelian randomization (MR), we have recently found that elevated levels of the amino acid derivative, carnitine, are associated with an increased risk of developing pancreatic cancer (unpublished work at the time of article press). In this situation, MELODI can be used as a starting point to investigate how the exposure and the outcome might be connected. Our aim was to generate mechanistic hypotheses that might explain how carnitine increases the risk of pancreatic cancer, for further investigation using in vitro studies in the laboratory.

Article sets for ‘carnitine’ and ‘pancreatic cancer’ were created with 14 631 and 82 226 articles, respectively, and the intermediates derived (Table 3, example 4). Figure 5 shows the enriched relationships identified between these two sets of data, and highlights possible intermediates between them. Of interest were the intermediates ‘fatty acid oxidation’ and ‘insulin’. Upon investigation of the literature underpinning these connections (this can be done easily in MELODI which links directly to the articles in PubMed), we found that carnitine can increase fatty acid oxidation.16–18 Metabolic reprogramming is a known feature of cancer cells,19 and fatty acid oxidation can be used by cancer cells for energy generation.20 We can therefore generate the following hypothesis for further investigation in the laboratory: ‘‘carnitine increases fatty acid oxidation which provides pancreatic cancer cells with a metabolic advantage’.

Figure 5.

Figure 5

Carnitine and pancreatic cancer. In this example two article sets were compared, one focused around ‘Carnitine’ and the other around ‘Pancreatic Cancer’, the results of which are displayed using a force-directed graph. Each article set (large nodes), subject and object (small nodes) is displayed as a node, with the relationships between them collapsed into single arcs. The thickness of these arcs represents the number of publications generating the relationship.

Insulin is also a highlighted intermediate linking carnitine and pancreatic cancer. Upon investigation of the literature underpinning the connections, we find that insulin treatment of skeletal muscle increases the expression of the carnitine transporter protein OCTN2. Investigation of the literature highlighted by MELODI informs us that increased insulin secretion is associated with some forms of pancreatic cancer.21 In addition, we have also found using MR that elevated fasting levels of insulin are causally associated with pancreatic cancer (unpublished data at time of article press). Therefore, with information from MELODI and our own MR investigations, we can generate the following hypothesis for further investigation in the laboratory: ‘elevated levels of insulin cause pancreatic cancer in part through increased expression of the carnitine transporter in pancreatic cancer cells’. This demonstrates the power of MELODI as a hypothesis generator to investigate the mechanisms underlying causal relationships.

Limitations

When identifying single overlapping terms, the structure of the text is not so important; however, the SemMedDB triple data are dependent on the structure. For example, an article set might contain the triples geneA-ASSOCIATED_WITH-geneB and geneB-ASSOCIATED_WITH-geneA. The underlying directionality provided by SemMedDB will lead to these two triples being treated as separate entities. There are cases where predicates imply direction, e.g. INHIBITS and STIMULATES require this restriction, but other predicates such as PART_OF and ASSOCIATED_WITH might not. This highlights how the structure of titles and abstracts are key to the extraction of SemMedDB triples and ultimately the identification of overlapping terms. In addition, the assumed directionality from article set A to article set B, and the method for the identification of overlapping SemMedDB triples, will likely miss many intermediates. However, the potential gain of a small number of true-positives is not worth the increase in false-negatives that would occur if this restriction was removed. This dependence on a rigid rule-based approach to data extraction could be alleviated with the inclusion of more sophisticated methods, such as machine learning or natural language processing. Their absence from MELODI is simply due to a lack of methods and resources available at the time with which one could process a large data set such as MEDLINE and extract complex data structures such as object-predicate-subject. SemRep and SemMedDB are well-developed tools that are under active development and provide frequent data releases. In the future, additional types of data extracted from articles can be simply added as extra nodes and relationships to the graph and as additional analysis methods.

An important limitation of any literature-based tool is that the published literature may be a biased subset, or a biased over-representation, of research that has been undertaken. A large proportion of negative findings are never published, and groups often publish many related papers with similar ideas discussed in the abstract. In addition, the algorithms used to produce the SemMedDB data and the humans used to assign the MeSH terms may introduce unconscious bias. Using more flexible agnostic methods such as those mentioned above would enable the use of other publicly available data sets, alleviating some of the bias associated with published literature. Even so, MELODI is always going to give a biased representation of what is really known about a topic. However, the alternative to a computational approach is manual curation, which is impossible at this scale and potentially prone to much greater bias. As long as the caveats and limitations are understood, then the output of this kind of approach can still be valuable and provide reliable hypotheses.

Discussion

MELODI is a hypothesis-free application that derives mechanisms, both known and novel, from the published literature. It uses a graph database to find enriched relationships between two sets of articles from the entire collection of MEDLINE articles, using both the manually curated MeSH terms and computationally derived SemMedDB terms. We have demonstrated its ability to derive both known and unknown intermediates across large complex data sets. Our examples have shown how it can derive novel intermediates for new studies and generate mechanistic targets underlying observed epidemiological associations.

An additional advantage of this kind of approach is the inclusion of data from any MEDLINE article, regardless of impact factor or citation number, which has obvious benefits for the low-impact paper. Often in the course of a scientific investigation, a decision is made to publish when further work could still be carried out, but time or financial constraints are overriding. This has the benefit of releasing information and data early even if the findings and hypotheses have not been validated. The authors of such papers may feel that more could have been done and the resulting work may lack impact; however, tools like MELODI can still use the output from these papers and, combined with similar findings, can be used to encourage others to continue this avenue of research, rather than the article being lost in the deluge of papers published every day.

MELODI is already capable of deriving reliable intermediate hypotheses. There is, however, potential for future development. The graph database is suited very well to the inclusion of complementary data sets, and the addition of these is planned. They include drug targets, metabolic pathways and protein-protein interactions. These extra data would permit reliable multi-step connections between article sets using known data connections. The database already contains information on the SemMedDB term types, which provides the necessary connections to the extra data and further filtering options, e.g. to show only intermediates that are genes, druggable, proteins, etc.

Filtering of the overlapping elements between article sets could also be improved with more computational methods. Machine learning techniques could be applied to train the filtering steps to improve the accuracy and usefulness of intermediates, by learning from user filtering and overlap with well-established data sets.

Supplementary Data

Supplementary data are available at IJE online.

Funding

This work was supported by a Cancer Research UK Programme Grant, the Integrative Cancer Epidemiology Programme (C18281/A19169) and the UK Medical Research Council (MRC Integrative Epidemiology Unit, MC UU 12013/8). B.M.L. was supported by a UICC Yamagiwa-Yoshida Memorial International Cancer Study Grant (YY1/16/422204) and a Fellowship from the National Breast Cancer Foundation (ECF-15–012).

Conflict of interest: None declared.

Supplementary Material

Supplementary Data

References

  • 1. Kilicoglu H, Shin D, Fiszman M, Rosemblat G, Rindflesch TC. SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics 2012;28:3158–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform 2003;36:462–77. [DOI] [PubMed] [Google Scholar]
  • 3. Smalheiser NR, Torvik VI, Zhou W. Arrowsmith two-node search interface: a tutorial on finding meaningful links between two disparate sets of articles in MEDLINE. Comput Methods Programs Biomed 2009;94:190–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Hristovski D, Dinevski D, Kastrin A, Rindflesch T. Biomedical question answering using semantic relations. BMC Bioinformatics 2015;16:6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Cameron D, Kavuluru R, Rindflesch TC, Sheth AP, Thirunarayan K, Bodenreider O. Context-driven automatic subgraph creation for literature-based discovery. J Biomed Inform 2015;54:141–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Hristovski D, Kastrin A, Dinevski D, Rindflesch TC. Constructing a graph database for semantic literature-based discovery. Stud Health Technol Inform 2015;216:1094. [PubMed] [Google Scholar]
  • 7. Sayers E. Entrez Programming Utilities Help. URL http://www.ncbi.nlm.nih.gov/books/NBK25499. 2009. (17 November 2017, date last accessed). [Google Scholar]
  • 8. Davey Smith G, Ebrahim S. Epidemiology: is it time to call it a day? Int J Epidemiol 2001;30:1–11. [DOI] [PubMed] [Google Scholar]
  • 9. Timpson NJ, Wade KH, Davey Smith G. Mendelian randomization: application to cardiovascular disease. Curr Hypertens Rep 2012;14:29–37. [DOI] [PubMed] [Google Scholar]
  • 10. Davey Smith G, Ebrahim S. Data dredging, bias, or confounding. BMJ 2002;325:1437–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Davey Smith G, Ebrahim S. ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? Int J Epidemiol 2003;32:1–22. [DOI] [PubMed] [Google Scholar]
  • 12. Egger M, Davey Smith G, Altman D. Systematic Reviews in Health Care: Meta-analysis in Context. Chichester, UK: Wiley, 2008. [Google Scholar]
  • 13. Adamo P, Ladomery MR. The oncogene ERG: a key factor in prostate cancer. Oncogene 2016;35:403–14. [DOI] [PubMed] [Google Scholar]
  • 14. Tomlins SA, Rhodes DR, Perner S, et al. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 2005;310:644–48. [DOI] [PubMed] [Google Scholar]
  • 15. Sharon SM, Pozniak Y, Geiger T, Werner H. TMPRSS2-ERG fusion protein regulates insulin-like growth factor-1 receptor (IGF1R) gene expression in prostate cancer: involvement of transcription factor Sp1. Oncotarget 2016;7:51375–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. di San Filippo CA, Wang Y, Longo N. Functional domains in the carnitine transporter OCTN2, defective in primary carnitine deficiency. J Biol Chem 2003;278:47776–84. [DOI] [PubMed] [Google Scholar]
  • 17. Lahjouji K, Mitchell GA, Qureshi IA. Carnitine transport by organic cation transporters and systemic carnitine deficiency. Mol Genet Metab 2001;73:287–97. [DOI] [PubMed] [Google Scholar]
  • 18. Chapoy PR, Angelini C, Brown WJ, Stiff JE, Shug AL, Cederbaum SD. Systemic carnitine deficiency—a treatable inherited lipid-storage disease presenting as Reye’s syndrome. N Engl J Med 1980;303:1389–94. [DOI] [PubMed] [Google Scholar]
  • 19. Pavlova NN, Thompson CB. The emerging hallmarks of cancer metabolism. Cell Metab 2016;23:27–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Rodríguez-Enríquez S, Hernández-Esquivel L, Marín-Hernández A, et al. Mitochondrial free fatty acid β-oxidation supports oxidative phosphorylation and proliferation in cancer cells. Int J Biochem Cell Biol 2015;65:209–21. [DOI] [PubMed] [Google Scholar]
  • 21. Grygiel K, Szmidt J, Jeleńska M, Pawlak K. Surgical treatment of hyperinsulinism during the course of pancreatic cancer (insulinoma) - one center experience. Pol J Surg 2012;84:31–36. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from International Journal of Epidemiology are provided here courtesy of Oxford University Press

RESOURCES