Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jul 20.
Published in final edited form as: Nat Protoc. 2021 Jan 22;16(3):1376–1418. doi: 10.1038/s41596-020-00455-4

Cognitive Analysis of Metabolomics Data for Systems Biology

Erica L-W Majumder 1,+, Elizabeth M Billings 1,+, H Paul Benton 1, Richard L Martin 2, Amelia Palermo 1, Carlos Guijas 1, Markus M Rinschen 1, Xavier Domingo-Almenara 1, J Rafael Montenegro-Burke 1, Bradley A Tagtow 2, Robert S Plumb 3, Gary Siuzdak 1,*
PMCID: PMC10357461  NIHMSID: NIHMS1901300  PMID: 33483720

Abstract

Cognitive computing is revolutionizing the way big data is processed and integrated, with Artificial Intelligence (AI) Natural Language Processing (NLP) platforms helping researchers to efficiently search and digest the vast scientific literature. NLP provides literature-based contextualization of metabolic features efficiently thereby decreasing the time and expert-level subject knowledge required during the prioritization, identification and interpretation steps in the metabolomics data analysis pipeline. Here we describe and demonstrate four workflows that combine metabolomics data with NLP-based literature searches of scientific databases, to aid in the analysis of metabolomics data and their biological interpretation. The four protocols may be used individually or consecutively depending on the research questions. The first generates improved metabolite annotation and prioritization. The second workflow finds literature evidence of the activity of metabolites and metabolic pathways in governing the biological condition on a systems biology level. The third finds candidate biomarkers and the fourth looks for metabolic conditions or drug-repurposing targets between two diseases.

Graphical Abstract

graphic file with name nihms-1901300-f0001.jpg

Introduction:

Metabolomics has predominantly been used as a biomarker discovery tool 1-4, however it also has the unique ability to identify the broad metabolic changes that occur within an organism 5, characterize the pathways that regulate those changes, and identify active metabolites that modulate phenotype 6. Since untargeted metabolomic profiling typically compares two or more conditions, such as healthy to diseased samples, dysregulated metabolites are therefore tightly associated with the biological process studied. Knowledge of the function and activities of the metabolites would reveal the underlying molecular mechanisms governing the biological condition of interest, from microbial response to environmental contamination to plant drought resistance to cancer metastasis 7. There are a number of challenges for the metabolomics field to move beyond biomarker discovery and make use of its broader potential. One primary bottleneck is the amount of time and manual effort spent analyzing data especially in 1) prioritizing which of several thousand detected features to characterize and formally identify and 2) determining the biological importance of the hundreds of statistically significant metabolites 8. Improvements in data processing and analysis by bioinformatic pipelines, inclusive of machine learning for feature annotation accompanied by comprehensive metabolite databases9, have helped facilitate semi-automation of the first challenge 10-14. Categorizing and determining the function of the identified metabolites with respect to biological importance is the time-consuming aspect of metabolomics and requires expert-level subject knowledge. Despite the current bioinformatic improvements in selection of metabolites for further investigation and identification pipelines, these methods could be enhanced with other non-mass spectrometry-generated data sources. One approach that has not been explored to accomplish more effective metabolite prioritization is contextualizing the mass spectrometry-derived data with the vast biological knowledge found in scientific literature.

As scientific literature expands continuously, it is not possible for a researcher to be aware of all relevant papers 15-17. In addition, crucial information could be available in the manuscript that may not be accessed/found through conventional literature searching, i.e. keyword searching, that is limited to titles, abstracts and exact matches with search terms 18. In order to address this challenge and add autonomous literature contextualization to metabolomics, we have coupled our XCMS metabolomics data analysis platform with Natural Language Processing (NLP), an artificial intelligence (AI) subfield used to computationally interpret unstructured language, i.e., natural language used to communicate among humans. NLP is used to translate the information within a human-written manuscript into structured data based on ontologies and reference databases that can be easily interpreted by a computer (Figure 1). In that sense, this technology can highlight the sections of the paper that are relevant to the scientist by going beyond the simple occurrence of keywords and pulling out interactions between keywords and understanding their synonymy 19. Further by searching vast databases of literature quickly, NLP systems can find, annotate and make connections between entities in relevant papers in a holistic manner, including those from disparate fields outside the researcher's primary focus. To that end, computational linguistics and explainable AI have the potential to completely change how and the speed at which scientists can access, read and interpret scientific literature.

Figure 1: Framework of computational literature searching tool components for cognitive and conventional searches.

Figure 1:

This diagram displays examples of the main elements of any computer-based literature searching tool: Data Input, Search Algorithms, Processing Algorithms and the Result Output. To begin, the user inputs keywords, phrases or chemical information to the existing body of knowledge housed, digested or analyzed by the search engine. The scientific literature is searched or understood by the machine either through exact match of keywords or text in a document title or abstract, or by more advanced algorithms that recognize nouns as entities and decipher the relationship between entities. Then, the search tool may perform additional processing to rank, categorize, sort or make predictions about the user input data. The user then receives the results, which can be further filtered and interrogated depending on research question. Of the many literature searching tools that exist, all use a different combination of inputs, algorithms and outputs. As such, no one tool is perfect for every application or research question.

In this effort, we developed four metabolomic workflows using NLP to improve output annotation and decrease data analysis and interpretation time via literature contextualization of metabolomics results (Figure 2) within a biological system of interest 20-22. In the first workflow AI-NLP tools are used to accelerate the assignment of putative identifications to metabolic features, and based on literature evidence of relationship of the metabolite to the biological condition and prioritize metabolites for full characterization. Entity information and chemical similarity calculations are used to narrow down possible metabolite matches during feature annotation and identification. Then, predictive analytics is employed to prioritize metabolic features from an untargeted profiling experiment for further characterization by showing metabolites with known or unknown relationships to the biological condition of the study. The second workflow seeks to find mechanistic connections between dysregulated metabolites and the biological condition under investigation by mapping the dysregulated metabolites onto pathways and searching the literature on the systems biology level. Thereby, the biological function of top dysregulated pathways from metabolomics results was deduced via co-occurrence searching of the genes, proteins and metabolites against the biological condition. The third workflow uses AI-NLP predictive analytics to identify biomarker candidates from diseases or other biological conditions of interest. This is based on the candidate metabolite’s strong literature association with one disease and lack of association with other diseases. We used the workflow to validate previously identified biomarker metabolites in genetic-caused metabolic disorders and found new candidate biomarker metabolites. The fourth workflow employs similar logic as the third, but instead of looking for metabolites that are different between conditions, we examine metabolites that are shared between conditions to find metabolic pathways that mechanistically link observed co-presenting biological conditions. This workflow also identifies drugs that have known activity in one disease that may be re-purposed for another. These workflows demonstrate the utility and future potential of AI-based NLP to optimize the interpretation of metabolomic experiments in their biological context.

Figure 2: Overview of workflows incorporating AI-based Natural Language Processing into metabolomics data analysis and interpretation pipelines.

Figure 2:

Evidence from scientific literature is standardly used in metabolomics workflows to facilitate decision making about methods, but also to decipher the identity and function of key metabolite actors in the biological system of interest. A key challenge in metabolomics is the large number of molecules detected and the ability to analyze and interpret the results quickly. We have incorporated AI-based literature searching into our metabolomics workflows to analyze results more expediently, but also more effectively. In the first workflow for untargeted metabolomics we use AI-literature searching to narrow down putative identifications of key metabolites of interest and to prioritize which metabolites to study based on know condition-association from literature evidence. Secondly, we move to a systems biology level and use metabolomics identified biological pathways to determine the activity of the metabolites in producing the observed phenotype. In our third and fourth approaches, we examine the relationship of diseases on the metabolome level, as metabolites are known actors in disease causation and can be drugs or drug targets themselves. By predicting from literature evidence which metabolites are known to be associated with particular diseases, we compare the disease metabolomes. Metabolites that are unique to one condition are potential biomarkers. Metabolites that are shared can point to pathways that overlap in both conditions and is therefore applicable in drug re-purposing or discovery applications. Overall, the cognitive analysis of metabolomics data has yielded actionable and testable hypotheses to the activity of metabolites in phenotype causation.

Development of the protocol:

These workflows emerged from several previous workflows we have developed, first in the protocols for conducting targeted and untargeted metabolomics experiments, followed by tools and protocols for data processing, feature annotation and systems biology analysis. We then extended these protocols with new steps to analyze the biological function of the metabolites with AI-based literature searching tools.

Metabolite Extraction and Metabolomics Data Acquisition

Metabolite extraction, Liquid Chromatography separation and detection by Mass Spectrometry are as reported for each experiment. Numerous papers outline approaches and protocols for metabolite extraction and metabolomics data acquisition and vary by biological source organism or tissue type, chromatography and mass spectrometer available23-29.

Data Processing, Annotation and Pathway Mapping

Data Processing steps are as reported for each experiment. Raw mass spectra were generally processed using XCMS Online, any variation from default parameters are discussed in the published studies. XCMS generates a results table of detected metabolic features and in the Systems Biology portal, generates a list of dysregulated pathways 24,27-31.

Metabolic features were annotated or given putative identifications with several different published algorithms and metabolomics annotation tools depending on the workflow. METLIN was used as the metabolite database 9. MISA was used to gain putative identifications of metabolites from in-source fragmentation spectra 32. MUMMICHOG is the algorithm that operates the Systems Biology Portal of XCMS Online 30. Metabolites are annotated using the pathways predicted from the genome of the organism.

Cognitive Metabolomics Data Analysis

Identification of Cognitive Literature Searching Tools and Development of Watson for Drug Discovery

Scientific literature searching tools that use machine learning or artificial intelligence were found by performing traditional literature searches, reading highlights, omics user forums or new technology features. Additionally, some of the capabilities of IBM Watson for Drug Discovery were developed or newly adapted for the creation of these workflows. IBM Watson® for Drug Discovery (WDD) is a cognitive computing platform for accelerating early stage life sciences research. WDD uses natural language processing and machine learning to extract key concepts from the scientific literature, understand the semantic context, and predict and identify potential connections between entities not explicitly described in the text.

WDD is trained to understand and recognize entity types within three core groups: genes (including proteins and mutant genes), conditions (diseases, signs and symptoms) and compounds (drugs and chemicals). Each entity in WDD is one of these types; for example, aspirin is a drug entity, whose various synonymous representations include acetylsalicylic acid. WDD is loaded with foundational knowledge composed of curated data sources which enumerate entities and their various synonymous representations, structures and identifiers, known connections between these entities, and classification of entities within various ontologies. This a priori ingestion of structured data is utilized to inform and support WDD’s reading of the unstructured natural language data which includes over 30 million documents. This structured metadata for each entity informs how WDD normalizes an individual mention of an entity type in text to the most appropriate corresponding entity. For instance, the ambiguous gene mention Drp1 could refer to multiple possible genes. WDD normalizes Drp1 to exactly one entity based on WDD’s contextual understanding of those genes across the entire corpus in the context of the unstructured text and structured metadata of the specific document in question. Following these named entity recognition and resolution phases of natural language processing, each document in the corpus is analyzed semantically to identify relationships between entities described explicitly in the text, and to construct a semantic fingerprint reflecting how each entity is described and the context in which it is discussed across the literature.

The semantic fingerprint for each entity is the foundation for one of WDD’s predictive models, which has been leveraged in various recent publications 20,22,33, including a prior work ranking metabolites with respect to their predicted endocrine-disrupting potential 34. The principle of this predictive method is that entities which occur in similar contexts and with similar descriptions in the literature are likely to have similar properties. This constitutes the basis for a dynamic predictive ranking model, which uses a training set of entities to rank a candidate set of entities.

This predictive method has three steps: 1) identification of a semantic fingerprint for each entity; 2) a pairwise comparison of all inputs from either the training or candidate set, to construct a symmetric matrix of semantic fingerprint similarities; and 3) application of a graph diffusion algorithm to distribute ‘heat’ from the training set, to the candidate set. The semantic fingerprint for each entity is a vector containing words and phrases used in the top 1,000 documents that mention that entity. Members of the candidate set are ranked by a heat score which approximates the overall similarity of each individual candidate to the entire training set.

WDD is pre-trained in these entity types such that it can recognize and normalize the various ways in which they are expressed in literature. WDD can also support more traditional keyword-based search and analytics, in conjunction with the synonym-aware support for entities. In a keyword search, only a match to the keyword is retrieved, since no synonyms for the keyword are known; this is the essential contrast to an entity-based search in WDD which retrieves matches to any of its synonyms. Keywords can be used as synonyms to supplement constructed entities or can be grouped to create user-defined entities. An entity is comprised of various synonyms, so by combining various keywords together to form a group, a group of keywords can behave similarly to an entity. This principle is used to expand WDD’s semantic search function beyond the recognized entity types.

WDD’s understanding of compounds includes storing and indexing each compound on the basis of its chemical structure, which is either retrieved from foundational knowledge or calculated algorithmically. This structure-based understanding of compounds allows for chemical informatics analysis within WDD. WDD has identified ‘formulation groups’ which are comprised of the desalted and standardized (e.g. charges removed) structural representation of a compound. This can be leveraged to maximize matches within the literature for a chemical or metabolite.

Development of workflows for cognitive metabolomics analysis:

With the newly developed cognitive literature searching tools in hand, we began to establish methods for their use with metabolomics data. In 2017, for an exposome application, we used AI-based literature searching tools to predict potential endocrine disruptors, specifically molecules that would effect an estrogen receptor, from sets of dysregulated metabolites for untargeted metabolomics studies34. This initial work based on chemical similarity implying similar biological activity served as the foundation for workflows 1, 3 and 4, which seek to identify metabolites, biomarkers and metabolic connections underlying co-presenting diseases. The further development of all four workflows came with our recent study investigating the biological function of metabolites found to be associated with temperature regulation during calorie restriction, which promotes the increased lifespan of mice on this diet. The workflows were used to look for known metabolites associated with different aspects of the hypothermic response to calorie restriction, prioritize metabolites detected in our metabolomics results, map metabolites onto biological pathways and use literature evidence to predict the functional role of those metabolites and pathways in the promotion of temperature regulation. From our metabolomics-AI-NLP searching we identified citrulline-nitric oxide pathway as one likely key pathway and then tests in mice confirmed the biological function of these metabolites (Guijas et al. Submitted). We also employed our cognitive metabolomics analysis strategies to shed light on the biological function of a multi-omic study on the changes to kidney metabolism under hypertensive condition35. Hypertension and Chronic Kidney Disease each have tens of thousands of papers in scientific literature databases and the use of AI-NLP tools enabled us to quickly find literature evidence about different components of the changes we observed and stitch together the mechanism of metabolic rewiring in the damaged kidneys. Workflow 2 and the ability to use metabolomics and proteomics data together on the pathway level was especially crucial to this multi-omic study linking together AMPK and mTOR signaling with lipid degradation and decreased branched chain amino acids35. As our science demanded, the workflows were improved and validated as described below. Additional use cases for each protocol are found in the anticipated results section of this manuscript.

Workflow Validation

Each of the four workflows was validated independently as follows.

Validation of Workflow 1

Metabolite Identification:

To demonstrate this cognitive assistance in annotation, we examined a significant feature in our metabolomics mass spectrometry data from a previously published experiment 30. The bacterium from the study, Desulfovibrio vulgaris Hildenborough, is a soil bacterium that was recently shown to also be a commensal member of the human gut microbiome and associated with colorectal cancer 36-38. The peak of interest had not been formally identified in our previous work. The peak has an m/z= 148.0427 in electrospray negative ionization mode. The m/z value was searched in the METLIN database with M-H, M-H2O-H, M+Na-2H and M+Cl adducts 9,27 and yielded 481 potential hits (Figure 3). The matches were further filtered by removing peptides and limiting the mass error to the highest resolution of the instrument used, 10 ppm, which reduced the potential matches to 53 molecules. Removing toxicants, limiting the mass error to 10 ppm and keeping only the M-H adduct reduced the hits to three possible matches. The three compounds were then searched in WDD for literature context. Based on the evidence and disease co-occurrence with colorectal cancer, which was not found for the other two possible annotations, methionine was the likely identification for the feature. This putative identification was validated and formally identified using tandem mass spectrometry (MS/MS). In processing this example, we noted that it was important to find the best level of taxonomy of the bacterium of interest to capture relevant information. Using co-occurrence, we looked at all levels of taxonomy for Desulfovibrio against metabolites, environments and diseases of interest. For this bacterium, the genus level returned the most useful literature connections (Supplemental Figure 1).

Figure 3: Literature assisted metabolite identification.

Figure 3:

Representation of filtering of possible metabolite identifications with mass spectrometry data and then literature evidence for association with the putative metabolite with certain diseases, genes or drugs. Data from mass spectrometry for the feature of interest is evaluated and searched in a metabolite database such as METLIN. Putative identifications based on the spectrum and m/z value are filtered based on the resolution of the instrument and level of information available in the database. Then, the putative identifications are searched for relationship to the condition of interest in an AI-based scientific literature searching tool. Based on literature evidence, the number of putative identifications is reduced. Finally, formal MS/MS identification procedures with standards are conducted. The AI-search tool decreased the number of standards to purchase and test. The example presented here is results from a feature in a bacterium belonging to the genus Desulfovibrio that was determined to be methionine.

Metabolite Prioritization:

To validate this approach, we used two publicly available untargeted metabolomics data sets for NAFLD- Non-Alcoholic Fatty Liver Disease, where the studies had determined the disease relevance of certain metabolites with biochemical experiments. The first study analyzed the serum of human patients with and without NAFLD (Metabolights MTBLS298). Combined with global transcriptomic data of samples from liver biopsies, they measured the metabolic activity of the liver. The found that the excess fat stimulates a change in the metabolic pathways the liver uses to acquire energy, which may explain the contribution of NAFLD to type 2 diabetes 39. The second data set was obtained from the HMDB as the only metabolites this database has associated with NAFLD. In this study, the authors were studying the contribution of the gut microbiome to NAFLD and used human fecal samples to profile to volatile organic compound metabolome. They also found changes in liver metabolism with NAFLD and many significantly dysregulated metabolites 40. Both data sets, containing several hundred metabolites each, were ranked by similarity score against the same set of training entities or keywords describing NAFLD that were pulled from the abstracts of both papers (Supplemental Tables 1-3, Supplemental Figures 3-5).

HMDB data set available at: http://www.hmdb.ca

First, we performed a retrospective analysis where the papers in WDD were limited to exclude the publication of the study. We saw that the ranking of key metabolites (phosphoenolpyruvate went from 35 to 72) in the study decreased when the knowledge from the papers was excluded (Supplemental Tables 1&2). The decrease in ranking confirms that WDD is accurately determining the relationship and biological context of these metabolites and improves when new papers or new knowledge is added to the corpus. Secondly, we performed a validation of the keyword training set. WDD randomly selects one third of the keywords to keep and re-ranks the metabolites. In both datasets, the rankings were not significantly changed and both the Fisher’s Exact test p-value (study 1- .00233, study 2- 0.000537) and the Wilcoxon Rank Sum p-value (study 1- 0.000427, study 2- 0.00359) were highly significant. This asserts that the training set of keywords were similar enough and the rankings based on them have a high probability to be accurate. Finally, we examined the rankings manually for biological sense and found that even though the untargeted metabolomics studies of human NAFLD were from different bodily fluids, metabolites from same chemical classes or biological pathways in the body were ranked similarly and painted the same story of alteration to sugar and fat metabolism in the liver. These validation sets add confidence to our interpretation of untargeted metabolomics annotation and prioritization using AI-based literature searching tools.

Validation of Workflow 2

Validation study based on Phenylketonuria

To first validate this approach of finding meaningful activity predictions of metabolites and pathways in disease function, we performed this workflow on a disease with a known causal pathway. Phenylketonuria (PKU) is a heritable condition resulting from the deficiency of the phenylalanine hydroxylase (PAH) enzyme 41,42. PAH is the first enzyme on the phenylalanine and tyrosine metabolic pathway. PKU causes mental retardation and organ damage predominantly from the buildup of phenylalanine, which becomes toxic at higher concentrations 43,44. Since the entire pathway is then blocked by lack of the first enzyme, the patients also suffer complications as a result of missing the downstream metabolites such as tyrosine, an essential amino acid, and fumarate, a backup energy source. To verify that AI-NLP capabilities can point to known pathway-disease relationships, we searched for the co-occurrence of the genes and metabolites on the phenylalanine metabolism pathway with PKU disease and evaluated the literature evidence (Supplemental Table 6). In the co-occurrence tool of WDD, the user selects the entity category of co-occurrence to search for. Since this workflow operates on the pathway level, we first searched for genes, as WDD treats genes and proteins as the same entity. The user then enters entities of interest for co-occurrence with the category. To ascertain disease-pathway activity relationships, we first input the genes on the pathway and then the disease(s) of interest. In this validation example all of the genes on the Phenylalanine and Tyrosine metabolism pathway were entered along with PKU. In this case, all the genes (and the disease) were known entities in the WDD corpus. Next, the user selects which document repository to search. We selected MEDLINE + PMCOA to have as many documents as possible from the National Library of Medicine. Further parameters can be applied such as publication date range, although in our analysis we permitted all publications. The search in WDD finds genes that co-occur with the entity of interest and outputs an affinity score, an affinity category, the number of documents with the co-occurrence and links to all of the papers with the entities highlighted. We use each piece of the WDD results sequentially to evaluate the function of genes in diseases.

The WDD co-occurrence affinity score is based on a chi-squared test and represents the probability that the two entities did not co-occur by random chance, and the affinity scores are reported as p-values which WDD labels affinity score. A very low number score corresponds to a strong affinity that is unlikely to have appeared by chance in the literature between the two entities. Entities are categorized from no to high affinity. High affinity category results have generally indicated in our searching a very well-known and documented relationship where there is ample evidence. Therefore, high co-occurring results yield disease activity predictions that are likely to be correct and often have in vivo evidentiary support. Lower-affinity results have given us actionable predictions as to the function of that pathway in the disease, but require further investigation in the laboratory to confirm and are therefore hypothesis generating. The number of papers has not been a particularly useful metric to us, but does give an indication if the topic has been well-studied in the community or not. Finally, viewing the documents is where we actually glean the predictions of functions from the published knowledge in the area. Another advantage of this approach is that documents are grouped from many different journals. Entity searching also finds matches of synonyms that conventional searches miss, increasing the likelihood of finding meaningful connections.

In our validation set of the phenylalanine pathway activity in PKU, we looked at the affinity category, score and then literature evidence for the co-occurrence of each gene with PKU disease (Table 1). Since this pathway is known to cause this disease, we were not surprised to see high affinity category results for all genes except one on the pathway which had a moderate affinity score. The number of papers in the high affinity category ranged from 12 to 1377 for the different genes, demonstrating that the number of papers is not as important as the strength of the relationship within the paper, an advantage of NLP (Supplemental Table 6). The evidence in the co-occurring papers explained the disease causation and treatments related to this pathway, as such, explained the biological activity 41-45. This confirms that AI-NLP applied to metabolomics guided systems biology can find evidence explaining the roles of a pathway in a disease. As with any such tool in this category that makes predictions or finds associations in the body of scientific literature, AI-NLP tools are limited by what is published and accessible data. We recognize that not all diseases have genetic causes and that it is often the metabolite rather than the gene that is the active player in a disease process 46. Therefore, the same procedure was applied by finding co-occurring chemicals to the disease searched by the metabolites on the pathway. The validation example was a genetic disorder, but searching via metabolites was still fruitful (Supplemental Table 6). In the absence of identified genes or proteins on a pathway or genetic causation, metabolite-disease co-occurrence searching is effective. This highlights an advantage of pathway level interpretation in that key evidence can be gleaned from the most effective level of input data for any given experiment.

Table 1:

Definitions of results types Database vs Discovered for IBM WDD

Term Definition
Database Results Database results include all known entity information including structure, common annotations, and synonyms. Sometimes ontologically relevant entities to the search term will be listed.
Discovered Results Discovered results include terms that were found to co-occur with your input entity or entities. These are displayed as ranked lists and separated by entity type.

Validation of Workflow 3

Literature-assisted putative biomarker identification is one potential application resulting from broad NLP-based literature searching in the metabolomics field. The identification of biomarkers to distinguish metabolically or symptomatically related diseases is desired clinically for accurate and rapid diagnosis. In many diseases that are caused by problems in metabolic pathways, early diagnosis is necessary to avoid negative side effects from the buildup of toxic metabolites 1,47,48. In this workflow demonstration, we reexamined results from a previous biomarker discovery study for two related diseases from the propionate metabolism pathway for the biosynthesis of adenosylcobalamin, Methylmalonic Acidemia (MMA) and Propionic Acidemia (PA) 47. Typical diagnosis of the two uses GC-MS detection of elevated propionyl-CoA in neo-natal blood screening. However, this biomarker is used for both and does not distinguish the two diseases, which is problematic clinically as the two diseases require different treatment 47. Untargeted metabolomics workflow was used to identify perturbations in disorders of propionate metabolism that only differ by mutations in two different genes on the pathway and identified additional potential biomarkers. To validate the biomarker prediction workflow, we re-analyzed this MMA and PA study with a cognitive computing-based literature search used to rank likely target metabolites for analysis before any mass spectrometry is performed.

With WDD platform’s cognitive computing tools, a large set of endogenous metabolites were ranked based on relevance to disease keyword training sets for both MMA and PA independently (Figure 2). These ranks were generated based on heat scores for a list of endogenous metabolites on disease keyword training sets. Heat scores are relative and based on the number of training set members so for comparisons between searching, normalized ranks were used. For this search, each metabolite was queried as formulation groups, which include desalted and standardized structures of a compound, rather than as individual chemical structures. This was done for each metabolite as to capture all occurrences including non-specific references to the metabolite in literature. Ranked lists of disease-associated metabolites were generated and then the lists for each disease were compared to the other, allowing for correlation analysis between the search results for each disease (Figure 6). While many of the metabolites were similarly ranked across both MMA and PA query results, distinct areas of non-correlating metabolites emerged (Figure 6, shaded area). The metabolites in these areas were ranked highly for one disease but not for the other and therefore, are potential biomarker candidates that could discriminate these diseases. Additional metabolites ranked in the top 100 for both diseases are likely the most significant metabolites but may not be unique to the disease (Figure 6, metabolites on central diagonal). Carnitine formulations were identified as a top 10 hit for both MMA and PA diseases, appearing in the biomarker candidates’ section of the graph. Metabolites of this type were identified as highly increased in both diseases in the previous study, but with certain sub-formulations of carnitine being disease specific 47. When searching is limited to papers published before 2007, to exclude the validation study, valine, isoleucine, and carnitine formulations are ranked in the top 10 compounds for both the MMA and PA search. This demonstrates that the high ranking of these compounds of interest was not contingent on the validation study being included in the query. While our ranked list pointed to the previously identified biomarkers, the metabolomic profiling in the study was necessary to pinpoint the exact molecule to be further evaluated as biomarker. Ranked literature searching gives an initial direction for research and assists in experimental development, but a complete metabolomics workflow is still necessary for formal biomarker identification.

Figure 6: Metabolome level disease comparisons for biomarker discovery.

Figure 6:

Biomarkers have clinical relevance as a diagnostic to discriminate between diseases. Scientific literature evidence of metabolite-disease relationship was investigated to predict biomarker candidates for two diseases. MMA and PA result from mutations of two different genes on the same pathway, but require different treatment programs. The AI-based search tool was asked to predict and rank metabolites by literature-associated with each disease. The ranks of these metabolites were then plotted revealing the highly correlating ranked metabolites for MMA vs PA. The blue region contains putative PA Biomarkers and the purple contains putative MMA biomarkers, and the overlapping area contains top ranked biomarkers, circled in red. Metabolites that are highly associated with one disease and not found in the metabolome of the other disease represent biomarker candidates for formal validation. The strongest predicted candidates are bolded.

Validation of Workflow 4

In order to demonstrate this workflow for potential drug-repurposing based on metabolite relationships and activities, we predicted metabolite associations for two diseases that often co-present in patients. Chronic Kidney Disease (CKD) and Acute Myocardial Infarction (AMI) often co-occur in patients despite affecting distinct organ systems 49. The metabolome relationships of the two diseases were examined using WDD’s explore a network tool. A query searched all available literature for chemicals and drugs associated with the two diseases. The results are a network diagram of the entities with a high confidence score of the relationship to the disease, as determined from literature evidence from MEDLINE+ PMCOA. This search is unbiased by the user because it is solely based on the knowledge held in WDD for the two queried conditions. The network results can be filtered by different entity categories, ontologies, types of relationships, number of papers citing that relationship or confidence scores (Supplemental Figure 6). The first view of the unfiltered search provides a comprehensive overview of metabolites based on the literature that are related to CKD and AMI (Figure 7).

Figure 7: Metabolome level disease comparisons for drug repurposing.

Figure 7:

Metabolites are actors in disease progression and metabolic pathways can link disease that co-present mechanistically and on a molecular level. Here we predict metabolites that may be acting in this intersectional capacity based on the literature association and relationship between metabolites or drugs and diseases. (A) Iterative filtering of literature based chemical-disease similarity network for CKD and AMI. Vitamin D is highlighted in red. (B) Drugs predicted for the diseases with Vitamin D as the top ranked drug compound for CKD and ranked 70th for AMI. (C) Categorization of shared metabolites by type drugs (orange), endogenous metabolites (green), hormones (blue) and well-known actors (grey). MDA is highlighted in cyan as a new shared metabolite of interest that may link the disease metabolically.

Through iterative filtering of the metabolic association network with the confidence of association parameter, the most relevant metabolites to the diseases were pinpointed and can be used to direct further study such as drug repurposing (Figure 7a). This confidence score is an internal calculation in WDD based on the number of papers and the consistency of that relationship appearing in those papers that yields the probability that the relationship between two entities is valid. Our threshold is generally set at 95% confidence given standard calculations of error and probability. These filtered molecules were categorized by type and then the evidence connecting the diseases to the metabolite was considered. For example, Vitamin D was a shared metabolite between the diseases, but was found to be related with higher confidence to CKD than AMI. When specifically analyzing predicted drug relationships, Vitamin D is the top ranked drug related to CKD and 70th for MI (Figure 7b). Through iterative analysis, these metabolite-disease relationships can be characterized, contextualized and understood. We observed that Vitamin D has been studied for both diseases, but far more articles were found for the CKD -Vitamin D relationship than AMI (260 vs 24). Of the 24 articles on AMI and Vitamin D, many cite the known relationship between CKD and Vitamin D as the nexus for further studying Vitamin D and AMI (Xu et al., 2017, Tunon et al., 2016). These results suggest that CKD and AMI could be metabolically linked on pathways involving Vitamin D since Vitamin D is synthesized in the kidneys and has been suggested as cardio-protectant. Also, Vitamin D or drugs targeting Vitamin D pathways might be useful in treating both disorders.

Applications of the method:

This protocol has been designed for use with targeted and untargeted metabolomics experiments leading to such outcomes as biomarker discovery and validation, metabolite identification, flux-based analysis modeling, systems biology analysis, disease comparisons and Metabolomics Activity Screening. This protocol can easily be adapted for use in any other of the main ‘omics fields including but not limited to genomics, transcriptomics and proteomics. As many of the tools were originally designed for genes and proteins, not metabolites, there are even more options with higher fidelity results for these other technologies.

Comparison with other methods:

There are many AI NLP implementations for scientific research (Semantic Scholar 50, Iris, Core, Content Mine, Microsoft Academic, Watson for Drug Discovery, HUPO B/D-HPP, Source Data), biomarker discovery and precision medicine (Euretos 51-53) that are changing the way literature searching is done. New tools are being released continuously as well 54,55. The biggest differences within the tools is the type of inputs they can use, their search and processing algorithms and the type of results the user receives as an output. These can be classified according to the type of generated output including tools that report lists of papers, tools that rank and score the papers and tools that make predictions based on the content of papers (Figure 1). Although no one tool can be used for every research question or application, the use of cognitive computing-based literature search applied to metabolomics enables new and deeper insights into the metabolite activity in biological systems and the formulation of testable predictions of biomarkers, new applications for existing drugs or the metabolic mechanisms underlying a condition.

Traditional literature searching platforms include Web of Knowledge, Scopus, Google Scholar, PubMed Advanced, but even these incorporate new processing algorithms like the recent release of SciFindern. Without the use of a cognitive computing tool designed for scientific literature, a researcher would manually search databases by keyword or text-matching searching. They would then compile all the results together and individually inspect each result for applicability to their topic of interest. The advantages of AI-NLP tools include the reduction of time, the simultaneous searching from several levels of biology, the efficiency of the searching, the connections found from disparate sources of information and the predictive power of the results. The disadvantages to the AI-NLP based searching tools and this protocol are that the tools are not built especially for metabolomics and that they are limited by the database they search, their entity definitions and the availability of structured data. However, these are common disadvantages for any approach of scientific literature searching.

Experimental design:

When designing the experiment for this protocol, it is first necessary to design a well-controlled metabolomics experiment. Metabolomics experimental design is beyond the scope of this protocol, but we refer readers to several references on the subject 23-25,27. To perform the workflows included in this protocol effectively, considerations should be given to each workflow. In general, users should consider internal checks or validations of the results- if they have a known metabolite, gene or protein, they should examine the results for that hit before interpreting the unknown results. These workflows also rely on the accuracy of the metabolite identifications or annotations, and so with better metabolite annotation leads to better results. It is also important to check that the molecule naming convention you are using is compatible with your literature searching tool of choice, and if not, convert formatting. In workflow 1 prioritization, the choice of the group of keywords or training set to describe the biological condition of interest in crucial to obtaining successful results. Users should have at least six terms that describe the condition by name and the symptoms or environmental descriptors. The collection of these terms should distinguish the condition from other similar disorders. It should be noted that all words selected will be weighted equally whenever that function is employed throughout the workflows. In workflow 2, it is advisable to check if the genome of your source organism is known and well-annotated. If not, you should build a metabolic model to achieve better pathway mapping. In workflow 3, it is best to choose diseases that have a medical reason for needing to distinguish them with biomarkers. For workflow 4, it is important to choose diseases that co-present or have phenotypic evidence of relationship.

Limitations:

The limitations of this protocol derive from the limitations of the AI-NLP tools themselves and the databases they search. Novel or understudied compounds or diseases may not have literature results in the databases search and will therefore not yield meaningful hits in this protocol. Each tool only searches certain databases and then, within them, different sections of the text are available. For instance, some databases only provide the title and abstract to the AI-NLP servers while others provide the full-text. A further hinderance is that much of the scientific literature exists as unstructured or siloed data, which the AI-NLP tools cannot access or process. AI-NLP tools are also limited by which entities are in their corpus and do not cover all possible molecules and conditions. Text matching can be used in these cases, but some of the advantages of the NLP servers are lost. Before use, data needs cleaning and processing to be in a readable format for each tool. Finally, the protocol is not automated at this time and the results need to be verified in a physical laboratory.

Materials

Equipment

Computer requirements

  • Browser requirements: XCMS Online supports many of the mainstream web browsers. For the best results, we recommend using the latest version of Google Chrome (v57+) or Mozilla Firefox (v51+).

  • Internet connection requirements: A fast upload connection is recommended, with a minimum of 5 Mbps, to upload files to data set storage. This can be done directly from the instrument computer or from a personal computer, provided there is adequate hard-drive space for data files. Physical Ethernet connections are normally preferred over wireless (WiFi) connections.

  • Hardware requirements: To view and work with XCMS Online results, a minimum of a Pentium 4 processor with 8-GB of RAM and a screen resolution of 1,280 × 800 or higher is recommended.

Software:

Spreadsheet Software: Microsoft Excel or something with the same capabilities including VLOOKUP Access to at least one of the following tools:

WDD, SciFinder, HUPOP, Microsoft Academic, Semantic Scholar

Data files

  • XCMS Online currently supports upload of both raw data files and numerous converted MS data formats

  • Gene and protein data format for XCMS Systems Biology: Differentially expressed gene and protein data should be in the format of a comma-separated (.csv) or tab-separated (.tsv) file. Genes names should be in the format of gene symbols, and protein names should be in the format of gene symbols or Uniprot accession numbers24,56. If multiple data sets are available, they must be uploaded individually.

Procedure

Cognitive Annotation and Prioritization of Untargeted Metabolomics Results Workflows may be used individually or consecutively.

Timing: ~1-4 hours depending on runtimes of XCMS, MISA, and annotation tools

Workflow 1:

Part 1: Metabolite Identification:

Steps 1-4 and 7 are covered in detail in our previous protocol papers as referenced.

  1. Standard Global Untargeted Metabolomics Approach
    1. Grow samples, extract metabolites, measure by mass spectrometry, process data, begin to evaluate features as described in our previous protocols23-26
      1. The protocol can also be applied to targeted metabolomics approaches.
      2. This approach is compatible with any type of chromatography and mass spectrometry.
  2. To process the metabolomics data from the mass spectrometer into a feature table, run an XCMS or XCMS Online Job24,28,29,31
    1. Full instructions for XCMS use are found in our published protocols and instructional videos.
    2. This workflow can be performed on single, pairwise and multi-group XCMS jobs.
  3. In the Feature table in the Results section of XCMS Online, use the Filter button to filter the features to the significantly dysregulated features with quality mass spectral data. Parameters for filtering include detection intensity values, fold change and statistics tests outputs for pairwise and multi-group jobs. Suggested filtering parameters are reported in our previous protocols24,29,31.

  4. Assign putative identifications, the process known as annotation, to filtered features of interest. Annotating features is the first step in formally identifying metabolites per the Metabolomics Standards Initiative levels of metabolite identification57,58. There are a number of computational tools to assign putative identifications from metabolites from different mass spectrometry techniques. These tools use only data from the mass spectrum to make an assignment and frequently annotate a feature with dozens of possible identifications. It is beyond the scope of this protocol to list all such tools and databases, but the tools we have used during the development of these protocols are listed. Each tool should be used according to their own published methods.
    1. METLIN MS and MS/MS searching9,27
    2. Autonomous METLIN-Guided In-source Fragment Annotation (MISA)32
  5. Literature contextualization of putative identifications of metabolites of interest. In this step you will add information about the metabolite from what has already been published to the mass spectrometry data in the process of formally identifying metabolites.
    1. Using IBM Watson For Drug Discovery
      1. Log in to IBM Watson for Drug Discovery.
      2. On the left sidebar, click the second icon from the top to open the explore an entity tool.
      3. Select All or Any Inputs. When searching multiple metabolites or terms at once, you must select Find either All Inputs or Any Inputs. All Inputs will only return results where all inputs are included in the text of the document or abstract. Any inputs will return results when at least one of the terms is present
      4. In the top search bar, type in the metabolite(s) or term(s) to search. To add a term to the search, hit enter to autofill in the top entity suggestion or select one from the list. After the term is entered it will turn blue. To edit the search term, click on it and a drop-down menu will appear. From here you can select a different entity such as chemical, drug or formulation group. You can edit the term by clicking the pencil icon. To search the term as a text, select the bottom option on this drop down. You can select whether to match capitalization and punctuation for your term. Text search terms are only recommended when no entity fits your metabolite or search term of interest.
      5. Select the database of documents to search. For quick results, it is recommended that you search Medline Abstracts. For more comprehensive results, Medline + PMCOA is recommended as it includes Medline Abstracts and PubMed Central® open access papers.
      6. Click explore to load results.
      7. Results are divided into two categories: Database results and Discovered results.
      8. To view papers where two terms co-occur, click the number of documents. Papers can be sorted using the sort by menu in the top right-hand corner of the screen and can be sorted by publishing date, document title, or publisher name.
        1. To export data, click the download button next to the sort by button. A document list will be exported to excel and will include the document identifier, title, published, main author list, document type, PubMed identifier, DOI, and publication date. This information can be easily shared with collaborators and papers can be located.
        2. To view a specific paper, click the title (in blue) and the annotated abstract or paper will load. On the right sidebar, select the types of entities to highlight. To view all entities found within the document, scroll to the bottom or click found entities to see a list of entities by type. This list can also be exported. The paper can also be viewed on PubMed by clicking view article on PubMed or can be viewed using DOI.
      9. Entities found to be co-occurring with your search can be saved a search set or search group by clicking all the entities of interest and selecting save to set or a group. To access sets and groups, click the folder icon for saved items on the left bar and click the type of saved item desired. Sets can be exported renamed and edited. To edit a set or group, click the set/group name. Entities in the set can be edited, activated or deactivated, deleted, modified. Notes can also be recorded for entities.
    2. Using Microsoft Academic
      1. Type the metabolite(s) of interest in to the search bar. If Microsoft academic recognizes it as a topic, a flask icon will appear next to it.
      2. Press enter or click the magnifying glass to run the search.
      3. Once the search results load, a group of top topics and authors will display on the left sidebar. On the right sidebar, parent and related topic will display. Sometimes a topic paragraph will display. You can click this to learn more about the nature of the literature published related to the compound.
      4. Use the left panel to filter results. It is recommended that the filter is set to display only journal publications.
      5. The results are displayed as a list of papers with the abstract or first paragraph and top topics. To view the complete article information, click the article title. Some articles may be downloaded directly by clicking the download button. References, cited by and related articles can be accessed from the view article page. On this page there is also a full list of topic annotations. To access the full article, click the DOI link.
      6. To further contextualize metabolites additional search terms may be used for example a metabolite and the disease model it was observed in.
      7. To search for a group of metabolites, one search can be done on multiple metabolites together that are assumed to be related. Top results will include as many metabolites as possible based on this assumption. This type of search is for contextualizing known ids but is not be ideal for comparing and distinguishing putative ids. To compare putative ids, search metabolites individually with or without additional keywords and compare the individual results.
      Note: Microsoft academic also allows for direct searching for papers, and searching by author, institution, journal and conference but these applications were not explored in this protocol.
    3. Using SciFinder
      1. Navigate to https://scifinder.cas.org. A license is necessary for access, but many institutions have access.
      2. Type the metabolite of interest into the search box. For best results, search using a contextualizing short phrase as suggested in the search examples. Further filtering can be done before searching using the advanced search settings to limit results document type. Press the search button or hit the enter key.
      3. Research Topic Candidates will be displayed. SciFinder can return results matching the exact phrase, close association between concepts, and concepts co-occurring anywhere in the text. It will also return results for the concepts alone. Select as many of these options as are interesting to but note the number of references as more references may be more difficult to sort through. In most cases, when multiple terms are search documents containing close associations between concepts will yield the most relevant results. For single search terms, select topic for references found containing the concept for most comprehensive results. Click get references.
      4. Once the references are loaded, further filtering can be done using the left menu. The three tabs can be used to analyze, refine and categorize results. There are helpful tooltips for each which can be accessed by clicking the question mark near the menu of interest. Each of these tools can be used to further filter and refine results. These changes to your search are documented on the top of the screen.
      5. To view a paper, click the title. Search terms will be highlighted in the title and the body of the abstract. To view the full text, click link to other sources.
      6. To export results, click the export button in the top right corner of the screen. Select specific papers to export or a range of papers. Papers can be exported to multiple file types and formats. For quick review of multiple papers, it is recommended that you select the results on interest, export as for Offline Review as a Portable Document Format (*.pdf) and export summary with full abstracts. The exported pdf is searchable, and terms used in the original query will be highlighted. Papers can be selected for export using the check box on the left of each result. Selected papers can also be sent to SciPlanner for export and review.
    4. Using Semantic Scholar
      1. In the drop down that says All Fields, select Medicine.
      2. Type the metabolite of interest into the search bar and hit enter.
      3. If interested in filtering the results. Select more filters to view all filtering options. By default, papers are sorted by relevance but can also be sorted by publication date.
      4. If the search term is a topic, the sidebar can be expanded to view topic which includes a brief description pulled from NIH and a paper-based overview of highly cited papers. This is very helpful for putative identification.
      5. To view a paper, click the title to navigate to the paper view page with includes the first paragraph of the paper. To view the full paper, click the view on button. The destination will vary from paper to paper with some directly linking to pdfs. Related topics, similar papers, and publications citing the paper can also be viewed. To navigate between sections, use the navigation bar at the top of the screen.
      6. To search multiple metabolites, we recommend searching each compound individually as this tool does not appear to currently be suited for searching lists of compounds. Short contextualizing words or phrases including diseases can also be searched with a metabolite.
  6. Narrow down list of putative identifications for each feature of high interest based on the literature context by eliminating impossible of improbable identifications. Consider these factors or others in your decision making on the likelihood of the identification.
    1. Is this putative identification manufactured synthetically? Are you expecting to see synthetic molecules such as drugs in your data set?
    2. Does the putative identification have a plausible biosynthetic route in your samples’ source organism?
    3. Is this identification a possibility given your experimental conditions inclusive of tissue type, cell type, sampling location, growth medium, and extraction methods?
    4. Has the putative identification ever previously been associated with something similar to the condition or source organism you are testing in your study?
  7. Complete formal identification of high interest metabolites using the reduced putative identification list to desired MSI standards level with targeted MS/MS data acquisitions and molecular standards, purchases or synthesized, or experimental reference databases58.

Workflow 1, Part 2: Metabolite Prioritization:

  1. Make an ingestible list of metabolite names of interest from your study. Formal or putative metabolite identifications may be used in this part. Compound names can be in InChiKey or SMILES formats. Acceptable formats will vary with AI-NLP tool.

  2. Create a list of terms to search that describe your biological condition of interest. It is recommended that the set includes the biological condition entity and relevant keywords related to symptoms, treatment and phenotype.

  3. After assigning putative or formal identifications, rank the features/metabolites against a known training set of keywords describing the biological condition of interest22
    1. Using IBM Watson for Drug Discovery
      1. To load your list of search terms into WDD, navigate to saved items sets. Create a new set. Manually enter each term. For entity terms, confirm the entity type that loads is correct. For text terms, select if capitalization and formatting must be matched. If a text term incorrectly disambiguates into an entity, click the term and edit it to select text rather than the entity. Text terms can also be entered in quotation marks to prevent this issue. This is also how you can modify entity types.
      2. Navigate to the predicative analytics tool.
      3. In the known set box either load your known set or enter terms manually.
      4. In the candidate set box, load your set of candidate metabolites.
      5. Once both sets are loaded select your database to search. For quick searching, search Medline Abstracts. For comprehensive searching, search Medline Abstracts+PMCOA.
      6. Results can be viewed visually in one of two way, a similarity tree or a distance network. To view text evidence for a relationship, click two nodes and select view text evidence at the bottom of the screen.
      7. To export these results, click the download button the bottom right and select export similarity scores which will export a .csv file.
    2. Using HUPO BD-HPP
      1. Navigate to rebrand.ly/metapurpose
      2. On the left panel labeled metabolite prioritization, first select the query type HPP Targeted Area, Human Disease (PheWas codes), Organ Systems or Custom Search. For this protocol, the search focused on human diseases and organs systems but other options may be more suitable to your model.
      3. To search for an organ system, select an organ system and choose species. To search for a disease, select Query Type: Human Diseases (PheWAS Codes) or use custom search.
      4. Decide to exclude or include black listed metabolites. By default, exclude is selected.
      5. Click Run.
      6. The results returned include 5 tabs of information. The prioritized list returned is a ranked list of metabolites including HMDB ids. Search the top metabolites and compare to your putative ids.
      7. To view literature evidence for each metabolite, navigate to the view publication tab to view highly cited papers for each metabolite sorted by number of citations. Be aware this may make it difficult to find less cited papers.
      8. To determine the rank threshold for relevant metabolites, view the publication score distribution tab and publication/citation number tab. These tabs can be used to further analyze the ranking of metabolites for a given search topic.
  4. Prioritize metabolites for further study and experiments based on association rankings, visual outputs and literature evidence.
    1. Prioritization based on association rankings.
      1. Highly ranked metabolites for literature association with condition or search terms have strong evidence supporting their biological function in the condition. The researcher may wish to prioritize the highly ranked metabolites if they have a new hypothesis or insight into a well-studied area or are looking for known safe and efficacious molecules. Metabolites often have more than one function and the literature is unlikely to have document all possible biological functions. Conversely, researchers may be able to explain their observation with this found set of literature evidence and would not prioritize these highly ranked metabolites for further study.
      2. Poorly ranked metabolites for literature association with condition or search terms do not have strong evidence supporting their presence or biological function. This represents molecules with a high potential for novel findings.
    2. Prioritization based on visualizations of metabolite-condition rankings.
      1. Similar to 4A, visualization of rankings by branching or similarity tree allows researchers to see which metabolites found in their samples have known literature associations with the search terms representing their conditions and decide if they want to study molecules with known or unknown roles in their condition of interest.
      2. Visualization also permits the rapid assessment of which search terms describing the condition were associated with which molecules. Not all terms contribute equally and some molecules will be associated with different aspects of the conditions, permitted a more nuanced assessment than straight rankings.
    3. Prioritization based on literature evidence. Access and read papers providing the evidence of the association ranking as describe above. Review the literature for evidence of tightly associated metabolites. Ranking alone may not be sufficient to explain the relationship between the metabolite and condition.
      1. Tight associations: This may explain the role of the metabolite in that condition and negate need for further study
      2. Tight associations with inadequate evidence: this may not adequately explain and then would be a good target for further study as something of high interest in the field
      3. None or weak associations: Likely metabolites with little or no literature evidence represent novel targets. Their characterization would add new knowledge to the field and can be selected for further study.

Workflow 2: Biological Contextualization of Metabolomics Predicted Pathways

  1. Perform a metabolomics experiment and process the data with XCMS Online as described in Workflow 1, Part 1, Steps 1-3.

  2. Map metabolites onto the metabolic pathways using a computational tool
    1. Using the XCMS Systems Biology Portal (MUMMICHOG algorithm)
      1. Follow the steps in our 2018 Nature Protocols procedure24,30.
      2. Retrieve gene list for each pathway
      3. Retrieve metabolites found list for each pathway
    2. Using the MetaCyc Omics Tools
      1. Navigate to https://metacyc.org/omics.shtml and select the “Paint metabolomics data onto metabolic map” under Metabolomics Data Analysis.
      2. Following the database’s set of instructions to upload and then overlay metabolomics data onto a metabolic map59.
    3. Determine Metabolite-Protein Interactions with STITCH
      1. Navigate to http://stitch.embl.de/ . To view relevant protein-metabolite interactions, access the Protein-Metabolite interactions tab. Select the number of top metabolites to search. 20 metabolites is the default setting and a good starting point.
      2. Click the view PMI button. This will load STITCH results for all the selected metabolites. Follow the instructions in the database60.
      3. Review the list and remove any metabolites that are not of interest to you. Select any genes that are of interest to you. Each gene includes a small description with the related metabolite highlighted to assist in this decision.
  3. Save gene list and metabolite list for each pathway in table format like a .csv file in standard names or abbreviations like GO (Gene Ontology) terms.

  4. Navigate to IBM Watson for Drug Discovery

  5. On the left panel, select the 4th icon from the top to navigate to the co-occurrence tool.

  6. On the top search bar select what entity type you would like to find co-occurring with your search terms such as genes, proteins or chemicals/drugs. By default, genes are selected.

  7. Type in the entities of interest you would like to search. Note that each additional entity will increase processing time and the size of results. Enter all the entities of the type selected from that pathway. For instance, enter all genes on the chosen pathway if genes were selected in step 7.

  8. Let search run

  9. Filter results by searching for gene name in list or by clicking the column for each gene to sort it. This brings the co-occurring hits to the top for easy visualization and analysis.

  10. Identify and evaluate the co-occurrence of each gene or metabolite on the pathway with the term for the condition of interest based on the affinity score, confidence level, and literature evidence of co-occurrence

  11. Assemble information from all components of pathway to learn or make predictions about the role of the pathway and the metabolites in the condition of interest.

Workflow 3: Metabolome Level Disease Comparison for Biomarker Discovery

  1. Run search on each disease.
    1. Using HUPO BD-HPP
      1. Navigate to rebrand.ly/metapurpose
      2. Select Query Type: Human Diseases (PheWAS Codes)
      3. Type disease into the step 2 box. Some suggested diseases will appear below while typing. If your disease of interest it present, click to select it. If the disease is not suggested in the drop down, it can still be entered as a search query. Diseases not resolved in the dropdown can also be searched using the custom search feature. Unresolved diseases searched this way may yield different results. For the comprehensive results, it is suggested that you run both searches for unlisted diseases. For a quick search, the disease search alone can be run.
      4. Step 3 by default is set to Human as the species. If analyzing a disease model in a common model organism you may want to switch the species to match.
      5. Decide to exclude or include blacklisted metabolites. By default, exclude is selected.
      6. Click Run
      7. To download results, scroll to the bottom of the prioritized list tab and click download. This will load the table into a .csv.
      8. Save this file as a an excel file.
      9. Repeat the above steps for each disease you are analyzing.
    2. Using IBM Watson for Drug Discovery
      1. Navigate and login to WDD.
      2. Load the metabolite set from the supplemental or your metabolite list of choice. Navigate to Saved Entities, Sets click create a new set. Name your set.
      3. Add entities: Option 1: Add entities: Copy and paste the entire entity name column from the spreadsheet. Each compound will be delaminated as its own entity. Option 2: Add from CSV. Select the saved file. Compounds will import with selected entity types. Import the entire set as formulation groups, edit the entity type column to “BASECHEMICAL”.
      4. To temporarily disable an entity within a list, change its status to suspend. To remove it, delete the entity.
      5. You can also create sets for your search terms for your disease. It is recommended that the disease set includes the disease condition entity and relevant keywords related to symptoms, treatment and phenotype. Follow the same instructions as above but manually enter each term. For text terms, select if capitalization and formatting must be matched. If a text term incorrectly disambiguates into an entity, click the term and edit it to select text rather than the entity. Text terms can also be entered in quotation marks to prevent this issue. This is also how you can modify entity types.
      6. Navigate to the predicative analytics tool.
      7. In the known set box either load your known set or enter terms manually. The same instructions as above can be followed for editing terms as entities or text.
      8. In the candidate set box, load your set of candidate metabolites. This can be the list included in this protocol or an external list of metabolites. For exploratory searches, a broad list of metabolites should be used. If the search is focused on a specific system or organ, a narrower list may be appropriate. This method can also be used to rank observed metabolites based on disease condition.
      9. Once both sets are loaded select your database to search. For quick searching, search Medline Abstracts. For comprehensive searching, search Medline Abstracts+PMCOA.
      10. Results can be viewed visually in one of two way, a similarity tree or a distance network. To view text evidence for a relationship, click two nodes and select view text evidence at the bottom of the screen.
      11. To export these results, click the download button the bottom right and select export similarity scores which will export a .csv file.
      12. Save this file as a an excel file.
      13. In the file, filter out known entities. The similarity score and rank are of the greatest interest. Rows labeled “known” in the set column can be removed or hidden and should not be plotted on distribution chart or the rank comparison plot.
      14. Repeat this for each disease of interest using the same candidate set.
  2. Graph the distribution scores in Excel by inserting a scatter plot.
    1. IBM Watson for Drug Discovery: For the x axis, select the candidate rank column. For the y axis select the similarity score column.
    2. HUPO BD-HPP:
      1. For the x axis select the rank column and for the y axis select the PURPOSE_Score.
      2. Change the y axis units to logarithmic scale. This is functionally the same as the publication score distribution chart in browser.
  3. Analyze Exported Data- Once all data is exported, data can be cross referenced. Searches should be compared based on the normalized ranks rather than scores as scores are not absolute across searches. Score distributions can be plotted to determine the cutoff for rank of significance. To determine this cut off, look at the point where the data first begins to plateau. At this point the scores are very similar across ranks in that range. If this threshold is vastly different between the two results, use your best discretion to select a cut off.

  4. To compare scores across sets
    1. For HUPO BD-HPP
      1. Create a new sheet on the first page in your first excel file.
      2. Copy the results from the second file to a third sheet.
      3. On the first sheet, insert all the HMDB identifiers from each set of results into column A.
      4. Select column A of sheet 1 and click remove duplicates generating a compressive list of all HMDB identifiers with ranks between the disease searches.
      5. Label column B with disease one (sheet2). Label column B onward with each diseases of interest.
      6. For each disease sheet cut the rank column and insert cut cells to the right of citation/year. The rank column should become column H (the 8th column) and column A should be UniProtKB_ID. It is important that this is the first column for each disease sheet. Delete any blank columns that remain after moving the rank column.
      7. On sheet 1, Label column B onward with each diseases of interest.
      8. In each disease of the first sheet, use the formula =IFERROR(VLOOKUP($A2,disease_name!$A:$H,8,FALSE),0). Replace disease_name with the name for each sheet. This will return the rank for each metabolite from each search. Some searches will not have a rank for every metabolite listed and will return as 0.
    2. For IBM Watson for Drug Discovery,
      1. Create a new sheet on the first page in your first excel file.
      2. Copy the results from the second file to a third sheet.
      3. On the first sheet copy all of the Entity names from each set of results into column A. Select column A and click remove duplicates. Assuming the same metabolite data set is used to search for both diseases you only need to copy one and skip the remove duplicates step.
      4. Label column B with disease one (sheet2). Label column B onward with each diseases of interest.
      5. In each disease of the first sheet, use the formula =IFERROR(VLOOKUP($A2,disease_name!$A:$H,8,FALSE),0). Replace disease_name with the name for each sheet. This formula assumes the candidate rank is outputted in column H and user entity name is in column A . If this is not the case, adjust the formula as necessary. This will return the rank for each metabolite from each search. If a compound is unranked in a search its rank will return as 0.
  5. Plot the ranks of disease 1 vs disease 2 in a scatter plot. Note that ranks on the axis with y=0 or x =0 are ranked for one disease but not the other.

Workflow 4: Metabolome Level Disease Comparison for Drug Repurposing

  1. Navigate to IBM Watson for Drug Discovery.

  2. Click the explore a network tool, 3rd icon down on the left-hand bar.

  3. In the top text box, a set can be loaded, or entity terms can be typed in. To compare two diseases, enter each as a condition entity. For fastest, results search MEDLINE abstracts only. For most comprehensive results, search MEDLINE +PMCOA. For quick and clear results, search two conditions at a time and in general limit the total number of condition search terms to fewer than five per query.

  4. Under advanced options select the type of entity to relate to your search terms. Gene is selected by default. In the provided example, chemical was selected. For efficiency and clarity of results, it is recommended that only one to two entity types at a time are selected for searches with potentially large numbers of results. If one entity type does not return many results, more can be searched together but each additional entity type will increase the run time.

  5. Run the search.

  6. Select the type of relationships to display. By default, Discovered and Both are selected but database only results are excluded. With these settings, all relationships displayed will have at least one supporting paper/abstract. Database relationships are based on information from the databases used by WDD to populate the corpus and can be included if desired. Discovered relationships are the most likely to contain novel information.

  7. Results can be filtered in two ways, a confidence percentage and number of supporting documents. Use the default threshold or select a range for the confidence and support threshold. Be aware that increasing the threshold may dramatically increase the number of results displayed. The minimum or maximum number of supporting documents, documents relating the entities, can also be used to filter results. By increasing the minimum number of papers, well documented relationships will be shown. By decreasing the maximum number of papers, less documented relationships can be focused on. Note in our example case, compound’s relationships to diseases can shift based on filtering so it is wise to test multiple filtering levels when analyzing results.

  8. Documents can be viewed in two ways. The first is on a node by node basis. Click the node and select view documents. All documents related to this entity will be displayed. If multiple entity nodes, are selected all related documents will be displayed. The second is on a relationship basis. Documents relating two entities can also be viewed by clicking the line linking two entities. Documents supporting relationships are annotated to highlight each entity and the terms used to relate the two entities are highlighted in gray. For each entity relationship, the number of supporting documents and confidence score are provided.

Anticipated results:

Each workflow has a unique set of anticipated results and a use case for each workflow is discussed sequentially in this section. The validation data sets for each workflow in this protocol are discussed in the development of the protocol section.

Workflow 1

AI-assisted Metabolite Identification:

As an example, we used a peak from an on-going untargeted metabolomics profiling study that was strongly upregulated in the frontal part of rat brains in comparison to other areas of the brain. This peak has a m/z value of 329.2478 in ESI positive ionization mode. Reducing the METLIN filters to 5 ppm mass error, only M+H adducts and no toxicants or peptides reduced the number of hits from 229 to 15. All 15 of the remaining METLIN hits for the m/z value of the dysregulated peak of interest had the same elemental composition, but fragmentation data and neutral loss peaks eliminated the cyclic structures leaving five options for putative identification. These five were searched using the protocol for Workflow 1. The polyunsaturated molecule (4Z,7Z,10Z,13Z,16Z,19Z)-Docosahexaenoic acid (DHA) was selected as the likely putative identification from the entity associations with humans and rats in WDD. The other fatty acids were associated with fatty liver disorders, not the brain. DHA was confirmed by MS/MS and retention time using an authentic standard analysis (Supplemental Figure 2).

Metabolomics Feature Prioritization:

This workflow provides an efficient and effective semi-autonomous method to prioritize metabolites based on literature evidence of known and unknown disease or environmental relationships.

We re-analyzed a previous global metabolomics colorectal cancer data set of metabolic changes in colon biofilms and tissues 30,61. Metabolic features were filtered by p-value less than 0.01, fold change greater than 5 and maximum intensity greater than 10,000 to obtain the significantly dysregulated features. These approximately 200 features were annotated with our recently published MISA pipeline 32. The around 60 resulting annotated putative metabolites were searched in WDD Predictive Analytics with the default settings against a keyword training set describing colon cancer symptoms and causes (Figure 4, green triangles) in the MEDLINE + PMCOA (PubMed Central Open Access) databases. A similar search can be performed with the HUPOP server 62. The resulting similarity scores between the dysregulated (i.e. disease associated) metabolites and disease terms were enumerated as a similarity matrix (Supplemental Table 4) and visualized in a similarity tree (Figure 4). The top branch represents the metabolites with a known relationship to colon cancer, and the bottom branch are the unknowns that should be considered for additional mass spectrometry or biological investigation. The nature of the known relationship between the metabolite and disease can be ascertained from the linked papers. In this case, most of the known disease relationship metabolites were pro-inflammatory molecules. Depending on the research questions of the study, metabolites with known activity might be prioritized for further characterization or metabolites without known activity might be selected to determine the function empirically.

Figure 4: Metabolite prioritization similarity tree.

Figure 4:

Green triangles are the keywords from the training set used to describe colorectal cancer. Red circles designate dysregulated metabolites given putative identifications by an annotation algorithm. Branches represent literature derived evidence of connection between the terms and the metabolites plotted by the similarity score. In this tree, the top branch contains the metabolites with known relationships to the disease studied and the bottom branch metabolites have not been associated with colorectal cancer in the literature previously.

Workflow 2

We exemplify this workflow in an unknown pathway-disease relationship with a re-analysis of a metabolomics study of the differentiation and proliferation defects of precursor oligodendrocyte cells (OPC) associated with multiple sclerosis remyelination 30,63, whereby NLP searches of genes from top dysregulated metabolomics-identified pathways yielded functional knowledge of the role of those pathways in multiple sclerosis progression that where not found in the original studies. In the original study, global untargeted metabolomic profiling was carried out on rat primary optic nerve OPC’s that were either treated with DMSO (vehicle), or stimulated for differentiation by T3, triiodothyronine, both while growing in differentiation medium 30,64. The authors’ goal was to find the molecular mechanisms connecting differentiation defects to MS related re-myelination. In their findings, the authors observed that the dysregulated metabolic pathways during OPC differentiation had a similar profile to those during pluripotent stem cell proliferation, but did not find connections between the pathways and disease mechanisms 63. We re-processed the metabolomics data in XCMS Online using the same parameters as previously reported 30 (Figure 5, Supplemental Table 7). We then pulled the gene lists from the top dysregulated pathways and searched all genes from a pathway together in a co-occurrence gene search against multiple sclerosis, as a disease entity, on the MEDLINE abstracts database in WDD. Resulting hits were sorted by each gene and the papers where the genes and multiple sclerosis co-occurred were examined for insight into the role of that pathway in disease pathobiology (Figure 5).

Figure 5: Determination of function of dysregulated metabolic pathway in disease state.

Figure 5:

Experimental mass spectrometry metabolomics data is fed into XCMS Online Systems Biology portal. Top dysregulated metabolic pathways are listed with found metabolites from each pathway. Genes comprising the pathway can also be seen, and the gene list was searched in WDD Co-occurrence against the disease of interest. Literature found in the co-occurrence was evaluated to learn the function of the pathway in the disease state. In some cases, the papers pointed back to the metabolites found in the metabolomics experiment and spoke to the activity of that metabolite in disease. Investigating metabolite- disease activity on the systems biology level yielded mechanistic insights of phenotype causation.

Connections between genes on metabolomics-identified dysregulated pathways and disease state were found from NLP-based co-occurrence searching of the pathway and the disease of interest. The most dysregulated pathway in our re-analysis was thyroid hormone metabolism I. A search of the MEDLINE database found only one paper where any of the genes on the pathway co-occurred with multiple sclerosis (Figure 5, Supplemental Table 7). The paper pointed back to one of the metabolic features on the thyroid pathway found in the metabolomics data and indicated that expression of this metabolite was linked to advantageous remyelination 65. In fact, this molecule, T3, was the molecule applied to differentiate the cells. In this way, the dysregulation of the metabolism of T3 is an internal validation of selecting condition-relevant dysregulated pathways since T3 was only added to one condition and it and its metabolites should be highly dysregulated compared to non-treated. The knowledge of the activity of T3 may even have been what motivated its use in this example study.

Literature results from the other highly dysregulated pathways pointed to direct activity of pathways and active metabolites in the disease progression. The catecholamine biosynthesis pathway had several hits with higher affinity scores for involvement in multiple sclerosis (Supplemental Tables 8-11). Co-occurring papers explained that metabolites like norepinephrine on this pathway are used as multiple sclerosis biomarkers, but more significantly are involved in sympathoadrenergic mechanisms of apoptosis in peripheral blood mononuclear cells linked to multiple sclerosis (Supplemental Tables 10-11) 66-68. The pentose phosphate pathway was connected to multiple sclerosis through the transadolase enzyme, which is recognized by the same autoimmune response as myelin in multiple sclerosis patients (Supplemental Table 12) 69-72. We gained biological insights into the function of metabolites in multiple sclerosis that were not identified in the original investigation. More generally, we observed that there are advantages of searching by whole pathway with NLP platforms because genes on pathway clustered together by hit numbers and affinity scores, and entity searching allows capture of all possible names for disease, gene and metabolite names. Both of these reasons increase the likelihood of finding literature that connects the pathway to the disease because all permutations are considered simultaneously and autonomously. Reanalysis of this data set gained insight into disease pathology from the metabolomics-identified pathways searched against the disease with AI-NLP.

Workflow 3: AI-assisted Biomarker Prediction

To demonstrate anticipated results from Workflow 3, we will revisit the validation study seeking biomarkers to discriminate MMA and PA, but will focus on those newly identified targets from this AI-assisted workflow. Ranked lists of disease-associated metabolites were compared (Figure 6). High ranking but unique metabolites are selected as the new candidate biomarkers (Figure 6, shaded area). Based on the literature evidence of the association between MMA and the experiment identified metabolites, formulations of homogentisate are candidates as MMA biomarkers and, in the same fashion for PA, 2-hydroxybutonoic acid formulations are potential PA biomarkers. Further candidates can be selected from an examination of those shaded regions in the plot.

Workflow 4: Drug Re-purposing, Metabolically linked diseases

Continuing with our investigation of metabolic links and potential drug-repurposing targets of the frequently co-presenting diseases Chronic Kidney Disease and Acute Myocardial Infarction using the protocol in Workflow 4, we examined metabolite-disease relationships that were not well known. In the same set of CKD-AMI shared metabolites where the validation metabolite Vitamin D was seen, we also examined Malondialdehyde (MDA), which appeared in seven papers for each disease. Research in the 1980’s-2000’s saw elevated levels of MDA in both conditions individually, usually in relationship to lipid peroxidation from oxidative stress73-75. MDA was predictive of AMI events in a type 2 diabetes cohort76. The most recent study linked MDA and AMI mechanistically via chemical modification of Collagen type 477. CKD progression studies have also found SNP’s on this collagen IV locus, and Collagen IV mutations cause Alport’s syndrome, a severe proteinuria kidney disease78,79. Therefore, MDA and the Collagen IV locus could be a connection point for these two diseases that could be exploited further, either pathophysiologically or therapeutically. Notably, this potentially very interesting connection could not have been made by a conventional literature search because of the vast literature (43000 studies on CKD, 229000 on AMI), and the lack of co-occurrence of CKD and AMI with metabolite annotation in the respective papers. Through cognitive computing, relationships between diseases and metabolites can be quickly predicted and ranked with literature searching leading to actionable predictions of disease mechanisms or potential treatments.

Outlook on the Future use of AI in Metabolomics Data Interpretation:

The future is in improved AI cognition. AI-based NLP literature searching addresses a major challenge in metabolomics: the categorization of metabolite relationships within biological contexts. We have demonstrated four workflows that highlight the potential for AI-assisted analysis of metabolomics experiments. A complete targeted or untargeted mass spectrometry workflow will always be necessary to formally identify dysregulated metabolites, however literature-evidence was used to accelerate this process. Literature searching allowed for the prioritization of understudied metabolites with respect to a disease or condition of interest. Natural language processing software quickly identified relevant articles, pointing to the pertinent research that was performed on a particular metabolite. With this information, research workflows were accelerated, providing an understanding of biological activity of metabolites in the specific disease condition, reducing the need for expert-level knowledge.

As AI technologies evolve and improvements to AI-NLP databases and search platforms occur, this will facilitate even tighter integration of AI-NLP into metabolomics data analysis. Such changes include creating a distinct entity category representing metabolites, direct import of MS and MS/MS data, prediction of related metabolites, and categorization of endogenous metabolites from exogenous drug compounds or environmental molecules. A forward-looking example of metabolomics AI integration is an automated biomarker detection, identification and then quantification system. Following data acquisition and feature annotation, AI-NLP searching of scientific literature databases is combined with MS/MS identification databases. Resulting metabolite identifications of putative biomarkers would be validated with high throughput quantitative LC/MS system, thereby allowing complete biomarker workflows in batch without the need for expert user intervention.

As NLP systems understand more types of biological relationships, we foresee being able to mine big data more effectively with the aim of connecting disperse clues to determine causal mechanisms linking metabolite with phenotype. That ability to ascertain the key causal action in the complex and heterogenous networks of biology is the linchpin for predictive biology going forward. Without that mechanistic knowledge, fields that depend on guided intervention, such as precision medicine, crop pest resistance and bioremediation, will remain correlative, nonspecific and ultimately ineffective. We are only beginning to see how NLP derived literature evidence can improve biological interpretation and analysis of metabolomics data. Drawing on the knowledge of the global scientific literature, incorporation of NLP or other AI-based technologies into metabolomics data processing and interpretation offers the opportunity to decipher those key connections and start to address major health and environmental challenges.

Supplementary Material

7-11-23 Supplementary Material

Figure 8: Literature Contextualization of Metabolites Using IBM Watson For Drug Discovery Explore and Entity.

Figure 8:

To contextualize metabolites using the explore and entity tool select explore an entity (left) and type the metabolite of interest into the search bar (top). Once results are loaded, Database and Discovered results can be viewed.

Figure 9: Literature Contextualization of Metabolites Using Microsoft Academic.

Figure 9:

To contextualize a metabolite using Microsoft Academic type the metabolite of interest into the search bar. The results page will display the search bar (top). Top topics for the searched metabolite will be in the left panel and the right panel will display an overview for the searched term. Annotated literature results will be in the center panel.

Figure 10: Literature Contextualization of Metabolites Using SciFinder.

Figure 10:

Once a metabolite or topic is searched, the results page will appear. Search terms will be highlighted in the literature results (center). The left panel can be used to analyze, refine and categorize search results.

Figure 11: Literature Contextualization of Metabolites Using Semantic Scholar.

Figure 11:

To contextualize metabolites using Semantic Scholar once searched the above results page will load. The search bar (top) can be edited to run additional searches. Literature results will load with search terms in bold. The right panel includes overview topic information, related topics and the ability to filter results by year. Additional filters can be found below the search bar.

Figure 12: Co-occurrence of Metabolites in Literature Using IBM WDD.

Figure 12:

to search for co-occurring terms, navigate to the co-occurrence tool (left). Select the entity of interest to find, type the terms of interest into the search bar, and select what subset of literature to search (top).

Figure 13: Ranking Top Related Metabolites for Diseases using HUPO BD-HPP.

Figure 13:

To rank metabolites using HUPO BD-HPP follow the steps in the metabolite prioritization box (left). Once steps 1-4 are complete, run the search and results will load (right). To navigate results use the gray bar highlighted at top.

Figure 14: Ranking Top Related Metabolites for Diseases using IBM WDD.

Figure 14:

To rank metabolites using the Predictive Analytics tool, select the tool using the left navigation bar. Type your known set of terms in the left top box. Type the set of metabolites you wish to rank in the right top box. Select the subset of literature to search then click rank.

Figure 16: Metabolome Level Comparison of Diseases for Drug Repurposing using IBM WDD.

Figure 16:

To compare diseases at the metabolome level, navigate to the Explore a Network tool (left). Type the disease conditions of interest in the search bar and select the subset of literature to explore (top). To select the entities to display in the search results use advanced options (below the search bar) and select the type of entity (chemical in the figure). To filter results use the left panel. To download results navigate to the download button in the bottom right.

Table 2:

Definitions of Group vs Set in the context of IBM WDD

Term Definition Example Use
Group Groups are a combination of inputs that create one custom entity Custom groups are good to use to define for entities that are not currently in WDD. For example, organisms are not a current entity class so to search for an organism like an entity, create a group to define an organism by including multiple taxonomic identifications and common names for the organism.
Set Sets are collections of individual entities and keywords that can be used for searching. List of metabolites; disease training set

References:

  • 1.Kurczy ME et al. Determining conserved metabolic biomarkers from a million database queries. Bioinformatics 31, 3721–3724, doi: 10.1093/bioinformatics/btv475 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Monteiro MS, Carvalho M, Bastos ML & Guedes de Pinho P Metabolomics analysis for biomarker discovery: advances and challenges. Current medicinal chemistry 20, 257–271 (2013). [DOI] [PubMed] [Google Scholar]
  • 3.Zhang A, Sun H, Yan G, Wang P & Wang X Metabolomics for Biomarker Discovery: Moving to the Clinic. BioMed research international 2015, 354671–354671, doi: 10.1155/2015/354671 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Zhang F et al. Metabolomics for biomarker discovery in the diagnosis, prognosis, survival and recurrence of colorectal cancer: a systematic review. Oncotarget 8, 35460–35472, doi: 10.18632/oncotarget.16727 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Taylor J, King RD, Altmann T & Fiehn O Application of metabolomics to plant genotype discrimination using statistics and machine learning. Bioinformatics 18, S241–S248, doi: 10.1093/bioinformatics/18.suppl_2.S241 (2002). [DOI] [PubMed] [Google Scholar]
  • 6.Guijas C, Montenegro-Burke JR, Warth B, Spilker ME & Siuzdak G Metabolomics activity screening for identifying metabolites that modulate phenotype. Nature Biotechnology 36, 316, doi: 10.1038/nbt.4101 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Paris LP et al. Global metabolomics reveals metabolic dysregulation in ischemic retinopathy. 12, 15, doi: 10.1007/s11306-015-0877-5 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Goodacre R et al. Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics 3, 231–241, doi: 10.1007/s11306-007-0081-3 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Guijas C et al. METLIN: A Technology Platform for Identifying Knowns and Unknowns. Analytical Chemistry, doi: 10.1021/acs.analchem.7b04424 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Domingo-Almenara X, Montenegro-Burke JR, Benton HP & Siuzdak G Annotation: A Computational Solution for Streamlining Metabolomics Analysis. Analytical Chemistry 90, 480–489, doi: 10.1021/acs.analchem.7b03929 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Smolinska A et al. Current breathomics—a review on data pre-processing techniques and machine learning in metabolomics breath analysis. Journal of Breath Research 8, 027105 (2014). [DOI] [PubMed] [Google Scholar]
  • 12.Kell DB Metabolomics and systems biology: making sense of the soup. Current Opinion in Microbiology 7, 296–307, doi: 10.1016/j.mib.2004.04.012 (2004). [DOI] [PubMed] [Google Scholar]
  • 13.Spasić I et al. MeMo: a hybrid SQL/XML approach to metabolomic data management for functional genomics. 7, 281, doi: 10.1186/1471-2105-7-281 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Spasić I et al. Facilitating the development of controlled vocabularies for metabolomics technologies with text mining. 9, S5, doi: 10.1186/1471-2105-9-s5-s5 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Tenopir C, KING DW, CHRISTIAN L & VOLENTINE R Scholarly article seeking, reading, and use: a continuing evolution from print to electronic in the sciences and social sciences. Learned Publishing 28, 93–105, doi:doi: 10.1087/20150203 (2015). [DOI] [Google Scholar]
  • 16.Bornmann L & Mutz R Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. 66, 2215–2222, doi:doi: 10.1002/asi.23329 (2015). [DOI] [Google Scholar]
  • 17.de Solla Price DJ Networks of Scientific Papers. 149, 510–515, doi: 10.1126/science.149.3683.510 %J Science (1965). [DOI] [PubMed] [Google Scholar]
  • 18.Yandell MD & Majoros WH Genomics and natural language processing. Nature Reviews Genetics 3, 601, doi: 10.1038/nrg861 (2002). [DOI] [PubMed] [Google Scholar]
  • 19.Hirschberg J & Manning CD Advances in natural language processing. 349, 261–266, doi: 10.1126/science.aaa8685 %J Science (2015). [DOI] [PubMed] [Google Scholar]
  • 20.Chen Y, Elenee Argentinis JD & Weber G IBM Watson: How Cognitive Computing Can Be Applied to Big Data Challenges in Life Sciences Research. Clinical Therapeutics 38, 688–701, doi: 10.1016/j.clinthera.2015.12.001 (2016). [DOI] [PubMed] [Google Scholar]
  • 21.Choi B-K et al. Literature-based automated discovery of tumor suppressor p53 phosphorylation and inhibition by NEK2. 115, 10666–10671, doi: 10.1073/pnas.1806643115 %J Proceedings of the National Academy of Sciences (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Bakkar N et al. Artificial intelligence in neurodegenerative disease research: use of IBM Watson to identify additional RNA-binding proteins altered in amyotrophic lateral sclerosis. 135, 227–247, doi: 10.1007/s00401-017-1785-8 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Ivanisevic J et al. Toward ‘Omic Scale Metabolite Profiling: A Dual Separation–Mass Spectrometry Approach for Coverage of Lipid and Central Carbon Metabolism. Analytical Chemistry 85, 6876–6884, doi: 10.1021/ac401140h (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Forsberg EM et al. Data processing, multi-omic pathway mapping, and metabolite activity analysis using XCMS Online. Nature Protocols 13, 633, doi: 10.1038/nprot.2017.151 https://www.nature.com/articles/nprot.2017.151#supplementary-information (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Zhu ZJ et al. Liquid chromatography quadrupole time-of-flight mass spectrometry characterization of metabolites guided by the METLIN database. Nat Protoc 8, 451–460, doi: 10.1038/nprot.2013.004 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Patti GJ, Tautenhahn R & Siuzdak G Meta-analysis of untargeted metabolomic data from multiple profiling experiments. Nat Protoc 7, 508–516, doi: 10.1038/nprot.2011.454 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Domingo-Almenara X et al. XCMS-MRM and METLIN-MRM: a cloud library and public resource for targeted analysis of small molecules. Nature methods 15, 681–684, doi: 10.1038/s41592-018-0110-3 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Smith CA, Want EJ, O'Maille G, Abagyan R & Siuzdak G XCMS: Processing Mass Spectrometry Data for Metabolite Profiling Using Nonlinear Peak Alignment, Matching, and Identification. Analytical Chemistry 78, 779–787, doi: 10.1021/ac051437y (2006). [DOI] [PubMed] [Google Scholar]
  • 29.Tautenhahn R, Patti GJ, Rinehart D & Siuzdak G XCMS Online: A Web-Based Platform to Process Untargeted Metabolomic Data. Analytical Chemistry 84, 5035–5039, doi: 10.1021/ac300698c (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Huan T et al. Systems biology guided by XCMS Online metabolomics. Nat Meth 14, 461–462, doi: 10.1038/nmeth.4260 http://www.nature.com/nmeth/journal/v14/n5/abs/nmeth.4260.html#supplementary-information (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Gowda H et al. Interactive XCMS Online: Simplifying Advanced Metabolomic Data Processing and Subsequent Statistical Analyses. Analytical Chemistry 86, 6931–6939, doi: 10.1021/ac500734c (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Domingo-Almenara X et al. Autonomous METLIN-Guided In-source Fragment Annotation for Untargeted Metabolomics. Analytical Chemistry 91, 3246–3253, doi: 10.1021/acs.analchem.8b03126 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Johnston TH et al. Repurposing drugs to treat l-DOPA-induced dyskinesia in Parkinson's disease. Neuropharmacology, doi: 10.1016/j.neuropharm.2018.05.035 (2018). [DOI] [PubMed] [Google Scholar]
  • 34.Warth B et al. Exposome-Scale Investigations Guided by Global Metabolomics, Pathway Analysis, and Cognitive Computing. Analytical Chemistry 89, 11505–11513, doi: 10.1021/acs.analchem.7b02759 (2017). [DOI] [PubMed] [Google Scholar]
  • 35.Rinschen MM et al. Metabolic rewiring of the hypertensive kidney. Science Signaling 12, eaax9760, doi: 10.1126/scisignal.aax9760 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Rey FE et al. Metabolic niche of a prominent sulfate-reducing human gut bacterium. Proceedings of the National Academy of Sciences 110, 13582–13587, doi: 10.1073/pnas.1312524110 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Junping Z et al. N-Acetyl-Cysteine Alleviates Gut Dysbiosis and Glucose Metabolic Disorder in High-Fat Diet-Induced Mice. Journal of Diabetes 0, doi:doi: 10.1111/1753-0407.12795. [DOI] [PubMed] [Google Scholar]
  • 38.Hale VL et al. Synthesis of multi-omic data and community metabolic models reveals insights into the role of hydrogen sulfide in colon cancer. Methods, doi: 10.1016/j.vmeth.2018.04.Q24 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Hyötyläinen T et al. Genome-scale study reveals reduced metabolic adaptability in patients with non-alcoholic fatty liver disease. Nature Communications 7, 8994, doi: 10.1038/ncomms9994 https://www.nature.com/articles/ncomms9994#supplementary-information (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Raman M et al. Fecal Microbiome and Volatile Organic Compound Metabolome in Obese Humans With Nonalcoholic Fatty Liver Disease. Clinical Gastroenterology and Hepatology 11, 868–875.e863, doi: 10.1016/j.cgh.2013.02.015 (2013). [DOI] [PubMed] [Google Scholar]
  • 41.Scheller R et al. Toward mechanistic models for genotype–phenotype correlations in phenylketonuria using protein stability calculations. Human Mutation 40, 444–457, doi: 10.1002/humu.23707 (2019). [DOI] [PubMed] [Google Scholar]
  • 42.Chen T et al. Mutational and phenotypic spectrum of phenylalanine hydroxylase deficiency in Zhejiang Province, China. Scientific Reports 8, 17137, doi: 10.1038/s41598-018-35373-9 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Duan H et al. Non-invasive prenatal testing of pregnancies at risk for phenylketonuria. Archives of Disease in Childhood - Fetal and Neonatal Edition 104, F24–F29, doi: 10.1136/archdischild-2017-313929 (2019). [DOI] [PubMed] [Google Scholar]
  • 44.Zori R et al. Induction, titration, and maintenance dosing regimen in a phase 2 study of pegvaliase for control of blood phenylalanine in adults with phenylketonuria. Molecular Genetics and Metabolism 125, 217–227, doi: 10.1016/j.ymgme.2018.06.010 (2018). [DOI] [PubMed] [Google Scholar]
  • 45.Brantley KD, Douglas TD & Singh RH One-year follow-up of B vitamin and Iron status in patients with phenylketonuria provided tetrahydrobiopterin (BH4). Orphanet Journal of Rare Diseases 13, 192, doi: 10.1186/s13023-018-0923-2 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Johnson CH, Ivanisevic J & Siuzdak G Metabolomics: beyond biomarkers and towards mechanisms. Nature Reviews Molecular Cell Biology 17, 451, doi: 10.1038/nrm.2016.25 https://www.nature.com/articles/nrm.2016.25#supplementary-information (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Wikoff WR, Gangoiti JA, Barshop BA & Siuzdak G Metabolomics Identifies Perturbations in Human Disorders of Propionate Metabolism. 53, 2169–2176, doi: 10.1373/clinchem.2007.089011 %J Clinical Chemistry (2007). [DOI] [PubMed] [Google Scholar]
  • 48.Kenny LC et al. Novel biomarkers for pre-eclampsia detected using metabolomics and machine learning. 1, 227, doi: 10.1007/s11306-005-0003-1 (2005). [DOI] [Google Scholar]
  • 49.Go AS, Chertow GM, Fan D, McCulloch CE & Hsu C.-y. Chronic Kidney Disease and the Risks of Death, Cardiovascular Events, and Hospitalization. 351, 1296–1305, doi: 10.1056/NEJMoa041031 (2004). [DOI] [PubMed] [Google Scholar]
  • 50.Fricke S Semantic Scholar. 2018. 106, 3 %J Journal of the Medical Library Association, doi: 10.5195/jmla.2018.280 (2018). [DOI] [Google Scholar]
  • 51.Toonen LJA et al. Transcriptional profiling and biomarker identification reveal tissue specific effects of expanded ataxin-3 in a spinocerebellar ataxia type 3 mouse model. 13, 31, doi: 10.1186/s13024-018-0261-9 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Roden D & Denny J Integrating electronic health record genotype and phenotype datasets to transform patient care. 99, 298–305, doi:doi: 10.1002/cpt.321 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Krittanawong C, Zhang H, Wang Z, Aydar M & Kitai T Artificial Intelligence in Precision Cardiovascular Medicine. 69, 2657–2664, doi: 10.1016/j.jacc.2017.03.571 %J Journal of the American College of Cardiology (2017). [DOI] [PubMed] [Google Scholar]
  • 54.Palmblad M Visual and Semantic Enrichment of Analytical Chemistry Literature Searches by Combining Text Mining and Computational Chemistry. Analytical Chemistry, doi: 10.1021/acs.analchem.8b05818 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Venkatesan A et al. SciLite: a platform for displaying text-mined annotations as a means to link research articles with biological data [version 2; referees: 2 approved, 1 approved with reservations]. Wellcome Open Research 1, doi: 10.12688/wellcomeopenres.10210.2 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.UniProt: a hub for protein information. Nucleic Acids Res 43, D204–212, doi: 10.1093/nar/gku989 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Sansone S-A et al. Metabolomics standards initiative: ontology working group work in progress. 3, 249–256, doi: 10.1007/s11306-007-0069-z (2007). [DOI] [Google Scholar]
  • 58.Sumner LW et al. Proposed minimum reporting standards for chemical analysis Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI). Metabolomics : Official journal of the Metabolomic Society 3, 211–221, doi: 10.1007/s11306-007-0082-2 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Caspi R et al. The MetaCyc database of metabolic pathways and enzymes. Nucleic Acids Research 46, D633–D639, doi: 10.1093/nar/gkx935 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Szklarczyk D et al. STITCH 5: augmenting protein-chemical interaction networks with tissue and affinity data. Nucleic Acids Res 44, D380–384, doi: 10.1093/nar/gkv1277 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Johnson Caroline H. et al. Metabolism Links Bacterial Biofilms and Colon Carcinogenesis. Cell Metabolism 21, 891–897, doi: 10.1016/j.cmet.2015.04.011 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Yu K-H et al. A Cloud-Based Metabolite and Chemical Prioritization System for the Biology/Disease-Driven Human Proteome Project. Journal of Proteome Research, doi: 10.1021/acs.jproteome.8b00378 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Beyer BA et al. Metabolomics-based discovery of a metabolite that enhances oligodendrocyte maturation. Nature Chemical Biology 14, 22, doi: 10.1038/nchembio.2517 https://www.nature.com/articles/nchembio.2517#supplementary-information (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Wolswijk G Oligodendrocyte precursor cells in the demyelinated multiple sclerosis spinal cord. Brain 125, 338–349, doi: 10.1093/brain/awf031 (2002). [DOI] [PubMed] [Google Scholar]
  • 65.Boelen A et al. Type 3 Deiodinase Expression in Inflammatory Spinal Cord Lesions in Rat Experimental Autoimmune Encephalomyelitis. 19, 1401–1406, doi: 10.1089/thy.2009.0228 (2009). [DOI] [PubMed] [Google Scholar]
  • 66.Gallai V et al. Neuropeptide Y plasma levels and serum dopamine-beta-hydroxylase activity in MS patients with and without abnormal cardiovascular reflexes. Acta Neurol Belg 94, 44–52 (1994). [PubMed] [Google Scholar]
  • 67.Mann MB et al. Association between the phenylethanolamine N-methyltransferase gene and multiple sclerosis. Journal of Neuroimmunology 124, 101–105, doi: 10.1016/S0165-5728(02)00009-7 (2002). [DOI] [PubMed] [Google Scholar]
  • 68.Cosentino M et al. Catecholamine production and tyrosine hydroxylase expression in peripheral blood mononuclear cells from multiple sclerosis patients: effect of cell stimulation and possible relevance for activation-induced apoptosis. Journal of Neuroimmunology 133, 233–240, doi: 10.1016/S0165-5728(02)00372-7 (2002). [DOI] [PubMed] [Google Scholar]
  • 69.Niland B et al. Cleavage of Transaldolase by Granzyme B Causes the Loss of Enzymatic Activity with Retention of Antigenicity for Multiple Sclerosis Patients. ji_0804174, doi: 10.4049/jimmunol.0804174 %J The Journal of Immunology (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Samland AK & Sprenger GA Transaldolase: from biochemistry to human disease. The international journal of biochemistry & cell biology 41, 1482–1494, doi: 10.1016/j.biocel.2009.02.001 (2009). [DOI] [PubMed] [Google Scholar]
  • 71.Esposito M et al. Human Transaldolase and Cross-Reactive Viral Epitopes Identified by Autoantibodies of Multiple Sclerosis Patients. 163, 4027–4032 (1999). [PubMed] [Google Scholar]
  • 72.Banki K et al. Oligodendrocyte-specific expression and autoantigenicity of transaldolase in multiple sclerosis. 180, 1649–1663, doi: 10.1084/jem.180.5.1649 %J The Journal of Experimental Medicine (1994). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Dousset J-C, Trouilh M & Foglietti M-J Plasma malonaldehyde levels during myocardial infarction. Clinica Chimica Acta 129, 319–322, doi: 10.1016/0009-8981(83)90035-9 (1983). [DOI] [PubMed] [Google Scholar]
  • 74.Loughrey CM et al. Oxidative stress in haemodialysis. Qjm 87, 679–683 (1994). [PubMed] [Google Scholar]
  • 75.Lim CS & Vaziri ND The effects of iron dextran on the oxidative stress in cardiovascular tissues of rats with chronic renal failure. Kidney Int 65, 1802–1809, doi: 10.1111/j.1523-1755.2004.00580.x (2004). [DOI] [PubMed] [Google Scholar]
  • 76.Virella G & Lopes-Virella MF The Pathogenic Role of the Adaptive Immune Response to Modified LDL in Diabetes. Front Endocrinol (Lausanne) 3, 76, doi: 10.3389/fendo.2012.00076 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Vallejo J, Duner P, Fredrikson GN, Nilsson J & Bengtsson E Autoantibodies against aldehyde-modified collagen type IV are associated with risk of development of myocardial infarction. J Intern Med 282, 496–507, doi: 10.1111/joim.12659 (2017). [DOI] [PubMed] [Google Scholar]
  • 78.Hudson BG, Tryggvason K, Sundaramoorthy M & Neilson EG Alport's Syndrome, Goodpasture's Syndrome, and Type IV Collagen. 348, 2543–2556, doi: 10.1056/NEJMra022296 (2003). [DOI] [PubMed] [Google Scholar]
  • 79.Wang Y et al. COL4A3 Gene Variants and Diabetic Kidney Disease in MODY. Clin J Am Soc Nephrol 13, 1162–1171, doi: 10.2215/cjn.09100817 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

7-11-23 Supplementary Material

RESOURCES