Literature-based predictions of Mendelian disease therapies

Cole A Deisseroth; Won-Seok Lee; Jiyoen Kim; Hyun-Hwan Jeong; Ryan S Dhindsa; Julia Wang; Huda Y Zoghbi; Zhandong Liu

doi:10.1016/j.ajhg.2023.08.018

. 2023 Sep 22;110(10):1661–1672. doi: 10.1016/j.ajhg.2023.08.018

Literature-based predictions of Mendelian disease therapies

Cole A Deisseroth ^1,², Won-Seok Lee ^2,^3,⁴, Jiyoen Kim ^2,⁵, Hyun-Hwan Jeong ², Ryan S Dhindsa ^2,⁶, Julia Wang ^1,^2,⁷, Huda Y Zoghbi ^2,^3,^4,^5,^8,⁹, Zhandong Liu ^2,^8,^∗

PMCID: PMC10577072 PMID: 37741276

Summary

In the effort to treat Mendelian disorders, correcting the underlying molecular imbalance may be more effective than symptomatic treatment. Identifying treatments that might accomplish this goal requires extensive and up-to-date knowledge of molecular pathways—including drug-gene and gene-gene relationships. To address this challenge, we present “parsing modifiers via article annotations” (PARMESAN), a computational tool that searches PubMed and PubMed Central for information to assemble these relationships into a central knowledge base. PARMESAN then predicts putatively novel drug-gene relationships, assigning an evidence-based score to each prediction. We compare PARMESAN’s drug-gene predictions to all of the drug-gene relationships displayed by the Drug-Gene Interaction Database (DGIdb) and show that higher-scoring relationship predictions are more likely to match the directionality (up- versus down-regulation) indicated by this database. PARMESAN had more than 200,000 drug predictions scoring above 8 (as one example cutoff), for more than 3,700 genes. Among these predicted relationships, 210 were registered in DGIdb and 201 (96%) had matching directionality. This publicly available tool provides an automated way to prioritize drug screens to target the most-promising drugs to test, thereby saving time and resources in the development of therapeutics for genetic disorders.

Keywords: natural language processing, literature search, text mining, NLP, drug repurposing, drug screening, pathway analysis

To facilitate the investigation of Mendelian disease therapeutics, we introduce a prediction tool called parsing modifiers via article annotations (PARMESAN). This tool extracts known gene-gene and drug-gene relationships from the biomedical literature, predicts unknown relationships, and prioritizes the predictions based on the amount of supporting and opposing evidence.

Introduction

The increased adoption of next-generation sequencing has led to the discovery of thousands of genes associated with Mendelian disorders. Even after diagnosis, many of these diseases are treated only at the symptomatic level, since a therapy that corrects the underlying molecular imbalance has yet to be identified. Addressing this large unmet clinical need will require tools that can rapidly and systematically identify candidate therapeutics at scale. However, prioritizing the right therapies requires understanding the molecular pathways involved in a given disease and the drugs that modulate that pathway in the correct direction. A database of known gene-gene and drug-gene relationships can aid in exploring these molecular pathways to identify candidate therapeutics. For example, some of these disorders may be directly treatable with a drug that targets the causative gene: haploinsufficiency disorders such as Rett syndrome¹ (MIM: 312750) and protein toxicity disorders such as spinocerebellar ataxia type 1² (MIM: 164400) may be rescuable by increasing or decreasing (respectively) levels of the responsible protein.

There are several available resources for inferring gene-gene relationships. For example, there exist protein-protein interaction databases, such as BioGRID,³ STRING,⁴ and BioPlex.⁵ While these tools have enabled network analyses, they do not provide information on directionality—which we define as whether one protein increases or decreases the activity of another. There are pathway databases that do provide directionality, such as Reactome⁶ and the Kyoto Encyclopedia of Genes and Genomes (KEGG),⁷ as well as genetic modifier databases such as NeuroGeM⁸ and PhenoModifier.⁹ These databases are manually curated through expert literature review, as are drug-gene relationship databases such as DrugBank¹⁰ and the Drug-Gene Interaction Database (DGIdb).¹¹

While manually curated databases are carefully annotated, there are more than 2 million publications in this space per year, which poses a significant challenge to keeping these databases up to date. There is a growing number of publicly available tools that extract this information automatically from the medical literature and offer an alternative to manual curation. For example, PubTator¹² automatically annotates genes, drugs, and species mentioned in PubMed and PubMed Central. SemRep¹³ and BioBERT¹⁴ extract named entities and entity relationships from the medical literature. Both represent significant advances in the field, but their utility in identifying promising disease therapeutics remains unknown. Additionally, BioBERT can take months to run on the full body of medical literature, and SemMedDB (the database constructed using SemRep) requires all users to have a Unified Medical Language System (UMLS)¹⁵ license, which limits the accessibility of both the database and any tools that use it. MediKanren¹⁶ is another tool that uses knowledge graphs to predict drug repurposing opportunities. However, its network analyses rely on manually curated pathway databases, and the primary literature is the first place that newly discovered relationships will be displayed—and possibly the only place for many discoveries.

Here, we set out to build a tool that predicts therapeutics for genetic disorders and remains current with the latest discoveries in the literature. We present PARMESAN (parsing modifiers via article annotations), an automated literature search tool that scans through PubMed (27 million abstracts) and PubMed Central (5 million downloadable full-text articles) for descriptions of gene-gene and drug-gene relationships. Once the database is aggregated, PARMESAN can predict a drug’s or protein’s effect on a disease gene. Here, we demonstrate PARMESAN’s accuracy, validity, and potential to prioritize new therapeutic candidates for genetic disorders.

Material and methods

Building a knowledge base

PARMESAN processes articles and summarizes the modifier relationships that they describe in the following format: “action modifier predicate target.” The action is the way the modifier gene is being manipulated, and the predicate is the effect that manipulating the modifier has on the target gene. As an example, Monteiro et al. published an article titled “Pharmacological disruption of the MID1/α4 interaction reduces mutant Huntingtin levels in primary neuronal cultures.”¹⁷ The modifier and target are MID1 (MID1 [MIM: 300552]) and Huntingtin (HTT [MIM: 613004]), respectively. The predicate—what MID1 is doing to Huntingtin—is “reduces.” And the action—what is being done to MID1 to achieve this effect on Huntingtin—is “disruption.” The resulting sentence is therefore “disruption MID1 reduces Huntingtin” (Figure 1).

Summary sentence construction

PARMESAN constructs summary sentences from sentences in the format “action modifier predicate target,” where the predicate is the effect the modifier has on the target gene (such as increase or decrease), and the action is what is being done to the modifier to achieve this effect (whether the modifier should be increased or decreased to have the mentioned effect on the target). “Action” is the only optional part of the summary sentence. We use an example extracted from Monteiro et al.¹⁷

We gave PARMESAN a vocabulary of 70 positive predicates, 142 negative predicates, and 63 negative actions (Table 1). PARMESAN constructed a knowledge base of known gene-gene relationships, which were positively directed (the modifier, which we will call a “positive regulator,” increases the level or activity of the target gene’s protein) or negatively directed (the modifier, a “negative regulator,” decreases the level or activity of the target gene’s protein).

Table 1.

Vocabulary used by PARMESAN

Positive predicate	accelerate, accelerated, accelerates, activate, activated, activates, allowed, allows, augment, augmentation, augmented, augments, coactivator, costimulate, elevate, elevated, elevates, elicit, elicited, elicits, enable, enables, enhance, enhanced, enhancement, enhancer, enhances, enhancing, exacerbate, exacerbated, exacerbates, expedited, facilitate, facilitated, facilitates, improve, improved, improvement, improves, induce, induced, -induced, induces, potentiate, potentiated, potentiates, promote, promoted, promoter, promotes, reactivate, reactivated, reactivation, restore, restored, restores, stabilize, stabilized, stimulate, stimulated, stimulates, strengthened, strengthening, strengthens, transactivated, transactivating, transactivation, upregulate, upregulated, upregulates
Negative predicate	ablated, abolish, abolished, abolishes, abolishing, abrogate, abrogated, abrogates, antagonized, antagonizes, antagonize, antagonized, antagonizes, coactivated, coinhibition, compete, competes, counteract, counteracted, counteracting, counteracts, deactivated, deactivates, degrade, degraded, degrader, depressed, destabilize, destabilized, destroyed, destroying, diminish, diminished, diminishes, diminishing, disassembled, disassembles, displace, displaced, displacement, displaces, disrupt, disrupting, disruption, dissociate, dissociated, disturb, disturbances, disturbed, disturbing, disturbs, divert, downregulate, down-regulate, eliminate, eliminated, eliminates, hinder, hindered, hindering, hinders, impair, impaired, impairing, impairment, impairs, impede, impeded, impedes, inactivate, inactivated, inhibit, inhibited, inhibiting, inhibition, inhibitions, inhibitive, inhibitory, inhibits, interceptor, interfere, interfered, interference, interferes, interfering, interrupt, interrupted, interrupting, interruption, interrupts, negated, negates, neutralize, neutralized, neutralizes, nullify, obliterated, obstructive, oppose, opposed, opposes, outcompeted, perturbation, perturbed, prevent, prevented, preventing, prevention, prevents, prohibited, prohibiting, reduce, reduced, reduces, reducing, repress, repressed, represses, repressing, repressive, restrict, restricting, restricts, revert, reverted, reverting, suppress, suppressed, suppresses, suppressing, suppression, suppressive, terminated, terminates, transrepresses, transrepression, undermined, unstimulated, weaken, weakened, weakening, weakens
Negative action	abolished, abrogated, absence, antibody, blockage, blocked, blocking, decrease, decreased, decreasing, deficiency, deficient, deleted, deleting, deletion, depleted, depleting, depletion, diminished, disruption, downregulate, down-regulate, downregulated, down-regulated, downregulates, downregulating, down-regulating, downregulation, down-regulation, eliminated, impaired, inhibit, inhibited, inhibiting, inhibition, inhibitor, inhibitors, inhibitory, inhibits, interfere, interference, interfering, knock, knockdown, knock-down, knocked, knocking, knockout, knock-out, loss, neutralizing, prevent, prevented, reduce, reduced, reduces, reducing, rnai, sirna, suppressed, suppresses, suppressing, suppression

Open in a new tab

A positive predicate is a word that suggests an increase in the expression or activity of the target gene. A negative predicate is a word that suggests a decrease in the expression or activity of the target gene. A negative action is a word that suggests decreased expression or activity of the modifier. Positive action words are not considered, because they do not add any information in the absence of a negative action. For example, “gene A suppresses gene B” has the same meaning as “increasing gene A suppresses gene B,” as both indicate gene A to be a negative regulator of gene B.

PARMESAN is designed to determine a consensus across different articles on the relationship between two entities, even if those articles mention different synonyms of the same entity. In order to do this, PARMESAN indexes genes by Entrez gene ID and drugs by PubChem ID. Genes in other species have different Entrez IDs from their human orthologs. PubTator identifies which species is being discussed when a gene is mentioned and extracts gene IDs pertaining to that species.

When PubTator extracts non-human genes, PARMESAN translates them to human genes through ortholog matching. It uses the Drosophila RNAi Screening Center Integrative Ortholog Prediction Tool (DIOPT),¹⁸ which predicts orthology between the genes of two species, and calculates a “mention value” based on the strength of the orthology. If all genes in the relationship belong to humans, the mention value is 1 or −1 (for positive or negative regulators, respectively). For each non-human gene in the relationship, we multiply this value by the DIOPT score of the orthology divided by the largest possible DIOPT score for that species. For example, Guerrero-Esteo et al.¹⁹ wrote that endoglin transfection leads to inhibition of fibronectin synthesis in mouse fibroblasts. PubTator extracted the Entrez IDs NCBI Gene: 13805 for mouse endoglin (Eng) and NCBI Gene: 14268 for mouse fibronectin (Fn1). DIOPT maps mouse endoglin to human endoglin (ENG [MIM: 131195]) with a score of 19 and mouse fibronectin to human fibronectin (FN1 [MIM: 135600]) with a score of 16. The highest possible DIOPT score for a mouse-human ortholog match is 20. Therefore, the mention value for this mention is −1 ^∗ (19/20) ^∗ (16/20) = −0.76.

For each modifier-target pair, the tool then calculates a directionality score based on the evidence supporting positive versus negative regulation. Conflicting information on this directionality will hinder the score. For a given modifier-target pair B and A, let X = the sum of the mention values stating that B positively regulates A, and let Y = the absolute value of the sum of the mention values stating that B negatively regulates A.

The score D_BA is defined per Equation 1:

D_{B A} = (X - Y) * \frac{| X - Y |}{X + Y}

(Equation 1)

A large positive score indicates that gene B likely positively regulates gene A. A large negative score indicates that gene B likely negatively regulates gene A. For a large score, there must be proportionally more articles in one direction than in the other.

As an example, in a pre-2022 SNCA (MIM: 163890) modifier search (Table S11), PARMESAN found 2 articles (with mention values of 1) stating that HSPA5 ([MIM: 138120], written as GRP78 in the extracted articles) positively regulates SNCA, and 0 stating that it negatively regulates SNCA.

D_{H S P A 5, S N C A} = (2 - 0) * \frac{| 2 - 0 |}{2 + 0} = 2

PARMESAN automatically identified hundreds of thousands of drug-gene relationships and more than 1 million gene-gene relationships among tens of thousands of drugs (indexed by PubChem ID) and genes (indexed by Entrez gene ID, Table S1). Further technical details on the scoring algorithm are included in the supplemental methods.

We used the same scoring algorithm to derive directionality scores from SemMedDB, a database of entity relationships (including proteins and drugs) automatically extracted from PubMed.

Testing the accuracy of extracted relationships

To test PARMESAN’s accuracy, we compared all of its extracted gene-gene relationships to Reactome’s knowledge base of functional interactions and its drug-gene relationships to DGIdb. Extractions were “consistent” with the manually curated relationships if they agreed on the directionality of the relationship (PARMESAN and the manually curated database both state that A upregulates B), and “contradicted” if they stated opposite directionalities (PARMESAN states that A upregulates B, while the manually curated database states that A downregulates B). We took the extracted relationships with the largest directionality scores (D_BA), lowering the threshold until all extractions are included, and determined the number of consistent and contradicted relationships.

SemMedDB is another database of relationships (including gene-gene and drug-gene) extracted automatically from the literature by the SemRep tool. We use the same algorithm to calculate directionality scores (D_BA) using SemMedDB instead of PARMESAN’s extractions and performed the same comparison against Reactome and DGIdb described above.

We also performed a holistic evaluation of the extracted relationships. We took random samples (selected through a seeded random shuffle) of 100 extracted gene-gene relationships—once from the full set of extractions and once from the set of all extractions with scores above 4 (the power analysis used to derive this minimum score is described in the supplemental methods), and manually checked the articles supporting each relationship. There were three possible outcomes: (1) correct, meaning that the above finding that “disruption MID1 reduces Huntingtin” was correctly extracted from the source article; (2) misdirected, meaning that the finding that “CoA reduced ACP” (Table S5) extracted from Lambrechts et al. accurately reflected that CoA modifies the activity of NDUFAB1 (written as ACP [MIM: 603836]), but the article suggests that impeding CoA synthesis is what reduces NDUFAB1 levels, indicating a positive, rather than a negative effect;²⁰ or (3) incorrect, meaning that the modifier is not indicated to modify the activity of the target in the source article. An example of the latter is the finding from Mukhopadhyay et al. that “MYOF upregulated FAH” (Table S3) was not depicted in the source article—the article was listing genes that were up- and down-regulated, which included MYOF (MIM: 604603) and FAH (MIM: 613871).²¹

We repeated this same test on the extracted drug-gene relationships. Using the same power analysis, we set the “high score” threshold at 3. We sampled extractions scoring above 3 and compared the accuracy to a sample of the full set of extractions (Table S7). Because this accuracy analysis is done manually and partly depends on human intuition, we provide all of the material (source article PMIDs and extraction summaries) from the extractions we evaluated. We compare the accuracies of the unfiltered and high-scoring predictions using a 2-proportion binomial test.

After manually establishing a set of “correct” extractions, we determined whether these correct relationships were registered in the manually curated databases. Indexing genes by Entrez ID and drugs by PubChem ID, we determined whether each relationship was described in the manually curated database.

Predicting indirect relationships

PARMESAN uses known connections to hypothesize on unknown relationships. If gene C positively regulates gene B and gene B negatively regulates gene A, then gene C may negatively regulate gene A as well. We score the suspected relationship between C and A based on the quantity and strength of the connections from C to B to A. The score is calculated as follows.

For every gene B where modifier C has a reported effect on B and B has a reported effect on gene A, we define the indirect link score L_CBA per Equation 2:

L_{C B A} = \frac{D_{C B} D_{B A}}{| D_{C B} | + | D_{B A} |}

(Equation 2)

This formula favors balanced connections. If one article shows that B positively regulates A and 200 show that C positively regulates B, that connection has an L_CBA value of 0.995. If 6 articles show that B positively regulates A and 5 show that C positively regulates B, that connection has an L_CBA value of 2.727. An indirect connection is invalid if either intermediate connection is invalid, and this scoring algorithm downweighs the former case, which is more likely to be invalid than the latter.

The L_CBA value is calculated for every intermediate gene B between C and A. Since C might, for instance, upregulate one downregulator of A and downregulate another downregulator of A, there can be multiple positive and negative L_CBA values for a given C and A. Ultimately, we want to make a final prediction of the effect of modifier C on target A. Similarly to Equation 1, we score a putative relationship based on the strength and consistency of the supporting evidence (i.e., the overwhelming support of one direction over the other). To do this, we take a similar approach to Equation 1 and calculate the final indirect relationship prediction score I_CA:

Let X = the sum of the positive L_CBA values between genes C and A.

Let Y = the absolute value of the sum of the negative L_CBA values between genes C and A.

I_{C A} = (X - Y) * \frac{| X - Y |}{X + Y}

(Equation 3)

Comparing indirect relationship predictions to existing knowledge

We compared PARMESAN’s predictions to knowledge bases of four types: (1) its own relationship extractions, (2) SemMedDB’s relationship extractions, (3) publicly available databases, and (4) two regulator screens for specific genes—Ataxin 1 (ATXN1 [MIM: 601556]), a protein whose unstable CAG trinucleotide repeat expansion causes neurodegeneration in spinocerebellar ataxia type 1,²^,²² and Microtubule-Associated Protein Tau (MAPT [MIM: 157140]), a protein that forms intracellular tangles in Alzheimer disease and is also implicated in frontotemporal dementia (MIM: 172700) and Parkinson disease²³^,²⁴^,²⁵ (MIM: 168600). We repeat each of these comparisons after filtering PARMESAN’s predictions to those above different cutoff scores. Technical details of the comparisons are provided in the supplemental methods.

In the comparison to its own relationship extractions, we ran PARMESAN’s gene-gene relationship predictor (Figure 3A, referencing Liu et al.,²⁶ Roscic et al.,²⁷ and Monteiro et al.¹⁷) using only articles published before January 1, 2012. We then compared the predictions to relationships extracted from articles published before 2012 and before 2022 (Figure 3B; Table S13). We consider a prediction consistent if PARMESAN extracted the predicted relationship from the literature with the same proposed directionality (both agree on whether gene A increases or decreases the activity of gene B). For this time-capsule comparison, if a predicted relationship was not extracted or it was extracted but with the opposite directionality to that which was predicted, then the prediction is not consistent.

Discovery of pre-2012 predictions over time

(A) We test PARMESAN’s long-term predictive capability with a time-capsule test, where PARMESAN makes predictions of modifiers using articles from before a given year, and we then compare those predictions to modifiers reported on after the given year. We use the example of *MID1*’s indirect effect on Huntingtin, putatively through its effect on *MTOR* (MIM: 601231).¹⁷^,²⁶^,²⁷

(B and C) We used PARMESAN to predict gene-gene (B) and drug-gene (C) relationships using only articles from before 2012, and show the fraction of those predictions that were consistent with relationships extracted before 2012 (black solid line) and before 2022 (black dashed line). The difference between these values (shown in red) represents the change in the fraction of predictions that were consistent with identified relationships over the decade after the predictions were made.

Similarly, we compare PARMESAN’s pre-2012 predictions to SemMedDB’s extracted relationships before 2012 and before 2022. As before, a prediction is consistent if PARMESAN’s predicted relationship matches the direction of SemMedDB’s extracted relationship.

We compare PARMESAN’s extracted and predicted relationships to two manually curated databases: DGIdb for the drug-gene and Reactome for the gene-gene relationship predictions. These databases specify the directionality of each relationship—whether A positively or negatively regulates B. A prediction was consistent if it matched the directionality of the relationship in the database, and contradicted if it contradicted the database’s directionality. We measured accuracy as [consistent/(consistent + contradicted)].

To benchmark PARMESAN’s performance, we generate predictions (using the same prediction algorithm) from five different automatically extracted knowledge bases: PARMESAN’s extractions from PubMed alone; PARMESAN’s extractions from PubMed Central alone; PARMESAN’s combined extractions from PubMed and PubMed Central; the extracted relationships in SemMedDB; and all of the extractions from PARMESAN and SemMedDB combined. In practice, users would likely evaluate the highest-scoring predictions first and stop evaluating after reaching a self-defined accuracy threshold. As such, we compared the highest-scoring predictions (above different score thresholds) to the manually curated database, lowering the score threshold until it reached zero.

Using the same definitions of consistent and contradicted, we compare PARMESAN’s gene-gene relationship predictions to two in vitro disease gene modifier screens: one for genes whose knockdown decreases ATXN1 levels²⁸ and one for druggable genes whose knockdown increases or decreases endogenous and over-expressed MAPT protein in a human cell line (unpublished data). As we increased the score threshold, we compared the declining numbers of correct and incorrect predictions with a one-phase decay least-squares fit and an extra sum-of-squares F test comparing the decay constant.

Lastly, to explore the potential utility of this tool in identifying drugs for relatively rare genetic disorders, we determined the number of predicted drugs for genes reported on by the Undiagnosed Diseases Network (UDN). We used a bash script to find genes (annotated by PubTator) mentioned in the titles of articles that list the UDN as an author, and we manually verified that all such genes were legitimately mentioned. Our script was stringent enough that we did not have to manually remove any genes from the list. We then determined how many drugs (above each integer score threshold from 0 to 100) PARMESAN predicted to modulate these genes.

Results

PARMESAN’s extraction confidence scores associate strongly with the accuracy of extracted relationships

Using PARMESAN, we extracted more than 1 million gene-gene and 600,000 drug-gene relationships from PubMed and PubMed Central (Tables S1 and S2). Among the 20,676 gene-gene relationships extracted by PARMESAN that were also present in Reactome, 15,095 (73%) had matching directionality. Likewise, PARMESAN and DGIdb agreed on 3,020 (83%) of the 3,626 drug-gene relationships that were present in both of their knowledge bases (Figure 2A). PARMESAN assigns a directionality score to each extracted relationship based on the amount of supporting and opposing evidence. Filtering the gene-gene and drug-gene relationships to those with larger directionality scores led to a stronger match with the manually curated databases, and both had a score threshold above which they achieved over 90% consistency. Of the 3,143 gene-gene relationships scoring above 6 that overlap with Reactome, 2,848 (90.6%) matched in directionality—and PARMESAN had 34,767 extracted gene-gene relationships above this score. Meanwhile, of the 1,507 drug-gene relationships scoring above 1 that overlapped with DGIdb, 1,411 (93.6%) matched in directionality, and PARMESAN had 104,702 drug-gene relationships above this threshold (Table S2).

Accuracy of extracted modifier relationships

(A) We compare PARMESAN’s relationship extractions to the manually curated databases Reactome (for gene-gene relationships) and DGIdb (for drug-gene relationships). “MCDB” means “manually curated database.” At different score thresholds, we plot the consistency with the manually curated database (Y axis) against the total number of extracted relationships above that score threshold (X axis). To measure the consistency, we isolate the relationships that overlap with the manually curated database, and among them, determine the percent of the relationships that had matching directionality (whether drug A positively or negatively regulates gene B). As we lower our score threshold, the consistency declines, suggesting that these scores are indicative of the consistency with the manually curated databases.

(B) In four different trials, we randomly collected 100 relationships extracted by PARMESAN and checked the supporting articles to see whether at least one of them truly signified the relationship. “Correct” relationships were confirmed, with correct directionality, in at least one of the supporting articles—for example, PARMESAN states that gene A negatively regulates gene B, and a supporting article indicates that gene A negatively regulates gene B. “Misdirected” relationships were confirmed by at least one supporting article, but never with correct directionality—for example, PARMESAN states that gene A negatively regulates gene B, and the supporting articles indicate that gene A positively regulates gene B. “Incorrect” relationships were not confirmed by any of the supporting articles. For genes and drugs, we randomly select from (1) the full set of extracted relationships (“unfiltered”) and (2) the set of all predictions scoring over a power-analysis-defined threshold—4 for genes and 3 for drugs (“high score”). The error bars represent the 95% confidence intervals for a binomial distribution. In both the manual evaluation and the comparison to the manually curated relationships, PARMESAN’s accuracy improves when relationships are limited to those with a score above the given threshold. This suggests that these scores are a promising measure of confidence in an automatically extracted relationship.

We also calculated directionality scores from SemMedDB’s relationship extractions and found that these extractions were also more consistent with DGIdb and Reactome as the score threshold increased (Figure S1). SemMedDB had a total of 181,041 gene-gene and 193,920 drug-gene relationships and also exceeded 90% accuracy at sufficiently high score thresholds (for drug-gene relationships, no threshold was needed to achieve this). There were 5,369 gene-gene relationships scoring above 3, and 633 of those relationships overlapped with Reactome. Among the overlapping relationships, 575 (90.8%) matched in directionality. Among all of the drug-gene relationships in SemMedDB (unfiltered), there were 1,160 relationships overlapping with DGIdb. Among the overlapping relationships, 1,059 (91.3%) matched in directionality.

We then manually evaluated random subsets of gene-gene and drug-gene relationship extractions from PARMESAN. Among the randomly selected gene-gene relationship extractions, 45% were correct (Figure 2B; Tables S3 and S7). We performed the same accuracy test on 100 random gene-gene relationships scoring above 4 (the threshold yielding a detectable accuracy difference on a binomial power analysis, Tables S4 and S7). The accuracy increased to 73% (p = 3.36 × 10⁻⁵). We next performed the same test on drug-gene relationships extracted by PARMESAN. Among the unfiltered extractions, 52% were correct (Tables S5 and S7). When we evaluated extractions with scores above 3 (Tables S6 and S7), the accuracy improved to 77% (p = 0.000159).

PARMESAN and the manually curated databases were largely disjoint from each other. Reactome’s functional interaction database contains 124,746 relationships that we could map to Entrez IDs, and DGIdb contains 15,724 that we could map to Entrez and PubChem IDs. PARMESAN’s extractions intersected with only 20,676 of the Reactome relationships and 3,626 of the DGIdb relationships. Among the relationships extracted by PARMESAN that we deemed “correct” in our manual evaluation (118 gene-gene and 129 drug-gene relationships), we found 9 of the gene-gene and 3 of the drug-gene relationships to be registered in Reactome and DGIdb, respectively (Table S8). The discrepancy between the sets of relationships covered by PARMESAN and the manually curated databases may be due to each of them covering a fairly small fraction of what is currently known about gene-gene and drug-gene relationships. Because Reactome and DGIdb are manually curated, they are limited in breadth, and they appear to miss a large fraction of what we know. If we were to view PARMESAN, Reactome, and DGIdb as samples of the set of known relationships, and if we assumed that Reactome and DGIdb were both independent of PARMESAN’s extractions (even though we would not realistically assume this), then the manually curated databases would contain 2%–8% of our collective knowledge and PARMESAN would contain 17%–23%.

PARMESAN’s higher-scoring predictions are more likely to match extracted relationships

We ran PARMESAN’s relationship extraction and prediction algorithms using only the articles published before 2012. This gave us a “time capsule” that represents what PARMESAN would have extracted and predicted at that time. We made another time capsule of PARMESAN’s extractions using articles from before 2022 to show how PARMESAN’s knowledge base had changed in the following decade. When comparing PARMESAN’s predicted gene-gene relationships from before 2012 to the ones that it extracted by 2022 (Figure 3A), 2.8% of the pre-2012 predicted relationships were present in the pre-2022 extracted knowledge base (with matching directionality). When focusing on predictions with higher scores (Figure 3B), the match exceeded 90%. As an example, we show the pre-2012 predictions (Table S9), the pre-2012 extractions (Table S10), and the pre-2022 extractions (Table S11) for genes that regulate alpha-synuclein (SNCA), a protein abundant in the plaques that form in Lewy body dementia (MIM: 127750).

Unexpectedly, the rate of newly consistent predictions (measured as the difference between the consistency rates before 2012 and before 2022) peaked at 18.7%, and then fell below 10% (Table S12). This suggests that predictions with high confidence scores are more likely to be accurate, but also more likely to have already been found.

We performed the same test with the pre-2012 drug-gene relationships (Figure 3C; Table S12) and saw a similar pattern: 1.5% of all predictions were consistent with the extractions from 10 years later; this rate rose above 90% as the score threshold increased; and the rate of newly consistent predictions peaked at 17.4%.

We repeated this test, comparing PARMESAN’s pre-2012 predictions to SemMedDB’s relationship extractions from before 2012 and before 2022 (Figure S2). For gene-gene predictions (Figure S2A), 0.5% of PARMESAN’s pre-2012 predictions matched SemMedDB’s extractions from before 2022. From increasing the score threshold, this consistency rose above 60%, and the rate of newly consistent predictions peaked at 7.3%. For drug-gene interactions (Figure S2B), the pre-2012 predictions overall had a 0.27% to SemMedDB’s pre-2022 extractions, the match rose above 20% with higher score thresholds, and the rate of newly consistent predictions peaked at 5.5%.

PARMESAN accurately predicts drug-gene relationships

We generated gene-gene and drug-gene relationship predictions from the following knowledge bases: (1) PARMESAN’s extractions from PubMed (27 million abstracts), (2) PARMESAN’s extractions from PubMed Central (5 million downloadable full-text articles), (3) PARMESAN’s extractions from both PubMed and PubMed Central, (4) SemMedDB’s relationships, and (5) The combined knowledgebase using both PARMESAN’s extractions (PubMed and PubMed Central) and the relationships in SemMedDB.

We compared each set of drug-gene relationship predictions to those contained in DGIdb¹¹ (Figure 4A; Table S13). Predictions generated from each of the five knowledge bases show a stronger match to DGIdb with increasing prediction scores, and each prediction set has a score level above which DGIdb never contradicts it. However, for a given accuracy level, some knowledge bases yield a larger number of promising predictions than others. For example, among the drug predictions generated from PARMESAN’s extractions from the abstracts in PubMed, the predictions with a score above 10 were never contradicted by DGIdb (100% accuracy, out of the 8 relationships that overlapped with DGIdb), and there were 4,042 total predictions above this threshold. Meanwhile, the predictions generated from PARMESAN’s extractions from the full-text articles in PubMed Central that scored above 18 were never contradicted by DGIdb (among the 53 that overlapped), and there were 29,774 total predictions above this threshold. This suggests that PARMESAN’s PubMed Central extractions (with PubMed Central having a larger corpus of text than PubMed) yield more than 7 times the number of predictions at this accuracy level that PARMESAN’s PubMed extractions do.

Comparison of PARMESAN’s gene-gene and drug-gene relationship predictions to manually curated relationships

We compared all of PARMESAN’s predictions to manually curated databases of drug-gene and gene-gene relationships. The accuracy, or “percent consistent,” has the same definition as it does in Figure 2A. We generated predictions from five knowledge bases: PARMESAN’s extractions from PubMed, PARMESAN’s extractions from PubMed Central, PARMESAN’s combined extractions from PubMed and PubMed Central, SemMedDB’s extractions, and the combined extractions from PARMESAN (PubMed and PubMed Central) and SemMedDB.

(A) The drug-gene relationship predictions were compared to the relationships presented in DGIdb. We take the top n predictions for a given number n (X axis) and observe the consistency in directionality with DGIdb. For example, PARMESAN (using PubMed alone) generated 453,892 predictions with scores above 2. Among the 255 predictions that scored above 2 and overlapped with DGIdb, 204 (80%) matched the directionality displayed by DGIdb. Therefore, the orange “PARMESAN (PubMed)” line contains the point at X = 453,892, Y = 0.8. The best predictions came from combining the extractions from PARMESAN and SemMedDB, although in this trial, the difference from using PARMESAN alone was not statistically significant.

(B) Gene-gene relationship predictions were compared to the gene-gene relationships presented in Reactome. This panel is formatted in the same way as (A). All prediction sets demonstrated increased accuracy with higher scores. In this setting, the combination of PARMESAN and SemMedDB showed the best predictive ability. Its differences from the other knowledge bases tested were all statistically significant.

(C) We compared PARMESAN’s genetic modifier predictions (using extractions from PubMed and PubMed Central combined) for *ATXN1* and *MAPT* to corresponding modifier screens, and the consistent predictions outnumbered the contradicted ones at higher score thresholds.

When users filter the predictions to achieve a desired accuracy level, one knowledge base will yield the largest number of predictions. Without knowing what accuracy level a user will accept, we must compare the number of predictions at any accuracy from 50% to 100%. To compare the number of predictions at each accuracy level generated from the different knowledge bases, we used the Friedman rank-sum test with Dunn’s multiple comparisons test (see supplemental methods). The Friedman statistic was 172.4 (p < 0.0001). The best drug predictions came from the combination of PARMESAN and SemMedDB—although despite having the largest rank sum (224), its difference from PARMESAN alone (using PubMed and PubMed Central, rank sum 211) was not statistically significant (p > 0.9999). However, the combination of PARMESAN and SemMedDB significantly outperformed the remaining knowledge bases (p = 0.0325 against PARMESAN’s PubMed Central extractions, and <0.0001 against PARMESAN’s PubMed extractions and SemMedDB). Furthermore, the three knowledge bases that used PARMESAN’s PubMed Central extractions all outperformed the two that did not (p < 0.0001).

Ultimately, our goal is to have at least one high-confidence prediction for as many target genes as possible. As such, we measured the accuracy in the comparison to the manually curated database against the number of target genes with at least one prediction (normalized to the highest target gene count achieved in any prediction set, which was 18,899, achieved by the gene predictions from combining PARMESAN and SemMedDB). This gave us a precision-recall-like curve for each set of predictions. This is not to be mistaken for an actual precision-recall curve, which would evaluate PARMESAN’s ability to predict the relationship (or nonexistence thereof) of any given gene-gene or drug-gene pair—a functionality that PARMESAN does not have. PARMESAN’s PubMed extractions yielded an area under the curve (AUC) of 0.552244, the PubMed Central extractions yielded 0.84776, the PubMed and PubMed Central extractions together yielded 0.861687, SemMedDB yielded 0.324165, and PARMESAN’s extractions combined with SemMedDB yielded 0.855783. In this respect, knowledge bases that include PARMESAN’s PubMed Central extractions all outperform the knowledge bases that do not (Figure S3A).

Lastly, we count the drugs that PARMESAN predicted for 64 genes reported on by the UDN (Table S17) at each score level from 0 to 100. PARMESAN predicted at least one drug for all 64 of the genes, and at least one drug with a score above 4 (corresponding to a 90% match with DGIdb) for 35 of them (Table S18). Only 12 of these 64 genes had drugs registered in DGIdb that were known to modulate them (Table S19). This highlights PARMESAN’s potential in identifying drugs that can modulate genes involved in rare disorders.

PARMESAN accurately predicts gene-gene relationships

We compared the predicted gene-gene relationships derived from each source knowledge base (PARMESAN with PubMed, PARMESAN with PubMed Central, PARMESAN’s full database, SemMedDB, and PARMESAN combined with SemMedDB) to all of the gene-gene relationships described in Reactome⁶ (Figure 4B; Table S14). For all prediction sets, raising the score threshold led to an increase in the consistency of the predictions with Reactome.

We used the Friedman rank-sum test with Dunn’s multiple comparisons test to determine which prediction set yielded the most predictions across the different score levels. The Friedman statistic was 191.0. The best predictions came from combining the extracted relationships from PARMESAN and SemMedDB. It outperformed all other knowledge bases (p = 0.009 against PARMESAN’s full knowledge base and p < 0.0001 against the remaining knowledge bases). As was true for the drug predictions, prediction sets that used PARMESAN’s PubMed Central extractions outperformed those that did not (all p values < 0.001).

As we did with the drug predictions, we generated precision-recall-like curves for the genetic modifier predictions, measuring the percent directional consistency with the manually curated databases against the normalized number of target genes covered (Figure S3B). PARMESAN’s PubMed extractions alone yielded an AUC of 0.584728, the PubMed Central extractions yielded 0.855866, the PubMed and PubMed Central extractions together yielded 0.869541, and SemMedDB yielded 0.375288. The best AUC, 0.88151, came from the combination of PARMESAN’s extractions and SemMedDB.

Comparing PARMESAN predictions to experimental genetic screens

As another approach to evaluate PARMESAN’s performance, we compared its predictions to functional genetic screens. Specifically, we compared PARMESAN’s predictions (PubMed + PubMed Central) to screens for regulators of ATXN1²⁸ (93 genes, 76 of which PARMESAN predicted as modifiers of ATXN1) and MAPT (97 genes, 93 of which PARMESAN predicted as modifiers of MAPT). The consistent predictions outnumbered the contradicted ones (Figure 4C; Tables S15 and S16) and filtering the predictions by score significantly favored the consistent predictions in the ATXN1 screen (p < 0.0001) but not the MAPT screen.

Discussion

Testing a drug for therapeutic effects in a given disorder is an expensive and time-consuming process, and it is essential to know which treatments are the most likely to work, especially for newly discovered genetic disorders for which no therapeutics have been developed. PARMESAN’s drug-predicting capability can save time and resources in the investigation of promising therapeutics for disorders caused by protein haploinsufficiency or toxic gain-of-function variants.

The predictive power of this tool expands with time, as new publications emerge that report gene-gene and drug-gene relationships. In theory, the most effective way to collect information on gene-gene and drug-gene relationships would be to require researchers to upload new discoveries to a public database. Automated literature curation may be the next best option until such a system exists. PARMESAN can construct an entire knowledge base automatically, accomplish years of manual curation within hours, and repeat the process regularly to capture new data. This knowledge base contains hundreds of thousands of relationships, which are more than 40% accurate overall and more than 70% accurate at higher scores. Future work will involve adjusting the extraction mechanism to account for recurring errors—for example, many of the erroneous high-scoring extractions are due to repeatedly extracting from the same article title cited in the reference sections of many publications. As such, removing the references section from the full-text articles before parsing them may be beneficial.

The known relationships allowed PARMESAN to accurately predict relationships that it had not already registered. An effective relationship predictor has two key aspects: (1) the ability to distinguish high- versus low-confidence predictions and (2) the ability to generate many high-confidence predictions. All of the knowledge bases we used could fuel predictions that accomplished the former but they performed differently in the latter. Overall, the largest number of promising predictions (particularly for gene-gene relationships) came from bringing together the extractions of PARMESAN and SemMedDB. We anticipate that adding additional automatically curated knowledge bases—possibly with the use of BioBERT or other large language models—would further strengthen the predictions.

The drug predictions may guide the development of therapeutics for diseases caused by heterozygous loss-of-function variants (for which the goal would be to increase the activity of the healthy protein) or gain-of-function variants (for which one would want to deactivate or destroy the toxic protein). Its potential in helping us find treatments for genetic disorders is highlighted by the fact that PARMESAN predicts drugs with confidence scores above 4 (of which the overlapping relationships with DGIdb displayed 91%-matching directionality) for more than half of the disease genes we identified that were reported on by the UDN.

PARMESAN can also guide genetic modifier screens, generating more than 1 million predictions at the score level that corresponds to a 91% match with Reactome (10,012 predictions out of 11,060 that overlapped with Reactome had matching directionality). Modifier screens serve to improve our understanding of gene-gene relationships, which can further guide therapeutic development. For both modifier and drug screens, PARMESAN has the potential to save a tremendous amount of time and resources by identifying the most promising experiments to conduct.

There are three key limitations of PARMESAN that future work will address. (1) The dependence on the availability of literature leads to fewer hypotheses for newly discovered disease genes than for well-known ones. This will be partially addressed by continuing to update PARMESAN’s knowledge base with new discoveries, but finding new ways to expand the reach of the tool will be essential. (2) One cannot verify that two entities have no relationship, which makes the accuracy of PARMESAN’s predictions on undiscovered relationships difficult to reliably assess. We can estimate this accuracy mathematically, but the best estimate will come from empirically testing more of PARMESAN’s hypotheses. (3) PARMESAN’s predictions are limited to upstream regulators, but it would be useful to also predict treatments that work downstream of a disease gene and facilitate the biological processes that the defective protein can no longer facilitate. Despite its limitations, PARMESAN can greatly expedite the literature-search process and prioritize the testing of candidate therapeutics for currently untreatable genetic disorders.

Appendix A

Statistical analyses

p values comparing accuracies from the manual accuracy evaluation (Table S7) are calculated using the 2-proportion binomial test provided in the statsmodels Python package. The threshold scores are calculated using the 2-proportion binomial power analysis, also using the statsmodels Python package.

95% confidence intervals are calculated using the binomial distribution Microsoft Excel function, “=CRITBINOM(t,s/t,0.025),” where s is the number of accurate relationship extractions, and t is the number of relationships evaluated (100). Each experiment was performed once.

When comparing the numbers of predictions at each accuracy level between prediction sets, we calculate statistical significance using the Friedman test with Dunn’s multiple comparisons test, using GraphPad Prism. A data point was taken at each integer percent accuracy from 50 to 100 (50, 51, 52 … 99, 100).

When comparing the modifier predictions to the results of modifier screens, a statistically significant difference between the diminishing numbers of consistent and contradicted predictions as the score threshold increases was calculated by one-phase decay least-squares fit with an extra sum-of-squares F test, comparing the decay constant K. In contrast to the comparison between prediction sets from different knowledge bases (for which we use the Friedman test), we are not simply determining whether one value is larger than the other at different accuracy levels, but whether the scoring system is effectively removing the contradicted predictions—in other words, whether increasing the score threshold gets rid of contradicted predictions more quickly than it gets rid of consistent ones. These calculations were done in GraphPad Prism.

Data and code availability

•
The in vitro modifier screens are intellectual property of the Huda Zoghbi lab. Information on the genes in these screens is available upon reasonable request. Modifier extractions from an earlier version of PARMESAN for neurodegenerative disease genes are available at the Neurodegeneration Hub (https://nddb.nrihub.org/).
•
Upon publication, PARMESAN will be available at parmesan.nrihub.org as a searchable web interface. Its extracted and predicted relationships can be queried for on the site or downloaded in bulk. Additionally, the code used to download the needed mapping and manually curated databases, run PARMESAN, and compare its predicted relationships to DGIdb and Reactome is available in a public GitHub repository, at https://github.com/coledeisseroth/PARMESAN. The code used to make predictions using SemMedDB will be kept in a separate repository at https://github.com/coledeisseroth/PARMESAN_SemMedDB, which will require the user to separately download the SemMedDB data and the UMLS Metathesaurus.

Acknowledgments

We thank Roopashri Holehonnur, Nigel Lee, Dongxue Mao, Sasidhar Pasupuleti, Ying-Wooi Wan, Megan Mair, Tarik Onur, Juan Botas, Shinya Yamamoto, Hugo Bellen, and the members of the labs of Huda Zoghbi and Zhandong Liu for their feedback and advice on the improvement and analysis of this tool.

C.A.D. and J.W. are supported by the Medical Scientist Training Program of Baylor College of Medicine. J.W. is also supported by the Eunice Kennedy Shriver National Institute of Child Health and Human Development of the National Institutes of Health (NIH) under award number F30HD094503 and the Robert and Janice McNair Foundation McNair MD/PhD Student Scholar Program. W.-S.L. was a Howard Hughes Medical Institute International Student Research fellow. J.K. and H.Y.Z. are supported by the BrightFocus Foundation and JPB Foundation for the MAPT modifier screen. R.S.D. is supported by NIH/NINDS grant F32NS127854. H.Y.Z. is supported by the Howard Hughes Medical Institute and by NIH/NINDS grant 2R37NS027699. Z.L. is supported by the National Institutes of Health and the National Institute on Aging (R01AG057339), the CHDI Foundation, the Huffington Foundation, and the Chao Foundation.

Author contributions

C.A.D.: conceptualization, methodology, software, validation, formal analysis, investigation, data curation, writing – original draft, writing – review & editing, visualization. W.-S.L.: data curation, writing – review & editing. J.K.: data curation, writing – review & editing. H.-H.J.: investigation, data curation, writing – review & editing. R.S.D.: writing – review & editing. J.W.: conceptualization, methodology, writing – review & editing. H.Y.Z.: conceptualization, methodology, resources, writing – review & editing, supervision. Z.L.: conceptualization, methodology, resources, writing – review & editing, supervision, project administration, funding acquisition.

Declaration of interests

H.Y.Z. collaborates with UCB Pharma to modify levels of MAPT and ATXN1. R.S.D. is a paid consultant of AstraZeneca.

Published: September 22, 2023

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2023.08.018.

Web resources

ATXN1 modifier screen, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9057624/
DGIdb. https://www.dgidb.org/downloads
DIOPT, https://www.flyrnai.org/tools/diopt/web/api
NCBI (Entrez) gene database, https://www.ncbi.nlm.nih.gov/gene/
PubChem, https://pubchem.ncbi.nlm.nih.gov/
PubMed, https://pubmed.ncbi.nlm.nih.gov/
PubMed Central, https://www.ncbi.nlm.nih.gov/pmc/
PubTator, https://www.ncbi.nlm.nih.gov/research/pubtator/
Reactome, https://reactome.org/download-data
SemMedDB, https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR.html
Unified Medical Language System (UMLS), https://www.nlm.nih.gov/research/umls/index.html

Supplemental information

Document S1. Figures S1–S3 and supplemental methods

mmc1.pdf^{(698.2KB, pdf)}

Data S1. Tables S1–S19

mmc2.xlsx^{(1.1MB, xlsx)}

Document S2. Article plus supplemental information

mmc3.pdf^{(3.6MB, pdf)}

References

1.Amir R.E., Van den Veyver I.B., Wan M., Tran C.Q., Francke U., Zoghbi H.Y. Rett syndrome is caused by mutations in X-linked MECP2, encoding methyl-CpG-binding protein 2. Nat. Genet. 1999;23:185–188. doi: 10.1038/13810. [DOI] [PubMed] [Google Scholar]
2.Orr H.T., Chung M.Y., Banfi S., Kwiatkowski T.J., Servadio A., Beaudet A.L., McCall A.E., Duvick L.A., Ranum L.P., Zoghbi H.Y. Expansion of an unstable trinucleotide CAG repeat in spinocerebellar ataxia type 1. Nat. Genet. 1993;4:221–226. doi: 10.1038/ng0793-221. [DOI] [PubMed] [Google Scholar]
3.Stark C., Breitkreutz B.-J., Reguly T., Boucher L., Breitkreutz A., Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34:D535–D539. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Snel B., Lehmann G., Bork P., Huynen M.A. STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res. 2000;28:3442–3444. doi: 10.1093/nar/28.18.3442. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Huttlin E.L., Ting L., Bruckner R.J., Gebreab F., Gygi M.P., Szpyt J., Tam S., Zarraga G., Colby G., Baltier K., et al. The BioPlex Network: A Systematic Exploration of the Human Interactome. Cell. 2015;162:425–440. doi: 10.1016/j.cell.2015.06.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Fabregat A., Sidiropoulos K., Viteri G., Forner O., Marin-Garcia P., Arnau V., D’Eustachio P., Stein L., Hermjakob H. Reactome pathway analysis: a high-performance in-memory approach. BMC Bioinf. 2017;18:142. doi: 10.1186/s12859-017-1559-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Ogata H., Goto S., Fujibuchi W., Kanehisa M. Computation with the KEGG pathway database. Biosystems. 1998;47:119–128. doi: 10.1016/S0303-2647(98)00017-3. [DOI] [PubMed] [Google Scholar]
8.Na D., Rouf M., O’Kane C.J., Rubinsztein D.C., Gsponer J. NeuroGeM, a knowledgebase of genetic modifiers in neurodegenerative diseases. BMC Med. Genomics. 2013;6:52. doi: 10.1186/1755-8794-6-52. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Sun H., Guo Y., Lan X., Jia J., Cai X., Zhang G., Xie J., Liang Q., Li Y., Yu G. PhenoModifier: a genetic modifier database for elucidating the genetic basis of human phenotypic variation. Nucleic Acids Res. 2020;48:D977–D982. doi: 10.1093/nar/gkz930. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Wishart D.S., Feunang Y.D., Guo A.C., Lo E.J., Marcu A., Grant J.R., Sajed T., Johnson D., Li C., Sayeeda Z., et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2018;46:D1074–D1082. doi: 10.1093/nar/gkx1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Wagner A.H., Coffman A.C., Ainscough B.J., Spies N.C., Skidmore Z.L., Campbell K.M., Krysiak K., Pan D., McMichael J.F., Eldred J.M., et al. DGIdb 2.0: mining clinically relevant drug-gene interactions. Nucleic Acids Res. 2016;44:D1036–D1044. doi: 10.1093/nar/gkv1165. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Wei C.-H., Kao H.-Y., Lu Z. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res. 2013;41:W518–W522. doi: 10.1093/nar/gkt441. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Kilicoglu H., Shin D., Fiszman M., Rosemblat G., Rindflesch T.C. SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics. 2012;28:3158–3160. doi: 10.1093/bioinformatics/bts591. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Lee J., Yoon W., Kim S., Kim D., Kim S., So C.H., Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36:1234–1240. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32:D267–D270. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Foksinska A., Crowder C.M., Crouse A.B., Henrikson J., Byrd W.E., Rosenblatt G., Patton M.J., He K., Tran-Nguyen T.K., Zheng M., et al. The precision medicine process for treating rare disease using the artificial intelligence tool mediKanren. Front. Artif. Intell. 2022;5 doi: 10.3389/frai.2022.910216. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Monteiro O., Chen C., Bingham R., Argyrou A., Buxton R., Pancevac Jönsson C., Jones E., Bridges A., Gatfield K., Krauß S., et al. Pharmacological disruption of the MID1/α4 interaction reduces mutant Huntingtin levels in primary neuronal cultures. Neurosci. Lett. 2018;673:44–50. doi: 10.1016/j.neulet.2018.02.061. [DOI] [PubMed] [Google Scholar]
18.Hu Y., Flockhart I., Vinayagam A., Bergwitz C., Berger B., Perrimon N., Mohr S.E. An integrative approach to ortholog prediction for disease-focused and other functional studies. BMC Bioinf. 2011;12:357. doi: 10.1186/1471-2105-12-357. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Guerrero-Esteo M., Lastres P., Letamendía A., Pérez-Alvarez M.J., Langa C., López L.A., Fabra A., García-Pardo A., Vera S., Letarte M., Bernabéu C. Endoglin overexpression modulates cellular morphology, migration, and adhesion of mouse fibroblasts. Eur. J. Cell Biol. 1999;78:614–623. doi: 10.1016/S0171-9335(99)80046-6. [DOI] [PubMed] [Google Scholar]
20.Lambrechts R.A., Schepers H., Yu Y., van der Zwaag M., Autio K.J., Vieira-Lara M.A., Bakker B.M., Tijssen M.A., Hayflick S.J., Grzeschik N.A., Sibon O.C. CoA-dependent activation of mitochondrial acyl carrier protein links four neurodegenerative diseases. EMBO Mol. Med. 2019;11 doi: 10.15252/emmm.201910488. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Ramberg H., Richardsen E., de Souza G.A., Rakaee M., Stensland M.E., Braadland P.R., Nygård S., Ögren O., Guldvik I.J., Berge V., et al. Proteomic analyses identify major vault protein as a prognostic biomarker for fatal prostate cancer. Carcinogenesis. 2021;42:685–693. doi: 10.1093/carcin/bgab015. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Orr H.T., Zoghbi H.Y. SCA1 molecular genetics: a history of a 13 year collaboration against glutamines. Hum. Mol. Genet. 2001;10:2307–2311. doi: 10.1093/hmg/10.20.2307. [DOI] [PubMed] [Google Scholar]
23.Alonso A.C., Grundke-Iqbal I., Iqbal K. Alzheimer’s disease hyperphosphorylated tau sequesters normal tau into tangles of filaments and disassembles microtubules. Nat. Med. 1996;2:783–787. doi: 10.1038/nm0796-783. [DOI] [PubMed] [Google Scholar]
24.Rademakers R., Cruts M., van Broeckhoven C. The role of tau (MAPT) in frontotemporal dementia and related tauopathies. Hum. Mutat. 2004;24:277–295. doi: 10.1002/humu.20086. [DOI] [PubMed] [Google Scholar]
25.Lei P., Ayton S., Finkelstein D.I., Spoerri L., Ciccotosto G.D., Wright D.K., Wong B.X.W., Adlard P.A., Cherny R.A., Lam L.Q., et al. Tau deficiency induces parkinsonism with dementia by impairing APP-mediated iron export. Nat. Med. 2012;18:291–295. doi: 10.1038/nm.2613. [DOI] [PubMed] [Google Scholar]
26.Liu E., Knutzen C.A., Krauss S., Schweiger S., Chiang G.G. Control of mTORC1 signaling by the Opitz syndrome protein MID1. Proc. Natl. Acad. Sci. USA. 2011;108:8680–8685. doi: 10.1073/pnas.1100131108. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Roscic A., Baldo B., Crochemore C., Marcellin D., Paganetti P. Induction of autophagy with catalytic mTOR inhibitors reduces huntingtin aggregates in a neuronal cell model. J. Neurochem. 2011;119:398–407. doi: 10.1111/j.1471-4159.2011.07435.x. [DOI] [PubMed] [Google Scholar]
28.Lee W.-S., Al-Ramahi I., Jeong H.-H., Jang Y., Lin T., Adamski C.J., Lavery L.A., Rath S., Richman R., Bondar V.V., et al. Cross-species genetic screens identify transglutaminase 5 as a regulator of polyglutamine-expanded ataxin-1. J. Clin. Invest. 2022;132 doi: 10.1172/JCI156616. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S3 and supplemental methods

mmc1.pdf^{(698.2KB, pdf)}

Data S1. Tables S1–S19

mmc2.xlsx^{(1.1MB, xlsx)}

Document S2. Article plus supplemental information

mmc3.pdf^{(3.6MB, pdf)}

Data Availability Statement

•
The in vitro modifier screens are intellectual property of the Huda Zoghbi lab. Information on the genes in these screens is available upon reasonable request. Modifier extractions from an earlier version of PARMESAN for neurodegenerative disease genes are available at the Neurodegeneration Hub (https://nddb.nrihub.org/).
•
Upon publication, PARMESAN will be available at parmesan.nrihub.org as a searchable web interface. Its extracted and predicted relationships can be queried for on the site or downloaded in bulk. Additionally, the code used to download the needed mapping and manually curated databases, run PARMESAN, and compare its predicted relationships to DGIdb and Reactome is available in a public GitHub repository, at https://github.com/coledeisseroth/PARMESAN. The code used to make predictions using SemMedDB will be kept in a separate repository at https://github.com/coledeisseroth/PARMESAN_SemMedDB, which will require the user to separately download the SemMedDB data and the UMLS Metathesaurus.

[bib1] 1.Amir R.E., Van den Veyver I.B., Wan M., Tran C.Q., Francke U., Zoghbi H.Y. Rett syndrome is caused by mutations in X-linked MECP2, encoding methyl-CpG-binding protein 2. Nat. Genet. 1999;23:185–188. doi: 10.1038/13810. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Orr H.T., Chung M.Y., Banfi S., Kwiatkowski T.J., Servadio A., Beaudet A.L., McCall A.E., Duvick L.A., Ranum L.P., Zoghbi H.Y. Expansion of an unstable trinucleotide CAG repeat in spinocerebellar ataxia type 1. Nat. Genet. 1993;4:221–226. doi: 10.1038/ng0793-221. [DOI] [PubMed] [Google Scholar]

[bib3] 3.Stark C., Breitkreutz B.-J., Reguly T., Boucher L., Breitkreutz A., Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34:D535–D539. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Snel B., Lehmann G., Bork P., Huynen M.A. STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res. 2000;28:3442–3444. doi: 10.1093/nar/28.18.3442. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Huttlin E.L., Ting L., Bruckner R.J., Gebreab F., Gygi M.P., Szpyt J., Tam S., Zarraga G., Colby G., Baltier K., et al. The BioPlex Network: A Systematic Exploration of the Human Interactome. Cell. 2015;162:425–440. doi: 10.1016/j.cell.2015.06.043. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Fabregat A., Sidiropoulos K., Viteri G., Forner O., Marin-Garcia P., Arnau V., D’Eustachio P., Stein L., Hermjakob H. Reactome pathway analysis: a high-performance in-memory approach. BMC Bioinf. 2017;18:142. doi: 10.1186/s12859-017-1559-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Ogata H., Goto S., Fujibuchi W., Kanehisa M. Computation with the KEGG pathway database. Biosystems. 1998;47:119–128. doi: 10.1016/S0303-2647(98)00017-3. [DOI] [PubMed] [Google Scholar]

[bib8] 8.Na D., Rouf M., O’Kane C.J., Rubinsztein D.C., Gsponer J. NeuroGeM, a knowledgebase of genetic modifiers in neurodegenerative diseases. BMC Med. Genomics. 2013;6:52. doi: 10.1186/1755-8794-6-52. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Sun H., Guo Y., Lan X., Jia J., Cai X., Zhang G., Xie J., Liang Q., Li Y., Yu G. PhenoModifier: a genetic modifier database for elucidating the genetic basis of human phenotypic variation. Nucleic Acids Res. 2020;48:D977–D982. doi: 10.1093/nar/gkz930. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Wishart D.S., Feunang Y.D., Guo A.C., Lo E.J., Marcu A., Grant J.R., Sajed T., Johnson D., Li C., Sayeeda Z., et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2018;46:D1074–D1082. doi: 10.1093/nar/gkx1037. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Wagner A.H., Coffman A.C., Ainscough B.J., Spies N.C., Skidmore Z.L., Campbell K.M., Krysiak K., Pan D., McMichael J.F., Eldred J.M., et al. DGIdb 2.0: mining clinically relevant drug-gene interactions. Nucleic Acids Res. 2016;44:D1036–D1044. doi: 10.1093/nar/gkv1165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Wei C.-H., Kao H.-Y., Lu Z. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res. 2013;41:W518–W522. doi: 10.1093/nar/gkt441. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Kilicoglu H., Shin D., Fiszman M., Rosemblat G., Rindflesch T.C. SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics. 2012;28:3158–3160. doi: 10.1093/bioinformatics/bts591. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Lee J., Yoon W., Kim S., Kim D., Kim S., So C.H., Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36:1234–1240. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32:D267–D270. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Foksinska A., Crowder C.M., Crouse A.B., Henrikson J., Byrd W.E., Rosenblatt G., Patton M.J., He K., Tran-Nguyen T.K., Zheng M., et al. The precision medicine process for treating rare disease using the artificial intelligence tool mediKanren. Front. Artif. Intell. 2022;5 doi: 10.3389/frai.2022.910216. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Monteiro O., Chen C., Bingham R., Argyrou A., Buxton R., Pancevac Jönsson C., Jones E., Bridges A., Gatfield K., Krauß S., et al. Pharmacological disruption of the MID1/α4 interaction reduces mutant Huntingtin levels in primary neuronal cultures. Neurosci. Lett. 2018;673:44–50. doi: 10.1016/j.neulet.2018.02.061. [DOI] [PubMed] [Google Scholar]

[bib18] 18.Hu Y., Flockhart I., Vinayagam A., Bergwitz C., Berger B., Perrimon N., Mohr S.E. An integrative approach to ortholog prediction for disease-focused and other functional studies. BMC Bioinf. 2011;12:357. doi: 10.1186/1471-2105-12-357. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Guerrero-Esteo M., Lastres P., Letamendía A., Pérez-Alvarez M.J., Langa C., López L.A., Fabra A., García-Pardo A., Vera S., Letarte M., Bernabéu C. Endoglin overexpression modulates cellular morphology, migration, and adhesion of mouse fibroblasts. Eur. J. Cell Biol. 1999;78:614–623. doi: 10.1016/S0171-9335(99)80046-6. [DOI] [PubMed] [Google Scholar]

[bib20] 20.Lambrechts R.A., Schepers H., Yu Y., van der Zwaag M., Autio K.J., Vieira-Lara M.A., Bakker B.M., Tijssen M.A., Hayflick S.J., Grzeschik N.A., Sibon O.C. CoA-dependent activation of mitochondrial acyl carrier protein links four neurodegenerative diseases. EMBO Mol. Med. 2019;11 doi: 10.15252/emmm.201910488. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Ramberg H., Richardsen E., de Souza G.A., Rakaee M., Stensland M.E., Braadland P.R., Nygård S., Ögren O., Guldvik I.J., Berge V., et al. Proteomic analyses identify major vault protein as a prognostic biomarker for fatal prostate cancer. Carcinogenesis. 2021;42:685–693. doi: 10.1093/carcin/bgab015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Orr H.T., Zoghbi H.Y. SCA1 molecular genetics: a history of a 13 year collaboration against glutamines. Hum. Mol. Genet. 2001;10:2307–2311. doi: 10.1093/hmg/10.20.2307. [DOI] [PubMed] [Google Scholar]

[bib23] 23.Alonso A.C., Grundke-Iqbal I., Iqbal K. Alzheimer’s disease hyperphosphorylated tau sequesters normal tau into tangles of filaments and disassembles microtubules. Nat. Med. 1996;2:783–787. doi: 10.1038/nm0796-783. [DOI] [PubMed] [Google Scholar]

[bib24] 24.Rademakers R., Cruts M., van Broeckhoven C. The role of tau (MAPT) in frontotemporal dementia and related tauopathies. Hum. Mutat. 2004;24:277–295. doi: 10.1002/humu.20086. [DOI] [PubMed] [Google Scholar]

[bib25] 25.Lei P., Ayton S., Finkelstein D.I., Spoerri L., Ciccotosto G.D., Wright D.K., Wong B.X.W., Adlard P.A., Cherny R.A., Lam L.Q., et al. Tau deficiency induces parkinsonism with dementia by impairing APP-mediated iron export. Nat. Med. 2012;18:291–295. doi: 10.1038/nm.2613. [DOI] [PubMed] [Google Scholar]

[bib26] 26.Liu E., Knutzen C.A., Krauss S., Schweiger S., Chiang G.G. Control of mTORC1 signaling by the Opitz syndrome protein MID1. Proc. Natl. Acad. Sci. USA. 2011;108:8680–8685. doi: 10.1073/pnas.1100131108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Roscic A., Baldo B., Crochemore C., Marcellin D., Paganetti P. Induction of autophagy with catalytic mTOR inhibitors reduces huntingtin aggregates in a neuronal cell model. J. Neurochem. 2011;119:398–407. doi: 10.1111/j.1471-4159.2011.07435.x. [DOI] [PubMed] [Google Scholar]

[bib28] 28.Lee W.-S., Al-Ramahi I., Jeong H.-H., Jang Y., Lin T., Adamski C.J., Lavery L.A., Rath S., Richman R., Bondar V.V., et al. Cross-species genetic screens identify transglutaminase 5 as a regulator of polyglutamine-expanded ataxin-1. J. Clin. Invest. 2022;132 doi: 10.1172/JCI156616. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Literature-based predictions of Mendelian disease therapies

Cole A Deisseroth

Won-Seok Lee

Jiyoen Kim

Hyun-Hwan Jeong

Ryan S Dhindsa

Julia Wang

Huda Y Zoghbi

Zhandong Liu

Summary

Introduction

Material and methods

Building a knowledge base

Figure 1.

Table 1.

Testing the accuracy of extracted relationships

Predicting indirect relationships

Comparing indirect relationship predictions to existing knowledge

Figure 3.

Results

PARMESAN’s extraction confidence scores associate strongly with the accuracy of extracted relationships

Figure 2.

PARMESAN’s higher-scoring predictions are more likely to match extracted relationships

PARMESAN accurately predicts drug-gene relationships

Figure 4.

PARMESAN accurately predicts gene-gene relationships

Comparing PARMESAN predictions to experimental genetic screens

Discussion

Appendix A

Statistical analyses

Data and code availability

Acknowledgments

Author contributions

Declaration of interests

Footnotes

Web resources

Supplemental information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases