TrendyGenes, a computational pipeline for the detection of literature trends in academia and drug discovery

Guillermo Serrano Nájera; David Narganes Carlón; Daniel J Crowther

doi:10.1038/s41598-021-94897-9

. 2021 Aug 3;11:15747. doi: 10.1038/s41598-021-94897-9

TrendyGenes, a computational pipeline for the detection of literature trends in academia and drug discovery

Guillermo Serrano Nájera ¹, David Narganes Carlón ^1,^2,³, Daniel J Crowther ^3,^✉

PMCID: PMC8333311 PMID: 34344904

Abstract

Target identification and prioritisation are prominent first steps in modern drug discovery. Traditionally, individual scientists have used their expertise to manually interpret scientific literature and prioritise opportunities. However, increasing publication rates and the wider routine coverage of human genes by omic-scale research make it difficult to maintain meaningful overviews from which to identify promising new trends. Here we propose an automated yet flexible pipeline that identifies trends in the scientific corpus which align with the specific interests of a researcher and facilitate an initial prioritisation of opportunities. Using a procedure based on co-citation networks and machine learning, genes and diseases are first parsed from PubMed articles using a novel named entity recognition system together with publication date and supporting information. Then recurrent neural networks are trained to predict the publication dynamics of all human genes. For a user-defined therapeutic focus, genes generating more publications or citations are identified as high-interest targets. We also used topic detection routines to help understand why a gene is trendy and implement a system to propose the most prominent review articles for a potential target. This TrendyGenes pipeline detects emerging targets and pathways and provides a new way to explore the literature for individual researchers, pharmaceutical companies and funding agencies.

Subject terms: Target identification, Data mining, Literature mining

Introduction

Pharmaceutical companies are actively looking for ways to reduce their attrition rates, the time taken for drug development, and the associated development costs^1–4. One approach being explored to address this productivity challenge is the exploitation of big biomedical data sets through machine learning^5,6. Evidence is emerging that machine learning can be used to speed-up and reduce the costs in all stages in drug discovery^5,6: drug repurposing^7,8, clinical trials^9,10, de-novo drug design^11–20, and target-disease associations^21–25. However, target identification and prioritisation remain the first step for the majority of drug discovery programmes^25–28. Only 10% of drug targets progress through clinical trials^28–30 and this success rate appears lower for novel targets^30–32. Historically, target identification has been broadly carried out on a case-by-case basis, based on the scientific interpretation of the available literature. However, thousands of peer-reviewed articles are published every day without taking into account pre-prints, patent data, and clinical trial reports³³. PubMed alone contains more than 30 million publications as of 2020, and the scientific output doubles every nine years³⁴, creating a corpus of "undiscovered public knowledge"³⁵. Thus, there is a high demand for machine learning and other computational methods to exploit the current knowledge and facilitate the maintenance of an overview of this overwhelming literature volume. The development of (i) alert systems to identify and rank emerging targets at genomic-scale and (ii) recommendation systems to prioritise detailed reading of scientific reviews is of importance for both pharmaceutical companies and the whole scientific community^25,27,36.

One of the most significant obstacles for the automatic analysis of biomedical literature is the use of non-redundant alternative gene synonyms, symbols, and acronyms from competing sources that can have other meanings in different areas of research³⁷. Therefore, it is imperative to disambiguate biomedical entities in the scientific literature at the outset. There have been several attempts in this line of research^{21–24,37–46}. However, these attempts do not unambiguously map gene and disease entities in scientific literature to controlled ontologies nor do they define an ambiguity measure for gene and disease synonyms. Although there have been multiple attempts about trend detection and burst term detection^48,49 and more concretely about the biomedical literature of targets and small molecules^50–52 to our knowledge this is the first attempt to analyse emerging trends about human protein coding genes.

Here we propose a new disambiguation algorithm based on co-citation networks and natural language processing to obtain accurate publication dynamics for every coding-gene in the human genome. This time-series data was used to train recurrent neural networks (RNN) in historical data and predict the state of the literature in recent years. We identify which genes are being mentioned in the literature more than expected in order to highlight and rank potential targets. This genome scale ranking is not alone sufficient for target assessment since this will not include assessment of tractability, commercial opportunity or clinical translatability, but identification of emerging biology is a key component of novel target identification. When the actual number of published articles exceeds predictions, there may have been a paradigm shift for that particular gene. Finally, we implemented topic detection algorithms along with recommendation systems to validate trendy targets. Therefore, the aims of this paper are fourfold: (i) to unambiguously detect genes and diseases within articles with a novel named entity recogniser (ii) to generate a ranking of genes and diseases based on a novel metric that defines its trendiness, (iii) to generate an automatic pipeline to analyse why these biological entities may be trendy, and (iv) to generate a recommendation system to suggest which articles to read which maximise the information coverage in subnetworks.

Results

Gene annotation

We gathered the human gene synonyms from different sources (Ensembl, UniProt, HGCN, Entrez and OpenTargets; Fig. 1B) to sample the potential publications mentioning human gene names. Human genes had around 10 synonyms on average and many of those synonyms are ambiguous (Table 1): More than 30% of gene symbols had at least one promiscuous synonym, around 10% of the gene symbols are unsafe and have at least one gene synonym in the English dictionary, and almost 50% of gene symbols had a nested synonym. Combining these problems, almost 60% of the 19,082 gene symbols have one or more of these four types of ambiguity. To determine which synonyms are potentially ambiguous (“unsafe gene synonyms”; Fig. 1C) we did feature engineering to obtain variables that characterise unsafe synonyms (e.g. longer gene names are less probable to be ambiguous; Table 2). Next, we used a positive-unlabelled bagging (PU) strategy following Mordelet et al. implementation⁵⁵ with a random forest classifier with the engineered features to calculate the probability of a gene synonym to be “unsafe” (see Methods).

Workflow. Chart summarising the process from the downloading of the data to the detection and analysis of trends in the literature. (A) Creation of a graph database with the information contained in PubMed baseline 2020. (B) Acquisition of a comprehensive collection of human coding gene names and synonyms. (C) Automatic determination of potential ambiguous (unsafe) gene names. (D) Annotation of the graph database with unambiguous gene symbols by combining co-citation network topology and binary classifiers. (E) Prediction of per-gene publication trends using RNN. When a gene has significantly more publications or citations than expected by the model it is considered to be trendy. (F) Automatic topic detection of collections of publications. We used this algorithm to quantify the evolution of topics in trendy gene publications over time. (G) A review recommender system that uses information from the citation network and topic detection to recommend the most efficient set of reviews to explore the literature.

Table 1.

Gene synonyms are ambiguous.

Type of synonym	Total counts	Percentage of the total number of synonyms
Nested	18,845	10.16
Promiscuous	11,744	6.32
English	1247	0.67
Manually discarded	58	0.03
Unsafe	24,491	13.20

Open in a new tab

Manually discarded synonyms were labelled as unsafe during the unsafe gene synonym detection in an active learning fashion (see Methods). Unsafe aggregates the data from all the other categories. Data for 19,082 gene symbols and 185,549 gene synonyms. The total counts represent the number of individual synonyms when grouped by gene symbol and gene synonym. Promiscuous synonyms are counted as many times as they act a synonym.

Table 2.

Unsafe features.

Variable	Meaning
Total	Number of total PubMed ID candidates retrieved in ElasticSearch when querying for all gene synonyms for a given gene symbol
Contribution	The percentage of PubMed IDs that a given gene synonym contributes to the total for a particular gene symbol
Number of characters	The length of the gene synonym in characters
Bits	The sum of the bits of information of every character in a gene synonym based on the frequencies of each character in PubMed’s corpus of titles and abstracts
Number of nested	The number of other gene synonyms that contain the gene synonym. For example: “Insulin” is part of “Insulin Receptor”
Prob. of the synonym given an alternative	The conditional probability of finding the gene synonym given that an alternative synonym for the same gene symbol also appears in the text
Prob. of an alternative given the synonym	The conditional probability of finding alternative gene synonyms given that the synonym synonym appears in the text
Is gene symbol	Whether the synonym is also an accepted gene symbol

Open in a new tab

Engineered features to evaluate the probability of a given gene symbol of being ambiguous (unsafe).

To link every human gene to a subset of publications we implemented a disambiguation pipeline based on co-citation networks and machine learning (Fig. 1D). We gathered the titles, abstracts and keywords of the publications that had a match for any of the synonyms using regex with ElasticSearch (Fig. 1D). Nevertheless, this original set of publications potentially contains false positives: publications that contain an ambiguous gene synonym in their titles or abstracts, that do not refer to the gene of interest.

We assumed that true and false positives synonyms will tend to belong to different communities of publications from different research fields. To detect these communities we used co-citation networks (Fig. 1D): a weighted graph where the weight of the edges represents the frequency of two publications being cited simultaneously (co-cited) by a third publication. When two publications are repeatedly co-cited it strongly suggests that both belong to the same field of study⁵⁶. We used the fast greedy modulation algorithm from iGraph to determine communities in the co-citation network and distinguished communities of publications focusing on the gene of interest by detecting the presence of “safe gene synonyms” in their titles and abstracts (Fig. 1D). The process is summarised in Fig. 2.

Disambiguation pipeline. **(A)** Citation network for a subset of PubMed IDs mentioning any of the gene synonyms of the gene symbol LRWD1, including ORCA. **(B)** Co-citation network of the same subset of PubMed IDs as in **(A)**. **(C)** Communities for the co-citation graph obtained after using iGraphs fast greedy algorithm: killer whale community, orca plant cluster, LRWD1 in drosophila and LRWD1 in heterochromatin. **(D)** Number of safe synonyms per PubMed ID in title or abstract in the same co-citation network. **(E)** Citation network with reviews citing any of the PubMed IDs. **(E)** Review information as defined by the recommender system scaled from 0 to 1.

Finally, because we only used citations from open-access publications contained in PubMed Central (PMC)⁵⁷, 46% of the publications were disconnected in the PubMed co-citation graph. To tackle this problem, we used again the inductive bagging positive-unlabelled approach to train multiple classifiers to associate the disconnected publications with the previously computed co-citation network components (Fig. 1C) using the words, phrases and one to four n-grams, contained in titles and abstracts. All available machine classifiers in Scikit Learn were used but logistic regression was selected due to its speed to accuracy ratio (Table 3).

Table 3.

Classifier comparison.

Classifier	Accuracy	Average precision	Brier loss	F1	Log loss	Precision	Recall	AUC	Time (s)
ETC	0.95	0.93	0.05	0.95	1.71	0.95	0.95	0.95	1.35
GPC	0.88	0.85	0.12	0.87	4.25	0.89	0.88	0.88	6.12
KNC	0.86	0.84	0.14	0.86	4.74	0.89	0.86	0.86	2.22
LOG	0.93	0.91	0.07	0.93	2.36	0.94	0.93	0.93	0.54
MLP	0.92	0.89	0.08	0.92	2.85	0.91	0.92	0.92	1.27
RDC	0.86	0.83	0.14	0.86	4.74	0.87	0.86	0.86	0.22
RFC	0.95	0.93	0.05	0.95	1.81	0.95	0.95	0.95	1.26
SVC	0.94	0.92	0.06	0.94	2.14	0.94	0.93	0.94	1.96

Open in a new tab

Performance metrics for the 8 classifiers (Extra Trees Classifier, ETC; Gaussian Process Classifier, GPC; K-Nearest Neighbour, KNN; Logistic Regression, LOG; MultiLayer Perceptron Classifier, MLC; Ridge Classifier, RDC; Random Forest Classifier, RFC; and Support Vector Machine classifier, SVC; in descending order) used for the disambiguation in “Topic detection” for a random sample of 2000 genes. The metrics shown in this table were obtained by averaging the results on the validation set during the threefold cross validation. Subsequently, the results were averaged for a sample of 2000 genes. The logistic regression classifier (bold) was the fastest and second most accurate model for a random sample of 2000 genes and therefore it was selected as the default model to run the disambiguation on the remaining 17,082 human protein-coding genes. This high validation score verified that there was no over-fitting after the threefold cross-validation.

To test the performance of the disambiguation pipeline we compared the disambiguation results with the gene-publication annotations from GeneRif⁵⁸ (manually curated annotations), DISEASES⁵⁹ (computational annotations), and UniProt⁶⁰ (computational and manually curated annotations) (Table 4). On average, the disambiguation recovers > 85% of all publications contained in these databases. Both GeneRif and Uniprot annotation do not necessarily contain a gene-synonym in the title or abstract, therefore those publications are out of our pipeline. Disambiguation results present on average a 70% precision with UniProt, the only collection of disambiguated publications of a similar magnitude. Finally, we included the disambiguated gene-publication annotations into the graph database.

Table 4.

Comparison of disambiguation methods.

	Recall	Precision	Total annotations
Uniprot	0.86	0.71	10,329,240
DISEASES	0.90	0.14	1,140,129
GeneRIF	0.86	0.11	726,532
Ours	-	-	9,658,406

Open in a new tab

Average recall and precision of the disambiguation of our disambiguation with other databases. Low precision values for DISEASES and GeneRIFs are due to the smaller size of these databases.

Trend detection

To detect incoming trends in the literature we gathered the publication dynamics of a given human gene from the disambiguated graph database (Fig. 1E). These time series include the number of publications, clinical trials, reviews and publications from big and medium-sized pharmaceutical companies, as well as, citations of publications coming from the mentioned categories per calendar year. Specifically, if a manuscript with author affiliations to big pharma cites other publications these citations are categorized as big pharma citations. Conversely publications citing this manuscript whose authors are affiliated to big pharma are not categorized as big pharma citations.

Time-series data from 1980 to 2013 was used to predict the per gene publication dynamics in each category between 2014 and 2019 using a Recurrent Neural Network model with an encoder-decoder architecture preceded by an attention layer, where both the encoder and decoder are composed of five hidden layers of Gated Recurrent Units (GRU). The time-series were created in a cumulative fashion, where each year contains the new publications and citations in addition to the previous ones.

For most genes, the model produces accurate predictions of the publication dynamics (Table 5), but for a small subset of genes the real number of publications or citations is significantly higher than expected (Fig. 3A). When the number of publications or citations exceeds the predictions, we interpret that the publication dynamics changed substantially in a way that cannot be explained simply by the gene’s publication history, implying that a meaningful discovery in the field has recently occurred (Fig. 3A; orange). Trendiness is defined as the probability of the fold-change between predicted and real number of publications and citations for a given gene. We used this metric to identify the trendiest genes in the academic community-using all publications-, or in the pharmaceutical industry-using publications coming from pharmaceutical companies-(Table S1, supplementary material).

Table 5.

Performance of the predictions.

Variables	MASE	Percentage of error	RMSE	Total (2013)
CIT. BIG PHARMA	0.42	12.51	10.60	1.86.E + 06
CIT. MED. PHARMA	0.50	14.90	5.20	6.59.E + 05
CIT. REVIEWS	0.30	3.35	45.60	2.38.E + 07
CIT. TRIALS	0.45	6.82	5.60	3.70.E + 06
CITATIONS	0.26	2.66	198.20	1.27.E + 08
PUB. BIG PHARMA	0.58	21.73	0.00	4.43.E + 05
PUB. MED. PHARMA	0.63	23.64	0.00	4.31.E + 05
PUBLICATIONS	0.33	8.66	32.60	9.48.E + 06
REVIEWS	0.52	13.37	2.40	9.07.E + 05
TRIALS	0.61	13.33	0.00	5.68.E + 05

Open in a new tab

Performance in the predictions of the publication dynamics. The model predicts the publications dynamics per gene between 2014 to 2019 using data from 1980 to 2013. Numbers represent the median 13,380 human genes.

MASE mean accuracy scaled error, RMSE root mean square error, Total number of elements in the database up to 2013.

Trend detection and gene–gene-disease co-occurrence. **(A)** Logarithmic scatter plots showing the predicted number of publications, reviews, citations and citations from big pharma companies against real data in the year 2018. **(B)** Trendiness (log2(predicted/real)) for genes associated with groups of diseases (MeSH parent categories). Left; Average trendiness of publications, reviews, citations and citations from reviews. Right; Average trendiness of citations coming from big and medium sized pharmaceutical companies. **(C)** Gene–Gene–Disease co-occurrence network of the first neighbours of CD274. Orange nodes are diseases, grey nodes are genes and the size of gene nodes represents the trendiness The grey edges are gene-disease association, the blue edges are gene-diseases with the width of the edges reflecting the number of co-occurrences.

Finally, to identify trendy genes of pharmaceutical interest, we computed the normalised mutual information of genes and diseases in the titles and abstracts of publications (Fig. 3B). Disease names and their synonyms were obtained from the Medical Subject Headings (MeSH) ontology at the Bioportal⁶¹. MeSH ontology contains 4818 different disease nodes at different levels of the ontology. We created a dictionary for each disease with the preferred and alternative names (see Methods). The diseases were disambiguated in titles and abstract using the same disambiguation pipeline used with the genes.

We noticed that many trendy genes cluster forming trendy pathways when getting the gene–gene and gene-disease association networks (Fig. 3C). We used enrichment of gene ontology (GO) terms for biological processes to uncover common pathways among the top 100 trendiest genes (Table S1, supplementary material). Among the most enriched GO terms in both academia and pharma are T cell co-stimulation, execution phase of necroptosis and pyroptosis. These biological processes are enriched in trendy genes which presumably reflect these fields of study are generating the most innovation and expectations in current biomedical research.

Topic detection

After the detection of gene trends, the next step was to understand why those genes might be trendy and curate possible mistakes in the disambiguation. With this aim we implemented a topic detection pipeline as an automatic, fast discovery tool to study groups of publications that mention the gene of interests (Fig. 1F). In this context, we used topic modelling algorithms. A topic is a collection of similar words, specific to a group of documents⁶². We used non-Negative Matrix Factorisation to generate a set of latent topics for each query (Fig. 4A; word clouds).

Topic time-lines. Topic time lines. Topic timelines for publications mentioning any of the genes for the immune checkpoint inhibitor **(A)**, necroptosis **(B)** and pyroptosis **(C)** pathways. The x-axis represents the time in years and the y-axis represents the likelihood of a given topic. Colors represent different topics defined by the keywords contained in the correspondent word clouds. The latent four topics were obtained using Non-Negative-Factorization all publications annotated with the genes after disambiguation. Word clouds were created using the phrases with highest TFIDF for groups of publications belonging to each topic. All timelines show at least one rising topic after 2013 that represents the reason why these genes became trendy, their implications in human disease: immune checkpoint inhibitors and monoclonal antibodies (yellow and orange in A), activation of necroptosis (orange in B), agonists of STING1 in cancer (black in C).

We explored the evolution of the topics associated with some trendiest genes. For the immune checkpoint inhibitors (CD274, PDCD1, TGIT and CTLA4) the topic timeline suggests that there was a rapid decrease in the likelihood of publications discussing the biological role of these immune checkpoint inhibitors since 2010 (Fig. 4A in grey), which coincides with a notable increase in topics that discuss cancer therapies (Fig. 4A in orange) and monoclonal antibodies that target these four different transmembrane immunoglobulins (Fig. 4A in yellow).This way, the topic-detection pipeline is able to capture the evolution of the research from its biological description to the clinical application.

The topic timeline of the members of the necroptosis pathway (RIPK1, RIPK3 and MLKL; Fig. 4B) suggests that in the last decade there has been a decrease in the likelihood of publications discussing these genes in the context of apoptosis (Fig. 4B in grey), in favour of publications that discuss the newly discovered form of cell death, the necroptotic pathway (Fig. 4B in orange), as well as, the translational medicine perspective of this pathway as is suggested by words like mouse, treatment and activity or cancer (Fig. 4B, in blue).

Finally, the topic timeline the members of the pyroptosis pathway (CGAS, TMEM173, GSDMA and GSDMAD; Fig. 4C) shows a fast increase from 2013 of publications discussing the therapeutic opportunity in cancer immunotherapy with agonists for TMEM173 (Fig. 4B in grey), while again, the remaining topics seemed to contain information on the biochemistry and biological role of the genes.

Recommender system

In addition to the automatic topic detection, we designed a review recommender system to accelerate the screening of the publications that cover most of the information in a network (Fig. 1G). There are an average of 2.9 reviews citing any publication that mentions at least one gene name. The aim was to minimise the time reading and maximising the information within a gene subnetwork. The algorithm aggregates both topic and network information from the citation subgraph of the publications that mention the gene of interest to obtain the most query-centric reviews. The topic information comes from the latent topics obtained from the topic detection algorithm. The network information was captured by the PageRank scores of the subgraph (see Methods). This approach ensures that reviews citing publications with highest PageRank scores are prioritised. To further minimize the number of reviews for initial human analysis we avoid repetition of information by simultaneously maximising the cumulative PageRank score whilst minimising the overlap of combined citations. This way, we expect to obtain a small set of reviews that will cover the main topics and publications in the field. We used this recommender system to select the optimal subset of reviews to assess why genes might be trendy (see Discussion). An example of the output can be found for the genes in the discussion in the supplementary data file.

Discussion and conclusions

We present TrendyGenes as a first attempt to (i) establish a systematic analysis of contemporary topics associated to human genes and diseases, (ii) develop an alert system for emerging targets and trends in the scientific literature across the human, protein-coding genome, (iii) to use topic modelling to rapidly generate timelines of phrases that facilitate the understanding of why these genes are trendy.

We constructed a graph database containing PubMed data where publications are connected by citations and authors and are annotated with disambiguated human gene-names and diseases. We expect this new resource to provide new ways to navigate the scientific literature, detect and visualise networks of discussion and analyse networks of influence from key opinion leaders. Disambiguating author names from PubMed, MedRxiv, or BioRxiv would further improve the quality of the database. Machine or deep learning algorithms could be trained on already labelled data to improve on previously published approaches^63–65 and address this issue.

Similarly further improvements in gene-name disambiguation would assist precision and recall metrics on our validation set suffer for different reasons GeneRIF and DISEASES include fewer publications in comparison to the genome wide metrics identified in our pipeline and there will be a lot of potential “false positives”. This makes the precision of our approach appear lower than what it may actually be. On the other hand, GeneRIF and Uniprot contain publications which either are not gene specific or do not mention the gene in the title or abstract.

However, the disambiguated genes and diseases can serve as labelled data for more sophisticated deep learning approaches to annotate biomedical entities. Gene and disease entities could be better annotated using both representation learning to capture the network topology and contextual information with transformer layers. Topic detection could be improved by using the state-of-the-art text summarisers with deep learning.

The number of publications per gene in aggregate is very predictable⁶⁶. However, occasionally genes present significantly more publications than expected, meaning that a recent breakthrough occurred which cannot be accounted for from the publication dynamics. In this study, we show that trendiness can identify emerging targets from the literature for rapid profiling at genome-scale. We combined trendiness with gene-disease associations to prioritise potential drug targets: emergent genes associated with diseases but yet included in pharmaceutical publications are worthy of being investigated as potential targets. We observe that TrendyGenes usually cluster into the same biological pathways (Fig. 3C for CD274, PDCD1, CTLA4 and TIGIT). Here, using topic modelling and the recommendation system, we identify the trendiest genes and pathways and discuss some case studies to exemplify our pipeline. We selected genes pharmacological relevance by choosing genes with high trendiness both in the academia and the pharmaceutical industry with high association with disease and more than 100 publications. Reviews suggested by the recommender system for these genes are included in (whatReview2read.zip, supplementary material).

Immune checkpoint inhibitors: CTLA4, CD274, PDCD1, TIGIT

CTLA4, PDCD1 (PD-1), CD274 (PD-L1) and TIGIT are among the trendiest genes in academia and pharma in 2019 (Fig. 3A). CTLA4, PDCD1, CD274 and TIGIT genes encode four different transmembrane immunoglobulins that act as co-inhibitory receptors: checkpoints or ‘breaks’ for the adaptive immune response that prevent T cells from exerting their functions^67,68. CTLA4 competes with its analogous CD28 for CD80 and CD86 to prevent a premature activation of T cells⁶⁸. PDCD1-CD274 interaction counters the positive signals that may have already activated T effector cells⁶⁸. TIGIT interacts with CD155 to down-regulate natural killer cells and T lymphocytes⁶⁹. Cancer cells attempt to impair these checkpoints and currently there are 7 FDA approved monoclonal antibodies that target three of proteins (CTLA4: Ipilimumab⁷⁰; PDCD1: Nivolumab⁷¹, Pembrolizumab⁷², Cemiplimab⁷³; CD274: Atezolizumab⁷⁴, Avelumab⁷⁵) and multiple candidates targeting TIGIT (BGB-A1217⁷⁶, OMP-313M32⁷⁷, MTIG7192A⁷⁸, AB154⁷⁹). Moreover, James Allison and Tasuku Honjo received the Nobel Prize in Medicine in 2018 for its research on immune checkpoint inhibitors⁴⁷.

Neurodegeneration: TREM2 and C9orf72

C9orf72 encodes a guanine nucleotide exchange factor involved in endosomal trafficking and autophagy^80,81. Hexanucleotide repeat expansions in promoter or intronic regions of C9orf72 are some of the major causes of sporadic and familial forms of both amyotrophic lateral sclerosis and frontotemporal dementia⁸⁰. Antisense oligonucleotides are being used to impede the transcription of C9orf72^82–84 or CRISPR–Cas9 system to target the GGGGCC repeat in the DNA⁸⁵ or RNA^85,86.

TREM2 gene encodes a transmembrane immunoglobulin receptor expressed in macrophages, osteoclasts, dendritic cells, and brain microglia^87,88. TREM2 variants have been associated with Nasu-Hakola disease^89,90, late-onset Alzheimer’s disease^91–94, frontotemporal dementia^95–100, amyotrophic lateral sclerosis^101,102 and Parkinson’s disease^101,103. TREM2 activates a pathway—through TYROBP/DAP12—that promotes inflammation^87,88 and promotes phagocytosis of cellular waste, remains of apoptotic cells, and pathogens^87,88. Currently, two independent groups have generated anti-TREM2 antibodies to stimulate microglia to remove amyloid plaques¹⁰⁴. Furthermore, the mAb generated by one of these groups, Alenco, in collaboration with Abbvie, has entered Phase I clinical trials^105,106.

DNA sensing by cGAS–STING: cGAS, TMEM173, GSDMD, GSDMA

The cytosolic nucleic acid-sensing pathway leads to pyroptosis, a lytic pro-inflammatory type of cell death involved in antiviral, antibacterial, and anticancer response¹⁰⁷. cGAS is a nucleotidyl-transferase that catalyzes production of cyclic GMP-AMP (cGAMP) upon the recognition of double-stranded DNA¹⁰⁷. TMEM173 (STING) binds to cGAMP and promotes the activation of both TBK1 and IRF3, increasing the transcription of genes encoding type I interferons¹⁰⁷. GSDMA and GSDMD are pore-forming effector proteins in the plasma membrane to release proinflammatory interleukins like IL-1β and IL-18¹⁰⁸. The cGAS-STING pathway has been associated to multiple autoimmune and chronic inflammatory diseases like non-alcoholic fatty liver disease¹⁰⁹, systemic lupus erythematosus¹¹⁰, vascular and pulmonary syndrome¹¹¹, macular degeneration¹¹², Bloom syndrome¹¹³, Aicardi-Goutières syndrome¹¹⁴, cancer¹¹⁵, DNA damage¹¹⁶, neurodegeneration¹¹⁷ and beyond. Currently, there are ongoing clinical trials for TMEM173^118–120 and GSDMD¹²¹ although there are no reported trials for GSDMA nor cGAS.

Necroptosis: RIPK1, RIPK3, and MLKL

RIPK1, RIPK3 and MLKL form part of the tumour necrosis factor-induced necroptosis pathway^122–124. This pathway has been associated with multiple pathologies: systemic inflammatory response syndrome^125,126, ulcerative colitis^127,128, psoriasis¹²⁸, rheumatoid arthritis¹²⁸, neurodegenerative diseases¹²⁹ and even cancer^130–132. TNFR1, FasL, TRAIL, and TLR can all activate RIPK1 to decide the cell’s fate: inflammation, apoptosis or necrosis¹³³. If caspase-8 is inhibited, RIPK1 and RIPK3 form the necrosome that subsequently phosphorylates MLKL¹³⁴. MLKL forms homo-trimers^135,136, migrates to the plasma membrane^135,136, binds to highly phosphorylated inositol phosphates¹³⁷, creates pores in the membrane¹³⁸ and disrupts the cell integrity. The discovery of RIPK1 dates back to 1995¹³⁹. Since then, four inhibitor programs have progressed through human phase II safety trials^140–143. The first publication mentioning MLKL is more recent¹⁴⁴ and, despite the lack of kinase activity, pharmaceutical companies have cited its publications by 60 times more since 2013. Although there are no clinical trials yet, there are at least three known different chemical inhibitors¹⁴⁵.

Mechanobiology: YAP1/WWTR1, PIEZO1 and PIEZO2

Cells use mechanical cues from their environment to guide behaviours such as proliferation and migration. Forces act as signals which are transduced to the nucleus where they control gene expression¹⁴⁶. Mechanical forces are critical regulators of organ and tissue homeostasis, morphogenesis and regeneration, and are important aspects of diseases like cancer, metastasis, fibrosis and cardiac hypertrophy. YAP1/WWTR1 (TAZ) are transcriptional co-activators and mechanotransducers¹⁴⁷. YAP/TAZ is hyperactivated in cancers¹⁴⁸, its inhibition reduces atherogenesis¹⁴⁹ and fibrosis¹⁵⁰, it triggers pulmonary hypertension¹⁵¹ , and it is necessary for epithelial regeneration in the intestine¹⁵². PIEZO1 and PIEZO2 are two mechano-sensitive cation channels that play a key role in cell number regulation^153,154 and migration¹⁵⁵, hearing¹⁵⁶, neural¹⁵⁷ and vascular¹⁵⁸ development, somatosensory functions¹⁵⁹, proprioception¹⁶⁰ and beyond. Piezo channels have been recently associated with multiple pathologies like arthrogryposis¹⁶¹, apnea¹⁶², congenital lymphatic dysplasia¹⁶³, hyperalgesia^164,165, malaria¹⁶⁶, pancreatitis¹⁶⁷, xerocytosis¹⁶⁸, Gordon syndrome, Marden-Walker Syndrome, and Distal Arthrogryposis Type 5¹⁶⁹. The discovery of mechanotransduction signalling pathways has received notable attention in the last years and may open the door to new therapeutic strategies to treat these diseases¹⁴⁷.

Trends in scientific literature are useful for pharmaceutical and biomedical companies. Moreover, this approach can offer crucial information to funding agencies to prioritise projects and a new way to study the research impact. Finally, individual researchers may benefit from a new methodology to explore the literature and from algorithms to maximise the efficiency of navigating over an increasingly vast biomedical literature.

Material and methods

Terminology

Here we use the term gene symbol to mean the approved symbol for any of the 19,084 human, protein-coding genes accepted by the HUGO Gene Nomenclature Committee. We refer to gene synonyms as any of the possible gene name variations by which the scientific community has ever referred to a given gene. Approved gene symbols are also included in the gene synonyms. For example: ‘EGFR’ is the approved gene symbol whereas ‘EGFR’, ‘Epidermal Growth Factor Receptor’, ‘ERBB1’, ‘ErbB-1’, ‘c-erbB1’, ‘HER1’, ‘ERBB’ are gene synonyms. We define promiscuous gene names as any gene synonym that is a synonym of more than one gene. This can include previous official gene symbols since these will not have been expunged from the literature. An example of this could be ‘ARP1’ which is a promiscuous gene synonym for the gene symbols ‘NR2F2’, ‘ACTR1A’, ‘ACTR1B’, ‘ANGPTL1’, ‘APOBEC2’, ‘ARFRP1’, ‘PITX2’⁴⁷. Unsafe gene synonyms are gene synonyms that may have a different meaning in other areas of research or in another context, for instance in standard English. The ‘STAR’ gene symbol is unsafe as opposed to its gene synonym ‘Steroidogenic Acute Regulatory Protein’ or CCP4 is both a gene synonym and the name for crystallography software. The final type of synonym we distinguish are Nested gene synonyms. These are gene synonyms that are part of another gene synonym. For instance ‘insulin’ is a nested gene synonym of ‘insulin receptor’, ‘TNF’ is nested gene synonym of ‘TNF Receptor Superfamily Member 1A’ (gene symbol ‘TNFRSF1A’) and ‘TNF Receptor Associated Factor 2’ (gene symbol ‘TRAF2’).

Pubmed as a graph database

PubMed baseline 2020⁵³ comprises 30,419,056 publications for biomedical literature from MEDLINE and life science journals and 173,572,773 citations from the full-text archive of open-source publications PubMed Central (PMC). PubMed was imported into a graph database (Fig. 1A) for a fast performance in the retrieval of highly relational data like authorship and citation networks. In a graph database information is represented as nodes and edges, allowing the fast retrieval of queries about relationships. We loaded PubMed 2020 base-line into Neo4J⁵⁴, an open source graph-database management system. We introduced four node types (publications, authors, human protein-coding genes, human diseases, medical subheadings), and four edge types (published, from authors to publications; cited by between publications, gene annotation from genes to publications; and disease annotation from diseases to publications). Furthermore, PUBLICATION nodes have the following attributes: PubMed identifier, title, abstract, affiliations, is_review, is_clinical_trial, big_pharma, med_pharma and date of publication. Profiling of the graph is included in Table 6, Database Profiling. Neo4J offers an interactive approach to navigate through PubMed (i) easily accessing references of publications, (ii) with the ability to query for specific genes and diseases already disambiguated, and (iii) with the aim of creating a knowledge graph for further exploration of gene-disease associations. The database is accessible to download at: https://zenodo.org/record/8362679.

Table 6.

Database profiling.

Graph entity	Type	Counts
PUBLICATION	Node	30,419,056
AUTHOR	Node	8,331,251
GENE	Node	19,082
DISEASE	Node	4818
MESH	Node	29,133
CITED_BY	Relationship	173,572,773
PUBLISHED	Relationship	121,879,576
GENE_PMID	Relationship	9,656,712
DISEASE_PMID	Relationship	39,605,276
MESH_PMID	Relationship	279,331,447

Open in a new tab

We loaded PubMed 2019 base-line into Neo4J, an open source graph-database management system. We introduced four node types (PUBLICATION, AUTHOR, GENE, DISEASE), and four edge types (PUBLISHED, from AUTHOR to PUBLICATION; CITED_BY between PUBLICATION, GENE_PMID_ASSOCIATION from GENE to PUBLICATION; and DISEASE_PMID_ASSOCIATION from DISEASE to PUBLICATION). Furthermore, PUBLICATION nodes have the following attributes: PMID, TITLE, ABSTRACT, AFFILIATIONS, IS_REVIEW, IS_CLINICAL_TRIAL, BIG_PHARMA, MEDIUM_PHARMA and DATE. The database is accessible at: https://mega.nz/file/4E8QjCaQ#oqtm7jof-lsG7ySget8uakh7m26bDLo1HrPu3mtdAV8.

Gold standard sets

GeneRif⁵⁸, UniProt, and DISEASES⁵⁹ were used as a golden-standard for validation.

Pharmaceutical companies

A list of organisation names was generated from Cortellis¹⁷⁰. Organisations with more than 100 patents in Cortellis were considered ‘big pharma’ and ‘mid pharma’ otherwise.