Skip to main content
iScience logoLink to iScience
. 2024 Feb 23;27(3):109309. doi: 10.1016/j.isci.2024.109309

Predicting potential target genes in molecular biology experiments using machine learning and multifaceted data sources

Kei K Ito 1,3,, Yoshimasa Tsuruoka 2,∗∗, Daiju Kitagawa 1,∗∗∗
PMCID: PMC10933549  PMID: 38482491

Summary

Experimental analysis of functionally related genes is key to understanding biological phenomena. The selection of genes to study is a crucial and challenging step, as it requires extensive knowledge of the literature and diverse biomedical data resources. Although software tools that predict relationships between genes are available to accelerate this process, they do not directly incorporate experiment information derived from the literature. Here, we develop LEXAS, a target gene suggestion system for molecular biology experiments. LEXAS is based on machine learning models trained with diverse information sources, including 24 million experiment descriptions extracted from full-text articles in PubMed Central by using a deep-learning-based natural language processing model. By integrating the extracted experiment contexts with biomedical data sources, LEXAS suggests potential target genes for upcoming experiments, complementing existing tools like STRING, FunCoup, and GOSemSim. A simple web interface enables biologists to consider newly derived gene information while planning experiments.

Subject areas: Molecular biology, Natural language processing

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • A machine learning system suggests potential target genes for future experiments

  • The system was trained on the sequence of experiment descriptions and biomedical data

  • Twenty-four million experiments were extracted from PubMed Central’s results sections

  • LEXAS is a simple web interface for searching and suggesting biological experiments


Molecular biology; Natural language processing

Introduction

In molecular biology, researchers often face questions like "What gene should we target in the next experiment?" once they have completed an experiment on a gene. To find the answer to this question, they usually consult the relevant literature and biomedical databases to look for potentially functionally related genes. Due to the large amount of literature and databases available today, they spend a lot of time planning their experiments.

There are many text-mining-based applications that can help researchers quickly grasp various gene-related information described in the literature. Some tools display biomedical concepts such as diseases and compounds related to the gene of interest.1,2,3,4 Others perform biological event extraction from parts of the literature, such as abstracts, to find the relationships between genes or proteins, such as physical interaction, activation, inhibition, and phosphorylation.5,6,7,8,9,10 Based on the text-mined information provided by these tools, researchers can hypothesize functionally related genes to analyze in their next experiment.

In addition to displaying text-mined information, many tools have been developed to predict functionally related genes. For example, GeneMania,11 FunCoup,12 HumanBase,13 and HumanNet14 predict gene-gene functional relations using information obtained from multiple databases, including protein-protein interactions, expression levels, and co-evolution. GIREM15 predicts functionally related genes using only text-mined information from PubMed abstracts. STRING employs both databases and text-mined information to predict functional and physical interactions between proteins.16 GOSemSim17 calculates the semantic similarity between two genes by applying a graph-based method to the Gene Ontology (GO) hierarchy.18,19 These resources use machine learning models to predict functional associations between genes or proteins, using information in databases as the "gold standard" for training the models. For example, FunCoup uses the protein-protein interaction database iRefIndex20 as a gold standard for predicting physical interactions and the signaling pathway database KEGG21 for predicting functional interactions in signaling pathways. STRING uses the Complex Portal database22 as a gold standard to predict physical interactions. However, none of these existing systems are designed to directly answer the aforementioned question, i.e., they do not suggest the target genes of the next experiment.

Deep reinforcement learning and active learning approaches have already been used to help biologists design experiments. Deep reinforcement learning is used to optimize the parameters of experiments such as nutrient concentrations.23 On the other hand, active learning approaches are used to predict genes to be analyzed.24 Active learning approaches start by inferring gene networks and then suggesting informative genes for subsequent experiments to improve network accuracy. However, many of these gene networks are constructed based solely on data such as gene expression levels, ignoring prior knowledge from the published literature. Although this allows for broad application across various gene networks or functions regardless of their prior research status, it may compromise the accuracy that could be gained by incorporating knowledge from published literature.24 A minority of these approaches, including the robot scientist Adam,25 do make use of information from publications, yet their adaptability remains limited. There is still a lack of a comprehensive tool that effectively utilizes information from the literature to suggest genes for future experiments.

In this work, we have developed LEXAS (Life science EXperiment seArch and Suggestion), a system that can suggest genes to be analyzed in the next experiment by using the information on the order of experiments as described in biomedical articles, as well as various biomedical data sources. We first obtained comprehensive gene-related experiment descriptions from the articles archived in PubMed Central using a deep-learning-based natural language processing model. We then focused on the sequential order of experiments in each article and trained machine learning models that can predict the target gene of the next experiment from the target gene of the previous experiment. LEXAS provides results that align well with the researcher’s decision-making process, offering a useful complement to existing tools. LEXAS is available at https://lexas.f.u-tokyo.ac.jp as a web application. The overview of LEXAS is shown in Figure 1.

Figure 1.

Figure 1

Overview and design of LEXAS

Sentences containing at least one gene name and one experiment method were extracted from the result section of articles archived in PubMed Central. The gene name and experiment method were masked by special tokens and then fed into a fine-tuned bio-BERT model for relation extraction. This resulted in the experiment list, which is browsable through the search interface. The context of the experiments was extracted from the experiment list and represented as a tuple. Along with corresponding negative examples, these tuples were used to generate feature vectors for training a prediction model for future experiments on genes. Result tables can be obtained by querying “TP53, immunofluorescence” for the search interface and “Cep63” for the suggestion interface.

Results

Extraction of experiment information from the literature

We first extracted information on gene-related experiments from the biomedical literature. In this study, we define an experiment as a research activity in which genes or proteins are analyzed using a certain experiment method (Figure 2A). For example, the sentence "Immunostaining showed that both RNF187 and P53 were localized mainly in the nucleus" indicates that "immunostaining" was performed on “RNF187” and "P53."

Figure 2.

Figure 2

Experiment retrieval from full-text biomedical articles

(A) Examples of biological experiments defined as research activities in which genes are analyzed using experiment methods.

(B) Extraction of gene-experiment relations by using BioBERT. A sentence in which a gene name and an experiment method are masked with [GENE] and [EXPE] is fed into the fine-tuned BioBERT model to predict whether the gene and the method are in a gene-experiment relation.

(C) Cumulative scatterplot indicating the number of experiments in which a gene was analyzed. The top 10 genes with the highest number of experiments are shown in the table.

(D) Pie chart depicting the percentage of target genes mentioned in the titles or the abstracts. See also Figure S1 and Tables S1, S2, and S3.

Note that not every combination of gene names and experiment methods mentioned in the same sentence indicates that the experiment has been performed on the gene. For example, the sentence "Inhibition of Plk1 suppressed loss of HsSAS6 from the centrioles" contains one experiment method ("inhibition") and two gene names ("Plk1", "HsSAS6"). However, this sentence describes only one experiment in which inhibition was applied to Plk1, not to HsSAS6. Thus, treating all combinations of gene names and experiment methods as mentions of experiments leads to many false detections of experiments. To predict whether or not a pair of a gene name and an experiment method is indeed in a gene-experiment relation, we formulated this task as a relation extraction problem and fine-tuned a BioBERT model,26 which is a variant of BERT27 pretrained with PubMed articles (Figure 2B).

To fine-tune the BioBERT model, we created a gene-experiment relationship dataset. This dataset consisted of randomly sampled 1,600 pairs of a gene and an experiment method mentioned in the same sentence. Each pair was annotated with a label indicating whether the experiment method was performed to analyze the gene. Of the 1,600 pairs, 587 were annotated as positive and 1,013 as negative, with a Cohen’s kappa coefficient of 0.901. To evaluate the performance of our relation extraction model, we calculated the F1 score using 5-fold cross-validation. We increased the number of annotations in increments of 100, from 100 to 1,600 pairs of genes and experiments. Diminishing returns in performance were observed before 1,600 annotations (Figure S1), indicating that additional annotations would not significantly improve the performance. At this point, the precision of the relation extraction was 0.824, with a recall of 0.810. As a baseline, a precision of 0.271 and a recall of 1.0 can be obtained by considering every combination of a gene name and a method in each sentence to indicate the gene-experiment relation. Note that gene-experiment relations can be described not within a single sentence but across multiple sentences. We leave such relations for future work, and the recall reported here is only an upper bound.

We applied this model to all sentences from the result sections of PubMed Central articles that included at least one human gene name and one experiment method (Tables S1 and S2). In total, 24,635,147 gene-method pairs representing gene-method relations were obtained. Hereafter, the extracted gene-method pairs are simply referred to as “experiments.”

In our experiment collections, 24,226 genes were targeted at least once in the experiments, and 19,008, 12,818, 4,035, and 435 genes were targeted more than or equal to 10, 100, 1,000, and 10,000 times, respectively (Figure 2C). The top 10 target genes with the highest number of experiments were TP53, EGFR, IL6, AKT1, INS, VEGFA, MTOR, CD4, APOE, and MYC (Figure 2C; Table S3).

We also found that only about 55% of the target genes were mentioned in the title or the abstract (Figure 2D). The remaining 45% of the target genes were only mentioned in the main text.

Collecting consecutive experiment pairs for training machine learning models

Our goal is to develop a machine learning model that can recommend genes for analysis following an experiment on a given gene. A gene rarely works alone in a cell but more often works with other genes as a functional module.28 Therefore, many molecular biologists conduct research that focuses not on a single gene alone but with potentially related genes.28 For example, if a researcher focuses on gene A, he/she may next study gene B, which he/she thinks is part of the same functional module as gene A. The selection of gene B is based on various factors, such as the known interaction of protein B with protein A, its common association with the same congenital disorder as gene A, or its similar patterns of tissue expression patterns. By training a machine learning model to predict the next target gene (from gene A to gene B), the model can mimic the researchers' decision-making process in selecting genes for analysis.

This idea is based on the hypothesis that the order of the experiment descriptions reflects the actual order in which the authors conducted the experiments. To test this hypothesis, we randomly selected 300 pairs of consecutive experiment descriptions, each from a different research article. We then manually reviewed the text of the articles to check if the experiments were indeed performed sequentially.

Of the 300 pairs, 167 described experiments on the same gene, whereas 133 pairs described experiments on different genes. For the 167 same-gene pairs, 108 (64.7%, 95% confidence interval [CI]: 57.5%–71.9%) described sequentially performed experiments. However, the remaining 59 same-gene pairs did not refer to sequentially performed experiments—49 of them referred to the same experiment (for instance, the consecutive sentences "We depleted gene X using siRNA" and "Depletion of gene X caused … "). In contrast, for the 133 pairs describing different genes, 122 (91.7%, 95% CI: 86.5%–96.2%) showed a sequential relationship between the experiments. Therefore, we conclude that for descriptions associated with different genes, there is a significant match between their sequence and the actual order of experiments. In training our machine learning models, we exclusively utilized the pairs describing different genes.

Each pair of two consecutive experiments is represented as a tuple consisting of two elements, where the first element is the target gene of an experiment, and the second element is the target gene of the following experiment (Figure 3A). The tuples representing the experiments described in the articles up to 2018 were used to train the machine learning models (628,965 tuples). The tuples representing the experiments described in 2019 were used for validation (63,850 tuples) and those described between 2020 and 2023 (221,318 tuples) were used for evaluation (Figure 3B).

Figure 3.

Figure 3

Training and evaluation of machine learning models

(A) Schematic illustration of the flow for training a machine learning model using the experiment information extracted from articles. Tuples indicating the context of the experiments were generated from the experiment list and converted into feature vectors. These feature vectors were then used to train the machine learning models.

(B) Schematic illustration of data division. Experiments extracted from PMC articles published up to 2018 were used as the training dataset, whereas experiments from 2019 articles were used for validation, and experiments from articles after 2020 were used to test several related tools.

(C) Comparison of prediction accuracy between algorithms. The area under the ROC curve (AUROC) @100 was calculated using seven different models. Data are presented as the mean AUROC@100 ± 95% confidence interval (n = 8,278).

(D) Comparison of prediction accuracy between our models and other resources. The area under the ROC curve (AUROC) @100 was calculated using eight different models. Data are presented as the mean AUROC@100 ± 95% confidence interval (n = 13,381).

(E) Scatterplots depicting the mean AUROC@100 according to the number of articles mentioning the query gene before 2018. Mann-Whitney tests with Bonferroni correction were used in (C) and (D) to compare the mean and obtain the p value. ∗∗∗p < 0.001; NS, p > 0.05. See also Figure S2 and Tables S4 and S5.

Constructing feature vectors

To reduce the computational resources for training the model, we transformed the task of predicting the gene to be analyzed from a set of human genes into a set of binary classification problems by using a negative sampling approach.29 For each tuple representing experiment contexts, three negative examples were generated by random sampling. We trained machine learning models to classify whether or not a tuple indicated actually described consecutive experiments.

Feature vectors were constructed using the data sources listed in Table 1. The data sources included categorical features such as those from the Gene Ontology,30 genetic diseases,31 and protein-protein interactions,32 as well as numerical features such as expression levels33 and gene dependency in cancer cells.34 The selection of these data sources was based on their frequent use by professional researchers in our institute, their diverse types of information, and their robust data collection methodologies. Each value in the feature vector reflects a relationship between two genes in a tuple. For example, a value corresponding to a categorical feature indicates whether the two genes in a tuple share the feature. A value corresponding to a numerical feature reflects the degree to which the features of the two genes are correlated.

Table 1.

Information sources used to train machine learning models

Gene feature Information source Type Dimension Used to train LEXAS-data
Chromosome location HGNC35 Categorical 46 Yes
Phenotypes of knockout mice Mouse Genome Informatics36 Categorical 4165 Yes
Subcellular localization Human Protein Atlas37 Categorical 58 Yes
Protein-protein interaction iRefIndex20 Categorical 14131 Yes
Transcription factor ENCODE38 Categorical 180 Yes
Phenotypic abnormalities in human Human Phenotype Ontology31 Categorical 3257
Genetic diseases Online Mendelian Inheritance in Man, Orphanet Categorical 83
Results of sparse matrix learning for the DepMap data Webster39 Categorical 218
Biological process, molecular function, and cellular component Gene Ontology19 Categorical 4313
Cancer cell growth under CRISPR/Cas9 mediated suppression of genes DepMap34 Numerical Yes
Expression levels among cancer cell lines DepMap Numerical Yes
Expression levels among tissues Human Protein Atlas Numerical Yes
Similarity of the usage of gene terms in the MEDLINE abstracts Word2Vec Numerical

This table summarizes the diverse gene features incorporated into the LEXAS model, detailing their sources, types, dimensions, and whether they were used to train the LEXAS-data model.

Evaluation of the model performance

We evaluated the performance of our models by predicting what gene should be examined next for each query gene and calculating the mean of the area under the ROC curve (AUROC). In the validation and test processes, a suggested gene was considered correct if it was indeed analyzed just after the query gene in the validation and test datasets, respectively. It should be noted that the AUROC scores computed in this evaluation are only approximations of the true accuracy of gene suggestion, because the absence of the suggested gene in the validation/test set does not necessarily mean that the gene is not a reasonable target in the next experiment. A more direct evaluation may be possible by having molecular biology researchers manually inspect the results, but such an evaluation will be small scale compared with the present evaluation. Here, we employed an approximate but large-scale approach for evaluation.

For the validation stage, we first trained several machine learning algorithms using the experiments described up to 2018 and evaluated the results using the experiments described in 2019. The algorithms used were XGBoost, logistic regression, support vector machine, random forest, multilayer perceptron, Naive Bayes, and k-nearest neighbor. Among these algorithms, logistic regression had the highest AUROC, followed by XGBoost (Figure S2A).

We also calculated AUROC@1000 and AUROC@100 to assess whether the true positive gene appeared in the top 1,000 or top 100 suggested genes. These metrics are important because they represent how often the correct gene is among the top suggestions, which is crucial for a recommendation system.40 Our results showed that XGBoost has significantly outperformed the other algorithms in the metrics (p < 0.001, n = 8278) (Figures 3C and S2B; Table S4). For comparison, we also implemented a baseline model that simply ranks genes based on the number of experiments in the literature published up to 2018. This baseline model scored lower than XGBoost in terms of AUROC@100 (mean difference: 0.044, 95% CI: 0.042–0.046). We, therefore, chose XGBoost as our final model. Hereafter, we use AUROC@100 to evaluate our model and other resources.

For the test stage, we then compared the XGBoost model, which is named LEXAS, with other popular resources using the test dataset. Among many tools that predict functionally related genes, we chose GOSemSim, STRING, and FunCoup for comparison because they can predict the functional relationship between genes in general, not specific to a disease or a tissue. In addition, we applied the Random Walk with Restart (RWR) method,41 a network-based method that calculates node-to-node proximity within a network, to STRING and FunCoup networks. The application of RWR allows us to assess the degree of relatedness or similarity between genes more comprehensively within complex networks. The original STRING and FunCoup data are denoted as “STRING-raw” and “FunCoup-raw,” respectively.

We compared the mean of the AUROC@100 score of LEXAS with those of other tools using the test data. We found that the AUROC score of LEXAS (0.568) was significantly higher than those of STRING-raw (mean difference: 0.013, 95% CI: 0.009–0.017), FunCoup (0.038, 95% CI: 0.036–0.041), FunCoup-raw (0.048, 95% CI: 0.045–0.052), and GOSemSim (0.018, 95% CI: 0.016–0.021) (p < 0.001, n = 13381, Mann-Whitney U test) (Figures 3C; Table S5). However, the AUROC score of LEXAS and STRING (−0.001, 95% CI: −0.004–0.002) did not show a significant difference.

We next attempted to improve the LEXAS model by introducing the information from STRING, FunCoup, and GOSemSim. This new model, LEXAS-plus, showed significantly higher performance (0.573) compared with the other models including STRING (mean difference: 0.005, 95% CI: 0.002–0.008) (p < 0.001). In addition, we developed a model called LEXAS-data, which only uses information from relatively objective databases and FunCoup, where all data are derived from comprehensive analyses such as RNA sequencing or mass spectrometry. The AUROC@100 score of LEXAS-data (0.536) was higher than that of FunCoup alone (mean difference: 0.006, 95% CI: 0.004–0.008, p < 0.001) (Figure 3D).

In the aforementioned evaluation, the gene that was examined just next to the query gene was considered a true positive. However, it is important to test whether the models can predict not only the next but also all the following genes. To test this, we considered all the following genes in the article as true positives and recalculated the metrics. In the validation process, XGBoost again showed the best score (Figure S2C; Table S4). In the test process, the AUROC score of LEXAS (0.576) was significantly higher than those of STRING (mean difference: 0.006, 95% CI: 0.004–0.008), STRING-raw (0.019, 95% CI: 0.017–0.021), FunCoup (0.045, 95% CI: 0.044–0.046), FunCoup-raw (0.044, 95% CI: 0.042–0.046), and GOSemSim (0.028, 95% CI: 0.026–0.030) (p < 0.001, n = 13381, Mann-Whitney U test) (Figure S2D; Table S5). The AUROC value of LEXAS-plus (0.581) was the highest among all tested models including LEXAS (mean difference: 0.005, 95% CI: 0.003–0.006) (Figure S2D; Table S5). These results suggest that LEXAS and LEXAS-plus can reliably predict not only the genes examined just next but also those examined in later experiments following the query gene.

The relationship between suggestion accuracy and the number of descriptions in the past literature

Next, we tested the relationship between gene suggestion accuracy and the number of descriptions in the past literature. We illustrated the relationship between the number of descriptions and the mean AUROC scores as shown in Figure 3E and compared the AUROC scores. We observed a trend where the LEXAS-plus model consistently outperformed the GOSemSim and FunCoup. However, when compared with STRING, no clear trend was evident overall. Nevertheless, for query genes mentioned fewer than 10 times in the literature, LEXAS-plus tends to outperform STRING. These results suggest that LEXAS-plus is a highly effective model for suggesting genes, particularly those with limited mentions in the existing literature.

Effect of subcellular localization, molecular function, and biological process of gene transcripts on the suggestion accuracy

We also investigated whether the prediction performance was affected by the category of the gene transcripts, such as subcellular localization, molecular function, and biological process. We used Gene Ontology annotations and selected the top 22 terms by the number of genes annotated to each term. We calculated the mean of the AUROC score for genes annotated with each term.

Our analysis showed that for subcellular localization, the predictability was highest for genes localized in the intracellular-membrane-bound organelle and lowest for integral components of the nucleus (Figure 4A; Table S6). LEXAS-plus showed the highest accuracy for genes in almost all subcellular localizations. Regarding biological processes, we found that genes related to the adaptive immune system or spermatogenesis had higher AUROC scores, whereas genes related to protein phosphorylation or ubiquitination had lower scores (Figure 4B; Table S6). Once again, LEXAS-plus showed the highest accuracy for genes involved in most biological processes. In addition, LEXAS-plus also showed the highest AUROC score for genes related to most molecular functions (Figure 4C; Table S6). Therefore, we conclude that the LEXAS-plus model is the optimal model for gene suggestion, regardless of the localization, biological process, and molecular function of gene transcripts.

Figure 4.

Figure 4

Evaluation of the output of LEXAS system

(A) Comparison of prediction accuracy between the categories of query genes classified by subcellular localization. Data are presented as the mean AUROC@100.

(B) Comparison of prediction accuracy between the categories of query genes classified by biological process. Data are presented as the mean AUROC@100.

(C) Comparison of prediction accuracy between the categories of query genes classified by molecular function. Data are presented as the mean AUROC@100. See also Table S6.

Impact of gene features on the suggestion

To evaluate the impact of gene features on the suggestions, we calculated the SHAP values,42 a game-theoretic value that is often used to provide explanations for local outputs of machine learning models such as XGBoost.43 For each gene, the SHAP values were calculated for the top 10 suggested genes. Figure 5A shows the frequency of the gene features that have the highest impact on prediction. In the LEXAS-data model trained with objective databases alone, the pfc score from the FunCoup database was the most influential, followed by expression levels in cancer and tissue and gene essentiality in cancer cell lines from the DepMap database. In the LEXAS model, on the other hand, the information from Word2vec was the most influential, followed by the expression levels in cancer cell lines. Word2vec is a machine learning approach that represents a word as a vector that reflects its context.44 In our model, the Word2vec trained on the full text of PMC articles published up to 2018 was used to vectorize gene names. Therefore, unsurprisingly, the textual context in which the gene name is mentioned is most informative for predicting the genes to be analyzed. In the LEXAS-plus model, the information from STRING was the most influential in about half of the suggestions, and the information sources used to train the LEXAS model were the most influential in the remaining suggestions. These results indicate that the LEXAS models suggest genes to be analyzed by incorporating a variety of gene features.

Figure 5.

Figure 5

Analysis of important features and scores provided by LEXAS

(A) Pie chart depicting the percentage of the used features, i.e., those with the highest SHAP values, to predict the candidate genes to be analyzed in the next experiments by the three models.

(B) The distribution of probabilities (scores) for all suggestions and positive suggestions by the LEXAS-plus model. Histogram represents the number of genes with the indicated probabilities. The probabilities were calculated with the LEXAS-plus model trained with the experiments up to 2018. If an experiment on a suggested gene was performed after an experiment on the query gene in any articles, the suggested gene was defined to be true.

(C) Histogram represents the ratio of positive suggestions against that of all suggestions with the indicated probabilities. The bin width in the histograms is 0.02 in (B) and (C). See also Figure S3.

Analysis of score distribution in LEXAS-plus model

Figure 5B shows the distribution of probabilities calculated by the LEXAS-plus model. The probabilities for all suggestions were distributed around 0.15, whereas the probabilities for positive suggestions showed a broader distribution. Figure 5C shows the proportion of positive suggestions. Among the suggestions with a probability greater than 0.98, the ratio of positive suggestions is about 70%. On the other hand, the ratio is under 5% among the suggestions with a probability below 0.8. The distribution of probabilities calculated by LEXAS and LEXAS-data models is shown in Figure S3. The probabilities are referred to as “scores” in the user interface.

A proof of concept for LEXAS

Table 2 shows an example of the output of the LEXAS model trained using the experiments before 2018. In this example, the query was CEP44. CEP44 was identified as a centrosomal gene by proteomics methods in 2011,45 but its function was not described until 2020. This table lists the top eight genes with the highest probabilities suggested by our LEXAS model. Interestingly, five out of these eight genes were actually analyzed and demonstrated to be functionally related to CEP44 in the articles published in 202046 or 2022.47 Furthermore, although the other three genes were not analyzed in the articles, the transcripts of RTTN, CEP350, and CCDC77 are also localized to the centrosome and appear to be worth analyzing.48,49,50 These results illustrate that our model can generate reasonable suggestions for the query genes even without any functional information.

Table 2.

An example of the target gene prediction by the LEXAS model

Rank Gene Score Top 5 features with the highest importance
1 CEP120∗ 0.984 DepMap
GO term: centriole
Cancer gene expression
GO term: centrosome
iRefIndex: TP53BP2
2 RTTN 0.966 DepMap
GO term: centriole
GO term: centrosome
Tissue gene expression
Cancer gene expression
3 CEP350 0.966 GO term: centrosome
DepMap
GO term: centriole
Cancer gene expression
Tissue gene expression
4 CEP295∗ 0.951 DepMap
GO term: centrosome
GO term: centriole
Tissue gene expression
Cancer gene expression
5 CCP110∗ 0.949 GO term: centrosome
DepMap
GO term: centriole
Cancer gene expression
Tissue gene expression
6 CENPJ∗ 0.938 GO term: centrosome
DepMap
GO term: centriole
Cancer gene expression
iRefIndex: TP53BP2
7 CCDC77 0.935 DepMap
Cancer gene expression
GO term: centriole
Tissue gene expression
niRefIndex: KIAA0753
8 CEP135∗ 0.929 GO term: centrosome
DepMap
GO term: centriole
Tissue gene expression
iRefIndex: TP53BP2

This table shows the potential target genes after an experiment on CEP44 predicted by LEXAS model using experiments before 2018. Genes shown with a star were reported to be functionally related to CEP44 in the articles published in 202046 or 2022.47

The top five features with the highest SHAP values were shown in the right column. For example, CCDC77 was suggested based on the information on the cancer dependency (DepMap), cancer gene expression, an annotation for the GO term “centriole,” tissue gene expression, and protein interaction with KIAA0753. As CEP44 is a centrosomal protein, an interaction with KIAA0753, which is also localized in centrosome,51 and the GO term “centriole” are reasonable clues for the suggestion.

User interface

Using the collected experiments and trained machine learning models, we built a web application for LEXAS with two interfaces: search and suggestion. The search interface allows users to retrieve a list of experiment descriptions extracted by the fine-tuned BioBERT model. Given a gene name and the category of an experiment method, the system displays the list of matching experiment descriptions (Figure 6A). The search system offers distinct advantages over previous search methods. First, researchers can quickly access information about experiments on a gene of interest simply by reading a single sentence that describes the experiment, rather than having to read the full text of an article. Second, the system automatically searches for experiment descriptions that include not only the query gene name but also synonymous gene names. Furthermore, we found that 45% of the target genes in experiments were not mentioned in article titles or abstracts (Figure 2D), highlighting the limitations of relying solely on these searches. Our experiment search system can help researchers overcome this limitation by searching the full text. These advantages suggest that our experiment search system is a valuable tool for researchers who want to quickly and accurately identify relevant experiment information for genes of interest.

Figure 6.

Figure 6

Web interface of LEXAS

(A) LEXAS search. The result table is obtained when searching for experiments in which “TP53” was examined with “immunofluorescence.”

(B) LEXAS suggestion. The result table is obtained as suggestion for the genes that can be examined after an experiment on CEP44.

The suggestion interface allows users to find a list of genes that LEXAS deems should be analyzed after an experiment on a given gene, along with possible experiment methods. The important gene features are displayed along with their SHAP values,42 helping the user understand why these genes are suggested (Figure 6B). The system also allows the user to choose between three machine learning models for the suggestion: LEXAS-data, LEXAS, and LEXAS-plus. LEXAS and LEXAS-plus are “plausible” models that use various databases and text-mined information. These models are suitable for those who are seeking plausible suggestions in line with a published body of knowledge. By contrast, the other “exploratory” model (LEXAS-data) was built using the information from objective databases acquired from comprehensive analysis alone. This model is thus suitable for those who are seeking novel and unexpected connections based on relatively objective features.

We will update the data sources and retrain the models at least quarterly in the future to ensure the accuracy and relevance of the search results.

Discussion

In this work, we developed a gene suggestion system named LEXAS by using machine learning models trained with the information from the experiment descriptions and biomedical data sources. Given a gene of interest, LEXAS produces a list of potentially functionally related genes to be analyzed next. The suggestion accuracy of LEXAS was higher than those of STRING, FunCoup, and GOSemSim.

An important aspect to consider in our study is the potential impact of publication bias on the results provided by our system. Our model relies on the experiment descriptions in the published literature for training. This training dataset inherently carries the publication biases that could affect the performance and generalizability of our model. For example, researchers often report “positive” results more frequently than “negative” ones. This could lead us to miss gene combinations that researchers performed experiments on, but the results did not show the expected outcomes. Furthermore, research on certain genes may be more prevalent in the literature due to historical emphasis or availability of resources, which could also bias our data toward these more popular research areas. We acknowledge this as a limitation of our work.

In this study, we trained machine learning models to predict reasonable target genes in the next experiment. Therefore, in our main evaluation, we considered only the genes analyzed in the immediate next experiment as true positives to demonstrate the effectiveness of our approach (Figure 3). However, it is also possible that the models predict other genes analyzed in the following experiments in the result section of the same article. To test this idea, we employed alternative approach where we regarded all subsequent experiments as true positives rather than just the immediate next experiment (Figure S2). This analysis showed that our LEXAS system is also effective at predicting genes in the rest of following experiments. Thus, although LEXAS was trained to predict just next experiment, it is also useful to predict further experiments as well. In LEXAS web application, a list of potential related genes to be analyzed after the query gene is displayed. Users can select multiple genes from this list to gain insight beyond only the immediate next step.

Our machine learning models predict genes to analyze based on a single query gene. However, this approach could encounter difficulties when dealing with a query gene that possesses multiple functions, because the outputs would include a mix of genes associated with each function of the query gene. An example would be a scenario where a researcher is interested in studying function A of gene X, which is also involved in another function, function B. Querying gene X in our system would return a mix of genes related to either function A or function B. This could obscure the next direction of the study. To mitigate this, the ability to use multiple query genes is beneficial to the gene prediction system. By simultaneously querying gene X with a group of genes related to the function A, we can identify genes that are related to gene X in the context of function A. Extending our LEXAS model to handle multiple query genes would be a promising direction for future research.

The user interface of LEXAS suggestion provides a list of genes as well as reasonable experiments to analyze the suggested gene. To suggest experiment methods, experiment methods were grouped into 20 categories, and a model was built to predict the experiment categories. The LEXAS user interface provides a list of reasonable experiment categories, such as "knockdown" or "immunofluorescence analysis." However, actual experiments are usually more complicated than these categories. Suggesting more specific and concrete experiment methods would be another interesting direction for further study.

Limitations of the study

This study extracted information of experiments described in single sentences. Consequently, experiments described across multiple sentences were not included in the extraction process. Besides, because LEXAS’s predictions rely on published articles, the results can be influenced by publication biases, which might overrepresent or underrepresent specific areas of biological research.

STAR★Methods

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data

HGNC (Tweedie et al., 2021)35 HUGO Gene Nomenclature Committee https://www.genenames.org/
Mouse Genome Informatics (Blake et al., 2021)36 The Jackson Laboratory https://www.informatics.jax.org/
Human Protein Atlas (Thul et al., 2017)37 The Human Protein Atlas project https://www.proteinatlas.org/
iRefIndex (Razick, Magklaras and Donaldson, 2008)20 VIB Technologies https://irefindex.vib.be/
Harmonizome52 Ma’ayan Laboratory of Computational Systems Biology https://maayanlab.cloud/Harmonizome/
Human Phenotype Ontology (Köhler et al., 2021)31 The Human Phenotype Ontology (HPO) project https://hpo.jax.org/
Webster (Pan et al., 2022)39 Pan et al. (Pan et al., 2022)51 https://depmap.org/webster/
Gene Ontology (Carbon et al., 2019)19 Gene Ontology Consortium https://geneontology.org/
DepMap (Meyers et al., 2017)34 Broad Institute https://depmap.org/portal/
Code for the development and evaluation of LEXAS This paper https://doi.org/10.5281/zenodo.10115270

Software and algorithms

Python version 3.7 Python Software Foundation https://www.python.org

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Kei K Ito (ito-delightfully-kei@g.ecc.u-tokyo.ac.jp).

Materials availability

This study did not generate new unique reagents.

Data and code availability

  • This paper analyzes existing, publicly available data. These accession numbers for the datasets are listed in the key resources table.

  • All original code has been deposited at Zenodo and is publicly available as of the date of publication. DOIs are listed in the key resources table.

  • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Method details

Article retrieval

Full-text articles archived in PubMed Central (PMC) were downloaded in XML format via the PubMed FTP service on January 20, 2023. The “sec” elements (sections) containing the word "result" in their titles were extracted by parsing the articles. The text within these sections was then retrieved for further analysis.

Sentence extraction

Sentence segmentation was performed using scispaCy.53 Sentences that contained at least one human gene name and one experiment method were extracted using a dictionary-matching algorithm, specifically the Aho–Corasick algorithm, which was implemented as a Python package called "ahocorapy". We used two different term lists for the genes and the experiment methods.

The gene term list consists of 106,953 terms of gene symbols, gene names, alias gene symbols, alias gene names, previous gene symbols, and previous gene names provided by the HUGO Gene Nomenclature Committee (HGNC).35 The terms that are less than 3 characters long and 103 stopword-like terms such as “was” and “can” were excluded to avoid false gene detection (Table S1).

The experiment method list consists of 4,303 terms in total, including 3,870 Medical Subject Headings (MeSH) terms and manually compiled 433 terms (Table S2). The manually compiled terms were expanded using a word2vec model trained on the texts of PubMed Central, as described in the 'collection of gene features' subsection in the STAR Methods section. We converted the experiment method terms into vectors and then investigated the top 10 terms most similar to each term. If a term was deemed appropriate and not already included, we added it to the list. The MeSH terms are descendants of the following categories: E05.196 (Chemistry Techniques, Analytical), E05.393 (Genetic Techniques), E05.478 (Immunologic Techniques), E05.200 (Clinical Laboratory Techniques), E05.301 (Electrochemical Techniques), E05.601 (Molecular Probe Techniques), E05.595 (Microscopy), E05.242 (Cytological Techniques), E05.591 (Micromanipulation), E01.370.225 (Clinical Laboratory Techniques), E01.370.350 (Diagnostic Imaging), or E01.370.500 (Mass Screening).

Relation extraction for gene and experiment

To train a model for relation extraction between a gene and an experiment method in a sentence, a set of masked sentences was prepared. For each pair of a gene and an experiment method in a sentence, a new sentence was created by masking the gene name and the experiment method with special tokens, [GENE] and [EXPE], respectively. For example, from the sentence, “Inhibition of Plk1 suppressed loss of HsSAS6 from the centrioles.", we replaced the gene name "Plk1" or “HsSAS6” with the [GENE] token and replaced the experiment method "inhibition" with the [EXPE] token. This resulted in the following two sentences:

  • [EXPE] of [GENE] suppressed loss of HsSAS6 from the centrioles.

  • [EXPE] of Plk1 suppressed loss of [GENE] from the centrioles.

The first sentence is positive, indicating the experiment method [EXPE] was applied to the gene [GENE], while the second sentence is negative, indicating no relationship between the experiment method [EXPE] and the gene [GENE]. In the annotation process, 1,600 masked sentences were randomly chosen and manually annotated by K.K.I whether the experiment method [EXPE] was performed on the gene [GENE] or not. For validating this process, R.Y, a trained PhD student majoring in cell biology, also annotated 100 of these sentences. The agreement between K.K.I and R.Y was assessed using Cohen’s kappa (0.901).

Negative sampling

The context of two consecutive gene-related experiments was represented as a tuple that consisted of two genes. Negative examples were generated by replacing the second element of positive examples with a randomly selected gene from all human genes unless the replaced tuple was contained in the positive examples. Three negative examples were generated per one positive example.

Collection of gene features

Gene locus

The gene locus data was acquired from the HUGO Gene Nomenclature Committee (HGNC).35 Information on chromosome number and arm (p or q) was used to generate feature vectors.

Gene Ontology

The ontology and annotation data were downloaded from the Gene Ontology. Version 2018-12-01 was used to train models for evaluating performance, while version 2023-01-01 was used to train models for the user interface. The ontology terms annotated on at least 10 human genes were used to generate feature vectors.

Mouse genome informatics

Information on genotype-phenotype annotations was downloaded. The mouse gene names were converted to human homologs and then phenotypes annotated on at least 10 human genes were used to generate feature vectors.

HPO, OMIM, Orphanet

Information on Human Phenotype Ontology (HPO) annotations version 2023-01-27, which also includes the information on Online Mendelian Inheritance in Man and Orphanet, was downloaded. The terms annotated on at least 10 human genes were used to generate feature vectors.

Human Protein Atlas

Sub-cellular location data based on the results of comprehensive immunofluorescence analysis and gene expression data among 256 tissues obtained from RNA-seq analysis were downloaded from Human Protein Atlas version 22.0. From the RNA-seq data, information on transcripts per million (TPM) was used to generate feature vectors.

iRefIndex

Information on protein-protein interactions of human proteins was downloaded from iRefIndex. iRefIndex 19.0 released 2022-08-22 was used to generate feature vectors.

DepMap, Webster

Information on expression levels and gene essentiality scores called Chronos among various cancer cell lines were downloaded from DepMap version 22Q4. The gene functions inferred from the DepMap gene essentiality score using a sparse dictionary learning approach called Webster were downloaded from supplementary information of the article.39 From the Webster data, the gene-function relationships with loading more than 0.1 or less than -0.1 were used to generate feature vectors.

ENCODE

The information about transcription factors and their targeting genes provided by ENCODE was downloaded through Harmonizome.52 Each gene was annotated with the information of which transcription factors can affect the expression of the gene.

Word2Vec

The word2vec model was trained using the full text of PubMed Central articles. For the validation and test, only articles published up to 2018 were used to train the word2vec model, while all articles were utilized for the web application. The training process utilized the continuous bag-of-words (CBOW) algorithm,44 a window size of 10, and a vector size of 100. The word2vec model was implemented using the gensim library in Python.54

Construction of feature vector

The tuples composed of two gene names were converted into the feature vectors using the information sources including categorical and numerical features listed in Table 1.

For categorical features, three dimensions were assigned for each term in the feature vector. Each of the two genes in a tuple is associated with several feature terms, such as “GO:0006281 (DNA repair)” and “HP:0000252 (microcephaly)”. When the genes in a tuple were both attributed to the feature term, the values corresponding to the term were set to (1, 0, 0). When only the gene in the previous experiment was attributed to the feature, the values were set to (0, 1, 0). When only the gene in the next experiment was attributed to the feature, the values were set to (0, 0, 1). If neither genes were attributed to the feature, the values were set to (0, 0, 0).

One dimension was assigned for each numerical feature in the feature vector. The value indicates the Pearson correlation coefficient between the values of two genes except for word2vec. The value corresponding to word2vec in the feature vector indicates the cosine similarity between the embedding vectors of the two genes in a tuple.

Comparison of related tools

Our models and online resources used for the comparison are listed below.

  • LEXAS

Our machine learning model trained using the experiments described up to 2018 and GO terms, several gene databases and text-derived information listed in Table 1. For each query gene, all genes were ranked by their probability to be examined after an experiment on the query gene.

  • STRING-raw55

The STRING database is a comprehensive resource that provides information on protein-protein interactions. It integrates information from numerous sources, including experimental data and text-mined information. The STRING database provides a link score between two genes as "scored links between proteins". STRING v10, updated in 2017, was selected for the comparison so that STRING would not use text-mined information from the articles after 2018. For each query gene, all genes were ranked by scored links between proteins.

  • STRING55

We applied Random Walk with Restart (RWR) method, a random-walk based method that measures node-to-node proximity in a network,41 to the scored links between transcripts in the STRING network. For each query gene, all genes were ranked by their proximities.

  • GOSemSim17

GOSemSim is a software package that can estimate semantic similarity between gene products based on GO terms. Using GOSemSim, we calculated the semantic similarities between the two genes in a tuple. We chose Wang’s graph-based method30 as the calculation method. For each query gene, all genes were ranked by their semantic similarities between the query gene.

  • FunCoup-raw12

FunCoup provides functional coupling information between two proteins, represented by a probabilistic confidence value called pfc (probability of functional coupling). The predictions are based on integrating data from various sources, such as protein-protein interactions and gene co-expression. For each query gene, all genes were ranked by pfc.

  • FunCoup12

The RWR method was also applied to the pfc of the FunCoup network. For each query gene, all genes were ranked by the proximities.

Calculation of the area under the ROC curve

To assess the performance of our machine learning model, we generated 19,393 tuples for each query gene with that gene in the first element and another human gene in the second element. These 19,393 genes were obtained from HGNC, which were annotated as "genes with protein product”. Each tuple was converted into a feature vector and given a probability of representing an actual experimental context using machine learning models. We then ranked the tuples based on their probabilities and calculated the area under the ROC curve (AUROC) for each query gene, using validation or test tuples as true examples. Additionally, we computed AUROC@k, which is similar to the AUROC but only considers the top-k ranked tuples. To calculate this metric, we set the probabilities to 0 for any genes that were ranked below k. We also calculated the same metrics for the semantic similarity for STRING and RWR proximity scores for STRING and FunCoup. To break ties, small random values ranging between 0 and 1×1010 were added to the score.

In the validation process, genes that were analyzed following the query gene before 2018 but not in 2019 were removed from the result table, because these genes cannot be considered positive or negative examples. Similarly, in the evaluation process, genes that were analyzed before 2019 but not after 2020 were removed from the result table.

Random Walk with Restart implementation

The Random Walk with Restart (RWR) algorithm was implemented using the PyRWR library (https://github.com/jinhongjung/pyrwr.git), a Python implementation. RWR is a graph-based algorithm that measures the proximity between a given seed node and all other nodes. It was applied to calculate node proximity within the STRING and FunCoup networks. All PyRWR parameters were used with default values. Cutoffs were implemented during the integration of RWR data into the LEXAS-plus model (STRING: 0.0005, FunCoup: 0.001).

Calculation of SHAP values

To interpret the importance of features in our XGBoost model, we calculated SHAP values using the TreeSHAP implementation in the SHAP Python library. This implementation is specifically designed for tree-based models like XGBoost.56 We created a TreeExplainer instance with the model, specifying 'tree_path_dependent' for feature perturbation. Then, we used this explainer to calculate SHAP values for our input data. The 'approximate=True' argument was passed to the shap_values method to speed up computation while maintaining a reasonable level of accuracy. The calculated SHAP values were then used to assess feature importance.

Experiment method prediction

Given the feature vectors used to train LEXAS model, a multi-class logistic regression classifier was trained to predict the category of the experiment method in the next experiment. Experiment methods were categorized into 20 groups as follows: knockdown/knockout, overexpression, immunofluorescence, protein-protein interaction, RT-PCR or qPCR, bioinformatics, immunohistochemistry, next generation sequencing, rescue experiment, protein structure, FISH, screening, mass spectrometry, super-resolution microscopy, electron microscopy, GWAS, live-imaging, Xray-flattering, circular dichroism and others. The accuracy of the method category prediction was 45.2%.

Web application

For the user interface, the LEXAS, LEXAS-plus and LEXAS-data models were trained using the information on all collected experiments and latest information sources as of 2023-01-30. When a user queries a gene, the top 100 genes with the highest probabilities are shown in the suggestion interface. The web application is built using Python, Flask, uWSGI and Nginx. The application is accessible through a web browser. To ensure data privacy and security, user inputs are encrypted using SSL/TLS protocols.

Quantification and statistical analysis

Mann-Whitney U tests were conducted in Figures 3 and S2 using Python scipy library57 to calculate P values. P values were denoted as ∗∗∗ for P < 0.001 and NS for P > 0.05 (not significant). Bootstrapping tests were conducted in Figures 3 and S2 using Python to acquire 95% confidence interval. We performed 10,000 runs to calculate the mean values, using resampling with replacement.58

Additional resources

LEXAS web interface: https://lexas.f.u-tokyo.ac.jp.

Acknowledgments

We gratefully acknowledge T. Mizuno, N. Kono, Y. Kishi, and Kitagawa lab members for constructive feedback on the user interface of LEXAS and J. Nakahara and R. Yabuki for the evaluation of the annotations.

This work was supported by JSPS KAKENHI grants (Grant number: 19H05651, 21J21432) from the Ministry of Education, Science, Sports and Culture of Japan, JST CREST (Grant number JPMJCR22E1), and the IIW program of The University of Tokyo, Japan.

Author contributions

K.K.I. conceived the study. K.K.I., Y.T., and D.K. designed the study. K.K.I. constructs the LEXAS system and web interface. K.K.I., Y.T., and D.K. analyzed the data. K.K.I., Y.T., and D.K. wrote the manuscript.

Declaration of interests

The user interface of this system has been applied for a patent in Japan (application number: 2021-133597, pending examination).

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work, the authors used GPT4 in order to improve language and readability. After using this service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Published: February 23, 2024

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.isci.2024.109309.

Contributor Information

Kei K. Ito, Email: ito-delightfully-kei@g.ecc.u-tokyo.ac.jp.

Yoshimasa Tsuruoka, Email: yoshimasa-tsuruoka@g.ecc.u-tokyo.ac.jp.

Daiju Kitagawa, Email: dkitagawa@mol.f.u-tokyo.ac.jp.

Supplemental information

Document S1. Figures S1–S3 and Tables S1, S4, and S5
mmc1.pdf (477KB, pdf)
Table S2. The list of experiment method, related to Figure 2
mmc2.xlsx (84.3KB, xlsx)
Table S3. The number of experiments per gene, related to Figure 2
mmc3.xlsx (439KB, xlsx)
Table S6. Mean of AUROC@100 for genes classified by Gene Ontology terms, related to Figure 4
mmc4.xlsx (25.4KB, xlsx)

References

  • 1.Tsuruoka Y., Tsujii J., Ananiadou S. FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics. 2008;24:2559–2560. doi: 10.1093/bioinformatics/btn469. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Tsuruoka Y., Miwa M., Hamamoto K., Tsujii J., Ananiadou S. Discovering and visualizing indirect associations between biomedical concepts. Bioinformatics. 2011;27:i111–i119. doi: 10.1093/bioinformatics/btr214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Rindflesch T.C., Kilicoglu H., Fiszman M., Rosemblat G., Shin D. Information Services and Use. 2011. Semantic MEDLINE: An advanced information management application for biomedicine. [DOI] [Google Scholar]
  • 4.Shen J., Vasaikar S., Zhang B. DLAD4U: deriving and prioritizing disease lists from PubMed literature. BMC Bioinf. 2018;19:495. doi: 10.1186/s12859-018-2463-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Chen H., Sharp B.M. Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinf. 2004;5:147. doi: 10.1186/1471-2105-5-147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Björne J., Salakoski T. Proceedings of the BioNLP 2018 workshop. Association for Computational Linguistics); 2018. Biomedical Event Extraction Using Convolutional Neural Networks and Dependency Parsing; pp. 98–108. [DOI] [Google Scholar]
  • 7.Miwa M., Pyysalo S., Ohta T., Ananiadou S. Wide coverage biomedical event extraction using multiple partially overlapping corpora. BMC Bioinf. 2013;14:175. doi: 10.1186/1471-2105-14-175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Wang X.D., Weber L., Leser U. EMNLP 2020 - 11th International Workshop on Health Text Mining and Information Analysis. LOUHI 2020; 2020. Biomedical event extraction as multi-turn question answering. [DOI] [Google Scholar]
  • 9.Trieu H.L., Tran T.T., Duong K.N.A., Nguyen A., Miwa M., Ananiadou S. DeepEventMine: End-to-end neural nested event extraction from biomedical texts. Bioinformatics. 2020;36:4910–4917. doi: 10.1093/bioinformatics/btaa540. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Björne J., Salakoski T. Proceedings of BioNLP Shared Task 2011 Workshop at the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. ACL HLT 2011; 2011. Generalizing biomedical event extraction. [Google Scholar]
  • 11.Warde-Farley D., Donaldson S.L., Comes O., Zuberi K., Badrawi R., Chao P., Franz M., Grouios C., Kazi F., Lopes C.T., et al. The GeneMANIA prediction server: Biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 2010;38:W214–W220. doi: 10.1093/nar/gkq537. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Persson E., Castresana-Aguirre M., Buzzao D., Guala D., Sonnhammer E.L.L. FunCoup 5: Functional Association Networks in All Domains of Life, Supporting Directed Links and Tissue-Specificity. J. Mol. Biol. 2021;433 doi: 10.1016/j.jmb.2021.166835. [DOI] [PubMed] [Google Scholar]
  • 13.Greene C.S., Krishnan A., Wong A.K., Ricciotti E., Zelaya R.A., Himmelstein D.S., Zhang R., Hartmann B.M., Zaslavsky E., Sealfon S.C., et al. Understanding multicellular function and disease with human tissue-specific networks. Nat. Genet. 2015;47:569–576. doi: 10.1038/ng.3259. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kim C.Y., Baek S., Cha J., Yang S., Kim E., Marcotte E.M., Hart T., Lee I. HumanNet v3: an improved database of human gene networks for disease research. Nucleic Acids Res. 2022;50:D632–D639. doi: 10.1093/nar/gkab1048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Al-Aamri A., Taha K., Al-Hammadi Y., Maalouf M., Homouz D. Constructing Genetic Networks using Biomedical Literature and Rare Event Classification. Sci. Rep. 2017;7 doi: 10.1038/s41598-017-16081-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Szklarczyk D., Gable A.L., Nastou K.C., Lyon D., Kirsch R., Pyysalo S., Doncheva N.T., Legeay M., Fang T., Bork P., et al. The STRING database in 2021: Customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 2021;49:D605–D612. doi: 10.1093/nar/gkaa1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Yu G., Li F., Qin Y., Bo X., Wu Y., Wang S. GOSemSim: An R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010;26:976–978. doi: 10.1093/bioinformatics/btq064. [DOI] [PubMed] [Google Scholar]
  • 18.Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Carbon S., Douglass E., Dunn N., Good B., Harris N.L., Lewis S.E., Mungall C.J., Basu S., Chisholm R.L., Dodson R.J., et al. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 2019;47:D330–D338. doi: 10.1093/nar/gky1055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Razick S., Magklaras G., Donaldson I.M. iRefIndex: A consolidated protein interaction database with provenance. BMC Bioinf. 2008;9:405. doi: 10.1186/1471-2105-9-405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Kanehisa M., Furumichi M., Tanabe M., Sato Y., Morishima K. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017;45:D353–D361. doi: 10.1093/nar/gkw1092. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Meldal B.H.M., Bye-A-Jee H., Gajdoš L., Hammerová Z., Horácková A., Melicher F., Perfetto L., Pokorný D., Lopez M.R., Türková A., et al. Complex Portal 2018: Extended content and enhanced visualization tools for macromolecular complexes. Nucleic Acids Res. 2019;47:D550–D558. doi: 10.1093/nar/gky1001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Treloar N.J., Braniff N., Ingalls B., Barnes C.P. Deep reinforcement learning for optimal experimental design in biology. PLoS Comput. Biol. 2022;18 doi: 10.1371/journal.pcbi.1010695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Sverchkov Y., Craven M. A review of active learning approaches to experimental design for uncovering biological networks. PLoS Comput. Biol. 2017;13 doi: 10.1371/journal.pcbi.1005466. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.King R.D., Rowland J., Oliver S.G., Young M., Aubrey W., Byrne E., Liakata M., Markham M., Pir P., Soldatova L.N., et al. The Automation of Science. Science. 2009;324:85–89. doi: 10.1126/science.1165620. [DOI] [PubMed] [Google Scholar]
  • 26.Lee J., Yoon W., Kim S., Kim D., Kim S., So C.H., Kang J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36:1234–1240. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Devlin J., Chang M.-W., Lee K., Toutanova K. NAACL HLT 2019 - 2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. [Google Scholar]
  • 28.Hartwell L.H., Hopfield J.J., Leibler S., Murray A.W. From molecular to modular cell biology. Nature. 1999;402:C47–C52. doi: 10.1038/35011540. [DOI] [PubMed] [Google Scholar]
  • 29.Chen C., Ma W., Zhang M., Wang C., Liu Y., Ma S. Revisiting Negative Sampling vs. Non-sampling in Implicit Recommendation. ACM Trans. Inf. Syst. 2023;41:1–25. doi: 10.1145/3522672. [DOI] [Google Scholar]
  • 30.Wang J.Z., Du Z., Payattakool R., Yu P.S., Chen C.F. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23:1274–1281. doi: 10.1093/bioinformatics/btm087. [DOI] [PubMed] [Google Scholar]
  • 31.Köhler S., Gargano M., Matentzoglu N., Carmody L.C., Lewis-Smith D., Vasilevsky N.A., Danis D., Balagura G., Baynam G., Brower A.M., et al. The human phenotype ontology in 2021. Nucleic Acids Res. 2021;49:D1207–D1217. doi: 10.1093/nar/gkaa1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Oughtred R., Stark C., Breitkreutz B.J., Rust J., Boucher L., Chang C., Kolas N., O’Donnell L., Leung G., McAdam R., et al. The BioGRID interaction database: 2019 update. Nucleic Acids Res. 2019;47:D529–D541. doi: 10.1093/nar/gky1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Cai Y., Hossain M.J., Hériché J.K., Politi A.Z., Walther N., Koch B., Wachsmuth M., Nijmeijer B., Kueblbeck M., Martinic-Kavur M., et al. Experimental and computational framework for a dynamic protein atlas of human cell division. Nature. 2018;561:411–415. doi: 10.1038/s41586-018-0518-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Meyers R.M., Bryan J.G., McFarland J.M., Weir B.A., Sizemore A.E., Xu H., Dharia N.V., Montgomery P.G., Cowley G.S., Pantel S., et al. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat. Genet. 2017;49:1779–1784. doi: 10.1038/ng.3984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Tweedie S., Braschi B., Gray K., Jones T.E.M., Seal R.L., Yates B., Bruford E.A. Genenames.org: The HGNC and VGNC resources in 2021. Nucleic Acids Res. 2021;49:D939–D946. doi: 10.1093/nar/gkaa980. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Blake J.A., Baldarelli R., Kadin J.A., Richardson J.E., Smith C.L., Bult C.J., Mouse Genome Database Group Mouse Genome Database (MGD): Knowledgebase for mouse-human comparative biology. Nucleic Acids Res. 2021;49:D981–D987. doi: 10.1093/nar/gkaa1083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Thul P.J., Åkesson L., Wiking M., Mahdessian D., Geladaki A., Ait Blal H., Alm T., Asplund A., Björk L., Breckels L.M., et al. A subcellular map of the human proteome. Science. 2017;356 doi: 10.1126/science.aal3321. [DOI] [PubMed] [Google Scholar]
  • 38.Dunham I., Kundaje A., Aldred S.F., Collins P.J., Davis C.A., Doyle F., Epstein C.B., Frietze S., Harrow J., Kaul R., et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Pan J., Kwon J.J., Talamas J.A., Borah A.A., Vazquez F., Boehm J.S., Tsherniak A., Zitnik M., McFarland J.M., Hahn W.C. Sparse dictionary learning recovers pleiotropy from human cell fitness screens. Cell Syst. 2022;13:286–303.e10. doi: 10.1016/j.cels.2021.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Schröder G., Thiele M., Lehner W. CEUR Workshop Proceedings. 2011. Setting goals and choosing metrics for recommender system evaluations. [Google Scholar]
  • 41.Pan J.Y., Yang H.J., Faloutsos C., Duygulu P. KDD-2004 - Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004. Automatic multimedia cross-modal correlation discovery; pp. 653–658. [DOI] [Google Scholar]
  • 42.Lundberg S., Lee S.-I. Advances in Neural Information Processing Systems 30 (NIPS 2017) Curran Associates, Inc.; 2017. A unified approach to interpreting model predictions; pp. 4765–4774. [DOI] [Google Scholar]
  • 43.Saleem R., Yuan B., Kurugollu F., Anjum A., Liu L. Explaining deep neural networks: A survey on the global interpretation methods. Neurocomputing. 2022;513:165–180. doi: 10.1016/j.neucom.2022.09.129. [DOI] [Google Scholar]
  • 44.Mikolov T., Chen K., Corrado G., Dean J. 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings. 2013. Efficient estimation of word representations in vector space. [Google Scholar]
  • 45.Jakobsen L., Vanselow K., Skogs M., Toyoda Y., Lundberg E., Poser I., Falkenby L.G., Bennetzen M., Westendorf J., Nigg E.A., et al. Novel asymmetrically localizing components of human centrosomes identified by complementary proteomics methods. EMBO J. 2011;30:1520–1535. doi: 10.1038/emboj.2011.63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Atorino E.S., Hata S., Funaya C., Neuner A., Schiebel E. CEP44 ensures the formation of bona fide centriole wall, a requirement for the centriole-to-centrosome conversion. Nat. Commun. 2020;11 doi: 10.1038/s41467-020-14767-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Vásquez-Limeta A., Lukasik K., Kong D., Sullenberger C., Luvsanjav D., Sahabandu N., Chari R., Loncarek J. CPAP insufficiency leads to incomplete centrioles that duplicate but fragment. J. Cell Biol. 2022;221 doi: 10.1083/jcb.202108018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Chen H.Y., Wu C.T., Tang C.J.C., Lin Y.N., Wang W.J., Tang T.K. Human microcephaly protein RTTN interacts with STIL and is required to build full-length centrioles. Nat. Commun. 2017;8 doi: 10.1038/s41467-017-00305-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Karasu O.R., Neuner A., Atorino E.S., Pereira G., Schiebel E. The central scaffold protein CEP350 coordinates centriole length, stability, and maturation. J. Cell Biol. 2022;221 doi: 10.1083/jcb.202203081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Fritz-Laylin L.K., Cande W.Z. Ancestral centriole and flagella proteins identified by analysis of Naegleria differentiation. J. Cell Sci. 2010;123:4024–4031. doi: 10.1242/jcs.077453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Chang C.H., Chen T.Y., Lu I.L., Li R.B., Tsai J.J., Lin P.Y., Tang T.K. CEP120-mediated KIAA0753 recruitment onto centrioles is required for timely neuronal differentiation and germinal zone exit in the developing cerebellum. Genes Dev. 2021;35:1445–1460. doi: 10.1101/GAD.348636.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Rouillard A.D., Gundersen G.W., Fernandez N.F., Wang Z., Monteiro C.D., McDermott M.G., Ma’ayan A. The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. Database. 2016;2016 doi: 10.1093/database/baw100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Neumann M., King D., Beltagy I., Ammar W. Proceedings of the 18th BioNLP Workshop and Shared Task. Association for Computational Linguistics); 2019. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing; pp. 319–327. [DOI] [Google Scholar]
  • 54.Rehurek R., Sojka P. Gensim--python framework for vector space modelling. NLP Centre, Fac. Informatics, Masaryk Univ. Brno, Czech Repub. 2011;3:2. [Google Scholar]
  • 55.Szklarczyk D., Franceschini A., Wyder S., Forslund K., Heller D., Huerta-Cepas J., Simonovic M., Roth A., Santos A., Tsafou K.P., et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2015;43:D447–D452. doi: 10.1093/nar/gku1003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Lundberg S.M., Erion G., Chen H., DeGrave A., Prutkin J.M., Nair B., Katz R., Himmelfarb J., Bansal N., Lee S.I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020;2:56–67. doi: 10.1038/s42256-019-0138-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J., et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Efron B., Tibshirani R.J. Chapman and Hall/CRC; 1994. An Introduction to the Bootstrap. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S3 and Tables S1, S4, and S5
mmc1.pdf (477KB, pdf)
Table S2. The list of experiment method, related to Figure 2
mmc2.xlsx (84.3KB, xlsx)
Table S3. The number of experiments per gene, related to Figure 2
mmc3.xlsx (439KB, xlsx)
Table S6. Mean of AUROC@100 for genes classified by Gene Ontology terms, related to Figure 4
mmc4.xlsx (25.4KB, xlsx)

Data Availability Statement

  • This paper analyzes existing, publicly available data. These accession numbers for the datasets are listed in the key resources table.

  • All original code has been deposited at Zenodo and is publicly available as of the date of publication. DOIs are listed in the key resources table.

  • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.


Articles from iScience are provided here courtesy of Elsevier

RESOURCES