Abstract
The most common approaches to discovering genes associated with specific diseases are based on machine learning and use a variety of feature selection techniques to identify significant genes that can serve as biomarkers for a given disease. More recently, the integration in this process of prior knowledge-based approaches has shown significant promise in the discovery of new biomarkers with potential translational applications. In this study, we developed a novel approach, GediNET, that integrates prior biological knowledge to gene Groups that are shown to be associated with a specific disease such as a cancer. The novelty of GediNET is that it then also allows the discovery of significant associations between that specific disease and other diseases. The initial step in this process involves the identification of gene Groups. The Groups are then subjected to a Scoring component to identify the top performing classification Groups. The top-ranked gene Groups are then used to train a Machine Learning Model. The process of Grouping, Scoring and Modelling (G-S-M) is used by GediNET to identify other diseases that are similarly associated with this signature. GediNET identifies these relationships through Disease–Disease Association (DDA) based machine learning. DDA explores novel associations between diseases and identifies relationships which could be used to further improve approaches to diagnosis, prognosis, and treatment. The GediNET KNIME workflow can be downloaded from: https://github.com/malikyousef/GediNET.git or https://kni.me/w/3kH1SQV_mMUsMTS.
Subject terms: Cancer, Computational biology and bioinformatics, Genetics, Molecular biology, Biomarkers, Diseases
Introduction
Complex diseases like diabetes, Alzheimer’s, and cancer are influenced by genetics, lifestyle, and environmental factors and do not follow any clear inheritance patterns. Research targeting gene expression patterns seeks identify disease associated genes that can potentially be used to identify biomarker patterns associated with early diagnosis, prognosis, and development of an effective drug design1. Biomarker identification and sample classification, has become an attractive research area in the field of bioinformatics2–5.
Over the last decade, the availability of large datasets has contributed to forming rich data repositories such as miRTarBase6 for microRNA target genes, Gene Ontology (GO)7, Gene Expression Omnibus (GEO), which provides access to microarray measurements8, TCGA—a database for gene expression, RNA-seq9, and KEGG—a knowledge-base of pathways10. Another widely used biological resource is DisGeNET, a knowledge-based platform for gene-disease–variant associations11. Researchers can leverage these resources for in-silico validation and to train statistical machine learning models for classification and biomarker discovery.
Hallmarks of human diseases include the critical perturbation in gene(s)/protein(s) in critical molecular pathways that can produce divergent or lethal phenotypes. This “principle of guilt-by-association” suggests that associated genes can share functions through genetic or physical interactions12. In other words, genes responsible for similar diseases/phenotypes are likely to be similar. This finding has motivated a shift from the traditional pure data-oriented approaches to knowledge-based integrative approaches. Insights can be better attained when advanced tools exploit biological knowledge for deep analysis rather than just using the traditional clustering and machine learning approaches13,14.
Different studies identifying genes associated with human diseases have resulted in the development of tools for diagnosis and, in some cases, have led to the design of novel drugs. Many computational tools that differ in their approaches and use of resources have been described, including those that integrate various types of biological information into machine learning15,16. One integrative approach is to use the aggregation of multiple datasets to increase the statistical power to effectively identify a small subset of genes to predict disease types17. BioGraph, presented by Liekens et al.18 is a data-mining platform for disease gene prioritization and identification that integrates 21 curated biomedical databases in order to rank disease-gene relations and identify potential susceptibility genes. Other approaches, such as GeP-HMRF integrate Genome-wide association studies (GWAS), expression quantitative trait loci (eQTL), and protein–protein interaction (PPI) data19. GeP-HMRF is a unified statistical model to predict disease-related genes that is reported to outperform Sherlock20, COLOC21, and NetWAS22 tools. The work of Peng et al.23 proposes a new network-based disease gene prediction method called SLN-SRW (Simplified Laplacian Normalization-Supervised Random Walk) to generate edge weights of a new biomedical network by integrating heterogeneous sources of biomedical data.
The study by Asif et al.201816 demonstrated that machine learning classifiers trained on functional gene similarities, using Gene Ontology (GO) to compute similarities between genes improves the identification of genes involved in complex diseases such as autism spectrum disorder (ASD). Luo et al.24 proposed EdgCSN, an ensemble learning algorithm that uses protein–protein interaction networks extracted from clinical sample-based networks, to predict disease-associated genes.
DisGeNET is a database11 that includes a variety of data for different diseases. Hamzeh and Rueda have proposed a new machine learning method incorporating the DisGeNET database to detect biomarkers in prostate cancer. A wrapper-based feature-selection approach was used to group genes-related diseases based on their classification accuracy. Results for each iteration were saved for further validation by researchers based on the best AUC or the highest number of detected genes in each group11.
Yousef et al. developed the Grouping-Scoring-Modeling (G-S-M) approach for integrating biological knowledge through different computational tools such as SVM-RCE-R25,26 maTE27, CogNet28, mirCorrnet29, miRModuleNet30, and PriPath31. Integrating biological knowledge with gene expression selection was reviewed in38 SVM-RCE-R25,26 tools were the first reports that considered groups of genes rather than individual genes, SVM-RCE (Support Vector Machines -Recursive Cluster Elimination), groups genes based on their gene expression values and scores each cluster of genes by a machine learning algorithm. In a recent study, Yousef et al.32, used the G-S-M model to integrate Gene Ontology data for grouping genes. In SVM-RNE (Recursive Network elimination)33 they detected gene networks that serve as gene groups for scoring and ranking by adopting the G-S-M model. Although different studies have used mRNA expression data and knowledge bases such as DisGeNet in their studies, our main objective using the G-S-M approach, has been to group genes to identify the best groups that were related to a specific disease. GediNET, our novel machine learning approach with two-class classification does not need other data annotations. With Monte Carlo cross-validation (MCCV), fractions of the samples are randomly selected as training dataset, and the rest is assigned for the testing dataset. The most accurate disease-gene groups are then identified in each training iteration, later accumulative top-ranked groups are combined to train the model. We also examined the results using similar approaches that follow the same merit, such as maTE27, CogNet28, mirCorrnet29, miRModuleNet30, and PriPath31.
However, the aim of the GediNET is not to compete with other tools that focus on single disease signatures but rather the aim is to discover novel gene groups with associations across a subset of disease based on machine learning.
Materials and methods
All methods were performed in accordance with the relevant guidelines and regulations.
Datasets
We downloaded 10 human gene expression datasets for different types of complex diseases from GEO database8. For each dataset, the name of the disease and the number of samples were defined. Moreover, positive and negative samples were available. Table 1 describes the 10 datasets in more detail.
Table 1.
GEO accession | Title | Disease | #Samples | Classes |
---|---|---|---|---|
GDS1962 | Glioma-derived stem cell factor effect on angiogenesis in the brain | Glioma | 180 | Negative = 23 |
Positive = 157 | ||||
GDS2545 | Metastatic prostate cancer (HG-U95A) | Prostate cancer | 171 | Negative = 81 |
Positive = 90 | ||||
GDS2771 | Large airway epithelial cells from cigarette smokers with suspect lung cancer | Lung cancer | 192 | Negative = 90 |
Positive = 102 | ||||
GDS3257 | Cigarette smoking effect on lung adenocarcinoma | Lung adenocarcinoma | 107 | Negative = 49 |
Positive = 58 | ||||
GDS4206 | Pediatric acute leukemia patients with early relapse: white blood cells | Leukemia | 197 | Negative = 157 |
Positive = 40 | ||||
GDS5499 | Pulmonary hypertension: PBMCs | Pulmonary hypertension | 140 | Negative = 41 |
Positive = 99 | ||||
GDS3837 | Non-small cell lung carcinoma in female nonsmokers | Lung cancer | 120 | Negative = 60 |
Positive = 60 | ||||
GDS4516_4718 | Colorectal cancer: laser microdissected tumor tissues | Colorectal cancer | 148 | Negative = 44 |
Positive = 104 | ||||
GDS2547 | Metastatic prostate cancer (HG-U95C) | Prostate cancer | 164 | Negative = 75 |
Positive = 89 | ||||
GDS3268 | Colon epithelial biopsies of ulcerative colitis patients | Colitis | 202 | Negative = 73 |
Positive = 129 |
Each entry has the GEO accession, the name of the disease, the number of samples and the data classes.
DisGeNET disease-gene association dataset
The dataset containing genes and their associated diseases was downloaded from DisGeNET version 7.011. The dataset contains 30,170 diseases and 21,666 genes that form 3,241,576 gene-disease connections. Given the massive dataset size, two filters were used to reduce the number of associations in terms of practicality and to reduce the computational complexity. The filters were set on the columns diseaseType and diseaseSemanticType in the DisGeNET dataset. The diseaseType column divided the data into three categories—disease, phenotype, and group—and we only chose disease as concerning for our study. On the column diseaseSemanticType, we only chose those rows categorized as Neoplastic Process and Disease. This was done to increase compatibility and to better understand the workflow results. After filtering, only 15,991 genes and 3929 diseases remained for further analysis, which accounted for 329,936 gene-disease associations. Figure 1 illustrates a part of the disease distribution over the number of genes for each disease.
The merit of GediNET in the discovery of disease-disease associations
Let D be a two-class gene expression dataset designed to study a specific disease (for example, Lung Cancer or Breast cancer) in order to detect significant genes that will serve as a biomarker for distinguishing cancer vs non-cancer. The traditional approach of the classification model suggests a list of k genes that can serve as biomarkers for predicting those patients with the disease. In other words, identifying disease-gene associations. One possible solution could be a linear function F(X) that might be expressed as:
F(X) = w1g1 + w2g2 + … + wkgk, where wi are the weights (scores) while the gi are the gene expression values. The weights indicate the importance (significant) of each gene expression for the linear model. For instance, a value weight close to zero indicates that the associated genes contribute less to the equation model. In other words, F(X) describes the biological interaction between those k individual genes to form a biomarker signature.
GediNET differs from traditional approaches by considering groups of genes, rather than individual genes. A group is a disease name that represents pre-existing biological knowledge of the associations between sets of genes and the disease. GediNET scores those individual groups and their contribution to the classification task by applying the S component of GediNET (see section (The S component). The top j-scored genes groups will be used for training the final model of GediNET. In other words, the genes that appear on those j groups will be used to train the machine learning model. The S component relies on representing the gene groups as a sub-dataset of the original dataset D preserving the class labels, as described in detail in the two following sections (Grouping Genes based on Disease (The G component) and Creating a Sub-dataset).
For simplicity, the final model might be visualized as a decision tree, as illustrated in Fig. 2 (Right panel). The left panel of Fig. 2 illustrates the decision tree model of the significant genes selected by the traditional approach. The right panel of Fig. 2 shows that the decision tree model consists of genes associated with the top three GediNET ranked diseases (groups). This model contains information about biological knowledge of the diseases showing the disease-disease associations.
For example, considering the dataset GDS1962 that studies Glioma, GediNET suggests a model that is based on the top three significant groups/diseases, as follows:
The following are the sets of genes associated with each disease:
Applying GediNET will compute F*(x) that describes the association between the Grp 1, 2 and 3_diseases with the disease under study (in this case Glioma disease). This might lead to new discoveries that have not been observed before by traditional approaches.
The G-S-M components of GediNET
GediNET is based on the generic approach named G-S-M, which has been adopted by different tools such as SVM-RCE 34, SVM-RCE-R25, SVM-RCE-R-OPT26, SVM-RNE33, maTE27, CogNet28 , miRcorrNet29, Integrating Gene Ontology-Based Grouping and Ranking32, miRModuleNet30, PriPath31 and recently reviewed in Yousef et al.35. The main workflow of GediNET is illustrated in Fig. 3, where the G-S-M approach is presented in the three main sections labeled with the orange section (G), the yellow section (S), and the green section (M), which represent:
1. The G Component (Grouping): where the genes are grouped according to the biological pre-existing knowledge of disease. Each group is represented by an extracted two-class subdataset from the main given dataset.
2. The S Component (Scoring): where the groups are scored and ranked by considering the related two-class subdatasets.
3. The M Component (Machine Learning model): where the model is created by training a classifier (Random Forest) on the top ranked groups’ genes.
The inputs for GediNET are a two-class gene expression dataset and a table that represents the biological pre-existing knowledge of the diseases. The dataset consists of two classes of samples: control (negative) and disease (positive). The dataset is split into training and testing. The training dataset is used for the G-S-M components, while the testing dataset is used to evaluate the model’s performance. The whole workflow is repeated 100 iterations using the cross-validation loop, where the input is randomly split into 90% training and 10% testing in each iteration. A Statistical t test (testing of equality of variances, Levene’s test)36 is performed on the training dataset to detect the top differentially expressed genes. The top 2000 differentially expressed genes with a P-value less than 0.05 are selected. The main contribution of the generic approach and the description of each component’s functions are explained in detail in the following sections.
G component: grouping genes based on disease
The first component GediNET is the grouping component G (the orange section in Fig. 3), which separates genes into groups. The G component might be based on any pre-existing biological knowledge, such as miRTarBase, KEGG pathway, etc., for creating groups of genes. In this tool, the G component group genes based on the DisGeNET v7 database11, which are gene-disease associations. Table 2 is an example of such groups that includes the disease name (group name), the set of genes associated with this disease, and the last column is the number of genes in the associated group.
Table 2.
Group name | Genes | #Genes |
---|---|---|
Small cell carcinoma of lung | VPS13B, SLC16A1, ANXA1, CD99, SMARCC1, PCNA… | 41 |
Leukemia, B-cell | TP53, LAMA4, STK11, CSPG4, CD40, TNFRSF1A… | 43 |
Stage III breast cancer Ajcc V6 | TP53, BRCA2 | 2 |
Head and neck carcinoma | PRMT5, ANXA1, LGALS1, TIMP3, IGFBP7, PCNA, TNC, TP53… | 149 |
Secondary malignant neoplasm of bone | ADAM9, SLC16A1, CD99, NME1-NME2, DPYSL3, TNC, TP53, NRAS… | 145 |
Malignant glioma | TK1, NPAS3, CD63, HMGB1, TAGLN2, TXNIP… | 162 |
Adenocarcinoma, tubular | PCNA, TP53, EFEMP1, APOE, STK11, PRKD1… | 31 |
Childhood brain neoplasm | TP53, NRAS, SOX9, MYC, TNFRSF11B | 5 |
Adult myelodysplastic syndrome | CSNK1A1, CTNNA1, HMGB1, PCNA, TOP2A, TP53… | 58 |
Non-small cell lung cancer stage I | TP53, PRRX1, IGFBP3, VEGFA, S100A6, GSTK1… | 22 |
The last column represents the number of genes in each group (group size).
G component: creating two-class subdataset
We assume that D consists of columns that represent the genes expressions while the rows represent the samples. D also has a class label column with information about each sample, as illustrated in Fig. 4 at the Input panel (labeled by I).
To score each group, we have created a two-class subdataset related to each group/disease. Each subdataset is specific for one group/disease that contains the genes belonging to that group/disease. This is achieved by extracting the genes columns belonging to the specific group and their original class label from the original dataset D. Let m be the number of groups. In this stage, we will extract or create m two-class subdatasets that will be input to the S (Scoring) component. In Fig. 4, the I panel (input panel) contains two matrices. The left one is an example of the gene expression matrix D with the class label for each sample appearing in column “Class”. The right one is the pre-existing biological knowledge containing the disease name (group name) with its set of genes. In our example, the right matrix contains four group diseases labeled with group_diseasei, i = 1,…,4. For example, group_disease1 represents the disease named “Well Differentiated Pancreatic Endocrine Tumor,” along with three genes associated with this specific disease. The genes are RBMS3, TFE3, and NTRK1.
Within the G component, the extraction of two-class subdatasets is performed. As evident in Fig. 4, four subdatasets are created. For each subdataset, the gene columns belonging to each disease group are extracted from the D dataset with the original class label, where pos is for the positive class and neg for the negative class. The four subdatasets serve as input to the following component, S, to be scored and ranked.
S component: scoring the groups
As a result of the G component, m, two-classes subdatasets are created, each representing one group. The task of the S component is to compute a score that measures to what extent it is differentially expressed considering the given two classes. The group is a set of genes; one way of computing a group-score is by computing each individual genes t statistics and then averaging those scores to be the final score of the group, as suggested in37. The following equations might be used to compute this score for given gene i:
1 |
where and are the average expressions over the positive and negative class respectively. and are the standard deviations over the positive and negative class, while, is the number of positive class samples, and is the negative class samples.
Based on equation number 1, one might compute a score for a given group that consists of k genes as the following:
2 |
However, GediNET uses a more progressive approach based on machine learning to compute such scores. Figure 5 illustrates the steps of the S component that ends by assigning the performance measurement as the group score. In our case, we consider the accuracy. Each two-class subdataset is randomly split into training and testing (90% training and 10% testing) as shown in Fig. 5, Panel S-Splitting, where this procedure is repeated r times. The training is used to train the machine learning algorithm (we have used Random Forest), and the model’s performance is evaluated on the test split as seen in the Panel, S-FitTestModel. The accuracy average of the r splits is computed to form the group score. All of the group scores are collected to form a table of m scores. For the M component, we perform a ranking step by ordering the table in descending order. An example of such an output of the Scoring component applied to the GDS2545 dataset is presented in Table 3.
Table 3.
Disease | Genes set | Score | Rank |
---|---|---|---|
Papillary renal cell carcinoma | TP53, VEGFA, SNORD35B, … | 0.98 | 1 |
Plasma cell neoplasm | LYN, IGF1, NME1, … | 0.96 | 2 |
Adult glioblastoma | BRD2, DNMT1, MAOB, … | 0.94 | 3 |
Intestinal cancer | CDKN2A, TP53, RPL24, … | 0.91 | 4 |
Malignant neoplasm of colon stage IV | LARP1, PES1, IFI27, MEN1, … | 0.89 | 5 |
Dermatofibrosarcoma | POSTN, AR, CDKN2A, TP53, … | 0.87 | 6 |
GediNET uses the accuracy measurement to assign a score; one might use a different measurement or a combination of measurements (such as sensitivity, specificity, the Area under the curve, etc.). For more information on such an option, we refer to26.
M component: fitting the model
The M component considers the top-ranked j groups of disease, and their genes are merged to form the top-ranked associated genes (as seen in Fig. 5, the output panel). A subdataset is extracted considering the top-ranked associated genes from the training part of the dataset (90% training, 10% testing, as mentioned before). An RF model is trained on the extracted subdataset. Finally, the model is evaluated on the testing dataset represented by those genes, and the performance statistics are recorded. We have reported the performance of j = 1,…,10.
In our implementation, many RF classifiers are trained on randomly selected data using 90% data for training and 10% for testing the classifier. However, such settings can be adjusted in our KNIME implementation of GediNET.
Implementation of GediNET
We have implemented the GediNET tool using the free and open-source platform KNIME38 due to its simple and intuitive graphical user interface. KNIME is a highly integrative platform that has enabled the scope to utilize scripts in both python and R in tandem to implement our tool as a KNIME workflow.
The workflow created on KNIME comprises several nodes with their separate functions. Meta-nodes are created as a collection of nodes that perform specific tasks.
The KNIME workflow for GediNET is presented in Fig. 6. It starts by uploading a list of the names of the dataset via the “List Files/Folders” node. Then a loop over those datasets is run to read each dataset by the node “Table Reader”, which is then processed by the meta-node “FilterMissingValues” to remove and or filter out rows with missing values. It then sends the filtered data as input to the GediNET meta-node. While the “Integer Input” node allows modifying the number of iterations, the tool should be used while training the model.
The GediNET KNIME workflow could be downloaded from: https://github.com/malikyousef/GediNET or https://kni.me/w/3kH1SQV_mMUsMTS.
Model performance evaluation
We used the Random Forest Classifier while splitting the data into 90% training and 10% testing. Since the datasets are imbalanced, meaning the dataset’s class label has an uneven distribution of observations, we employed the under-sampling method. Such a method deals with imbalanced datasets by maintaining all of the samples in the minority class while decreasing the size of the majority class. For model training, we applied tenfold Monte Carlo cross-validation (MCCV)39. With Monte Carlo cross-validation (MCCV), fractions of the samples are randomly selected as training data, and the rest is assigned for the test data. The performance measures are computed as the average of 100-fold MCCV. We use MCCV rather than traditional CV because the MCCV method is more repeatable since the variance is low.
To evaluate the performance of the RF model, several quantitative metrics were calculated, such as Accuracy, Sensitivity and Specificity40, using the following formulations:
3 |
4 |
5 |
where TP = true positive; FP = false positive, TN = true negative; and FN = false negative. Moreover, the Area Under the Curve (AUC) measures the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve41. We used the AUC to evaluate the performance results.
In each iteration, our approach generates lists of disease groups and their associated genes that are slightly different. Hence, there is a need to apply a prioritization approach on those lists. As utilized in miRcorrNet, we have used rank aggregation methods. In this respect, we have embedded the RobustRankAggreg R package42, developed by (Kolde et al.42), into the GediNET workflow. The RobustRankAggreg assigns a P-Value to each element in the aggregated list, which describes how well each element/entity was ranked compared to the expected value.
Results
Performance evaluation of GediNET
Table 4 presents an example of the average 100-fold MCCV performance table of GediNET for aggregated top-ranked 10 groups for the GDS1962 dataset. The last row presents the performance of the top-ranked group (#Groups = 1). The AUC obtained is 97% using 21.61 genes on average. The row of #Groups = 2 presents the performance metrics obtained for the top 2 groups, where the genes of the first top-ranked group and the second-highest scoring group are aggregated together. That is to say that GediNET reports the performance results for the top 10 groups cumulatively.
Table 4.
#Groups | #Genes | Accuracy | Sensitivity | Specificity | AUC |
---|---|---|---|---|---|
10 | 136.74 | 0.928 | 0.93 | 0.92 | 0.98 |
9 | 127.68 | 0.93 | 0.93 | 0.92 | 0.98 |
8 | 116.02 | 0.93 | 0.94 | 0.92 | 0.98 |
7 | 111.16 | 0.93 | 0.93 | 0.91 | 0.98 |
6 | 102.02 | 0.93 | 0.9 | 0.92 | 0.98 |
5 | 92.88 | 0.93 | 0.93 | 0.93 | 0.98 |
4 | 78.37 | 0.93 | 0.93 | 0.92 | 0.98 |
3 | 62.47 | 0.93 | 0.94 | 0.92 | 0.98 |
2 | 45.57 | 0.93 | 0.93 | 0.93 | 0.97 |
1 | 21.61 | 0.92 | 0.93 | 0.92 | 0.97 |
Table 5 shows the GediNET performance over 10 datasets for the top 2 gene groups. All values are the results of an average of 100-MCCV iterations while considering the AUC for presenting the performance. The complete performance results are attached in the supplementary data. The table shows the GEO accession in the first column, the number of genes in column #Genes while ACC is the accuracy, SEN is the sensitivity, SPE is the specificity, and the AUC is the area under the curve. We see only one unsuccessful result for the dataset GDS4206. However, a similar observation was made when applying other tools to this specific dataset, as illustrated in Fig. 7.
Table 5.
GEO Accession | #Genes | ACC | SEN | SPE | AUC |
---|---|---|---|---|---|
GDS1962 | 45.57 | 0.93 | 0.93 | 0.93 | 0.97 |
GDS2545 | 113.76 | 0.73 | 0.72 | 0.74 | 0.81 |
GDS2771 | 97.83 | 0.64 | 0.69 | 0.59 | 0.70 |
GDS3257 | 74.81 | 0.97 | 0.99 | 0.94 | 0.99 |
GDS3837 | 21 | 0.92 | 0.83 | 1 | 0.92 |
GDS4206 | 83 | 0.66 | 0.3 | 0.82 | 0.58 |
GDS4516_4718 | 40.72 | 0.99 | 0.99 | 0.99 | 1 |
GDS2574 | 102.49 | 0.76 | 0.77 | 0.76 | 0.83 |
GDS3268 | 115.7 | 0.67 | 0.7 | 0.63 | 0.73 |
GDS5499 | 80.23 | 0.9 | 0.96 | 0.77 | 0.95 |
ACC accuracy, SEN sensitivity, SPE specificity, FM F-measure, AUC area under the ROC curve.
The average number of genes associated with the top 2 groups is slightly high because the distribution of genes over the disease is slightly high compared, for example, to other biological knowledge such as microRNA target or KEGG pathways. Moreover, this number of genes could be reduced by removing the least contributed genes when processing each group. This step will be considered in the future version of the algorithm. Also, one can use additional biological knowledge to filter out more genes from the group by, for example, leaving the most associated genes with the disease. The last suggestion requires other biological resources to be embedded into the GediNET.
Comparative evaluation with other biological G-S-M
For comparison, we have considered similar tools that apply the G-S-M approach by integrating biological knowledge for grouping the genes and performing the scoring on the group, such as CogNet30, maTE29, and PriPath33 use RF with the same default parameters (Split criteria: Information Gain Ratio and number of models 100). Moreover, a similar approach was applied in the text mining domain where a TextNetTopics tool was developed43. Within the TextNetTopics, a performance comparison was performed with three different feature selection methods namely Extreme Gradient Boosting (XGBoost), Fast Correlation Based Filter (FCBF), and selectKBest (SKB), through four classifiers. These classifiers are Adaboost, DT, RF, and LogitBoost. The results showed that RF with SKB feature selection provided the highest performance.
We have recorded the AUC values for the top 1–10 groups ranked by the scoring component for each tool by applying 100-MCCV. More specifically, we considered the top two groups for comparison purposes.
Figure 7 illustrates the mean AUC values of the four tools for the 10 datasets. Meanwhile, Fig. 8 plots the mean number of genes for the four tools. As apparent in Fig. 7, the AUC values of GediNET, CogNet, maTE, and PriPath for 10 different datasets for the top two clusters are nearly similar. Thus, the performance of those tools is comparable. This close performance indicates that the developed tool GediNET is consistent and robust. However, the outcome of each tool is different as each one of those tools has its merit and its aim of detecting significant groups related to specific pre-biological knowledge.
Figure 8 implies that, on average, GediNET uses a tenfold higher number of genes than other tools. This is due to the fact that the groups of genes associated with the diseases are much higher than others.
One of the tool’s outputs is a list of ranked disease groups that were assigned a P-value by the robust rank aggregation package42. Table 6 is an example of this tool for the GDS1962 dataset.
Table 6.
GDS1962 | |||
---|---|---|---|
Disease name | P-value | #Genes | List of genes |
Papillary renal cell carcinoma | 0.00052 | 22 | SLC16A1, TAGLN2, TIMP3, IGFBP7… |
Plasma cell neoplasm | 0.0010 | 11 | CD99, TP53, LPL, CD40… |
Common acute lymphoblastic leukemia | 0.001772 | 3 | KNG1, MME, BCL2 |
Ductal breast carcinoma | 0.002363 | 13 | TCF21, AFAP1L2, PLG… |
Gastric mucosa-associated lymphoid tissue lymphoma | 0.002953 | 2 | BCL2, EPCAM |
Intrahepatic cholangiocarcinoma | 0.003544 | 27 | SHBG, BAX, TYMS, GPC3… |
Lymphoma, non-hodgkin | 0.004135 | 44 | BAX, SLC23A1, MME, TYMS, … |
Malignant neoplasm of colon stage iv | 0.004725 | 7 | TYMS, MYCN, KLK6, NDRG1, … |
Neuroectodermal tumor, primitive | 0.005316 | 14 | SFRP1, PCSK2, MYCN, CAPS… |
Papillary thyroid carcinoma | 0.005907 | 75 | BAX, PKHD1L1, MME, GPC3… |
This is a novel output of the feature selection techniques that GediNET is providing. This table will be used to analyze the relationship between the diseases further. For example, Table 6 raises a biological question about the association between the top-ranked diseases (PAPILLARY RENAL CELL CARCINOMA, PLASMA CELL NEOPLASM,…) and the target disease of the study (dataset GDS1962 with target disease Glioma). Additionally, GediNET provides a list of significant genes that were also aggregated by the Robust Rank Aggregation tool. While scoring each group, the genes associated with the group is scored with the same score as the group. This list with its scores is aggregated at the end to compile and report a list of significant genes. Table 7 provides an example of such a list.
Table 7.
Genes | P-value |
---|---|
MYL1 | 0.003 |
RNF44 | 0.016 |
UBN1 | 0.051 |
N4BP2L1 | 0.060 |
GDI1 | 0.066 |
ARL17B | 0.093 |
MYLPF | 0.133 |
The user can consider the list of significant genes for functional and enrichment analysis as was done in similar studies such as PriPath and miRmodulnet using different tools such as David44, EnrichR45, and GeneMANIA46.
Biological interpretations
One of the outputs of GediNET is a list of significant diseases which had been scored by the S component, as illustrated in Table 6. This list is ranked by P-value (ranked by RobustRankAggreg).
For all the 10 GEO datasets, the top 2 diseases and their set of genes were considered to perform pathway enrichment analysis. Their total number of distinct genes is 1184.
The web tool, EnrichR45 was used to perform the pathway enrichment analysis. The tool was run to collect the top enriched pathways for each disease-gene group per dataset, and the top pathways (with the least P-values) were selected. WikiPathway database47 version 2021 for human genes was used to select our results. The top cell signaling pathways’ names for the 10 GEO datasets, P-values, adjusted P-value, and associated genes are illustrated in Table 8. Evidence from literature was then gathered for the dataset cancer and the top-performing disease, along with the enriched genes and pathways found from the enrichment analysis.
Table 8.
Cell signaling pathways term | P-value | Adjusted P-value | List of genes | #Genes |
---|---|---|---|---|
Head and neck squamous cell carcinoma WP4674 | 2.24E-13 | 6.31E-11 | CCND1; CDKN2A; AKT1… | 9 |
DNA damage response (only ATM dependent) WP710 | 2.95E-16 | 1.08E-13 | GSK3B; SMAD4; CDKN1A,… | 14 |
VEGFA-VEGFR2 signaling pathway WP3888 | 1.66E-10 | 6.37E-08 | LRRC59; NRP2; PRKAA2;… | 27 |
VEGFA-VEGFR2 signaling pathway WP3888 | 1.05E-11 | 2.59E-09 | HSP90AA1; ANXA1;… | 18 |
Lung fibrosis WP3624 | 6.32E-09 | 1.73E-06 | GREM1; CSF3;IL6; PLAU; EGF; MUC5B; MMP9 | 7 |
IL-18 signaling pathway WP4754 | 2.33E-17 | 1.05E-14 | GSK3B; CEBPB; CXCL8;… | 29 |
Effects of nitric oxide WP1995 | 2.93E-05 | 0.00310457 | NOS1; XDH | 2 |
TP53 network WP1742 | 2.14E-13 | 9.13E-11 | CDKN1A; CDKN2A; MYC;… | 9 |
Apoptosis WP254 | 1.88E-06 | 4.25E-04 | CASP10; MYC; PMAIP1;… | 6 |
Hepatitis C and hepatocellular carcinoma WP3646 | 5.41E-12 | 2.07E-09 | CDKN1A; IL6; CXCL8;… | 10 |
The first column is the name of the cell signaling pathway, the second column is the P-values, the third column is the adjusted P-value, the Genes column represents an example of the associated genes, and finally, the last column is the total number of associated genes.
Next, we used the cytoscape tool48 to visualize the correlation network between the cell signaling pathways with the overlapping genes for all the top enriched pathways from the previous step. In total, we took the most 10 significant pathways that were enriched among the 20 disease-gene group pairs to visualize. Figure 9 represents the signaling pathway networks with overlapping genes across different GEO datasets.
As we have stated, we examine 10 different GEO gene expression datasets, studying mostly different diseases. Figure 9 illustrates the most significant pathways related to all given datasets, indicating that disease genes are correlated and associated even when studying different diseases. The network in Fig. 9 shows that GediNET discovers important biological information related to various diseases. Moreover, we have studied the significance of GediNET on the data GDS3257 by considering the top 2 significant diseases having 12 distinct genes. Figure 10 illustrates the network of the most significant pathways and their related genes.
Disease-disease associations
We assume that a disease is represented by a set of genes. The simple approach for finding a disease-disease association is by applying different association indices that consider the number of shared genes between the two diseases. For example, one might use the Jaccard Simpson, Geometric, Cosine, and even Pearson correlation coefficient (PCC)32,33.
Recently, different efforts toward Disease-Disease associations (DDA) are gaining attention for their importance in exploring novel associations of diseases and enhancing knowledge of disease relationships, which could further improve approaches to disease diagnosis, prognosis, and treatment. Yet, shared genes offer only limited information about the relationship between two diseases.
The number of known DDA and reliable associations is very small. Thus, it suggests that more efforts are required for DDA detections.
Disease-disease relationships through the incomplete human interactome49 are computational approaches that derive mathematical conditions for the identifiability of disease modules and show that the network-based location of each disease module determines its pathobiological relationship to other diseases. Suratanee A, Plaimas K.50 have developed a novel network-based scoring algorithm called DDA to identify the relationships between diseases in a large-scale study. Their method is developed based on a random walk prioritization in a protein–protein interaction network.
DisGeNET provides through its API, disease-disease associations that have been obtained by computing the number of shared genes and shared variants between pairs of diseases by source. DisGeNet uses two metrics to compute the DDA. The first one is the Jaccard Index (JI) , G1 is the set of genes associated with Disease 1, and G2 is the set of genes related to Disease 2.
The second one is Jaccard variance , V1 is the set of variants associated with Disease 1, and V2 is the set of variants associated with Disease 2.
In order to compute for each dataset, the standard DDA in GediNET, we have computed the fraction of the number of shared genes for each pair of the top-scored disease group for 4 datasets as illustrated in Fig. 11.
GediNET differs from the tools mentioned above in that it is based on machine learning for detecting the relationships between diseases, DDAs, which detect novel and previously unknown associations. We conducted a further analysis to explore if GediNET can identify novel relationships between diseases using DisGeNET API.
Table 9 illustrates for each data set its three top detected diseases by DisGeNET API and the top 3 ranked diseases by GediNET. For each detected disease by DisGeNet we have looked up the disease in the list of ranked diseases by GediNET to examine the two tools.
Table 9.
GEO data set/target disease | The data disease | Top 1 disease name | Top 2 disease name | Top 3 disease name |
---|---|---|---|---|
GDS1962/brainstem glioblastoma | DisGeNET | Recurrent endometrial cancer (#193, pv = 0.16) | Adult astrocytic tumor (#253, pv = 0.22) | Alpha-thalassemia/mental retardation syndrome, nondeletion type, x-linked |
GediNET | Papillary renal cell carcinoma | Plasma cell neoplasm | Adult glioblastoma | |
GDS2545/metastatic prostate cancer | DisGeNET | Metastasis from malignant tumor of prostate (#25, pv = 0.01) | Hormone refractory prostate cancer (#274, pv = 0.34) | Secondary malignant neoplasm of bone (#62, pv = 0.04) |
GediNET | Childhood rhabdomyosarcoma | Rhabdomyosarcoma | Secondary malignant neoplasm of liver | |
GDS2771/lung cancer | DisGeNET | Primary malignant neoplasm of lung (#50, pv = 0.03) | Carcinoma of lung (#97, pv = 0.08) | Non-small cell lung carcinoma (#141, pv = 0.14) |
GediNET | Mantle cell lymphomA | Gastrointestinal carcinoid tumor | Mucinous adenocarcinoma | |
GDS3257/lung adenocarcinoma | DisGeNET | Non-small cell lung cancer recurrent (#116, pv = 0.11) | Adenosquamous cell lung cancer (#137, pv = 0.15) | Adenocarcinoma, metastatic (#200, 0.22) |
GediNET | Acoustic neuroma | Adenocarcinoma of colon | Adenocarcinoma of esophagus | |
GDS4206/Pediatric acute leukemia patients with early relapse: white blood cells | DisGeNET | Childhood leukemia (#96, pv = 0.13) | Melanoma (#29, pv = 0.03) | Glioblastoma multiforme (#115, pv = 0.18) |
GediNET | Acute leukemia | Adult diffuse large b-cell lymphoma | Esophageal carcinoma | |
GDS5499/pulmonary hypertension | DisGeNET | Idiopathic pulmonary hypertension | Vascular diseases | Endothelial dysfunction |
GediNET | Cholangiocarcinoma | Hepatocarcinogenesis | Papilloma | |
GDS3837/Non-small cell lung carcinoma in female nonsmokers | DisGeNET | Primary malignant neoplasm of lung | Carcinoma of lung (#10, pv = 0.009) | Neoplasm metastasis |
GediNET | Early-stage breast carcinoma | Meningioma, benign, no icd-o subtype | Colorectal carcinoma | |
GDS4516_4718/colorectal carcinoma | DisGeNET | Malignant neoplasm of colon and/or rectum (#3, pv = 0.002) | Carcinogenesis | Neoplasm metastasis |
GediNET | Acute leukemia | Acute lymphocytic leukemia | Malignant neoplasm of colon and/or rectum | |
GDS2547/metastatic prostate cancer | DisGeNET | Metastasis from malignant tumor of prostate (#27, pv = 0.02) | Hormone refractory prostate cancer (#91, pv = 0.1) | Secondary malignant neoplasm of bone (#123, pv = 0.18) |
GediNET | Malignant neoplasm of lung | Carcinoma of bladder | prostate carcinoma | |
GDS3268/ulcerative colitis | DisGeNET | Crohn disease | Inflammatory bowel diseases | Colitis |
GediNET | Malignant neoplasm of thyroid | Adenomatous polyposis coli | Leukemia, myelocytic, acute |
For each detected disease by DisGeNET, we have looked up the disease in the list of robust ranked aggregated disease results by GediNET. The values in parenthesis for the rows of DisGeNET are the position of the disease and the P-value assigned by GediNET.
In Table 9 we have included additional information, the values in parenthesis for the rows of DisGeNET are the position of the disease and the P-value assigned by GediNET. Interestingly, excluding just one disease all the top three significant diseases detected by GediNET are novel. This suggests that the tool detects a new biological knowledge that the biology researcher should consider.
Discussion and conclusion
In this study, we describe a novel approach for discovering disease-disease associations and detecting the genes/biomarkers associated with those diseases.
The approach is based on grouping the genes by their disease associations and then scoring those groups in terms of classification significance to train the machine learning model. For example, if a model created from the given data associated with a specific disease, such as lung cancer, is also found to apply to a subset of different diseases, this could suggest a previously undetected biological relationship with those other diseases that could inform clinical approaches not previously considered. The traditional approach of searching for genes that could be used as a biomarker in most cases yields a list of significant genes that solve the computational problem and does not take into account any prior knowledge about those genes, as such, their association with other diseases or even with other biological knowledge such as microRNA targets (see maTE tool27), or Pathways (See CogNet tool28), GeneOntology (See tool32).
Potential limitations and future plans
The novelty of the GediNET approach lies in the fact that it scores gene groups by considering the contribution of all its members. One potential limitation of this approach that might be considered, is whether some members (genes) within a group may have a noisy impact and as a result adversely affect the overall classification performance. Other feature selection approaches that consider each gene individually, will not have this problem. However, to avoid this, we used a statistical t-test on the training dataset to first detect the top differentially expressed genes. The top 2000 differentially expressed genes were then used to extract the training datasets that were used as input to the G component. Thus, GediNET will always be dealing with the least noisy genes. One direction of future work is to perform internal gene scoring for each gene group to consider only those genes with the highest scores (Supplementary table S1).
Another potential limitation of our approach is the possibility that the size of the (gene) group could influence the performance. For example, by influencing Scoring component. Groups that contain larger numbers of gene would tend to have higher scores. This issue might be solved by considering a fixed number of representative genes from each group. An area of feature selection or feature ranking (scoring) that we have not addressed in this study, is the possibility that two groups of features that are useless when considered separately can be useful when they are combined. In GediNET, the scoring component treats each group individually. One potential future approach would be to develop the S component to score groups simultaneously to address this possibility.
Our GediNET tool is unique in that: (1) the search for the significant biomarkers/genes focuses on gene groups rather than single genes associated with the disease and (2) the final list of genes can be used to define new disease-disease associations as presented in Fig. 2, right panel. GediNET identifies important relationships between diseases, using DDA based machine learning, which explores novel associations that can enhance our knowledge of disease relationships and which could further improve approaches to disease diagnosis, prognosis, and treatment by detecting new relationship between diseases.
Supplementary Information
Acknowledgements
The work of M.Y. has been supported by the Zefat Academic College. L. Showe was supported by The Commonwealth of Pennsylvania–CURE Formula Funding: SAP #4100088567.
Author contributions
These authors contributed equally to this work. All authors reviewed the manuscript.
Data availability
The datasets generated during and/or analyzed during the current study are available in the GEO (https://www.ncbi.nlm.nih.gov/geo/). The GediNET KNIME workflow can be downloaded from: https://github.com/malikyousef/GediNET.git or https://kni.me/w/3kH1SQV_mMUsMTS.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Emma Qumsiyeh, Email: emma.qumsiyeh@hotmail.com.
Malik Yousef, Email: malik.yousef@gmail.com.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-022-24421-0.
References
- 1.Wang X, Gulbahce N, Yu H. Network-based methods for human disease gene prediction. Brief. Funct. Genom. 2011;10:280–293. doi: 10.1093/bfgp/elr024. [DOI] [PubMed] [Google Scholar]
- 2.Chen B, Shang X, Li M, Wang J, Wu F-X. Identifying individual-cancer-related genes by rebalancing the training samples. IEEE Trans. NanoBiosci. 2016;15:1–1. doi: 10.1109/TNB.2016.2553119. [DOI] [PubMed] [Google Scholar]
- 3.Browne F, Wang H, Zheng H. A computational framework for the prioritization of disease-gene candidates. BMC Genom. 2015 doi: 10.1186/1471-2164-16-S9-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Navlakha S, Kingsford C. The power of protein interaction networks for associating genes with diseases. Bioinformatics. 2010;26:1057–1063. doi: 10.1093/bioinformatics/btq076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Advances in translational bioinformatics: Computational approaches for the hunting of disease genes | Briefings in bioinformatics | Oxford academic. https://academic.oup.com/bib/article/11/1/96/193936 (Accessed 30 November 2021). [DOI] [PMC free article] [PubMed]
- 6.MiRTarBase 2016: Updates to the experimentally validated MiRNA-target interactions database | nucleic acids research | Oxford academic. https://academic.oup.com/nar/article/44/D1/D239/2503072 (Accessed on 30 November 2021). [DOI] [PMC free article] [PubMed]
- 7.Gene ontology: Tool for the unification of biology | Nature Genetics. https://www.nature.com/articles/ng0500_25/ (Accessed 30 November 2021). [DOI] [PMC free article] [PubMed]
- 8.Clough E, Barrett T. The gene expression omnibus database. Methods Mol. Biol. Clifton NJ. 2016;1418:93–110. doi: 10.1007/978-1-4939-3578-9_5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Tomczak K, Czerwińska P, Wiznerowicz M. The cancer genome atlas (TCGA): An immeasurable source of knowledge. Contemp. Oncol. 2015;19:A68–A77. doi: 10.5114/wo.2014.47136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.From genomics to chemical genomics: New developments in KEGG | nucleic acids research | Oxford Academic. https://academic.oup.com/nar/article/34/suppl_1/D354/1133379 (Accessed 30 November 2021). [DOI] [PMC free article] [PubMed]
- 11.Piñero J, Bravo À, Queralt-Rosinach N, Gutiérrez-Sacristán A, Deu-Pons J, Centeno E, García-García J, Sanz F, Furlong LI. DisGeNET: A comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 2017;45:D833–D839. doi: 10.1093/nar/gkw943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Gillis J, Pavlidis P. “Guilt by Association” is the exception rather than the rule in gene networks. PLOS Comput. Biol. 2012;8:e1002444. doi: 10.1371/journal.pcbi.1002444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ben-dor, A. Gene-Expression Profiles in Hereditary Breast Cancer. Adv. Anat. Pathol. (2002). [DOI] [PubMed]
- 14.Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M, Simon R, Yakhini Z, Ben-Dor A, et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature. 2000;406:536–540. doi: 10.1038/35020115. [DOI] [PubMed] [Google Scholar]
- 15.van Driel MA, Brunner HG. Bioinformatics methods for identifying candidate disease genes. Hum. Genom. 2006;2:429–432. doi: 10.1186/1479-7364-2-6-429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Identifying disease genes using machine learning and gene functional similarities, assessed through gene ontology | PLoS ONE. 10.1371/journal.pone.0208626, https://journals.plos.org/plosone/article?id (Accessed 6 October 2022). [DOI] [PMC free article] [PubMed]
- 17.Multi-view based integrative analysis of gene expression data for identifying biomarkers | scientific reports. https://www.nature.com/articles/s41598-019-49967-4 (Accessed 30 November 2021). [DOI] [PMC free article] [PubMed]
- 18.Liekens AM, De Knijf J, Daelemans W, Goethals B, De Rijk P, Del-Favero J. BioGraph: Unsupervised biomedical knowledge discovery via automated hypothesis generation. Genome Biol. 2011;12:R57. doi: 10.1186/gb-2011-12-6-r57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wang J, Zheng J, Wang Z, Li H, Deng M. Inferring gene-disease association by an integrative analysis of EQTL genome-wide association study and protein-protein interaction data. Hum. Hered. 2018;83:117–129. doi: 10.1159/000489761. [DOI] [PubMed] [Google Scholar]
- 20.He X, Fuller CK, Song Y, Meng Q, Zhang B, Yang X, Li H. Sherlock: Detecting gene-disease associations by matching patterns of expression QTL and GWAS. Am. J. Hum. Genet. 2013;92:667–680. doi: 10.1016/j.ajhg.2013.03.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD, Wallace C, Plagnol V. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 2014;10:e1004383. doi: 10.1371/journal.pgen.1004383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Greene CS, Krishnan A, Wong AK, Ricciotti E, Zelaya RA, Himmelstein DS, Zhang R, Hartmann BM, Zaslavsky E, Sealfon SC, et al. Understanding multicellular function and disease with human tissue-specific networks. Nat. Genet. 2015;47:569–576. doi: 10.1038/ng.3259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Peng J, Bai K, Shang X, Wang G, Xue H, Jin S, Cheng L, Wang Y, Chen J. Predicting disease-related genes using integrated biomedical networks. BMC Genom. 2017;18:1043. doi: 10.1186/s12864-016-3263-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Luo P, Tian L-P, Chen B, Xiao Q, Wu F-X. Ensemble disease gene prediction by clinical sample-based networks. BMC Bioinform. 2020;21:79. doi: 10.1186/s12859-020-3346-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Yousef M, Bakir-Gungor B, Jabeer A, Goy G, Qureshi R, Showe LC. Recursive cluster elimination based rank function (SVM-RCE-R) implemented in KNIME. F1000Research. 2020;9:1255. doi: 10.12688/f1000research.26880.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Yousef M, Jabeer A, Bakir-Gungor B. Optimization of Scoring Function for SVM-RCE-R. In: Kotsis G, editor. Database and Expert Systems Applications - DEXA 2021 Workshops. Cham: Communications in Computer and Information Science, Springer International Publishing; 2021. pp. 215–224. [Google Scholar]
- 27.Yousef M, Abdallah L, Allmer J. MaTE: Discovering expressed interactions between MicroRNAs and their targets. Bioinformatics. 2019;35:4020–4028. doi: 10.1093/bioinformatics/btz204. [DOI] [PubMed] [Google Scholar]
- 28.Yousef M, Ülgen E, Uğur Sezerman O. CogNet: Classification of gene expression data based on ranked active-subnetwork-oriented KEGG pathway enrichment analysis. PeerJ Comput. Sci. 2021;7:e336. doi: 10.7717/peerj-cs.336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Yousef M, Goy G, Mitra R, Eischen CM, Jabeer A, Bakir-Gungor B. MiRcorrNet: Machine learning-based integration of MiRNA and MRNA expression profiles, combined with feature grouping and ranking. PeerJ. 2021;9:e11458. doi: 10.7717/peerj.11458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Yousef M, Goy G, Bakir-Gungor B. MiRModuleNet: Detecting MiRNA-MRNA regulatory modules. Front. Genet. 2022;13:767455. doi: 10.3389/fgene.2022.767455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Yousef M., Ozdemir F., Jaaber A., Allmer J., Bakir-Gungor B. PriPath: Identifying dysregulated pathways from differential gene expression via grouping, scoring and modeling with an embedded machine learning approach, In review (2022). [DOI] [PMC free article] [PubMed]
- 32.Yousef, M., Sayici, A., Bakir-Gungor, B. Integrating gene ontology based grouping and ranking into the machine learning algorithm for gene expression data analysis. 1479 10.1007/978-3-030-87101-7_20.
- 33.Yousef M, Ketany M, Manevitz L, Showe LC, Showe MK. Classification and biomarker identification using gene network modules and support vector machines. BMC Bioinform. 2009;10:337. doi: 10.1186/1471-2105-10-337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Yousef M, Jung S, Showe LC, Showe MK. Recursive cluster elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinform. 2007;8:144. doi: 10.1186/1471-2105-8-144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Yousef M, Kumar A, Bakir-Gungor B. Application of biological domain knowledge based feature selection on gene expression data. Entropy Basel Switz. 2020;23:E2. doi: 10.3390/e23010002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Brown MB, Forsythe AB. Robust tests for the equality of variances. J. Am. Stat. Assoc. 1974;69:364–367. doi: 10.1080/01621459.1974.10482955. [DOI] [Google Scholar]
- 37.Nacu Ş, Critchley-Thorne R, Lee P, Holmes S. Gene expression network analysis and applications to immunology. Bioinformatics. 2007;23:850–858. doi: 10.1093/bioinformatics/btm019. [DOI] [PubMed] [Google Scholar]
- 38.Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B. KNIME: The Konstanz Information Miner. In: Preisach C, Burkhardt H, Schmidt-Thieme L, Decker R, editors. Proceedings of the Data Analysis Machine Learning and Applications. Springer; 2008. pp. 319–326. [Google Scholar]
- 39.Xu Q-S, Liang Y-Z. Monte carlo cross validation. Chemom. Intell. Lab. Syst. 2001;56:1–11. doi: 10.1016/S0169-7439(00)00122-2. [DOI] [Google Scholar]
- 40.El-Hadj Imorou S. Socio-economic and health determinants of rural households consent to prepay for their health care in N’Dali (North of Benin) Open J. Soc. Sci. 2020;08:348–360. doi: 10.4236/jss.2020.85024. [DOI] [Google Scholar]
- 41.Hand D, Till R. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn. 2004;45(171):186. [Google Scholar]
- 42.Kolde R, Laur S, Adler P, Vilo J. Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics. 2012;28:573–580. doi: 10.1093/bioinformatics/btr709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Yousef M, Voskergian D. TextNetTopics: Text classification based word grouping as topics and topics’ scoring. Front. Genet. 2022;13:893378. doi: 10.3389/fgene.2022.893378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.DAVID: Functional annotation tools. https://david.ncifcrf.gov/tools.jsp (Accessed 8 April 2022).
- 45.Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, Koplev S, Jenkins SL, Jagodnik KM, Lachmann A, et al. Enrichr: A comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44:W90–W97. doi: 10.1093/nar/gkw377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.GeneMANIA. https://genemania.org/ (Accessed 8 April 2022).
- 47.Martens M, Ammar A, Riutta A, Waagmeester A, Slenter DN, Hanspers K, Miller AR, Digles D, Lopes EN, Ehrhart F, et al. WikiPathways: Connecting communities. Nucleic Acids Res. 2021;49:D613–D621. doi: 10.1093/nar/gkaa1024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Franz M, Lopes CT, Huck G, Dong Y, Sumer O, Bader GD. Cytoscape.Js: A graph theory library for visualisation and analysis. Bioinformatics. 2016;32:309–311. doi: 10.1093/bioinformatics/btv557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Menche J, Sharma A, Kitsak M, Ghiassian SD, Vidal M, Loscalzo J, Barabási A-L. Disease networks. Uncovering disease-disease relationships through the incomplete interactome. Science. 2015;347:1257601. doi: 10.1126/science.1257601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Suratanee A, Plaimas K. DDA: A novel network-based scoring method to identify disease-disease associations. Bioinform. Biol. Insights. 2015;9:BBI.S35237. doi: 10.4137/BBI.S35237. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets generated during and/or analyzed during the current study are available in the GEO (https://www.ncbi.nlm.nih.gov/geo/). The GediNET KNIME workflow can be downloaded from: https://github.com/malikyousef/GediNET.git or https://kni.me/w/3kH1SQV_mMUsMTS.