Abstract
Machine learning (ML) algorithms are powerful tools to find complex patterns and biomarker signatures when conventional statistical methods fail to identify them. While the ML field made significant progress, state of the art methodologies to build efficient and non-overfitting models are not always applied in the literature. To this purpose, automatic programs, such as BioDiscML, were designed to identify biomarker signatures and correlated features while escaping overfitting using multiple evaluation strategies, such as cross validation, bootstrapping and repeated holdout. To further improve BioDiscML and reach a broader audience, better visualization support and flexibility in choosing the best models and signatures are needed. Thus, to provide researchers with an easily accessible and usable tool for in depth investigation of the results from BioDiscML outputs, we developed a visual interaction tool called BioDiscViz. This tool provides summaries, tables and graphics, in the form of Principal Component Analysis (PCA) plots, UMAP, t-SNE, heatmaps and boxplots for the best model and the correlated features. Furthermore, this tool also provides visual support to extract a consensus signature from BioDiscML models using a combination of filters. BioDiscViz will be a great visual support for research using ML, hence new opportunities in this field by opening it to a broader community.
Introduction
In recent years, new methods of Artificial Intelligence (AI) have been deployed in bioinformatics research to provide pattern classification, biomarker identification and forecast modeling using omics data. Studying biomarker signatures is an important part of the research process as they are correlated to biological functions. Machine learning and feature selection will identify multivariate associations of biomarkers (i.e., features) and detect complex hidden patterns in the data. Considering the existence of many algorithms for feature selection and classification, multiple models are often generated with different signatures, but inconsistent overlaps between signatures were observed despite equivalent performances being frequent [1]. Furthermore, correlated features may not be retained by the models during their optimization when avoiding redundancy of information. Indeed, selecting a “best model” and its signatures is an equilibrium between decomplexifying the model and getting all valuable biomarkers. Often, various approaches, like ensemble learning or union of overlapping features, tend to find optimized solutions but at the cost of either side of the balance.
A solution to facilitate the generation of multiple models and signatures has been proposed with an automatic ML tool, BioDiscML [2]. BioDiscML is a new generation ML tool which has been demonstrated to be highly efficient in multiple research topics involving the identification of biomarker signatures from various types of data, such as proteomics [3], transcriptomics [4] and multi-omics (metagenomics/metabolomics, metagenomics/lipidomics) [5, 6]. Furthermore, BioDiscML proposes various conditions for choosing a “best model”, but this is complex to determine as some data are too heterogeneous to propose ideal decision threshold metrics. Unfortunately, this tool does not provide visualization of the signature, hence limiting a rapid view of the results. Thus, to help in these decisions, we propose a visual tool, BioDiscViz, to support the choice of consensus features within a set of trained classifiers with their corresponding signatures.
Design and implementation
BioDiscViz is a visual Shiny application working on Windows and Unix operating systems to support BioDiscML by presenting an interactive interface and graphs to the researchers which will improve their understanding of the results. The application is based on R [7]. It uses the framework Rshiny [8] and its dependency Rshiny Dashboard [9] and requires Rstudio [10], an integrated development environment for R.
Input
BioDiscViz takes as input a directory containing BioDiscML output in csv format and their summary results. The best model and the classification or regression results are independently accessible. Furthermore, the tool supports multiple BioDiscML outputs in the same directory and allows rapid switching between them.
Layout
BioDiscViz’s interface is divided into two parts: a sidebar on the left and the main section on the right.
Starting with the sidebar, the first item is the “input directory” button. Clicking this button opens an interface where it’s possible to choose the directory containing the BioDiscML outputs. Below, there is a submit button, which runs BioDiscViz on the selected BioDiscML results and an “example button” which runs the analysis of the example data provided. The main part of the interface is divided into four sections: Short Signature, Long Signature, Attribute Distribution, and Consensus Signatures.
Once the BioDiscML results are submitted to BioDiscViz, additional options appear in the sidebar. First, a scrollable list allows the selection of a specific BioDiscML output to study if there are multiple outputs in the same directory. Then, two sliders are present to adjust the font and label sizes of the figures, which can be modified at any time by the user and force the update of the plots. Finally, a button to download an HTML report of the results, including all the figures will be somewhere.
On the main part of the application, four sections are accessible through the sidebar. Each section is divided into two to three parts, consisting of a results summary for the models, plots, and a table.
Short Signature: This section represents the results obtained by the best model of bioDiscML.
Long Signature: This section displays the correlated features.
Attribute Distribution: This is an additional feature in the visualizer that allows to interactively visualize the most frequently used features by different classifiers that were tested. Various thresholds for the classifiers using metrics such as the Matthew’s correlation coefficient and standard deviation are available. Moreover, the number of attributes to match the experimental design is determined by the user.
Consensus Signatures: This section provides a representation of the different signatures called by the majority of classifiers based on user-defined parameters.
Representations
There is a heatmap, a PCA, t-SNE, UMAP graph and a boxplot to represent the short, long and consensus signatures (Fig 1A). The heatmap was made using ComplexHeatmap [11], an user-friendly package for better representation of heatmaps. The PCA was built using FactoExtra [12] and the UMAP, Rtsne and boxplots with ggplot2 [13].
Fig 1. Representation of the best signature and attribute distribution sections.
A. Heatmap and PCA by BiodiscViz on the best model found by BiodiscML, Kstar model, on the colon cancer dataset. We observe in cyan color the tumor tissues and in red the normal ones. B. Selection of the consensus signatures with BiodiscViz on the colon cancer dataset. Here were selected the 10 attributes most frequently called by the classifiers passing the threshold of a Matthew Correlation Coefficient > = 0.75 and a Standard Deviation < = 0.15.
The attribute distribution is represented under the form of a UpsetR plot [14] (Fig 1B). UpsetR is a R package generating static upset plots to visualize the intersections between the different features in the different classifiers.
BioDiscViz also gives access to the summary details for the short signature and a table of the data used for the short and long signature. The table in the shiny application is an integration of the datasets. It allows users to search for specific information using a search field. If there is a particular instance of interest, it can be easily found and highlighted within the table (S1A–S1C Fig).
Considering that non-numerical features cannot be easily integrated into PCA and heatmap with other numerical values, a particularity of BioDiscViz is the transformation of categorical features into numerical ones. This form allows users to simply annotate them on the side of the heatmaps to integrate the information contained by these features into the clustering of PCA.
Outputs
BioDiscViz also possesses different functions to facilitate use and export of the results for archiving, sharing and publication. The first one is the creation of a report of the different graphs represented in the application which takes into account the modifications carried out by the user. The second functionality is to be able to download a sub dataset containing the information for the selected features in the “attribute distribution”.
The study of consensus signatures is of great interest to allow researchers to identify new molecular targets of interest. If the best model provides a vision of which useful data were selected, the model does not necessarily use the features providing the most information. We consider that the most frequently called signatures by the classifiers contain important information for our problem. Those are the signatures we call consensus signatures. As such, giving BioDiscML’s models these consensus signatures, which were left out by the best model, could potentially improve the initial results obtained by the previous best model.
Results
To demonstrate the functionalities of BioDiscViz, we used a colon cancer dataset [15] which was used for the BioDiscML publication and which is available on BioDiscViz gitlab. This dataset contains gene expression in 40 tumor and 22 normal colon tissue samples.
Visualize the best signature
The identification of the best signatures was studied from two perspectives. First, the signatures from the best model followed by the consensus signatures.
For signatures retrieved from the best model, different plots were generated. In this case, two classes were distinctly separated on the PCA and the heatmap (Fig 1A), showing that they provide enough information to the model to correctly predict tumor tissues and healthy tissues. Then, differential expressions of genes identified in the model were visualized using boxplots (Fig 2). Interestingly, all the signatures showed promising results as there is a clear difference for each gene between the two classes.
Fig 2. Boxplot of the best model obtained by BioDiscML.
Biomarkers used by the best model selected by BioDiscML to classify the healthy and cancerous tissues.
Consensus signature
BioDiscML uses ensemble methods to create an association of signatures, but it does not take advantage of all generated models during its learning stage. Ensemble methods also keep all features of the model’s signatures, without any optimization, thus complexifying the model. We formulated the hypothesis that the features frequently called by the different models contained valuable information relevant to the problem at hand. The consensus signatures, referred to as such, offer a promising avenue for constructing a new signature and enhancing or streamlining the model. To delve deeper into these signatures, it is feasible to generate a dataset of consensus signatures directly using BioDiscViz. Following numerous tests, we selected the top 10 signatures from the classifiers that surpassed the MCC threshold of 0.76 (Matthew’s Correlation Coefficient) and had a standard deviation of MCC (STD MCC) no greater than 0.15. (Fig 1B). The quality of the selection was assessed using the heatmap (Fig 3A) and PCA (Fig 3B), which presented a better separation between the classes than the best model identified by BioDiscML. Compared to the best model signatures, these consensus signatures consist of 3 genes overlapping with the best signature, and 7 newly added genes. To further look into these new signatures, The boxplot was used to select the genes which were differentially expressed between the healthy and cancerous tissues.
Fig 3. Graphical representation of the 10 best consensus signatures.
A. Heatmap. B. PCA.
Following the identification of the consensus signature, we ran BiodiscML a second time to find an optimal machine learning classifier with the full signature, without any feature selection. The best classifier was a Kstar model with 6 attributes signature which had a MCC of 0.776 with a standard deviation across (STD) all evaluation procedures of 0.037, which is a reasonable performance considering past work on MCC evaluations [16]. Furthermore, the model had an accuracy of 0.857 Moreover, the model exhibited an accuracy of 0.857, which is comparable to, and in some cases even superior to, the results reported in the existing literature for this particular type of data [17]. With the consensus signature, we obtained a Fuzzy Lattice Reasoning model with a MCC of 0.791 (STD 0.032) which is slightly better than the previous best model (MCC increased by 1,9% and STD decreased by 15,6%).
In conclusion, our tool is able to provide visual support to BioDiscML and new insights outside of the best model by looking into the consensus signatures. Furthermore, these consensus signatures could be used to rerun BioDiscML and may enhance the quality of the model.
Supporting information
Search for a specific instance: “12” in the table (A). The first approach involves scanning the entire table for any values that contain “12” (B). Alternatively, the user can focus on a particular column and instance and search for the one with a value of “12” (C).
(EPS)
Data Availability
BioDiscViz is directly implemented in R and is available under the GNU-GPL 3 license on Gitlab (https://gitlab.com/SBouirdene/biodiscviz.git) and online at https://sophiane-bouirdene.shinyapps.io/BiodiscViz_shinyapp/. The version used for this article can be found under the release 1.0.
Funding Statement
Dr Steve Bilodeau received a grant from the Canadian Institutes of Health Research (Grant Number: 387762) for the broader project encompassing BioDiscViz. We assure you that the funders played no role in the study design, data collection, analysis, the decision to publish, or the preparation of the manuscript.
References
- 1. Li YH, Xu JY, Tao L, Li XF, Li S, Zeng X, et al. SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity. PLOS ONE. 2016. p. e0155290. doi: 10.1371/journal.pone.0155290 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Leclercq M, Vittrant B, Martin-Magniette ML, Scott Boyer MP, Perin O, Bergeron A, et al. Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data. Front Genet. 2019;10: 452. doi: 10.3389/fgene.2019.00452 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Roux-Dalvai F, Gotti C, Leclercq M, Hélie M-C, Boissinot M, Arrey TN, et al. Fast and Accurate Bacterial Species Identification in Urine Specimens Using LC-MS/MS Mass Spectrometry and Machine Learning. Mol Cell Proteomics. 2019;18: 2492–2505. doi: 10.1074/mcp.TIR119.001559 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Rabaglino MB, Kadarmideen HN. Machine Learning Approach to Integrated Endometrial Transcriptomic Datasets Reveals Biomarkers Predicting Uterine Receptivity in Cattle at Seven Days after Estrous. Sci Rep. 2020;10: 1–10. doi: 10.1038/s41598-020-72988-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Khorraminezhad L, Leclercq M, O’Connor S, Julien P, Weisnagel SJ, Gagnon C, et al. Dairy Product Intake Modifies Gut Microbiota Composition among Hyperinsulinemic Individuals. Eur J Nutr. 2020;60: 159–167. doi: 10.1007/s00394-020-02226-z [DOI] [PubMed] [Google Scholar]
- 6. Doré E, Joly-Beauparlant C, Morozumi S, Mathieu A, Lévesque T, Allaeys I, et al. The Interaction of Secreted Phospholipase A2-IIA with the Microbiota Alters Its Lipidome and Promotes Inflammation. JCI Insight. 2022;7. doi: 10.1172/jci.insight.152638 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.R Core Team (2021) R: A language and environment for statistical computing. R Foundation for Statistical Computing Vienna, Austria. R Foundation for Statistical Computing; 2020. Available from: https://www.R-project.org/.
- 8.Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y et al. “shiny: Web Application Framework for R” R package version 1.7.1 Available from: https://rstudio.github.io/shiny/index.html.
- 9.Chang W, Borges Ribeiro B. “shinydashboard: Create Dashboards with’Shiny’” R package version 0.7.2. Available from: https://rstudio.github.io/shinydashboard/.
- 10.RStudio Team (2020) RStudio: Integrated Development for R. RStudio, Boston, MA, USA. Available from: https://github.com/rstudio/rstudio.
- 11. Gu Z, Eils R, Schlesner M. Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data. Bioinformatics. 2016;32: 2847–2849. doi: 10.1093/bioinformatics/btw313 [DOI] [PubMed] [Google Scholar]
- 12.Kassambara A and Mundt F. Factoextra: Extract and Visualize the Results of Multivariate Data Analyses R Package Version 1.0.7. Available from: https://cran.r-project.org/web/packages/factoextra/readme/README.html.
- 13.Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4. Available from: https://ggplot2.tidyverse.org.
- 14. Conway JR, Lex A, Gehlenborg N. UpSetR: An R Package for the Visualization of Intersecting Sets and Their Properties. Bioinformatics. 2017;33: 2938–2940 doi: 10.1093/bioinformatics/btx364 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, et al. Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proc Natl Acad Sci U S A. 1999;96: 6745–6750. doi: 10.1073/pnas.96.12.6745 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Schober P, Boer C, Schwarte LA. Correlation Coefficients: Appropriate Use and Interpretation. Anesth Analg. 2018;126: 1763–1768. doi: 10.1213/ANE.0000000000002864 [DOI] [PubMed] [Google Scholar]
- 17.Fahami M, Roshanzamir M, Izadi N, Keyvani V, Alizadehsani R. Detection of Effective Genes in Colon Cancer: A Machine Learning Approach. 2021. Informatics in Medicine Unlocked 24 (January): 100605.



