Abstract
Motivation
A common task in scientific research is the comparison of lists or sets of diverse biological entities such as biomolecules, ontologies, sequences and expression profiles. Such comparisons rely, one way or another, on calculating a measure of similarity either by means of vector correlation metrics, set operations such as union and intersection, or specific measures to capture, for example, sequence homology. Subsequently, depending on the data type, the results are often visualized using heatmaps, Venn, Euler, or Alluvial diagrams. While most of the abovementioned representations offer simplicity and interpretability, their effectiveness holds only for a limited number of lists and specific data types. Conversely, network representations provide a more versatile approach where data lists are viewed as interconnected nodes, with edges representing pairwise commonality, correlation, or any other similarity metric. Networks can represent an arbitrary number of lists of any data type, offering a holistic perspective and most importantly, enabling analytics for characterizing and discovering novel insights in terms of centralities, clusters and motifs that can exist in such networks. While several tools that implement the translation of lists to the various commonly used diagrams, such as Venn and Euler, have been developed, a similar tool that can parse, analyze the commonalities and generate networks from an arbitrary number of lists of the same or heterogenous content does not exist.
Results
To address this gap, we introduce List2Net, a web-based tool that can rapidly process and represent lists in a network context, either in a single-layer or multi-layer mode, facilitating network analysis on multi-source/multi-layer data. Specifically, List2Net can seamlessly handle lists encompassing a wide variety of biological data types, such as named entities or ontologies (e.g., lists containing gene symbols), sequences (e.g., protein/peptide sequences), and numeric data types (e.g., omics-based expression or abundance profiles). Once the data is imported, the tool then (i) calculates the commonalities or correlations (edges) between the lists (nodes) of interest, (ii) generates and renders the network for visualization and analysis and (iii) provides a range of exporting options, including vector, raster format visualization but also the calculated edge lists and metrics in tabular format for further analysis in other tools. List2Net is a fast, lightweight, yet informative application that provides network-based holistic insights into the conditions represented by the lists of interest (e.g., disease-to-disease, gene-to-phenotype, drug-to-disease, etc.). As a case study, we demonstrate the utility of this tool applied on publicly available datasets related to Multiple Sclerosis (MS). Using the tool, we showcase the translation of various ontologies characterizing this specific condition on disease-to-disease subnetworks of neurodegenerative, autoimmune and infectious diseases generated from various levels of information such as genetic variation, genes, proteins, metabolites and phenotypic terms.
Keywords: Biological network construction, Network representation of lists commonalities, Network analysis, Multi-layer data representation
Graphical Abstract
1. Introduction
The rapid development of high-throughput sequencing (HTS) technologies supported by a range of bioinformatics methodologies and applications has generated a considerable volume of readily available data characterizing the molecular basis of the physiology and pathology of humans and other organisms. As a result, the user can be informed from large databases or view, access and re-analyze a wealth of publicly available experimental data. Nevertheless, the realization of the new vision of Medicine - namely Precision and Personalized Medicine – necessitates the advancement of translational research, where proper tools for extensive and in-depth data interpretation and knowledge extraction are available. In most cases, data generated by HTS or other means are or can be stored or represented as lists, in database structures, spreadsheets or simply text files. These lists may contain symbols and ontologies corresponding, for example, to genomic variants, perturbed pathways, sequences, or expressions/abundances of biomolecules that are found to be significant in the conditions under study. These findings require a holistic and integrative approach to be interpreted and enable their future translation in clinical applications. Moreover, comparing the lists of novel findings with those already deposited in existing data repositories is imperative for clarifying results, realizing an accurate context, or even validating the insights we have obtained.
Thus far, simple questions concerning the commonalities or correlations between lists or sets of biological information, such as the gene/protein/metabolite/pathway ontologies that are related to several diseases, their corresponding expression or abundance levels across groups of samples, the pathways that are targeted from a number of different drugs and the sequences of genes/proteins across species, cannot be probed holistically through existing tools. For instance, most tools draw Venn diagrams commonly used to highlight the intersection across several submitted lists. On one hand, this approach is limited in the number of sets that can be effectively visualized. On the other hand, the string-based intersection of biological sequences (DNA, RNA and Amino Acid) cannot capture the homology between non-identical sequences. Through a network-based approach, one can represent and analyze an arbitrary number of lists/sets of seemingly any type of content since the underlying commonalities can be determined, among others, by string-based matching, sequence alignment metrics and statistical relationships. More importantly, Network Science offers a range of tools that can be exploited to understand further the systemic effects emerging from the structure and connectivity of the network modules rather than of isolated pairs of entities.
Already existing tools such as VENNY 2.1 (https://bioinfogp.cnb.csic.es/tools/venny/index.html), jvenn [1] and Intervene [2] were designed for Venn diagram and upSet plots generation, but they are limited in the number of lists one can use as an input. Other tools, such as the Multiple List Comparator (https://molbiotools.com/listcompare.php), accept multiple lists and offer a classical Venn diagram, a matrix with pairwise intersections, yet the network format is absent. Moreover, tools such as NetVenn [3] accept multiple lists as input and also offer mapping of the findings on a network (e.g., a protein interaction network) but with no generation of the associations network between the input lists of data. In addition to these tools, a recent web-based application called Arena3Dweb[4], provides an interactive 3D visualization and analysis of networks. However, it needs an already constructed network file to proceed with its visualization options. A comparison of various tools for computing and visualizing intersections for Venn, Euler, Edwards diagrams, UpSet plots, pairwise heatmaps, and networks can be found in Supplementary Table S1. The first row displays the name of the tool or package, while the other rows list their primary features, such as input data type, maximum number of input lists, plot types, output formats, etc. The table sheds light on several aspects of intersection production and highlights the limitations mentioned above. The innovative contribution of List2Net is that it generates association networks from lists of various data types. These networks can be visualized and analyzed either with List2Net or with other visualization tools like Arena3Dweb and Cytoscape [5].
To overcome all previously mentioned limitations, we have developed List2Net as a web-based tool aiming to represent lists of data with various data types linked in a network context based on pairwise commonalities or correlations. In the current era, with the plethora of findings from many levels of analysis ending up with a list format, such a tool is handy due to its potential to compare and associate many lists from many levels and with a diverse range of data types, keeping its use as simple as possible. We demonstrate the use of List2Net in a case study aiming to generate a disease-to-disease subnetwork around Multiple Sclerosis (MS). We have used publicly available data (genes, proteins, variants, metabolites and phenotypic information) that were related to three different categories of diseases (neurodegenerative, autoimmune and infectious diseases) to generate disease-to-disease subnetworks around MS under a range of association queries addressing disease similarity based on those data.
2. Methods
List2Net was designed and developed as a lightweight, user-friendly web-based tool to analyze, transform and visualize lists of data as networks. This type of network visualization assigns the list-names as nodes and their pairwise commonalities/correlations as edges with a weight proportional to the magnitude of the pairwise commonality. List2Net was implemented as an interactive web application using the R package ‘Shiny’ (https://shiny.rstudio.com/). The List2Net main workflow comprises four distinct steps as illustrated in Fig. 1, namely 1) Data import, 2) Edge list creation, 3) Network Visualization and 4) Analysis and Exporting. The main User Interface (UI) of List2Net consists of six sub-interfaces or panels: i) data import and edge filtering sidebar, ii) preview of the parsed data types and content from the uploaded files, iii) provision of multi-layer layout settings in cases of different datatypes or user-defined grouping, iv) calculation of edge list/s tables, v) visualization, analysis and exporting panel of the generated networks and vi) detailed documentation and help on the application usage.
Fig. 1.
A schematic representation of the workflow of List2Net. Step 1: Upload Input List and Mode setup: Initially, the users upload the list/s choosing whether to use the single- or multi-layer mode of the tool. Step 2: Edge Lists Creation: Depending on the specific input/s, the creation of the edge list/s is generated. Users can change this table-format edge list by changing specific metrics and thresholds. The multi-layer network has a sub-step that includes specific multi-layer options (Sub-step – Multi-Layer Network Setup), with several user choices according to this mode. Step 3: Network Visualization: The generated edge list/s is used to create the undirected and weighted network. Moreover, the tool offers users the ability to edit some visualization settings. Step 4: Network Analysis and Exporting: Users can get different network features and metrics along with their distributions. Finally, the user can export/download all the calculated tables (edge list and analytics tables) and the network in both a network file and an image format.
2.1. Data uploading and processing
List2Net supports three different data types as input (Table 1). These include: i) lists of strings (characters) that represent names/ontologies or descriptions, ii) lists of biological sequences (nucleic or amino acids), or iii) numeric vectors (e.g., expression values). More specifically, the input format can be either a single multi-column file in txt, tsv, csv, xlsx/xls with a single data type across all columns, multiple single-column files in txt, tsv, csv, xlsx/xls or fasta of the same data type or multiple multi-column files where the users can upload multi-column files of any data type across files with the same data type within each file. In the single file with multiple columns, the first row is used as the name of the lists. In multiple single-column files, the user can select whether the input files include headers to be used by the tool as list names, or alternatively assign the input file names as list/node names. In the multiple multi-column files, each file represents a different layer and the name of the file is used as the name of the layer.
Table 1.
Supported data types in List2Net with a short description.
| List Type | R data type | Description |
|---|---|---|
| Named Entities/Ontologies | Character | Sets of character strings include names/descriptions of genes, proteins, metabolites, pathways, SNPs, related diseases, variants, HPO terms, etc. |
| Biological Sequences | BioStrings | Sequences of the standard DNA, RNA, or AA single letter code |
| Numeric Sets | Numeric | Vectors of numeric sets that may represent gene/protein expression, metabolite/microbial/viral abundance, drug concentrations, etc., across several samples |
SNPs: Single nucleotide polymorphisms, HPO: Human Phenotype Ontology, AA: Amino acid
From the sidebar of the web application, the users are allowed to select using the top radio buttons whether they are uploading a single or multiple files. Once the user uploads the file(s), the app automatically switches to the ‘List of files’ tab. Under this tab, the top panel shows information about the name, size, type of the file as well as the data type of the uploaded list. The data type is assigned automatically by the tool, but the user has the option to override the detected data type using the dropdown menu ‘Data Type(s)’ in the specific panel. The bottom panel previews in an interactive table the imported dataset. The Search field on the top right of each table allows the user to browse and verify the raw data prior to the edge list calculation.
The web-app is set by default to the single-layer mode. Once data uploading is complete, from single or multiple single-column files, the user can create the edge list and the corresponding network. Alternatively, the multi-layer mode can be activated from the relevant switch at the sidebar or automatically when the user uploads multiple multi-column files as layers.
The uploaded file/s limitation is 30MB. In the case of a multi-column file, the user can upload a single file up to the limit mentioned above. In the cases of uploading multiple files, the total size of the files should not exceed 30MB. This limitation can help reduce massive networks and make visualization more appealing and informative.
2.2. Edge List/s Creation
The edge lists are generated by clicking the ‘Calculate Edge List’ button under the ‘Edge Lists’ tab, which automatically appears once data-uploading is complete. Once the calculation is finished, the button ‘Calculate Edge List’ is removed and the users can interact with the calculated edge lists shown as interactive table(s). By selecting an entry from the edge list table, details of the common information, alignment and correlations (depending on the data type) among lists are rendered below. The construction of the edge lists and the different panels in named entities, biological sequences and numeric sets cases will be described in the following subsections.
2.2.1. Named entities
In the case where the user uploads lists of named entities (strings that represent names/descriptions), the tool calculates the number of common elements between all the uploaded lists and four panels are constructed in this tab, named ‘Weighted Edge List(s)’, ‘Detailed List for the selected row’, ‘Unique in Object 1′ and ‘Unique in Object 2′, respectively. In the first panel, named ‘Weighted Edge List(s)’, an edge list in a four-column table format is shown. Columns ‘Object 1′ and ‘Object 2′ correspond to the two lists which are connected by the specific edge. The third column, named ‘Edge Factor’, represents the number of common elements between the connected lists and the fourth column, named ‘Jaccard Index’, contains the corresponding value of the Jaccard similarity coefficient (Table 2). It is important to mention that the entities are processed in a case-sensitive manner, and lowercase and uppercase are interpreted as two different entities. In addition, the user can browse a specific edge by selecting the corresponding row from the edge list table. Upon selection, the ‘Detailed List for the selected row’ panel renders a table that shows the common elements between the selected lists. Moreover, this selection rendersthe unique elements in each list of the selected edge in the panels 'Unique in Object 1' and 'Unique in Object 2', respectively. Users can copy, print and download the tables in CSV, EXCEL and PDF format. Notably, since the tables can be large, the user can use the Search field located at the top right of each table to retrieve information on specific entities.
Table 2.
Description of the edge lists’ weights in different list types.
| List Type | Description of the weight |
|---|---|
| Named Entities/Ontologies |
Edge Factor: Number of common elements Jaccard Index (Jaccard similarity coefficient): The number of common elements divided by the size of the union of two lists |
| Biological Sequences | Weight: Number of sequences with a Homology degree (%) higher than the cutoff. The homology is determined from pairwise alignments and choice of metrics |
| Numeric Sets | Different correlation coefficients: Pearson, Spearman, Kendall |
The tab ‘Edge Settings’ located at the sidebar can be used to modify the default filters applied during the edge list calculation. In the dropdown menu ‘Layer List Set’, the user can select specific thresholds based on the similarity metric for the ‘Edge Factor’ or the ‘Jaccard Index’. The radio buttons control which similarity metric to be applied as edge weight while the slider of the selected metric controls its cutoff value. This filtering can help reduce the network’s density and make visualization more appealing and informative while focusing on stronger connections between nodes. The corresponding table in the ‘Weighted Edge List(s)’ will be automatically updated upon selection.
2.2.2. Biological sequences
In the case that a user uploads lists containing biological sequences, the tool performs an all-by-all pairwise alignment and three panels are rendered under the 'Edge Lists' tab: ‘Weighted Edge List(s)’, ‘Detailed List for the selected row’ and ‘Alignment’. The ‘Weighted Edge List(s)’ table has three columns, where ‘SetA’ and ‘SetB’ correspond to the connected lists, while the third column named ‘Weight’ contains the value of the weight. The latter is equal to the number of sequence alignments between the two lists with a similarity (%) above a predefined threshold (Table 2). More specifically, the R package Biostrings [6] and the function ‘pid’ is used to calculate the percentage of sequence identity measures ('PID1′,'PID2′,'PID3′,'PID4′) for all pairwise alignments (Table 3). Moreover, the user can retrieve information on specific alignments by selecting an entry from the edge list table. Specifically, ‘Detailed List for the selected row’, shows the similarity results for each of the aligned sequences, while individual alignments are shown on the bottom right ('Alignment' panel) after selecting a specific set of sequences (by row). For the first two panels, users can copy, print and download the specific tables in CSV, EXCEL and PDF format.
Table 3.
Similarity metric options available in List2Net.
| PID | Description of the PID types |
|---|---|
| PID1 | 100*(identical positions) / (aligned positions + internal gap positions) |
| PID2 | 100*(identical positions) / (aligned positions) |
| PID3 | 100*(identical positions) / (length shorter sequence) |
| PID4 | 100*(identical positions) / (average length of the two sequences) |
PID: Percentage identity
In the dropdown menu ‘Layer List Set’ of the tab ‘Edge Settings’, the user can select the similarity metric from a set of radio buttons with the options shown in Table 3. The sequence similarity (%) threshold can be adjusted using the slider below. The corresponding table in the ‘Weighted Edge List(s)’ will be automatically updated upon selection.
2.2.3. Numeric sets
In the case that the lists contain numeric data, the tool calculates all the pairwise correlation coefficients (Pearson, Spearman, or Kendall) before rendering three panels, named ‘Weighted Edge List(s)’, ‘Detailed List for the selected row’ and ‘Scatter Plot’. Similar to the other data types, in the 'Weighted Edge List(s)' panel, the first two columns show the lists connected by a specific edge while the third, fourth and fifth columns contain the Pearson, Spearman and Kendall correlation coefficients, respectively (Table 2). Kendall coefficients are not calculated for large vectors (>300 values) as the corresponding function is computationally heavy.
The user can focus on a specific pairwise comparison by selecting a specific row from the edge list table. Upon selection, the detailed correlation results (correlation coefficients, p-values), will appear under the ‘Detailed List for the selected row’, next to a scatter plot ('Scatter Plot' panel) with the numeric values of the lists of interest. For the first two panels, users can copy, print and download the specific tables in CSV, EXCEL and PDF format.
In the dropdown menu ‘Layer List Set’ of the ‘Edge Settings’ tab, the user can select specific thresholds based on the correlation metric (Pearson, Spearman, Kendall) to filter the generated edge list. The top radio buttons control which correlation metric to be applied as edge weight. Users can switch to create edges by considering only positive, negative, or both correlations (default). Finally, the users can set the upper and lower thresholds to adjust the number of edges. The corresponding table in the ‘Weighted Edge List(s)’ will be automatically updated upon selection.
2.2.4. Multi - layer edge lists
When the user selects the multi-layer mode and/or uploads multiple multi-column files as layers, the ‘Edge Lists’ tab appears after clicking on the ‘Layer Settings’ table. In multi-layer cases, the tool will generate one edge list per layer by iteratively applying the methods described above for each data type (Table 2). In the ‘Edge Settings’ tab, located at the sidebar, an edge filtering submenu for each layer (uploaded file) will be created for multiple multi-column files. This allows the orthogonal control of edge list filtering settings across the different layers. The number of vertices/nodes shown in the sidebar represents the sum of vertices (lists) from all layers (unique names of lists). It is important to note that the number of edges shown includes the sum of edges from all layers, based on the specific thresholds and filters. However, in the case of inter-layer membership ‘By Name’, the number of edges of the network may differ (more details about this option are described in Section 2.3). This is because the edges between the layers are not included in the edge list, yet, they are present in the network.
2.3. Sub-step – multi-layer network setup
In case the multi-layer mode was activated manually or automatically due to multiple files upload, the ‘Layer Settings’ tab is rendered. This tab contains the user interface for controlling the layer layout and node membership settings and it is an intermediate step before the edge list creation.
By default, the node membership (or layout) is automatically generated. In case of single multi-column files or multiple single-column files, the layer membership is generated randomly and all lists are distributed across three layers. The user can then modify the auto-generated membership by using the group radio buttons located at the right half of the panel and clicking the ‘Commit Layers’ button.
When multiple multi-column files are used, the tool will automatically assign the membership of each list to its originating file without amendment options for the user.
Alternatively, an index file containing the layer membership can be uploaded. The user can upload a list-layer mapping in the form of a two-column txt, tsv, csv, xlsx/xls file format with the first column being the name of the list and the second, the layer name (headers are required).
In addition, under the ‘Layer Settings’ heading, the user can select the ‘Layers relationship’. There are two options: the ‘Hierarchical’ (default) or ‘Non-hierarchical’ layers. These options offer a more detailed representation of the network than a flat structure and allow the extraction of hidden information in flat networks. Hierarchical layers will result in stacked layers and any inter-layer connections are allowed only between adjacent layers in the stack, for instance, layer 2 is scanned for connections only with layers 1 and 3. Non-hierarchical layers allow connection across all layers and visualization of the layers as faces on an n-sided prism (n being the number of layers). For example, three layers will result in a triangular prism, six layers in a hexagonal, etc.
Moreover, the user can also select the ‘Multi-layer membership’, where the same list name can be a member of a single, multiple, or all network layers. Also, the type of information used to construct the inter-layer relationships can be chosen between two options under the heading ‘Inter-layer mode connectivity’; The choices are ‘By Name’ or ‘By Content’ (default). Choosing to construct the relationships by name means that nodes/lists in different layers will be connected only if they have the same name, whereas choosing to connect them by content means the tool will estimate inter-layer edges the same way as within the layer. Consequently, the ‘By Content' option can only be used for multiple layers of the same data type. In the case of ‘Multiple multi-column files as layers’, inter-layer edges are determined only ‘By Name’ and any changes on the ‘Layer Membership’ part cannot occur.
In multiple multi-column file cases, the number of layers is predetermined and equals the number of uploaded files. In any other cases, the user can define the number of layers using the numeric slider (up to ten layers) and edit the names, order, and colors of the layers from the different inputs in this panel. Once the user makes any of the above changes (name, order), they must click on the ‘Rename/Reorder’ button to apply the changes in the 'Layer Membership' panel. Finally, the ‘Commit Layers’ (under the ‘Layer Settings’ tab) or ‘Calculate Edge List’ (under the ‘Edge Lists’ tab) buttons initiate the edge list/s calculation and network generation. Any changes made on the ‘Layer Membership’ part also require the user to click on the ‘Commit Layers’ button to save the changes on this section and re-calculate the edge list/s.
2.4. Visualization of networks and network analytics
Once the edge list is calculated, the ‘Network’ tab appears at the top tabset of the main UI. Under the ‘Network’ tab, four panels are constructed namely i) ‘Network’, ii) ‘First neighbors of the selected node’, iii) ‘Subnetwork based on the selected node’ (in the case of multi-layer network) and iv) ‘Network Metrics’.
2.4.1. Single-layer and multi-layer networks
The single-layer or multi-layer network visualization is rendered in the first panel, ‘Network’.
2.4.1.1. Single-layer network
In the single-layer modes, a 2D network visualization is provided using the visNetwork [7] package, and a panel named ‘Graph Settings’ is created on the right to allow users to edit the visualization settings. The user can adjust the node and the edge size, choose between the different layout algorithms to visualize a network, and select a node color.
2.4.1.2. Multi-layer network
For multi-layer networks, the user can choose to visualize the network object either using the ‘threejs’ (3JS) library (default) or switch to visNetwork using the ‘3d Visualiser’ switch button in the panel ‘Graph Settings’. The 3JS renders a highly interactive 3D network, however, visNetwork gives a better annotated representation. In fact, the visNetwork will produce a pseudo-3D network where each node is aligned on the perspective plane of its layer.
For both visualizers’ specific sliders control the adjustments of the relative node and edge size, while with a dropdown list, the user can select and apply a range of within-layer network layouts as well as specific sliders for the X, Y and Z rotations for visNetwork cases. In scenarios where inter-layer connections are determined ‘By Name’, the user can select the ‘Fixed Layout’, to align the connected nodes across all layers.
2.4.1.3. Network Layouts
The List2Net network visualization module (under the ‘Network’ tab) comes with some of the layout algorithms offered by the igraph [8] package in R. These include the Fruchterman Reingold [9] layout, which uses the force-directed layout algorithm developed by Fruchterman and Reingold to place nodes on the plane; Random, which places the vertices of a plane in a uniformly random manner; Circle, which places vertices on a circle, which are ordered by their vertex IDs; Reingold-Tilford (Tree) [10], a tree-like layout which is suitable for trees or graphs without many cycles; LGL – Large Graph Layout, a force-directed layout that is suitable for use with larger graphs; Graphopt, which optimizes the layout of the vertices using the graphopt algorithm; GEM [11], which uses the GEM force-directed layout algorithm to place vertices on the plane; Star, which sets one vertex in the center of a circle and the remaining vertices equidistantly around the circumference; Davidson-Harel [12], which places the vertices of a graph on the plane in accordance with Davidson and Harel’s simulated annealing algorithm; MDS which uses multidimensional scaling to place vertices on a plane; DrL – Distributed Recursive (Graph) Layout, which is an implementation of the force-directed DrL layout generator.
In the single-layer network and in case of the layout selection named ‘Star’ or ‘Tree’, a dropdown list named ‘Select Root’ appears in the ‘Graph Settings’ panel, and the user can select the name of the node to be placed at the center of the start layout or as root for the tree layout. The two aforementioned layouts are not available for multi-layer networks.
2.4.1.4. Nodes and Edges
In single and multi-layer cases visualized by visNetwork, the orange-red-bordered nodes represent the articulation points of the network. In single-layer networks, the node size is proportional to the uploaded list's size in entities and biological sequences, or proportional to the mean of each numeric set’s number in the case of numeric data type. The width of the edge/connection is proportional to the weight determined and calculated in the edge list tab based on the data type of the uploaded lists.
In multi-layer networks, the nodes in different layers are represented by different node colors. The node size in each layer is proportional to the uploaded list's size in entities and biological sequences, or proportional to the mean of each numeric set’s number in the case of numeric data type. The width of the edge/connection in each layer is proportional to the weight determined and calculated in the edge list tab based on the data type of the uploaded lists. In visNetwork graphs, hovering the mouse pointer over an edge will display its weight value. In cases where inter-layer membership is determined ‘By Name’, the edges between different layers have no value/weight.
It is essential to mention that in the case of numeric sets in both single and multi-layer networks, the edges are colored based on the selection of positive (correlation-blue colored) or negative (anti-correlation-red colored) correlations from the user.
Additionally, the tool gives a variety of options for the creation of optimal custom views. The network is interactive, with zooming and panning functionalities. Users can drag any node and place it anywhere on the plane in visNetwork graphs. In the 3JS networks, the user can hover over the nodes to reveal their names.
2.4.1.5. Network Exporting
Furthermore, the user can download the network as a vector image (PDF, SVG) or raster image (PNG format), edge list format and network file (tsv format) for further analysis with other tools dedicated to network analytics such as Cytoscape and Arena3Dweb. Based on the network file, the user can export a tsv file with five columns, named ‘SourceNode’,’TargetNode’,’SourceLayer’, ‘TargetLayer’, ‘Weight’ and ‘EdgeColor’ to upload it to other tools for further visualization properties like Arena3Dweb. To clarify, in numeric sets, the ‘Weight’ attribute includes the absolute correlation value. The direction of the correlation is denoted by the ‘EdgeColor’ column (positive correlation with blue color-#4169E1 and negative correlation with red color-#FF0000). A user can select one option and click on the download button. The 3JS network can be downloaded only in HTML format.
2.4.1.6. Intra-layer and Inter-layer relationships
In addition, the ‘Network’ tab enables the user to focus on intra-layer and inter-layer relationships. Specifically, either by selecting a specific node from a dropdown list (unique node names in case the multi-layer membership is ‘Yes’) in the ‘Network’ panel or by clicking on a specific node from the network, the first neighbors of the selected node and subnetworks based on the selected node are presented. If multi-layer membership is used, a list's name can be a member in more than one layer. When the user selects a node name from the dropdown list, all nodes with the same name across different layers will be highlighted.
2.4.1.7. First neighbors
Upon a selection of a node, its first neighbors are listed by layer under the ‘First neighbors of the selected node’ panel, as well as the common neighbors from all the layers that it belongs to, if any. At the left of the ‘First neighbors of the selected node’ panel, a set of checkboxes is created named ‘Neighbors per Layer’ and the user can also select any combination of layers to present the common neighbors. In the case of a single-layer mode, a list of the names of the first neighbor nodes of the selected node is provided in a table format.
2.4.1.8. Subnetworks
Furthermore, the tool creates subnetworks in the panel 'Subnetwork based on the selected node', which include only the first neighbors of the selected node from all the layers it belongs to, if any. The weight for each edge is calculated by the sum of the weights of the common edges presented in each layer based on the selected node and can be normalized in the range of 0–1, divided by the max value. In the case of numeric sets, the normalization option is disabled since it calculates the mean value of the absolute values (selected correlation coefficient) of the common edges. Also, the user can select any combination of the layers to include in the subnetwork. Users can download this subnetwork in edge list (tsv) format (the subnetwork panel is not presented in the single-layer mode).
2.4.1.9. Network analysis
The fourth panel, named ‘Network Metrics’ includes three tabs, namely (i) ‘Network Features’, (ii) ‘Analytics’ and (iii) ‘Distributions’, respectively.
Under the first tab, ‘Network Features’, the tool reports the following, both by layer or as a whole: (a) whether the network is fully connected or the number of non-connected components (in the case of a non-connected network), (b) the number of Nodes/Vertices, (c) the number of Edges, which is the number of connections in the network, (d) the node/list names marked as articulation points (the nodes that result upon their removal, in an increase of the number of components) and (e) the names of the unconnected nodes which represent lists with unique content.
The second tab, ‘Analytics’, is used for automated topological analysis, utilizing the analytics functions from the R igraph package. The metrics are reported in a tabular format and include: (a) the degree, which is the total number of edges/connections of a node; (b) the Betweenness centrality, which is the number of times a node participates in the shortest path connecting other nodes; (c) the Eigenvector centrality, which is a metric of the influence of a node in a network; (d) the Clustering Coefficient, which is a metric that measures the tendency of a node to form a group with the neighboring nodes and (e) the Strength, which is the summation of the connectivity weights of the edges attached to each node. The user can print or download in CSV, EXCEL and PDF format the resulting table. The table also features a Search field for browsing the metrics of a specific node, while the radio buttons offer layer level-filtering of the table.
The third tab, ‘Distributions’, shows visualizations of the distributions of the aforementioned topological metrics as histogram/density plots. In addition, the last histogram shows the edge weight distribution of the edge list created in previous steps. In the case of the multi-layer network, each layer is assigned a different color and all layers appear in histograms.
In general, analytics or centrality metrics can estimate the importance a node or edge has regarding the connectivity or the information flow of the network. There are several centrality metrics available that can help us understand the biological significance of the entities in a network. For instance, degree centrality, one of the most widely used network metrics, describes the total number of edges that are connected to a node. High-degree nodes (hubs) are topologically and often functionally significant. For instance, when deleting a hub node (e.g., a gene), this change may be detrimental compared to deleting a non-hub node [13]. When dealing with biological networks, hubs are often involved in regulatory and transcriptional processes. For example, in signaling networks, when a protein shows a very high degree centrality, it is suggested to have a central regulatory role. Transcription factors [14], docking proteins [15], kinases [16], etc., are some types of proteins that are key players in biological networks. The analog of node degree, known as strength, sums the connectivity weights of the edges attached to each node (degree). Both degree and strength are useful summary measures of the connectivity of each node in a network.
In addition, betweenness centrality is another highly used centrality metric, which describes how many times a node is involved in the shortest path connecting other nodes [17]. In other words, it measures the nodes that act as a bridge along the shortest path between two nodes. High betweenness centrality nodes are also known as “bottlenecks”. They can have a significant influence on a network since their deletion can often result in a complete loss of connectivity and affect the topology of the network. Subsequently, this halts the propagation of information in a network. In biological networks, deletion of a high betweenness centrality protein can result in the deactivation of specific molecular processes [18], [19], [20].
Moreover, eigenvector centrality is a measure of the influence of a node in a network. In other words, it determines whether a node is essential if connected to other important players. In biological networks, eigenvector centrality detects the influence a gene can have based on the influence of the genes it is connected to [17].
Lastly, the clustering coefficient is another metric used in network analysis. It measures how much specific nodes in a network tend to cluster together. If a node has a high clustering coefficient, it can be inferred to as vital since it is connected to other well-connected nodes that all together form a well-connected cluster within a network [17]. For instance, in biological networks, high clustering coefficient nodes can be used as potential biomarkers [21].
2.5. Examples
The List2Net offers a number of examples that can be loaded using the ‘Load Examples’ menu located at the sidebar. The menu items include the following single-layer scenarios:
-
(i)
‘Neurodegenerative Disease Network based on their common genes’, which loads lists of genes (entities data type) involved in specific neurodegenerative diseases from the DisGeNET database to detect their commonalities. The column name represents the name of the neurodegenerative disease and each row consists of the gene's name in a text format.
-
(ii)
‘Gene co-expression across samples’ for numeric expression vectors (numeric). This is a synthetic dataset, loaded as a matrix where each column represents a list/node, and aims to demonstrate the co- and anti-occurrence/expression functionalities of the tool
-
(iii)
‘Sequences of Antimicrobial Peptides per Species’ for a set of peptides (amino-acid sequences). Each column represents a species and each row represents a peptide sequence. The example aims to demonstrate the biological sequence analysis functions of the tool.
-
(iv)
‘The organelle proteome network’ loads a collection of proteins (entities) from The Human Protein Atlas, involved in different organelles' (lists) functions. Each column has the name of the organelle and each row has the name of the protein, and the network is created based on the commonality of the proteins involved across the organelles.
Any of the above examples can be processed and generate single-layer networks or by activating the multi-layer switch at the sidebar, handled as multi-layered cases. In addition, the examples menu includes two natively multi-layer scenarios:
-
(v)
‘Multi-layer Network – Simple Form’ is constructed by numeric vectors, and uses the Gene co-expression example to represent the functionality of the multi-layer option for List2Net in a simple form. Upon selection of this example, the tool automatically switches to the multi-layer mode, creates three layers (automated layers) by default and assigns the membership between the lists and the layers. The user is free to edit/modify the membership and explore other properties of the layer layout, e.g., Non-hierarchical layer relationship.
-
(vi)
‘Multi-Layer Network – Multiple Sclerosis’, which loads multiple files with multiple columns containing entities. The latter is in fact the dataset used for the Multiple Sclerosis case study, which is presented in this manuscript in detail under Section 3.
For the examples above, it is evident that there are no restrictions on the comparisons of the different lists. Overall, users can use List2Net to answer their biological questions and create new hypotheses. The content of the input lists is up to the users. The tool can be used to perform simple tasks, for instance, to define the commonalities between different lists and see these relationships in a network. Additionally, it can be used to answer questions using the multi-layer perspective of the network, where you can define unknown connections before processing into networks.
3. Application of List2Net in a case study with a focus on Multiple Sclerosis
3.1. Multiple Sclerosis
To demonstrate the functionality and usefulness of this application, List2Net-supported analyses were conducted using Multiple Sclerosis (MS) as a case study. MS is a chronic inflammatory, demyelinating and degenerative disorder of the central nervous system (CNS) affecting both the brain and the spinal cord. The name of the disease denotes the formation of inflammatory lesions leading to demyelinating plaques in the CNS and, therefore, to the damage of the protective shield surrounding neurons, called the myelin sheath. MS is one of the most common neurological diseases of young adults. The classification system of MS includes four major phenotypes: a) the relapsing-remitting MS (RRMS), b) the secondary-progressive MS (SPMS), c) the primary progressive MS (PPMS) and d) the progressive relapsing MS (PRMS) [22], [23], [24].
We utilized List2Net to generate a synthetic map connecting MS with other diseases in the prism of neurodegenerative, autoimmune and infectious maladies under various relationship questions regarding disease similarity based on gene, variant, protein, metabolite and phenotypic commonalities. The procedure included three main steps: i) to retrieve the data from different databases, ii) to prepare the files in List2Net format and iii) to extract and interpret the results.
3.2. Data extraction
3.2.1. Disease Ontology data
To accomplish the first step, we used Disease Ontology (DO), a standardized human disease ontology, to extract all the names and available codes, such as UMLS, OMIM, ORDO, GARD, ICD-9, ICD-10 and MeSH, for the three categories of diseases (neurodegenerative, autoimmune and infectious) [25]. The aim of DO is to provide a consistent, reusable, and sustainable description of human disease terminology, phenotypic characteristics, and related disease medical vocabulary concepts to the biomedical community [25]. To achieve this, we downloaded the ‘HumanDO.obo’ Disease Ontology file (an Open Biological and Biomedical Ontologies formatted file that comprises the DO’s representation) from the database and by using an in-house developed R script, we extracted all the available codes for the diseases belonging in the last two layers of hierarchy in the disease ontology tree for the three categories above of diseases, which resulted in a total of 939 diseases.
3.2.2. DisGeNET data
In addition, we retrieved publicly available data from various levels. More specifically, we collected curated genes and variants through DisGeNET, a platform that contains one of the largest publicly available collections of genes and variants associated with human diseases [26]. To download the data from DisGeNET, we gave all the available codes for each disease retrieved from DO as input to the query. A threshold of gene disease association (gda) score> =0.5 and variant disease association (vda) score> =0.7 were used to keep the most important genes and variants linked to the categories of diseases that were mentioned previously. Therefore, we retrieved 274 disease association files with the corresponding genes and 299 disease association files with the corresponding variants.
3.2.3. UniProt data
Moreover, we downloaded all the human protein names and the specific section of ‘pathology and biotech’ from UniProt, a database of protein sequences and functional information across different species [27]. We used an in-house developed R script to parse the OMIM code and name of the disease used for the involvement of proteins in diseases, as well as the unique protein identifier. Thus, 6202 entries were collected.
3.2.4. Human Metabolome Database data
We fetched the disease-related metabolites from the Human Metabolome Database (HMDB), which contains detailed information about small molecule metabolites in the human body [28]. Specifically, we parsed xml files for metabolites that were found in CSF (cerebrospinal fluid), feces, saliva, serum, sweat and urine, and with an in-house developed R script, we extracted the metabolite’s accession number and name, and also the name of the diseases and the associated OMIM codes that were in ‘associated disorders and diseases’ section. The output file included 47781 entries.
3.2.5. Human Phenotype Ontology data
Finally, we retrieved phenotypic information from Human Phenotype Ontology (HPO), a standardized vocabulary of phenotypic abnormalities associated with many diseases [29], where we downloaded the ‘phenotype.hpoa’ file (a file containing the HPO annotations for rare diseases) and got the associated OMIM and ORPHA codes, the name of the diseases and the associated HPO codes and names for each disease in the database. Thus, the final entries were 228852.
3.3. Preparation of files in List2Net format
Following the data retrieval, we prepared all the aforementioned files in List2Net format for input to the tool. To do this, we mapped the diseases from the different databases either by OMIM code, ORPHA code, or by name or synonym, using as a reference name the one provided by DO, either by an in-house developed R script or manually. For the data from DisGeNET, the ‘gene symbol’ for the gene part and the ‘variant id’ for the variant part were kept. This was done for each downloaded.txt file that corresponds to a disease. We grouped disease types within the same disease family by merging the disease names into the parent-disease name from DO. After this merging, we concluded to 59 files (corresponding to 59 diseases) for genes and 85 files (corresponding to 85 diseases) for variants. In the sequel, we constructed two multiple-column files, one for the genes and one for the variants. We worked in each level separately for the selected disease categories for the rest of the lists. For data from UniProt, HMDB and HPO, we created the respective files, in which the first row corresponds to the name of the disease we have the information about. This information includes the proteins involved in a specific disease, the metabolites and the phenotypic information for the disease.
3.4. Case study results
3.4.1. Construction of the merged network
To create and visualize the multi-layer network based on our input lists (different files: genes, variants, proteins, metabolites, hpo), we selected the option ‘Multiple multi-column files as layers’ from the sidebar. In addition, from the ‘Layer Settings’ tab, we selected the hierarchical option. We used the default options for multiple multi-column files as layers, ‘Multi-layer membership’ (activation), and for the inter-layer membership (node connectivity), the option ‘By Name’. Fig. 2 illustrates the multi-layer network. The resulting multi-layer network comprises 5 layers, with the gene layer at the bottom, followed by the phenotypic information-hpo, metabolite, protein and variant layers. Each subnetwork is created from all the pairwise comparisons. These pairwise comparisons are generated from the common entities from the different edge lists created in the ‘Edge Lists’ tab for each list and layer separately. The first layer, starting from the bottom of the multi-layer network, is the gene layer and represents the common genes between the different diseases; the second layer illustrates the common phenotypic information between the different diseases; the third layer shows the common metabolites that are involved in the diseases; the fourth represents the common proteins that are involved in diseases and the final layer denotes the common variants associated with the diseases. Each node from each subnetwork represents the different diseases and an edge is created if they share common information. Each edge between the different diseases has a weight based on the number of their common elements (genes, phenotypic information, metabolites, proteins and variants). The edges between the 5 different layers denote that the two nodes-diseases are the same, meaning they have the same name (node connectivity).
Fig. 2.
The Multi-Layer network. Using the different levels of information of the publicly available data on neurodegenerative, autoimmune and infectious diseases, we have created a multi-layer network. This network is comprised of five layers, where each subnetwork (layer) is created from all the pairwise comparisons. The first layer, starting from the bottom of the multi-layer network, is the gene layer and represents the common genes between the different diseases; the second layer illustrates the common phenotypic information between the different diseases; the third layer shows the common metabolites that are involved in the diseases; the fourth layer represents the common proteins that are involved in the diseases and the final layer denotes the common variants associated with the diseases. Each node from each subnetwork represents the different diseases and an edge is present if they share common information. The nodes in different layers are represented by different node colors and the orange-red-bordered nodes in different layers represent the articulation points of the whole multi-layer network. The edges between the five different layers denote that the two node-diseases have the same name.
3.4.2. Insights from each layer separately
It is essential to mention that each layer can connect different diseases, which can provide a layer-specific insight into the particular disease of interest (in our case, the MS). Fig. 3 illustrates three different disease-disease networks. The ‘molecular disease-disease commonality’ is the subnetwork comprised of the disease-nodes from the 4 layers (the gene, the protein, the metabolite and the variant layer) linked based on their common information with the disease of interest (MS). This subnetwork was generated through Cytoscape using the diseases as nodes and commonalities with MS as edges as exported from List2Net. The different colors of the edges represent the different layers.
Fig. 3.
Disease-disease networks. Molecular disease-disease commonality network: This network shows the connections between Multiple Sclerosis and other diseases (neurodegenerative, autoimmune and infectious diseases) from the four layers (the gene, the protein, the metabolite and the variant layer) that are linked based on their common information. Each node represents a different disease and an edge is present if they have common genes, proteins, metabolites or variants. The different colors of the edges represent the different layers and the number of edges denotes the number of common elements. Phenotypic disease-disease commonality network: This network illustrates the connections between Multiple Sclerosis and other diseases (neurodegenerative, autoimmune and infectious diseases) that share common symptoms. Each node shows the different diseases and an edge is present if they share common symptoms. The number on the edges represents the number of common phenotypic-hpo terms. To have a better view of the similarities between Multiple Sclerosis and other diseases and depict unique and shared diseases-nodes, we constructed a third network. We set as ‘core nodes’ the five layers (MS genes, MS Metabolites, MS Phenotypes, MS Proteins and MS Variants) and the rest of the diseases from the above two networks (Molecular disease-disease commonality network and Phenotypic disease-disease commonality network). The node size denotes the degree and the specific color on the disease node shows the layer and/or the layers that belong to.
‘Rheumatoid arthritis’, an autoimmune chronic inflammatory arthritis and ‘ulcerative colitis’, an autoimmune disease of the gastrointestinal tract and one of the predominant subtypes of Inflammatory bowel disease, were shown to share common elements with MS in three out of four layers. ‘Rheumatoid arthritis’ was shown to share the largest number of components with a total edge weight equal to 143. This number represents the sum of the weight (edge factor) of the common edges presented in three layers (gene, protein and variant layer). Furthermore, ‘type 1 diabetes mellitus’, an autoimmune disease of the endocrine system, was shown to have the second highest weight with the involvement of two layers (gene and variant layer). Other diseases with a two-layer involvement include ‘Parkinson’s disease’ (metabolite and variant layer), ‘lupus erythematosus’ (gene and variant layer), ‘celiac disease’ (metabolite and variant layer) and ‘amyotrophic lateral sclerosis’ (ALS) (metabolite and variant layer). Other diseases like ‘Alzheimer’s disease’ (metabolite layer), ‘psoriasis’ (variant layer) and ‘primary biliary cholangitis’ (variant layer) have a single-layer commonality association with MS. Additionally, the majority of the diseases shown in this subnetwork are autoimmune diseases (21), followed by neurodegenerative diseases (8) and infectious diseases (5).
More specifically, regarding the gene layer (dark-purple color), ‘rheumatoid arthritis’ and ‘type 1 diabetes mellitus’ were shown to share the largest number of genes with an edge weight equal to 3. Similarly, most diseases connected with MS are autoimmune diseases, such as ‘lupus erythematosus’ (an autoimmune disease of the musculoskeletal system) and ‘ulcerative colitis’, with an edge weight equal to 1 for both diseases. In general, patients with autoimmune diseases are more likely to develop an additional autoimmune disease, and there is a correlation between MS and other autoimmune diseases. Several genetic and environmental factors, such as viruses, have been implicated in the co-existence of MS with other autoimmune diseases [30], [31]. Specifically, studies showed that MS patients had a higher incidence of rheumatoid arthritis [32], [33]. Also, some studies investigated the co-occurrence of MS and ‘type 1 diabetes mellitus’ [34]. At the gene level, we can also observe the connection of a neurodegenerative disease with MS, the ‘secondary Parkinson disease’ (Parkinsonism), with an edge weight equal to 1. Regarding the protein level (turquoise color), ‘rheumatoid arthritis’ was detected to share a single protein with MS. No other commonalities between MS and other diseases were detected.
Moreover, we can observe that the contribution of the metabolite layer (blue color) in the subnetwork brings up neurodegenerative diseases, for instance, ‘Alzheimer’s disease’ and ‘Parkinson’s disease’, which had a high score with the MS, 27 and 26, respectively. According to recent studies, different neurodegenerative diseases have altered metabolic processes in common [35]. Potential stage predictors of diseases like ‘Alzheimer’s disease’, ‘Parkinson’s disease’ and MS include metabolites that measure the degree of neuronal damage and alterations in the myelin sheath [36]. Further, infectious diseases like ‘human immunodeficiency virus’ (HIV) (2) and ‘bacterial meningitis’ (1) have also appeared in the network due to metabolic involvement. Additionally, autoimmune diseases like ‘ulcerative colitis’ (2), ‘Guillain-Barre syndrome’ (1), which is an acute, immune-mediated disease of the peripheral nervous system, and ‘celiac disease’ (1), an autoimmune disease of the gastrointestinal tract, share commonalities with MS. The same number of neurodegenerative and autoimmune diseases have contributed to this layer (3 neurodegenerative and 3 autoimmune diseases).
Compared to the previous layers, more diseases associated with MS have been added to the network due to the variant layer (green color), with the majority of the diseases being autoimmune diseases. For instance, ‘psoriasis’ (25), which is an autoimmune disease of skin and connective tissue, ’primary biliary cholangitis’ (19), which is an autoimmune disease of the gastrointestinal tract and ‘ankylosing spondylitis’ (14), an autoimmune disease of the musculoskeletal system. In total, 20 autoimmune diseases were found to be associated with MS. Thus, neurodegenerative diseases were 6 in total, such as ‘Parkinson’s disease’ (1), ‘Huntington’s disease’ (1) and ‘cerebellar ataxia’ (1), and three infectious diseases, such as ‘hepatitis b′ (6) and ‘leprosy’ (2). ‘Rheumatoid arthritis’ had the highest weight (139), followed by ‘type 1 diabetes mellitus’ (77), ‘ulcerative colitis’ (68) and ‘lupus erythematosus’ (46).
Remarkably, the phenotypic (shared symptoms) information-hpo (blue-purple color) subnetwork shown in Fig. 3 as ‘phenotypic disease-disease commonality’ illustrates that the majority of the diseases that are connected with MS are neurodegenerative diseases instead of autoimmune and infectious diseases. The node-disease with the highest weight was ‘cerebellar ataxia’ (8). Different studies support that in both RRMS and progressive MS, cerebellar dysfunction leads to a variety of neurological symptoms [37]. Some other neurodegenerative diseases are ‘X-linked hereditary ataxia’ (5), ‘neurodegeneration with brain iron accumulation’ (5), ‘episodic ataxia’ (5) as well as ‘ALS’ (5). Common phenotypic abnormalities with the aforementioned neurodegenerative diseases and MS include ‘depression’, ‘muscle weakness’, ‘incoordination’, ‘spasticity’, ‘diplopia’ and ‘urinary incontinence’. Also, high weight was observed in some autoimmune diseases like ‘temporal arteritis’ (4), which is an autoimmune vasculitis and ‘myasthenia gravis’ (4), an autoimmune disease of the nervous system. Common phenotypic abnormalities with the previous diseases and MS include ‘muscle weakness’, ‘diplopia’ and ‘paresthesias’. It is essential to mention that ‘rheumatoid arthritis’, which had the highest edge weight in the previous subnetwork, as well as other diseases with high edge weight score like ‘type 1 diabetes mellitus’, did not appear in this subnetwork. In total, MS shared common phenotypic abnormalities with 19 neurodegenerative diseases, 11 autoimmune diseases and only 3 infectious diseases.
Table 4 shows the number of common elements between MS and other diseases. Each row represents a disease and each column represents the 5 different layers. The numbers represent the common elements between a specific disease and MS. In addition, the combined network in Fig. 3 shows in network format the unique and shared diseases between the layers. This network is constructed using R and the igraph package. We used as ‘core nodes’ the specific information about MS (MS Metabolites, MS Genes, MS HPO Phenotypes, MS Proteins and MS Variants) to show where the information and the disease nodes were extracted from the two networks above. The node color refers to the specific association type (molecular evidences, phenotypic manifestations and phenotypic and molecular evidences) and the node size refers to the degree (the number of its adjacent edges). We can observe diseases like ‘Guillain-Barre syndrome’ and ‘HIV’ to be part only of the metabolite layer (node size equal to 1). Furthermore, unique diseases from the variant layer include ‘hepatitis b′, ‘autoimmune thyroiditis’ and ‘ankylosing spondylitis’ (node size equal to 1). Also, the phenotypic information-hpo layer has unique diseases such as episodic ataxia, neurodegeneration with brain iron accumulation and x-lined hereditary ataxia (node size equal to 1). Additionally, based on shared diseases, we can observe that the variant and the phenotypic information-hpo layers had the most common diseases (node size equal to 2). ‘Cerebellar ataxia’, ‘Huntington’s disease’, ‘psoriasis’, ‘myasthenia gravis’, ‘leprosy’ and ‘temporal arteritis’ were some of those common diseases. Common diseases among the molecular evidence include ‘rheumatoid arthritis’, ‘type 1 diabetes mellitus’, 'lupus erythematosus' and ‘ulcerative colitis’. The gene, protein and variant layers are shown to share only 1 disease, ‘rheumatoid arthritis’ (node size equal to 3). The metabolite, variant and gene layers had ‘ulcerative colitis’ as the common disease (node size equal to 3). The gene and variant layers had 2 common diseases: ‘type 1 diabetes mellitus’ and ‘lupus erythematosus’ (node size equal to 2). Moreover, the metabolite, variant and phenotypic information-hpo had 3 common diseases (Parkinson’s disease, ALS and celiac disease). The metabolite and phenotypic information-hpo had ‘Alzheimer’s disease’ and the ‘bacterial meningitis’ (node size equal to 2) in common and the gene and phenotypic information-hpo layers had only 1 disease in common, the ‘secondary Parkinson disease’ (Parkinsonism) (node size equal to 2).
Table 4.
The number of common elements between Multiple Sclerosis and other diseases. The first column depicts the disease that has commonalities with Multiple Sclerosis. The other columns represent the numbers of the common elements between MS and other diseases in the phenotypic information (hpo), the gene, the metabolite, the protein and the variant layers.
| disease | hpo | gene | metabolite | protein | variant |
|---|---|---|---|---|---|
| cerebellar ataxia | 8 | - | - | - | 1 |
| amyotrophic lateral sclerosis | 5 | - | 2 | - | 1 |
| episodic ataxia | 5 | - | - | - | - |
| neurodegeneration with brain iron accumulation | 5 | - | - | - | - |
| x-linked hereditary ataxia | 5 | - | - | - | - |
| myasthenia gravis | 4 | - | - | - | 3 |
| parkinson's disease | 4 | - | 26 | - | 1 |
| temporal arteritis | 4 | - | - | - | 1 |
| neuroacanthocytosis | 3 | - | - | - | - |
| spastic ataxia | 3 | - | - | - | - |
| spinal muscular atrophy | 3 | - | - | - | - |
| celiac disease | 2 | - | 1 | - | 36 |
| leprosy | 2 | - | - | - | 2 |
| multiple system atrophy | 2 | - | - | - | - |
| pontocerebellar hypoplasia | 2 | - | - | - | - |
| secondary parkinson disease (parkinsonism) | 2 | 1 | - | - | - |
| sjogren's syndrome | 2 | - | - | - | 1 |
| alopecia areata | 1 | - | - | - | 7 |
| alzheimer's disease | 1 | - | 27 | - | - |
| amyotrophic lateral sclerosis-parkinsonism_dementia complex 1 | 1 | - | - | - | 1 |
| autoimmune hemolytic anemia | 1 | - | - | - | - |
| autoimmune hepatitis | 1 | - | - | - | - |
| autoimmune polyendocrine syndrome | 1 | - | - | - | - |
| bacterial meningitis | 1 | - | 1 | - | - |
| familial encephalopathy with neuroserpin inclusion bodies | 1 | - | - | - | - |
| graves' disease | 1 | - | - | - | 11 |
| huntington's disease | 1 | - | - | - | 1 |
| huntington's disease-like 2 | 1 | - | - | - | 1 |
| lateral sclerosis | 1 | - | - | - | - |
| ornithine translocase deficiency | 1 | - | - | - | - |
| primary thrombocytopenia | 1 | - | - | - | - |
| psoriasis | 1 | - | - | - | 25 |
| stress-induced childhood-onset neurodegeneration with variable ataxia and seizures | 1 | - | - | - | - |
| ankylosing spondylitis | - | - | - | - | 14 |
| autoimmune thyroiditis | - | - | - | - | 6 |
| behcet's disease | - | - | - | - | 1 |
| common variable immunodeficiency | - | - | - | - | 2 |
| guillain-barre syndrome | - | - | 1 | - | - |
| hepatitis b | - | - | - | - | 6 |
| human immunodeficiency virus infectious disease | - | - | 2 | - | - |
| latent autoimmune diabetes in adults | - | - | - | - | 2 |
| lupus erythematosus | - | 1 | - | - | 46 |
| osteomyelitis | - | - | - | - | 1 |
| pemphigus | - | - | - | - | 1 |
| primary biliary cholangitis | - | - | - | - | 19 |
| psoriatic arthritis | - | - | - | - | 1 |
| rheumatoid arthritis | - | 3 | - | 1 | 139 |
| type 1 diabetes mellitus | - | 3 | - | - | 77 |
| ulcerative colitis | - | 1 | 2 | - | 68 |
| vitiligo | - | - | - | - | 5 |
Overall, MS is a complex disease with elements of both autoimmune and neurodegenerative disorders. We observed that ‘rheumatoid arthritis’ had the highest summarized weight across all layers, except the phenotypic information layer, where most of the connected diseases were neurodegenerative diseases, with the top-scored disease being ‘cerebellar ataxia’.
4. Conclusion
In this work, we present List2Net, an interactive, user-friendly web application that calculates pairwise comparisons from different types of lists of biological identifiers and offers an easy way to visualize this information as an association network. To our knowledge, List2Net is the first tool to accept an arbitrary number of lists as input, calculate all the pairwise comparisons, and then visualize them as networks. We used this tool, focusing on MS as a case study, to reveal similarities with other diseases on various levels of molecular and phenotypic information. List2Net offers a holistic view of the disease of interest by adding different types of biological information (genes, proteins, metabolites, etc.). For MS, many connections were identified, which can be further analyzed for possible drug repurposing and biomarker discovery.
The List2Net tool is not without limitations. Currently, when the user uploads multiple multi-column files as layers, the only available option to build the inter-layer memberships of the network is by name. However, in the future version of List2Net, we are aiming to include the option to build the inter-layer memberships for the multiple multi-column files as layers by content as well. Additionally, List2Net supports up to ten layers in the multi-layer mode. This is done to ensure that the visualization is both appealing and informative. As part of future work, we plan to include associations between imaging and other clinical features to result in a more complete network profile.
CRediT authorship contribution statement
Sotiroula Afxenti: Conceptualization, Methodology, Software, Validation, Investigation, Resources, Writing – original draft, Writing – review & editing. Marios Tomazou: Conceptualization, Methodology, Software, Visualization Validation, Investigation, Resources, Writing – original draft, Writing – review & editing. George Tsouloupas: Software, Resources, Writing – review & editing. Anastasia Lambrianides: Writing – review & editing, Supervision. Marios Pantzaris: Writing – review & editing, Supervision. George M. Spyrou: Conceptualization, Writing – review & editing, Supervision, Project administration.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This research was supported by the Muscular Dystrophy Association Cyprus/Telethon Cyprus.
Footnotes
Supplementary data associated with this article can be found in the online version at doi:10.1016/j.csbj.2023.11.020.
Appendix A. Supplementary material
Supplementary material
.
Data availability
The List2Net web application is available at https://list2net.cing-big.hpcf.cyi.ac.cy/
References
- 1.Bardou P., Mariette J., Escudié F., Djemiel C., Klopp C. SOFTWARE open access jvenn: an interactive venn diagram viewer. BMC Bioinforma. 2014;vol. 15(293):1–7. doi: 10.1186/1471-2105-15-293. 〈http://www.biomedcentral.com/1471-2105/15/293〉 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Khan A., Mathelier A. Intervene: a tool for intersection and visualization of multiple gene or genomic region sets. BMC Bioinforma. 2017;vol. 18(1):1–8. doi: 10.1186/s12859-017-1708-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wang Y., Thilmony R., Gu Y.Q. NetVenn: an integrated network analysis web platform for gene lists. Nucleic Acids Res. 2014;vol. 42(W1):161–166. doi: 10.1093/nar/gku331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Kokoli M., et al. Arena3Dweb: interactive 3D visualization of multilayered networks supporting multiple directional information channels, clustering analysis and application integration. NAR Genom Bioinforma. 2023;vol. 5(2):1–8. doi: 10.1093/nargab/lqad053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Shannon Paul, et al. Cytoscape: a software environment for integrated models. Genome Res. 1971;vol. 13(22):426. doi: 10.1101/gr.1239303.metabolite. 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Pagès H., Aboyoun P. R Gentlem, S DebRoy, “Biostrings: Effic Manip Biol Strings ”. 2023 doi: 10.18129/B9.bioc.Biostrings. [DOI] [Google Scholar]
- 7.Almende B.V. and Contributors and B. Thieurmel, “visNetwork: Network Visualization using ‘vis.js’ Library.” 2022. [Online]. Available: 〈https://cran.r-project.org/package=visNetwork〉.
- 8.Csardi G., Nepusz T. The igraph software package for complex network research. Inter Complex Syst Vol Complex Sy. 2006;no. 1695:1695. 〈http://igraph.sf.net〉 (Available) [Google Scholar]
- 9.Fruchterman T.M.J., Reingold E.M. Graph Drawing by Force-Directed Placement. Softw-Pract Exp. 1991;vol. 21(11):1129–1164. [Google Scholar]
- 10.Reingold E.M., Tilford J.S. Tidier drawings of trees. IEEE Trans Softw Eng. 1981;vol. SE-7(2):223–228. doi: 10.1109/TSE.1981.234519. [DOI] [Google Scholar]
- 11.A. Frick, A. Ludwig, and H. Mehldau, “A fast adaptive layout algorithm for undirected graphs (extended abstract and system demonstration),” pp. 388–403, 1995, doi: 〈10.1007/3–540-58950–3_393〉.
- 12.Davidson R., Harel D. Drawing graphs nicely using simulated annealing. ACM Trans Graph. 1996;vol. 15(4):301–331. doi: 10.1145/234535.234538. [DOI] [Google Scholar]
- 13.Charitou T., Bryan K., Lynn D.J. Using biological networks to integrate, visualize and analyze genomics data. Genet Sel Evol. 2016;vol. 48(1):1–12. doi: 10.1186/s12711-016-0205-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Feng J., Xu J. Identification of pathogenic genes and transcription factors in glaucoma. Mol Med Rep. 2019;vol. 20(1):216–224. doi: 10.3892/mmr.2019.10236. [DOI] [PubMed] [Google Scholar]
- 15.Bourquard T., et al. Unraveling the molecular architecture of a G protein-coupled receptor/β-arrestin/Erk module complex. Sci Rep. 2015;vol. 5:1–13. doi: 10.1038/srep10760. (no. June) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Aksam V.K.M., Chandrasekaran V.M., Pandurangan S. Cancer drug target identification and node-level analysis of the network of MAPK pathways. Netw Model Anal Heal Inform Bioinforma. 2018;vol. 7(1) doi: 10.1007/s13721-018-0165-1. [DOI] [Google Scholar]
- 17.Kumar N., Mukhtar M.S. Ranking plant network nodes based on their centrality measures. Entropy. 2023;vol. 25(4) doi: 10.3390/e25040676. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Dunn R., Dudbridge F., Sanderson C.M. The use of edge-betweenness clustering to investigate biological function in protein interaction networks. BMC Bioinforma. 2005;vol. 6:1–14. doi: 10.1186/1471-2105-6-39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wang M., Wang H., Zheng H. A mini review of node centrality metrics in biological networks. Int J Netw Dyn Intell. 2022:99–110. doi: 10.53941/ijndi0101009. [DOI] [Google Scholar]
- 20.Yu H., Kim P.M., Sprecher E., Trifonov V., Gerstein M. The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics. PLoS Comput Biol. 2007;vol. 3(4):713–720. doi: 10.1371/journal.pcbi.0030059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Blain-Moraes S., et al. Network efficiency and posterior alpha patterns are markers of recovery from general anesthesia: a high-density electroencephalography study in healthy volunteers. Front Hum Neurosci. 2017;vol. 11:1–8. doi: 10.3389/fnhum.2017.00328. (no. June) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Thompson A.J., Baranzini S.E., Geurts J., Hemmer B., Ciccarelli O. Multiple sclerosis. Lancet. 2018;vol. 391(10130):1622–1636. doi: 10.1016/S0140-6736(18)30481-1. [DOI] [PubMed] [Google Scholar]
- 23.Compston A., Coles A. Multiple sclerosis. Lancet. 2008 doi: 10.1016/S0140-6736(08)61620-7. [DOI] [Google Scholar]
- 24.FD L., et al. Defining the clinical course of multiple sclerosis: the 2013 revisions. Neurology. 2014;vol. 83(3):278–286. doi: 10.1212/WNL.0000000000000560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Schriml L.M., et al. The human disease ontology 2022 update. Nucleic Acids Res. 2022;vol. 50(D1):D1255–D1261. doi: 10.1093/nar/gkab1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Piñero J., et al. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 2020;vol. 48(D1):D845–D855. doi: 10.1093/nar/gkz1021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Consortium T.U. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023;vol. 51(D1):D523–D531. doi: 10.1093/nar/gkac1052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wishart D.S., et al. HMDB 5.0: the human metabolome database for 2022. Nucleic Acids Res. 2022;vol. 50(D1):D622–D631. doi: 10.1093/nar/gkab1062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Köhler S., et al. The human phenotype ontology in 2021. Nucleic Acids Res. 2021;vol. 49(D1):D1207–D1217. doi: 10.1093/nar/gkaa1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Cusick M.F., Libbey J.E., Fujinami R.S. Molecular mimicry as a mechanism of autoimmune disease. Clin Rev Allergy Immunol. 2012;vol. 42(1):102–111. doi: 10.1007/s12016-011-8294-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.C.C. Wtccc et al. Consortium MS_Nature 2012 vol. 476 7359 2012 214 219 doi: 10.1038/nature10251.Genetic.
- 32.Tseng C.C., et al. Increased incidence of rheumatoid arthritis in multiple sclerosis. Med US. 2016;vol. 95(26) doi: 10.1097/MD.0000000000003999. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Toussirot É., et al. Association of rheumatoid arthritis with multiple sclerosis: report of 14 cases and discussion of its significance [1] J Rheumatol. 2006;vol. 33(5):1027–1029. [PubMed] [Google Scholar]
- 34.Tettey P., Simpson S., Taylor B.V., Van Der Mei I.A.F. The co-occurrence of multiple sclerosis and type 1 diabetes: shared aetiologic features and clinical implication for MS aetiology. J Neurol Sci. 2015;vol. 348(1–2):126–131. doi: 10.1016/j.jns.2014.11.019. [DOI] [PubMed] [Google Scholar]
- 35.Donatti A., Canto A.M., Godoi A.B., da Rosa D.C., Lopes-Cendes I. Circulating metabolites as potential biomarkers for neurological disorders—metabolites in neurological disorders. Metabolites. 2020;vol. 10(10):1–32. doi: 10.3390/metabo10100389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.S.L. Andersen et al. HHS Public Access 2020 12 21 doi: 10.1016/j.msard.2019.03.006.Metabolome-based.
- 37.Wilkins A. Cerebellar dysfunction in multiple sclerosis. Front Neurol. 2017;vol. 8:1–6. doi: 10.3389/fneur.2017.00312. (no. JUN) [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary material
Data Availability Statement
The List2Net web application is available at https://list2net.cing-big.hpcf.cyi.ac.cy/




