Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Nov 1.
Published in final edited form as: Genomics. 2020 Jul 20;112(6):4288–4296. doi: 10.1016/j.ygeno.2020.07.028

Proteinarium: Multi-Sample Protein-Protein Interaction Analysis and Visualization Tool

David Armanious 1,^, Jessica Schuster 2,3,^, George A Tollefson 2, Anthony Agudelo 2, Andrew T DeWan 4, Sorin Istrail 1,5, James Padbury 2,3,5, Alper Uzun 2,3,5,*
PMCID: PMC7749048  NIHMSID: NIHMS1615491  PMID: 32702417

Abstract

We posit the likely architecture of complex diseases is that subgroups of patients share variants in genes in specific networks which are sufficient to give rise to a shared phenotype. We developed Proteinarium, a multi-sample protein-protein interaction (PPI) tool, to identify clusters of patients with shared gene networks. Proteinarium converts user defined seed genes to protein symbols and maps them onto the STRING interactome. A PPI network is built for each sample using Dijkstra’s algorithm. Pairwise similarity scores are calculated to compare the networks and cluster the samples. A layered graph of PPI networks for the samples in any cluster can be visualized. To test this newly developed analysis pipeline, we reanalyzed publicly available data sets, from which modest outcomes had previously been achieved. We found significant clusters of patients with unique genes which enhanced the findings in the original study.

Keywords: Software, Protein-protein interactions, networks, multi-sample, data visualization

1. Introduction

Genome-wide association studies (GWAS) have become a popular approach to the investigation of complex diseases (1,2) and have made possible discovery of insights not previously recognized (35). However, GWAS have often failed to demonstrate the “missing heritability” in many common diseases (610). A major factor contributing to the lack of success of GWAS in identifying ‘missing heritability’ is the fundamental nature of the genetic architecture interrogated by GWAS. The computational approaches underlying GWAS reflect the “common disease common variant hypothesis,” that complex disease architecture is due to additive genetic effects due to variants in individual genes. The genetics of complex diseases, however, suggests that is unlikely. The more likely architecture is that subgroups of patients share variants in genes in specific networks and pathways which are sufficient to give rise to a shared phenotype. It is also likely that variants in genes in different networks and pathways express similar phenotypes and define different subgroups of patients. All nodes in the pathways are unlikely to be equally represented. Because of purifying selection, pathogenic variants may be clustered in overlapping pathways, rate limiting steps or regions of high-centrality (11). Thus, resources are needed to identify these shared and individual networks and pathways in clusters of patients within diseases or phenotypes.

With the generation of high throughput screening methods, extensive protein–protein interaction (PPI) networks have been built (11,12). PPI networks potentially harbor great power as they reflect the functional action of genes (13,14). Large scale protein interaction maps show that genes involved in related phenotypes frequently interact physically at the level of proteins in model organisms, and as well as in humans (1519). This is also true for the disease phenotypes. Proteins that are associated with similar disease phenotypes have a strong affinity to interact with each other (17, 20). In addition, they have a tendency to cluster in the same network zone (21). PPI networks have been used to identify candidate genes and subnetworks associated with complex diseases such as cancer and Alzheimer ‘s disease (14, 22, 23).

We developed Proteinarium, a multi-sample protein-protein interaction tool to identify clusters of patients with shared networks to better understand the mechanism of complex diseases and phenotypes. This tool was designed to increase the power in the analysis of experimental data by identifying disease associated biological networks that define clusters of patients, as well as the visualization of such networks with user specified parameters. Compared to other PPI network analysis tools, a distinguishing feature of Proteinarium is its ability to identify clusters of samples (patients), their shared PPI networks and representation of samples within the clusters by their group assignment.

2. Material and Methods

Figure 1 presents an overview of the workflow for Proteinarium.

Figure 1. Proteinarium workflow.

Figure 1.

(a) Proteinarium accepts gene lists for each sample (b) All gene symbols are converted to their protein product names. (c1) Interactome Information is provided from STRING Database. (c2) Proteins for each sample are mapped onto interactome, blue colored circles. (c3) Finding the interactions for each sample based on Dijkstra’s algorithm. (c4) Layered graph built by including the imputed proteins that are colored in orange for each sample based on their gene list. (d) Next the network similarity matrix between samples is built using Jaccard index. (e) Samples are clustered by UPGMA method based on their network similarities and any chosen branch of the dendrogram displays the layered graph of PPI networks which can be visualized on the fly and saved as an interaction file. (f) Proteinarium provides several output files for users to export the results to be used on other programs like ITOL, Gephi or Cytoscape.

2.1. PPI Network Data Source

Proteinarium uses the STRING database, version 11, for humans as its network data source. It includes known and predicted PPIs (7). Each PPI has an associated score between 0 and 1000 indicating the confidence of the interaction.

2.2. Input Data

Proteinarium can be used to analyze data for samples with dichotomous phenotypes, multiple samples from a single phenotype or a single sample. For each sample, a list of genes is required as input (seed genes). HUGO Gene Nomenclature Committee (HGNC) Symbol gene IDs are used as input. If the samples are from a dichotomous phenotype, the genes for each sample must be in one of two discrete files, group1GeneSetFile or group2GeneSetFile (see Supplementary Materials – (S2) Configuration Table) The seed genes may be derived from deep sequence data, transcriptome analysis, or from any analysis that generates candidate gene lists. It is up to the user how to generate the seed genes as input for Proteinarium.

2.3. Multi-sample Analysis: Constructing Networks

2.3.1. Mapping onto the interactome

Proteinarium is initialized by mapping the seed genes onto STRING’s protein-protein interactome (19). For each seed gene, Proteinarium maps the HUGO Symbol to the associated proteins in STRING’s database (10). Proteinarium then finds the PPIs within the STRING database that correspond to the seed genes, forming a subset of the protein interactome for each sample.

2.3.2. Building a graph with Dijkstra’s Algorithm

For each sample, Proteinarium builds an interaction graph from the seed proteins. We use Dijkstra’s shortest path algorithm to find short, high-confidence paths between all pairs of proteins where such a path exists (24). STRING provides a confidence score S for every edge connecting two proteins between 0 and 1000. The higher the number, the higher the confidence of the specific protein-protein interaction. We define the cost of an edge between two proteins to be 1000 – S. Thus, the highest confidence interactions should be the lowest cost edges. The algorithm has been modified to only consider paths whose number of vertices in the path (maxPathLength) and sum of edge weights (max Path Cost) are below the user specified values. These two parameters can be specified by the user (see Supplementary Materials, (S2) Configuration Table). For each sample i, the graph Gi = (Vi, Ei) is generated, where each vertex vVi corresponds to a protein and each edge eEi corresponds to a protein interaction. Only the seed proteins and the proteins that were required to minimally connect the seed proteins are included in the set of vertices Vi.

2.3.3. Building the similarity matrix and clustering samples with UPGMA

After generating the graphs for each sample, Proteinarium calculates the similarity between each pair of graphs using the Jaccard distance (25). The distance di,j between any two graphs Gi and Gj is calculated as:

di,j=1|GiGj||GiGj|

This similarity matrix is then used as the input to cluster the set of graphs. Clustering is performed hierarchically using the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) (12). Initially, all clusters consist of a single graph, corresponding to a leaf of a dendrogram. When there exists more than one cluster, we combine the two clusters that are closest to each other according the distance metric d and then update all cluster distances.

In the standard UPGMA algorithm, the weight wi would correspond to the number of graphs in that cluster, so all the weights of all leaves in the tree would equal 1. Our modified algorithm scales either the Group 1 graph factor or the Group 2 graph factor to a constant ρ ≥ 1 such that the product of the size of the graph set and ρ is equal to the size of the other graph set.

2.4. Visualizing the “Global Dendrogram”

Proteinarium outputs the results of the clustering algorithm as a dendrogram in a .PNG file for visualization and in a text file in Newick tree format. The horizontal length of the line segments is proportional to the height of the cluster in the tree. The lower the height, the more similar the graphs within the cluster are. All heights are normalized between 0 and 1, with the leaves occupying height 0 and the ancestor occupying height 1. The lines are also colored according to the (weighted) percentage of Group 1 versus Group 2 samples comprising the cluster: if a group comprises more than 60% of the weight, then the edge is colored by that group; otherwise, the edge is colored black.

2.5. Visualizing “Local” Clusters

2.5.1. Constructing a layered graph

For a given cluster C comprised of n graphs G1 = (V1, E1), …. Gn = (Vn, En) Proteinarium can construct and output a summary layered graph LG = (VLG, ELG) consisting of each of the n graphs with the following vertices and edges:

VLG=V1V2Vn
ELG=E1E2En

Additionally, the count annotation for each vertex v in LG:

LG.count(v)=i=1n1Vi(v)

The count annotation is the number of sample networks in which the protein (v) of the layered graph is found. If a cluster C contains samples from both sample groups, there are 5 possible networks being created on the fly. These networks are as follows:

  1. [Group 1] the network for Group1 samples only;

  2. [Group 2] the network for Group 2 samples only;

  3. [Group 1 + Group 2] the network for both Group1 and Group 2 samples;

  4. (iv) [Group 1 − Group 2] the network where Group 2 is subtracted from Group 1;

  5. [Group 2 − Group1] the network where Group 1 is subtracted from Group 2.

Details on the constructions of graphs iv and v are in Supplementary Materials (S1a).

2.5.2. Visualizing and annotating the layered graph

Given a layered graph, LG, corresponding to a cluster C, we lay the vertices according to a simple implementation of a force-directed layout algorithm. Additional details on the construction of graphs iv and v are included in Supplementary Materials (S1a). We color each vertex v according to the samples for which v’s corresponding gene exists:

  1. Only samples from Group 1: yellow [group1VertexColor]

  2. Only the samples from Group 2: blue [group2VertexColor]

  3. Samples from both Group 1 and Group 2: green [bothGroupsVertexColor] or a 50/50 mixture of (1) and (2)

  4. Neither samples from Group 1 nor Group 2 (Protein was inferred from the pairwise path finding algorithm): red [defaultVertexColor]

Additionally, the opacity of each vertex v is calculated linearly between a minimum opacity m and 255:

opacity(v)=m+(255m)*LG.count(v)LG.count(v')

The same calculation is used for the opacity of the edges.

Several parameters in the rendering and the force-directed algorithm can be set by the user to obtain the best visualization possible (see Supplementary Materials (S2) Config Table). These include parameters that control the number of nodes to render which allows the user to control how much network information is to be displayed to keep the visualization tractable. We rank the seed genes of a given graph according to the number of pairwise paths that particular vertex appears in. We then retain the top k vertices such that the total number of unique vertices found within all pairwise paths of k selected vertices is less than or equal to the maximum number of vertices to be displayed. In doing so, all graphs will show only complete paths whose endpoints originate from seed genes.

2.6. Fisher’s exact test p-value

The p-value for a cluster is given by Fisher-Exact and indicates the probability of observing, among all n sample samples, a cluster of size m with proportions of Group 1 and Group 2 samples relative to the total number of Group 1 and Group 2 samples. This measure provides a sense of how disproportionate a particular group (either Group 1 or Group 2) is in a cluster; the lower the p-value, the more confidence we have that the over-representation is not due to random chance. The user can set the significance threshold (significanceThreshold) for which the clusters that have a p-value under this threshold are colored red in the dendrogram visualization (colorSignificantBranchLabels).

2.7. Clustering coefficients

For each cluster, we calculate the global clustering coefficients for each of the four graphs: Group 1 graph, Group 2 graph, and the two graphs resulting from the weighted differences of each graph as described above. The clustering coefficient is a measure describing the tendency of vertices in a particular graph to cluster together. For a graph G = (V, E) with |V| = n vertices, the global clustering coefficient is given by

1ni=1n|{ejk:vj,vkNi,ejkE}||Ni|*(|Ni|1)

Where Ni is the set of neighbors of vertex vi, i.e. Ni = {vj : eijE}.

2.8. Program Availability

Proteinarium is a command-line tool written entirely in Java with no external dependencies. We also built a fully functional graphical user interface (GUI) for users who are not familiar or comfortable using the command line option. Both the command-line version and the GUI are available to download from https://github.com/alperuzun/Proteinarium/. There are two videos on how to download, install and run the command-line and GUI versions of Proteinarium. They are both available from the GitHub page.

3. Results

3.1. Running Proteinarium on Simulated Data

We evaluated Proteinarium’s performance on simulated datasets and on the datasets for two use cases. The STRING database was used to generate two dissimilar PPI networks “1” and “2.” Different seed genes were selected for each and a network generated by extending the path length. The seed genes were ESR1 (HGNC: 3467) and B2M (HGNC:914), respectively. PPI network 1 and 2 were built based on the following parameter settings for the STRING database: (1) Medium confidence of 0.4; (2) Maximum number of interactions to show 1st shell of no more than 50 interactors; and 2nd shell of no more than 20 interactors; (3) All active interaction sources included. The proteins in these networks do not overlap and are located in different regions of the interactome (separation score, 0.578) (26). Both networks have a total of 71 proteins and their clustering coefficients were 0.590 and 0.691, respectively. The list of protein names for both networks is provided in a supplementary file (Supplementary 1). We simulated two groups each with 50 samples (Figure 2). For each sample in Group 1, 10 genes were randomly selected from PPI network 1 and for each sample in Group 2,10 genes were randomly selected from PPI network 2. We ran Proteinarium using these Group1 and Group2 seed gene files. To add noise to the simulated data, for each sample in Group 1 a percentage of their genes were randomly replaced with genes from PPI network 2 (20%, 30%, 40% and then 50%). Similarly, for samples in Group 2, a percentage of genes were replaced at random with genes from PPI network 1. Representative dendrograms reflecting the output of Proteinarium for the simulations with 0%, 20%, 30%, 40% and 50% are shown in Figure 2 and the performance in Table 1. Samples in Group 1 are colored yellow, and samples in Group 2 are colored blue. The five dendrograms show the clustering of the samples based on their network similarities. With no noise, as expected, samples from Group 1 cluster perfectly together and samples from Group 2 clustered perfectly together (power: 100%, p<.05). As more noise was added to the data sets, the power decreased from 100% at 20% noise, down to 15% at 40% noise. At 50% noise (i.e. the null hypothesis) the power to discriminate groups was less than 5%.

Figure 2. Simulations with Proteinarium on noisy data.

Figure 2.

(a1 and 1a2) STRING database was used to obtain PPI Networks 1 and 2, respectively. Both networks contain 71 proteins and their clustering coefficients were 0.590 and 0.691. There are no shared proteins between these two networks. (b) Simulated data was generated for Group1 and Group2, each with 50 samples. For each sample in Group 1, 10 genes were randomly selected from PPI. To add noise to the simulated data, for each sample in Group 1 a percentage of their genes were randomly replaced with genes from PPI Network 2 (20%, 30%, 40% and 50%). Similarly, for samples in Group 2, a percentage of genes were replaced at random with genes from PPI Network 1. This is represented by a series of pie chart schematics, showing the different amount of “noise” added to each sample in the various simulations (S1–S5). (c1–c5) Representative dendrograms reflecting the output of Proteinarium for the simulations with 0%, 20%, 30%, 40% and 50% noise respectively. Group 1 samples are represented in yellow, and Group 2 samples were represented in blue. The five dendrograms show the clustering of the samples based on their network similarities.

Table 1.

Proteinarium’s performance on simulated datasets

Test Replicate Number of Simulations with P-value < 0.05 Total Simulations Fraction of Simulations with P-value < 0.05 Number of Simulations with P-value ≥ 0.05 Fraction of Simulations with P-value ≥ 0.05
1 0.2 500 500 1 0 0
2 0.3 477 500 0.954 23 0.046
3 0.4 78 500 0.156 422 0.844
4 0.5 23 500 0.046 477 0.954

3.2. Running Proteinarium on Use Cases

We have analyzed publicly available data sets, from which modest outcomes had previously been achieved, to test implementation of our newly developed analysis pipeline.

3.2.1. Genome-wide expression study of preterm birth

We implemented Proteinarium on a previously published genome wide expression study of preterm birth (27). The aim of the original study was to investigate maternal whole blood gene expression profiles associated with spontaneous preterm birth (SPTB, <37 weeks gestation) in asymptomatic pregnant women. The study population was a matched subgroup of women who delivered at term (51 SPTBs, 114 term delivery controls). We used the gene expression microarray data generated from maternal blood collected between 27–33 weeks of gestation. Raw expression data is accessible at NCBI GEO database, GSE59491(28). The authors had performed univariate analyses to determine the differential gene expression associated with SPTB. We calculated z-scores for each gene in the preterm birth samples, using the mean of the control samples for reference (and vice-versa). For each sample, the genes were ranked according to their z-score and the top 50 genes were used as the sample’s gene list for input to Proteinarium. The top 50 genes for each case and control are provided as supplementary files (Supplementary 2a and 2b). The configuration file used to run Proteinarium is available as a supplementary file (Supplementary 3). The dendrogram in Figure 3a demonstrates the clustering of the samples based on their network similarities.

Figure 3. A genome wide expression study of preterm birth.

Figure 3.

Proteinarium was implemented on a previously published genome wide expression study of preterm birth. (a) Represents the dendrogram for cases and controls (51 SPTBs, 114 term delivery controls). (b) Cluster 102 (C102) represented in a zoom view. It contains significantly more preterm birth samples (n=8) than control samples (n=1). (c) Shows the layered PPI network of C102 for these preterm birth samples, consisting of 43 nodes. We found that 11 (in black circles) out of these 43 genes had previously been shown to be nominally differentially expressed in SPTB.

Cluster 102 contained significantly more preterm birth samples (n=8) than control samples (n=1). Figure 3c shows the layered PPI network for these preterm birth samples, consisting of 43 proteins. We ran GO Term analysis via DAVID software on all 43 members of the network. We found significantly enriched biological process based on Bonferroni corrected p-value for biological processes, molecular functions and cellular components. The enriched biological processes are nuclear-transcribed mRNA catabolic process, nonsense-mediated decay (GO:0000184), mRNA splicing, via spliceosome (GO:0000398), movement of cell or subcellular component (GO:0006928). Enriched molecular functions are poly(A) RNA binding (GO:0044822), protein binding (GO:0005515), GDP binding (GO:0019003), GTP binding (GO:0005525). The enriched cellular components are cytosol (GO:0005829), membrane (GO:0016020), extracellular exosome (GO:0070062), cytoplasm (GO:0005737), phagocytic vesicle (GO:0045335), F-actin capping protein complex (GO:0008290), plasma membrane (GO:0005886). We also found that 11 out of these 43 genes had previously been found to be nominally differentially expressed (27). Using over representation analysis, it was determined that this cluster is significantly enriched for preterm birth associated genes, both overall and amongst the inferred genes only (Chi-sq. 2 tailed p value=0.01). Additionally, the overall network density is 0.08, whereas the density for the subnetwork of the 11 PTB genes (Density = 0.16) is significantly greater (p=0.039). Using permutation testing, the probability of seeing a subnetwork of 11 nodes with a density of .16 or greater is less than 5%. Thus, the output from Proteinarium confirms and extends the results of this study. The results further support the validity of the assumptions underlying the design of Proteinarium, that the genetic architecture of complex diseases is characterized by clusters of patients with shared protein-protein interaction networks.

3.2.2. Gene expression in prostatic tissue from benign prostatic hyperplasia (BPH) patients.

We used this pipeline to reanalyze the gene expression analysis of prostatic samples with their methylation status at the SRD5A2 promoter. This study compared the expression profile of 12 SRD5A2-methylated and 10 SRD5A2-unmethylated prostate samples. (NCBI GEO database, GSE101486) (29, 30). SRD5A2 is known to regulate prostate growth. Epigenetic silencing of SRD5A2 is mediated by methylation of the SRD5A2 promoter region (31). We identified the top 100 seed genes for each patient based on their z score (Supplementary 4a and 4b). The configuration file used to run Proteinarium is available as a supplementary file (Supplementary 5). Our analysis identified 2 significant clusters, one containing 11 out of the 12 methylated samples and the other containing 8 out of the 10 unmethylated samples. The layered graphs of these two clusters are shown in Figure 4. These two networks have a positive separation score indicating they reside in different parts of the interactome. In the SRD5A2-methylated dominant cluster (Cluster 1), phosphorylated human estrogen receptor-a (pERa) and aromatase were selectively upregulated. Aromatase mediates the conversion of androgens to estrogens. In our analysis, estrogen receptor 1(ESR1) protein was imputed and interacts with NCOA1, which is involved in androgenic pathways. Moreover, NFKB1 was also imputed and has an important role in regulation of methylation of SRD5A2. These two genes are unique to Cluster 1. Annotating these networks using DAVID software found that in Cluster 1 prostate cancer ranked higher than (p= 0.0006) than in the cluster 2 network (p= 0.002). Thus, the output from Proteinarium confirms and extends the results of this study.

Figure 4. A genome wide expression study of prostatic tissue samples from patients with benign prostatic hyperplasia.

Figure 4.

Proteinarium was implemented on a previously published genome wide expression study of benign prostatic hyperplasia samples. There are a total of 22 prostatic samples, grouped by methylated SRD5A2, (n=12, in dark purple) and unmethylated SRD5A2 (n=10, in yellow) status. In the dendrogram, two significant clusters were identified. Cluster 1 contains the methylated SRD5A2 samples. Cluster 2 contains the unmethylated SRD5A2 samples. The layered PPI graphs of the two clusters are displayed, with the red proteins being those that are imputed from the interactome and the blue proteins are the seed genes.

4. Discussion

Proteinarium is a multi-sample, protein-protein interaction tool built to identify clusters of samples with shared networks underlying complex disease. Using seed genes from each sample, the Proteinarium pipeline takes these input genes and converts them to protein symbols, which are then mapped onto the STRING PPI interactome. For each sample, its specific PPI network is built using Dijkstra’s algorithm by searching for the shortest path between each pair of protein inputs. The similarities between all subjects‟ PPI Networks is calculated and used as the distance metric for clustering samples. In complex genetic diseases, Proteinarium can identify clusters of samples for which their shared networks differ between cases and controls.

In order to test Proteinarium, we simulated several data sets, each with a varying percentage of noise added to the data. With no noise, we confirmed full power to distinguish cluster 1 samples from cluster 2 samples (with 50 samples and input gene lists of size 10). As more noise was added to the data sets, the ability to discriminate between groups became more difficult, with the power decreasing from 100% down to 15% for 40% noise. We also validated that for the null hypothesis (i.e. there is no difference between Group 1 and Group 2) there was no power to discriminate with high confidence. This simulation was useful as a proof of concept that the algorithm performs as it should for both the most extreme as well as the null case. It also serves as a reference for users who might have prior information regarding the complexity or heterogeneity of their groups.

We also implemented Proteinarium on previously published genome wide expression studies of two diseases, preterm birth and from patients with benign prostatic hyperplasia. As already mentioned, the aim of the preterm birth study was to investigate maternal whole blood gene expression profiles associated with spontaneous preterm birth (SPTB, <37 weeks) in asymptomatic pregnant women. We ran Proteinarium on this data set, using the genes that had the highest Z-scores comparing SPTB cases with term controls as input. We found one Cluster for which the “CASE” cohort was highly represented. This cluster included 8 out of the 47 SPTB cases. We analyzed the layered PPI networks of these 8 subjects and found that there was significant over-representation of genes that had been previously found by the authors to be nominally significantly differentially expressed (27). Additionally, the network of these genes was denser than that of the whole network and had such a high density that it was unlikely to occur by chance alone. These findings support the validity of the assumptions underlying the design of Proteinarium and lend validity to the concept that the genetic architecture of complex disease is characterized by clusters of patients that share variants in genes in specific networks and pathways which are sufficient to give rise to a shared phenotype. In the second data set, we reanalyzed a gene expression from patients with benign prostatic hyperplasia study, where the expression profile of 12 SRD5A2-methylated and 10 SRD5A2-unmethylated prostatic samples were compared. We found two significant clusters based on their methylation status. We found unique genes with in these clusters which enhanced the findings in the original study.

One of strengths of Proteinarium is the ability of the user to configure all parameters throughout the pipeline. For example, one of the configuration options is the maximum path length (MPL). MPL is the maximum number of vertices (or imputed genes) within any given path. Thus, a user defined MPL of 2 would allow single genes to be imputed between any two of the seed genes. This parameter allows for flexibility in building a network with varying degrees of distant neighbors. With a MPL of 2 or greater, proteins that connect a pair of the input nodes will be added as imputed nodes to the network. This allows for inference(s) on genes which may have not originally been considered as relevant based on the results used to generate the input data. Once the individual networks are generated and pairwise similarities are calculated, hierarchical clustering is used to cluster the samples. The output of this analysis is displayed as a dendrogram. In addition to the image of the dendrogram (.png), Proteinarium provides the dendrogram structure in Newick tree format. This allows users to visualize the dendrogram on other platforms like Dendroscope or ITOL (23, 24) and for subsequent analysis with other software. An additional output is the Cluster Analysis File, which provides detailed information about clustering results in a tabular format (.csv). This file contains the clustering coefficient of the layered graph for that cluster, the number of samples in the cluster, the p-values for statistical abundance of group type, and the average distance (height) for the branches of the dendrogram for each cluster. This cluster is also available on the fly on the command line screen when a cluster is selected.

In addition to its use as an analytic tool, Proteinarium provides useful visualizations. The dendrogram allows the user a global view of the distribution of samples into clusters and the group coloring allows for identification of patterns related to phenotypic class. Additionally, as mentioned above, users can select any branch of the dendrogram and display the layered PPI network of the samples within that cluster. This functionality enables users to visualize the layered networks of multiple samples. If a cluster contains samples from both phenotypic groups, then there are 5 possible networks created on the fly. These network images allow users to visually identify interesting aspects of the network, like the most connected genes, the most frequently appearing genes amongst the samples in the cluster, the distribution of genes originated from each phenotypic group as well as those that are imputed. Proteinarium also provides an output file for these networks. The file includes information about the genes in the network, if they were imputed and the number of samples that contain each gene. Additionally, Proteinarium provides a PPI interaction file for each network. This allows users to analyze the network(s) of interest on other platforms such as Cytoscape or Gephi (32, 33).

4.1. Limitations and Future applications

There are some limitations with the methods in this version of Proteinarium that we are aware of. The current release of Proteinarium uses protein-protein interaction data from a single repository, the STRING database. We plan to extend the resource options in the next release by allowing the PPI information data to be derived from any IMEX consortium that the users choose (32). In addition, we plan to allow PPI from other organisms in Proteinarium, e.g. mouse.

In the current version we use Dijkstra’s algorithm for measuring the distance between nodes in the interaction graph. Since Proteinarium finds interactions based on the seed genes of the patients from the same disease, we expect to see that these genes frequently interact and moreover tend to cluster in the same neighborhood of the interactome. The distance between disease genes tends to be smaller than expected at random (26). While we expect our genes have a high degree of interaction, shortest path algorithms tend to include highly connected and unspecific hubs into the calculation. There are other methods to measure the distance between nodes of protein-protein interactions such as diffusion kernels which has been shown to limit this bias. In the next version of Proteinarium this is one of the methods that we will incorporate into the pipeline as an option for users.

Additionally, in the current pipeline we used a single method for computing the similarity matrix, the Jaccard Index, and a single method for building the rooted tree, the UPGMA hierarchical clustering method. One limitation of this method is that it is possible for subsets of samples to have a distance of 0, while being in close proximity within the interactome (26). The main focus of Proteinarium was for single phenotype analysis. We hypothesize that there will be substantial gene overlap amongst the samples, and thus be localized in similar network neighborhoods. However, we are considering additional methods for this step in the next version of Proteinarium such as Maximum likelihood and Bayesian methods to build dendrograms. We were reassured that the analyses using the use cases for actual complex diseases was able to identify significant clusters with over representation of genes nominally differentially expressed by other methods.

5. Conclusion

In conclusion, we have created a multi-sample protein-protein interaction network tool to support analysis and visualization of single or paired samples. Proteinarium provides several different user-defined outputs with more than 30 configurable options. The tool allows investigators to identify important associations from high throughput data for a variety of disease phenotypes based on PPI networks. Results from two use cases identified significant clusters of patients with shared PPI networks which enhanced the findings in the original studies. Proteinarium was built specifically to identify such clusters. These findings lend support to our hypothesis that the likely architecture of complex diseases is that subgroups of patients share variants in genes in specific networks which are sufficient to give rise to a shared phenotype.

Supplementary Material

1
2
3
4
5
6
7
8

Highlights.

Proteinarium is a multi-sample protein-protein interaction (PPI) tool to identify clusters of samples with shared networks.

Proteinarium provides useful analysis and visualizations. The Dendrogram and layered PPI network.

The dendrogram allows the user a global view of the distribution of samples into clusters. In Proteinarium any branch of the dendrogram can be displayed as a layered PPI network of the samples within the cluster.

Acknowledgements

We thank Stephen Kidd from the Department of Classics at Brown University for naming the tool. We thank the Center for Computation and Visualization (CCV) and Computational Biology Core at Brown University.

Funding

This work was supported by the National Institutes of Health (grants 5P20GM109035-04, 5P30GM114750) and the Kilguss Research Core at Women & Infants Hospital.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Competing interests

The authors declare that they have no competing interests.

Availability and requirements

Project name: Proteinarium

Project home page: https://github.com/alperuzun/Proteinarium/

Operating system(s): Linux, Mac OS X, Windows

Programming language: Java

Other requirements: none

License: GNU Affero GPL (Version 3)

Any restrictions to use by non-academics: None

Consent for publication

Not applicable.

Availability of data and material

The sample datasets used during the development of current software and the latest version can be freely downloaded at https://github.com/alperuzun/Proteinarium/

References

  • 1.Ku CS, Loy EY, Pawitan Y, Chia KS. The pursuit of genome-wide association studies: where are we now? J Hum Genet. 2010;55(4):195–206. [DOI] [PubMed] [Google Scholar]
  • 2.Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11(1):31–46. [DOI] [PubMed] [Google Scholar]
  • 3.Cordell HJ. Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet. 2009;10(6):392–404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wang K, Li M, Hakonarson H. Analysing biological pathways in genome-wide association studies. Nat Rev Genet. 2010;11(12):843–54. [DOI] [PubMed] [Google Scholar]
  • 5.Moore JH. Detecting, characterizing, and interpreting nonlinear gene-gene interactions using multifactor dimensionality reduction. Adv Genet. 2010;72:101–16. [DOI] [PubMed] [Google Scholar]
  • 6.Maher B Personal genomes: The case of the missing heritability. Nature. 2008;456(7218):18–21. [DOI] [PubMed] [Google Scholar]
  • 7.Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010;11(6):446–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Gibson G Hints of hidden heritability in GWAS. Nat Genet. 2010;42(7):558–60. [DOI] [PubMed] [Google Scholar]
  • 10.McClellan J, King MC. Genetic heterogeneity in human disease. Cell. 2010;141(2):210–7. [DOI] [PubMed] [Google Scholar]
  • 11.Lage K Protein-protein interactions and genetic diseases: The interactome. Biochim Biophys Acta. 2014;1842(10):1971–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Xing S, Wallmeroth N, Berendzen KW, Grefen C. Techniques for the Analysis of Protein-Protein Interactions in Vivo. Plant Physiol. 2016;171(2):727–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Brubaker D, Liu Y, Wang J, Tan H, Zhang G, Jacobsson B, et al. Finding lost genes in GWAS via integrative-omics analysis reveals novel sub-networks associated with preterm birth. Hum Mol Genet. 2016;25(23):5254–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Nibbe RK, Markowitz S, Myeroff L, Ewing R, Chance MR. Discovery and scoring of protein interaction subnetworks discriminative of late stage human colon cancer. Mol Cell Proteomics. 2009;8(4):827–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34(Database issue):D535–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, et al. A map of the interactome network of the metazoan C. elegans. Science. 2004;303(5657):540–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Oti M, Snel B, Huynen MA, Brunner HG. Predicting disease genes using protein-protein interactions. J Med Genet. 2006;43(8):691–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, et al. Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006;440(7084):631–6. [DOI] [PubMed] [Google Scholar]
  • 19.Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019;47(D1):D607–D13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Chen Y, Zhu J, Lum PY, Yang X, Pinto S, MacNeil DJ, et al. Variations in DNA elucidate molecular networks that cause disease. Nature. 2008;452(7186):429–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Barabasi AL, Gulbahce N, Loscalzo J. Network medicine: a network-based approach to human disease. Nat Rev Genet. 2011;12(1):56–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Auffray C Protein subnetwork markers improve prediction of cancer outcome. Mol Syst Biol. 2007;3:141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol. 2010;6(1):e1000641. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Dijkstra EW. A note on two problems in connexion with graphs. Numerische Mathematik. 1959;1(1):269–71. [Google Scholar]
  • 25.Jaccard P Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles. 1901;37:547–79. [Google Scholar]
  • 26.Menche J, Sharma A, Kitsak M, Ghiassian SD, Vidal M, Loscalzo J, et al. Disease networks. Uncovering disease-disease relationships through the incomplete interactome. Science. 2015;347(6224):1257601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Heng YJ, Pennell CE, McDonald SW, Vinturache AE, Xu J, Lee MW, et al. Maternal Whole Blood Gene Expression at 18 and 28 Weeks of Gestation Associated with Spontaneous Preterm Birth in Asymptomatic Women. PLoS One. 2016;11(6):e0155191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Heng YJ PC, McDonald SW, Vinturache AE, Xu J, Lee MW, Briollais L, Lyon AW, Slater DM, Bocking AD, de Koning L, Olson DM, Dolan SM, Tough SC, Lye SJ. Maternal Whole Blood Gene Expression at 18 and 28 weeks of Gestation Associated with Spontaneous Preterm Birth in Asymptomatic Women. NCBI GEO database,2016. [DOI] [PMC free article] [PubMed]
  • 29.Salari K, O.A., Genome-wide analysis of prostatic tissue gene expression from patients with benign prostatic hyperplasia. 2017, NCBI GEO Database.
  • 30.Wang Z, et al. , Androgenic to oestrogenic switch in the human adult prostate gland is regulated by epigenetic silencing of steroid 5alpha-reductase 2. J Pathol, 2017. 243(4): p. 457–467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Bechis SK, et al. , Age and Obesity Promote Methylation and Suppression of 5alpha-Reductase 2: Implications for Personalized Therapy of Benign Prostatic Hyperplasia. J Urol, 2015. 194(4): p. 1031–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Bastian MHS, Jacomy M. Gephi: an open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media 2009.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1
2
3
4
5
6
7
8

RESOURCES