Structural Analysis of Genomic and Proteomic Signatures Reveal Dynamic Expression of Intrinsically Disordered Regions in Breast Cancer and Tissue

Nicole Zatorski; Yifei Sun; Abdulkadir Elmas; Christian Dallago; Timothy Karl; David Stein; Burkhard Rost; Kuan-Lin Huang; Martin Walsh; Avner Schlessinger

doi:10.1101/2023.02.23.529755

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Feb 24:2023.02.23.529755. [Version 1] doi: 10.1101/2023.02.23.529755

Structural Analysis of Genomic and Proteomic Signatures Reveal Dynamic Expression of Intrinsically Disordered Regions in Breast Cancer and Tissue

Nicole Zatorski ¹, Yifei Sun ¹, Abdulkadir Elmas ², Christian Dallago ^3,⁴, Timothy Karl ⁴, David Stein ¹, Burkhard Rost ⁴, Kuan-Lin Huang ², Martin Walsh ¹, Avner Schlessinger ¹

PMCID: PMC9980136 PMID: 36865220

Summary

Structural features of proteins capture underlying information about protein evolution and function, which enhances the analysis of proteomic and transcriptomic data. Here we develop Structural Analysis of Gene and protein Expression Signatures (SAGES), a method that describes expression data using features calculated from sequence-based prediction methods and 3D structural models. We used SAGES, along with machine learning, to characterize tissues from healthy individuals and those with breast cancer. We analyzed gene expression data from 23 breast cancer patients and genetic mutation data from the COSMIC database as well as 17 breast tumor protein expression profiles. We identified prominent expression of intrinsically disordered regions in breast cancer proteins as well as relationships between drug perturbation signatures and breast cancer disease signatures. Our results suggest that SAGES is generally applicable to describe diverse biological phenomena including disease states and drug effects.

Keywords: protein structural features, breast cancer, human precision transcriptomics and proteomics

1. Introduction

With the advent of improved sequencing technology there has been a great emphasis placed on using proteomic and transcriptomic data to understand underlying disease etiologies and characteristics¹. This has led to successful identification of biomarkers that have advanced precision medicine² in fields such as oncology³. For example, single cell RNA sequencing on primary breast cancer tumors has explored the heterogeneity of gene expression in tumor tissue and the preponderance of immune cell response to disease⁴. Analysis of gene sets from RNA expression can be based on a comparison of the gene names found to be differentially expressed in a sample population as compared to the control population. The alternative, well-established method for analyzing sequencing data is through the use of a gene ontology (GO) enrichment of the differentially expressed genes, which provides a standardized vocabulary and relationship for genes⁵. This gives a slightly more nuanced view of the signature; in that it provides annotations about the function of proteins encoded by genes. In proteomic analysis, protein names or fragments of protein sequences found using mass spectroscopy are analyzed for their relative abundances in a sample⁶.

Despite the developments in transcriptomic and proteomic analysis tools, analysis of expression frequencies on a gene name or a GO annotation level does not capture functional similarities of genes based on their encoded protein structures. Structural features of proteins can robustly describe biological function. For example, we have shown that using one structural feature, i.e., the structural fold, family, and superfamily compositions of gene sets from transcriptomics data, captured similarities and differences among tissues and drug treated cell lines when combined with machine learning techniques⁷. Other showed that the use of structural information can detect off target effects of drugs when structurally similar proteins were targeted⁸. This holds true for structure-derived representations of proteins, such as secondary structure, solvent accessibility, and intrinsically disordered regions (IDRs)⁹. Accurate classifiers predicting structural features are available^9–13, yielding 1D or string representations for input proteins. Moreover, the 3D structure prediction method AlphaFold2¹⁴ allows genome-scale 3D structure prediction, and representations from full 3D coordinates to distance and contact maps.

Here we tested whether representation of the gene or protein set with a variety of structural features can describe phenotypes. We first developed Structural Analysis of Gene and protein Expression Signatures (SAGES), a method of generating structural features from gene and protein sets. We applied SAGES, in combination with random forest models and recursive feature elimination, to GTEx¹⁵, a dataset of RNA expression data taken from human tissue samples. We also analyzed SAGES of breast cancer gene and protein expression data from patient cohorts, as well as breast cancer data from the COSMIC¹⁶ database of mutated and overexpressed genes. Furthermore, we applied SAGES drug perturbation signatures from the Connectivity map¹⁷ to investigate the similarity between existing breast cancer drugs and the breast cancer SAGES signature.

2. Results

2.1. Structural Features are Predictive of Normal Tissue Type

To test whether or not structural features can capture meaningful biological patterns, we evaluated the ability of SAGES to predict normal tissue from GTEx using only structural features and no gene name information. For each of the 30 tissue types in GTEx, we trained three random forest models using different input features, including structural features, gene names, and a combination of structural features with gene names. We measured average performance for all tissues following 10-fold cross validation (Table 1). Notably, accuracy and AUROC are 97% for all groups, including those trained on a combination of structural features and gene names. Interestingly, because performance of the model trained on genes was already high, a comparable performance with structural features alone that does not include any specific information on the genes is notable. For example, in the model trained on normal breast tissue SAGES the AUROC and F score were 0.949±0.012 and 0.951±0.011, respectively (Figure 1), while the gene name trained model obtained similar performance (0.951±0.012 and 0.953±0.011, respectively) (Supplementary Table 2). This demonstrates that, while gene names contain adequate information for recapitulating tissue type, remarkably, structural features alone also contain sufficient information for recapitulating biological identity of tissues. The breast tissue prediction model trained on both types of features preformed similarly to the structural features trained model with AUROC and F-score of 0.953±0.014 and 0.955±0.014 (Figure 1). This is unsurprising considering that it uses the same structural features in addition to the gene names. Predictor performance of normal tissue type using structural features compared to gene names alone demonstrates that the orthogonal information based on underlying protein characteristics is comparable to the information used in gene expression analysis.

Table 1. Performance of tissue type predictors with different input features.

Averaged performance of random forest tissue type predictors on all thirty GTEx tissues following 10-fold cross validation. The Structural Features rows mark the performance obtained when the model was trained using only structural features. The gene name row denotes the performance of the model trained on one hot encoded gene name symbols. The structural features and gene names row represent the performance of the model trained on both structural features and gene names.

Features Trained on	Tissue Type	Accuracy	AUROC	F score	Precision	Recall
Structural Features and Gene Names	Average of all tissues	0.972±0.061	0.972±0.061	0.974±0.054	0.971±0.071	0.98±0.049
Gene Names	Average of all tissues	0.973±0.057	0.973±0.057	0.975±0.054	0.972±0.068	0.979±0.051
Structural Features	Average of all tissues	0.967±0.063	0.967±0.063	0.97±0.054	0.965±0.077	0.979±0.049

Open in a new tab

Figure 1. — Graphs depicting performance of the random forest predictor of normal breast tissue trained on structural features, gene names, and a combination of all features based on GTEx samples. The line labeled random represents performance of a model that has no skill and instead randomly selects a classification. (A) The Receiver Operating Characteristic Curve (ROC) for the breast tissue prediction model trained on different features, which all perform comparably. The AUROC for this model is 94.9% for training on structural features, 95.1% for training on gene names, and 95.3% for training on all features. (B) The precision recall curve for the breast tissue prediction model trained on different features. The precision and recall for each of the feature sets used to train the model are 92.8% and 97.5% for structural features, 92.6% and 98.0% for gene names, and 93.0% and 98.0% for all features combined.

2.2. Structural Features Reveal Characteristics of Normal Breast Tissue

To interrogate the contribution of different features to the model, we used recursive feature elimination (RFE)¹⁸ to analyze the normal breast tissue predictor trained on a combination of structural features and genes from GTEx samples (Table 2). Using all features as input to RFE allows us to explore the relative importance of genes and structural features when used in tandem. We observe that prominent genes are differentially expressed in normal breast tissue, including APOD, a component of HDL; FABP4, a fatty acid binding protein; SAA1, serum amyloid; and TIMP3, a metallopeptidase inhibitor. The identification of genes such as APOD and FABP4, which are known to be associated with breast tissue, increase our confidence in the result¹⁹. Other genes such as TIMP3 and SAA1 are known more for their breast cancer prognostic value^20,21 and further support the robustness of the feature selection process.

Table 2.

Ranking and description of features found most informative in the breast tissue predictor

The twenty-five features that contributed most to the random forest model predicting normal breast tissue from structural features and gene names of GTEx samples.

Rank^a	Feature Name^b	Feature Description^c
1.3	FABP4	Fatty acid binding protein
1.7	APOD	Component of HDL
3.3	SAA1	Serum amyloid
6.6	Number of glutamines in protien	Structural feature
8.9	AZGP1	Zinc binding glycoprotein
9.8	Number of threonines in conserved regions	Structural feature
9.8	CD59	CD59 blood group
13.9	Number of phenylalanines in conserved regions	Structural feature
16	RPL23	Ribosomal protein
18.1	XBP1	X box binding protein
18.4	Number of serines in protein	Structural feature
18.7	MGP	Matrix Gla Protein
19.7	VIM	Vimentin
22.2	TIMP3	TIMP metallopeptidase inhibitor 3
22.8	TXNIP	thioredoxin interacting protein
23.3	C7	complement c7
25.3	IGFBP4	insulin like growth factor binding protein 4
26.5	SERPING1	serpin family g member 1
29.1	Number of glutamic acids in protein	Structural feature
32.1	PNPLA2	Patatin Like Phospholipase Domain Containing 2
32.2	C10orf10	DEPP1 Autophagy Regulator
32.5	Number of glutamic acids in sheet region	Structural feature
34.6	GSN	Gelsolin

Open in a new tab

Rank marks the lowest value of rank corresponds to the feature that contributes most to the model as assigned using recursive feature elimination with 10 random seeds.

Feature Name represents contains the gene name symbol or structural feature name that corresponds with the rank in that same row.

Feature Description corresponds to additional information about the function or name of the feature in that row. Gene feature descriptors were summarized from the Human Gene Database Gene Cards version 5.10 [1].

Notably, features related to intrinsically disordered regions (IDRs) and conserved regions contributed greatly to the breast tissue predictor, even amongst all features, including gene names as well as other structural features (Table 2). These include the composition of glutamine, serine, and glutamate in the protein list; threonine and phenylalanine in conserved regions of proteins; glutamate residues in beta sheet regions; and histidine residues in the IDRs. In particular, this reveals that certain amino acids are differentially expressed in proteins that define breast tissue. For example, glutamine can potentially contribute to protein structural instability when found in IDRs²², and histidine, is known to be disorder promoting²³. Taken together, this overall the importance of disorder related features is consistent with the enrichment of IDRs in signaling proteins, and particularly breast cancer, such as HER2²⁴, SIPA1²⁵, and BRCA1²⁶.

2.3. Breast Cancer Structural Features Differ from those of Normal Breast Tissue

2.3.1. Transcriptomic data reveals an overexpression of proteins with IDRs and intrinsically disordered binding regions: experimentally derived samples

Human breast tumor samples and normal breast tissue controls were collected from 23 individuals during their surgical treatment procedures and sequenced. We compared SAGES of these biopsied breast tumors to those of normal tissue samples from the same individual to identify changes in the type of protein features overexpressed in the diseased tissue. The SAGES analysis of all 23 patients, as well as the following subsets: HER2 negative patients, PR negative patients, ER negative patients, TCHP treated patients, and AC-T treated patients, all demonstrated statistically significantly different protein features from background. Interestingly, the most highly represented features in overexpressed genes in breast cancer patients are long IDRs and IDRs that interact with other proteins (intrinsically disordered binding regions) (Figure 2A). This trend is observed both in the agglomeration of all 23 samples and specifically in HER2 negative patients (Figure 2B). The important role of IDRs in tumorigenesis of a wide variety of cancers has previously been highlighted²⁷. Many of the corresponding proteins with IDRs associated with cancer have been implicated in cell signaling pathways²⁸. An examination of domains, folds, superfamilies, and families that are expressed in the breast cancer samples but not their corresponding normal breast tissues reveals cell signaling related substructures such as the PDZ domain²⁹ and the tyrosine-protein kinase catalytic domain³⁰. Additional features of note include zinc finger domains and immunoglobulin substructures, which are both implicated in breast cancer progression^31,32. The immunoglobulin V-set domain is highly overexpressed in the set of breast cancer with PR negative disease and does not appear in HER2 or ER negative patients (Figure 2B). Interestingly, patients treated with TCHP had some features that were overexpressed compared to background (Supplementary Figure 2A), while those treated with AC-T only had features that were under expressed at a low magnitude compared to background (Supplementary Figure 2B). These trends provide motivation for further exploration of SAGES features in drug treatment.

Figure 2. — Average log ratios of SAGES features that are statistically significantly different in breast cancer samples as compared to normal breast tissue. The log ratio of sample divided by background for each feature was averaged if the calculated p value for that feature was less than the Bonferroni adjusted significance level. Error bars represent the standard deviation of the values used to calculate the average. If there was only one instance of feature found to be statistically significant, the error bars were set to zero. Features that are not families, folds, superfamilies, domains, or frequency counts of a secondary structure consist of the number of amino acids within the protein that make up the secondary structure. (A) The average log ratio of SAGES features for gene expression from 23 breast cancer tumors compared to normal breast tissue from the same individuals. (B). The average log ratio of SAGES features for gene expression from 23 breast cancer patients separated by negative receptor status.

Unlike SAGES of most breast cancer patients that show features, which previously have been generalized to cancer appear in breast cancer, SAGES of triple negative disease (HER2, PR, and ER negative), Anastrozole treated disease, and GO analysis of all breast cancer samples did not reveal any statistically different gene ontology classifications between the cancerous and noncancerous breast tissue when also using a Bonferroni adjusted significance level.

2.3.2. Transcriptomic data reveals an overexpression of proteins with IDR and intrinsically disordered binding regions: COSMIC database

SAGES captured breast cancer disease trends across datasets from multiple sources as seen in the analysis of breast cancer gene expression data from COSMIC¹⁶, an established cancer mutation database. Comparisons between mutated and unmutated but overexpressed breast cancer genes from COSMIC against normal breast tissue gene expression from GTEx demonstrated clear overexpression of IRDs and intrinsically disordered binding regions (Figure 3), in agreement with the analysis of the patient-derived breast cancer gene expression samples (Figure 2).

Figure 3. — The average log ratio of SAGES features for gene expression of mutated and unmutated but over expressed genes from COSMIC breast cancer tumors compared to normal breast tissue from GTEx. Features are displayed by size of average log ratio level for mutated genes, where positive values correspond to features with increased length or frequency in breast cancer compared to normal breast tissue. Due to the fact that the COSMIC data was based on a set of samples combined before analysis, there are no error bars.

Further, SAGES of the COSMIC data also provide insight into the difference between breast cancer genes that are mutated and those that are overexpressed but have unchanged amino acid sequences. Particularly, the most common mutations include: missense substitutions, nonsense substitutions, frameshift insertions, and frameshift deletions¹⁶. Both sets (i.e. mutated genes and overexpressed genes) over express proteins with long IDRs and intrinsically disordered binding regions, and the mutated genes are also enriched with Immunoglobulin Subtype domains as compared to normal breast tissue. Overall, the mutated genes are more different, in terms of number of statistically significantly different features and log ratio of expression, from normal background breast tissue than the unmutated genes. Interestingly, both mutated and overexpressed gene sets are enriched with transmembrane helices.

Addition of GO enrichment analysis further differentiates the roles associated with the mutated and unmutated but overexpressed breast cancer genes. Mutated breast cancer genes are associated with adhesion (such as 0007156, 0098742, 0031344), signaling (such as 0099177, 0050804, 0007267), cellular differentiation (such as 0000902, 0007420, 0048858), metabolism- including those of lipids, (such as 0019222, 0044237, 0044238), signaling related to immune response (such as 0002757, 0002429, 0002768), and regulation of cell death (0010941). Localization of these proteins primarily coincides with the plasma membrane (e.g. 0098590) and the extracellular region (e.g. 0005615). Notably, the mutated genes were implicated in processes related to adhesion. This is unsurprising due to the loss of cellular adhesion associated with most cancers³³. Overexpressed breast cancer genes that have no amino acid sequence changes are associated with metabolism (such as 0008152, 0044238, 0006629), development (such as 0030154), signaling (such as 0023052, 0007165), regulation of cell death (0010942), and immune response (such as 0045087). Some of these proteins are associated with the extracellular space (0005615), however many others are associated with the endoplasmic reticulum and ribosomes (e.g. 0005788, 0005840). The localization to organelles related to protein synthesis is supported by the known increased metabolic activity of cancer cells³⁴. Importantly, both of these RNA expression derived samples have processes related to signaling.

2.3.3. Proteomic data reveals an overrepresentation of transmembrane proteins

Proteomic data, which can reveal differential protein abundance when compared to transcriptomic data³⁵, provides a unique view of expressed breast cancer protein features. This supplements our understanding of breast cancer proteins, which are revealed through gene expression analysis. Proteomics data from 17 breast cancer tumors with normal tissues from the same patients was analyzed using SAGES³⁶. The result showed the presence of proteins with transmembrane helix regions containing a larger number of amino acids on average and immunoglobulin domains (Figure 4). The proteomics data revealed different patterns in features from those calculated using gene expression signatures from the 23 different breast cancer patients sequenced for this study and the COSMIC gene expression data such as presence of the kinase regulatory SH3 domain³⁷, with the exception of the immunoglobulin like domain. Interestingly, both domain types are often found in receptor tyrosine kinases (RTKs), which play a key role in signaling³⁸. Notably, features from the proteomics breast cancer data set had shorter IDRs and binding regions that are intrinsically disordered. Because these partially intrinsically disordered proteins are often involved in cell signaling, such as RTKs, they are tightly regulated and have short half-life³⁹. The difference in structural features found in the analysis of gene and protein expression of breast cancer therefore can be attributed to the transient nature of signaling proteins, which contain IDRs.

Figure 4. — The average log ratio of SAGES features for proteomics of 17 breast cancer tumors extracted from surgical patients compared to normal breast tissue from the same patients. Negative values correspond to features with decreased length or frequency in breast cancer compared to normal breast tissue.

This absence of transient signaling proteins in the proteomic sample is reflected in the GO analysis. On a biological process level, the proteins in breast cancer had functions related to metabolic processes (e.g. 0044281) and transcription (e.g. 0006351, 0097659). One of the areas where these proteins localize in the cell is the nucleus, specifically chromatin (0000785) and chromosomes (0005694). This aligns with the biological processes related to transcription. The proteins also localize to the extracellular space (0005615), which could be related to the increased presence of proteins with transmembrane helices, and the endoplasmic reticulum lumen (0031093). Notably absent from the overrepresented proteins found in the proteomic analysis of breast cancer are the GO terms related to any cellular signaling, as well as cellular adhesion. Taken together, our results suggest that combining analyzing using enrichments of both GO and structural features generated with SAGES supports the correlation between lack of signaling molecules and IDRs in the proteomic samples.

2.4. SAGES Captures Similarity Between Perturbation Signatures of Existing Breast Cancer Therapies and Disease

It has been proposed that gene signatures can be applied for drug repurposing⁴⁰. In particular, the signature reversion principle (SRP) is based on the assumption that drug perturbation gene expression signature has a negative correlation with a disease gene expression signature⁴⁰. To investigate the extension of the SRP, we applied SAGES to gene and protein expression changes induced by breast-cancer drugs and breast cancer. The number of statistically significantly different SAGES features between the drug perturbation and both genomic and proteomic breast cancer analyses were calculated and averaged for the two drug types. Comparisons between drug perturbation with proteomic and genomic breast cancer backgrounds show that breast cancer drugs produce a perturbation in cell lines with SAGES features that are more similar to those obtained from breast cancer than those obtained from other drugs (Figure 5A). With the proteomic background, the breast cancer drugs had a lower average number of SAGES features with statistically significant differences (314) than other drugs (321) (p=0.00070). With the genomic background, the breast cancer drugs had a lower average number of features with statistically significant differences (323) than other drugs (327) (p=0.050).

Figure 5. — Difference between connectivity map breast cancer and other drug perturbation signatures compared to signatures derived from breast cancer COSMIC gene expression and experimental proteomics expression. (A). Counts of statistically significant different features between SAGES of breast cancer and drug perturbation (B) Jaccard coefficient representing similarity of drug perturbation signatures to gene and protein breast cancer signature backgrounds.

This increased similarity between drug perturbation and breast cancer is captured in a gene-protein level comparison between the perturbations and backgrounds but not in a gene-gene level comparison (Figure 5B). When the Jaccard coefficients of the breast cancer (0.0060) and other (0.0055) drug genes are compared to the proteomic breast cancer background, there is smore similarity to the background in the breast cancer drug genes (p=0.00102). However, this trend of SAGES features and proteins of breast cancer drug perturbations being more similar to the targeted disease is not recapitulated in the comparison of gene expression of drug perturbations and breast cancer for both drug groups. Breast cancer drugs have an average Jaccard coefficient of 0.00575 compared to other drugs with a Jaccard coefficient of 0.00572 (p=0.77). Because drug repurposing using perturbation signatures is classically done at a gene level⁴¹, the emergence of this association at the protein and structural level demonstrates the power of the orthogonal information provided with SAGES and raises a potential application of SAGES in the signature based repurposing sphere.

3. Discussion

Genomic and proteomic data contains valuable information about biological states. Currently, gene expression is assessed on a gene name similarity basis or through the use of enrichment of GO annotations in gene lists, however these existing methods do not capture underlying structure based functional information.

Application of features derived from different levels of protein structure can greatly enhance transcriptomic analysis. SAGES is distinct from existing methods in that it provides both sequence and 3D structure based orthogonal data which supplement transcriptomic biological insight (Figure 6). These detailed, structural level, descriptions of proteins encoded by the genes in the input protein data set provide orthogonal information to gene names and GO functional descriptions.

Figure 6. — (A) SAGES takes as input a set of genes or proteins. (B) For each protein, SAGES determines structural features of protein regions from protein sequences and 3D models. (C) SAGES normalizes features by the magnitude of their corresponding protein’s expression level. (D) Sample structural features are then compared to a background using p values from a Fisher’s exact test or Welch’s t test. Bonferroni and false discovery rate corrected significance levels are provided based on the found feature size. (E) The output is a vectorized depiction of the structural features of a transcriptomics or proteomics sample that can be used as the input for biological analysis of samples and machine learning applications.

The analysis of normal tissue transcriptomic data from GTEx revealed the generalizability of structural features to all tissue types with the development of tissue type random forest predictors (Table 1). Use of features alone captured just as much information about biological state as the traditionally used gene names. Feature selection applied to the normal breast tissue samples further revealed that, interestingly, amino acid composition of IDRs was shown to play an informative role in breast tissue prediction (Table 2).

We applied SAGES to multiple breast cancer datasets to demonstrate how this method can be used to inform our understanding of breast tissue and tumor biology. SAGES, applied to samples excised from 23 women during their breast cancer treatment surgeries, revealed structural and functional differences between tumor and normal tissue. Notably, proteins with IDRs and intrinsically disordered binding regions are overexpressed in diseased breast cancer tissue (Figure 2). This was supported by SAGES analysis of breast cancer genes that are mutated and overexpressed in breast cancer according to COSMIC (Figure 3). GO analysis of these samples reveals that the proteins that are overexpressed in breast cancer and contain these IDR features are associated with cell signaling, metabolism, and immune response. Uniquely, mutated breast cancer proteins also were associated with cellular adhesion. Proteomic analysis of human breast cancer samples revealed a different feature landscape which was correlated with a loss of representation of proteins related to cell signaling (Figure 4). This is thought to be due to the transient nature of these cell signaling proteins.

SAGES was used to interrogate drug perturbation signatures from the Connectivity Map and the resulting analysis showed breast cancer drug perturbation signatures are more similar to breast cancer expression signatures on the feature and protein level than other drugs (Figure 5). This, along with SAGES analysis of notable features in breast cancer patients before receiving AC-T or TCHP treatment (Supplementary Figure 2), has interesting implications for signature reversion principle (SRP). This further demonstrates the capacity for structural features to capture both established and novel underlying biological function. The scope of SAGES is wide and we anticipate that it has the potential for a vast number of future applications.

4. Experimental Procedures

4.1. Structural Feature Generation

Transcriptomic signatures can be represented by the structural and functional features of the proteins transcribed in the gene set. These structural features can be derived from protein sequence or from three-dimensional, resolved structure. SAGES calculates sequence based features including length, frequency and amino acid composition from sequence predictions such as: IDRs predicted with IUPRED2a⁴²; protein binding regions in intrinsically disordered peptides predicted with ANCHOR⁴³; transmembrane helical regions predicted with TMHMM 2.0⁴⁴; and globular, helical, positive amino acid containing regions, negative amino acid containing regions, coiled-coil, sheet, loop, conserved, and non-conserved regions predicted with PredictProtein⁴⁵. SAGES also assigns SCOPe⁴⁶ and UniProt⁴⁷ folds, families, superfamilies, and domains using HHpred⁴⁸, with optimized parameters determined in previous work⁷. SAGES applies the same amino acid frequency calculation to features derived from structural models generated by AlphaFold2¹⁴ (default parameters): 3₁₀ helix, alpha helix, beta bridge, beta bulge, turns, and high curvature regions from the Dictionary of Protein Secondary Structure (DSSP)⁴⁹; distances from the center of mass from biopython’s pdb parser⁵⁰; and aggregation propensity from Aggrescan3d⁵¹. SAGES determines the number of amino acids (length), the number of separate instances of a feature type (frequency), and the number of each type of amino acid (composition) for secondary features. For all secondary features SAGES uses a cutoff of 50% predicted probability according to the various feature prediction tools listed above when determining amino acid content of a region. Unlike the other features, aggregation propensity is denoted with a range of values that can be negative or positive rather than a percentage of predicted probability so all amino acids with positive values are included in the analysis. Additionally, SAGES tabulates the total number of contacts as well as the minimum, maximum, and average amino acid distance from the protein’s center of mass. All features are listed in Supplementary Table 1.

SAGES normalizes the features using the expression level of the input genes or proteins which correspond to each feature. This weighted average normalization ensures feature frequency adequately reflects prominence within the sample. SAGES further normalizes the output feature values using the number of genes or proteins input in the sample to ensure consistency between different sized samples. Family, fold, superfamily, and domain were not normalized due to the underabundance of these features within each input set.

4.2. Statistical significance of features

The statistical tests included in the method compare the current sample to a background sample. The preloaded background consists of the entire human proteome from all GTEx V8 samples¹⁵. This background was replaced with a control background matched to each experimental sample in the breast cancer feature analysis. The frequencies of categorical variables are compared to the background using the scipy⁵² statistical package’s Fisher’s exact test to determine a p value. A two tailed, type two, t test was used to determine the t value for averages of lengths of secondary structures compared to background Eq. 1. Because variance between sample and background was not assumed to be equal, a Welch’s t-test was used⁵³.

t value = \frac{background mean - sample mean}{\sqrt{\frac{{standard deviation background}^{2}}{background size} - \frac{{standard deviation sample}^{2}}{sample size}}}

(1)

The scipy statistical package’s t.sf function with degrees of freedom equal to sample size minus two was used to determine the p value from the t value.

The p value cutoff was set at 0.05 and Bonferroni corrected⁵⁴ according to the number of found features in that set of feature types. For families, folds, superfamilies, and domains, this means the number of each of those types of features found. For this correction, number of amino acids and secondary structure elements were grouped together and always summed to 219 leading to an adjusted p value of 0.000228. The p value cutoff for the averages was corrected by 13 resulting in an adjusted value of 0.00385. The statsmodels stats.multitest package’s false discovery rate correction⁵⁵, was also calculated to provide an additional metric for determining significance for large samples. Base 2 log ratios of feature frequency in samples over background was also provided for categorical variables (Figure 6). The code for structural features can be found on the following GitHub: https://github.com/schlessinger-lab/structural_features

4.3. Prediction of Normal Tissues

Structural features were used to predict tissue type from RNA expression of normal tissues. Furthermore, the breast tissue model specifically was also used to highlight important structural characteristics of proteins that were highly predictive in this tissue type. The top 250 most highly expressed genes for each GTEx sample were extracted from GTEx V8¹⁵ along with their level of overexpression, which is defined as the log fold change greater than zero. As seen in previous work, this is a sufficiently sized number of genes for capturing underlying tissue specificity⁷. Structural features (section 2.1) for each sample were determined. There were 3,532 structural features and 5,376 gene name features for each sample. Sample features were min-max normalized (Eq. (2)) in preparation for the tissue type prediction where equals the value of a feature for a sample and x_i equals the set of values for all samples corresponding to that feature.

normalized value = \frac{x_{i} - min (x)}{max (x) - min (x)}

(2)

For each of the 30 tissue types, the samples were split into 90/10 training and test sets with equal number of positive and negative labels. The scikit learn version 0.24.1 random forest classifier was used to predict whether a sample could be classified as that type of tissue. This classifier fits 100 decision trees on data subsets and averages them into a meta predictor to improve performance. The following parameter values were used: criterion to measure split quality was gini impurity, maximum depth was none, minimum sample split was two, minimum samples leaves was one, minimum weight fraction leaf was zero, maximum features was the square root of the number of features, maximum leaf nodes was none, minimum impurity decrease was zero, bootstrap samples were used to build the trees, out of bag samples were not used to estimate the generalization score, number of parallel jobs was none, verbosity was none, warm start was set to false and just fit a new forest every call, class weights were none, complexity parameter used for pruning was zero, and maximum samples was none. Following 10-fold cross validation with random seed 0–9, performance was measured using: the area under the receiver operating characteristic curve (AUROC), accuracy Eq. (3) where y_i is the value of the i^th sample and $y_{i}^{'}$ is the corresponding predicted value, precision Eq. (4) where TP is true positives and FP is false positives, recall Eq. (5) where FN is false negatives, and F-score Eq. (6). Standard deviations for all metrics were computed using python.statistics stdev.

accuracy (y, y^{'}) = \frac{1}{n_{samples}} \sum_{i = 0}^{n_{{samples}^{- 1}}} 1 (y_{i}^{'} = y_{i})

(3)

P = \frac{TP}{TP + FP}

(4)

R = \frac{TP}{TP + FN}

(5)

F 1 = 2 \frac{PR}{P + R}

(6)

Additionally, the most predictive features for normal breast tissue were determined using sklearn recursive feature elimination (RFE) with random forest as the estimator¹⁸. Weights were assigned to all features and then the less important features were eliminated recursively until a single feature remained. This generated a ranking of features based on contribution to the model. RFE was conducted 10 times with random seeds 0–9 and rankings were summed. The 25 features with the lowest score, and therefore best rank are reported in Table 2.

4.4. Breast Cancer Datasets: Generation

Breast tumor tissue and normal control breast tissue was excised from 23 breast cancer patients during surgery. Samples were then prepared and sequenced according to standard protocol⁵⁶. The women enrolled had the following demographics: 13 were post-menopausal, 4 were Estrogen Receptor (ER) negative, 16 were Human Epidermal Growth Factor Receptor (HER2) negative, 5 were Progesterone Receptor (PR) negative, 16 had invasive ductal carcinoma, and 2 had internal mammary chain involvement. Five women received treatment prior to sample collection. One woman was administered Taxotere, Carboplatin, Herceptin, Perjeta (TCHP). Three women were administered Adriamycin, cyclophosphamide, Taxol (AC-T). One woman was administered Anastrozole. Global proteomics data from primary breast tumors and matched normal tissues were generated through mass-spectrometry (MS) (PMID: 33212010)³⁶ and pre-processed as previously described (PMID: 34552204). Data pertaining to mutated and overexpressed genes in human breast cancer was also sourced from the Catalogue of Somatic Mutations In Cancer (COSMIC)¹⁶. This was divided into two sets, one containing only the most commonly mutated genes in breast cancer and the other containing only unmutated, overexpressed genes in breast cancer. For weighting purposes, the number of times a gene was mutated compared to the total number of times that gene was expressed served as the weight for the first set and the number of times the gene was overexpressed was used as the weight for the second set.

4.5. Breast Cancer Datasets: Analysis

The 250 most highly expressed genes were extracted from each genomic and proteomic sample and SAGES was applied. Each patient breast tumor sample had a matching normal breast tissue control from the same patient, which was used to generate the background for the statistical analysis described as part of the SAGES method. The COSMIC samples were compared to SAGES of the top 250 genes from all GTEx normal breast tissue samples. For each sample, if a feature was statistically significant (p value less than the Bonferroni corrected significance level of 0.0011 determined by dividing 0.05 by the number of non-amino acid specific features), the log ratio was included in the averages visualized in Figure 3. The standard deviation was determined for all features that came from more than one sample. Due to the large number of amino acid related features, these were excluded from the visualization. Folds, superfamilies, and domains were present only in the breast cancer samples and not in the background were excluded from this analysis.

Gene Ontology (GO) analysis was also performed using the top 250 most overexpressed genes for each dataset with the panther classification system and the Bonferroni adjusted significance level⁵⁷. The following sample-background pairs were compared using this approach: breast cancer transcriptomic samples-normal breast transcriptomic samples, breast cancer proteomic samples-normal breast proteomic samples, breast cancer proteomic samples- breast cancer transcriptomic samples, normal breast tissue proteomic samples-normal breast tissue transcriptomic samples, COSMIC breast cancer samples with and without mutations-GTEx normal breast tissue.

4.6. Breast Cancer Drug Perturbation

Perturbation signatures of cell lines treated with drugs from the Connectivity Map¹ were compared to various breast cancer backgrounds to investigate how drug induced gene expression relates to disease signature. Unweighted SAGES for all overexpressed genes for each experimental sample were calculated and compared to unweighted SAGES of all proteins from the proteomics samples with log ratio expression greater than one and to unweighted SAGES of all genes from the COSMIC mutated breast cancer gene dataset. Unweighted SAGES were used to ensure that all overexpressed proteins contributed equally to the analysis. For each sample, the number of significantly different features from background were counted. The significance level of 0.05 was selected. Multiple hypothesis testing was not employed because the aim was ultimately to assess similarity to background rather than difference from background and using this technique would increase the type II error. The samples were divided into breast cancer treatment drugs (doxorubicin, fulvestrant, letrozole, megestrol, methotrexate, paclitaxel, raloxifene, tamoxifen, and vinblastine) according to a list published by the National Cancer Institute, and all other drugs in the Connectivity Map database. The average number of statistically significantly different features for each group were calculated and a two-sided, type 2, student t test was used to determine the p value.

Gene expression of the perturbation samples and the expression signatures used to calculate the SAGES of the two backgrounds were also directly compared. The Jaccard coefficient, which is the intersection of the genes overexpressed in both sets over the union of the genes overexpressed in both sets, was used to determine signature similarity. Samples were divided into breast cancer and other drugs and compared with a two-sided, type 2, student t test. Additionally, the average Jaccard coefficient for both groups was determined.

Supplementary Material

Supplement 1

media-1.docx^{(116.3KB, docx)}

Acknowledgments

We would like to thank Michael Bernhofer and Nolan Caile for their assistance. This work is supported by the National Institutes of Health grant F30 HL160179 to N.Z.

Footnotes

The authors declare no competing interests

References

1.Lamb J., Crawford E.D., Peck D., Modell J.W., Blat I.C., Wrobel M.J., Lerner J., Brunet J.P., Subramanian A., Ross K.N., et al. (2006). The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–1935. 10.1126/science.1132939. [DOI] [PubMed] [Google Scholar]
2.Heidecker B., and Hare J.M. (2007). The use of transcriptomic biomarkers for personalized medicine. Heart Fail Rev 12, 1–11. 10.1007/s10741-007-9004-7. [DOI] [PubMed] [Google Scholar]
3.Merry E., Thway K., Jones R.L., and Huang P.H. (2021). Predictive and prognostic transcriptomic biomarkers in soft tissue sarcomas. NPJ Precis Oncol 5, 17. 10.1038/s41698-021-00157-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Chung W., Eum H.H., Lee H.O., Lee K.M., Lee H.B., Kim K.T., Ryu H.S., Kim S., Lee J.E., Park Y.H., et al. (2017). Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer. Nat Commun 8, 15081. 10.1038/ncomms15081. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Liu W., Liu J., and Rajapakse J.C. (2018). Gene Ontology Enrichment Improves Performances of Functional Similarity of Genes. Sci Rep 8, 12100. 10.1038/s41598-018-30455-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Zeidan B.A., Townsend P.A., Garbis S.D., Copson E., and Cutress R.I. (2015). Clinical proteomics and breast cancer. Surgeon 13, 271–278. 10.1016/j.surge.2014.12.003. [DOI] [PubMed] [Google Scholar]
7.Rahman R., Zatorski N., Hansen J., Xiong Y., van Hasselt J.G.C., Sobie E.A., Birtwistle M.R., Azeloglu E.U., Iyengar R., and Schlessinger A. (2021). Protein structure-based gene expression signatures. Proc Natl Acad Sci U S A 118. 10.1073/pnas.2014866118. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.MacDonald M.L., Lamerdin J., Owens S., Keon B.H., Bilter G.K., Shang Z., Huang Z., Yu H., Dias J., Minami T., et al. (2006). Identifying off-target effects and hidden phenotypes of drugs in human cells. Nat Chem Biol 2, 329–337. 10.1038/nchembio790. [DOI] [PubMed] [Google Scholar]
9.Lee D., Redfern O., and Orengo C. (2007). Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 8, 995–1005. 10.1038/nrm2281. [DOI] [PubMed] [Google Scholar]
10.Rost B., Liu J., Nair R., Wrzeszczynski K.O., and Ofran Y. (2003). Automatic prediction of protein function. Cellular and Molecular Life Sciences 60, 2637–2650. 10.1007/s00018-003-3114-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Gerstein M., and Levitt M. (1997). A structural census of the current population of protein sequences. Proc Natl Acad Sci U S A 94, 11911–11916. 10.1073/pnas.94.22.11911. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Radivojac P., Clark W.T., Oron T.R., Schnoes A.M., Wittkop T., Sokolov A., Graim K., Funk C., Verspoor K., Ben-Hur A., et al. (2013). A large-scale evaluation of computational protein function prediction. Nat Methods 10, 221–227. 10.1038/nmeth.2340. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Rost B., Radivojac P., and Bromberg Y. (2016). Protein function in precision medicine: deep understanding with machine learning. FEBS Lett 590, 2327–2341. 10.1002/1873-3468.12307. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Zidek A., Potapenko A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Consortium G.T. (2020). The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330. 10.1126/science.aaz1776. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Tate J.G., Bamford S., Jubb H.C., Sondka Z., Beare D.M., Bindal N., Boutselakis H., Cole C.G., Creatore C., Dawson E., et al. (2019). COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res 47, D941–D947. 10.1093/nar/gky1015. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Lamb J. (2007). The Connectivity Map: a new tool for biomedical research. Nat Rev Cancer 7, 54–60. 10.1038/nrc2044. [DOI] [PubMed] [Google Scholar]
18.Darst B.F., Malecki K.C., and Engelman C.D. (2018). Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genet 19, 65. 10.1186/s12863-018-0633-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Sjostedt E., Zhong W., Fagerberg L., Karlsson M., Mitsios N., Adori C., Oksvold P., Edfors F., Limiszewska A., Hikmet F., et al. (2020). An atlas of the protein-coding genes in the human, pig, and mouse brain. Science 367. 10.1126/science.aay5947. [DOI] [PubMed] [Google Scholar]
20.Danforth D.N. Jr. (2016). Genomic Changes in Normal Breast Tissue in Women at Normal Risk or at High Risk for Breast Cancer. Breast Cancer (Auckl) 10, 109–146. 10.4137/BCBCR.S39384. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Ignacio R.M.C., Gibbs C.R., Kim S., Lee E.S., Adunyah S.E., and Son D.S. (2019). Serum amyloid A predisposes inflammatory tumor microenvironment in triple negative breast cancer. Oncotarget 10, 511–526. 10.18632/oncotarget.26566. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Tompa P. (2005). The interplay between structure and function in intrinsically unstructured proteins. FEBS Lett 579, 3346–3354. 10.1016/j.febslet.2005.03.072. [DOI] [PubMed] [Google Scholar]
23.Zhao J., Zou L., Li Y., Liu X., Zeng C., Xu C., Jiang B., Guo X., and Song X. (2021). HisPhosSite: A comprehensive database of histidine phosphorylated proteins and sites. J Proteomics 243, 104262. 10.1016/j.jprot.2021.104262. [DOI] [PubMed] [Google Scholar]
24.Wright P.E., and Dyson H.J. (2015). Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol 16, 18–29. 10.1038/nrm3920. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Ma Y., Weng J., Wang N., Zhang Y., Minato N., and Su L. (2021). A novel nuclear localization region in SIPA1 determines protein nuclear distribution and epirubicin-sensitivity of breast cancer cells. Int J Biol Macromol 180, 718–728. 10.1016/j.ijbiomac.2021.03.101. [DOI] [PubMed] [Google Scholar]
26.Mark W.Y., Liao J.C., Lu Y., Ayed A., Laister R., Szymczyna B., Chakrabartty A., and Arrowsmith C.H. (2005). Characterization of segments from the central region of BRCA1: an intrinsically disordered scaffold for multiple protein-protein and protein-DNA interactions? J Mol Biol 345, 275–287. 10.1016/j.jmb.2004.10.045. [DOI] [PubMed] [Google Scholar]
27.Meszaros B., Hajdu-Soltesz B., Zeke A., and Dosztanyi Z. (2021). Mutations of Intrinsically Disordered Protein Regions Can Drive Cancer but Lack Therapeutic Strategies. Biomolecules 11. 10.3390/biom11030381. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Iakoucheva L.M., Brown C.J., Lawson J.D., Obradovic Z., and Dunker A.K. (2002). Intrinsic disorder in cell-signaling and cancer-associated proteins. J Mol Biol 323, 573–584. 10.1016/s0022-2836(02)00969-5. [DOI] [PubMed] [Google Scholar]
29.Jelen F., Oleksy A., Smietana K., and Otlewski J. (2003). PDZ domains - common players in the cell signaling. Acta Biochim Pol 50, 985–1017. [PubMed] [Google Scholar]
30.Li E., and Hristova K. (2006). Role of receptor tyrosine kinase transmembrane domains in cell signaling and human pathologies. Biochemistry 45, 6241–6251. 10.1021/bi060609y. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Jen J., and Wang Y.C. (2016). Zinc finger proteins in cancer progression. J Biomed Sci 23, 53. 10.1186/s12929-016-0269-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Whiteside T.L., and Ferrone S. (2012). For breast cancer prognosis, immunoglobulin kappa chain surfaces to the top. Clin Cancer Res 18, 2417–2419. 10.1158/1078-0432.CCR-12-0566. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Cavallaro U., and Christofori G. (2004). Cell adhesion and signalling by cadherins and Ig-CAMs in cancer. Nat Rev Cancer 4, 118–132. 10.1038/nrc1276. [DOI] [PubMed] [Google Scholar]
34.DeBerardinis R.J., and Chandel N.S. (2016). Fundamentals of cancer metabolism. Sci Adv 2, e1600200. 10.1126/sciadv.1600200. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Pascal L.E., True L.D., Campbell D.S., Deutsch E.W., Risk M., Coleman I.M., Eichner L.J., Nelson P.S., and Liu A.Y. (2008). Correlation of mRNA and protein levels: cell type-specific gene expression of cluster designation antigens in the prostate. BMC Genomics 9, 246. 10.1186/1471-2164-9-246. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Elmas A., Tharakan S., Jaladanki S., Galsky M.D., Liu T., and Huang K.L. (2021). Pan-cancer proteogenomic investigations identify post-transcriptional kinase targets. Commun Biol 4, 1112. 10.1038/s42003-021-02636-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Kurochkina N., and Guha U. (2013). SH3 domains: modules of protein-protein interactions. Biophys Rev 5, 29–39. 10.1007/s12551-012-0081-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Lemmon M.A., and Schlessinger J. (2010). Cell signaling by receptor tyrosine kinases. Cell 141, 1117–1134. 10.1016/j.cell.2010.06.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Gsponer J., Futschik M.E., Teichmann S.A., and Babu M.M. (2008). Tight regulation of unstructured proteins: from transcript synthesis to protein degradation. Science 322, 13651368. 10.1126/science.1163581. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Pushpakom S., Iorio F., Eyers P.A., Escott K.J., Hopper S., Wells A., Doig A., Guilliams T., Latimer J., McNamee C., et al. (2019). Drug repurposing: progress, challenges and recommendations. Nat Rev Drug Discov 18, 41–58. 10.1038/nrd.2018.168. [DOI] [PubMed] [Google Scholar]
41.Hodos R.A., Kidd B.A., Shameer K., Readhead B.P., and Dudley J.T. (2016). In silico methods for drug repurposing and pharmacology. Wiley Interdiscip Rev Syst Biol Med 8, 186–210. 10.1002/wsbm.1337. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Meszaros B., Erdos G., and Dosztanyi Z. (2018). IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res 46, W329–W337. 10.1093/nar/gky384. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Meszaros B., Simon I., and Dosztanyi Z. (2009). Prediction of protein binding regions in disordered proteins. PLoS Comput Biol 5, e1000376. 10.1371/journal.pcbi.1000376. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Krogh A., Larsson B., von Heijne G., and Sonnhammer E.L. (2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305, 567–580. 10.1006/jmbi.2000.4315. [DOI] [PubMed] [Google Scholar]
45.Bernhofer M., Dallago C., Karl T., Satagopam V., Heinzinger M., Littmann M., Olenyi T., Qiu J., Schutze K., Yachdav G., et al. (2021). PredictProtein - Predicting Protein Structure and Function for 29 Years. Nucleic Acids Res 49, W535–W540. 10.1093/nar/gkab354. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Fox N.K., Brenner S.E., and Chandonia J.M. (2014). SCOPe: Structural Classification of Proteins--extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res 42, D304–309. 10.1093/nar/gkt1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.UniProt C. (2021). UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 49, D480–D489. 10.1093/nar/gkaa1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Gabler F., Nam S.Z., Till S., Mirdita M., Steinegger M., Soding J., Lupas A.N., and Alva V. (2020). Protein Sequence Analysis Using the MPI Bioinformatics Toolkit. Curr Protoc Bioinformatics 72, e108. 10.1002/cpbi.108. [DOI] [PubMed] [Google Scholar]
49.Kabsch W., and Sander C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637. 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
50.Cock P.J., Antao T., Chang J.T., Chapman B.A., Cox C.J., Dalke A., Friedberg I., Hamelryck T., Kauff F., Wilczynski B., and de Hoon M.J. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423. 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Kuriata A., Iglesias V., Pujols J., Kurcinski M., Kmiecik S., and Ventura S. (2019). Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility. Nucleic Acids Res 47, W300–W307. 10.1093/nar/gkz321. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J., et al. (2020). SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17, 261–272. 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Rosner B. (2011). Fundamentals of biostatistics, 7th Edition (Brooks/Cole, Cengage Learning; ). [Google Scholar]
54.Armstrong R.A. (2014). When to use the Bonferroni correction. Ophthalmic Physiol Opt 34, 502–508. 10.1111/opo.12131. [DOI] [PubMed] [Google Scholar]
55.van Iterson M., Boer J.M., and Menezes R.X. (2010). Filtering, FDR and power. BMC Bioinformatics 11, 450. 10.1186/1471-2105-11-450. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Rengasamy M., Zhang F., Vashisht A., Song W.M., Aguilo F., Sun Y., Li S., Zhang W., Zhang B., Wohlschlegel J.A., and Walsh M.J. (2017). The PRMT5/WDR77 complex regulates alternative splicing through ZNF326 in breast cancer. Nucleic Acids Res 45, 11106–11120. 10.1093/nar/gkx727. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Mi H., Muruganujan A., Huang X., Ebert D., Mills C., Guo X., and Thomas P.D. (2019). Protocol Update for large-scale genome and gene function analysis with the PANTHER classification system (v.14.0). Nat Protoc 14, 703–721. 10.1038/s41596-019-0128-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

media-1.docx^{(116.3KB, docx)}

[R1] 1.Lamb J., Crawford E.D., Peck D., Modell J.W., Blat I.C., Wrobel M.J., Lerner J., Brunet J.P., Subramanian A., Ross K.N., et al. (2006). The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–1935. 10.1126/science.1132939. [DOI] [PubMed] [Google Scholar]

[R2] 2.Heidecker B., and Hare J.M. (2007). The use of transcriptomic biomarkers for personalized medicine. Heart Fail Rev 12, 1–11. 10.1007/s10741-007-9004-7. [DOI] [PubMed] [Google Scholar]

[R3] 3.Merry E., Thway K., Jones R.L., and Huang P.H. (2021). Predictive and prognostic transcriptomic biomarkers in soft tissue sarcomas. NPJ Precis Oncol 5, 17. 10.1038/s41698-021-00157-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Chung W., Eum H.H., Lee H.O., Lee K.M., Lee H.B., Kim K.T., Ryu H.S., Kim S., Lee J.E., Park Y.H., et al. (2017). Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer. Nat Commun 8, 15081. 10.1038/ncomms15081. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Liu W., Liu J., and Rajapakse J.C. (2018). Gene Ontology Enrichment Improves Performances of Functional Similarity of Genes. Sci Rep 8, 12100. 10.1038/s41598-018-30455-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Zeidan B.A., Townsend P.A., Garbis S.D., Copson E., and Cutress R.I. (2015). Clinical proteomics and breast cancer. Surgeon 13, 271–278. 10.1016/j.surge.2014.12.003. [DOI] [PubMed] [Google Scholar]

[R7] 7.Rahman R., Zatorski N., Hansen J., Xiong Y., van Hasselt J.G.C., Sobie E.A., Birtwistle M.R., Azeloglu E.U., Iyengar R., and Schlessinger A. (2021). Protein structure-based gene expression signatures. Proc Natl Acad Sci U S A 118. 10.1073/pnas.2014866118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.MacDonald M.L., Lamerdin J., Owens S., Keon B.H., Bilter G.K., Shang Z., Huang Z., Yu H., Dias J., Minami T., et al. (2006). Identifying off-target effects and hidden phenotypes of drugs in human cells. Nat Chem Biol 2, 329–337. 10.1038/nchembio790. [DOI] [PubMed] [Google Scholar]

[R9] 9.Lee D., Redfern O., and Orengo C. (2007). Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 8, 995–1005. 10.1038/nrm2281. [DOI] [PubMed] [Google Scholar]

[R10] 10.Rost B., Liu J., Nair R., Wrzeszczynski K.O., and Ofran Y. (2003). Automatic prediction of protein function. Cellular and Molecular Life Sciences 60, 2637–2650. 10.1007/s00018-003-3114-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Gerstein M., and Levitt M. (1997). A structural census of the current population of protein sequences. Proc Natl Acad Sci U S A 94, 11911–11916. 10.1073/pnas.94.22.11911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Radivojac P., Clark W.T., Oron T.R., Schnoes A.M., Wittkop T., Sokolov A., Graim K., Funk C., Verspoor K., Ben-Hur A., et al. (2013). A large-scale evaluation of computational protein function prediction. Nat Methods 10, 221–227. 10.1038/nmeth.2340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Rost B., Radivojac P., and Bromberg Y. (2016). Protein function in precision medicine: deep understanding with machine learning. FEBS Lett 590, 2327–2341. 10.1002/1873-3468.12307. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Zidek A., Potapenko A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Consortium G.T. (2020). The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330. 10.1126/science.aaz1776. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Tate J.G., Bamford S., Jubb H.C., Sondka Z., Beare D.M., Bindal N., Boutselakis H., Cole C.G., Creatore C., Dawson E., et al. (2019). COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res 47, D941–D947. 10.1093/nar/gky1015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Lamb J. (2007). The Connectivity Map: a new tool for biomedical research. Nat Rev Cancer 7, 54–60. 10.1038/nrc2044. [DOI] [PubMed] [Google Scholar]

[R18] 18.Darst B.F., Malecki K.C., and Engelman C.D. (2018). Using recursive feature elimination in random forest to account for correlated variables in high dimensional data. BMC Genet 19, 65. 10.1186/s12863-018-0633-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Sjostedt E., Zhong W., Fagerberg L., Karlsson M., Mitsios N., Adori C., Oksvold P., Edfors F., Limiszewska A., Hikmet F., et al. (2020). An atlas of the protein-coding genes in the human, pig, and mouse brain. Science 367. 10.1126/science.aay5947. [DOI] [PubMed] [Google Scholar]

[R20] 20.Danforth D.N. Jr. (2016). Genomic Changes in Normal Breast Tissue in Women at Normal Risk or at High Risk for Breast Cancer. Breast Cancer (Auckl) 10, 109–146. 10.4137/BCBCR.S39384. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Ignacio R.M.C., Gibbs C.R., Kim S., Lee E.S., Adunyah S.E., and Son D.S. (2019). Serum amyloid A predisposes inflammatory tumor microenvironment in triple negative breast cancer. Oncotarget 10, 511–526. 10.18632/oncotarget.26566. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Tompa P. (2005). The interplay between structure and function in intrinsically unstructured proteins. FEBS Lett 579, 3346–3354. 10.1016/j.febslet.2005.03.072. [DOI] [PubMed] [Google Scholar]

[R23] 23.Zhao J., Zou L., Li Y., Liu X., Zeng C., Xu C., Jiang B., Guo X., and Song X. (2021). HisPhosSite: A comprehensive database of histidine phosphorylated proteins and sites. J Proteomics 243, 104262. 10.1016/j.jprot.2021.104262. [DOI] [PubMed] [Google Scholar]

[R24] 24.Wright P.E., and Dyson H.J. (2015). Intrinsically disordered proteins in cellular signalling and regulation. Nat Rev Mol Cell Biol 16, 18–29. 10.1038/nrm3920. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Ma Y., Weng J., Wang N., Zhang Y., Minato N., and Su L. (2021). A novel nuclear localization region in SIPA1 determines protein nuclear distribution and epirubicin-sensitivity of breast cancer cells. Int J Biol Macromol 180, 718–728. 10.1016/j.ijbiomac.2021.03.101. [DOI] [PubMed] [Google Scholar]

[R26] 26.Mark W.Y., Liao J.C., Lu Y., Ayed A., Laister R., Szymczyna B., Chakrabartty A., and Arrowsmith C.H. (2005). Characterization of segments from the central region of BRCA1: an intrinsically disordered scaffold for multiple protein-protein and protein-DNA interactions? J Mol Biol 345, 275–287. 10.1016/j.jmb.2004.10.045. [DOI] [PubMed] [Google Scholar]

[R27] 27.Meszaros B., Hajdu-Soltesz B., Zeke A., and Dosztanyi Z. (2021). Mutations of Intrinsically Disordered Protein Regions Can Drive Cancer but Lack Therapeutic Strategies. Biomolecules 11. 10.3390/biom11030381. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Iakoucheva L.M., Brown C.J., Lawson J.D., Obradovic Z., and Dunker A.K. (2002). Intrinsic disorder in cell-signaling and cancer-associated proteins. J Mol Biol 323, 573–584. 10.1016/s0022-2836(02)00969-5. [DOI] [PubMed] [Google Scholar]

[R29] 29.Jelen F., Oleksy A., Smietana K., and Otlewski J. (2003). PDZ domains - common players in the cell signaling. Acta Biochim Pol 50, 985–1017. [PubMed] [Google Scholar]

[R30] 30.Li E., and Hristova K. (2006). Role of receptor tyrosine kinase transmembrane domains in cell signaling and human pathologies. Biochemistry 45, 6241–6251. 10.1021/bi060609y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Jen J., and Wang Y.C. (2016). Zinc finger proteins in cancer progression. J Biomed Sci 23, 53. 10.1186/s12929-016-0269-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Whiteside T.L., and Ferrone S. (2012). For breast cancer prognosis, immunoglobulin kappa chain surfaces to the top. Clin Cancer Res 18, 2417–2419. 10.1158/1078-0432.CCR-12-0566. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Cavallaro U., and Christofori G. (2004). Cell adhesion and signalling by cadherins and Ig-CAMs in cancer. Nat Rev Cancer 4, 118–132. 10.1038/nrc1276. [DOI] [PubMed] [Google Scholar]

[R34] 34.DeBerardinis R.J., and Chandel N.S. (2016). Fundamentals of cancer metabolism. Sci Adv 2, e1600200. 10.1126/sciadv.1600200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Pascal L.E., True L.D., Campbell D.S., Deutsch E.W., Risk M., Coleman I.M., Eichner L.J., Nelson P.S., and Liu A.Y. (2008). Correlation of mRNA and protein levels: cell type-specific gene expression of cluster designation antigens in the prostate. BMC Genomics 9, 246. 10.1186/1471-2164-9-246. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Elmas A., Tharakan S., Jaladanki S., Galsky M.D., Liu T., and Huang K.L. (2021). Pan-cancer proteogenomic investigations identify post-transcriptional kinase targets. Commun Biol 4, 1112. 10.1038/s42003-021-02636-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Kurochkina N., and Guha U. (2013). SH3 domains: modules of protein-protein interactions. Biophys Rev 5, 29–39. 10.1007/s12551-012-0081-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Lemmon M.A., and Schlessinger J. (2010). Cell signaling by receptor tyrosine kinases. Cell 141, 1117–1134. 10.1016/j.cell.2010.06.011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Gsponer J., Futschik M.E., Teichmann S.A., and Babu M.M. (2008). Tight regulation of unstructured proteins: from transcript synthesis to protein degradation. Science 322, 13651368. 10.1126/science.1163581. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Pushpakom S., Iorio F., Eyers P.A., Escott K.J., Hopper S., Wells A., Doig A., Guilliams T., Latimer J., McNamee C., et al. (2019). Drug repurposing: progress, challenges and recommendations. Nat Rev Drug Discov 18, 41–58. 10.1038/nrd.2018.168. [DOI] [PubMed] [Google Scholar]

[R41] 41.Hodos R.A., Kidd B.A., Shameer K., Readhead B.P., and Dudley J.T. (2016). In silico methods for drug repurposing and pharmacology. Wiley Interdiscip Rev Syst Biol Med 8, 186–210. 10.1002/wsbm.1337. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Meszaros B., Erdos G., and Dosztanyi Z. (2018). IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res 46, W329–W337. 10.1093/nar/gky384. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Meszaros B., Simon I., and Dosztanyi Z. (2009). Prediction of protein binding regions in disordered proteins. PLoS Comput Biol 5, e1000376. 10.1371/journal.pcbi.1000376. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Krogh A., Larsson B., von Heijne G., and Sonnhammer E.L. (2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305, 567–580. 10.1006/jmbi.2000.4315. [DOI] [PubMed] [Google Scholar]

[R45] 45.Bernhofer M., Dallago C., Karl T., Satagopam V., Heinzinger M., Littmann M., Olenyi T., Qiu J., Schutze K., Yachdav G., et al. (2021). PredictProtein - Predicting Protein Structure and Function for 29 Years. Nucleic Acids Res 49, W535–W540. 10.1093/nar/gkab354. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.Fox N.K., Brenner S.E., and Chandonia J.M. (2014). SCOPe: Structural Classification of Proteins--extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res 42, D304–309. 10.1093/nar/gkt1240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.UniProt C. (2021). UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 49, D480–D489. 10.1093/nar/gkaa1100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Gabler F., Nam S.Z., Till S., Mirdita M., Steinegger M., Soding J., Lupas A.N., and Alva V. (2020). Protein Sequence Analysis Using the MPI Bioinformatics Toolkit. Curr Protoc Bioinformatics 72, e108. 10.1002/cpbi.108. [DOI] [PubMed] [Google Scholar]

[R49] 49.Kabsch W., and Sander C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637. 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]

[R50] 50.Cock P.J., Antao T., Chang J.T., Chapman B.A., Cox C.J., Dalke A., Friedberg I., Hamelryck T., Kauff F., Wilczynski B., and de Hoon M.J. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423. 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Kuriata A., Iglesias V., Pujols J., Kurcinski M., Kmiecik S., and Ventura S. (2019). Aggrescan3D (A3D) 2.0: prediction and engineering of protein solubility. Nucleic Acids Res 47, W300–W307. 10.1093/nar/gkz321. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] 52.Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J., et al. (2020). SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17, 261–272. 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] 53.Rosner B. (2011). Fundamentals of biostatistics, 7th Edition (Brooks/Cole, Cengage Learning; ). [Google Scholar]

[R54] 54.Armstrong R.A. (2014). When to use the Bonferroni correction. Ophthalmic Physiol Opt 34, 502–508. 10.1111/opo.12131. [DOI] [PubMed] [Google Scholar]

[R55] 55.van Iterson M., Boer J.M., and Menezes R.X. (2010). Filtering, FDR and power. BMC Bioinformatics 11, 450. 10.1186/1471-2105-11-450. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] 56.Rengasamy M., Zhang F., Vashisht A., Song W.M., Aguilo F., Sun Y., Li S., Zhang W., Zhang B., Wohlschlegel J.A., and Walsh M.J. (2017). The PRMT5/WDR77 complex regulates alternative splicing through ZNF326 in breast cancer. Nucleic Acids Res 45, 11106–11120. 10.1093/nar/gkx727. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] 57.Mi H., Muruganujan A., Huang X., Ebert D., Mills C., Guo X., and Thomas P.D. (2019). Protocol Update for large-scale genome and gene function analysis with the PANTHER classification system (v.14.0). Nat Protoc 14, 703–721. 10.1038/s41596-019-0128-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

This is a preprint.

Structural Analysis of Genomic and Proteomic Signatures Reveal Dynamic Expression of Intrinsically Disordered Regions in Breast Cancer and Tissue

Nicole Zatorski

Yifei Sun

Abdulkadir Elmas

Christian Dallago

Timothy Karl

David Stein

Burkhard Rost

Kuan-Lin Huang

Martin Walsh

Avner Schlessinger

Summary

1. Introduction

2. Results

2.1. Structural Features are Predictive of Normal Tissue Type

Table 1. Performance of tissue type predictors with different input features.

Figure 1. SAGES performance in predicting tissue from GTEx.

2.2. Structural Features Reveal Characteristics of Normal Breast Tissue

Table 2.

2.3. Breast Cancer Structural Features Differ from those of Normal Breast Tissue

2.3.1. Transcriptomic data reveals an overexpression of proteins with IDRs and intrinsically disordered binding regions: experimentally derived samples

Figure 2. Structural features of normal and breast cancer tissue from newly generated human patient samples.

2.3.2. Transcriptomic data reveals an overexpression of proteins with IDR and intrinsically disordered binding regions: COSMIC database

Figure 3. Structural features of normal and breast cancer tissue from the COSMIC database.

2.3.3. Proteomic data reveals an overrepresentation of transmembrane proteins

Figure 4. Structural features of primary breast tumors compared to matched normal tissues from a global proteomics dataset quantified using mass-spectrometry.

2.4. SAGES Captures Similarity Between Perturbation Signatures of Existing Breast Cancer Therapies and Disease

Figure 5. Perturbation signatures of existing breast cancer drugs and drugs for all other indications compared to breast cancer signatures.

3. Discussion

Figure 6. SAGES workflow.

4. Experimental Procedures

4.1. Structural Feature Generation

4.2. Statistical significance of features

4.3. Prediction of Normal Tissues

4.4. Breast Cancer Datasets: Generation

4.5. Breast Cancer Datasets: Analysis

4.6. Breast Cancer Drug Perturbation

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases