Skip to main content
PLOS One logoLink to PLOS One
. 2023 Feb 21;18(2):e0277786. doi: 10.1371/journal.pone.0277786

On the gene expression landscape of cancer

Augusto Gonzalez 1,2, Dario A Leon 2,3,*, Yasser Perera 4,5, Rolando Perez 1,6
Editor: Elingarami Sauli7
PMCID: PMC9942972  PMID: 36802377

Abstract

Kauffman picture of normal and tumor states as attractors in an abstract state space is used in order to interpret gene expression data for 15 cancer localizations obtained from The Cancer Genome Atlas. A principal component analysis of this data unveils the following qualitative aspects about tumors: 1) The state of a tissue in gene expression space can be described by a few variables. In particular, there is a single variable describing the progression from a normal tissue to a tumor. 2) Each cancer localization is characterized by a gene expression profile, in which genes have specific weights in the definition of the cancer state. There are no less than 2500 differentially-expressed genes, which lead to power-like tails in the expression distribution functions. 3) Tumors in different localizations share hundreds or even thousands of differentially expressed genes. There are 6 genes common to the 15 studied tumor localizations. 4) The tumor region is a kind of attractor. Tumors in advanced stages converge to this region independently of patient age or genetic characteristics. 5) There is a landscape of cancer in gene expression space with an approximate border separating normal tissues from tumors.

Introduction

Nowadays, cancer is one of the biggest health problems [1]. Common knowledge relates it to aging [2], although the exact causes triggering it are unknown. There is also evidence that genetic factors play a very important role [3], specially in tumors developing in early childhood or in familial tumors. The role of external factors with carcinogenic potential is also recognized [4, 5], as it is the role of lifestyle [6]. In spite of their diversity, cancers are known to follow certain general hallmarks [7, 8].

Enormous coordinated efforts aimed at understanding basic aspects of cancer have led to many important results. A lot of information about genes, cells and tissues have been compiled and shared into public databases (see, for example, Refs. [911]). The analysis of such information allowed the identification of mutation signatures, immune characteristics, etc. [1219].

From the theoretical point of view, there is also a great progress. We focus mainly on two particularly attractive developments. The first is the idea that cancer is an atavism, that is a cooperative state of multi-cellular organisms, prior to modern metazoans [2024]. It means that the genetic code of cancer is implicit in our DNA and under certain circumstances it may be reestablished. The second very stimulating and general idea is Kauffman’s picture of biological states as attractors in an abstract space, related to the solutions of gene regulatory networks [25, 26].

In the present paper, we use the Kauffman picture in order to interpret gene expression (GE) data for tumors. The data, provided by The Cancer Genome Atlas (TCGA) project [11], is processed by standard Principal Component Analysis (PCA) [2729]. For a given tissue, normal and tumor samples are grouped [30]. The centers of the normal and tumor clouds of samples are identified as the corresponding Kauffman attractors. We show that the Kauffman picture may be further elaborated by stressing qualitative and semi-quantitative aspects on cancer which come straightforwardly from the analysis of GE data. Five major conclusions of this study are presented in the body of the paper. In quality of examples, let us briefly mention two of them.

First, let us stress that the coordinate along the first PC axis, PC1, which is the expression of a metagene [31], may be taken as an indicator of progression to cancer. Indeed, below we show data for kidney clear cell carcinoma (KIRC) in which it is apparent that only tumors in initial stages populate the intermediate region between normal and cancer attractors. That is, tumors in initial states show coordinates in the intermediate region whereas advanced stages of tumors correspond to coordinate values close to the center of the cancer attractor. Additionally, in a separate analysis of prostate adenocarcinoma (PRAD) [32], we show that the coordinate along PC1 correlates with clinical data on tumor cellularity of samples. In other words, the fraction of tumor cells in samples increases as we move along PC1 towards the tumor attractor.

As a second example, let us comment that for any pair of tumors we are able to compute their distance in GE space and the number of common differentially expressed genes. We show that the distance exhibits an inverse correlation with the number of common genes, i.e. the shorter the distance the larger the number of common genes. Thus, the physical distance in GE space has an actual meaning of closeness between tumors.

Besides the aforementioned 5 major conclusions, additional results are presented as Supporting Information. Other specific lines of work have been developed in separated publications, such as the analysis of PRAD mentioned above [32], the concept of smooth and abrupt transitions and their relation to GE rearrangements [33], the computation of volumes of the atractors basins and their relation to the transition rates [34], the elaboration of a 1D model for carcinogenesis based on the coordinate along PC1 [35], etc.

Results

In this section we describe our procedures and present five main qualitative and semi-quantitative results in the form of statements. We take tissue expression data for 15 cancer localizations from the TCGA project [11], which correspond to localizations with sufficient normal and tumor samples for the PCA processing. The studied cases are shown in Table 1.

Table 1. The set of normal, Nn, and tumor, Nt, samples in the 15 cancer types under study.

The TCGA notation is included.

TCGA notation/ Cancer type N n N t
BLCA Bladder urothelial carcinoma 19 414
BRCA Breast invasive carcinoma 112 1096
COAD Colon adenocarcinoma 41 473
ESCA Esophageal carcinoma 11 160
HNSC Head and neck squamous cell carcinoma 44 502
KIRC Kidney clear cell carcinoma 72 539
KIRP Kidney papillary cell carcinoma 32 289
LIHC Liver hepatocellular carcinoma 50 374
LUAD Lung adenocarcinoma 59 535
LUSC Lung squamous cell carcinoma 49 502
PRAD Prostate adenocarcinoma 52 499
READ Rectum adenocarcinoma 10 167
STAD Stomach adenocarcinoma 32 375
THCA Thyroid carcinoma 58 510
UCEC Uterine corpus endometrial carcinoma 23 552

In the Methods section we explain the PCA methodology. However, let us stress two definitions which will be used in the figures below. For each gene, the differential expression in a given sample is defined as ediff = e/eref, where eref is the mean geometric average of this gene over normal samples. On the other hand, the logarithmic fold variation is defined as: efold = log2(ediff). This latter variable is used in the PCA results.

Statement 1

The state of a tissue in gene expression space can be described by a few variables. In particular, there is a single variable describing the progression from a normal tissue to a tumor.

Fig 1 shows the PCA results for three of the studied tissues: kidney renal papillary cell carcinoma (KIRP), lung squamous cell carcinoma (LUSC) and liver hepatocellular carcinoma (LIHC). In the upper panel, the distribution of normal and tumor samples in the (PC1, PC2) plane shows well defined normal and tumor regions, which are identified as the corresponding Kauffman attractors. Note that the first PC1 variable discriminates between a normal sample and a tumor. PC1 is thus labeled as the cancer axis and the projection along it is used to quantify the progression from a normal tissue to a tumor. Comparison with clinical variables like tumor stage or tumor cellularity is very interesting and will be discussed below.

Fig 1. Principal component analysis of the TCGA gene expression data for three of the studied tumors.

Fig 1

The upper panel contains the (PC1, PC2) plane, whereas the bottom panel shows the variance captured by the first 18 PCs.

The lower panel, on the other hand, evaluates the fraction of total variance captured by the first n PC variables. In LUSC, for example, the first 3 components account for 74% of the variance. This reduced number of variables, well below the number of constituent genes, may be interpreted as the effective number of degrees of freedom of the complex system represented by a tissue.

The fraction of variance depends on the sample set. Thus, we should take the results as approximate, semi quantitative ones, that shall improve as the sample size is enlarged. Nevertheless, in spite of the statistical origin of the new variables, we can use them to describe the actual state of a given sample. Notice that each variable in fact defines an expression profile, a concerted variation of a group of genes, a metagene [31].

Statement 2

Each cancer localization is characterized by a gene expression profile, in which genes have specific weights in the definition of the cancer state.

Let us call v1 the eigenvector along PC1 (boldface denotes vectors). We showed above that the PC1 axis accounts for a very large fraction of variance and that the projection along it can be taken as an indicator of progression towards the malignant state. For a given sample with fold expression vector efold, the projection x1 over the PC1 axis is precisely defined as

x1=efold·v1=iefoldiv1i. (1)

Where the sum runs over genes, and v1 is the unitary vector along PC1. The v1 vector may be thought to provide a metagene [31] or gene expression profile of cancer in the tissue, i.e. the set of over- or under-expressed genes (and their relative importance) that define the cancer state. Each component of the normalized vector v1 defines the amplitude v1i weighting the gene i in the cancer state. A positive or negative sign indicates that the gene is over- or under-expressed respectively in the tumor.

The left panels of Fig 2 show the 30 genes with the largest contributions to v1 in the same tumors represented in Fig 1. In principle, because of their large weights, these 30 genes could be used as cancer biomarkers, however their specific roles deserve a careful study in each tissue. In LUSC, for example, the gene with the largest weight is Surfactant Protein C (SFTPC), a silenced gene with an important role in lung homeostasis [36, 37]. The analogous genes in KIRP and LIHC are Uromodulin (UMOD) and Cytochrome P450 family 1 subfamily A member 2 (CYP1A2), respectively. The ranking of genes offered by PCA for each tumor is promising and, to the best of our knowledge, have not been sufficiently exploited so far. They are hub genes [38] showing high absolute values of the differential expression, high frequencies of dysregulation in the tumor set of samples, and interactions with a large number of relevant genes. In Ref. [32] we performed an analysis of the top 33 genes in the ranking for PRAD. Some of these genes have been already validated in the literature, but there is also a number of promising, yet not validated, indications for biomarkers or target genes.

Fig 2.

Fig 2

Left panel. The 30 genes with highest weights in the v1 vector. The same tumors as in Fig 1 are used as examples. The numbering of genes is the one used in the TCGA data. Positive signs correspond to over-expressed, and negative to under-expressed genes. Central and right panels. Over- and under-expression tails in the integrated distribution function of genes at the center of the tumor cloud. For each gene we compute the mean geometric average over tumor samples. The over-expression tail, for example, is then obtained as the number of genes for which the average differential expression is greater than a given value.

The central and right panels of Fig 2 show the gene distribution functions for the centers of the tumor clouds (geometric averages over tumor samples). In these figures, the genes are ordered according to their differential expression values. Only the over- or under-expressions tails are shown. Notice that the tails involve a few thousands of genes, the rest of the nearly 60000 RNA and protein-coding measured genes are not differentially expressed. The distributions are not symmetrical (see S1 Fig in S1 File). Whereas KIRP is dominated by silenced genes, LUSC has nearly equal proportions of under- and over-expressed genes, and in LIHC the over-expressed genes are more numerous. The log-log plots in the center and right panels of Fig 2 stress that the tails exhibit a power-like (Pareto) dependence [33, 39, 40], i.e. the number of genes with differential expression greater than a given value is proportional to an inverse power of the expression.

Statement 3

Tumors in different localizations share hundreds or even thousands of differentially expressed genes. There are 6 genes common to the 15 tumor localizations.

For each localization, we select the most significant 2500 genes with the largest contributions to the vector v1 along the PC1 direction defining the cancer axis. This number, 2500, is roughly the number of genes with significant differential expression values and great importance in the definition of the cancer state, as it is apparent from Fig 2 central and right panels (see also S2 Fig in S1 File).

Fig 3 shows the number of shared genes for pairs of tumors. Notice that these numbers vary in the interval between 314 and 1889. Large numbers of shared genes are characteristic of tumors in the same organ but originating in different cells (lung, kidney). However, there are also tumors sharing unexpectedly large numbers of genes. For example, tumors in the uterine corpus (UCEC) and bladder (BLCA) share more than 1300 differentially expressed genes. It is worth noticing that the number of shared genes seems to be related to the proximity of tumors in the expression space (see S3 Fig in S1 File).

Fig 3. Heatmap representation of the number of common differentially-expressed genes in pairs of cancer localizations.

Fig 3

The intersection of the vertical and horizontal coordinates shows the number of shared genes. The acronyms for the cancer names are the same as in Table 1.

Let us stress that there are 49 genes common to a group of 11 tumors, BLCA-BRCA-COAD-ESCA-HNSC-LUAD-LUSC-PRAD-READ-STAD-UCEC, and six genes common to all of the studied localizations (see S1 Table). The common (pan-cancer) genes are MMP11 (+), C7 (-), ANGPTL1 (-), UBE2C (+), IQGAP3 (+) and ADH1B (-). The signs added in parenthesis indicate that the gene is over- or under-expressed in tumors. Their absolute differential expression values are very similar in all of the studied tumors.

The six identified pan-cancer genes have been recently pointed out as playing a significant role in many cancers [4146]. It is noteworthy that these genes are straightforwardly related to cancer hallmarks [7, 8]: i.e. invasion, suppression of the immune response, angiogenesis, proliferation and changes in metabolism. Shared genes among groups of tumors suggest the possibility of global therapies in these groups. Below, we notice that pan-cancer genes play a role in both, tissue differentiation and in the definition of the border between normal tissues and tumors.

Statement 4

The tumor region is a kind of attractor. Tumors in advanced stages converge to this region independently of patient age or genetic characteristics.

As shown in Fig 1, regions corresponding to normal and tumor samples are well defined and partially disjointed in the expression space. The sample dispersion comes from genetic differences, the age of patients and the evolution history of each individual. Well defined regions in expression space support the attractor paradigm of cancer [25, 26, 47], in which the cancer region should be the region of confluence of all somatic evolution trajectories which leave the normal area. In a very reductionist view one may think, for example, about the normal homeostatic state and cancer as two stable solutions of a global gene regulatory network [48, 49].

In order to test more precisely the attractor hypothesis, we study the dependence of the gene expression distribution functions on patient age. In particular, we want to check whether the distribution functions of tumors in the cancer region is age independent, i.e. whether the advanced tumor reaches in average a unique distribution function regardless of the somatic evolution history.

Let us consider, again, LUSC as an example. According to age, we may classify the samples as young or old, defining 4 subgroups of samples in normal or tumor states: Normal Young (NY), Normal Old (NO), Tumor Young (TY) and Tumor Old (TO). These subgroups are in some sense arbitrarily defined with a threshold age of 62 years, the median of the sample set.

The results are very interesting. The (over-) expression distribution functions are visualized in Fig 4. We compute (mean geometric) averages over the NY, NO, TY and TO subgroups, and use the NY values as references in order to define normalized (differential) expression values in the remaining subgroups: ediff(NO), ediff(TY), and ediff(TO). These vectors characterize the centers of their respective clouds of samples. Genes are sorted with regard to their normalized expression values.

Fig 4. Integrated gene (over-) expression distribution functions in LUSC.

Fig 4

According to age, samples are grouped into four sets: Normal Young (NY), Normal Old (NO), Tumor Young (TY) and Tumor Old (TO). The average over the NY set is used to define reference values in order to normalize the expressions. Each set of points represents the average over the respective group.

In the NO group, only a reduced number of genes, around 10, exhibit differential expression values above 2 as a consequence of aging. In tumors, however, deviations are much larger, for instance there are around 1000 over-expressed genes with |ediff| > 5.

More striking is the similarity between the distribution functions in the TY and TO subgroups. That is, for tumors the distribution function in the final state is nearly independent of the age at which tumor initiates. This is an argument in favor of the attractor hypothesis. Similar results (not shown) are obtained for the under-expression tail.

A slightly different test comes from considering a second “time” or progression variable: the clinically determined tumor stage [50]. It is a qualification given to the tumor at the moment of diagnosis, but in some sense it quantifies also the somatic evolution once the portion of the tissue acquires the tumor condition. Fig 5 shows the distribution of tumors by stages in KIRC. Normal tissues are represented by blue points, whereas tumors are drawn in red. The four panels refer to the four stages: I, II, III and IV. All the blue points are present in the four panels, but only red points with the corresponding stage are included in each panel. We use a contour plot with a blue-to-red gradient scale of colors in order to visualize the density of points difference, ρnρt, between normal and tumor samples in the state space. Two regions of high intensity are apparent in blue and red, corresponding to normal and tumor clouds respectively, while a light zone in between indicates a small modulus density difference.

Fig 5. Stages in the evolution of tumors in clear cell kidney cancer (KIRC).

Fig 5

Blue points are normal samples (included in the four panels), whereas red points are samples from tumors in a given stage of evolution. Contours represent the difference between normal and tumor density of sample points. Stages I and II seem to be “transitional”, there are many points traveling along the intermediate region. On the other hand, stages III and IV are “final”, in the sense that most of the tumors are concentrated in the high intensity red region.

Naively, one expects that tumors move along the transition region from the normal to the tumor region as the stage evolves from I to IV. In the actual measurements, we don’t track individual tumors as function of stages, but get pictures of different tumors at different stages. Thus, in the initial stages we should observe a fraction of red points captured in the transition region, whereas in the final stages most tumors should be concentrated in the optimal region. This is what actually follows from the figure, again supporting the attractor paradigm [25, 26]. We may speculate that the optimal region could be related to a region of maximal fitness for the tumor in the given tissue.

The intuition induces us to relate the tumor stage to the coordinate along the tumor axis PC1. The correspondence, however, is not exact. Although there seems to be a correlation between stage and mean displacement towards the tumor region, many tumors in the initial stages are already at the center of the cloud. This could be related to the fact that the observed distribution of samples is probably related to the fitness distribution and the transition region should be a low-fitness zone [33]. Despite the scarcity of tumor samples in the IV stage of LIHC and LUSC, a similar conclusion can be reached from the analysis of their corresponding stage evolution (see S4 Fig in S1 File).

Statement 5

There is a landscape of cancer in gene expression space with an approximate border separating normal tissues from tumors.

We want to draw a picture in which both normal tissues and tumors in different localizations are represented. In a way, this is a picture involving tissue differentiation and cancer. It is not surprising that pan-cancer genes will play a role in both processes.

We shall use the normal, enormal, and tumor, etumor, averages (geometric mean) of the gene expression vectors for each localization in order to reduce normal and tumor clouds to their respective normal and tumor centers. The common reference for all the tissues, erefall, is then computed as the average expression vector of the normal centers, i.e. the center of the cloud of normal centers. By using this reference, we can obtain the logarithmic fold variation, efoldtissue, for each tissue and build the covariance matrix of all the localizations. The dimension of this matrix is still equal to the number of genes, which can be reduced by the PCA processing in order to obtain a simplified description in 2–3 variables.

The first aspect to be stressed is that the first 2 PCs accounts only for 37% of the total data variance. The relative importance of these two variables is not so apparent as in the case of individual tissues. This is probably due to the big dispersion of the data for normal tissues, related to tissue differentiation, sometimes even larger than separations between a normal tissue and the respective tumor.

As a consequence of the dispersion of normal tissues, we do not have a “cancer axis” or direction, as in individual tissues. In order to draw a frontier between normal and tumor regions, we shall include higher PCs. The next component, PC3 accounts for 12% of the data variance.

We show in Fig 6 the (PC1, PC3) plane, which indeed suggests that there is a border. Actually, the regions and the border are high dimensional, but the 2D figure captures the essential features. We may baptize this figure as the “approximate normal versus cancer” or “tissue differentiation versus cancer” plane. It is apparent from the figure, that the transition from a normal tissue to the corresponding tumor implies crossing the border, and involves simultaneous displacements along the PC1 and PC3 axes.

Fig 6. The gene expression landscape in the (PC1, PC3) plane.

Fig 6

Each point in the diagram represents the average of samples in a given localization. For simplicity, normal tissues are labeled with the corresponding tumor indexes. The approximate border between the normal and tumor regions is apparent.

The unitary vectors along these axes allow the identification of genes with the highest weights. It is very interesting that pan-cancer genes are among the most important genes in these vectors. For example, ADH1B and UBE2C are included in the set of 8 most important genes along PC1: PI3 (+), ADH1B (-), MYBL2 (+), UBE2C (+), ALB (-), CEACAM5 (+), CST1 (+) and MMP1 (+).

A more detailed analysis of the border between normal tissues and tumors is needed, in particular a study of the role of pan-cancer genes. We notice that the separation between the normal and cancer manifolds takes place also in protein expression space [51]. Fig 6 provides also a view to groups or clusters of tumors, based on distances in GE space which, as mentioned above, are a true measure of closeness. A similar analysis is performed in S6 Fig in S1 File. It deserves however a further work.

Discussion

Data processing techniques for dimensional reduction like PCA have proved to be a very useful tool for analyzing gene expression data. The current application to cancer allows to extract information contributing to the understanding of the cancer state from a theoretical point of view. In particular, the results of the PCA processing of the GE data for the 15 studied tumors support Kauffman’s theory of cancer as an attractor state, i.e once a portion of the tissue escapes from the normal region it is driven to the cancer basin of attraction.

Already with the first component, PCA manages to separate normal and tumor data samples into two very disengaged groups for all studied tumor localizations. The PC variables, although of statistical origin, can be used to describe the state and evolution of the tissue. By computing the cumulative variance we show that the number of relevant PC coordinates for a tissue, i.e. the effective number of degrees of freedom of the biological system, do not seem to be much larger than 10, in spite of the huge number (more than 60000) of constituent genes.

The transition from a normal tissue to a tumor involves the modification of thousands of genes. These genes do not act independently, but in a concerted way. There is a gene expression profile for each tumor, which can be obtained from the unitary vector along PC1. The profile can be used to identify potential biomarkers and target genes.

Tumors in different localizations share hundreds or even thousands of genes from their profiles. The number of shared genes inversely correlates with the distance between the two tumors in GE space. We found 6 genes common to the 15 tumor localizations. These genes seem to play a significant role both in the process of tissue differentiation and in delimiting the border between the set of normal tissues and their corresponding tumors.

Our results are approximate and semi-quantitative, in the sense that they are limited by the number of samples and could be numerically modified if a larger data set becomes available, but at the same time they are simple, general and unbiased, in the sense that no modeling or elaborated mathematical treatments are used. We try to keep the interpretation of the results as devoid as possible of any speculation. The results of our paper raise some interesting questions as, for example, what would happen if we target a few of the most significant genes in the cancer profile? Could this kind of intervention induce a rearrangement of the whole profile preventing the tumor to evolve to more advanced stages? Questions are also raised in relation to the overall landscape in gene expression space shown in Fig 6. For example, could we use the pan-cancer genes (i.e, the genes common to all 15 tumors) in therapies designed to act in this group of tumors?

In the paper, we focused on qualitative aspects. However, there is the possibility to quantify and to model some aspects of cancer. For example, notice that from figures like Fig 1 we can estimate the dimensions of the normal and cancer regions in gene expression space and the distance between their centers. From this data, and the statement that progression is described by a single variable, we may devise a one dimensional model for tumorigenesis in a given tissue [35]. As mentioned above, the GE data allows also to measure the volumes of the basins of attraction in the normal and cancer states [33], to suggest a picture of smooth and abrupt transitions and the relation with GE rearrangements [34], etc. Work along these and other interesting directions is in progress.

Methods

We analyze TCGA data for gene expression [11]. This is a well curated database. We selected the 15 cancer types shown in Table 1 on the basis of two conditions: i) The number of normal samples is greater than or equal to 10, and ii) The number of tumor samples is greater than or equal to 160.

The TCGA data involve measurements of the gene expressions for 60483 RNA- and protein-coding genes [52, 53]. This is the dimension of the matrices used in our principal component analysis. The data is in the FPKM format, corresponding to the number of fragments per kilo-base of gene length per mega-base of reads [54]. This is a way of normalizing measurements in a given sample.

On the other hand, in order to compute the average expression of a gene in a set of samples commonly the median or the geometric mean are used. We prefer geometric averages, but then the data should be slightly distorted to avoid zeroes. To this end, we added a constant 0.1 to the data. Indeed, we show in S5 Fig in S1 File expressions from a typical data file (PRAD case). Notice that there are around 28000 not transcribed genes (expression exactly zero), and only around 30000 genes with expression above 0.1. By applying this regularization procedure, the differential expression of genes with expression values below 0.1 is set to 1, and they will practically have zero contribution to the PCA. This is a very simple way of guaranteeing that only statistical significant genes enter the analysis.

For each cancer localization, we take the mean geometric average over normal samples in order to define the reference expression for each gene, eref. Then the normalized or differential expression is defined as: ediff = e/eref. The logarithmic fold variation is defined in terms of the base 2 logarithm, efold = log2(ediff). Besides reducing the variance, the logarithm allows treating over- and under-expression in a symmetrical way. In addition, we verified that the log transformation makes the data more close to a normal distribution, a requirement of the PCA method [2729].

Deviations and variances are measured with respect to efold = 0. That is, with respect to the average over normal samples. This election is quite natural, because normal samples are the majority in a population, individuals with cancer are rare.

With these assumptions, the covariance matrix is written as

σij=efoldi(s)efoldj(s)/(Nsamples-1), (2)

where the sum runs over the samples, s, and Nsamples is the total number of samples (normal plus tumor). efold i(s) is the fold variation of gene i in sample s.

The matrix σ is then diagonalized. As mentioned, its dimension is 60483. The obtained eigenvectors define the Principal Component (PC) axes: PC1, PC2, etc, and the projection over them define the new state variables. By definition, the index of the component is assigned from the highest to the lowest fraction of the total variance captured by the PC in the sample set, thus the highest percentage of variance corresponds to the PC1 axis.

With this procedure, around 10 PCs are enough for an approximate description of the region of the gene expression space occupied by the set of samples. Thus, we need only a small number of the eigenvalues and eigenvectors of σ. To this end, we use a Lanczos routine in Python language, available on https://github.com/DarioALeonValido/evolp, and run it in a node with 2 processors, 12 cores and 64 GB of RAM memory. As a result, we get the first 100 eigenvalues and their corresponding eigenvectors.

Supporting information

S1 File. Appendixes with figures.

1.1) Fraction of over- and under-expressed genes in the 15 tumors under study. 1.2) The contribution of genes to the unitary vector along PC1. 1.3) The proximity of tumors in GE space and the number of shared genes. 1.4) Stages in the evolution of tumors. 1.5) Pan-cancer genes and their characteristics. 1.6) Range of expression values in a typical TCGA data file. 1.7) Differentially expressed genes and top pathways. 1.8) Clustering analysis based on S2 Table.

(PDF)

S1 Table. Pan-cancer genes and their characteristics.

(XLS)

S2 Table. Differentially expressed genes and top pathways.

(XLS)

Acknowledgments

A.G. acknowledges the Office of External Activities of the Abdus Salam Centre for Theoretical Physics and the University of Electronic Science and Technology of China for support. DA.L. acknowledges support from the Norwegian University of Life Sciences. The research is carried on under a project of the Platform for Bio-informatics of BioCubaFarma, Cuba. The data for the present analysis come from the TCGA Research Network [11].

Data Availability

All the relevant data are in the paper and its Supporting information files. Moreover, the information about the data we used, the procedures and results are integrated in a public repository that is part of the project "Processing and Analyzing Mutations and Gene Expression Data in Different Systems": \url{https://github.com/DarioALeonValido/evolp}. In particular, the PCA processing of the TCGA data is performed with the specific python scripts that are found in the folder \path{../evolp/Landscape_cancer/}.

Funding Statement

The authors received no specific funding for this work.

References

  • 1. Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer Statistics, 2021. CA: A Cancer Journal for Clinicians. 2021;71(1):7–33. doi: 10.3322/caac.21654 [DOI] [PubMed] [Google Scholar]
  • 2. DePinho RA. The age of cancer. Nature. 2000;408(6809):248–254. doi: 10.1038/35041694 [DOI] [PubMed] [Google Scholar]
  • 3. Garber JE, Offit K. Hereditary Cancer Predisposition Syndromes. Journal of Clinical Oncology. 2005;23(2):276–292. doi: 10.1200/JCO.2005.10.042 [DOI] [PubMed] [Google Scholar]
  • 4. Wei Q, Cheng L, Amos CI, Wang LE, Guo Z, Hong WK, et al. Repair of Tobacco Carcinogen-Induced DNA Adducts and Lung Cancer Risk: a Molecular Epidemiologic Study. JNCI: Journal of the National Cancer Institute. 2000;92(21):1764–1772. doi: 10.1093/jnci/92.21.1764 [DOI] [PubMed] [Google Scholar]
  • 5. Lala PK, Chakraborty C. Role of nitric oxide in carcinogenesis and tumour progression. The Lancet Oncology. 2001;2(3):149–156. doi: 10.1016/S1470-2045(00)00256-4 [DOI] [PubMed] [Google Scholar]
  • 6. Shammas MA. Telomeres, lifestyle, cancer, and aging. Current opinion in clinical nutrition and metabolic care. 2011;14(1):28–34. doi: 10.1097/MCO.0b013e32834121b1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Hanahan D, Weinberg RA. The Hallmarks of Cancer. Cell. 2000;100(1):57–70. doi: 10.1016/S0092-8674(00)81683-9 [DOI] [PubMed] [Google Scholar]
  • 8. Hanahan D, Weinberg RA. Hallmarks of Cancer: The Next Generation. Cell. 2011;144(5):646–674. doi: 10.1016/j.cell.2011.02.013 [DOI] [PubMed] [Google Scholar]
  • 9. Safran M, Solomon I, Shmueli O, Lapidot M, Shen-Orr S, Adato A, et al. GeneCards™ 2002: towards a complete, object-oriented, human gene compendium. Bioinformatics. 2002;18:1542–1543. doi: 10.1093/bioinformatics/18.11.1542 [DOI] [PubMed] [Google Scholar]
  • 10. Uhlen M, Zhang C, Lee S, Sjöstedt E et al. A pathology atlas of the human cancer transcriptome. Science. 2017;357(6352):eaan2507. doi: 10.1126/science.aan2507 [DOI] [PubMed] [Google Scholar]
  • 11. Katarzyna T, Patrycja C, Maciej W. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol (Pozn). 2015;19(1A):A68–A77. doi: 10.5114/wo.2014.47136 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Bailey M, Tokheim C, Porta-Pardo E, Sengupta S. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell. 2018;173(2):371–385.e18. doi: 10.1016/j.cell.2018.02.060 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Huang K, Mashl R, Wu Y, Ritter D. Pathogenic Germline Variants in 10,389 Adult Cancers. Cell. 2018;173(2):355–370.e14. doi: 10.1016/j.cell.2018.03.039 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Thorsson V, Gibbs D, Brown S, Wolf D. The Immune Landscape of Cancer. Immunity. 2018;48(4):812–830.e14. doi: 10.1016/j.immuni.2018.03.023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Ding L, Bailey M, Porta-Pardo E, Thorsson V. Perspective on Oncogenic Processes at the End of the Beginning of Cancer Genomics. Cell. 2018;173(2):305–320.e10. doi: 10.1016/j.cell.2018.03.033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. The ICGC/TCGA Consortium. Pan-cancer analysis of whole genomes. Nature. 2020;578:82–93. doi: 10.1038/s41586-020-1969-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Gerstung M, Jolly C, Leshchiner I, Dentro S. The evolutionary history of 2,658 cancers. Nature. 2020;578(7793):122–128. doi: 10.1038/s41586-019-1907-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. PCAWG Transcriptome Core Group, Calabrese C, Davidson NR, Demircioğlu D. Genomic basis for RNA alterations in cancer. Nature. 2020;578(7793):129–136. doi: 10.1038/s41586-020-1970-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Reyna M, Haan D, Paczkowska M, Verbeke L. Pathway and network analysis of more than 2500 whole cancer genomes. Nature Communications. 2020;11(1):729. doi: 10.1038/s41467-020-14367-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Davies PCW, Lineweaver CH. Cancer tumors as Metazoa 1.0: tapping genes of ancient ancestors. Physical Biology. 2011;8(1):015001. doi: 10.1088/1478-3975/8/1/015001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Domazet-Lošo T, Tautz D. Phylostratigraphic tracking of cancer genes suggests a link to the emergence of multicellularity in metazoa. BMC Biology. 2010;8(1):66. doi: 10.1186/1741-7007-8-66 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Lineweaver CH, Davies PCW, Vincent MD. Targeting cancer’s weaknesses (not its strengths): Therapeutic strategies suggested by the atavistic model. BioEssays. 2014;36(9):827–835. doi: 10.1002/bies.201400070 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Cisneros L, Bussey KJ, Orr AJ, Miočević M, Lineweaver CH, Davies P. Ancient genes establish stress-induced mutation as a hallmark of cancer. PLOS ONE. 2017;12(4):1–22. doi: 10.1371/journal.pone.0176258 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Trigos AS, Pearson RB, Papenfuss AT, Goode DL. Somatic mutations in early metazoan genes disrupt regulatory links between unicellular and multicellular genes in cancer. eLife. 2019;8:e40947. doi: 10.7554/eLife.40947 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Kauffman SA. Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology. 1969;22(3):437–467. doi: 10.1016/0022-5193(69)90015-0 [DOI] [PubMed] [Google Scholar]
  • 26. Huang S, Ernberg I, Kauffman S. Cancer attractors: A systems view of tumors from a gene network dynamics and developmental perspective. Seminars in Cell & Developmental Biology. 2009;20(7):869–876. doi: 10.1016/j.semcdb.2009.07.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometrics and Intelligent Laboratory Systems. 1987;2(1):37–52. doi: 10.1016/0169-7439(87)80084-9 [DOI] [Google Scholar]
  • 28. Lever J, Krzywinski M, Altman N. Principal component analysis. Nature Methods. 2017;14(7):641–642. doi: 10.1038/nmeth.4346 [DOI] [Google Scholar]
  • 29. Ringnér M. What is principal component analysis? Nature biotechnology. 2008;26:303–304. doi: 10.1038/nbt0308-303 [DOI] [PubMed] [Google Scholar]
  • 30. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences. 1999;96(12):6745–6750. doi: 10.1073/pnas.96.12.6745 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Huang E, Cheng SH, Dressman H, Pittman J, Tsou MH, Horng CF, et al. Gene expression predictors of breast cancer outcomes. The Lancet. 2003;361(9369):1590–1596. doi: 10.1016/S0140-6736(03)13308-9 [DOI] [PubMed] [Google Scholar]
  • 32. Perera Y, Gonzalez A, Perez R. Principal component analysis of RNA-seq data unveils a novel prostate cancer-associated gene expression signature. Arch Can Res. 2021;9(S4):002. [Google Scholar]
  • 33. Gonzalez A, Nieves J, Leon DA, Bringas ML, Valdes-Sosa P. Gene expression rearrangements denoting changes in the biological state. Scientific Reports. 2021;11:8470. doi: 10.1038/s41598-021-87764-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Gonzalez A, Quintela F, Leon DA, Bringas-Vega ML, Valdes-Sosa PA. Estimating the number of available states for normal and tumor tissues in gene expression space. Biophysical Reports. 2022;2(2):100053. doi: 10.1016/j.bpr.2022.100053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Herrero R, Leon DA, Gonzalez A. A one-dimensional parameter-free model for carcinogenesis in gene expression space. Scientific Reports. 2022;12:4748. doi: 10.1038/s41598-022-08502-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Whitsett JA, Weaver TE. Hydrophobic Surfactant Proteins in Lung Function and Disease. New England Journal of Medicine. 2002;347(26):2141–2148. doi: 10.1056/NEJMra022387 [DOI] [PubMed] [Google Scholar]
  • 37. Mulugeta S, Beers MF. Surfactant protein C: Its unique properties and emerging immunomodulatory role in the lung. Microbes and Infection. 2006;8(8):2317–2323. doi: 10.1016/j.micinf.2006.04.009 [DOI] [PubMed] [Google Scholar]
  • 38. van Dam S, Võsa U, van der Graaf A, Franke L, de Magalhães JP. Gene co-expression analysis for functional classification and gene–disease predictions. Briefings in Bioinformatics. 2017;19(4):575–592. doi: 10.1093/bib/bbw139 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Newman MEJ. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics. 2005;46(5):323–351. doi: 10.1080/00107510500052444 [DOI] [Google Scholar]
  • 40. Kuznetsov VA, Knott GD, Bonner RF. General Statistics of Stochastic Process of Gene Expression in Eukaryotic Cells. Genetics. 2002;161(3):1321–1332. doi: 10.1093/genetics/161.3.1321 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Li QG, He YH, Wu H, Yang CP, Pu SY, Fan SQ, et al. A Normalization-Free and Nonparametric Method Sharpens Large-Scale Transcriptome Analysis and Reveals Common Gene Alteration Patterns in Cancers. Theranostics. 2017;7:2888–2899. doi: 10.7150/thno.19425 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Carbone C, Piro G, Merz V, Simionato F, Santoro R, Zecchetto C, et al. Angiopoietin-Like Proteins in Angiogenesis, Inflammation and Cancer. International Journal of Molecular Sciences. 2018;19(2). doi: 10.3390/ijms19020431 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Afshar-Kharghan V. The role of the complement system in cancer. The Journal of Clinical Investigation. 2017;127(3):780–789. doi: 10.1172/JCI90962 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Yang Y, Zhao W, Xu QW, Wang XS, Zhang Y, Zhang J. IQGAP3 Promotes EGFR-ERK Signaling and the Growth and Metastasis of Lung Cancer Cells. PLOS ONE. 2014;9(5):1–10. doi: 10.1371/journal.pone.0097578 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Gobin E, Bagwell K, Wagner J, Mysona D, Sandirasegarane S, Smith N, et al. A pan-cancer perspective of matrix metalloproteases (MMP) gene expression profile and their diagnostic/prognostic potential. BMC Cancer. 2019;19(1):581. doi: 10.1186/s12885-019-5768-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Dastsooz H, Cereda M, Donna D, Oliviero S. A Comprehensive Bioinformatics Analysis of UBE2C in Cancers. International Journal of Molecular Sciences. 2019;20(9). doi: 10.3390/ijms20092228 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Huang S, Eichler G, Bar-Yam Y, Ingber DE. Cell Fates as High-Dimensional Attractor States of a Complex Gene Regulatory Network. Phys Rev Lett. 2005;94:128701. doi: 10.1103/PhysRevLett.94.128701 [DOI] [PubMed] [Google Scholar]
  • 48. Karlebach G, Shamir R. Modelling and analysis of gene regulatory networks. Nature Reviews Molecular Cell Biology. 2008;9(10):770–780. doi: 10.1038/nrm2503 [DOI] [PubMed] [Google Scholar]
  • 49. Emmert-Streib F, Dehmer M, Haibe-Kains B. Gene regulatory networks and their applications: understanding biological and medical problems in terms of networks. Frontiers in Cell and Developmental Biology. 2014;2:38. doi: 10.3389/fcell.2014.00038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Amin MB, Greene FL, Edge SB, Compton CC, Gershenwald JE, Brookland RK, et al. The Eighth Edition AJCC Cancer Staging Manual: Continuing to build a bridge from a population-based to a more “personalized” approach to cancer staging. CA: A Cancer Journal for Clinicians. 2017;67(2):93–99. [DOI] [PubMed] [Google Scholar]
  • 51. Nieves J, Gonzalez A. The tissue differentiation and cancer manifolds in gene and protein expression spaces; 2021. doi: 10.1101/2021.08.20.457160 [DOI] [Google Scholar]
  • 52. Cunningham F, Allen JE, Allen J, Alvarez-Jarreta J, Amode M, Armean I, et al. Ensembl 2022. Nucleic Acids Research. 2021;50(D1):D988–D995. doi: 10.1093/nar/gkab1049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics. 2009;10(1):57–63. doi: 10.1038/nrg2484 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010;28(5):511–515. doi: 10.1038/nbt.1621 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Elingarami Sauli

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

20 Sep 2022

PONE-D-22-21174On the gene expression landscape of cancerPLOS ONE

Dear Dr. Dario Alejandro Leon Valido,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised (by Reviewers 1 and 2)  during the review process.

Please submit your revised manuscript by Nov 04 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the reviewers. You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Elingarami Sauli, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please ensure that you have specified (1) whether consent was informed and (2) what type you obtained (for instance, written or verbal, and if verbal, how it was documented and witnessed). If your study included minors, state whether you obtained consent from parents or guardians. If the need for consent was waived by the ethics committee, please include this information.

3. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well. 

4. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

5. Thank you for stating the following financial disclosure: 

"The authors received no specific funding for this work."

At this time, please address the following queries:

a) Please clarify the sources of funding (financial or material support) for your study. List the grants or organizations that supported your study, including funding received from your institution. 

b) State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

c) If any authors received a salary from any of your funders, please state which authors and which funders.

d) If you did not receive any funding for this study, please state: “The authors received no specific funding for this work.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

6. Thank you for stating the following in the Acknowledgments Section of your manuscript: 

"A.G. acknowledges the Office of External Activities of the Abdus Salam Centre for Theoretical Physics and the University of Electronic Science and Technology of China for support. DA.L. acknowledges financialn support from the Norwegian University of Live Sciences for publication fees. The research is carried on under a project of the Platform for Bio-informatics of BioCubaFarma, Cuba. The data for the present analysis come from the TCGA Research Network."

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. 

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: 

"The authors received no specific funding for this work."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

7. Thank you for stating the following in your Competing Interests section:  

"The authors declare no compiting interest."

Please complete your Competing Interests on the online submission form to state any Competing Interests. If you have no competing interests, please state ""The authors have declared that no competing interests exist."", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now 

 This information should be included in your cover letter; we will change the online submission form on your behalf.

8. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Review of “On the Gene Expression Landscape of Cancer” by Gonzales et al.

This manuscript is interesting and thought-provoking in that it addresses “grand” questions of cancer from a higher-order viewpoint, rather than ever narrower questions in ever more detail as most papers seem to do nowadays. However, the writing could be clearly improved and be more concise, precise and use a more scientific language. There are many “filling” words and sentences that can be omitted without losing information.

The following major and minor points should be addressed:

Major points

1. The first two paragraphs of the Results section could be considerably shortened. Parts of it should be transferred to the Methods section, and/or covered by references. It is not necessary to explain in this paper how RNA-seq works, and it is not part of this manuscript`s results. The authors state “In the Methods section, the used PCA methodology is briefly explained”. This statement is sufficient for the Results section, further details on their PCA methodology should be in the Methods.

2. Figure 2 and the corresponding Text shows “a few thousand of genes” differentially expressed between tumors and normal tissue, and the magnitude of this differential expression is at least two-fold, and up to >100-fold. This is unusual, since similar previous analyses show fewer differentially expressed genes, and a lower magnitude of the differences. Are all these differences statistically significant (by t-test or similar test statistics)? Have the authors properly normalized the gene expression data (per-gene and per-sample normalization)?

3. Legends to Figures and Tables should explain what is shown in corresponding Figures and Tables in more detail and with more precision, e.g. all abbreviations should be defined (or at least refer to an item where those are defined), and x- and y-axes should be well defined (e.g. x-axis in Figure 3 is labelled “Differential Expression” – are these fold-changes? Probably yes, but this should be defined!).

4. In statement 4, second part (page 6), the effect of tumor stage (I, II, II, IV) on gene expression of tumors is discussed, and it is assumed that higher-stage tumors should be progressed further on the normal-tumor gene expression axis. However, whereas expression is analysed in the primary tumor, the tumor stage is not a property of the primary tumor alone, e.g. stage IV commonly means that distant metastasis are present, and lymph node metastasis also play a key role in determining the tumor stage (which may or may not affect the gene expression state of the primary tumor analysed here). Probably the tumor size or tumor grade (both a property of the primary tumor) would be a better parameter than tumor stage to address this question. In any case, it should be defined (or a reference provided), how tumor stage I-IV are defined in KIRC, and this potential limitation should be discussed.

Minor points

1. “Statement 1-5” in Results: Are these results, conclusions or hypothesis? More precision in wording is needed.

2. Statement 1 (page 4): the authors refer to “the left panel” and “the right panel” of Figure 1. They probably mean the upper panel and the lower panel.

3. This reviewer recommends to omit the scale lines within the panels in Figure 2, they are distracting and may cause problems during printing and photo-copying.

4. Page 4 bottom: “the rest of the 60000 genes”. The authors should explain how they arrive at this number. There are about 25000 human protein-coding human genes, were additional genes (miRNAs, long nc RNAs etc.) included in this number? The authors should provide their (or the TCGA’s) definition of “gene”.

5. Page 6: “Well defined regions in expression space is a fact in favor of the attractor paradigm of cancer” could be replaced by “Well defined regions in expression space support the attractor paradigm of cancer”.

6. “sub-expression” should rather be termed “under-expression”, as opposed to over-expression, or “up-regulation” and “down-regulation” should be used (throughout the manuscript).

7. The Formatting of the References apparently does not adhere to PLOS One style, in particular, the full first names of authors are provided.

Reviewer #2: PCA has been used to analyze GE data from TCGA dataset. In detail, the results of the PCA of 15 different tumors seem to support 5 different facts or statements. Overall, the manuscript is well written and PCA well performed. I have some minor comments:

- As authors stated: "results are simple, general and unbiased, in the sense that no modeling or elaborated mathematical treatments are used". It would be better to describe the possible limits of this approach.

- For example, could we use the pan-cancer genes (i.e, the genes common to all 15 tumors) in therapies designed to act in this group of tumors? Are these druggable targets?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Feb 21;18(2):e0277786. doi: 10.1371/journal.pone.0277786.r002

Author response to Decision Letter 0


24 Oct 2022

Reviewer 1:

This manuscript is interesting and thought-provoking in that it addresses “grand”

questions of cancer from a higher-order viewpoint, rather than ever narrower questions

in ever more detail as most papers seem to do nowadays. However, the writing could be

clearly improved and be more concise, precise and use a more scientific language. There are

many “filling” words and sentences that can be omitted without losing information.

Authors Reply:

We thank the reviewer for his/her careful reading of the manuscript and his/her positive

comments and suggestions.

We have rephrased some sentences along the hole manuscript and added more phrases

to make the text clearer and more compressible. All the changes are highlighted in the new

version.

Reviewer 1:

The following major and minor points should be addressed: Major points 1. The first

two paragraphs of the Results section could be considerably shortened. Parts of it should

be transferred to the Methods section, and/or covered by references. It is not necessary

to explain in this paper how RNA-seq works, and it is not part of this manuscript‘s

results. The authors state “In the Methods section, the used PCA methodology is briefly

explained”. This statement is sufficient for the Results section, further details on their

PCA methodology should be in the Methods.

Authors Reply:

We thanks the referee for his/her suggestions. We have shortened the two paragraphs

and moved not relevant information to the Method section.

Reviewer 1:

2. Figure 2 and the corresponding Text shows “a few thousand of genes” differentially

expressed between tumors and normal tissue, and the magnitude of this differential

expression is at least two-fold, and up to ¿100-fold. This is unusual, since similar previous

analyses show fewer differentially expressed genes, and a lower magnitude of the differences.

2

Are all these differences statistically significant (by t-test or similar test statistics)? Have

the authors properly normalized the gene expression data (per-gene and per-sample

normalization)?

Authors Reply:

We thank the referee for this comment. In the new version of the manuscript we explain

in more detail the normalization and the used methodology. Our simple procedure discards

statistically irrelevant genes (its differential expression becomes 1 and the log fold variation,

which enters the principal component analysis, becomes 0). Genes with high differential

expression, highlighted by the methodology, have always a biological meaning.

Here, we would like to show data for PRAD as an example. The 1st figure is similar

to Fig. 3 of our manuscript. There are only two genes (dlx1 and pca3) with differential

expression higher than 20. There are also around 1000 genes with edif f > 2, and around 5000

genes with edif f > 1.5. Note that a study of statistically significant differentially expressed

genes in PRAD identifies around 11,000 genes (doi: 10.1371/journal.pone.0145322).

In paper doi: 10.1038/s41598-021-87764-0, we relate this relative high GE rearrangement

(that is, thousands of diff expressed genes) to a discontinuous transition between the normal

and the tumor state.

The 2nd figure plots sample data for pca3 (a known PRAD marker) in age groups.

Mean geometric averages are represented in red. The NY group is taken as reference for

normalization. This 2nd fig contributes with 3 point to the 1st figure: that is, one for each

of the NO, TY and TO curves.

Reviewer 1:

3. Legends to Figures and Tables should explain what is shown in corresponding

Figures and Tables in more detail and with more precision, e.g. all abbreviations should

be defined (or at least refer to an item where those are defined), and x- and y-axes should

be well defined (e.g. x-axis in Figure 3 is labelled “Differential Expression” – are these

fold-changes? Probably yes, but this should be defined!).

Authors Reply:

We thanks the referee for his/her comments. We have added more details to the caption

3

FIG. 1: PRAD

FIG. 2: PRAD

4

of the figures and tables. In particular, the x axis in Fig. 3 is properly the differential

expression. The log-log plot is simply to show the Paretto law in the tail distribution

function.

Reviewer 1:

4. In statement 4, second part (page 6), the effect of tumor stage (I, II, II, IV) on gene

expression of tumors is discussed, and it is assumed that higher-stage tumors should be

progressed further on the normal-tumor gene expression axis. However, whereas expression

is analysed in the primary tumor, the tumor stage is not a property of the primary tumor

alone, e.g. stage IV commonly means that distant metastasis are present, and lymph node

metastasis also play a key role in determining the tumor stage (which may or may not affect

the gene expression state of the primary tumor analysed here). Probably the tumor size or

tumor grade (both a property of the primary tumor) would be a better parameter than tumor

stage to address this question. In any case, it should be defined (or a reference provided),

how tumor stage I-IV are defined in KIRC, and this potential limitation should be discussed.

Authors Reply:

We recognize the limitations to draw “attractor basins” even in the context of advanced

tumors stages (III-IV) by only surveying the primary tumor. However, our assumption

here is that in primary tumors some information remains in their gene expression profiles,

concerning the original seeding of the distant metastasis, and/or that the process of cell

invasion and dissemination from this primary site is continuous.

In the same line, other groups have also extracted and proposed prognostic signatures for

advanced KIRC stages by studying primary tumor samples. See, for example:

https://doi.org/10.3389/fbioe.2019.00270

https://doi.org/10.1111/cbdd.14141

https://doi.org/10.3389/fonc.2022.912155

Reviewer 1:

Minor points 1. “Statement 1-5” in Results: Are these results, conclusions or hypothesis?

More precision in wording is needed.

5

Authors Reply:

We have clarified at the beginning of the Results section, that the statements refer to

results/conclusions, following straightforwardly from the analysis of the GE data.

Reviewer 1:

2. Statement 1 (page 4): the authors refer to “the left panel” and “the right panel” of

Figure 1. They probably mean the upper panel and the lower panel.

Authors Reply:

We thanks the referee for pointing this out, we have corrected it.

Reviewer 1:

3. This reviewer recommends to omit the scale lines within the panels in Figure 2, they

are distracting and may cause problems during printing and photo-copying.

Authors Reply:

Thanks the reviever for this comment, we have removed the grids.

Reviewer 1:

4. Page 4 bottom: “the rest of the 60000 genes”. The authors should explain how they

arrive at this number. There are about 25000 human protein-coding human genes, were

additional genes (miRNAs, long nc RNAs etc.) included in this number? The authors

should provide their (or the TCGA’s) definition of “gene”.

Authors Reply:

The referee is right, TCGA data involves measurement of the expression for 60487 RNA-

and protein-coding genes in the Ensembl notation. We have made explicit this point in the

text.

Reviewer 1:

5. Page 6: “Well defined regions in expression space is a fact in favor of the attractor

paradigm of cancer” could be replaced by “Well defined regions in expression space support

6

the attractor paradigm of cancer”.

Authors Reply:

We thanks the referee for pointing this out, we have corrected it accordingly.

Reviewer 1:

6. “sub-expression” should rather be termed “under-expression”, as opposed to over-

expression, or “up-regulation” and “down-regulation” should be used (throughout the

manuscript).

Authors Reply:

We thanks the referee for pointing this out, we have uniformized the notation.

Reviewer 1:

7. The Formatting of the References apparently does not adhere to PLOS One style, in

particular, the full first names of authors are provided.

Authors Reply:

We thanks the referee for pointing this out, we have corrected the format of the references

and the manuscript accordingly.

Reviewer 2:

PCA has been used to analyze GE data from TCGA dataset. In detail, the results of

the PCA of 15 different tumors seem to support 5 different facts or statements. Overall,

the manuscript is well written and PCA well performed.

Authors Reply:

We thanks the referee for his/her very positive comments.

Reviewer 2:

I have some minor comments: - As authors stated: ”results are simple, general and

unbiased, in the sense that no modeling or elaborated mathematical treatments are used”.

7

It would be better to describe the possible limits of this approach. - For example, could we

use the pan-cancer genes (i.e, the genes common to all 15 tumors) in therapies designed to

act in this group of tumors? Are these druggable targets?

Authors Reply:

We thanks the referee for raising these points. We have stressed in the new version of

the manuscript that the results of our PCA analysis are limited mainly by the number of

available samples. The second raised question on the pan-cancer genes is really interesting,

but beyond the scope of the paper.

Druggable targets are defined as those molecules (i.e. proteins, peptides, nucleic acids)

which levels and/or specific activities can by modulated by a drug, which can consist of

a small molecular weight chemical compound (SMOL) or a biologic (BIOL), such as an

antibody or a recombinant protein. Two issues need to be considered on the putative

druggability of these reported pan-cancer genes: 1) Gene expression changes highlighted here

need to be corroborated at the protein level if we attempt to develop a target therapy against

the final protein product. 2) Usually, protein down-regulation and/or diminished activity,

are harder to tackle than protein over-expression and/or activity inhibition. However, as

described below, there are experimental and approved drugs for 2/3 up-regulated pan-cancer

genes, whereas there are recent reported drugs aiming at the three down-regulated genes:

a) MMP11 (Up, Matrix Metallopeptidase 11): at least two experimental drugs are re-

ported at https://www.genecards.org/cgi-bin/carddisp.pl?gene=MMP11

b) C7 (Down, Complement Component C7): at least one experimental drug is reported

at https://www.genecards.org/cgi-bin/carddisp.pl?gene=C7&keywords=C7

c) ANGPTL1 (Down,Angiopoietin Like 1): one inferred drug is reported at

https://www.genecards.org/cgi-bin/carddisp.pl?gene=ANGPTL1&keywords=

ANGPTL1#drugs_compounds

d) UBE2C (Up, Ubiquitin Conjugating Enzyme E2 C): at least three approved drugs are

reported at https://www.genecards.org/cgi-bin/carddisp.pl?gene=UBE2C&keywords=

UBE2C#drugs_compounds

e) IQGAP3 (Up, IQ Motif Containing GTPase Activating Protein 3): no reported or in-

ferred drug so far at https://www.genecards.org/cgi-bin/carddisp.pl?gene=IQGAP3&

keywords=IQGAP3

8

f) ADH1B (Down, Alcohol Dehydrogenase 1B (Class I), Beta Polypeptide): there are

more than 5 approved drugs and another handful under developmental stages.

Of note, in a former comprehensive redefinition of the druggable genome (doi: 10.1126/sc-

itranslmed.aag1166), the authors stratified the druggable gene set into three tiers corre-

sponding to position in the drug development pipeline. Tier 1 (1427 genes) included efficacy

targets of approved small molecules and bio-therapeutic drugs as well as clinical-phase drug

candidates. Tier 2 was composed of 682 genes encoding targets with known bioactive drug-

like small-molecule binding partners as well as those with ≥ 50% identity (over ≥ 75% of

the sequence) with approved drug targets. Tier 3 contained 2370 genes encoding secreted or

extracellular proteins, proteins with more distant similarity to approved drug targets, and

members of key druggable gene families not already included in tier 1 or 2. Furthermore, the

more frequent Pfam-A domain content in three tiers of druggable genes were also described.

After inspecting this data we noted that ADH1B was positioned into Tier 1, whereas

MMP11, C7 and ANGPTL1 were located to Tier 3B. Overall, only IQGAP3 remains elusive

as a putative druggable target within the six pan-cancer genes. IQGAP3 contains the

Rho GTPase activation prot domains not included in tiers 1-3.

Attachment

Submitted filename: plosone_response_vT.pdf

Decision Letter 1

Elingarami Sauli

3 Nov 2022

On the gene expression landscape of cancer

PONE-D-22-21174R1

Dear Dr. Dario Alejandro Leon Valido,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Elingarami Sauli, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: My comments have been all addressed and the manuscript improved. I suggest accepting the manuscript.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

**********

Acceptance letter

Elingarami Sauli

11 Nov 2022

PONE-D-22-21174R1

On the gene expression landscape of cancer

Dear Dr. Leon Valido:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Elingarami Sauli

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. Appendixes with figures.

    1.1) Fraction of over- and under-expressed genes in the 15 tumors under study. 1.2) The contribution of genes to the unitary vector along PC1. 1.3) The proximity of tumors in GE space and the number of shared genes. 1.4) Stages in the evolution of tumors. 1.5) Pan-cancer genes and their characteristics. 1.6) Range of expression values in a typical TCGA data file. 1.7) Differentially expressed genes and top pathways. 1.8) Clustering analysis based on S2 Table.

    (PDF)

    S1 Table. Pan-cancer genes and their characteristics.

    (XLS)

    S2 Table. Differentially expressed genes and top pathways.

    (XLS)

    Attachment

    Submitted filename: plosone_response_vT.pdf

    Data Availability Statement

    All the relevant data are in the paper and its Supporting information files. Moreover, the information about the data we used, the procedures and results are integrated in a public repository that is part of the project "Processing and Analyzing Mutations and Gene Expression Data in Different Systems": \url{https://github.com/DarioALeonValido/evolp}. In particular, the PCA processing of the TCGA data is performed with the specific python scripts that are found in the folder \path{../evolp/Landscape_cancer/}.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES