Abstract
PURPOSE
To analyze a microarray experiment to identify the genes with expressions varying after the diagnosis of breast cancer.
METHODS
A total of 44 928 probe sets in an Affymetrix microarray data publicly available on Gene Expression Omnibus from 249 patients with breast cancer were analyzed by the nonparametric multivariate adaptive splines. Then, the identified genes with turning points were grouped by K-means clustering, and their network relationship was subsequently analyzed by the Ingenuity Pathway Analysis.
RESULTS
In total, 1640 probe sets (genes) were reliably identified to have turning points along with the age at diagnosis in their expression profiling, of which 927 expressed lower after turning points and 713 expressed higher after the turning points. K-means clustered them into 3 groups with turning points centering at 54, 62.5, and 72, respectively. The pathway analysis showed that the identified genes were actively involved in various cancer-related functions or networks.
CONCLUSIONS
In this article, we applied the nonparametric multivariate adaptive splines method to a publicly available gene expression data and successfully identified genes with expressions varying before and after breast cancer diagnosis.
Keywords: MASAL, breast cancer, turning point, microarray, Ingenuity
Introduction
Since the development of microarray for high-throughput analysis of gene expressions,1 this technology has been widely used in many biomedical studies.2–8 Cancer research is one of the main areas that apply this technology.2,9–14 Among this, breast cancer research is the most common one.15–19
Although the development of microarrays and the ability to perform massively parallel gene expression analysis of human tumors has been shown great contribution to breast cancer classification, prognostication, and prediction during the last decade,20 the heterogeneous nature of breast cancer and the lack of reliable pathological or molecular markers still reflects the complexity of the molecular alterations that underlie the development and progression of this disease and poses serious problems to clinical management.21 Some studies proposed methods to identify genes consistently expressed at different levels in diseased and normal cases, which is useful to elucidate pathways in breast cancer progression.21,22 Other studies have explored the gene expression changes that are associated with the various stages of breast cancer progression.23 However, the genes with changing expression associated with diagnostic age are still poorly understood, although the risk of breast cancer is highly age related.
Multivariate adaptive regression splines (MARS) is a nonparametric regression procedure that fits the model with a number of piecewise linear functions (ie, truncated functions with knots).24 Zhang25 further extended this method to be capable of analyzing the data under longitudinal settings. By applying this method to a publicly available microarray data set,26 this study intends to identify a novel set of genes that vary expressions along with the diagnosis of breast cancer. One of our aims is to improve our understanding of biological mechanism in disease progression at different stages of age and provide novel potential drug targets for breast cancer. The other aim is to demonstrate that the method developed in this study can be generalized to studies with the similar problems for other diseases (eg, lung cancer). The outline of the article is as follows.
Section “Method” describes the methods used in the study. Section “Results” presents the results. In section “Discussion,” we discussed the findings identified by our method and study limitations, as well as the future work. The last section presents the conclusions.
Method
Microarray data set
We studied 44 928 probe sets in an Affymetrix microarray data publicly available on Gene Expression Omnibus (www.ncbi.nlm.nih.gov/geo/) (series number: GSE4922) from 249 patients with breast cancer, who were aged between 28 and 93 years old and enrolled in a cohort at Uppsala.26 The tumor specimens were assessed on Affymetrix U133 A and B arrays. Of 249 samples, 68 were grade 1 tumors, 126 were grade 2 tumors, and 55 were grade 3 tumors.
Processing microarray data through Robust Multichip Average
Robust Multichip Average (RMA) is an algorithm developed to extract the expression matrix from Affymetrix data.27 Through RMA, the raw probe-level intensity values from the Affymetrix data are background corrected, log2 transformed, quantile normalized, and then summarized via a linear model to obtain an expression measure for each probe set. In this step, the raw Affymetrix data (GSE4922) are transformed to a normalized expression value matrix (44 928 probe sets × 249 patients) via RMA in R 2.13.0 (www.bioconductor.org).
Identifying genes with expressions varying after diagnosis via multivariate adaptive splines
Compared with MARS,24 Multivariate Adaptive Splines of Analysis for longitudinal data not only can analyze data under the longitudinal settings but also has the advantage of defining several interesting phases when the velocity of increased (or decreased) outcome changes.25 Below is a brief description of the model and algorithm in MASAL.
Assume that the outcomes are repeatedly measured at q different time points for each of the N units. The outcome of unit i at jth observation (i = 1 … N, j = 1 … q), Yij, is equal to the following:
![]() |
(1) |
where f is a smooth function, tij is the time of measurements, xk,ij (k = 1 … p) is the kth covariate, ei is an error term, and x* is some covariates which the error term depends on.
Here, MASAL is used to regress gene expressions on age at diagnosis through R-package MASAL25,28–30 and in so doing identify changing points for genes with age-varying expressions.
In this study, we mainly focus on the changes of gene expressions along with age (t), and only single measurement for each unit is available. Hence, in the absence of covariates and multiple measurements, the function (1) can be deduced as follows:
![]() |
where β0 is an intercept parameter, M is some unknown number of nonconstant terms, hm(t) is a basis function in a function set Γ = {(t − τ)+, t}τ ∈(−∞,+∞) ((t − τ)+ = max(0, t − τ)),25 or a product of 2 or more such functions. For each gene, β0, βm, M, and hm(t) are estimated from the data using R-package MASAL.
Another advantage of using MASAL is that MARS24 searched for the knot τ* over all observed values of ti only, and this restriction has been removed without computational cost by MASAL.25
Grouping the genes with the similar turning points by K-means clustering
K-means clustering is a method for finding clusters and cluster centers in a set of unlabeled data (ie, unsupervised machine learning).31 Given a number of cluster centers of genes’ turning points, the K-means procedure iteratively moves the center to minimize the total within-cluster variance. The desired number of cluster centers is a minimum cluster number which satisfies that the ratio of within-cluster variance to the total variance is larger than 90%. The goal of clustering analysis in this study is to group the genes with the similar expression turning points and then conduct the subsequent pathway analysis for their associations.
Modeling the biological systems for the clustered genes with Ingenuity Pathway Analysis
Ingenuity Pathway Analysis (IPA) is applied to understand the molecular and chemical interactions, cellular phenotypes, and disease processes within a system from RNA expression microarrays or single-nucleotide polymorphism microarrays. Ingenuity Pathway Analysis also provides insight into the causes of observed gene expression changes and into the predicted downstream biological effects of those changes. The main transcription factors and biological functions for each cluster will be analyzed by IPA (Ingenuity Systems, www.ingenuity.com).
Results
Out of 44 928 probe sets, 5765 (12.8%) were identified by MASAL to change expressions after diagnosis. However, due to the high variation at the tails, we only focused on the 1640 probe sets with age ranging from 49 to 75, which are close to the 20% and 80% quartiles of the diagnostic age. Out of these 1640 probe sets, 927 expressed lower after turning points and 713 expressed higher after the turning points. Table 1 listed the first 20 probe sets in the order of Affymetrix GeneChip with turning points at diagnosis. The full list can be found in Appendix 1. Figure 1 shows the example of 2 probe sets selected from Table 1.
Table 1.
Top 20 probe sets with expressions changing after diagnosis.
Figure 1.
Examples of genes (probe sets) with expressions changing after diagnosis.
K-means clustered the identified turning points into 3 major groups, and the corresponding centers are 54 (group 1), 62.5 (group 2), and 72 (group 3) (unit: year) (Figure 2).
Figure 2.

Distribution of the identified turning points and the grouping results from K-means clustering.
The pathway analysis was first applied to all probe sets identified by MASAL (named “group all”) and then followed by the 3 groups clustered by K-means (named “group 1,” “group 2,” and “group 3”).
For the top canonical pathways, IPA shows that gonadotropin-releasing hormone (GnRH) signaling, inhibition of matrix metalloproteinases, and T-cell receptor signaling are the top-ranked pathways for these 4 groups (including group all, group 1, group 2, and group 3) (Table 2).
Table 2.
Top 5 canonical pathways identified by Ingenuity Pathway Analysis.
For the disease and disorder function, the cancer-related function is among the top 5 functions for group all, group 1, and group 3 (Table 3). The inflammatory response function is the top disease and disorder function for group 2.
Table 3.
Top 5 disease and disorder functions identified by Ingenuity Pathway Analysis.
For the molecular and cellular function, cellular growth and proliferation function is the no. 1 function for group all, group 1, group 3 and no. 2 for group 2. The function related to cell death and survival appears among the top 5 for 3 out of 4 groups (Table 4).
Table 4.
Top 5 molecular and cellular functions identified by Ingenuity Pathway Analysis.
Interestingly, for the top 5 networks identified by IPA (Table 5), the overlaps among these 4 groups are very rare. In particular, RNA posttranscriptional modification, protein synthesis, and gene expression is the no. 1 network identified for group all. Developmental disorder, hereditary disorder, and ophthalmic disease is the no. 1 network for group 1. For group 2, hereditary disorder, respiratory disease, and cell cycle is the no. 1 network, and for group 3, the no. 1 network is amino acid metabolism, drug metabolism, and molecular transport.
Table 5.
Top 5 networks identified by Ingenuity Pathway Analysis.
For the upstream regulators analysis (Table 6), IPA found that the tumor protein p53 (TP53), which is a tumor suppressor protein encoded by the TP53 genes in humans, is the no. 1 transcription regulator in group 1. TGFB1, a secreted protein that performs many cellular functions (eg, control of cell growth, cell proliferation, and apoptosis), appears as one of the top 5 regulators among all 4 groups.
Table 6.
Identified top 5 upstream regulators via Ingenuity Pathway Analysis for the 3 clusters of genes with expressions after diagnosis.

Discussion
In cancer research, it is natural to envision that there are genes that change expressions before and after the onset of cancer. Developing methods to identify these genes can assist us to further understand the process of carcinogenesis and provide potential drug targets for cancer treatment. In this article, we implemented a method developed by Zhang et al29 to a publicly available gene expression data collected using the microarray technology, which has been widely used to study the biomedical problems, particularly in cancer research in the past 20 years.
Our approach successfully identified genes that vary expressions before and after the diagnosis of breast cancer. The networks analysis by IPA shows that many of the identified or associated genes play important roles in the process of cancer development and progression. For instance, the tumor suppressor gene p53 and the transforming growth factor TGFB1 have been demonstrated to be associated with breast cancer risk in many studies.32–36 Figures 3 and 4 show the regulator networks of these 2 factors from Ingenuity.
Figure 3.
P53-mediated regulatory network via Ingenuity pathway analysis.
Figure 4.
TGFB1-mediated regulatory network via Ingenuity pathway analysis.
In addition to the genes, our approach can also identify networks or functions that are associated with breast cancer risk. For instance, we found that “Cancer” is among the top 5 disease and disorder functions, and GnRH signaling is among the top 5 canonical pathways for many clustered groups (Tables 2 and 3), whereas there is evidence showing that GnRH signaling is associated with risks of many types of cancers.37–39
For comparisons, we also separated the genes according to the sign of their turning points after clustering analysis and then re-ran the IPA. The results are uploaded as supplemental data.
The ideal data set to study this kind of problems is the one that collects longitudinal measurements for each gene from the same patient and the time spectrum should cover disease diagnosis. In addition, to have enough power, the study must have adequate sample size (eg, in the minimal range of several hundreds). However, this kind of data set is difficult to be obtained even though the cost for each array (or gene chip) has been greatly reduced nowadays. Our study findings show that the measurements of a gene expression from different patients can be pooled together to study the change of expression before and after disease diagnosis.
One of the study limitations is the lack of modeling covariates in the discovery of turning points, which will be one of our future research areas. In addition, we plan to apply the approaches demonstrated in this study to the data sets of other cancer types, aiming at the search of common genes that vary expressions before and after diagnosis.
Conclusions
In this article, we implemented the nonparametric MASAL method to a publicly available gene expression data and successfully identified genes with expressions varying before and after breast cancer diagnosis.
Appendix 1. MASAL-identified Affymetrix probe sets with expressions varying after age at diagnosis
Footnotes
PEER REVIEW: Three peer reviewers contributed to the peer review report. Reviewers’ reports totaled 540 words, excluding any confidential comments to the academic editor.
FUNDING: The author(s) received no financial support for the research, authorship, and/or publication of this article.
DECLARATION OF CONFLICTING INTERESTS: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
Both authors contributed to the conception and design; acquisition, analysis, and interpretation; drafting and review for important intellectual content; and final approval of the manuscript.
REFERENCES
- 1.Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270:467–470. doi: 10.1126/science.270.5235.467. [DOI] [PubMed] [Google Scholar]
- 2.DeRisi J, Penland L, Brown PO, et al. Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat Genet. 1996;14:457–460. doi: 10.1038/ng1296-457. [DOI] [PubMed] [Google Scholar]
- 3.Spellman PT, Sherlock G, Zhang MQ, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998;9:3273–3297. doi: 10.1091/mbc.9.12.3273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Behr MA, Wilson MA, Gill WP, et al. Comparative genomics of BCG vaccines by whole-genome DNA microarray. Science. 1999;284:1520–1523. doi: 10.1126/science.284.5419.1520. [DOI] [PubMed] [Google Scholar]
- 5.Lim LP, Lau NC, Engele PG, et al. Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature. 2005;433:769–773. doi: 10.1038/nature03315. [DOI] [PubMed] [Google Scholar]
- 6.Shi L, Reid LH, Jones WD, et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006;24:1151–1161. doi: 10.1038/nbt1239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Miller DT, Adam MP, Aradhya S, et al. Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. Am J Hum Genet. 2010;86:749–764. doi: 10.1016/j.ajhg.2010.04.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Mertzanidou A, Wilton L, Cheng J, et al. Microarray analysis reveals abnormal chromosomal complements in over 70% of 14 normally developing human embryos. Hum Reprod. 2013;28:256–264. doi: 10.1093/humrep/des362. [DOI] [PubMed] [Google Scholar]
- 9.Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000;16:906–914. doi: 10.1093/bioinformatics/16.10.906. [DOI] [PubMed] [Google Scholar]
- 10.Inza I, Sierra B, Blanco R, Larranaga P. Gene selection by sequential search wrapper approaches in microarray cancer class prediction. J Intell Fuzzy Syst. 2002;12:25–33. [Google Scholar]
- 11.Rhodes DR, Yu J, Shanker K, et al. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci U S A. 2004;101:9309–9314. doi: 10.1073/pnas.0401994101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 2005;21:631–643. doi: 10.1093/bioinformatics/bti033. [DOI] [PubMed] [Google Scholar]
- 13.Konishi H, Ichikawa D, Komatsu S, et al. Detection of gastric cancer-associated microRNAs on microRNA microarray comparing pre- and post-operative plasma. Br J Cancer. 2012;106:740–747. doi: 10.1038/bjc.2011.588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Botling J, Edlund K, Lohr M, et al. Biomarker discovery in non-small cell lung cancer: integrating gene expression profiling, meta-analysis, and tissue microarray validation. Clin Cancer Res. 2013;19:194–204. doi: 10.1158/1078-0432.CCR-12-1139. [DOI] [PubMed] [Google Scholar]
- 15.Kononen J, Bubendorf L, Kallioniemi A, et al. Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med. 1998;4:844–847. doi: 10.1038/nm0798-844. [DOI] [PubMed] [Google Scholar]
- 16.van’t Veer LJ, Dai H, van de Vijver MJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]
- 17.Iorio MV, Ferracin M, Liu CG, et al. MicroRNA gene expression deregulation in human breast cancer. Cancer Res. 2005;65:7065–7070. doi: 10.1158/0008-5472.CAN-05-1783. [DOI] [PubMed] [Google Scholar]
- 18.Prat A, Parker JS, Karginova O, et al. Phenotypic and molecular characterization of the claudin-low intrinsic subtype of breast cancer. Breast Cancer Res. 2010;12:R68. doi: 10.1186/bcr2635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Prat A, Adamo B, Cheang MC, Anders CK, Carey LA, Perou CM. Molecular characterization of basal-like and non-basal-like triple-negative breast cancer. Oncologist. 2013;18:123–133. doi: 10.1634/theoncologist.2012-0397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Weigelt B, Baehner FL, Reis-Filho JS. The contribution of gene expression profiling to breast cancer classification, prognostication and prediction: a retrospective of the last decade. J Pathol. 2010;220:263–280. doi: 10.1002/path.2648. [DOI] [PubMed] [Google Scholar]
- 21.Cimino D, Fuso L, Sfiligoi C, et al. Identification of new genes associated with breast cancer progression by gene expression analysis of predefined sets of neoplastic tissues. Int J Cancer. 2008;123:1327–1338. doi: 10.1002/ijc.23660. [DOI] [PubMed] [Google Scholar]
- 22.Nacht M, Ferguson AT, Zhang W, et al. Combining serial analysis of gene expression and array technologies to identify genes differentially expressed in breast cancer. Cancer Res. 1999;59:5464–5470. [PubMed] [Google Scholar]
- 23.Ma X-J, Salunga R, Tuggle JT, et al. Gene expression profiles of human breast cancer progression. Proc Natl Acad Sci U S A. 2003;100:5974–5979. doi: 10.1073/pnas.0931261100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Friedman JH. Multivariate adaptive regression splines. Ann Stat. 1991;19:1–67. doi: 10.1177/096228029500400303. [DOI] [PubMed] [Google Scholar]
- 25.Zhang H. Multivariate adaptive splines for analysis of longitudinal data. J Comput Graph Stat. 1997;6:74–91. [Google Scholar]
- 26.Ivshina AV, George J, Senko O, et al. Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. Cancer Res. 2006;66:10292–10301. doi: 10.1158/0008-5472.CAN-05-4414. [DOI] [PubMed] [Google Scholar]
- 27.Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003;31:e15. doi: 10.1093/nar/gng015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Zhang H. Analysis of infant growth curves using multivariate adaptive splines. Biometrics. 1999;55:452–459. doi: 10.1111/j.0006-341x.1999.00452.x. [DOI] [PubMed] [Google Scholar]
- 29.Zhang H, Singer B. Recursive Partitioning in the Health Sciences. Berlin, Germany: Springer; 1999. [Google Scholar]
- 30.Zhang H. Mixed effects multivariate adaptive splines model for the analysis of longitudinal and growth curve data. Stat Methods Med Res. 2004;13:63–82. doi: 10.1191/0962280204sm353ra. [DOI] [PubMed] [Google Scholar]
- 31.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Berlin, Germany: Springer-Verlag; 2009. [Google Scholar]
- 32.Allred DC, Clark GM, Elledge R, et al. Association of P53 protein expression with tumor cell proliferation rate and clinical outcome in node-negative breast cancer. J Natl Cancer Inst. 1993;85:200–206. doi: 10.1093/jnci/85.3.200. [DOI] [PubMed] [Google Scholar]
- 33.Lakhani SR, Van De Vijver MJ, Jacquemier J, et al. The pathology of familial breast cancer: predictive value of immunohistochemical markers estrogen receptor, progesterone receptor, HER-2, and p53 in patients with mutations in BRCA1 and BRCA2. J Clin Oncol. 2002;20:2310–2318. doi: 10.1200/JCO.2002.09.023. [DOI] [PubMed] [Google Scholar]
- 34.Miller LD, Smeds J, George J, et al. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci U S A. 2005;102:13550–13555. doi: 10.1073/pnas.0506230102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Polyak K. Breast cancer: origins and evolution. J Clin Invest. 2007;117:3155–3163. doi: 10.1172/JCI33295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Barcellos-Hoff MH, Akhurst RJ. Transforming growth factor-beta in breast cancer: too much, too late. Breast Cancer Res. 2009;11:202. doi: 10.1186/bcr2224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Everest HM, Hislop JN, Harding T, et al. Signaling and antiproliferative effects mediated by GnRH receptors after expression in breast cancer cells using recombinant adenovirus. Endocrinology. 2001;142:4663–4672. doi: 10.1210/endo.142.11.8503. [DOI] [PubMed] [Google Scholar]
- 38.Emons G, Gründker C, Günthert AR, Westphalen S, Kavanagh J, Verschraegen C. GnRH antagonists in the treatment of gynecological and breast cancers. Endocr Relat Cancer. 2003;10:291–299. doi: 10.1677/erc.0.0100291. [DOI] [PubMed] [Google Scholar]
- 39.Gründker C, Emons G. Role of gonadotropin-releasing hormone (GnRH) in ovarian cancer. Reprod Biol Endocrinol. 2003;1:65. doi: 10.1186/1477-7827-1-65. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.











































