Bayesian Mixture Models for Assessment of Gene Differential Behaviour and Prediction of pCR through the Integration of Copy Number and Gene Expression Data

Filippo Trentini; Yuan Ji; Takayuki Iwamoto; Yuan Qi; Lajos Pusztai; Peter Müller

doi:10.1371/journal.pone.0068071

. 2013 Jul 12;8(7):e68071. doi: 10.1371/journal.pone.0068071

Bayesian Mixture Models for Assessment of Gene Differential Behaviour and Prediction of pCR through the Integration of Copy Number and Gene Expression Data

Filippo Trentini ¹, Yuan Ji ^2,^*, Takayuki Iwamoto ³, Yuan Qi ⁴, Lajos Pusztai ⁵, Peter Müller ⁶

Editor: Xiaofeng Wang⁷

PMCID: PMC3709899 PMID: 23874497

Abstract

We consider modeling jointly microarray RNA expression and DNA copy number data. We propose Bayesian mixture models that define latent Gaussian probit scores for the DNA and RNA, and integrate between the two platforms via a regression of the RNA probit scores on the DNA probit scores. Such a regression conveniently allows us to include additional sample specific covariates such as biological conditions and clinical outcomes. The two developed methods are aimed respectively to make inference on differential behaviour of genes in patients showing different subtypes of breast cancer and to predict the pathological complete response (pCR) of patients borrowing strength across the genomic platforms. Posterior inference is carried out via MCMC simulations. We demonstrate the proposed methodology using a published data set consisting of 121 breast cancer patients.

Introduction

Biological Background

Copy number and arrayCGH

Human beings have two copies of each gene, defined as a segment of DNA. The normal copy number of a gene is therefore two. Copy number aberration (CNA) refers to cytogenetic events in which the DNA replication process is disrupted such that the gene either is replicated multiple times (copy number gains) or loses one or both copies (copy number loss) in newly generated cells. Comparative Genomic Hybridization (CGH) has emerged as a dominant technique for detecting CNA [1], especially when combined with microarrays. The resulting arrayCGH techniques [2], [3], [4] and [5] measure thousands or millions of genomic targets or “probes” that are spotted or printed on a glass surface. These probes usually span the whole genome with a resolution of the order ranging from 1 MB (one million base pairs) for BAC (bacterial artificial chromosome), to 50–100 kb (kilo base pairs) for more recent microarrays. In an arrayCGH experiment, a DNA test sample of interest is labeled with a dye (say Cy3) and then mixed with a diploid reference sample labeled with a different dye (say Cy5). The combined sample is then hybridized to the microarrays and intensities of both colors are measured through an imaging process. The quantity of interest is the Inline graphic ratio of the two intensities for each color. The collection of the intensity ratios then provide useful information about genome-wide changes in copy numbers between the two samples. Since the reference sample is presumed to be diploid, the intensity ratio is determined by the copy number of the DNA in the test sample. If the copy number of the test sample is also two, then the theoretical Inline graphic intensity ratio equals zero. If there is a single copy loss in the test sample, the theoretical ratio is assuming all the cells in the test sample lost one copy of the DNA fragment. If there is a single copy gain, the theoretical ratio is Multiple copy gains are called amplifications, and the corresponding theoretical intensity ratios are Inline graphic , , etc. When both copies are lost, the theoretical ratio is and a large negative value is usually observed in experiments.

Integration of DNA copy number and RNA expression

Expression microarrays measure RNA expression which, by the central dogma of molecular biology, are resulted from the transcription of DNAs. Microarray technology for measuring RNA gene expression has been well known to the statistical community, and its review is omitted here. Naturally, we are prone to think that CNAs impact the intensities of the relative RNA expressions in that more copies of DNA should lead to higher levels of RNA expression. It is therefore of great interest to study the intensity of such interaction, if there is any, between aCGH and RNA expression measurements on different genes.

Gene expression and copy number variation data have been broadly studied, to assess differential expression of genes [6] and to find segments along the DNA that show CNAs [7], [8]. Statistical and computational models for integrating different types of data are becoming a popular topic in the recent literature, even though only few considered full model-based approaches. [9] was among the earliest to investigate the direct association between the two types of data in breast cancer cell lines and tissue samples, and their approach was based mainly on descriptive statistics. Van Wieringen and Van de Wiel [10], attempting to mitigate the high noise in the raw expression measurements of the DNA and RNAs, proposed a sampling model for RNA expression incorporating estimated probabilities of corresponding CNAs. They subsequently developed nonparametric adaptive tests to study whether the estimated copy number variations in the DNA level would induce differential gene expression at the RNA level. More recently [11] presented a double-layered mixture model ( Inline graphic ) that directly modeled segmental patterns in the copy number data to produce CNA profiles, and simultaneously scored the association between copy number and gene expression data. The DLMM assigned high scores to elevated or reduced expression measurement only if the expression changes are observed consistently across samples with copy number aberration.

An important biological premise to the description of the model is that by integrating DNA copy number and RNA expression data, we will gain more knowledge about the underlying biological process. For example, a high or low correlation between a copy number aberration (CNA) for a gene marker and its abnormal RNA expression would indicate different carcinogenic mechanism and therefore different treatment selections [12] [13].

We describe a Bayesian Mixture Model that converts the noisy raw intensity measurement of the DNA and RNAs into probability of expression, which are subsequently modeled as latent parameters. Thus the integration of the two platforms is realized by joint modeling the probabilities of expression through a probit regression. Our aim, however, is not only to evaluate the relative contribution of large genetic variants such as CNAs, to gene expression but also make inference using both differential expression of the genes and differential copy number variations of the same set of genes. Moreover our full model-based approach allows us, after new information on the patients in the study are acquired, to exploit the latent integrated structure of our model and achieve better predictive performances for the clinical outcome of new patients coming into the study.

In the next paragraph we present a motivating example with matched arrayCGH and microarray samples from breast cancer patients. In the materials and methods section we introduce probability models with a particular focus on the probit regression that allows for integration of both platforms, along with some simulation studies. Thus, in the result section, the focus is on posterior inference of the interaction between the two platforms, differential behaviour, which takes into account both differential gene expression and differential CNA, and prediction of the pCR of patients after treatment. A final remark is provided in the discussion section.

Motivating Example

We consider data in breast cancer consisting of 121 patients from three disease subgroups, ER+, HER2+, and triple negative (TN). ER+ patients have present estrogen receptors – a protein related to hormone and regulation of gene expression – in their cancer cells. HER2+ patients are instead those whose tumor cells test positive for a protein called human epidermal growth factor receptor 2. Finally TN patients lack three “receptors” in their cancer cells: ER, HER2, and progesterone receptors. ER+ and HER2+ patients were therefore collapsed in the same group, in order to compare TN patients versus others.

On a slightly reduced set of 116 patients we have a measure, formalized as a dichotomous variable, on their positive or negative pCR to treatment. Numerosities are specified in the table 1.

Table 1. Contingency table to classify patients with respect to subgroup of breast cancer and pathological complete response.

	Triple Negative	Positive to ER, HER2 or both	TOT
Positive pCR	20	11	31
No pCR	33	52	85
Missing	3	2	5
TOT	56	65	121

Open in a new tab

The mRNA expression data was obtained with Affymetrix U133A gene chips. The data was normalized with MAS5 algorithm, scaled to target intensity of 600 and log2 transformed. The expression profiles of the cancers are available at GEO accession number GSE22093 [14]. The DNA copy number data was generated with Agilent 4x44K CGH arrays, processed as log2 ratios of the intensities of the two colors, and is available at ArrayExpress accession number E-TABM-584.

ArrayCGH and microarray RNA experiments have been performed using the 121 breast samples to obtain the copy number data on 22,944 probes and RNA expression data for 11,306 genes. We then mapped 22,944 probes to the 11,306 genes, which gave us a matching between the probe ids on the aCGH and the gene ids on the microarrays.

Materials and Methods

Ethics Statement

All the research used public data, published in 2009 in the following paper: “Molecular characterization of breast cancer with high-resolution oligonucleotide comparative genomic hybridization array” written by Andre F. et al. and published in Clinical Cancer Research [14].

Sampling model for and

On arrayCGH, the experimental unit is probe Inline graphic belonging to gene . On RNA microarray, the experimental unit is gene . Denote the log2 intensity ratio for probe b at sample t, and the RNA expression level for gene g at sample t, and Denoting the set of arrayCGH probes corresponding to gene , the matched copy number and RNA expression data for sample Inline graphic is then

We propose mixture models for Inline graphic and and introduce latent variables representing the differential expression status of the DNA and RNA, respectively. We then integrate the two models by constructing a prior probit regression linking the latent variables from both platforms.

We use a mixture model [15] to introduce trinary latent indicator variables for the CNA state for each probe and the differential expression (DE) state for each gene. Specifically, let Inline graphic take values in the set , respectively corresponding to the copy-loss ( copy number), copy-neutral ( copy number), and copy-gain ( copy number) states and take values in the set , respectively corresponding to the under-, normal-, and over-expression states. Conditional on and , the sampling models for copy number log2 ratios Inline graphic and for gene expression are given by

graphic file with name pone.0068071.e036.jpg

(1)

graphic file with name pone.0068071.e037.jpg

(2)

In (2), the mixture model for gene expression data Inline graphic includes a gene effect and a sample effect . This is not the case in the mixture model for aCGH data . The main reason is because is already a log ratio between the cancer sample copy number and the reference sample copy number and therefore the corresponding effects should have canceled out by taking the ratio. The sampling model is indexed by Inline graphic and representing normal ranges of variability in the observed measurements and . The parameters and define the tail overdispersion with respect to normality, associated with copy losses or gains for aCGH and under- or over-expression for microarrays.

Latent probit scores and probit regression

Anticipating the integration of both platforms using a regression model, we further introduce latent Gaussian variables Inline graphic and to define a probit scores for the trinary indicators and . Specifically, define

graphic file with name pone.0068071.e053.jpg

(3)

Before we introduce the probit regression for integration, we present a prior for Inline graphic that allows for inference of different CNAs across different conditions, in our case of breast cancer data, different subtypes of breast cancer. Let is a clinical categorical covariate indicating which subgroups the patient belongs to, we assume that

where Inline graphic respectively if the patient belongs to TN subgroup or not, , a probe-specific mean, describes a baseline CNA status (e.g., a reference subtype) and a trinary indicator accounting for differential CNA in the two subtypes, following a prior distribution given by

graphic file with name pone.0068071.e060.jpg

The integration of the two platforms is easily done using the latent probit scores and a linear model. First, we introduce a gene-level score for the aCGH data, defined as Inline graphic Keeping in mind that there is a natural biological causal relationship between DNA copy number change and altered gene expression for the corresponding RNAs, we assume that

where Inline graphic is the clinical binary covariate mentioned above, while and trinary indicators accounting respectively for differential gene expression in TN subgroup and interaction between the two measurement for gene , following similar prior to the one mentioned above for .

Markov dependence across probes

A Markov dependence is assumed across the probes and it is defined in the following conditional prior on the probe specific effect. Define Inline graphic Assuming that the index is ordered according to locus proximity on the chromosome, the dependence across adjacent probes is described as follows. Let and

for Inline graphic In this formulation the parameters can be directly interpreted as partial correlation coefficients, defining the strength of dependence between ratios associated with probes that are adjacent on the chromosome.

Priors

The last step is the specification of the priors for the set of parameters that index the sampling model. We assume conditionally conjugate priors. Denoting Inline graphic a gamma distribution with mean , we assume

Particular attention is given to the formulation of the prior for Inline graphic where

graphic file with name pone.0068071.e081.jpg

with Inline graphic much larger than and fixed at 1. The prior for 's is given by

for Inline graphic , with so that the marginal variance of 's is bounded above. Note that this model assumes that adjacent probes are equally correlated, characterized by 's and . Alternatively, one could model the correlation between probes as a function of their genomics distances, and this can be easily achieved by modeling Inline graphic as a distance between probes and , for example. Finally we assume conditionally conjugate priors for the gene and slide specific effects

subject to Inline graphic . Finally, the normal range of variability in mRNA expression

the tail over-dispersion parameters

and the regression parameters

graphic file with name pone.0068071.e101.jpg

graphic file with name pone.0068071.e102.jpg

with the same assumptions on Inline graphic , and , fixed at 1.

A summary of the model is given in the upper part of Figure 1.

Boxes refer to variables in the model, where latent variables are represented by dotted line boxes. Circles refere to parameters, where the red ones are the indicators used for posterior inference.

Modified Probability Model for the prediction of pCR

The idea of this section raises from the question of whether or not we could use the same latent structure underneath gene expression and copy number variation data to make inference on a clinical outcome of new patients in the study, in particular Inline graphic , the pCR of patients to treatment.

The chosen approach is to state a model for Inline graphic and , , and to assume a Bernoulli distribution for . This leads us to the sought model and posterior probabilities of being 1 give us a measure for the prediction of the outcome of the new patient.

The advantages of our model with respect to, for example, a simple logistic regression Inline graphic are mainly the noise reduction achieved through the assumption of a latent structure underneath our data, i.e. the latent POE scores for gene expression and the natural variable selection allowed within the model itself; indicators and in equation (4) and (5) (with Bernoulli priors with probability Inline graphic and very close to 1) allow for a reduction of the number of covariates (genes) and avoid the problem of overestimation.

In summary, as a new patient comes into a study and we have measurements of his gene expression and copy number variation, we run the model Inline graphic and assume for his clinical outcome a Bernoulli distribution with probability . Through MCMC methods we obtain updated posterior probabilities of being 1 that give us a measure for the prediction of his outcome.

In this particular case the outcome refers to the pCR to the treatment of patients in a breast cancer study, which is defined as a complete disappearance of the tumor with no more than a few scattered tumor cells detected by the pathologist in the resection specimen [16].

As before we use a mixture model (equations 1 and 2) [15] to introduce a trinary latent indicator variables for the CNA state for each probe and the expression level state for each gene, and latent Gaussian variables Inline graphic and to define a probit scores for the trinary indicators and (3).

The next two equations embody our assumption that positive or negative clinical response of patients could be related to differential behaviour of a small subgroups of the 11,306 genes, i.e. copy number variation and gene expression. We assume

(4)

where Inline graphic is the clinical outcome mentioned above, measured on the 116 patients, and is a binary indicator introduced for controlling the number of covariate in the regression.

The integration of the two platforms is implemented as a regression with the probit scores,

(5)

where Inline graphic characterizes the relationship between the two platform, is a binary indicator introduced for controlling the number of covariate in the regression and is the same variable as above.

As new patients Inline graphic come into the study, and supposedly they do not have an information on pCR, an assumption on their outcome is made, as follows:

(6)

so that we can learn about Inline graphic through the above prior and , using Bayes formula and MCMC methods. The Bernoulli probability was set to be equal to the sample proportion of patients with positive pCR.

Priors

Priors are defined as in section 2.4, with the only exception of the regression parameters Inline graphic and , and the binary indicators and . For both the first parameters an informative prior is assumed

with Inline graphic , and the variances estimated using the data. While for the two indicators

with Inline graphic very close to 1 to allow for the selection of a very small subgroups of genes as covariates in the two regressions.

A summary of the model is given in the lower part of Figure 1.

Bayesian Multiplicity Control

Posterior inference for the proposed model is carried out using MCMC simulations by a Gibbs sampling scheme, iterating from the complete set of full conditionals reported in the appendix.

Since the analysis deals with high throughput gene expression data and our final aim is that of selecting Inline graphic genes [17] multiple comparison problems arise.

A useful generalization of frequentist Type-I error rates to multiple hypothesis testing is the false discovery rate (FDR) introduced in Benjamini and Hochberg [18], and reviewed in a Bayesian framework by Storey [19], [20].

Let Inline graphic denote the indicator for gene being differentially expressed under two biological conditions of interest (in our case we will be facing two different indicators and whether the comparison is ER+ vs TN or HER2+ vs TN).

Let Inline graphic denote an indicator for rejecting the -th comparison and denote the number of rejections; it is defined

as the fraction of false rejections, relative to the total number of rejections. As such it is neither Bayesian nor frequentist. Under a Bayesian perspective, since the only unknown quantity is Inline graphic in the numerator, it can be defined an expected FDR. Let , then

It was proved by M Inline graphic ller et al. [21] that under several loss functions that combine false negative and false discovery counts the optimal rule is of the following form . The problem is now that of specifying so that the FDR is controlled at a desirable level.

An algorithm that allows us to compute FDR levels for number of discoveries, and therefore to select differentially expressed genes so that the FDR level is controlled at level Inline graphic , consists in sorting, from the lowest to the highest, the marginal posterior probabilities , to obtain . Thus, if , we do not reject any null hypothesis; otherwise, if , we reject only. We iterate this procedure until the first time , and reject .

Simulation Study

We perform a small simulation study and generate data in a way that the last 50 (out of 1,000) genes show joint differential behaviour in copy number and RNA expression. We firstly generated two matrices for gene expression Inline graphic and copy number ratios , respectively of dimensions and , with probes, genes (exactly two probes per gene) and samples. The clinical covariate is set to be 1 for the first 10 patients and 0 for the remaining 40 patients. Sample and gene effects were generated from the corresponding priors in the model, Inline graphic subject to and . Observed log2 ratios and expression values were sampled from two Gaussian distributions, respectively centred at and 0. To induce differential joint behaviour for the last 50 genes, we did the following:

for RNA expression, we generated Inline graphic for and ;

for copy number, we generated Inline graphic for and ;

The second simulation study generates data from the proposed mixture model. We started from setting Inline graphic to be 2 for the first 50 genes and 0 for the remaining 950. and generated the latent scores from the corresponding priors in the model, for , , and for , for and for , for and , , and , randomly with proportions respectively 30 and 70, for and . Once the latent scores are generated, using (1 and 2), we generate gene expression and CNA measurements, setting the hyperparameters as follows:

In both cases roughly 2000 iterations were needed for convergence of the MCMC chain.

For the sake of simplicity, we report only results for the second simulation. In Figure 2 we show the posterior probabilities of positive interaction between platforms ( Inline graphic ), differential CNA () and joint CNA and RNA differential expression (). As we expected, posterior probabilities of positive interaction between platforms for the first 50 genes and posterior porbabilities of differential CNA and differential joint behaviour for the first 100 genes are among the highest.

The red dots highlight posterior probabilities of genes which are claimed by the model to show respectively positive interaction between the two platforms, differential CNA and differential joint behaviour.

While these simulations merely show that our proposed models achieve what is expected, we direct attention to selection of differentially behaved genes with multiplicity control and then data analysis based on breast cancer samples.

Results

We applied our model to the breast cancer data set. As comparison, we also applied a simpler version of our models by setting Inline graphic for all the genes. The simpler models assume that the gene expression and copy numbers are independent and therefore there is no integration. We call these simpler versions “marginal models”.

In the upper plot of Figure 3 dots refer to the posterior probabilities of DNA copy number amplification, Inline graphic , and over expression, , based on the marginal models; black dots highlight the list of over-expressed genes which jointly showed copy number amplification obtained through the integrated model. As expected the joint model selects, coherently, mostly genes in the upper right corner, but still differently from the intersection between the marginal ones.

Black dots highlight posterior probabilities of genes which are claimed by the model to show joint differential behaviour (A). Comparison between differences in means of the gene expression data and posterior probability of differential expression (B). Comparison between sample correlations and posterior probabilities of positive interaction between platforms (C).

A simple model checking was achieved plotting posterior probabilities of differential gene expression and difference in means of the gene expression measurements for TN and non TN group. Following the same criteria, we plotted posterior probabilities of positive interaction between platforms and sample correlations. Lower plots of Figure 3 show, respectively, a very good match between no difference in sample means and low posterior probabilities of differential expression, and between strong positive sample correlations and high posterior probabilities of positive interaction between platforms.

Our main focus was on five lists of interesting genes: under (over)-expressed genes which jointly showed DNA copy number deletion (amplification) in TN subgroup, under (over)-expressed genes conditional on DNA copy number aberration only in TN subgroup and genes which showed positive interaction between the two platforms. We therefore respectively defined

•

where Inline graphic and indicates all the probes belonging to the gene g.

FDR levels were computed with the algorithm presented in the previous section for the distinct Inline graphic 's, and genes were selected choosing a cutoff The lists of selected genes could be of greater interest for clinicians since they indicate which genes show differential expression and copy number variation in TN patients versus patients who tests positively for ER and HER2 receptors.

On the other hand, for prediction of pCR, we split the data sets into a training set and a test set; the training set, consisting of 94 patients, was used to obtain samples from the posterior distribution of the parameters while the test set, consisting of 22, to check for prediction performances through the ROC curve. Both sets were randomly selected, and numerosities with respect to pCR of training and test samples are reported in table 2. We constrained numerosities in order for the test sample to be equally balanced between positive and negative pCR, and for the training sample to respect proportions of the original data set.

Table 2. Numerosities in the training set and test set.

	Training sample	Test sample	TOT
Positive pCR	20	11	31
No pCR	74	11	85
TOT	94	22	116

Open in a new tab

The adopted method for the estimation of the smoothed ROC curve is LLoyd and Yong's one [22], which is proved to perform better than the empirical estimation. They proposed to estimate this curve from kernel smoothing of the distribution functions of the diagnostic measurement underlying the binary decision rule, i.e. the conditional posterior probabilities of positive pCR, and showed the significant accuracy achieved by this method for realistic sample size compared with the empirical estimation.

As mentioned above, the tests we performed were done on a sample of 22 patients, for which we had previously measured their pCR, and are based on the posterior probabilities of the clinical outcome being 1, Inline graphic , obtained running the Gibbs Sampler for 30.000 iterations. We performed the same analysis using marginally the two platforms and obtaining respectively posterior probabilities and . These posterior probabilities, obtained through the joint and marginal models, are showed in Figure 4.

The ROC curves are compared in Figure 5 and such comparison confirms our choice of borrowing information between the two genomic platforms, since the ROC curve corresponding to the integrated model has by far the highest Area Under the Curve, slightly below 0.9.

We finally tried and compared our method with a simple logistic regression with LASSO variable selection (LLR) [23] [24], whose corresponding ROC curves are plotted in Figure 6. We performed the analysis using the package glmnet in R, and set the elastic net mixing parameter Inline graphic to 1. The penalty is defined as

and Inline graphic correponds to the Lasso penalty, which in this case gave the best prediction performances.

We therefore plotted in Figure 7 the smoothed ROC curves based on posterior probabilities of pCR obtained through the integrated model and on predictive probabilities obtained through LLR using only copy number variation data. The AUC under the curve obtained through our integrated model shows to be much higher that the one under the curve obtained through LLR.

Discussion

We have introduced a Bayesian hierarchical model to integrate two types of genomics data, copy number and RNA expression. The proposed model can be easily extended to multiple platforms, with modification to the modeling of latent probit scores. Since the entire statistical inference is based on a coherent probability model, scientific questions can be addressed with probability statements, allowing for reporting uncertainty measures such as FDR. This is the main advantage of the proposed models over existing ones.

In table 3 we reported the list of genes which show jointly over expression and copy number amplification in TN patients, which was of great interest for clinicians and was also the list associated with the lowest FDR levels. Gene MYC appeared in the list and the result is promising since MYC is a key regulator of cell growth, proliferation, metabolism, differentiation, and apoptosis and MYC deregulation contributes to breast cancer development and progression and is associated with poor outcomes. Multiple mechanisms are involved in MYC deregulation in breast cancer, including gene amplification, transcriptional regulation, and mRNA and protein stabilization, which correlate with loss of tumor suppressors and activation of oncogenic pathways [25].

Table 3. List of genes which jointly show over expression and copy number amplification in TN group.

Symbol	EntrezID	Cytoband	postprob
E2F3	1871	6p22	0.951
MYC	4609	8q24.21	0.954
PLCG2	5336	16q24.1	0.954
PEPD	5184	19q13.11	0.954
C12orf32	83695	12p13.33	0.954
C10orf10	11067	10q11.21	0.954
FOLH1	2346	11p11.2	0.955
GTPBP2	54676	6p21	0.956
KARS	3735	16q23.1	0.957
CD14	929	5q22-q32	0.958
SHCBP1	79801	16q11.2	0.959
CHD1L	9557	1q12	0.959
CCDC86	79080	11q12.2	0.962
SLAMF7	57823	1q23.1-q24.1	0.962
CTPS	1503	1p34.1	0.962
IRAK1	3654	Xq28	0.964
C1GALT1	56913	7p14-p13	0.965
STK38	11329	6p21	0.965
AK2	204	1p34	0.966
HEPH	9843	Xq11-q12	0.966
VIM	7431	10p13	0.967
CDH3	1001	16q22.1	0.968
TRIT1	54802	1p34.2	0.969
GAS1	2619	9q21.3-q22	0.971
HLA-DRA	3122	6p21.3	0.972
ST8SIA1	6489	12p12.1-p11.2	0.973
FXYD5	53827	19q13.12	0.975
C1S	716	12p13	0.975
RECK	8434	9p13.3	0.976
C11orf75	56935	11q21	0.976
MOBKL2B	79817	9p21.2	0.977
HLA-E	3133	6p21.3	0.978
FAM107A	11170	3p21.1	0.979
ICAM1	3383	19p13.3-p13.2	0.979
INSL4	3641	9p24	0.980
PRKD3	23683	2p21	0.982
SLC2A3	6515	12p13.3	0.983
PVR	5817	19q13.2	0.984
TPX2	22974	20q11.2	0.985
NDRG1	10397	8q24.3	0.985
NFKBIE	4794	6p21.1	0.985
TIMM44	10469	19p13.3-p13.2	0.986
C1orf38	9473	1p35.3	0.986
PDSS1	23590	10p12.1	0.986
SH2D2A	9047	1q21	0.986
USP25	29761	21q11.2	0.989
HMGN4	10473	6p21.3	0.989
CHODL	140578	21q11.2	0.990
POLR1E	64425	9p13.2	0.990
STIL	6491	1p32	0.992
BTG3	10950	21q21.1	0.992
MCM4	4173	8q11.2	0.992

Open in a new tab

Breast cancer has been classified into 5 or more subtypes based on gene expression profiles, and each subtype has distinct biological features and clinical outcomes. Among these subtypes, basal-like tumor is associated with a poor prognosis and has a lack of therapeutic targets. MYC is overexpressed in the basal-like subtype and may serve as a target for this aggressive subtype of breast cancer. Tumor suppressor BRCA1 inhibits MYC's transcriptional and transforming activity [25]. Loss of BRCA1 with MYC overexpression leads to the development of breast cancer, especially, basal-like breast cancer. As a downstream effector of estrogen receptor and epidermal growth factor receptor family pathways, MYC may contribute to resistance to adjuvant therapy. Targeting MYC-regulated pathways in combination with inhibitors of other oncogenic pathways may provide a promising therapeutic strategy for breast cancer, the basal-like subtype in particular [26].

As far as the model is concerned, there are a few possible weaknesses in the procedure, mainly related to the prior specification for parameters Inline graphic , related to differential expression and prediction. We were dealing with highly parametrized models and few observations data sets, reason why we chose some easier shortcuts in order to achieve faster MCMC convergency. Some interesting modifications of our prior specifications are now to be implemented, since we found in literature new and more efficient approaches to the issue of sparsity, such as the horseshoe prior [27].

Also, it was very hard to compare our models' performances with other methods, either due to the lack of codes or to the scarcity of works on the specific topic of prediction using integrated genomic platforms; we therefore chose a simple LASSO logistic regression which showed to be a poor fit for this particular data and this is mainly due to the high correlation between them.

Future work includes the development of models for integration of three or more platforms, and the extension to new type of genomics data, such as next-generation sequencing (NGS) data. In the latter case, the main challenge is the inclusion of a model for the count data from the NGS experiment. The intuitive statistical method for such an extension would be a graphical model, where network priors will be considered treating each platform as a node, and edges among the nodes will be interpreted as dependence between platforms.

Finally, all this project was focused on a specific data set, with rather particular features. The natural hierarchical structure and correlation between DNA and RNA makes very hard to think of the application of our methodology to different problems, though an interesting path to follow could be that of demographical sciences, where this hierarchical structure could be found for example in data at country level and regional level.

Funding Statement

The research was funded by the parent grant NIH R01 CA132897 and no additional external funding was received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray J, et al. (1992) Comparative genomic hybridization for molecular cytogenetic analysis of solid tumor. Science 258: 818–821. [DOI] [PubMed] [Google Scholar]
2. Solinas-Toldo S, Lampel S, Stilgenbauer S, Nickolenko J, Benner A, et al. (1997) Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes Chromosomes Cancer 20: 399–407. [PubMed] [Google Scholar]
3. Pinkel D, Albertson DG (2005) Array comparative genomic hybridization and its applications in cancer. Nature Genetics 23: 41–46. [DOI] [PubMed] [Google Scholar]
4. Snijders A, Nowak N, Segraves R, Blackwood S, Brwon N, et al. (1998) Assembly of microarrays for genome-wide measurement of DNA copy number. Nature Genetics 29: 263–264. [DOI] [PubMed] [Google Scholar]
5. Pinkel D, Segraves R, Sudar D, Clark S, Poole I, et al. (1998) High resolution analysis of DNA copy number variation using comparative genomic hyridization to microarrays. Nature Genetics 20: 207–211. [DOI] [PubMed] [Google Scholar]
6. Do KA, Müller P, Tang F (2005) A Bayesian Mixture Model for differential gene expression. Journal of the Royal Statistical Society C 54: 627–644. [Google Scholar]
7. Fridlyand J, Snijders A, Pinkel DG, Jain AN (2004) Application of Hidden Markov Models to the analysis of the array CGH data. Journal of Multivariate Analysis 90: 132–153. [Google Scholar]
8. Baladandayuthapani V, Ji Y, Talluri R, Nieto-Barajas LE, Morris JS (2010) Bayesian Random Segmentation Models to Identify Shared Copy Number Aberrations for Array CGH Data. JASA 105: 1358–1375. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Pollack J, Sorlie T, Perou C, Ress C, Jeffrey S, et al. (2002) Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumor. proceedinfs of the National Academy of Sciences 99: 12963–12968. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Van Wieringen W, Wiel MA (2009) Nonparametric testing for DNA copy number induced differential mRNA gene expression. Biometrics 65: 19–29. [DOI] [PubMed] [Google Scholar]
11. Choi H, Qin ZS, Ghosh D (2010) A double-layered mixture model for the joint analysis of DNA copy number and gene expression data. Journal of Computational Biology 17: 121–137. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Tsang RY, Finn RS (2012) Beyond trastuzumab: novel therapeutic strategies in HER2-positive metastatic breast cancer. Br J Cancer 106(1): 6–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Verma S, Miles D, Gianni L, Krop IE, Welslau M, et al. (2012) Trastuzumab emtansine for HER2- positive advanced breast cancer. N Eng J Med 367(19): 1783–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Andre F, Job B, Tordai A, Michiels S, Liedtke C, et al. (2009) Molecular characterization of breast cancer with high-resolution oligonucleotide comparative genomic hybridization array. Clin Cancer Res 15: 441451. [DOI] [PubMed] [Google Scholar]
15. Parmigiani G, Garrett ES, Anbazhagan R, Gabrielson E (2002) A statistical framework for expression-based molecular classification in cancer. Journal of Royal Statistical Society B 64: 717–736. [Google Scholar]
16. Bonnefoi H (2007) Validation of gene signatures that predict the response of breast cancer to neoadjuvant chemotherapy: a substudy of the EORTC 10994/BIG 00-01 clinical trial. Lancet Oncology 8(12): 1071–1078. [DOI] [PubMed] [Google Scholar]
17. Efron B, Tibshirani R (2006) On testing the significance of sets of genes. The Annals of Applied Statistics 1: 101–129. [Google Scholar]
18. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 57: 289–300. [Google Scholar]
19. Storey JD (2002) A direct approach to false discovery rate. Journal of Royal Statistical Society B 64: 479–498. [Google Scholar]
20. Storey JD (2003) The positive ase discovery rate: a Bayesian interpretation and the q-value. The Annals of Statistics 31: 2013–2035. [Google Scholar]
21.Müller P, Parmigiani G, Rice K (2007) FDR and Bayesian multiple comparison. In: Bayesian Statistics, Oxford University Press.
22. Lloyd CJ, Yong Z (1999) Kernel estimators of the ROC curve are better than empirical. Statistics & Probability Letters 44: 221–228. [Google Scholar]
23.Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference and prediction;. Second Edition Springer: Canada.
24. Tibshirani R (1996) Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B 58 (1): 267–288. [Google Scholar]
25. Xu J, Chen Y, Olopade OI (2010) MYC and breast cancer. Genes & cancer 1 (6): 629–640. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Horiuchi D, Kusdra L, Huskey NE, Chandriani S, Lenburg ME, et al. (2012) MYC pathway activation in triple-negative breast cancer is synthetic lethal with CDK inhibition. Journal Exp Med 209 (4): 679–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Carvalho CM, Polson NG, Scott JG (2010) The horseshoe estimator for sparse signal. Biometrika 1: 1–16. [Google Scholar]

[pone.0068071-Kallioniemi1] 1. Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray J, et al. (1992) Comparative genomic hybridization for molecular cytogenetic analysis of solid tumor. Science 258: 818–821. [DOI] [PubMed] [Google Scholar]

[pone.0068071-SolinasToldo1] 2. Solinas-Toldo S, Lampel S, Stilgenbauer S, Nickolenko J, Benner A, et al. (1997) Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes Chromosomes Cancer 20: 399–407. [PubMed] [Google Scholar]

[pone.0068071-Pinkel1] 3. Pinkel D, Albertson DG (2005) Array comparative genomic hybridization and its applications in cancer. Nature Genetics 23: 41–46. [DOI] [PubMed] [Google Scholar]

[pone.0068071-Snijders1] 4. Snijders A, Nowak N, Segraves R, Blackwood S, Brwon N, et al. (1998) Assembly of microarrays for genome-wide measurement of DNA copy number. Nature Genetics 29: 263–264. [DOI] [PubMed] [Google Scholar]

[pone.0068071-Pinkel2] 5. Pinkel D, Segraves R, Sudar D, Clark S, Poole I, et al. (1998) High resolution analysis of DNA copy number variation using comparative genomic hyridization to microarrays. Nature Genetics 20: 207–211. [DOI] [PubMed] [Google Scholar]

[pone.0068071-Do1] 6. Do KA, Müller P, Tang F (2005) A Bayesian Mixture Model for differential gene expression. Journal of the Royal Statistical Society C 54: 627–644. [Google Scholar]

[pone.0068071-Fridlyand1] 7. Fridlyand J, Snijders A, Pinkel DG, Jain AN (2004) Application of Hidden Markov Models to the analysis of the array CGH data. Journal of Multivariate Analysis 90: 132–153. [Google Scholar]

[pone.0068071-Baladandayuthapani1] 8. Baladandayuthapani V, Ji Y, Talluri R, Nieto-Barajas LE, Morris JS (2010) Bayesian Random Segmentation Models to Identify Shared Copy Number Aberrations for Array CGH Data. JASA 105: 1358–1375. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0068071-Pollack1] 9. Pollack J, Sorlie T, Perou C, Ress C, Jeffrey S, et al. (2002) Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumor. proceedinfs of the National Academy of Sciences 99: 12963–12968. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0068071-VanWieringen1] 10. Van Wieringen W, Wiel MA (2009) Nonparametric testing for DNA copy number induced differential mRNA gene expression. Biometrics 65: 19–29. [DOI] [PubMed] [Google Scholar]

[pone.0068071-Choi1] 11. Choi H, Qin ZS, Ghosh D (2010) A double-layered mixture model for the joint analysis of DNA copy number and gene expression data. Journal of Computational Biology 17: 121–137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0068071-Tsang1] 12. Tsang RY, Finn RS (2012) Beyond trastuzumab: novel therapeutic strategies in HER2-positive metastatic breast cancer. Br J Cancer 106(1): 6–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0068071-Verma1] 13. Verma S, Miles D, Gianni L, Krop IE, Welslau M, et al. (2012) Trastuzumab emtansine for HER2- positive advanced breast cancer. N Eng J Med 367(19): 1783–91. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0068071-Andre1] 14. Andre F, Job B, Tordai A, Michiels S, Liedtke C, et al. (2009) Molecular characterization of breast cancer with high-resolution oligonucleotide comparative genomic hybridization array. Clin Cancer Res 15: 441451. [DOI] [PubMed] [Google Scholar]

[pone.0068071-Parmigiani1] 15. Parmigiani G, Garrett ES, Anbazhagan R, Gabrielson E (2002) A statistical framework for expression-based molecular classification in cancer. Journal of Royal Statistical Society B 64: 717–736. [Google Scholar]

[pone.0068071-Bonnefoi1] 16. Bonnefoi H (2007) Validation of gene signatures that predict the response of breast cancer to neoadjuvant chemotherapy: a substudy of the EORTC 10994/BIG 00-01 clinical trial. Lancet Oncology 8(12): 1071–1078. [DOI] [PubMed] [Google Scholar]

[pone.0068071-Efron1] 17. Efron B, Tibshirani R (2006) On testing the significance of sets of genes. The Annals of Applied Statistics 1: 101–129. [Google Scholar]

[pone.0068071-Benjamini1] 18. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 57: 289–300. [Google Scholar]

[pone.0068071-Storey1] 19. Storey JD (2002) A direct approach to false discovery rate. Journal of Royal Statistical Society B 64: 479–498. [Google Scholar]

[pone.0068071-Storey2] 20. Storey JD (2003) The positive ase discovery rate: a Bayesian interpretation and the q-value. The Annals of Statistics 31: 2013–2035. [Google Scholar]

[pone.0068071-Mller1] 21.Müller P, Parmigiani G, Rice K (2007) FDR and Bayesian multiple comparison. In: Bayesian Statistics, Oxford University Press.

[pone.0068071-Lloyd1] 22. Lloyd CJ, Yong Z (1999) Kernel estimators of the ROC curve are better than empirical. Statistics & Probability Letters 44: 221–228. [Google Scholar]

[pone.0068071-Hastie1] 23.Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference and prediction;. Second Edition Springer: Canada.

[pone.0068071-Tibshirani1] 24. Tibshirani R (1996) Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B 58 (1): 267–288. [Google Scholar]

[pone.0068071-Xu1] 25. Xu J, Chen Y, Olopade OI (2010) MYC and breast cancer. Genes & cancer 1 (6): 629–640. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0068071-Horiuchi1] 26. Horiuchi D, Kusdra L, Huskey NE, Chandriani S, Lenburg ME, et al. (2012) MYC pathway activation in triple-negative breast cancer is synthetic lethal with CDK inhibition. Journal Exp Med 209 (4): 679–96. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0068071-Carvalho1] 27. Carvalho CM, Polson NG, Scott JG (2010) The horseshoe estimator for sparse signal. Biometrika 1: 1–16. [Google Scholar]

PERMALINK

Bayesian Mixture Models for Assessment of Gene Differential Behaviour and Prediction of pCR through the Integration of Copy Number and Gene Expression Data

Filippo Trentini

Yuan Ji

Takayuki Iwamoto

Yuan Qi

Lajos Pusztai

Peter Müller

Roles

Abstract

Introduction

Biological Background

Copy number and arrayCGH

Integration of DNA copy number and RNA expression

Motivating Example

Table 1. Contingency table to classify patients with respect to subgroup of breast cancer and pathological complete response.

Materials and Methods

Ethics Statement

Sampling model for and

Latent probit scores and probit regression

Markov dependence across probes

Priors

Figure 1. Graphical representation of the model for assessment of gene differential behaviour (A) and the prediction model (B).

Modified Probability Model for the prediction of pCR

Priors

Bayesian Multiplicity Control

Simulation Study

Figure 2. Posterior probabilities of positive interaction between the two platforms (A), differential CNA (B) and differential joint behaviour (C) after simulation 2.

Results

Figure 3. Posterior probabilities of differential CNA (on the x-axis) and differential expression (y-axis) obtained respectively through the marginal models on CNA data and gene expression data (A).

Table 2. Numerosities in the training set and test set.

Figure 4. Histograms of the posterior probabilities of positive pCR in the integrated model (A) and in the marginal models, respectively on gene expression (B) and CNA data (C).

Figure 5. Comparison between ROC curves obtained with the marginal and integrated model.

Figure 6. Comparison between ROC curves obtained with the LASSO logistic regression, respectively using single or joint platforms.

Figure 7. Comparison between ROC curves obtained with the integrated model and LASSO logistic regression of pCR on copy number data.

Discussion

Table 3. List of genes which jointly show over expression and copy number amplification in TN group.

Funding Statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases