Abstract
DNA copy number variations (CNVs) have been shown to be associated with cancer development and progression. The detection of these CNVs has the potential to impact the basic knowledge and treatment of many types of cancers, and can play a role in the discovery and development of molecular-based personalized cancer therapies. One of the most common types of high-resolution chromosomal microarrays is array-based comparative genomic hybridization (aCGH) methods that assay DNA CNVs across the whole genomic landscape in a single experiment. In this article we propose methods to use aCGH profiles to predict disease states. We employ a Bayesian classification model and treat disease states as outcome, and aCGH profiles as covariates in order to identify significant regions of the genome associated with disease subclasses. We propose a principled two-stage method where we first make inferences on the underlying copy number states associated with the aCGH emissions based on hidden Markov model (HMM) formulations to account for serial dependencies in neighboring probes. Subsequently, we infer associations with disease outcomes, conditional on the copy number states, using Bayesian linear variable selection procedures. The selected probes and their effects are parameters that are useful for predicting the disease categories of any additional individuals on the basis of their aCGH profiles. Using simulated datasets, we investigate the method’s accuracy in detecting disease category. Our methodology is motivated by and applied to a breast cancer dataset consisting of aCGH profiles assayed on patients from multiple disease subtypes.
Keywords: breast cancer, classification, Bayesian network, hidden Markov model
Introduction
DNA copy number variations (CNVs) have been shown to be associated with cancer development and progression.1 Somatic CNVs can lead to tumorigenesis. For example, loss of copy numbers for tumor suppressor genes or amplification for oncogenes both lead to cancer. The detection of these CNVs has the potential to impact the basic knowledge and treatment of many types of cancers, and can play a role in the discovery and development of molecular-based personalized cancer therapies.2
In early years, cytogeneticists have been limited to traditionally visually examining whole genomes with a microscope, a technique known as karyotyping or chromosome analysis. In the mid-70 s and 80s, the development and application of molecular diagnostic methods such as Southern blots, polymerase chain reaction (PCR), and fluorescence in situ hybridization (FISH) allowed clinical researchers to make many important advances in genetics, including clinical cytogenetics. However, these techniques have several limitations. First, they are very time consuming and labor intensive, and only a limited number and regions of the chromosome can be tested simultaneously. Further, because the probes are targeted to specific chromosome regions, the analysis requires prior knowledge of an abnormality and is of limited use for screening complex karyotypes. More recently, scientists have developed techniques that integrate aspects of both traditional and molecular cytogenetic techniques called chromosomal micorarrays.3 These high-throughput high-resolution microarrays have allowed researchers to diagnose numerous subtle genome-wide chromosomal abnormalities that were previously undetectable and find many cytogenetic abnormalities in part or all of a single gene. Such information is useful for biologists to detect new genetic disorders and also provide a better understanding of the pathogenetic mechanisms of many chromosomal aberrations.
One of the most common types of high-resolution chromosomal microarrays are array-based comparative genomic hybridization (aCGH) methods that assay DNA CNVs across the whole genomic landscape in a single experiment.4 With aCGH, differentially labeled test and reference samples’ genomic DNAs are cohybridized to normal chromosomes, and fluorescence intensities/ratios along the length of chromosomes provide a cytogenetic representation of the relative DNA CNV across the whole genome. Whereas early aCGH arrays were mainly used in research settings, recent improvements in algorithms for aCGH data analysis as well as rapidly reducing costs now enable clinical applications of aCGH arrays, particularly in the study of cancer genomic as a diagnostic tool.2
In this article, we propose methods to use aCGH profiles to predict disease states. We employ a Bayesian classification model, and treat disease states as outcome and aCGH profiles as covariates – to identify significant regions of the genome associated with disease subclasses. Statistical challenges for aCGH classification include not only high dimensionality ie, large number (tens of thousands) of probes but also relatively small number of samples, more importantly, the presence of serial correlation among the features – nearby probes (by genomic location) tend to be highly correlated. Classical methods usually used for multivariate classification of high-dimensional genomic data, eg, penalized approaches (Zhu and Hastie5 and the references there-in), do not account for the specific structure of aCGH data, as they ignore the serial dependence in the probes. To exploit the serial genomic information, typical approaches first segment the data6 and then conduct downstream classification. Alternative methods are based on kernel-based techniques such as support vector machine (SVM),7 and its variants exploit genomic continuity.8 While incorporating excellent prediction capabilities, these methods do not explicitly utilize the inherent discrete nature of the latent copy number states (gain/loss/normal) in their variable selection procedures, which serves as one of the primary aims in this article.
In the Bayesian framework, several innovative variable selection strategies have been developed in various contexts, with reasonable degrees of success. Some of these approaches can be regarded as linear variable selection methods. These include stepwise selection,9 penalized regression approaches such as lasso (and its variants),10 and non-concave penalized likelihood approaches.11 The technique applied in this paper is based on Bayesian linear variable selection approaches, including spike and slab mixture priors,12 stochastic search variable selection,13 Gibbs-based variable selection,14 Bayesian model averaging,15,16 and indicator priors.17 The stochastic search variable selection approach of George and McCulloch13 has been extended to multivariate settings by Brown et al.18 and to generalized linear mixed models by Cai and Dunson.19 Effective variable selection methods have also been developed for multinomial probit models by Sha et al.20 and for microarray data with censored outcomes by Lee and Mallick21 and Sha et al.22 However, none of these approaches account for natural spatial/serial dependency in the covariates (as in our case) – which might lead to biased estimates.
In this article we propose a principled two-stage method for disease classification using covariates exhibiting serial dependence. In general, the technique is applicable to datasets having the following structure. For individuals i = 1, …, n, we have (i) two disease categories coded as the binary response yi and (ii) aCGH emissions ei1, …, eip corresponding to p probes, with p typically being much larger than n. The analysis broadly consists of two stages. In Stage 1, we make inferences on underlying copy number states associated with the aCGH emissions based on hidden Markov model (HMM) formulations23 to account for serial dependencies. Subsequently in Stage 2, we analyze the model parameters associated with the binary responses, conditional on the parameters discovered in Stage 1, using Bayesian linear variable selection procedures. In particular, we select the aCGH probes having a linear regression relationship with the disease categories. The selected probes and their effects are parameters that are useful for predicting the disease categories of any additional individuals on the basis of their aCGH emissions. Our methodology is motivated by and applied to a dataset consisting of 111 breast cancer patients24 and falling into two disease subgroups, ER+ and triple negative (TN). There are 56 TN patients and 55 ER+ patients. For each patient, DNA copy number data were generated using Agilent 4x44K CGH arrays (available at ArrayExpress accession number E-TABM-484).
The remainder of the paper is organized as follows. Section 2 provides details of the model for the two-stage analysis. Section 3 develops the posterior inference and prediction technique based on Markov chain Monte Carlo (MCMC) methods. In Section 4, using simulated datasets, we investigate the method’s accuracy in detecting disease category. Finally, Section 5 analyzes the motivating breast cancer dataset and makes test case predictions.
Model
Our modeling framework consists of two stages: In Stage 1, we model the aCGH emissions, relying on HMMs to account for the serial correlations among the emissions. Then, in Stage 2, the relationship between the HMM parameters and the subject-specific binary responses is specified using a probit regression model and the latent indicator variables using the approaches proposed by George and McCulloch,13 Kuo and Mallick,17 and Brown et al.18 We expound on each of these below.
Stage 1: relationship between aCGH emissions and latent copy number states
For subjects i = 1, …, n and probes j = 1, …, p, we have the binary responses y1, …, yn representing the two disease subcategories and the set of real-valued aCGH emissions {eij}. Let sij ∈ {−1, 0, +1} be a latent variable called the copy number state, representing a loss, no change, and gain in copy number for individual i at probe j. The copy number state is inferred using a Bayesian HMM that accounts for the serial correlations of the aCGH emissions.
Similarly to Guha et al.23 conditional on sij, the aCGH emissions are assumed to be normally distributed:
where, because of the specific biological interpretations associated with the HMM states, we assume that μ−1 < μ0 < μ+1. This assumption also prevents label switching, a well-known problem with mixture models, thereby making inferences even more efficient. The latent states si1, …, sip are assumed to follow a three-state HMM with stationary transition probability matrix A = ((aut))3×3 having row sums ∑t = 1,2,3aut = 1 for u = 1, 2, 3. That is, P[si,j+1 = t | sij = u] = aut for j = 1, …, (n − 1). To further facilitate inferences of the state-specific parameters, informative conjugate priors are assigned to the parameters of the normal distribution ie, μs and σs for s ∈ {−1, 0, +1}. Refer to Guha et al.23 for further details about MCMC inference of the underlying copy number states of the probes for the individuals. The technique developed in that paper is applied to infer the latent copy number states (gain/loss/normal) si1, …, sip for subjects i = 1, …, n that are subsequently used in the below Stage 2.
Stage 2: relationship between disease classification and latent copy number states
In the second stage of the analysis, we model the relationship between the disease category and latent copy number states of the genomic probes for each individual. These values are copy number states inferred from analysis in Section 2.1.
Let and be indicator functions of loss and gain. To simplify the notation, for subjects i = 1, …, n, we collectively represent the vector of 2p covariates as . For covariate j = 1, …, 2p, averaging over the individuals, let . Centering and scaling over the n individuals, we transform the covariates as follows:
Let Q be the set of covariates j for which assumes at least two distinct values. That is, . Because the variables vij are centered, j ∉ Q if and only if v1j = … = vnj = 0.
A key assumption of our model is that probes that do not belong to Q ie for which do not assume at least 2 distinct values, are not predictive of disease subcategory, although the probes could possibly be predictive of the disease. For this reason, we identify Q as the set of potential predictors of disease subcategory and write q = |Q| ≤ 2p. We discard all probes j ∉ Q, relabeling the variables {vij: j ∈ Q} as {xij: j = 1, …, q}.
For individuals i = 1, …, n, we assume the probit regression model proposed by Albert and Chib25:
| (1) |
For the intercept β0, we assume the prior γ = (γ1, …, γq)′ be i.i.d. Bernoulli variables with P[γj = ω], where ω is expected to be relatively small and is assigned the uniform prior on (0,0.1). The remaining coefficients in (1) are independently distributed as
where δ0 denotes the point mass at 0. In other words, each probe is predictive of disease classification with probability ω. We assume independent exponential priors with mean 1 for and τ−2.
Gibbs Sampling Procedure
Let be the random number of variables (including the intercept β0) that participate in the disease classification. Let rij = zi − ∑k≠jxikβk for i = 1, …, n. For a set of numbers {θij: i = 1, …, n, j = 1, …, q}, let θj represent the vector (θ1j, …, θnj)′ for probe j = 1, …, q.
Although the Gibbs sampler is conceptually straightforward, updating of γ can be computationally intensive for large q. The step is described as follows. For probe j = 1, …, q, let β–j represent the set of regression coefficients excluding βj. With In denoting the identity matrix of order n and , the posterior probability P[γj |β−j, ω, rj] is proportional to (1 − ω) · Nn(rj | 0,In) when γj = 0 and is proportional to ω Nn(rj | 0,Bj) when γj = 1. The density Nn(rj | 0,In) can be quickly computed even in large problems. However, the density Nn(rj | 0,Bj) involves the inversion and determinant calculation for the non-diagonal matrix Bj. Because it must be iteratively performed for every probe j, it can be computationally expensive or can at least involve large amounts of memory, when q is large. Theorem 7.1 of the Appendix exploits the structure of Bj to drastically simplify the computation. For probe j = 1, …, q, let
| (2) |
Applying Theorem 7.1, we have det(Bj) = 1 + τ2, and Nn(rj|0,Bj) is proportional to exp . The calculation is feasible even for large q.
Outline of procedure
Let F·I(c, d) denote the distribution F restricted to the interval (c, d). The Gibbs sampler consists of the following steps:
- Applying Theorem 7.1, the binary indicators for probes j = 1, …, q are updated as follows:
where and Lj1 is as defined in (2). - Writing xi = (1, xi1, …, xiq)T for individuals i = 1, …, n, the subject-specific latent variables z are independently distributed as
- Let βI be the elements of β corresponding to the intercept and to the set of probes j for which γj = 1. Then . Vector βI is jointly updated as
where UI is an n × ρ matrix with the first column equal to a vector of n 1’s and the remaining columns equal to the vectors xj for which γj = 1. The variance matrix . is distributed as gamma .
is distributed as gamma .
ω | γ is distributed as beta (ρ, q − ρ + 1) · 1 (0, 0.1).
Test case predictions
Suppose we have the aCGH profiles of n* additional test case individuals from the same hypothetical disease population. Using the within-variable means and variances of the training sample, we transformed the aCGH profiles to obtain the covariates xi∗ = (1,xi∗1,…,xi∗q)T for individuals i* = 1, …, n* belonging to the test sample. Let D represent the training set data. The posterior probability that individual i* belongs to disease category 1 is
A consistent (in simulation size) estimate of this probability is then
where β = β(t) is the value generated at the Mth MCMC iterate. We declare the disease category of the test case individual labeled i* as
| (3) |
Simulation Study
We generated a training sample consisting of p = 2000 aCGH profiles for n = 100 individuals. The individuals were regarded as random draws from a disease population where 100 × (1 − p*) = 25% of the individuals had “disease 0” and the remaining 100 × p* = 75% individuals had “disease 1,” so that p* = 0.75 represented the prior probability of disease 1 in the population.
Disease 0 was assumed to be characterized by losses (s = −1) from probes 201 to 400 and gains (s = 1) from probes 1401 to 1800. Disease 1 was characterized by losses from probes 301 to 500 and also from probes 1601 to 1800. The remaining probes were assigned a copy number state of 0. For each disease subcategory, we randomly selected 10% of the probes that were associated with the disease and randomly set their copy number states to be copy neutral, gains, or losses with equal probability. Additionally, random noise at the probe level was then added to the profiles by selecting 2% (ie, 4000) of the remaining probes and randomly changing their copy number states. These values constituted the variables sij in Stage 2 of the Section 2 model, and were assumed to be known in the simulation.
As described in Section 2, the variables were then transformed to obtain the covariates wij and vij for i = 1, …, n and j = 1, …, 2p. The set was evaluated to identify q = 2571 probes for which the individuals had at least two distinct values. These variables were relabeled as {xij:j = 1, …, q}, and the remaining variables were discarded. The model was fit using the Gibbs sampler of Section 3. An initial set of 10,000 samples was run to allow the MCMC chain to forget its starting values. A 1-in-10 subsample of M = 100,000 additional draws was stored for posterior inferences. Figure 1 presents histograms for the marginal posteriors of the intercept β0, standard deviations τ0 and τ, and Bernoulli probability ω, which are used in the sequel to make predictions for the disease categories of the test case individuals.
Figure 1.
Histogram of selected model parameters for the simulation study.
We evaluated the success of the predictive ability of our approach by drawing 50 independent test samples of n* = 200 individuals from the same hypothetical disease population and generating their aCGH profiles based on their disease categories. Exactly 50 of these 200 test case individuals had disease 0, and the remaining 150 individuals had disease 1. Using the within-variable means and variances of each training sample, we transformed the aCGH profiles to obtain the covariates xi∗ = (1,xi∗1,…,xi∗q)T for individuals i* = 1, …, n* belonging to the test sample of each of the 50 datasets.
For each dataset, using the stored MCMC sample of size M = 100,000 and as described in Section 3, we computed the posterior probability of disease 1, , for the n* = 200 individuals. The estimated for the n* = 200 individuals were computed as in (3). These values versus the true disease categories yi∗ are summarized in Table 1. The graph reveals the remarkable accuracy of the proposed methodology in detecting disease category. Specifically, for all 50 datasets, the technique resulted in perfect disease prediction with no false classification.
Table 1.
For the 200 individuals belonging to the 50 test samples of the simulation study, the estimated disease category versus the true category averaged over the 50 test samples. Perfect classification was obtained for each dataset. As a result, the standard errors shown in parenthesis are all zero.
| ESTIMATED | ||
|---|---|---|
| Truth | ||
| yi∗ = 0 | 50 (0) | 0 (0) |
| yi∗= 1 | 0 (0) | 150 (0) |
Breast Cancer Data Analysis
We analyzed the breast cancer dataset from Andre et al.24 which consists of n = 111 individuals with either disease subcategory ER+ (label “1”) or TN (label “0”). There are 56 TN and 55 ER+ patients. aCGH emissions for these individuals were available on the same set of p = 42,416 probes along with the probes’ locations. Specifically, the chromosome and the distance in megabases (MB) from a telomere are available for every probe.
As described in Section 2.1, we used this information to first infer the latent copy number states eij of the probes using a Bayesian HMM, where i = 1, …, 111 and j = 1, …, 42,416. Then, as described in Section 2.2, we obtained the indicator functions, and , of gain and loss. These indicator variables were transformed to obtain the covariates wij and vij for i = 1, …, n and j = 1, …, 84,832. The set was evaluated to identify q = 5,543 covariates having at least two distinct values for the 111 individuals. These variables were relabeled as {xij: j = 1, …, 5,543} and retained as potential regressors. The remaining variables were discarded because they were unlikely to be associated with the subcategory classification.
To investigate the reliability of the proposed method of these actual datasets, we performed 50 independent replications of the following steps. (i) We randomly split the data into training and test sets in a 4:1 ratio. (ii) We analyzed the disease subcategories and the q = 5,543 covariates of the 89 training set individuals using the Bayesian probit regression model with likelihood function (1). The model was fit using the Gibbs sampler of Section 3. An initial set of 10,000 samples was run to allow the MCMC chain to overcome its initial values. A 1-in-10 subsample of M = 100,000 additional draws was stored for posterior inferences. (iii) As described in Section 3, we used the q = 5,543 covariates of the 22 test case individuals to predict their disease subcategories. These predictions were compared with the actual disease subcategories of these 22 individuals to compute the classification error rate for the specific training–test case random split. An average of the 50 independent estimates in Step (iii) yielded a simulation-based estimate of the classification error rate for the proposed method. This was estimated to be 22.55% with a standard error of 1.16%.
The significant probes (covariates) that were found to be predictive of disease subtype are plotted in Figures 2–4. We assumed a posterior probability threshold of δ = 0.15 that yielded 500 markers along the entire genome predictive of the disease classification. Figure 2 plots a bar graph of the chromosomal breakdown of these markers. As can be seen, most of the significant markers are located on chromosomes 5, 12, 16, and 17. The corresponding karyograms Figures 3 and 4 show the breakdown on the markers by chromosomal locations for negative (red) and positive (green) associations with the disease states, respectively.
Figure 2.
Number of significant markers broken down for each chromosome.
Figure 4.
Human karyogram with significant locations. This figure is a karyogram that depicts the significant probes identified using our approach. The green color corresponds to positive regression coefficients.
Figure 3.
Human karyogram with significant locations. This figure is a karyogram that depicts the significant probes identified using our approach. The red color corresponds to negative regression coefficients.
Our results are promising based on the locations of selected markers. As noted, most markers are on chromosomes 5, 12, 16, and 17. It has been shown that chromosome 5q deletions are the most frequent aberration in breast tumors from BRCA1 mutation carriers. The deletions in 5q occur at high frequencies on putative tumor suppressor genes such as XRCC4, RAD50, RASA1, APC, and PPP2R2B.26 Chromosome 16q has been a target region for the detection of biomarkers for breast cancer.24 We identified a high concentration of biomarkers in 16q as well. In addition, our flagged biomarkers on chromosome 17 are also convincing, since chromosome 17 is the host for the most famous breast cancer gene BRCA1 as well as ER. Interestingly, little is known about the association of CNVs on chromosome 12 with subgroups of breast cancer. Our findings on chromosome 12 could be potentially new discoveries that might warrant further functional validation.
Conclusions and Discussion
The detection of CNVs in aCGH methods is important for the treatment of many types of cancers, especially in the development of molecular-based personalized cancer therapies. We propose a framework for the prediction of disease types using aCGH profiles. We employ a Bayesian classification model and treat disease states as outcome and aCGH profiles as covariates in order to identify significant regions of the genome associated with disease subclasses. Specifically, we propose a principled two-stage method using the covariates exhibiting serial dependence. Stage 1 makes inferences on the underlying copy number states associated with the aCGH emissions based on HMM formulation. Using Bayesian linear variable selection procedures, Stage 2 detects the model parameters associated with the binary responses, conditional on the parameters of Stage 1.
The selected probes and their effects are parameters that are useful for predicting the disease categories of any additional individuals on the basis of their copy number profiles. A simulation study demonstrates the method’s remarkable accuracy in detecting disease category. The methodology is applied to a breast cancer dataset, and we find several markers that are associated with disease subtype using the copy number profiles. Some of these discoveries confirm existing literature, and novel associations could be potential targets for future validation studies.
Our methods are general and could be potentially applied to SNP arrays as well that yield copy number profiles. A nice generalization of the method would be to incorporate genotype information (eg, allelic frequencies) in the models (especially, Stage 1) that could lead to more refined estimation of the latent copy number states. Furthermore, current technologies enable collection of multiplatform data on matched patient samples such as mRNA expression (eg, The Cancer Genome Atlas (TCGA)) that can be leveraged to provide a more detailed understanding of the biological mechanisms involved in cancer development and progression. We leave these tasks for future consideration.
Appendix
Theorem 7.1: Let x = (x1, …, xn)′ be a vector such that xTx = 1. Define the matrices A = xxT and B = In + τ2A. Then the determinant of matrix B is 1 + τ2. Given r ∈ Rn, define the vector h = (h1, …, hn)T= ϕ x + r and scalar . Let . Then the n-variate normal density
Proof. Since A = xxT has rank 1 and xTx = 1, the eigenvalues of A consist of a single 1 and (n − 1) number of 0’s. Furthermore, the eigenvector corresponding to eigenvalue 1 must be x. Let ΛA be the diagonal matrix of the eigenvalues, and P be the matrix of eigenvectors of A. Then A = P ΛA PT.
Since PPT = In and B = In + τ2A, B has the same eigenvectors as A and its eigenvalues are 1 + τ2 and (n − 1) number of 1’s. The product of these eigenvalues is
| (4) |
Matrix B−1/2 has the same eigenvectors as B and its eigenvalues are and (n − 1) number of 1’s. Thus, and
Given r ∈ Rn, we have
| (5) |
We obtain the result on substituting (4) and (5) in the n-variate normal density.
Footnotes
SUPPLEMENT: Classification, Predictive Modelling, and Statistical Analysis of Cancer Data (A)
ACADEMIC EDITOR: JT Efird, Editor in Chief
FUNDING: This work was supported by the National Science Foundation under award DMS-0906734 to SG. YJ’s research was supported by NIH R01 CA132897. VB’s research is partially supported by NIH grant R01 CA160736 and the Cancer Center Support Grant (CCSG) (P30 CA016672). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Science Foundation, National Cancer Institute, or the National Institutes of Health.
COMPETING INTERESTS: Authors disclose no potential conflicts of interest.
This paper was subject to independent, expert peer review by a minimum of two blind peer reviewers. All editorial decisions were made by the independent academic editor. All authors have provided signed confirmation of their compliance with ethical and legal obligations including (but not limited to) use of any copyrighted material, compliance with ICMJE authorship and competing interests disclosure guidelines and, where applicable, compliance with legal and ethical guidelines on human and animal research participants. Provenance: the authors were invited to submit this paper.
Author Contributions
Conceived and designed the experiments: SG, YJ, VB. Analyzed the data: SG, VB. Wrote the first draft of the manuscript: SG, YJ, VB. Contributed to the writing of the manuscript: SG, YJ, VB. Agree with manuscript results and conclusions: SG, YJ, VB. Jointly developed the structure and arguments for the paper: SG, YJ, VB. Made critical revisions and approved final version: SG, YJ, VB. All authors reviewed and approved of the final manuscript.
REFERENCES
- 1.Pinkel D, Albertson DG. Array comparative genomic hybridization and its applications in cancer. Nat Genet. 2005;37(suppl):S11–7. doi: 10.1038/ng1569. [DOI] [PubMed] [Google Scholar]
- 2.Chin L, Hahn WC, Getz G, Meyerson M. Making sense of cancer genomic data. Genes Dev. 2011;25:534–55. doi: 10.1101/gad.2017311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Vissers LE, de Vries BB, Veltman JA. Genomic microarrays in mental retardation: from copy number variation to gene, from research to diagnosis. J Med Genet. 2010;47:289–97. doi: 10.1136/jmg.2009.072942. [DOI] [PubMed] [Google Scholar]
- 4.Kallioniemi A, Kallioniemi OP, Sudar D, et al. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science. 1992;258:818–21. doi: 10.1126/science.1359641. [DOI] [PubMed] [Google Scholar]
- 5.Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression. Biostatistics. 2004;5:427–43. doi: 10.1093/biostatistics/5.3.427. [DOI] [PubMed] [Google Scholar]
- 6.Willenbrock H, Fridlyand J. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005;21:4084–91. doi: 10.1093/bioinformatics/bti677. [DOI] [PubMed] [Google Scholar]
- 7.Liu J, Ranka S, Kahveci T. Classification and feature selection algorithms for multi-class CGH data. Bioinformatics. 2008;24:86–95. doi: 10.1093/bioinformatics/btn145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Rapaport F, Barillot E, Vert JP. Classification of arrayCGH data using fused SVM. Bioinformatics. 2008;24:i375–82. doi: 10.1093/bioinformatics/btn188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Peduzzi PN, Hardy RJ, Holford TR. A stepwise variable selection procedure for nonlinear regression models. Biometrics. 1980;36:511–6. [PubMed] [Google Scholar]
- 10.Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997;16:385–95. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- 11.Fan J, Li R. Variable selection for Coxs proportional hazards model and frailty model. Ann Stat. 2002;30:74–99. [Google Scholar]
- 12.Mitchell TJ, Beauchamp JJ. Bayesian variable selection in linear regression. J Am Stat Assoc. 1988;83:1023–36. [Google Scholar]
- 13.George E, McCulloch R. Variable selection via Gibbs sampling. J Am Stat Assoc. 1993;88:881–9. [Google Scholar]
- 14.Dellaportas P, Forster JJ, Ntzoufras I. Bayesian Variable Selection using the Gibbs Sampling. New York: Marcel Dekker, Inc; 1982. pp. 273–86. [Google Scholar]
- 15.Madigan D, Raftery A. Model selection and accounting for model uncertainty in graphical models using Occams window. J Am Stat Assoc. 1994;89:1535–46. [Google Scholar]
- 16.Volinsky C, Madigan D, Raftery AE, Kronmal RA. Bayesian model averaging in proportional hazard models: assessing the risk of stroke. Appl Stat. 1997;46:433–48. [Google Scholar]
- 17.Kuo L, Mallick B. Bayesian semiparametric inference for the accelerated failure time model. Can J Stat. 1997;25:457–72. [Google Scholar]
- 18.Brown PJ, Vannucci M, Fearn T. Multivariate Bayesian variable selection and prediction. J R Stat Soc Series B Stat Methodol. 1998;60:627–41. [Google Scholar]
- 19.Cai B, Dunson D. Bayesian covariance selection in generalized linear mixed models. Biometrics. 2006;62:446–57. doi: 10.1111/j.1541-0420.2005.00499.x. [DOI] [PubMed] [Google Scholar]
- 20.Sha N, Vannucci M, Tadesse MG, et al. Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics. 2004;60:812–19. doi: 10.1111/j.0006-341X.2004.00233.x. [DOI] [PubMed] [Google Scholar]
- 21.Lee K, Mallick B. Bayesian methods for variable selection in survival models with application to DNA microarray data. Sankhya. 2004;66:756–78. [Google Scholar]
- 22.Sha N, Tadesse MG, Vannucci M. Bayesian variable selection for the analysis of microarray data with censored outcome. Bioinformatics. 2006;22:2262–8. doi: 10.1093/bioinformatics/btl362. [DOI] [PubMed] [Google Scholar]
- 23.Guha S, Li Y, Neuberg D. Bayesian Hidden Markov Modeling of Array CGH Data. J Am Stat Assoc. 2008;103:485–97. doi: 10.1198/016214507000000923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Andre F, Job B, Dessen P, et al. Molecular characterization of breast cancer with high-resolution oligonucleotide comparative genomic hybridization array. Clin Cancer Res. 2009;15:441–51. doi: 10.1158/1078-0432.CCR-08-1791. [DOI] [PubMed] [Google Scholar]
- 25.Albert JH, Chib S. Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc. 1993;1993;88:669–79. [Google Scholar]
- 26.Johannsdottir H, Jonsson G, Johannesdottir G, et al. Chromosome 5 imbalance mapping in breast tumors from BRCA1 and BRCA2 mutation carriers and sporadic breast tumors. Int J Cancer. 2006;119:1052–60. doi: 10.1002/ijc.21934. [DOI] [PubMed] [Google Scholar]




