Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2009 May 6;25(15):1959–1960. doi: 10.1093/bioinformatics/btp307

RJaCGH: Bayesian analysis of aCGH arrays for detecting copy number changes and recurrent regions

Oscar M Rueda 1,*,, Ramon Diaz-Uriarte 1,*
PMCID: PMC2712338  PMID: 19420051

Abstract

Summary: Several methods have been proposed to detect copy number changes and recurrent regions of copy number variation from aCGH, but few methods return probabilities of alteration explicitly, which are the direct answer to the question ‘is this probe/region altered?’ RJaCGH fits a Non-Homogeneous Hidden Markov model to the aCGH data using Markov Chain Monte Carlo with Reversible Jump, and returns the probability that each probe is gained or lost. Using these probabilites, recurrent regions (over sets of individuals) of copy number alteration can be found.

Availability: RJaCGH is available as an R package from CRAN repositories (e.g. http://cran.r-project.org/web/packages).

Contact: rueda.om@gmail.com; rueda.om@gmail.com

1 INTRODUCTION

Genomic DNA copy number alterations (CNAs) are associated with complex diseases (McCarroll and Altshuler, 2007), and are often studied using array-based comparative genomic hybridization (aCGH). To be immediately useful in both clinical and basic research scenarios, aCGH data analysis requires accurate methods that do not impose unrealistic biological assumptions and that provide direct answers to the key question, ‘What is the probability that this gene/region has CNAs?’ Estimates of the probabilities of alteration (instead of P-values or smoothed means) are the most direct and usable answer to this problem (Broët and Richardson, 2006). Probabilities can be used in contexts from basic research to clinical applications (Lockwood et al., 2006; Pinkel and Albertson, 2005) so that a clinician might require high certainty of alteration of a specific gene before invasive procedures, whereas a basic researcher can consider for further study genes that show only a moderate probability of alteration. In addition, many aCGH platforms have probes located at variable distances, which should be incorporated in the analysis (Broët and Richardson, 2006; Lockwood et al., 2006). A variety of methods have been developed for the analysis of aCGH data (see reviews in Lai et al., 2005; Rueda and Diaz-Uriarte, 2007a,b; Willenbrock and Fridlyand, 2005), but most of them do not return probabilities of alteration nor make use of the distance between probes. The few approaches that return probabilities of alteration either do not use distance between probes, or fix the number of possible states of alteration to three or four, a biologically unrealistic assumption. In addition to locating probes that show copy number changes, the identification of common or recurrent regions of alteration is one frequent study objective: the regions more likely to harbor disease-critical genes are those that are recurrent or common among samples (Diskin et al., 2006; Pinkel and Albertson, 2005). The identification of these regions should use the information about the probability of alteration (to avoid giving the same weight to probes with strong and weak evidence of alteration), and should allow the discovery of regions over subsets of samples as it is known that many complex diseases, such as cancer or autism, are composed of subtypes of syndromes (Sebat, 2007). Most available methods for locating common regions (Klijn et al., 2008; Rouveirol et al., 2006; Shah et al., 2007; Taylor et al., 2008), do not allow for among-subject heterogeneity nor use probabilities. Finally, many of the existing methods are not always readily and freely available like those on CRAN, or as easy to use without forcing (often arbitrary) choices on the user. We have developed a freely available R package, RJaCGH, for the analysis of aCGH data that incorporates distance between probes, returns probabilities of alteration and allows the identification of recurrent regions of CNA.

2 RESULTS

To estimate probabilities of copy number changes, we use a non-homogeneous Hidden Markov model (HMM) with an unknown number of hidden states fitted via Reversible Jump Markov Chain Monte Carlo (Cappé et al., 2005). By using a non-homogeneous HMM, we can account for the variable distance between probes/genes and Reversible Jump allows us to use HMMs without fixing the number of hidden states. By exploring the full posterior probabilities and retaining the probabilities of models of different sizes, we can employ Bayesian model averaging (Hoeting et al., 1999), thus incorporating model uncertainty and not conditioning our inferences to the selection of a particular model. The statistical model is described in Rueda and Diaz-Uriarte (2007a), where it is shown that the method performs as well as, or better than, the competing methods ACE (Lingjaerde et al., 2005), BioHMM (Marioni et al. 2006), HMM (Fridlyand et al., 2004), CGHseg (Picard et al., 2005), DNAcopy (Venkatraman and Olshen, 2007) and GLAD (Hupé et al., 2004) in terms of calling gains and losses, and the performance advantage increases as the variability in inter-probe distance increases.

For the identification of recurrent regions of CNA, we have developed two algorithms, pREC-A and pREC-S (fully described in the documentation of the program and as technical report from http://biostats.bepress.com/cobra/ps/art43/). pREC-A (probabilistic recurrent copy number regions, common threshold over all arrays) does not allow for among-subject heterogeneity and is, thus, similar in objectives to previous approaches except for the fact that we explicitly use probabilities. pREC-S (probabilistic recurrent copy number regions, subsets of arrays), identifies common regions over subsets of arrays; alternatively, we can think of this algorithm as identifying subsets of arrays that share regions of alteration. This is a novel algorithm, explicitly targeted to incorporate heterogeneity and use probabilities. Both methods use probabilities of alteration as returned by the non-homogeneous HMM. No hard thresholds are imposed, and thus the user decides what constitutes sufficient evidence (in terms of probability of alteration) to call a probe gained (or lost). The probabilities that we use are not the marginal probabilities of alteration but the joint probabilities of alteration of a region of probes. Our approach incorporates both within- and among-array variability: we use the information on the certainty of each call of gain/loss (i.e. the probability) in all computations of recurrent regions. Moreover, using probabilities of alteration (instead of magnitude of change), in addition to differentiating between evidence of alteration and estimated fold change, prevents inter-array differences in range of log2 ratios and tissue mixture to get confounded with evidence of alteration. Finally, both algorithms use at most two parameters and their biological meaning is immediate: probability of alteration, and number of samples that share an alteration. We can use the output of pREC-S as the basis for clustering and to display patterns of groupings of arrays; an example is shown in the documentation of the program.

The RJaCGH method has been implemented as an R package (R Development Core Team, 2006). All of the MCMC code for the HMM as well as the two algorithms for common regions have been implemented in C (dynamically loaded from R) for speed. The program is available from the standard R repositories (e.g. http://cran.r-project.org/web/packages/) under the GPL (v. 3) license and has been submitted to BioConductor. The package depends on no additional software (besides R itself). The flexibility and comprehensiveness of RJaCGH does have a computational cost: estimation of probabilities by RJaCGH is considerably slower than segmentation by alternative approaches. If probabilities of alteration are desired (but finding recurrent regions or incorporating distance between probes is not needed), the bcp method of (Erdman and Emerson, 2007, 2008) is a much faster alternative. pREC-A and pREC-S, once the probabilities have been obtained, are very fast (on the order of seconds to a few minutes for datasets that include 50–70 samples).

ACKNOWLEDGEMENTS

J. Fadista, A. Ivens and D. Grove for comments and bug report of the package. Three reviewers for comments on the ms.

Funding: Fundacion de Investigacion Medica Mutua Madrileña, RTIC COMBIOMED (RD07/0067/0014) Spanish Health Ministry, Supercomputacion y Ciencia (CSD2007-00050) Spanish Ministry of Education and Science, Spanish National Bioinformatics Institute (www.inab.org) a platform of Genoma España.

Conflict of Interest: none declared.

REFERENCES

  1. Broët P, Richardson S. Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model. Bioinformatics. 2006;22:911–918. doi: 10.1093/bioinformatics/btl035. [DOI] [PubMed] [Google Scholar]
  2. Cappé O, et al. Inference in Hidden Markov Models (Springer Series in Statistics) New York: Springer; 2005. [Google Scholar]
  3. Diskin S, et al. Stac: a method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments. Genome Res. 2006;16:1149–1158. doi: 10.1101/gr.5076506. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Erdman C, Emerson JW. bcp: an R package for performing a Bayesian analysis of change point problems. J. Stat. Softw. 2007;23:1–13. [Google Scholar]
  5. Erdman C, Emerson JW. A fast Bayesian change point analysis for the segmentation of microarray data. Bioinformatics. 2008;24:2143–2148. doi: 10.1093/bioinformatics/btn404. [DOI] [PubMed] [Google Scholar]
  6. Fridlyand J, et al. Hidden Markov models approach to the analysis of array CGH data. J. Multivar. Anal. 2004;90:132–153. [Google Scholar]
  7. Hoeting JA, et al. Bayesian model averaging: a tutorial (with discussion) Stat. Sci. 1999;14:382–417. [Google Scholar]
  8. Hupé P, et al. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics. 2004;20:3413–3422. doi: 10.1093/bioinformatics/bth418. [DOI] [PubMed] [Google Scholar]
  9. Klijn C, et al. Identification of cancer genes using a statistical framework for multiexperiment analysis of nondiscretized array CGH data. Nucleic Acids Res. 2008;36 doi: 10.1093/nar/gkm1143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Lai WRR, et al. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005;21:3763–3770. doi: 10.1093/bioinformatics/bti611. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Lingjaerde OC, et al. CGH-explorer: a program for analysis of array-CGH data. Bioinformatics. 2005;21:821–822. doi: 10.1093/bioinformatics/bti113. [DOI] [PubMed] [Google Scholar]
  12. Lockwood WW, et al. Recent advances in array comparative genomic hybridization technologies and their applications in human genetics. Eur. J. Hum. Genet. 2006;14:139–148. doi: 10.1038/sj.ejhg.5201531. [DOI] [PubMed] [Google Scholar]
  13. Marioni JC, et al. BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics. 2006;22:1144–1146. doi: 10.1093/bioinformatics/btl089. [DOI] [PubMed] [Google Scholar]
  14. McCarroll SA, Altshuler DM. Copy-number variation and association studies of human disease. Nat. Genet. 2007;39(Suppl. 7):S37–S42. doi: 10.1038/ng2080. [DOI] [PubMed] [Google Scholar]
  15. Picard F, et al. A statistical approach for array CGH data analysis. BMC Bioinformatics. 2005;6:27. doi: 10.1186/1471-2105-6-27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Pinkel D, Albertson DG. Array comparative genomic hybridization and its applications in cancer. Nat. Genet. 2005;37(Suppl.):S11–S17. doi: 10.1038/ng1569. [DOI] [PubMed] [Google Scholar]
  17. R Development Core Team. R: A Language and Environment for Statistical Computing. Austria: R Foundation for Statistical Computing Vienna; 2006. ISBN 3-900051-07-0. [Google Scholar]
  18. Rouveirol C, et al. Computation of recurrent minimal genomic alterations from array-CGH data. Bioinformatics. 2006;22:2066–2073. doi: 10.1093/bioinformatics/btl004. [DOI] [PubMed] [Google Scholar]
  19. Rueda OM, Diaz-Uriarte R. Flexible and accurate detection of genomic copy-number changes from aCGH. PLoS Comput. Biol. 2007a;3:e122. doi: 10.1371/journal.pcbi.0030122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Rueda OM, Diaz-Uriarte R. A response to Yu et al. ‘a forward-backward fragment assembling algorithm for the identification of genomic amplification and deletion breakpoints using high-density single nucleotide polymorphism (SNP) array’. BMC bioinformatics, 2007. 2007b;8:145. doi: 10.1186/1471-2105-8-145. BMC Bioinformatics8394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Sebat J. Major changes in our dna lead to major changes in our thinking. Nat. Genet. 2007;39:S3–S5. doi: 10.1038/ng2095. [DOI] [PubMed] [Google Scholar]
  22. Shah S, et al. Modeling recurrrent CNA copy number alterations in array CGH data. Bioinformatics. 2007;23:i450–i458. doi: 10.1093/bioinformatics/btm221. [DOI] [PubMed] [Google Scholar]
  23. Taylor BSS, et al. Functional copy-number alterations in cancer. PLoS ONE. 2008;3 doi: 10.1371/journal.pone.0003179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23:657–663. doi: 10.1093/bioinformatics/btl646. [DOI] [PubMed] [Google Scholar]
  25. Willenbrock H, Fridlyand J. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005;21:4084–4091. doi: 10.1093/bioinformatics/bti677. [DOI] [PubMed] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES