Skip to main content
Cancer Informatics logoLink to Cancer Informatics
. 2011 May 25;10:159–173. doi: 10.4137/CIN.S6873

Model-Integrated Estimation of Normal Tissue Contamination for Cancer SNP Allelic Copy Number Data

Susann Stjernqvist 1, Tobias Rydén 1,, Chris D Greenman 1
PMCID: PMC3118450  PMID: 21695067

Abstract

SNP allelic copy number data provides intensity measurements for the two different alleles separately. We present a method that estimates the number of copies of each allele at each SNP position, using a continuous-index hidden Markov model. The method is especially suited for cancer data, since it includes the fraction of normal tissue contamination, often present when studying data from cancer tumors, into the model. The continuous-index structure takes into account the distances between the SNPs, and is thereby appropriate also when SNPs are unequally spaced. In a simulation study we show that the method performs favorably compared to previous methods even with as much as 70% normal contamination. We also provide results from applications to clinical data produced using the Affymetrix genome-wide SNP 6.0 platform.

Keywords: allelic copy number, hidden Markov model, cancer, normal cell contamination

1. Introduction

DNA in tumor cells can contain abnormalities in the form of copy number aberrations such as segments with losses or gains of one or several copies of either allele. The lengths of such aberrations can vary between short segments up to an entire chromosome, and their positions are essential both for detecting and for improving knowledge of various sorts of cancer. Therefore, methods that localize copy number aberrations are of great importance. In addition to changes in the total copy number of both alleles together, changes in the allelic copy numbers, ie, the number of copies of each allele, are also important. We denote the two different alleles at a given genomic location by A and B, so that for normal cells the possible genotypes are AA, AB and BB. One example of a genotype aberration is loss of heterozygosity (LOH), for which the only attainable genotypes are AA and BB.

Different techniques to measure DNA copy numbers have been developed, as have methods to evaluate the measurement data. One technique is array comparative genomic hybridization (aCGH), which provides ratios of the copy numbers of a sample DNA, compared to those of some reference DNA. Several different statistical methods have been applied to this kind of data, including different segmentation methods,4,20,21 smoothing6,12 and hidden Markov models.1,7,9,16,19,24,27,28,29 aCGH data provides information only about the total copy number and gives no information about the amount of each allele. Another drawback with such data is the limited number of probes on the arrays. For this reason there is an increased use of single nucleotide polymorphism (SNP) data, which offers denser measurements and provides intensities for the two alleles separately. Using SNP data it is possible not only to estimate copy number changes, but also to find allelic changes such as LOH. Indeed, a copy number amplification may be caused by different allelic changes. For example, a copy number of four could correspond either to {AAAA, AAAB, ABBB, BBBB}, to {AAAA, AABB, BBBB} or to {AAAA, BBBB}, depending on which allele that has gained extra copies.

SNP data has previously been analyzed using various sorts of methods, such as smoothing11,15 and pattern recognition.22 The most frequently used methods are however based on hidden Markov models (HMMs).3,8,13,17,18,26,30,31 A brief introduction to Markov chains and HMMs is found in Appendix 1. HMMs suit SNP data well since genomic alterations often appear in longer or shorter segments, implying that copy numbers across probes in a small genomic region are correlated. For example, Wang et al31 and Colella et al3 model SNP data from the Illumina array, which provides log R ratio data (log2-ratio of total observed intensities to total expected intensities) and BAF data (normalized measure of the relative intensities of the two alleles), using an HMM with six states, while Sun et al30 apply a more comprehensive model with nine states. Korn et al13 combine an HMM to model copy number variants with a clustering algorithm to detect genotypes. Li et al18 also model the proportion of the major allele while Lamy et al17 use both allelic intensities provided by the Affymetrix array and model them using bivariate Normal distributions.

Several of the methods above assume that the ploidy, ie, mean copy number, of a chromosome is two. This holds for normal cells, but cancer cells are anueploid, ie, their ploidy may differ from two. The necessity for considering ploidy when modeling cancer data is well described by Greenman et al,8 but in brief one can say that the measured normalized intensity for a probe in a diploid chromosome is twice as large as for a probe with the same copy number in a quadroploid chromosome. Two methods that include ploidy are those of Attiyeh et al2 and Greenman et al,8 which both contain a pre-processing step in which the ploidy is estimated. Greenman et al then continue by using an HMM while Attiyeh et al apply a window-based model.

Another feature common in tumor samples, arising from the difficulty to dissect tumor cells only from a tissue sample, is contamination of the tumor cell sample by normal cells. As a result the measured allelic intensities are mixtures of intensities from tumor and normal cells, thus yielding non-integer DNA copy numbers. One way to incorporate such contamination is to model total copy numbers of the mixed sample in a non-parametric way,2,29 but this provides limited information about the copy numbers of the cancer cells. Sun et al30 estimate the fraction of normal tissue contamination using an empirical method and Colella et al3 write that it is possible to extend their method to handle contamination, but without being more specific. Li et al18 show that their method can handle a fraction of normal tissue contamination up to 30%, while Lamy et al17 report a simulation study with slightly better results. Some tumors however form in a manner such that even with microdisection, a significant proportion of normal cells (say 50% or more) can arise in the sample, and none of the above methods provide results that are satisfactory for such high fractions of normal tissue contamination.

The purpose of the present paper is to devise a method to estimate allelic copy numbers, with ploidy and fraction of normal tissue contamination integrated in the model. Indeed, in all of the above papers, ploidy and/or normal fraction are estimated by adding more or less ad-hoc steps to a model that does not account for these parameters in itself. The model reported here is thus particularly suited for cancer data, for which both of these features are common. By including these parameters in the model they can be estimated alongside the other parameters using all data, rather than adding a pre-processing step or empirical methods using only a small subset of the data. In the simulation study presented below, samples with 30%, 50% and 70% normal contamination are simulated and even for the largest amount of contamination, 97% of the probes are reconstructed to the correct copy number state.

An additional feature of our model is that it is based on a continuous-index Markov chain, which accounts for the fact that the SNP probes are often unevenly spread over the genome. The relevance of a continuous-index model was highlighted by Gupta and Mitra10 (Section 5.3) for the different but related problem of classifying regions of DNA as nucleosome free regions (NFR) or non-NFR using a two-state HMM. Indeed, they showed that with irregularly spaced probes, a continuous-index model can provide substantially better results than a discrete-index model; 99% vs. 85% or 68% correct classifications in simulations for two different arrangements of the probes. Also the methods by Wang et al,31 Colella et al3 and Li et al,18 who apply discrete-index HMMs to SNP data, aim to take distances between probes into account by letting the Markov transition probabilities depend on these distances in different ways. Common to all of these methods is however that the stipulated transition probabilities violate the Chapman-Kolmogorov equation of Markov chains. That is, letting P(t) be the matrix of transition probabilities over a distance t between two probes, the equality P(t1 + t2) = P(t1)P(t2) does not hold. In essence this means that there is in fact no Markov chain with the given transition probabilities.

The paper is organized as follows. The model is described in in Section 3. Section 4 provides results from a simulation study as well as from an application to clinical data. Concluding remarks are given in Section 5.

2. Data

The data used in this study are the cancer samples in Greenman et al,8 produced using the Affymetrix genome-wide SNP 6.0 platform. We applied the algorithm to about 15 different cell line and primary tumor samples, representing various cancer forms including breast, lung and renal cancer. The primary tumor sample PD1753a for which results are reported in Section 4.2 are from a clear cell renal cell carcinoma sample.32

For probes at SNPs the intensities of the two different alleles are provided, while at other positions only a single total copy number intensity is available. Following Greenman et al,8 the intensities are normalized by first dividing each measurement by the total intensity of the sample (ie, the sum of all probe intensities over the entire genome), to remove chip-to-chip variation. The mean signals for each allele (or probe at non-SNP positions) are then transformed into a copy number intensity and a genotype intensity that are indicators of total copy number and allelic ratio dosages. The model presented below incorporates intensities for SNP probes only, but is easily extendable to include also probes measuring total copy number only; we elaborate on this further in the Discussion. The cancer data is available from the Cancer Genome Project, subject to a manual transfer agreement, and our Matlab code is available on the WWW.33

3. The Model

3.1. Basic structure

Let there be Nc probes on chromosome c, and denote these probes as probe (k, c), k = 1, 2, …, Nc. The genomic location of probe (k, c) is denoted by tkc, measured in the unit base pairs (bp) starting from the beginning of the chromosome. We denote the two different alleles at any genomic location by A and B. We will write g = (gA, gB) for the allelic copy numbers, ie, gA and gB are the number of copies of the A and B allele respectively. For example, the genotype AAB corresponds to g = (2, 1). Obviously the genotype and the allelic copy numbers are in a one-to-one correspondence to each other, and at times we will make no real distinction between the two. The allelic intensities are modeled using an HMM for which each state i corresponds to one genotype set Gi as specified in Table 1. The Markov chain can be extended to include more states with copy numbers above six, but the model as stated here has proved to be enough for the studied samples. To explain the genotype sets in Table 1, we note that through cancer development any region in the genome starts with one parental copy of each region and ends up with m copies of one allele and n copies of the other. If the genotype was originally AA or BB then the genotype will be (m + n)A or (m + n)B, respectively. If the SNP was heterozygous then we must end up with either mA and nB, or mB and nA. These are the genotypes indicated in Table 1. We refer to state 4, with genotype set {AA, AB, BB} as the normal state, and by an abnormal state we mean any other state.

Table 1.

Genotype sets for the different states of the Markov chain, sorted in the order given by the total copy number and copy number of the minor allele.

Statei (Total CN, minor CN) Genotype set Gi
1 (0,0) { }
2 (1,0) {A, B}
3 (2,0) {AA, BB}
4 (2,1) {AA, AB, BB}
5 (3,0) {AAA, BBB}
6 (3,1) {AAA, AAB, ABB, BBB}
7 (4,0) {4A, 4B}
8 (4,1) {4A, 3AB, A3B, 4B}
9 (4,2) {4A, 2A2B, 4B}
10 (5,0) {5A, 5B}
11 (5,1) {5A, 4AB, A4B, 5B}
12 (5,2) {5A, 3A2B, 2A3B, 5B}
13 (6,0) {6A, 6B}
14 (6,1) {6A, 5AB, A5B, 6B}
15 (6,2) {6A, 4A2B, 2A4B, 6B}
16 (6,3) {6A, 3A3B, 6B}

For each chromosome c the sequence of copy number states, according to Table 1, is modeled by a continuous-index Markov chain (Xc(t))t1ctTc, where t and Tc are respectively the genomic location (in bp) within the chromosome and the length (in bp) of the chromosome. The Markov chains for different chromosomes are assumed independent. The genomic location (in bp) is, strictly speaking, a discrete variable, but since the number of bp’s within a chromosome is much larger than the number of jumps of the Markov chain, the error caused by using a continuous approximation is negligible. With a discrete-index model the Markov transition probabilities would either be very close to unity (for staying in the same state from one bp to another) or close to zero (for changing state), and dealing with such probabilities is unstable numerically. For a continuous-index model, using transition rates rather than probabilities, this problem does not exist.

With 16 different states there are 240 different types of jumps and equally many transition rates (per chromosome). It is infeasible to estimate such many rates, and to make the model more parsimonious we assume a large number of them to agree. Specifically we assume, for chromosome c, a common rate λc for jumps from any state (normal or abnormal) to the group of abnormal states, with each such state, except for the current one in case the chain resides in an abnormal state, being equally likely, and another common rate ηc for jumps to the normal state from any abnormal state. The total rate out of any abnormal state, for chromosome c, is thus λc + ηc. This dynamic provides Markov chains whose stationary versions are time-reversible.29 Finally we let δic = P(Xc(t1c) = i) denote the initial probability for Markov state i in chromosome c.

Write ykc = (yAkc, yBkc) for the measured allelic intensities at probe (k, c). Greenman et al8 studied the correlation between the allele A and B intensities, for each probe, using 460 wild-type samples. For probe (k, c), plotting the two allele intensities for all wild-type samples against each other reveals three clusters (see,8 Figure 1, for an example). These clusters correspond to the genotypes AA, AB and BB, with the coordinates of the cluster centers written as (A0kc + 2A1kc, B0kc), (A0kc + A1kc, B0kc + B1kc) and (A0kc, B0kc + 2B1kc) respectively for suitable parameters A0kc, B0kc, A1kc and B1kc. These parameters were all estimated by Greenman et al8 using the wild-type samples. Their interpretation is that A0kc is the background intensity of the A allele (at diploid probes BB), and A1kc is the increase in A allele intensity from BB to AB and from AB to AA; B0kc and B1kc have analogous interpretations.

Figure 1.

Figure 1

Proportions of probes at which the Markov state was incorrectly reconstructed by the Viterbi algorithm with MAP parameter estimates computed by the EM algorithm. Markov transition rates were λc = ηc = 10−7 (top left), λc = 10−7, ηc = 10−9 (top right), λc = 10−9, ηc = 10−7 (bottom left), λc = ηc = 10−9 (bottom right) (unit: bp−1). Confidence intervals were obtained by exponentiating two-sided 95% student-t confidence limits based on the log-proportions for 10 genome replicates.

Further denote by (μAkcg, μBkcg) the mean allele A and B intensities at probe (k, c) for allelic copy numbers g = (gA, gB). The cluster centers above then write

μkcg=(μAkcg,μBkcg)=(A0kc+gAA1kc,B0kc+gBB1kc), (1)

and this model applies for the normal Markov state i = 4, ie, for allelic copy numbers such that gA + gB = 2. Moreover, the clusters in Greenman et al8 (Fig. 1) are tilted ovals, indicating that the intensities for alleles A and B are correlated and have unequal variances. Greenman et al8 found that a suitable model for the covariance matrix is

kcg=vkc(μAkcg2ρkcμAkcgμBkcgρkcμAkcgμBkcgμBkcg2); (2)

note that the variances are taken proportional to the squared means. The probe-specific variance factors vkc and correlations ρkc, as well as the means parameters A0kc, B0kc, A1kc and B1kc described above, were all estimated by Greenman et al8 using the wild-type samples and assuming a bivariate Normal distribution for each cluster.

We now carry this model further by assuming that for each probe, the allele intensities follow the mean-variance model given by Eqs. (1)(2) also for genotypes (gA, gB) for which gA and gB do not sum to two, ie, for all pairs (gA, gB) corresponding to genotypes listed in Table 1. That is, we assume that the response from amount of each allele on the microarray to measured intensity is linear, with the variance also increasing linearly. In reality the allelic intensities have a linear response for lower copy numbers, while at higher copy numbers the intensities start to saturate and our method is approximate. This could be adjusted for by a non-linear transformation, cf. Section 5, but we have not attempted such an adjustment in the analyses presented in this paper.

The above model specifies the conditional density of Ykc given a particular genotype. To specify the conditional density of Ykc given a Markov state, we recall that each Markov state has a genotype set comprising between one and four different genotypes. Thus the conditional density of Ykc, given the state, is a mixture of bivariate Normal distributions for which each mixture component represents a different genotype. The mixture weights were taken as the Hardy-Weinberg weights; for the copy number-aberrated genotypes, Hardy-Weinberg was used to compute the germline genotype proportions. Thus letting pkc be the allele frequency for an A allele at probe (k, c), the probability for the different genotypes, denoted by wkcig, are the binomial probabilities pkc and 1 – pkc for states with two genotypes, pkc2, 2pkc(1 – pkc) and (1 – pkc)2 for states with three genotypes, and pkc2, pkc(1 – pkc), pkc (1 – pkc) and (1 – pkc)2 respectively for states with four different genotypes. The frequencies pkc were also estimated by Greenman et al,8 using the wild-type samples. The conditional density for a measurement Ykc given the Markov state, often referred to as the emission density of the HMM, thus writes

fYkc|Xc(tkc)(y|i)=gGiwkcigfYkc|Gkc(y|g), (3)

where Gkc is the allelic copy numbers for probe (k, c) and fYkc|Gkc (·|g) is the bivariate Normal density with mean and covariance matrix as in Eqs. (1)(2).

As pointed out in the introduction we include the ploidy K, ie, average copy number over the entire genome, in the model to make it suitable for cancer data. The ploidy is defined genome-wide and not per chromosome, as the probe intensities are normalized per genome. The HMM described above models the normalized intensities, and its parameters were estimated for wild-type samples (ie, diploid samples; K = 2). For a sample with K > 2 the normalized intensities will thus be smaller by a factor 2/K (on average), so that the model for the normalized intensities becomes

Ykc|Gkc=gN(2Kμkcg,4K2Σkcg). (4)

This completes the specification of the basic model. As described above, the parameters A0kc, A1kc, B0kc, B1kc, vkc, ρkc and pkc were all estimated from the wild type samples, and were thus considered as fixed when the model was applied to cancer cell data. The intensities λc and ηc, the initial probabilities δc and the ploidy K were on the other hand estimated from the actual cancer data.

3.2. Normal tissue contamination

As mentioned above it is often difficult to dissect cancer cells without including any surrounding normal tissue, ie, diploid tissue. Such contamination implies that the measured allelic intensities correspond to a mixture of cancer and normal cells. We denote the fraction of normal tissue in the sample by γ, and consequently the fraction of tumor tissue is 1 – γ. Then for a given probe with, as above, copy numbers gA and gB or alleles A and B in the tumor but also copy numbers gAN and gBN for the two alleles in the normal tissue, we assumed the same mean-covariance model as in Eqs. (1)(2), but with (gA, gB) replaced by

(gAγ,gBγ)=((1γ)gA+γgAN,(1γ)gB+γgBN). (5)

Similarly, the conditional distribution of Ykc given Markov state i is a mixture of bivariate Normals, but now each four-tuple (gA, gB, gAN, gBN) contributes to a component of that mixture. Thus, the number of mixture components will for some Markov states be larger than without normal tissue contamination (see Table 2).

Table 2.

Combined genotype sets for the different states of the Markov chain, in a model with normal contamination γ. The weights for the respective combined genotypes are the Hardy-Weinberg weights as in the model without normal tissue contamination, and the total and minor copy numbers for the abberated components are as in Table 1.

Statei Combined genotype setGi
1 {2γA, γAγB, 2γB}
2 {(1 + γ)A, AγB, γAB, (1 + γ )B}
3 {2A, (2 − γ)AγB, γA(2 − γ)B, 2B}
4 {AA, AB, BB}
5 {(3 − γ)A, (3 − 2γ)AγB, γA(3 − 2γ)B, (3 − γ)B}
6 {(3 − γ)A, (2 − γ)AB, A(2 − γ)B, (3 − γ)B}
7 {(4 − 2γ)A, (4 − 3γ)AγB, γA(4 − 3γ)B, (4 − 2γ)B}
8 {(4 − 2γ)A, (3 − 2γ)AB, A(2 − γ)B, (4 − 2γ)B}
9 {(4 − 2γ)A, (2 − γ)A(2 − γ)B, (4 − 2γ)B}
10 {(5 − 3γ)A, (5 − 4γ)AγB, γA(5 − 4γ)B, (5 − 3γ)B}
11 {(5 − 3γ)A, (4 − 3γ)AB, A(4 − 3γ)B, (5 − 3γ)B}
12 {(5 − 3γ)A, (3 − 2γ)A(2 − γ)B, (2 − γ)A(3 − 2γ)B, (5 − 3γ)B}
13 {(6 − 4γ)A, (6 − 5γ)AγB, γA(6 − 5γ)B, (6 − 4γ)B}
14 {(6 − 4γ)A, (5 − 4γ)AB, A(5 − 4γ)B, (6 − 4γ)B}
15 {(6 − 4γ)A, (4 − 3γ)A(2 − γ)B, (2 − γ)A(4 − 3γ)B, (6 − 4γ)B}
16 {(6 − 4γ)A, (3 − 2γ)A(3 − 2γ)B, (6 − 4γ)B}

The weights for the combined genotypes are Hardy-Weinberg weights as in the model without normal contamination. For example, for a state in Table 2 with three combined genotypes, the weights are pkc2, 2pkc(1 – pkc) and (1 – pkc)2 respectively.

3.3. Estimation of parameters and the Markov path

The parameters estimated from a tumor sample are the transition rates λc and ηc, the initial probabilities δc, the ploidy K and also the fraction γ of normal tissue contamination.

For a model like the present one, the maximum-likelihood estimator (MLE) typically overestimates the transition rates λc and ηc 25 (Section 4.3), thereby letting an aposteriori reconstruction of the Markov chain trajectory capture also very short transients of the observed data. When using the EM algorithm to compute the MLE, this becomes visible as an overestimated number of jumps of the Markov chain. In order to control the jumps and make their number biologically plausible, we take a Bayesian approach and penalize overly large transition rates by placing Gamma distribution priors on each λc and ηc. Other parameters are assigned uniform (flat) priors. All parameters are apriori independent. We then compute the maximum aposteriori (MAP) parameter estimate using the EM algorithm, by incorporating the priors into the M-step5 (p. 6). Otherwise this algorithm is a variant of the EM algorithm described by Roberts and Ephraim,23 designed to estimate parameters of a continuous-index HMM observed at discrete positions. The method is detailed in Appendix 2.1.

Finally, to construct an estimate of the trajectory of the hidden Markov chain we use a Viterbi algorithm adapted to continuous-index Markov chains (see Appendix 2.2).

4. Results

4.1. Application to simulated data

To evaluate our method’s ability of making correct reconstructions for different amounts of normal contamination, we simulated data from the assumed model, computed MAP parameter estimates using the EM algorithm, reconstructed the hidden Markov chain using the Viterbi algorithm, and finally computed the proportion of probes at which the Markov state was correctly reconstructed. For each simulated dataset we first simulated the Markov chain and the genotypes for each probe position, then computed μkcg and Σkcg using Eqs. (1)(2), Eq. (5) and the fixed A0, A1, B0, B1, ρ and v (estimated from the wild-type samples), and finally simulated data from the bivariate Normal distributions of Eq. (4) with K = 2. Note that the actual value of K is irrelevant for these simulations, since the model given by Eqs. (1)(2) describes the data after normalization.

The simulations were carried out for 30%, 50% and 70% normal contamination, and transition rates λc = ηc = 10−7, λc = 10−7 and ηc = 10−9, λc = 10−9 and ηc = 10−7, and λc = ηc = 10−9 (in units of bp−1) respectively. For each combination of contamination and rates, 10 replicates were simulated. For the Gamma priors of λc and ηc we chose shape parameter 2 and means equal to the true transition rates. These choices yield priors that are not overly informative, but which are concentrated enough on small values to prevent the Markov chain from jumping too frequently in our samples.

To verify the convergence of the EM algorithm we present the EM iterations for three different simulated replicates in Figure 2. The proportions of incorrectly reconstructed probes are plotted in Figure 1.

Figure 2.

Figure 2

Estimates of normal contamination γ for iterations 1–10 of the EM algorithm and three simulated replicates with different values of γ: γ = 0.3 (top), γ = 0.5 (middle), and γ = 0.7 (bottom). The initial value for γ was 0.5 in all simulations.

These results can be compared to those from the simulation study by Lamy et al.17 For a normal contamination of 30% the results are similar, but for 45%, which is the largest fraction studied by Lamy et al, their method provides 8%–18% incorrectly estimated probes while at 50% contamination our model provides an error rate below 1%. In addition, the present model performs well even at such a high amount of normal contamination as 70%, when the Markov state is correctly reconstructed at more than 97% of the probes. Obviously the differences between our results and those of Lamy et al depend not only on the different estimation algorithms but also on differences between the number and location of the probes, and on the model for the observed allele intensities and its parameters. However, given the magnitude of the performance improvement, a significant part of it must be attributed to the estimation algorithm as such.

4.2. Application to clinical data

We applied our method to a number of samples from the data described in Section 2. An example is displayed in Figure 3, which shows the Viterbi reconstruction of the Markov chain as well as the corresponding copy numbers compared to the data, for chromosome 3 in primary sample PD1753. For the Gamma priors for λc and ηc we chose shape parameters 2 and means 10−15.

Figure 3.

Figure 3

Top: Viterbi reconstruction of the Markov path for chromosome 3 in PD1753. Bottom: sum of (standardized) allele intensities for probes within the same chromosome (grey dots), and the copy number of the corresponding state (black solid line).

The reconstruction divides the chromosome into two regions, reconstructed to state 2 ({A, B}) and state 4 ({AA, AB, BB}) respectively. As a simple check of this reconstruction we plotted the standardized allele intensities against each other for all probes in the respective region (Figs. 45). Figure 5, corresponding to the normal state, shows three clusters representing the three genotypes AA, AB and BB, while Figure 4 shows four clusters. In Table 1 state 2 is associated to two genotypes, A and B, but with normal contamination this state comprises four combined genotypes (1 + γ)A, AγB, γAB and (1 + γ)B (Table 2). Here γ is estimated at 0.53.

Figure 4.

Figure 4

Scatter plot of standardized measured allele intensities in the segment reconstructed to Markov state 2 in Figure 3. The fraction of normal contamination was estimated at 0.53.

Figure 5.

Figure 5

Scatter plot of standardized measured allele intensities in the segment reconstructed to Markov state 4 in Figure 3.

For some of the genomes the values of A0kc, A1kc, B0kc and B1kc needed small adjustments before applying our model; without it, the model did not produce a reasonable fit. A possible explanation for this adjustment being required is a drift in the measured intensities from when data from the wild-type samples, used to estimate most model parameters, was collected, to when the tumor samples were analyzed. A suitable construction of the adjustment was as a common, ie, genome-wide multiplier c0 for all A0kc and B0kc, and another common multiplier c1 for all A1kc and B1kc. The multipliers c0 and c1 were estimated using data from a chromosome segment known to belong to the normal state. The data within this segment was clustered into three parts using the k-means algorithm, and then c0 and c1 were estimated by a least squares fit.

5. Discussion

We have presented a method to estimate the number of copies of each of the two alleles in SNP data, taking three features common in cancer data into account; unequally spaced probes, aneuploidy, and normal contamination. Unequally spaced probes are modeled using a continuous-index Markov chain instead of a discrete-index one, which is the usual choice in the literature. The ploidy and fraction of normal contamination are both included as parameters in the model, which allows us to estimate them along with other variables and using all the data, rather than estimating them separately in a pre-processing step. This set-up also allows us to retain the integer structure of the allele copy numbers. The model’s ability to estimate the fraction of normal contamination has been demonstrated in a simulation study, with the results being far better than for previous methods and excellent even with as much as 70% normal contamination.

Above we denoted Markov state 4, ie, the state with genotypes {AA, AB, BB}, the normal state, irrespective of the ploidy of the chromosome. The reason for singling out this particular state is that it is often particularly interesting whether the Markov chain is in this state or not, at any given probe. One could argue that if the ploidy differs from two this is not ‘normal’, but it is straightforward to select a different state as ‘normal’ and then modify the transition rate structure and estimation algorithm accordingly.

The emission model, ie, Eqs. (1)(4), assume that the means and variances of the measured intensities are both linear in the amount of each allele. In practice this assumption may fail, eg, because for large copy numbers the response is nonlinear. One could then include such a non-linearity in the model, and model the mean intensities as μkcg = hkc(g;θkc) where h is some function and θkc parameters of this function. Ideally the functional form h as well as all its probe-speficic parameters θkc should be well estimated beforehand, so that they are essientially known when evaluating an unknown sample. Similar comments apply to the variance of the measured intensities.

In this paper we have only considered probes that provide allele-specific intensity measurements, but, as mentioned in Section 2, microarrays often also contain probes that measure the total copy number only, ie, the sum of the number of alleles. Such probes can easily be included in our model by speficying a corresponding suitable emission density, ie, a density corresponding to Eq. (3). For instance, this could be a univariate Normal density with mean μkcg = C0kc + C1kc(gA + gB) and variance σkcg2=νkcμkcg2 for parameters C0kc, C1kc and νkc that again need to be estimated prior to analyzing an unknown sample. Should the response function from total copy number to intensity not be linear for large copy numbers, this could be handled similarly to what can be done for SNP probes; cf. the previous paragraph.

Finally we mention some possible limitations of our method. Firstly, the accuracy of the method is likely to be reduced in regions of very high copy number where signal saturation occurs, such as in amplicons, and bespoke nonlinear adjustments may be required (as discused above). Secondly, we have ignored copy number polymorhisms. These will produce non-integer copy numbers in the cancer sample due to the skewed ratio between the cancer and the contaminating normal. If copy number data is available for the normal, it may be possible to generalise these methods to make such an adjustment, however, such regions are generally a lot smaller in scale than the somatic copy number changes seen in cancer and were not considered further. Lastly, we have assumed that the sample in question is derived from a homogeneous collection of cells. However, cell-to-cell variation is quite possibly going to produce a lot of different clones with differing copy numbers, and more general methods will be required to deal with such complexities.

To sum up this paper, copy number variations in cancer are common and their accurate determination is important for determining homozygous deletion, amplifications and breakpoints, all of which can be functionally implicated in cancer. This problem is compounded by normal contamination, making the accurate estimation of integer copy numbers in cancer samples with normal contamination difficult. Here we have introduced a method that addresses this problem.

Acknowledgments

CDG was supported by the Wellcome Trust at the Sanger Centre. The authors would like to thank the anonymous reviewers for their constructive comments and suggestions that improved the presentation of this paper.

Appendix

1. A Primer on Markov Chains and Hidden Markov Model

The purpose of this section is to provide a brief and rather elementary introduction to Markov chains with discrete and continuous index, and to hidden Markov models. A monograph entirely devoted to bioinformatics applications of hidden Markov models is the text by Koski.14

Consider a sequence t1, t2, …, tN of locations (in our case these will be probe locations), and a set {1, 2, …, r} of states (which will in our case be as in Tables 1 or 2). At any location tk there is an actual state x(tk) (ie, a true copy number state), which we think of as the realization of a random variable X(tk). These random variables are dependent, since copy number states at nearby probes are correlated. To model this dependence, we use Markov chains.

A discrete-index Markov chain (we use the term index rather than the more common ‘time’, since bp location is not a temporal variable) is specified by transition probabilities pij(tk–1, tk), giving the (conditional) probability that if the chain happens to be in state i at location tk–1, it will move to state j at location tk. For j = i, the probability concerns the event that the chain will stay in the same state, ie, not move at all. Implicit in this characterization is also the fact that if the states x(t1), x(t2), …, x(tk–1) at all foregoing locations t1, t2, …, tk–1 are known, this does not affect the conditional probability, which only depends on the state x(tk–1) at the closest location tk–1; this is the Markov property. To complete the specification of the Markov chain, we must also provide the initial probabilities, ie, the probabilities that at the first location t1, the chain starts in state i for each respective state.

In our model, the probe locations tk are separated by different distances tktk–1, ie, these distances are not equal. We wish to incorporate this feature into the Markov model, so that the transition probabilities pij(tk–1, tk) do not only depend on the states i and j that the chain moves from and to respectively, but also on the distance hk = tktk–1 between the probes. One way to accomplish this is to think of the base pair location along a chromosome, which we denote by t, as a continuous variable rather than as a discrete one, and to model the state changes of the Markov chain using this continuous variable, or index. In contrast to a discrete-index Markov chain, a continuous-index Markov chain is specified in terms of transition rates. For any state i and any other state j, ie, different from i, there is a transition rate qij from state i state j. For any state i we also define the total rate out of i, qi, as the sum of all transition rates out of this state, ie, qi = Σj≠i qij. One way to interpret these rates is in terms of sojourn lengths and jump probabilities. Given that the chain has entered state i, it will stay there for a sojourn whose length is random and follows an exponential distribution with rate qi (mean 1/qi); the probability that this sojourn exceeds length s is thus the exponential exp(–qi s). When then the chain eventually leaves state i, the probability that it jumps to state j is given by qij/qi.

For a continuous-index Markov chain it is also possible to compute the probability that for two locations tk–1 and tk separated by distance hk, if the chain is in state i at location tk–1, it will be in state j at location tk. Denoting these probabilities by pij(hk) and collecting then into an r × r matrix P(hk) (thus pij(hk) is the row i column j element of this matrix), it holds that P(hk) = exp(Qhk), where Q is the r × r rate matrix (or intensity matrix, or generator) with off-diagonal elements given by the transition rates qij and diagonal elements qii = –qi, ie, the negative of the total rates out of the respective states. Moreover, exp(Qhk) is the matrix-exponential function, defined by the power series exp(A) = I + A + A2/2! + A3/3! + … for any square matrix A, where I is the identity matrix, ie, a matrix of the same size as A and with diagonal elements equal to one and all off-diagonal elements being zero, and k! is the factorial 1 × 2 × … × k. This definition is a direct generalization of the power series for the ordinary (real-valued) exponential function.

In a hidden Markov model (HMM), the Markov chain is not directly observable, but only as disturbed by noise. In the present setting the copy number state cannot be observed with certainty, but for any probe the intensity measurements, for each allele, provide partial information about the copy number state. In an HMM, the link between the state X(tk) at some location tk and the corresponding measurement Yk (here, intensities) is specified through an emission density fYkX(tk)(y|i), which is the conditional density of Yk given that X(tk) = i. In the present context the emission density is thus the density of the measured intensities given a certain copy number state. Since there are two intensities available, one for each allele, the density is a bivariate one. Furthermore, since each copy number state (Markov state) contains several genotypes, the emission density for a copy number state is a mixture (weighted average) of densities corresponding to each of these genotypes; this is Eq. (3).

Specifying the HMM thus amounts to specifiying the structure and parameters of the Markov chain, and those of the emission densities. When this has been done, typical tasks are to i) estimate parameters from data, and ii) find the most likely realization of the Markov chain, given data. The first task, parameter estimation, is commonly carried out using the so-called EM (expectation maximization) algorithm, which is an iterative procedure that in each iteration increases the likelihood of the model parameters. The purpose is thus to iterate until convergence, and then to report the resulting parameters as the MLE (maximum likelihood estimate); convergence to the MLE is not guaranteed, however. For our HMM, the algorithm is outlined in Appendix 2.1. The second problem above can be viewed as that of reconstructing the Markov trajectory, given data (and model parameters, usually estimated ones). This problem is solved using the so-called Viterbi algorithm, which is a dynamic programming algorithm that recursively finds the most likely path. This algorithm, for our HMM, is described in Appendix 2.2.

2. Methods

2.1. The EM algorithm

The parameters to estimate are the ploidy K, the fraction γ of normal tissue, and, for each chromosome c, the two transition rates λc and ηc and the initial distribution δc. Our starting point is the EM algorithm for continuous-index hidden Markov chains by Roberts and Ephraim.23 As latent (unobserved) data we take the whole Markov trajectory (Xc(t))t1ctTc for each chromosome c, but the complete likelihood involves only the sufficient statistics consisting of the initial state Xc(t1c), the total lengths Tnc and Tac of sojourns in the normal state and in abnormal states respectively, and the numbers m·ac and manc of jumps to abnormal states, and from abnormal states to the normal state respectively, for chromosome c.

With these sufficient statistics, and recalling that each λc has a Gamma prior with shape and intensity parameters say αcλ and βcλ, and analogously for each ηc, the complete log-posterior, ie, the sum of the complete log-likelihood and the log-prior, is, up to a constant not depending on the parameters,

Lc(θ;X,y)=c{logδX(t1),c+maclogλc+manclogηcλcTnc(λc+ηc)Tac+(αcλ1)logλcβcλλc+(αcη1)logηcβcηηc+k=1NclogfYkc|Xkc(ykc|Xkc;γ,K)},

where θ = (δc, λc, ηc, K, γ). Moreover, y = {ykc} is the collection of all data and X = {(Xc(t))} is the collection of all (unobserved) Markov chain trajectories. The quantity to maximize in one iteration of the EM algorithm is

Q(θ;θ)=Eθ[Lc(θ;X,y)|y],

where maximization is with respect to θ′ and the notation Eθ indicates that the expectation is computed under the current parameter (estimate) θ. Note that Lc(θ′;X,y) and hence also Q(θ;θ′) split into two distinct parts, one of which depends on (δc, λc, ηc) only and one of which depends on K and γ only. Maximization with respect to (δc, λc, ηc) and with respect to (K, γ) can thus be carried out separately.

Also note that K and γ are common across the genome, and therefore estimated using the data for all chromosomes. For each iteration of the EM algorithm we compute the forward and backward variables for each chromosome, store them, and then re-estimate K and γ using the information from all chromosomes.

The M-steps for the transition rates read

λ^c=αcλ1+m^acβcλ+T^ac+T^nc,η^c=αcη1+m^ancβcη+T^ac,

where ·ac=Eθ[m·ac|y1c, ⋯, yc,Nc] etc. Note that Tac + Tnc equal the length of the Markov chain trajectory for chromosome c, ie, Tct1c, so that also ac+ nc= Tct1c. Moreover, the M-step for the initial distributions is

δ^ic=Pθ(X(t1)=i|y1c,,yc,Nc).

The M-step for the ploidy is

K^=V4U+V216U2ΣcNcU,

where

U=c,k,i,gGi18νkc(1ρkc2)(yAkc2μAkcg22ρkcyAkcyBkcμAkcgμBkcg+yBkc2μBkcg2)×Pθ(Xc(tkc)=i,Gkc=g|y1c,,yc,Nc)

and

V=c,k,i,gGi12vkc(1+ρkc)(yAkcμAkcg+yBkcμBkcg)×Pθ(Xc(tkc)=i,Gkc=g|y1c,,yc,Nc)

For the fraction γ of normal contamination there is no closed form expression for the M-step, and to re-estimate γ we maximize Q(θ;·), as a function of γ, numerically. Note however that above depends on γ, which appears implicitly in the means μAkcg and μBkcg used to compute U and V. Therefore, by maximizing w.r.t. K′ (using the current γ) and then w.r.t. γ′ (using the re-estimated ), as we do, and not w.r.t. K′ and γ′ jointly, we in fact obtain a generalized EM algorithm rather than an EM algorithm, in the terminology of Dempster et al5 (Eq. (3.5)).

The conditional expectations ·ac etc. are computed in the E-step, which follows that of Roberts and Ephraim23 with minor changes. Now write yk:l,c for {ykc, yk+ 1,c, …, ylc}, and let mijc be the number of jumps by the Markov chain from state i to state j, in chromosome c. Then

m^ijc=Eθ[mijc|y1:Nc,c]=θTcPθ(Xc(t)=i,Xc(t)=j|y1:Nc,c)dt=0TcPθ(Xc(t)=i,Xc(t)=j)pθ(y1:Nc,c)×pθ(y1:Nc,c|Xc(t)=i,Xc(t)=j)dt=k=2Nctk1,ctkcPθ(Xc(t)=j|Xc(t)=i)Pθ(Xc(t)=i)pθ(y1:Nc,c)×pθ(y1:k1,c,yk:Nc,c|Xc(t)=i,Xc(t)=j)dt;

here the symbol P denotes probabilities as well as densities; note that Pθ(Xc(t) = j|Xc(t−) = i) = qijc, where qijc is the transition rate from state i to state j in chromosome c. Thus, with rabnormal being the number of abnormal states, qijc is equal to λc/rabnormal if i is the normal state and j is any abnormal state, equal to λc/(rabnormal − 1) if i and j are both abnormal states (because the chain cannot jump from a state to itself), and equal to ηc if i is any abnormal state and j is the normal state. Given the HMM structure it follows that y1:k–1,c and yk:Nc,c are conditionally independent given Xc(t−) = i and Xc(t) = j, whence

m^ijc=qijcpθ(y1:Nc,c)κ=2Nctk1,ctkcPθ(Xc(t)=i)pθ(y1:k1,c|Xc(t)=i)×Pθ(yk:Nc,c|Xc(t)=j)dt=qijcpθ(y1:Nc,c)κ=2Nc0hkcPθ(y1:k1,c,Xc(tk1,c+t)=i)×P(yk:Nc,c|Xc(tk1,c+t)=j)dt

with hkc = tkctk–1,c. Here, the two factors in the integrand on the right-hand side are the forward and backward densities respectively.

To compute these factors, and similar ones, we use a forward-backward type algorithm. Let r be the number of Markov states, and let Bkc be the r × r diagonal matrix whose (i, i)-th element is the probability density function of ykc given Markov state i at position tkc, ie, fYkcXc (tkc) = i(ykc) in Eq. (3). Further let Fkc be the r × r matrix whose (i, j)-th element is

[Fkc]ij=Pθ(ykc,Xc(tkc)=j|Xc(tk-1,c)=i)=[exp(Qchkc)]ij[Bkc]ij,

where Qc is the matrix with elements qijc, i, j = 1, 2, …, r for ij, and diagonal elements qcii being the negative of the total rate out of state i for chromosome c (the row sums of Qc then become zero). We note that the discrete-index process (Xc(tkc))1≤k≤Nc, ie, the continuous-index process (Xc(·)) sampled at the locations of the probes, is a non-homogeneous discrete-index Markov chain with transition probability matrices, from tk–1,c to tkc, given by exp(Qchkc). With this matrix notation we have Fkc = exp(Qchkc)Bkc, and the likelihood for chromosome c can be written

pθ(y1:Nc,c)=δcB1c(k=2NcFkc)1

where 1 is the r × 1 vector of all ones. The forward densities are

Pθ(y1:k1,c,Xc(tk1,c+t)=i)=s=1rPθ(y1:k1,c,Xc(tk1,c)=s)×Pθ(Xc(tk1,c+t)=i|Xc(tk1,c)=s)=s=1r(δcB1cκ=2k1Fκc)1s[exp(Qct)]si,

where 1j is the r × 1 vector whose elements are zero except for element j which is one.

The backward densities are

pθ(yk:Nc,c|Xc(tk1,c+t)=j)=s=1rpθ(yk+1:Nc,c|Xc(tkc)=s)pθ(ykc|Xc(tkc)=s)×Pθ(Xc(tkc)=s|Xc(tk1,c+t)=j)=s=1r[Bkc]ss[exp(Qc(hkct))]js1s(κ=k+1NcFκc)1.

The above matrix multiplications are numerically unstable, as the products will either tend to zero or infinity exponentially fast as the number of factors increases. Therefore scaled versions of these recursions are introduced. The scaled forward densities with normalizing constants dkc at probe (k, c) are

Lc(k)=δcB1cd1cκ=2kFκcdκc,

which we compute recursively as

Lc(k)=Lc(k1)Fkcdkc

with d1c = δcB1c1, Lc(1) = δcB1c/d1c, dkc = Lc(k – 1)Fkc1.

The scaled backward densities are

Rc(k)=κ=k+1NcFκcdκc1,

which we compute as

Rc(k)=Fk+1Rc(k+1)dk+1,c

with Rc(Nc) = 1.

Using these scales quantities, the matrix c with entries ijc can be expressed as

m^c=QcIk,c,

where ⊙ denotes element-wise multiplication and

Ikc=0hkcexp(Qc(hkct))Vcexp(Qct)dt

with

Vc=k=2NcBkcRc(k)Lc(k1)dk,c.

The integrals Ikc are evaluated using the matrix

Dc=(QcVc0Qc);

Ikc is then upper right r × r block of exp(Dchkc).

Finally, recalling that the normal state is state 4,

m^.ac=1ir1jr,ji,j4m^ijc,m^anc=1ir,i4m^i4c.

Using similar types of computations is follows that

T^ic=E[Tic|y1:Nc,c]=m^iic/qiic,

where Tic is the total length of all sojourns of the Markov chain in state i within chromosome c. Moreover,

Pθ(X(t1c)=i|y1:Nc,c)Lc(1)iRc(1)i,

and the conditional probabilities in the expressions for U and V are computed using

P(Xc(tkc)=i,Gkc=g|y1:Nc)=P(Gkc=g|Xkc=i,y1:Nc,c)P(Xkc=i|y1:Nc,c)wkcigf(ykc|Gkc=g)Lc(k)iRc(k)i,

where the weights and densities on the right-hand side are those in Eq. (3).

2.2. The Viterbi algorithm

We used a Viterbi algorithm, adapted to the continuous-index structure, to find the aposteriori most likely Markov chain trajectory. The algorithm is the usual Viterbi algorithm, but with transition probability matrices exp(Qchkc) that vary with probe index (c, k). The algorithm thus finds the most likely sequence at the probe locations only. When the estimated reconstruction of each Markov state X(tkc) is available, one may also estimate the corresponding genotype Gkc (see below).

For any chromosome c, the Viterbi algorithm is as follows. To ensure numeric stability, it operates on log-scale.

  1. Put ξ1c(i) = log(δic[B1c]ii) for i = 1, …, r.

  2. Iterate for k = 2, 3, …, Nc,

    ξkc(j)=maxi{ξk1(i)+log[exp(Qchkc)]ij+log[Bkc]jj}

    for i = 1, 2, …, r.

  3. Put c(Nc) = arg maxi ξNcc(i).

  4. Iterate for k = Nc – 1, Nc – 2, …, 1,
    x^c(k)=argmaxi{ξkc(i)+log[exp(Qhk+1)]i,x^c(k+1)}.

Having reconstructed the states xc(tkc), it holds that the corresponding genotypes, given the Markov chain and intensity data, are conditionally independent with

P(Gkc=g|Xc(tkc)=i,Ykc=y)=wkcigfYkc|Gkc(y|g)ΣgGiwkcigfYkc|Gkc(y|g)

for all gGi; here fYkc|Gkc (y| g) is the bivariate Normal density as in Eq. (3). Selecting, for each probe (k, c), Gkc as the genotype gGi maximizing this expression thus yields a maximum aposteriori (MAP) reconstruction of the genotypes at all probes.

Footnotes

Disclosure

This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.

References

  • 1.Andersson R, Bruder CEG, Piotrowski A, et al. A segmental maximum a posteriori approach to genome-wide Copy Number profiling. Bioinformatics. 2008;24:751–8. doi: 10.1093/bioinformatics/btn003. [DOI] [PubMed] [Google Scholar]
  • 2.Attiyeh EF, Diskin SJ, Attiyeh MA, et al. Genomic copy number determination in cancer cells from single nucleotide polymorphism microarrays based on quantitative genotyping corrected for aneuploidy. Genome Res. 2009;19:276–83. doi: 10.1101/gr.075671.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Colella S, Yau C, Taylor JM, et al. QuantiSNP: an objective Bayes hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 2007;35:2013–25. doi: 10.1093/nar/gkm076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Daruwala R, Rudra A, Ostrer H, Lucito R, Wigler M, Mishra B. A versatile statistical analysis algorithm to detect genome copy number variation. Proc Nat Acad Sci. 2004;101:16292–7. doi: 10.1073/pnas.0407247101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with discussion) J Roy Statist Soc B. 1977;39:1–38. [Google Scholar]
  • 6.Eilers PHC, de Menezes RX. Quantile smoothing of array CGH data. Bioinformatics. 2005;21:1146–53. doi: 10.1093/bioinformatics/bti148. [DOI] [PubMed] [Google Scholar]
  • 7.Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN. Hidden Markov models approach to the analysis of array CGH data. J Multivar Anal. 2004;90:132–53. [Google Scholar]
  • 8.Greenman CD, Bignell G, Butler A, et al. PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatist. 2010;11:164–75. doi: 10.1093/biostatistics/kxp045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Guha S, Li Y, Neuberg D. Bayesian hidden Markov modeling of array CGH data. J Amer Statist Assoc. 2008;103:485–97. doi: 10.1198/016214507000000923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Mitra R, Gupta M. A continuous-index Bayesian hidden Markov model for prediction of nucleosome positioning in genomic DNA. Biostatist. to appear. [DOI] [PMC free article] [PubMed]
  • 11.Huang J, Wei W, Chen J, et al. CARAT: A novel method for allelic detection of DNA copy number changes using high density oligonucleotide arrays. BMC Bioinformatics. 2006;7:83. doi: 10.1186/1471-2105-7-83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Hupé P, Stransky N, Thiery J, Radvanyi F, Barillot E. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics. 2004;20:3413–22. doi: 10.1093/bioinformatics/bth418. [DOI] [PubMed] [Google Scholar]
  • 13.Korn JM, Kuruvilla FG, McCarroll SA, et al. Integrated genotype calling and association analysis of SNPs common copy number polymorphisms and rare CNVs. Nature Genetics. 2008;40:1253–60. doi: 10.1038/ng.237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Koski T. Hidden Markov Models for Bioinformatics. Dordrecht: Kluwer Academic Publishers; 2001. [Google Scholar]
  • 15.Laframboise T, Harrington D, Weir BA. PLASQ: a generalized linear model-based procedure to determine allelic dosage in cancer cells from SNP array data. Biostatist. 2007;8:323–36. doi: 10.1093/biostatistics/kxl012. [DOI] [PubMed] [Google Scholar]
  • 16.Lai TL, Xing H, Zhang N. Stochastic segmentation models for array-based comparative genomic hybridization data analysis. Biostatist. 2008;9:290–307. doi: 10.1093/biostatistics/kxm031. [DOI] [PubMed] [Google Scholar]
  • 17.Lamy P, Andersen CL, Dyrskjot L, Torring N, Wiuf C. A hidden Markov model to estimate population mixture and allelic copy-numbers in cancers using Affymetrix SNP arrays. BMC Bioinformatics. 2007;8:434. doi: 10.1186/1471-2105-8-434. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Li C, Beroukhim R, Weir BA, Winckler W, Garraway LA, Sellers WT, et al. Major copy proportion analysis of tumor smples using SNP arrays. BMC Bioinformatics. 2008;9:204. doi: 10.1186/1471-2105-9-204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Marioni JC, Thorne NP, Tavaré S. BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics. 2006;22:1144–6. doi: 10.1093/bioinformatics/btl089. [DOI] [PubMed] [Google Scholar]
  • 20.Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatist. 2004;5:557–72. doi: 10.1093/biostatistics/kxh008. [DOI] [PubMed] [Google Scholar]
  • 21.Picard F, Robin S, Lavielle M, Vaisse C, Daudin J. A statistical approach for array CGH data analysis. BMC Bioinformatics. 2005;6:27. doi: 10.1186/1471-2105-6-27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Popova T, Mani′e E, Stoppa-Lyonnet D, Rigaill G, Barillot E, Stern MH. Genome Alteration Print (GAP): a tool to visualize and mine complex cancer genomic profiles obtained by SNP arrays. Genome Biology. 2009;10:R128. doi: 10.1186/gb-2009-10-11-r128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Roberts WJJ, Ephraim Y. An EM Algorithm for ion-channel current estimation. IEEE Trans Signal Proc. 2008;56:26–33. [Google Scholar]
  • 24.Rueda OM, Días R. Flexible and accurate detection of genomic copy-number changes from aCGH. PLoS Comput Biol. 2007;3:1115–22. doi: 10.1371/journal.pcbi.0030122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Rydén T. EM versus Markov chain Monte Carlo for estimation of hidden Markov models: a computational perspective (with discussion) Bayesian Anal. 2008;3:659–88. [Google Scholar]
  • 26.Scharpf RB, Parmigiani G, Pevsner J, Ruczinski I. Hidden Markov models for the assesment of chromosomal alterations using high-throughput SNP arrays. Ann Appl Statist. 2008;2:687–713. doi: 10.1214/07-AOAS155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Shah SP, Xuan X, DeLeeuw RJ, Khojasteh M, Lam WL, Ng R, et al. Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics. 2006;22:e431–9. doi: 10.1093/bioinformatics/btl238. [DOI] [PubMed] [Google Scholar]
  • 28.Stjernqvist S, Rydén T, Sköld M, Staaf J. Continuous-index hidden Markov modelling of array CGH copy number data. Bioinformatics. 2007;23:1006–14. doi: 10.1093/bioinformatics/btm059. [DOI] [PubMed] [Google Scholar]
  • 29.Stjernqvist S, Rydén T. A continuous-index hidden Markov jump process for modelling DNA copy number data. Biostatist. 2009;10:773–8. doi: 10.1093/biostatistics/kxp030. [DOI] [PubMed] [Google Scholar]
  • 30.Sun W, Wright FA, Tang Z, et al. Integrated study of copy number states and genotype calls using high-density SNP arrays. Nucleic Acids Res. 2009;37:5365–77. doi: 10.1093/nar/gkp493. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Wang K, Li M, Hadley D. PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007;17:1665–74. doi: 10.1101/gr.6861907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.www.sanger.ac.uk/perl/genetics/CGP/cosmic?action=sample&id=919182
  • 33.www.maths.lth.se/matstat/staff/susann/

Articles from Cancer Informatics are provided here courtesy of SAGE Publications

RESOURCES