Model-Integrated Estimation of Normal Tissue Contamination for Cancer SNP Allelic Copy Number Data

Susann Stjernqvist; Tobias Rydén; Chris D Greenman

doi:10.4137/CIN.S6873

. 2011 May 25;10:159–173. doi: 10.4137/CIN.S6873

Model-Integrated Estimation of Normal Tissue Contamination for Cancer SNP Allelic Copy Number Data

Susann Stjernqvist ¹, Tobias Rydén ^1,^✉, Chris D Greenman ¹

PMCID: PMC3118450 PMID: 21695067

Abstract

SNP allelic copy number data provides intensity measurements for the two different alleles separately. We present a method that estimates the number of copies of each allele at each SNP position, using a continuous-index hidden Markov model. The method is especially suited for cancer data, since it includes the fraction of normal tissue contamination, often present when studying data from cancer tumors, into the model. The continuous-index structure takes into account the distances between the SNPs, and is thereby appropriate also when SNPs are unequally spaced. In a simulation study we show that the method performs favorably compared to previous methods even with as much as 70% normal contamination. We also provide results from applications to clinical data produced using the Affymetrix genome-wide SNP 6.0 platform.

Keywords: allelic copy number, hidden Markov model, cancer, normal cell contamination

1. Introduction

DNA in tumor cells can contain abnormalities in the form of copy number aberrations such as segments with losses or gains of one or several copies of either allele. The lengths of such aberrations can vary between short segments up to an entire chromosome, and their positions are essential both for detecting and for improving knowledge of various sorts of cancer. Therefore, methods that localize copy number aberrations are of great importance. In addition to changes in the total copy number of both alleles together, changes in the allelic copy numbers, ie, the number of copies of each allele, are also important. We denote the two different alleles at a given genomic location by A and B, so that for normal cells the possible genotypes are AA, AB and BB. One example of a genotype aberration is loss of heterozygosity (LOH), for which the only attainable genotypes are AA and BB.

Different techniques to measure DNA copy numbers have been developed, as have methods to evaluate the measurement data. One technique is array comparative genomic hybridization (aCGH), which provides ratios of the copy numbers of a sample DNA, compared to those of some reference DNA. Several different statistical methods have been applied to this kind of data, including different segmentation methods,⁴^,²⁰^,²¹ smoothing⁶^,¹² and hidden Markov models.¹^,⁷^,⁹^,¹⁶^,¹⁹^,²⁴^,²⁷^,²⁸^,²⁹ aCGH data provides information only about the total copy number and gives no information about the amount of each allele. Another drawback with such data is the limited number of probes on the arrays. For this reason there is an increased use of single nucleotide polymorphism (SNP) data, which offers denser measurements and provides intensities for the two alleles separately. Using SNP data it is possible not only to estimate copy number changes, but also to find allelic changes such as LOH. Indeed, a copy number amplification may be caused by different allelic changes. For example, a copy number of four could correspond either to {AAAA, AAAB, ABBB, BBBB}, to {AAAA, AABB, BBBB} or to {AAAA, BBBB}, depending on which allele that has gained extra copies.

SNP data has previously been analyzed using various sorts of methods, such as smoothing¹¹^,¹⁵ and pattern recognition.²² The most frequently used methods are however based on hidden Markov models (HMMs).³^,⁸^,¹³^,¹⁷^,¹⁸^,²⁶^,³⁰^,³¹ A brief introduction to Markov chains and HMMs is found in Appendix 1. HMMs suit SNP data well since genomic alterations often appear in longer or shorter segments, implying that copy numbers across probes in a small genomic region are correlated. For example, Wang et al³¹ and Colella et al³ model SNP data from the Illumina array, which provides log R ratio data (log₂-ratio of total observed intensities to total expected intensities) and BAF data (normalized measure of the relative intensities of the two alleles), using an HMM with six states, while Sun et al³⁰ apply a more comprehensive model with nine states. Korn et al¹³ combine an HMM to model copy number variants with a clustering algorithm to detect genotypes. Li et al¹⁸ also model the proportion of the major allele while Lamy et al¹⁷ use both allelic intensities provided by the Affymetrix array and model them using bivariate Normal distributions.

Several of the methods above assume that the ploidy, ie, mean copy number, of a chromosome is two. This holds for normal cells, but cancer cells are anueploid, ie, their ploidy may differ from two. The necessity for considering ploidy when modeling cancer data is well described by Greenman et al,⁸ but in brief one can say that the measured normalized intensity for a probe in a diploid chromosome is twice as large as for a probe with the same copy number in a quadroploid chromosome. Two methods that include ploidy are those of Attiyeh et al² and Greenman et al,⁸ which both contain a pre-processing step in which the ploidy is estimated. Greenman et al then continue by using an HMM while Attiyeh et al apply a window-based model.

Another feature common in tumor samples, arising from the difficulty to dissect tumor cells only from a tissue sample, is contamination of the tumor cell sample by normal cells. As a result the measured allelic intensities are mixtures of intensities from tumor and normal cells, thus yielding non-integer DNA copy numbers. One way to incorporate such contamination is to model total copy numbers of the mixed sample in a non-parametric way,²^,²⁹ but this provides limited information about the copy numbers of the cancer cells. Sun et al³⁰ estimate the fraction of normal tissue contamination using an empirical method and Colella et al³ write that it is possible to extend their method to handle contamination, but without being more specific. Li et al¹⁸ show that their method can handle a fraction of normal tissue contamination up to 30%, while Lamy et al¹⁷ report a simulation study with slightly better results. Some tumors however form in a manner such that even with microdisection, a significant proportion of normal cells (say 50% or more) can arise in the sample, and none of the above methods provide results that are satisfactory for such high fractions of normal tissue contamination.

The purpose of the present paper is to devise a method to estimate allelic copy numbers, with ploidy and fraction of normal tissue contamination integrated in the model. Indeed, in all of the above papers, ploidy and/or normal fraction are estimated by adding more or less ad-hoc steps to a model that does not account for these parameters in itself. The model reported here is thus particularly suited for cancer data, for which both of these features are common. By including these parameters in the model they can be estimated alongside the other parameters using all data, rather than adding a pre-processing step or empirical methods using only a small subset of the data. In the simulation study presented below, samples with 30%, 50% and 70% normal contamination are simulated and even for the largest amount of contamination, 97% of the probes are reconstructed to the correct copy number state.

An additional feature of our model is that it is based on a continuous-index Markov chain, which accounts for the fact that the SNP probes are often unevenly spread over the genome. The relevance of a continuous-index model was highlighted by Gupta and Mitra¹⁰ (Section 5.3) for the different but related problem of classifying regions of DNA as nucleosome free regions (NFR) or non-NFR using a two-state HMM. Indeed, they showed that with irregularly spaced probes, a continuous-index model can provide substantially better results than a discrete-index model; 99% vs. 85% or 68% correct classifications in simulations for two different arrangements of the probes. Also the methods by Wang et al,³¹ Colella et al³ and Li et al,¹⁸ who apply discrete-index HMMs to SNP data, aim to take distances between probes into account by letting the Markov transition probabilities depend on these distances in different ways. Common to all of these methods is however that the stipulated transition probabilities violate the Chapman-Kolmogorov equation of Markov chains. That is, letting P(t) be the matrix of transition probabilities over a distance t between two probes, the equality P(t₁ + t₂) = P(t₁)P(t₂) does not hold. In essence this means that there is in fact no Markov chain with the given transition probabilities.

The paper is organized as follows. The model is described in in Section 3. Section 4 provides results from a simulation study as well as from an application to clinical data. Concluding remarks are given in Section 5.

2. Data

The data used in this study are the cancer samples in Greenman et al,⁸ produced using the Affymetrix genome-wide SNP 6.0 platform. We applied the algorithm to about 15 different cell line and primary tumor samples, representing various cancer forms including breast, lung and renal cancer. The primary tumor sample PD1753a for which results are reported in Section 4.2 are from a clear cell renal cell carcinoma sample.³²

For probes at SNPs the intensities of the two different alleles are provided, while at other positions only a single total copy number intensity is available. Following Greenman et al,⁸ the intensities are normalized by first dividing each measurement by the total intensity of the sample (ie, the sum of all probe intensities over the entire genome), to remove chip-to-chip variation. The mean signals for each allele (or probe at non-SNP positions) are then transformed into a copy number intensity and a genotype intensity that are indicators of total copy number and allelic ratio dosages. The model presented below incorporates intensities for SNP probes only, but is easily extendable to include also probes measuring total copy number only; we elaborate on this further in the Discussion. The cancer data is available from the Cancer Genome Project, subject to a manual transfer agreement, and our Matlab code is available on the WWW.³³

3. The Model

3.1. Basic structure

Let there be N_c probes on chromosome c, and denote these probes as probe (k, c), k = 1, 2, …, N_c. The genomic location of probe (k, c) is denoted by t_kc, measured in the unit base pairs (bp) starting from the beginning of the chromosome. We denote the two different alleles at any genomic location by A and B. We will write g = (g_A, g_B) for the allelic copy numbers, ie, g_A and g_B are the number of copies of the A and B allele respectively. For example, the genotype AAB corresponds to g = (2, 1). Obviously the genotype and the allelic copy numbers are in a one-to-one correspondence to each other, and at times we will make no real distinction between the two. The allelic intensities are modeled using an HMM for which each state i corresponds to one genotype set G_i as specified in Table 1. The Markov chain can be extended to include more states with copy numbers above six, but the model as stated here has proved to be enough for the studied samples. To explain the genotype sets in Table 1, we note that through cancer development any region in the genome starts with one parental copy of each region and ends up with m copies of one allele and n copies of the other. If the genotype was originally AA or BB then the genotype will be (m + n)A or (m + n)B, respectively. If the SNP was heterozygous then we must end up with either mA and nB, or mB and nA. These are the genotypes indicated in Table 1. We refer to state 4, with genotype set {AA, AB, BB} as the normal state, and by an abnormal state we mean any other state.

Table 1.

Genotype sets for the different states of the Markov chain, sorted in the order given by the total copy number and copy number of the minor allele.

Statei	(Total CN, minor CN)	Genotype set G_i
1	(0,0)	{ }
2	(1,0)	{A, B}
3	(2,0)	{AA, BB}
4	(2,1)	{AA, AB, BB}
5	(3,0)	{AAA, BBB}
6	(3,1)	{AAA, AAB, ABB, BBB}
7	(4,0)	{4A, 4B}
8	(4,1)	{4A, 3AB, A3B, 4B}
9	(4,2)	{4A, 2A2B, 4B}
10	(5,0)	{5A, 5B}
11	(5,1)	{5A, 4AB, A4B, 5B}
12	(5,2)	{5A, 3A2B, 2A3B, 5B}
13	(6,0)	{6A, 6B}
14	(6,1)	{6A, 5AB, A5B, 6B}
15	(6,2)	{6A, 4A2B, 2A4B, 6B}
16	(6,3)	{6A, 3A3B, 6B}

Open in a new tab

For each chromosome c the sequence of copy number states, according to Table 1, is modeled by a continuous-index Markov chain (X_c(t))_{t_1c≤t≤T_c}, where t and T_c are respectively the genomic location (in bp) within the chromosome and the length (in bp) of the chromosome. The Markov chains for different chromosomes are assumed independent. The genomic location (in bp) is, strictly speaking, a discrete variable, but since the number of bp’s within a chromosome is much larger than the number of jumps of the Markov chain, the error caused by using a continuous approximation is negligible. With a discrete-index model the Markov transition probabilities would either be very close to unity (for staying in the same state from one bp to another) or close to zero (for changing state), and dealing with such probabilities is unstable numerically. For a continuous-index model, using transition rates rather than probabilities, this problem does not exist.

With 16 different states there are 240 different types of jumps and equally many transition rates (per chromosome). It is infeasible to estimate such many rates, and to make the model more parsimonious we assume a large number of them to agree. Specifically we assume, for chromosome c, a common rate λ_c for jumps from any state (normal or abnormal) to the group of abnormal states, with each such state, except for the current one in case the chain resides in an abnormal state, being equally likely, and another common rate η_c for jumps to the normal state from any abnormal state. The total rate out of any abnormal state, for chromosome c, is thus λ_c + η_c. This dynamic provides Markov chains whose stationary versions are time-reversible.²⁹ Finally we let δ_ic = P(X_c(t₁_c) = i) denote the initial probability for Markov state i in chromosome c.

Write y_kc = (y_Akc, y_Bkc) for the measured allelic intensities at probe (k, c). Greenman et al⁸ studied the correlation between the allele A and B intensities, for each probe, using 460 wild-type samples. For probe (k, c), plotting the two allele intensities for all wild-type samples against each other reveals three clusters (see,⁸ Figure 1, for an example). These clusters correspond to the genotypes AA, AB and BB, with the coordinates of the cluster centers written as (A_0kc + 2A_1kc, B_0kc), (A_0kc + A_1kc, B_0kc + B_1kc) and (A_0kc, B_0kc + 2B_1kc) respectively for suitable parameters A_0kc, B_0kc, A_1kc and B_1kc. These parameters were all estimated by Greenman et al⁸ using the wild-type samples. Their interpretation is that A_0kc is the background intensity of the A allele (at diploid probes BB), and A_1kc is the increase in A allele intensity from BB to AB and from AB to AA; B_0kc and B_1kc have analogous interpretations.

Proportions of probes at which the Markov state was incorrectly reconstructed by the Viterbi algorithm with MAP parameter estimates computed by the EM algorithm. Markov transition rates were *λ_c* = *η_c* = 10⁻⁷ (top left), *λ_c* = 10⁻⁷, *η_c* = 10⁻⁹ (top right), *λ_c* = 10⁻⁹, *η_c* = 10⁻⁷ (bottom left), *λ_c* = *η_c* = 10⁻⁹ (bottom right) (unit: bp⁻¹). Confidence intervals were obtained by exponentiating two-sided 95% student-t confidence limits based on the log-proportions for 10 genome replicates.

Further denote by (μ_Akcg, μ_Bkcg) the mean allele A and B intensities at probe (k, c) for allelic copy numbers g = (g_A, g_B). The cluster centers above then write

\begin{array}{l} μ_{k c g} & = & (μ_{A k c g}, μ_{B k c g}) \\ = & (A_{0 k c} + g_{A} A_{1 k c}, B_{0 k c} + g_{B} B_{1 k c}), \end{array}

(1)

and this model applies for the normal Markov state i = 4, ie, for allelic copy numbers such that g_A + g_B = 2. Moreover, the clusters in Greenman et al⁸ (Fig. 1) are tilted ovals, indicating that the intensities for alleles A and B are correlated and have unequal variances. Greenman et al⁸ found that a suitable model for the covariance matrix is

\sum_{k c g} = v_{k c} (\begin{array}{c} μ_{A k c g}^{2} & ρ_{k c} μ_{A k c g} μ_{B k c g} \\ ρ_{k c} μ_{A k c g} μ_{B k c g} & μ_{B k c g}^{2} \end{array});

(2)

note that the variances are taken proportional to the squared means. The probe-specific variance factors v_kc and correlations ρ_kc, as well as the means parameters A_0kc, B_0kc, A_1kc and B_1kc described above, were all estimated by Greenman et al⁸ using the wild-type samples and assuming a bivariate Normal distribution for each cluster.

We now carry this model further by assuming that for each probe, the allele intensities follow the mean-variance model given by Eqs. (1)–(2) also for genotypes (g_A, g_B) for which g_A and g_B do not sum to two, ie, for all pairs (g_A, g_B) corresponding to genotypes listed in Table 1. That is, we assume that the response from amount of each allele on the microarray to measured intensity is linear, with the variance also increasing linearly. In reality the allelic intensities have a linear response for lower copy numbers, while at higher copy numbers the intensities start to saturate and our method is approximate. This could be adjusted for by a non-linear transformation, cf. Section 5, but we have not attempted such an adjustment in the analyses presented in this paper.

The above model specifies the conditional density of Y_kc given a particular genotype. To specify the conditional density of Y_kc given a Markov state, we recall that each Markov state has a genotype set comprising between one and four different genotypes. Thus the conditional density of Y_kc, given the state, is a mixture of bivariate Normal distributions for which each mixture component represents a different genotype. The mixture weights were taken as the Hardy-Weinberg weights; for the copy number-aberrated genotypes, Hardy-Weinberg was used to compute the germline genotype proportions. Thus letting p_kc be the allele frequency for an A allele at probe (k, c), the probability for the different genotypes, denoted by w_kcig, are the binomial probabilities p_kc and 1 – p_kc for states with two genotypes, $p_{k c}^{2}$ , 2p_kc(1 – p_kc) and (1 – p_kc)² for states with three genotypes, and $p_{k c}^{2}$ , p_kc(1 – p_kc), p_kc (1 – p_kc) and (1 – p_kc)² respectively for states with four different genotypes. The frequencies p_kc were also estimated by Greenman et al,⁸ using the wild-type samples. The conditional density for a measurement Y_kc given the Markov state, often referred to as the emission density of the HMM, thus writes

f_{Y_{k c} | X_{c} (t_{k c})} (y | i) = \sum_{g \in G_{i}} w_{k c i g} f_{Y_{k c} | G_{k c}} (y | g),

(3)

where G_kc is the allelic copy numbers for probe (k, c) and f_{Y_kc|Gkc} (·|g) is the bivariate Normal density with mean and covariance matrix as in Eqs. (1)–(2).

As pointed out in the introduction we include the ploidy K, ie, average copy number over the entire genome, in the model to make it suitable for cancer data. The ploidy is defined genome-wide and not per chromosome, as the probe intensities are normalized per genome. The HMM described above models the normalized intensities, and its parameters were estimated for wild-type samples (ie, diploid samples; K = 2). For a sample with K > 2 the normalized intensities will thus be smaller by a factor 2/K (on average), so that the model for the normalized intensities becomes

Y_{k c} | G_{k c} = g \sim N (\frac{2}{K} μ_{k c g}, \frac{4}{K^{2}} Σ_{k c g}) .

(4)

This completes the specification of the basic model. As described above, the parameters A_0kc, A_1kc, B_0kc, B_1kc, v_k_c, ρ_kc and p_kc were all estimated from the wild type samples, and were thus considered as fixed when the model was applied to cancer cell data. The intensities λ_c and η_c, the initial probabilities δ_c and the ploidy K were on the other hand estimated from the actual cancer data.

3.2. Normal tissue contamination

As mentioned above it is often difficult to dissect cancer cells without including any surrounding normal tissue, ie, diploid tissue. Such contamination implies that the measured allelic intensities correspond to a mixture of cancer and normal cells. We denote the fraction of normal tissue in the sample by γ, and consequently the fraction of tumor tissue is 1 – γ. Then for a given probe with, as above, copy numbers g_A and g_B or alleles A and B in the tumor but also copy numbers $g_{A}^{N}$ and $g_{B}^{N}$ for the two alleles in the normal tissue, we assumed the same mean-covariance model as in Eqs. (1)–(2), but with (g_A, g_B) replaced by

(g_{A}^{γ}, g_{B}^{γ}) = ((1 - γ) g_{A} + γ g_{A}^{N}, (1 - γ) g_{B} + γ g_{B}^{N}) .

(5)

Similarly, the conditional distribution of Y_kc given Markov state i is a mixture of bivariate Normals, but now each four-tuple (g_A, g_B, $g_{A}^{N}$ , $g_{B}^{N}$ ) contributes to a component of that mixture. Thus, the number of mixture components will for some Markov states be larger than without normal tissue contamination (see Table 2).

Table 2.

Combined genotype sets for the different states of the Markov chain, in a model with normal contamination γ. The weights for the respective combined genotypes are the Hardy-Weinberg weights as in the model without normal tissue contamination, and the total and minor copy numbers for the abberated components are as in Table 1.

Statei	Combined genotype setG_i
1	{2γA, γAγB, 2γB}
2	{(1 + γ)A, AγB, γAB, (1 + γ )B}
3	{2A, (2 − γ)AγB, γA(2 − γ)B, 2B}
4	{AA, AB, BB}
5	{(3 − γ)A, (3 − 2γ)AγB, γA(3 − 2γ)B, (3 − γ)B}
6	{(3 − γ)A, (2 − γ)AB, A(2 − γ)B, (3 − γ)B}
7	{(4 − 2γ)A, (4 − 3γ)AγB, γA(4 − 3γ)B, (4 − 2γ)B}
8	{(4 − 2γ)A, (3 − 2γ)AB, A(2 − γ)B, (4 − 2γ)B}
9	{(4 − 2γ)A, (2 − γ)A(2 − γ)B, (4 − 2γ)B}
10	{(5 − 3γ)A, (5 − 4γ)AγB, γA(5 − 4γ)B, (5 − 3γ)B}
11	{(5 − 3γ)A, (4 − 3γ)AB, A(4 − 3γ)B, (5 − 3γ)B}
12	{(5 − 3γ)A, (3 − 2γ)A(2 − γ)B, (2 − γ)A(3 − 2γ)B, (5 − 3γ)B}
13	{(6 − 4γ)A, (6 − 5γ)AγB, γA(6 − 5γ)B, (6 − 4γ)B}
14	{(6 − 4γ)A, (5 − 4γ)AB, A(5 − 4γ)B, (6 − 4γ)B}
15	{(6 − 4γ)A, (4 − 3γ)A(2 − γ)B, (2 − γ)A(4 − 3γ)B, (6 − 4γ)B}
16	{(6 − 4γ)A, (3 − 2γ)A(3 − 2γ)B, (6 − 4γ)B}

Open in a new tab

The weights for the combined genotypes are Hardy-Weinberg weights as in the model without normal contamination. For example, for a state in Table 2 with three combined genotypes, the weights are $p_{k c}^{2}$ , 2p_kc(1 – p_kc) and (1 – p_kc)² respectively.

3.3. Estimation of parameters and the Markov path

The parameters estimated from a tumor sample are the transition rates λ_c and η_c, the initial probabilities δ_c, the ploidy K and also the fraction γ of normal tissue contamination.

For a model like the present one, the maximum-likelihood estimator (MLE) typically overestimates the transition rates λ_c and η_c ²⁵ (Section 4.3), thereby letting an aposteriori reconstruction of the Markov chain trajectory capture also very short transients of the observed data. When using the EM algorithm to compute the MLE, this becomes visible as an overestimated number of jumps of the Markov chain. In order to control the jumps and make their number biologically plausible, we take a Bayesian approach and penalize overly large transition rates by placing Gamma distribution priors on each λ_c and η_c. Other parameters are assigned uniform (flat) priors. All parameters are apriori independent. We then compute the maximum aposteriori (MAP) parameter estimate using the EM algorithm, by incorporating the priors into the M-step⁵ (p. 6). Otherwise this algorithm is a variant of the EM algorithm described by Roberts and Ephraim,²³ designed to estimate parameters of a continuous-index HMM observed at discrete positions. The method is detailed in Appendix 2.1.

Finally, to construct an estimate of the trajectory of the hidden Markov chain we use a Viterbi algorithm adapted to continuous-index Markov chains (see Appendix 2.2).

4. Results

4.1. Application to simulated data

To evaluate our method’s ability of making correct reconstructions for different amounts of normal contamination, we simulated data from the assumed model, computed MAP parameter estimates using the EM algorithm, reconstructed the hidden Markov chain using the Viterbi algorithm, and finally computed the proportion of probes at which the Markov state was correctly reconstructed. For each simulated dataset we first simulated the Markov chain and the genotypes for each probe position, then computed μ_kcg and Σ_kcg using Eqs. (1)–(2), Eq. (5) and the fixed A₀, A₁, B₀, B₁, ρ and v (estimated from the wild-type samples), and finally simulated data from the bivariate Normal distributions of Eq. (4) with K = 2. Note that the actual value of K is irrelevant for these simulations, since the model given by Eqs. (1)–(2) describes the data after normalization.

The simulations were carried out for 30%, 50% and 70% normal contamination, and transition rates λ_c = η_c = 10⁻⁷, λ_c = 10⁻⁷ and η_c = 10⁻⁹, λ_c = 10⁻⁹ and η_c = 10⁻⁷, and λ_c = η_c = 10⁻⁹ (in units of bp⁻¹) respectively. For each combination of contamination and rates, 10 replicates were simulated. For the Gamma priors of λ_c and η_c we chose shape parameter 2 and means equal to the true transition rates. These choices yield priors that are not overly informative, but which are concentrated enough on small values to prevent the Markov chain from jumping too frequently in our samples.

To verify the convergence of the EM algorithm we present the EM iterations for three different simulated replicates in Figure 2. The proportions of incorrectly reconstructed probes are plotted in Figure 1.

Estimates of normal contamination γ for iterations 1–10 of the EM algorithm and three simulated replicates with different values of γ: γ = 0.3 (top), γ = 0.5 (middle), and γ = 0.7 (bottom). The initial value for γ was 0.5 in all simulations.

These results can be compared to those from the simulation study by Lamy et al.¹⁷ For a normal contamination of 30% the results are similar, but for 45%, which is the largest fraction studied by Lamy et al, their method provides 8%–18% incorrectly estimated probes while at 50% contamination our model provides an error rate below 1%. In addition, the present model performs well even at such a high amount of normal contamination as 70%, when the Markov state is correctly reconstructed at more than 97% of the probes. Obviously the differences between our results and those of Lamy et al depend not only on the different estimation algorithms but also on differences between the number and location of the probes, and on the model for the observed allele intensities and its parameters. However, given the magnitude of the performance improvement, a significant part of it must be attributed to the estimation algorithm as such.

4.2. Application to clinical data

We applied our method to a number of samples from the data described in Section 2. An example is displayed in Figure 3, which shows the Viterbi reconstruction of the Markov chain as well as the corresponding copy numbers compared to the data, for chromosome 3 in primary sample PD1753. For the Gamma priors for λ_c and η_c we chose shape parameters 2 and means 10⁻¹⁵.

Top: Viterbi reconstruction of the Markov path for chromosome 3 in PD1753. Bottom: sum of (standardized) allele intensities for probes within the same chromosome (grey dots), and the copy number of the corresponding state (black solid line).

The reconstruction divides the chromosome into two regions, reconstructed to state 2 ({A, B}) and state 4 ({AA, AB, BB}) respectively. As a simple check of this reconstruction we plotted the standardized allele intensities against each other for all probes in the respective region (Figs. 4–5). Figure 5, corresponding to the normal state, shows three clusters representing the three genotypes AA, AB and BB, while Figure 4 shows four clusters. In Table 1 state 2 is associated to two genotypes, A and B, but with normal contamination this state comprises four combined genotypes (1 + γ)A, AγB, γAB and (1 + γ)B (Table 2). Here γ is estimated at 0.53.

Scatter plot of standardized measured allele intensities in the segment reconstructed to Markov state 2 in Figure 3. The fraction of normal contamination was estimated at 0.53.

Scatter plot of standardized measured allele intensities in the segment reconstructed to Markov state 4 in Figure 3.

For some of the genomes the values of A_0kc, A_1kc, B_0kc and B_1kc needed small adjustments before applying our model; without it, the model did not produce a reasonable fit. A possible explanation for this adjustment being required is a drift in the measured intensities from when data from the wild-type samples, used to estimate most model parameters, was collected, to when the tumor samples were analyzed. A suitable construction of the adjustment was as a common, ie, genome-wide multiplier c₀ for all A_0kc and B_0kc, and another common multiplier c₁ for all A_1kc and B_1kc. The multipliers c₀ and c₁ were estimated using data from a chromosome segment known to belong to the normal state. The data within this segment was clustered into three parts using the k-means algorithm, and then c₀ and c₁ were estimated by a least squares fit.

5. Discussion

We have presented a method to estimate the number of copies of each of the two alleles in SNP data, taking three features common in cancer data into account; unequally spaced probes, aneuploidy, and normal contamination. Unequally spaced probes are modeled using a continuous-index Markov chain instead of a discrete-index one, which is the usual choice in the literature. The ploidy and fraction of normal contamination are both included as parameters in the model, which allows us to estimate them along with other variables and using all the data, rather than estimating them separately in a pre-processing step. This set-up also allows us to retain the integer structure of the allele copy numbers. The model’s ability to estimate the fraction of normal contamination has been demonstrated in a simulation study, with the results being far better than for previous methods and excellent even with as much as 70% normal contamination.

Above we denoted Markov state 4, ie, the state with genotypes {AA, AB, BB}, the normal state, irrespective of the ploidy of the chromosome. The reason for singling out this particular state is that it is often particularly interesting whether the Markov chain is in this state or not, at any given probe. One could argue that if the ploidy differs from two this is not ‘normal’, but it is straightforward to select a different state as ‘normal’ and then modify the transition rate structure and estimation algorithm accordingly.

The emission model, ie, Eqs. (1)–(4), assume that the means and variances of the measured intensities are both linear in the amount of each allele. In practice this assumption may fail, eg, because for large copy numbers the response is nonlinear. One could then include such a non-linearity in the model, and model the mean intensities as μ_kcg = h_kc(g;θ_kc) where h is some function and θ_kc parameters of this function. Ideally the functional form h as well as all its probe-speficic parameters θ_kc should be well estimated beforehand, so that they are essientially known when evaluating an unknown sample. Similar comments apply to the variance of the measured intensities.

In this paper we have only considered probes that provide allele-specific intensity measurements, but, as mentioned in Section 2, microarrays often also contain probes that measure the total copy number only, ie, the sum of the number of alleles. Such probes can easily be included in our model by speficying a corresponding suitable emission density, ie, a density corresponding to Eq. (3). For instance, this could be a univariate Normal density with mean μ_kcg = C_0kc + C_1kc(g_A + g_B) and variance $σ_{k c g}^{2} = ν_{k c} μ_{k c g}^{2}$ for parameters C_0kc, C_1kc and ν_kc that again need to be estimated prior to analyzing an unknown sample. Should the response function from total copy number to intensity not be linear for large copy numbers, this could be handled similarly to what can be done for SNP probes; cf. the previous paragraph.

Finally we mention some possible limitations of our method. Firstly, the accuracy of the method is likely to be reduced in regions of very high copy number where signal saturation occurs, such as in amplicons, and bespoke nonlinear adjustments may be required (as discused above). Secondly, we have ignored copy number polymorhisms. These will produce non-integer copy numbers in the cancer sample due to the skewed ratio between the cancer and the contaminating normal. If copy number data is available for the normal, it may be possible to generalise these methods to make such an adjustment, however, such regions are generally a lot smaller in scale than the somatic copy number changes seen in cancer and were not considered further. Lastly, we have assumed that the sample in question is derived from a homogeneous collection of cells. However, cell-to-cell variation is quite possibly going to produce a lot of different clones with differing copy numbers, and more general methods will be required to deal with such complexities.

To sum up this paper, copy number variations in cancer are common and their accurate determination is important for determining homozygous deletion, amplifications and breakpoints, all of which can be functionally implicated in cancer. This problem is compounded by normal contamination, making the accurate estimation of integer copy numbers in cancer samples with normal contamination difficult. Here we have introduced a method that addresses this problem.

Acknowledgments

CDG was supported by the Wellcome Trust at the Sanger Centre. The authors would like to thank the anonymous reviewers for their constructive comments and suggestions that improved the presentation of this paper.

Appendix

1. A Primer on Markov Chains and Hidden Markov Model

The purpose of this section is to provide a brief and rather elementary introduction to Markov chains with discrete and continuous index, and to hidden Markov models. A monograph entirely devoted to bioinformatics applications of hidden Markov models is the text by Koski.¹⁴

Consider a sequence t₁, t₂, …, t_N of locations (in our case these will be probe locations), and a set {1, 2, …, r} of states (which will in our case be as in Tables 1 or 2). At any location t_k there is an actual state x(t_k) (ie, a true copy number state), which we think of as the realization of a random variable X(t_k). These random variables are dependent, since copy number states at nearby probes are correlated. To model this dependence, we use Markov chains.

A discrete-index Markov chain (we use the term index rather than the more common ‘time’, since bp location is not a temporal variable) is specified by transition probabilities p_ij(t_k–₁, t_k), giving the (conditional) probability that if the chain happens to be in state i at location t_k–₁, it will move to state j at location t_k. For j = i, the probability concerns the event that the chain will stay in the same state, ie, not move at all. Implicit in this characterization is also the fact that if the states x(t₁), x(t₂), …, x(t_k–₁) at all foregoing locations t₁, t₂, …, t_k–₁ are known, this does not affect the conditional probability, which only depends on the state x(t_k–₁) at the closest location t_k–₁; this is the Markov property. To complete the specification of the Markov chain, we must also provide the initial probabilities, ie, the probabilities that at the first location t₁, the chain starts in state i for each respective state.

In our model, the probe locations t_k are separated by different distances t_k – t_k–₁, ie, these distances are not equal. We wish to incorporate this feature into the Markov model, so that the transition probabilities p_ij(t_k–₁, t_k) do not only depend on the states i and j that the chain moves from and to respectively, but also on the distance h_k = t_k – t_k_–1 between the probes. One way to accomplish this is to think of the base pair location along a chromosome, which we denote by t, as a continuous variable rather than as a discrete one, and to model the state changes of the Markov chain using this continuous variable, or index. In contrast to a discrete-index Markov chain, a continuous-index Markov chain is specified in terms of transition rates. For any state i and any other state j, ie, different from i, there is a transition rate q_ij from state i state j. For any state i we also define the total rate out of i, q_i, as the sum of all transition rates out of this state, ie, q_i = Σ_j≠i q_ij. One way to interpret these rates is in terms of sojourn lengths and jump probabilities. Given that the chain has entered state i, it will stay there for a sojourn whose length is random and follows an exponential distribution with rate q_i (mean 1/q_i); the probability that this sojourn exceeds length s is thus the exponential exp(–q_i s). When then the chain eventually leaves state i, the probability that it jumps to state j is given by q_ij/q_i.

For a continuous-index Markov chain it is also possible to compute the probability that for two locations t_k–₁ and t_k separated by distance h_k, if the chain is in state i at location t_k–₁, it will be in state j at location t_k. Denoting these probabilities by p_ij(h_k) and collecting then into an r × r matrix P(h_k) (thus p_ij(h_k) is the row i column j element of this matrix), it holds that P(h_k) = exp(Qh_k), where Q is the r × r rate matrix (or intensity matrix, or generator) with off-diagonal elements given by the transition rates q_ij and diagonal elements q_ii = –q_i, ie, the negative of the total rates out of the respective states. Moreover, exp(Qh_k) is the matrix-exponential function, defined by the power series exp(A) = I + A + A²/2! + A³/3! + … for any square matrix A, where I is the identity matrix, ie, a matrix of the same size as A and with diagonal elements equal to one and all off-diagonal elements being zero, and k! is the factorial 1 × 2 × … × k. This definition is a direct generalization of the power series for the ordinary (real-valued) exponential function.

In a hidden Markov model (HMM), the Markov chain is not directly observable, but only as disturbed by noise. In the present setting the copy number state cannot be observed with certainty, but for any probe the intensity measurements, for each allele, provide partial information about the copy number state. In an HMM, the link between the state X(t_k) at some location t_k and the corresponding measurement Y_k (here, intensities) is specified through an emission density f_{Y_kX(t_k)}(y|i), which is the conditional density of Y_k given that X(t_k) = i. In the present context the emission density is thus the density of the measured intensities given a certain copy number state. Since there are two intensities available, one for each allele, the density is a bivariate one. Furthermore, since each copy number state (Markov state) contains several genotypes, the emission density for a copy number state is a mixture (weighted average) of densities corresponding to each of these genotypes; this is Eq. (3).

Specifying the HMM thus amounts to specifiying the structure and parameters of the Markov chain, and those of the emission densities. When this has been done, typical tasks are to i) estimate parameters from data, and ii) find the most likely realization of the Markov chain, given data. The first task, parameter estimation, is commonly carried out using the so-called EM (expectation maximization) algorithm, which is an iterative procedure that in each iteration increases the likelihood of the model parameters. The purpose is thus to iterate until convergence, and then to report the resulting parameters as the MLE (maximum likelihood estimate); convergence to the MLE is not guaranteed, however. For our HMM, the algorithm is outlined in Appendix 2.1. The second problem above can be viewed as that of reconstructing the Markov trajectory, given data (and model parameters, usually estimated ones). This problem is solved using the so-called Viterbi algorithm, which is a dynamic programming algorithm that recursively finds the most likely path. This algorithm, for our HMM, is described in Appendix 2.2.

2. Methods

2.1. The EM algorithm

The parameters to estimate are the ploidy K, the fraction γ of normal tissue, and, for each chromosome c, the two transition rates λ_c and η_c and the initial distribution δ_c. Our starting point is the EM algorithm for continuous-index hidden Markov chains by Roberts and Ephraim.²³ As latent (unobserved) data we take the whole Markov trajectory (X_c(t))_{t_1c≤t≤T_c} for each chromosome c, but the complete likelihood involves only the sufficient statistics consisting of the initial state X_c(t_1c), the total lengths T_n_c and T_a_c of sojourns in the normal state and in abnormal states respectively, and the numbers m_·a_c and m_an_c of jumps to abnormal states, and from abnormal states to the normal state respectively, for chromosome c.

With these sufficient statistics, and recalling that each λ_c has a Gamma prior with shape and intensity parameters say $α_{c}^{λ}$ and $β_{c}^{λ}$ , and analogously for each η_c, the complete log-posterior, ie, the sum of the complete log-likelihood and the log-prior, is, up to a constant not depending on the parameters,

\begin{array}{l} L^{c} (θ; X, y) & = & \sum_{c} {log δ_{X (t_{1}), c} + m_{\cdot a c} log λ_{c} + m_{an c} log η_{c} \\ - λ_{c} T_{n c} - (λ_{c} + η_{c}) T_{a c} \\ + (α_{c}^{λ} - 1) log λ_{c} - β_{c}^{λ} λ_{c} \\ + (α_{c}^{η} - 1) log η_{c} - β_{c}^{η} η_{c} \\ + \sum_{k = 1}^{N_{c}} log f_{Y_{k c} | X_{k c}} (y_{k c} | X_{k c}; γ, K)}, \end{array}

where θ = (δ_c, λ_c, η_c, K, γ). Moreover, y = {y_kc} is the collection of all data and X = {(X_c(t))} is the collection of all (unobserved) Markov chain trajectories. The quantity to maximize in one iteration of the EM algorithm is

Q (θ; θ') = E_{θ} [L^{c} (θ'; X, y) | y],

where maximization is with respect to θ′ and the notation E_θ indicates that the expectation is computed under the current parameter (estimate) θ. Note that L_c(θ′;X,y) and hence also Q(θ;θ′) split into two distinct parts, one of which depends on (δ_c, λ_c, η_c) only and one of which depends on K and γ only. Maximization with respect to (δ_c, λ_c, η_c) and with respect to (K, γ) can thus be carried out separately.

Also note that K and γ are common across the genome, and therefore estimated using the data for all chromosomes. For each iteration of the EM algorithm we compute the forward and backward variables for each chromosome, store them, and then re-estimate K and γ using the information from all chromosomes.

The M-steps for the transition rates read

{\hat{λ}}_{c} = \frac{α_{c}^{λ} - 1 + {\hat{m}}_{\cdot a c}}{β_{c}^{λ} + {\hat{T}}_{a c} + {\hat{T}}_{n c}}, {\hat{η}}_{c} = \frac{α_{c}^{η} - 1 + {\hat{m}}_{a n c}}{β_{c}^{η} + {\hat{T}}_{a c}},

where m̂_·a_c=E_θ[m_·a_c|y₁_c, ⋯, y_c_,Nc] etc. Note that T_a_c + T_n_c equal the length of the Markov chain trajectory for chromosome c, ie, T_c – t_1c, so that also T̂_ac+ T̂_nc= T_c – t_1c. Moreover, the M-step for the initial distributions is

{\hat{δ}}_{i c} = P_{θ} (X (t_{1}) = i | y_{1 c}, \dots, y_{c, N_{c}}) .

The M-step for the ploidy is

\hat{K} = - \frac{V}{4 U} + \sqrt{\frac{V^{2}}{16 U^{2}} - \frac{Σ_{c} N_{c}}{U}},

where

\begin{array}{l} U & = & - \sum_{c, k, i, g \in G_{i}} \frac{1}{8 ν_{k c} (1 - ρ_{k c}^{2})} (\frac{y_{A k c}^{2}}{μ_{A k c g}^{2}} - \frac{2 ρ_{k c} y_{A k c} y_{B k c}}{μ_{A k c g} μ_{B k c g}} + \frac{y_{B k c}^{2}}{μ_{B k c g}^{2}}) \\ \times P_{θ} (X_{c} (t_{k c}) = i, G_{k c} = g | y_{1 c}, \dots, y_{c, N_{c}}) \end{array}

and

\begin{array}{l} V & = & - \sum_{c, k, i, g \in G_{i}} \frac{1}{2 v_{k c} (1 + ρ_{k c})} (\frac{y_{A k c}}{μ_{A k c g}} + \frac{y_{B k c}}{μ_{B k c g}}) \\ \times P_{θ} (X_{c} (t_{k c}) = i, G_{k c} = g | y_{1 c}, \dots, y_{c, N_{c}}) \end{array}

For the fraction γ of normal contamination there is no closed form expression for the M-step, and to re-estimate γ we maximize Q(θ;·), as a function of γ, numerically. Note however that K̂ above depends on γ, which appears implicitly in the means μ_A_kcg and μ_B_kcg used to compute U and V. Therefore, by maximizing w.r.t. K′ (using the current γ) and then w.r.t. γ′ (using the re-estimated K̂), as we do, and not w.r.t. K′ and γ′ jointly, we in fact obtain a generalized EM algorithm rather than an EM algorithm, in the terminology of Dempster et al⁵ (Eq. (3.5)).

The conditional expectations m̂_·_a_c etc. are computed in the E-step, which follows that of Roberts and Ephraim²³ with minor changes. Now write y_k:l,c for {y_kc, y_k_{+ 1,}_c, …, y_lc}, and let m_ijc be the number of jumps by the Markov chain from state i to state j, in chromosome c. Then

\begin{array}{l} {\hat{m}}_{i j c} & = & E_{θ} [m_{i j c} | y_{1 : N_{c}, c}] \\ = & \int_{θ}^{T_{c}} P_{θ} (X_{c} (t -) = i, X_{c} (t) = j | y_{1 : N_{c}, c}) d t \\ = & \int_{0}^{T_{c}} \frac{P_{θ} (X_{c} (t -) = i, X_{c} (t) = j)}{p_{θ} (y_{1 : N_{c}, c})} \\ \times p_{θ} (y_{1 : N_{c}, c} | X_{c} (t -) = i, X_{c} (t) = j) d t \\ = & \sum_{k = 2}^{N_{c}} \int_{t_{k - 1, c}}^{t_{k c}} \frac{P_{θ} (X_{c} (t) = j | X_{c} (t -) = i) P_{θ} (X_{c} (t -) = i)}{p_{θ} (y_{1 : N_{c}, c})} \\ \times p_{θ} (y_{1 : k - 1, c}, y_{k : N_{c}, c} | X_{c} (t -) = i, X_{c} (t) = j) d t; \end{array}

here the symbol P denotes probabilities as well as densities; note that P_θ(X_c(t) = j|X_c(t−) = i) = q_ijc, where q_ijc is the transition rate from state i to state j in chromosome c. Thus, with r_abnormal being the number of abnormal states, q_ijc is equal to λ_c/r_abnormal if i is the normal state and j is any abnormal state, equal to λ_c/(r_abnormal − 1) if i and j are both abnormal states (because the chain cannot jump from a state to itself), and equal to η_c if i is any abnormal state and j is the normal state. Given the HMM structure it follows that y_1:k–1,c and y_k:Nc,c are conditionally independent given X_c(t−) = i and X_c(t) = j, whence

\begin{array}{l} {\hat{m}}_{i j c} & = & \frac{q_{i j c}}{p_{θ} (y_{1 : N_{c}, c})} \sum_{κ = 2}^{N_{c}} \int_{t_{k - 1, c}}^{t_{k c}} P_{θ} (X_{c} (t -) = i) p_{θ} (y_{1 : k - 1, c} | X_{c} (t -) = i) \\ \times P_{θ} (y_{k : N_{c}, c} | X_{c} (t) = j) d t \\ = & \frac{q_{i j c}}{p_{θ} (y_{1 : N_{c}, c})} \sum_{κ = 2}^{N_{c}} \int_{0}^{h_{k c}} P_{θ} (y_{1 : k - 1, c}, X_{c} (t_{k - 1, c} + t -) = i) \\ \times P (y_{k : N_{c}, c} | X_{c} (t_{k - 1, c} + t) = j) d t \end{array}

with h_kc = t_kc – t_k–_1,_c. Here, the two factors in the integrand on the right-hand side are the forward and backward densities respectively.

To compute these factors, and similar ones, we use a forward-backward type algorithm. Let r be the number of Markov states, and let B_kc be the r × r diagonal matrix whose (i, i)-th element is the probability density function of y_kc given Markov state i at position t_kc, ie, f_{Y_kcX_c (t_kc)} = i(y_kc) in Eq. (3). Further let F_kc be the r × r matrix whose (i, j)-th element is

\begin{array}{l} {[F_{k c}]}_{i j} & = & P_{θ} (y_{k c}, X_{c} (t_{k c}) = j | X_{c} (t_{k - 1, c}) = i) \\ = & {[exp (Q_{c} h_{k c})]}_{i j} {[B_{k c}]}_{i j}, \end{array}

where Q_c is the matrix with elements q_ijc, i, j = 1, 2, …, r for i ≠ j, and diagonal elements q_cii being the negative of the total rate out of state i for chromosome c (the row sums of Q_c then become zero). We note that the discrete-index process (X_c(t_kc))_{1≤k≤N_c}, ie, the continuous-index process (X_c(·)) sampled at the locations of the probes, is a non-homogeneous discrete-index Markov chain with transition probability matrices, from t_k–_1,_c to t_kc, given by exp(Q_ch_kc). With this matrix notation we have F_kc = exp(Q_ch_kc)B_kc, and the likelihood for chromosome c can be written

p_{θ} (y_{1 : N_{c}, c}) = δ_{c} B_{1 c} (\prod_{k = 2}^{N_{c}} F_{k c}) 1

where 1 is the r × 1 vector of all ones. The forward densities are

\begin{array}{l} P_{θ} (y_{1 : k - 1, c,} X_{c} (t_{k - 1, c} + t -) = i) \\ = \sum_{s = 1}^{r} P_{θ} (y_{1 : k - 1, c}, X_{c} (t_{k - 1, c}) = s) \\ \times P_{θ} (X_{c} (t_{k - 1, c} + t -) = i | X_{c} (t_{k - 1, c}) = s) \\ = \sum_{s = 1}^{r} (δ_{c} B_{1 c} \prod_{κ = 2}^{k - 1} F_{κ c}) 1_{s} {[exp (Q_{c} t)]}_{si}, \end{array}

where 1_j is the r × 1 vector whose elements are zero except for element j which is one.

The backward densities are

\begin{array}{l} p_{θ} (y_{k : N_{c}, c} | X_{c} (t_{k - 1, c} + t) = j) \\ = \sum_{s = 1}^{r} p_{θ} (y_{k + 1 : N_{c}, c} | X_{c} (t_{kc}) = s) p_{θ} (y_{kc} | X_{c} (t_{kc}) = s) \\ \times P_{θ} (X_{c} (t_{kc}) = s | X_{c} (t_{k - 1, c} + t) = j) \\ = \sum_{s = 1}^{r} {[B_{kc}]}_{ss} {[exp (Q_{c} (h_{kc} - t))]}_{js} {1^{'}}_{s} (\prod_{κ = k + 1}^{N_{c}} F_{κ c}) 1 . \end{array}

The above matrix multiplications are numerically unstable, as the products will either tend to zero or infinity exponentially fast as the number of factors increases. Therefore scaled versions of these recursions are introduced. The scaled forward densities with normalizing constants d_kc at probe (k, c) are

L_{c} (k) = \frac{δ_{c} B_{1 c}}{d_{1 c}} \prod_{κ = 2}^{k} \frac{F_{κ c}}{d_{κ c}},

which we compute recursively as

L_{c} (k) = \frac{L_{c} (k - 1) F_{kc}}{d_{kc}}

with d₁_c = δ_cB₁_c1, L_c(1) = δ_cB₁_c/d₁_c, d_kc = L_c(k – 1)F_kc1.

The scaled backward densities are

R_{c} (k) = \prod_{κ = k + 1}^{N_{c}} \frac{F_{κ c}}{d_{κ c}} 1,

which we compute as

R_{c} (k) = \frac{F_{k + 1} R_{c} (k + 1)}{d_{k + 1, c}}

with R_c(N_c) = 1.

Using these scales quantities, the matrix m̂_c with entries m̂_ijc can be expressed as

{\hat{m}}_{c} = Q_{c} ⊙ {I^{'}}_{k, c},

where ⊙ denotes element-wise multiplication and

I_{kc} = \int_{0}^{h_{kc}} exp (Q_{c} (h_{kc} - t)) V_{c} exp (Q_{c} t) dt

with

V_{c} = \sum_{k = 2}^{N_{c}} \frac{B_{kc} R_{c} (k) L_{c} (k - 1)}{d_{k, c}} .

The integrals I_kc are evaluated using the matrix

D_{c} = (\begin{array}{l} Q_{c} & V_{c} \\ 0 & Q_{c} \end{array});

I_kc is then upper right r × r block of exp(D_ch_kc).

Finally, recalling that the normal state is state 4,

\begin{array}{l} {\hat{m}}_{. a c} = \sum_{1 \leq i \leq r 1 \leq j \leq r,} \sum_{j \neq i, j \neq 4} {\hat{m}}_{ijc}, \\ {\hat{m}}_{an c} = \sum_{1 \leq i \leq r, i \neq 4} {\hat{m}}_{i 4 c} . \end{array}

Using similar types of computations is follows that

{\hat{T}}_{ic} = E [T_{ic} | y_{1 : N_{c}, c}] = {\hat{m}}_{iic} / q_{iic},

where T_ic is the total length of all sojourns of the Markov chain in state i within chromosome c. Moreover,

P_{θ} (X (t_{1 c}) = i | y_{1 : N_{c}, c}) \propto L_{c} {(1)}_{i} R_{c} {(1)}_{i},

and the conditional probabilities in the expressions for U and V are computed using

\begin{array}{l} P (X_{c} (t_{kc}) = i, G_{kc} = g | y_{1 : N_{c}}) \\ = P (G_{kc} = g | X_{kc} = i, y_{1 : N_{c}, c}) P (X_{kc} = i | y_{1 : N_{c}, c}) \\ \propto w_{kcig} f (y_{kc} | G_{kc} = g) L_{c} {(k)}_{i} R_{c} {(k)}_{i}, \end{array}

where the weights and densities on the right-hand side are those in Eq. (3).

2.2. The Viterbi algorithm

We used a Viterbi algorithm, adapted to the continuous-index structure, to find the aposteriori most likely Markov chain trajectory. The algorithm is the usual Viterbi algorithm, but with transition probability matrices exp(Q_ch_kc) that vary with probe index (c, k). The algorithm thus finds the most likely sequence at the probe locations only. When the estimated reconstruction of each Markov state X(t_kc) is available, one may also estimate the corresponding genotype G_kc (see below).

For any chromosome c, the Viterbi algorithm is as follows. To ensure numeric stability, it operates on log-scale.

Put ξ₁_c(i) = log(δ_ic[B₁_c]_ii) for i = 1, …, r.
Iterate for k = 2, 3, …, N_c,

$\begin{array}{l} ξ_{kc} (j) = max_{i} {ξ_{k - 1} (i) + log {[exp (Q_{c} h_{kc})]}_{ij} \\ + log {[B_{kc}]}_{jj}} \end{array}$

for i = 1, 2, …, r.
Put x̂_c(N_c) = arg max_i ξ_{N_cc}(i).
Iterate for k = N_c – 1, N_c – 2, …, 1,
$\begin{array}{l} {\hat{x}}_{c} (k) = arg max_{i} {ξ_{kc} (i) \\ + log {[exp (Q h_{k + 1})]}_{i, {\hat{x}}_{c} (k + 1)}} . \end{array}$

Having reconstructed the states x_c(t_kc), it holds that the corresponding genotypes, given the Markov chain and intensity data, are conditionally independent with

\begin{array}{l} P (G_{kc} = g | X_{c} (t_{kc}) = i, Y_{kc} = y) \\ = \frac{w_{kcig} f_{Y_{kc} | G_{kc}} (y | g)}{Σ_{g^{'} \in G_{i}} w_{kci g^{'}} f_{Y_{kc} | G_{kc}} (y | g^{'})} \end{array}

for all g ∈ G_i; here f_{Y_kc|G_kc} (y| g) is the bivariate Normal density as in Eq. (3). Selecting, for each probe (k, c), G_kc as the genotype g ∈ G_i maximizing this expression thus yields a maximum aposteriori (MAP) reconstruction of the genotypes at all probes.

Footnotes

Disclosure

This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.

References

1.Andersson R, Bruder CEG, Piotrowski A, et al. A segmental maximum a posteriori approach to genome-wide Copy Number profiling. Bioinformatics. 2008;24:751–8. doi: 10.1093/bioinformatics/btn003. [DOI] [PubMed] [Google Scholar]
2.Attiyeh EF, Diskin SJ, Attiyeh MA, et al. Genomic copy number determination in cancer cells from single nucleotide polymorphism microarrays based on quantitative genotyping corrected for aneuploidy. Genome Res. 2009;19:276–83. doi: 10.1101/gr.075671.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Colella S, Yau C, Taylor JM, et al. QuantiSNP: an objective Bayes hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 2007;35:2013–25. doi: 10.1093/nar/gkm076. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Daruwala R, Rudra A, Ostrer H, Lucito R, Wigler M, Mishra B. A versatile statistical analysis algorithm to detect genome copy number variation. Proc Nat Acad Sci. 2004;101:16292–7. doi: 10.1073/pnas.0407247101. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with discussion) J Roy Statist Soc B. 1977;39:1–38. [Google Scholar]
6.Eilers PHC, de Menezes RX. Quantile smoothing of array CGH data. Bioinformatics. 2005;21:1146–53. doi: 10.1093/bioinformatics/bti148. [DOI] [PubMed] [Google Scholar]
7.Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN. Hidden Markov models approach to the analysis of array CGH data. J Multivar Anal. 2004;90:132–53. [Google Scholar]
8.Greenman CD, Bignell G, Butler A, et al. PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatist. 2010;11:164–75. doi: 10.1093/biostatistics/kxp045. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Guha S, Li Y, Neuberg D. Bayesian hidden Markov modeling of array CGH data. J Amer Statist Assoc. 2008;103:485–97. doi: 10.1198/016214507000000923. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Mitra R, Gupta M. A continuous-index Bayesian hidden Markov model for prediction of nucleosome positioning in genomic DNA. Biostatist. to appear. [DOI] [PMC free article] [PubMed]
11.Huang J, Wei W, Chen J, et al. CARAT: A novel method for allelic detection of DNA copy number changes using high density oligonucleotide arrays. BMC Bioinformatics. 2006;7:83. doi: 10.1186/1471-2105-7-83. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Hupé P, Stransky N, Thiery J, Radvanyi F, Barillot E. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics. 2004;20:3413–22. doi: 10.1093/bioinformatics/bth418. [DOI] [PubMed] [Google Scholar]
13.Korn JM, Kuruvilla FG, McCarroll SA, et al. Integrated genotype calling and association analysis of SNPs common copy number polymorphisms and rare CNVs. Nature Genetics. 2008;40:1253–60. doi: 10.1038/ng.237. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Koski T. Hidden Markov Models for Bioinformatics. Dordrecht: Kluwer Academic Publishers; 2001. [Google Scholar]
15.Laframboise T, Harrington D, Weir BA. PLASQ: a generalized linear model-based procedure to determine allelic dosage in cancer cells from SNP array data. Biostatist. 2007;8:323–36. doi: 10.1093/biostatistics/kxl012. [DOI] [PubMed] [Google Scholar]
16.Lai TL, Xing H, Zhang N. Stochastic segmentation models for array-based comparative genomic hybridization data analysis. Biostatist. 2008;9:290–307. doi: 10.1093/biostatistics/kxm031. [DOI] [PubMed] [Google Scholar]
17.Lamy P, Andersen CL, Dyrskjot L, Torring N, Wiuf C. A hidden Markov model to estimate population mixture and allelic copy-numbers in cancers using Affymetrix SNP arrays. BMC Bioinformatics. 2007;8:434. doi: 10.1186/1471-2105-8-434. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Li C, Beroukhim R, Weir BA, Winckler W, Garraway LA, Sellers WT, et al. Major copy proportion analysis of tumor smples using SNP arrays. BMC Bioinformatics. 2008;9:204. doi: 10.1186/1471-2105-9-204. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Marioni JC, Thorne NP, Tavaré S. BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics. 2006;22:1144–6. doi: 10.1093/bioinformatics/btl089. [DOI] [PubMed] [Google Scholar]
20.Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatist. 2004;5:557–72. doi: 10.1093/biostatistics/kxh008. [DOI] [PubMed] [Google Scholar]
21.Picard F, Robin S, Lavielle M, Vaisse C, Daudin J. A statistical approach for array CGH data analysis. BMC Bioinformatics. 2005;6:27. doi: 10.1186/1471-2105-6-27. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Popova T, Mani′e E, Stoppa-Lyonnet D, Rigaill G, Barillot E, Stern MH. Genome Alteration Print (GAP): a tool to visualize and mine complex cancer genomic profiles obtained by SNP arrays. Genome Biology. 2009;10:R128. doi: 10.1186/gb-2009-10-11-r128. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Roberts WJJ, Ephraim Y. An EM Algorithm for ion-channel current estimation. IEEE Trans Signal Proc. 2008;56:26–33. [Google Scholar]
24.Rueda OM, Días R. Flexible and accurate detection of genomic copy-number changes from aCGH. PLoS Comput Biol. 2007;3:1115–22. doi: 10.1371/journal.pcbi.0030122. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Rydén T. EM versus Markov chain Monte Carlo for estimation of hidden Markov models: a computational perspective (with discussion) Bayesian Anal. 2008;3:659–88. [Google Scholar]
26.Scharpf RB, Parmigiani G, Pevsner J, Ruczinski I. Hidden Markov models for the assesment of chromosomal alterations using high-throughput SNP arrays. Ann Appl Statist. 2008;2:687–713. doi: 10.1214/07-AOAS155. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Shah SP, Xuan X, DeLeeuw RJ, Khojasteh M, Lam WL, Ng R, et al. Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics. 2006;22:e431–9. doi: 10.1093/bioinformatics/btl238. [DOI] [PubMed] [Google Scholar]
28.Stjernqvist S, Rydén T, Sköld M, Staaf J. Continuous-index hidden Markov modelling of array CGH copy number data. Bioinformatics. 2007;23:1006–14. doi: 10.1093/bioinformatics/btm059. [DOI] [PubMed] [Google Scholar]
29.Stjernqvist S, Rydén T. A continuous-index hidden Markov jump process for modelling DNA copy number data. Biostatist. 2009;10:773–8. doi: 10.1093/biostatistics/kxp030. [DOI] [PubMed] [Google Scholar]
30.Sun W, Wright FA, Tang Z, et al. Integrated study of copy number states and genotype calls using high-density SNP arrays. Nucleic Acids Res. 2009;37:5365–77. doi: 10.1093/nar/gkp493. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Wang K, Li M, Hadley D. PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007;17:1665–74. doi: 10.1101/gr.6861907. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.www.sanger.ac.uk/perl/genetics/CGP/cosmic?action=sample&id=919182
33.www.maths.lth.se/matstat/staff/susann/

[b1-cin-1-2011-159] 1.Andersson R, Bruder CEG, Piotrowski A, et al. A segmental maximum a posteriori approach to genome-wide Copy Number profiling. Bioinformatics. 2008;24:751–8. doi: 10.1093/bioinformatics/btn003. [DOI] [PubMed] [Google Scholar]

[b2-cin-1-2011-159] 2.Attiyeh EF, Diskin SJ, Attiyeh MA, et al. Genomic copy number determination in cancer cells from single nucleotide polymorphism microarrays based on quantitative genotyping corrected for aneuploidy. Genome Res. 2009;19:276–83. doi: 10.1101/gr.075671.107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b3-cin-1-2011-159] 3.Colella S, Yau C, Taylor JM, et al. QuantiSNP: an objective Bayes hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 2007;35:2013–25. doi: 10.1093/nar/gkm076. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b4-cin-1-2011-159] 4.Daruwala R, Rudra A, Ostrer H, Lucito R, Wigler M, Mishra B. A versatile statistical analysis algorithm to detect genome copy number variation. Proc Nat Acad Sci. 2004;101:16292–7. doi: 10.1073/pnas.0407247101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b5-cin-1-2011-159] 5.Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with discussion) J Roy Statist Soc B. 1977;39:1–38. [Google Scholar]

[b6-cin-1-2011-159] 6.Eilers PHC, de Menezes RX. Quantile smoothing of array CGH data. Bioinformatics. 2005;21:1146–53. doi: 10.1093/bioinformatics/bti148. [DOI] [PubMed] [Google Scholar]

[b7-cin-1-2011-159] 7.Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN. Hidden Markov models approach to the analysis of array CGH data. J Multivar Anal. 2004;90:132–53. [Google Scholar]

[b8-cin-1-2011-159] 8.Greenman CD, Bignell G, Butler A, et al. PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatist. 2010;11:164–75. doi: 10.1093/biostatistics/kxp045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b9-cin-1-2011-159] 9.Guha S, Li Y, Neuberg D. Bayesian hidden Markov modeling of array CGH data. J Amer Statist Assoc. 2008;103:485–97. doi: 10.1198/016214507000000923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b10-cin-1-2011-159] 10.Mitra R, Gupta M. A continuous-index Bayesian hidden Markov model for prediction of nucleosome positioning in genomic DNA. Biostatist. to appear. [DOI] [PMC free article] [PubMed]

[b11-cin-1-2011-159] 11.Huang J, Wei W, Chen J, et al. CARAT: A novel method for allelic detection of DNA copy number changes using high density oligonucleotide arrays. BMC Bioinformatics. 2006;7:83. doi: 10.1186/1471-2105-7-83. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b12-cin-1-2011-159] 12.Hupé P, Stransky N, Thiery J, Radvanyi F, Barillot E. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics. 2004;20:3413–22. doi: 10.1093/bioinformatics/bth418. [DOI] [PubMed] [Google Scholar]

[b13-cin-1-2011-159] 13.Korn JM, Kuruvilla FG, McCarroll SA, et al. Integrated genotype calling and association analysis of SNPs common copy number polymorphisms and rare CNVs. Nature Genetics. 2008;40:1253–60. doi: 10.1038/ng.237. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b14-cin-1-2011-159] 14.Koski T. Hidden Markov Models for Bioinformatics. Dordrecht: Kluwer Academic Publishers; 2001. [Google Scholar]

[b15-cin-1-2011-159] 15.Laframboise T, Harrington D, Weir BA. PLASQ: a generalized linear model-based procedure to determine allelic dosage in cancer cells from SNP array data. Biostatist. 2007;8:323–36. doi: 10.1093/biostatistics/kxl012. [DOI] [PubMed] [Google Scholar]

[b16-cin-1-2011-159] 16.Lai TL, Xing H, Zhang N. Stochastic segmentation models for array-based comparative genomic hybridization data analysis. Biostatist. 2008;9:290–307. doi: 10.1093/biostatistics/kxm031. [DOI] [PubMed] [Google Scholar]

[b17-cin-1-2011-159] 17.Lamy P, Andersen CL, Dyrskjot L, Torring N, Wiuf C. A hidden Markov model to estimate population mixture and allelic copy-numbers in cancers using Affymetrix SNP arrays. BMC Bioinformatics. 2007;8:434. doi: 10.1186/1471-2105-8-434. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b18-cin-1-2011-159] 18.Li C, Beroukhim R, Weir BA, Winckler W, Garraway LA, Sellers WT, et al. Major copy proportion analysis of tumor smples using SNP arrays. BMC Bioinformatics. 2008;9:204. doi: 10.1186/1471-2105-9-204. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b19-cin-1-2011-159] 19.Marioni JC, Thorne NP, Tavaré S. BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics. 2006;22:1144–6. doi: 10.1093/bioinformatics/btl089. [DOI] [PubMed] [Google Scholar]

[b20-cin-1-2011-159] 20.Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatist. 2004;5:557–72. doi: 10.1093/biostatistics/kxh008. [DOI] [PubMed] [Google Scholar]

[b21-cin-1-2011-159] 21.Picard F, Robin S, Lavielle M, Vaisse C, Daudin J. A statistical approach for array CGH data analysis. BMC Bioinformatics. 2005;6:27. doi: 10.1186/1471-2105-6-27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b22-cin-1-2011-159] 22.Popova T, Mani′e E, Stoppa-Lyonnet D, Rigaill G, Barillot E, Stern MH. Genome Alteration Print (GAP): a tool to visualize and mine complex cancer genomic profiles obtained by SNP arrays. Genome Biology. 2009;10:R128. doi: 10.1186/gb-2009-10-11-r128. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b23-cin-1-2011-159] 23.Roberts WJJ, Ephraim Y. An EM Algorithm for ion-channel current estimation. IEEE Trans Signal Proc. 2008;56:26–33. [Google Scholar]

[b24-cin-1-2011-159] 24.Rueda OM, Días R. Flexible and accurate detection of genomic copy-number changes from aCGH. PLoS Comput Biol. 2007;3:1115–22. doi: 10.1371/journal.pcbi.0030122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b25-cin-1-2011-159] 25.Rydén T. EM versus Markov chain Monte Carlo for estimation of hidden Markov models: a computational perspective (with discussion) Bayesian Anal. 2008;3:659–88. [Google Scholar]

[b26-cin-1-2011-159] 26.Scharpf RB, Parmigiani G, Pevsner J, Ruczinski I. Hidden Markov models for the assesment of chromosomal alterations using high-throughput SNP arrays. Ann Appl Statist. 2008;2:687–713. doi: 10.1214/07-AOAS155. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b27-cin-1-2011-159] 27.Shah SP, Xuan X, DeLeeuw RJ, Khojasteh M, Lam WL, Ng R, et al. Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics. 2006;22:e431–9. doi: 10.1093/bioinformatics/btl238. [DOI] [PubMed] [Google Scholar]

[b28-cin-1-2011-159] 28.Stjernqvist S, Rydén T, Sköld M, Staaf J. Continuous-index hidden Markov modelling of array CGH copy number data. Bioinformatics. 2007;23:1006–14. doi: 10.1093/bioinformatics/btm059. [DOI] [PubMed] [Google Scholar]

[b29-cin-1-2011-159] 29.Stjernqvist S, Rydén T. A continuous-index hidden Markov jump process for modelling DNA copy number data. Biostatist. 2009;10:773–8. doi: 10.1093/biostatistics/kxp030. [DOI] [PubMed] [Google Scholar]

[b30-cin-1-2011-159] 30.Sun W, Wright FA, Tang Z, et al. Integrated study of copy number states and genotype calls using high-density SNP arrays. Nucleic Acids Res. 2009;37:5365–77. doi: 10.1093/nar/gkp493. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b31-cin-1-2011-159] 31.Wang K, Li M, Hadley D. PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007;17:1665–74. doi: 10.1101/gr.6861907. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b32-cin-1-2011-159] 32.www.sanger.ac.uk/perl/genetics/CGP/cosmic?action=sample&id=919182

[b33-cin-1-2011-159] 33.www.maths.lth.se/matstat/staff/susann/

PERMALINK

Model-Integrated Estimation of Normal Tissue Contamination for Cancer SNP Allelic Copy Number Data

Susann Stjernqvist

Tobias Rydén

Chris D Greenman

Abstract

1. Introduction

2. Data