Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Jul 15.
Published in final edited form as: Ann Appl Stat. 2008 Jun 1;2(2):687–713. doi: 10.1214/07-AOAS155

Hidden Markov models for the assessment of chromosomal alterations using high-throughput SNP arrays

Robert B Scharpf 1, Giovanni Parmigiani 1,2, Jonathan Pevsner 3, Ingo Ruczinski 1,*
PMCID: PMC2710854  NIHMSID: NIHMS114455  PMID: 19609370

Abstract

Chromosomal DNA is characterized by variation between individuals at the level of entire chromosomes (e.g. aneuploidy in which the chromosome copy number is altered), segmental changes (including insertions, deletions, inversions, and translocations), and changes to small genomic regions (including single nucleotide polymorphisms). A variety of alterations that occur in chromosomal DNA, many of which can be detected using high density single nucleotide polymorphism (SNP) microarrays, are linked to normal variation as well as disease and therefore of particular interest. These include changes in copy number (deletions and duplications) and genotype (e.g. the occurrence of regions of homozygosity). Hidden Markov models (HMM) are particularly useful for detecting such alterations, modeling the spatial dependence between neighboring SNPs. Here, we improve previous approaches that utilize HMM frameworks for inference in high throughput SNP arrays by integrating copy number, genotype calls, and the corresponding measures of uncertainty when available. Using simulated and experimental data, we in particular demonstrate how confidence scores control smoothing in a probabilistic framework. Software for fitting HMMs to SNP array data is available in the R package vanillaICE.

1 Introduction

Chromosomal DNA is characterized by variation between individuals at the level of entire chromosomes (e.g. aneuploidy in which the chromosome copy number is altered), segmental changes (including insertions, deletions, inversions, and translocations), and changes to small genomic regions (including single nucleotide polymorphisms). A variety of alterations that occur in chromosomal DNA, many of which can be detected using high density single nucleotide polymorphism (SNP) microarrays, are linked to normal variation as well as disease and therefore of particular interest (Shaw-Smith et al., 2004; Aguirre et al., 2004; Aggarwal et al., 2005; Dutt and Beroukhim, 2007; Sebat et al., 2007; Szatmari et al., 2007). These include changes in copy number (deletions and duplications) and genotype (e.g. the occurrence of regions of homozygosity).

Copy number variations can arise through somatic and germline events. While naturally occurring and often (but not always) benign, germline copy number variations are more abundant than previously thought (Freeman et al. (2006); Redon et al. (2006); Eichler et al. (2007)). On the other hand, somatic copy number changes such as gene amplifications and deletions frequently contribute to tumorigenesis (or might be the consequence of it). Regions of homozygosity (i.e. long stretches of homozygous SNPs) can also occur through somatic and germline events. A hemizygous deletion of one chromosomal allele results in only one DNA copy, and therefore SNPs in that region will appear as homozygous (given current genotyping technologies that generate only biallelic calls). The definition of loss of heterozygosity (LOH) refers to such a somatic event: for example, comparing a tumor and normal sample from the same person, any heterozygous SNPs in the normal sample appear as homozygous SNPs in the tumor sample, in any region where an allele was lost. As already noted, regions of homozygosity can also occur through germline events. While chromosomal DNA is typically inherited from both parents, under some circumstances an individual inherits two copies of a chromosome from one parent. The inheritance of both homologues of a pair of chromosomes from only one parent can be due to autozygosity (homozygosity in which alleles are identical by descent) or to uniparental disomy (UPD, Robinson (2000); Engel (2006)). Autozygosity and UPD do not involve an aneuploidy (change in chromosomal copy number), and the region of homozygosity may extend over an entire chromosome or segmentally across a subregion of a chromosome. The condition is termed uniparental isodisomy (iUPD) if the two copies inherited from one parent are identical, and results in stretches of homozygous SNPs. (If the two inherited copies are different homologues, the result is uniparental heterodisomy, hUPD, but does not result in stretches of homozygous SNPs.) In some cases, UPD is thought to be benign, but can also be associated with disease (Prader-Willi syndrome, Angelman syndrome, Beckwith-Wiedemann syndrome, see for example Altug-Teber et al. (2005)). UPD can disrupt genomic imprinting, such that imprinted genes (expressed preferentially from the paternal or maternal alleles) fail to be expressed. UPD can also cause homozygosity for autosomal recessive traits such as cystic fibrosis (Zlotogora, 2004).

A variety of technologies have been applied for the assessment of chromosomal abnormalities including conventional karyotyping (e.g. Giemsa staining of metaphase chromosomes) and fluorescence in situ hybridization (FISH). While the former only allows for the genome-wide detection of major chromosomal amplifications and deletions, the latter allows for the verification of suspected microdeletions as well as translocations and some duplications. Array comparative genome hybridization (aCGH) permits a genome-wide measurement of copy number variation using bacterial artificial chromosome (BAC) clones deposited on a microarray. This is a high throughput technique but the resolution is limited to tens or hundreds of thousands of base pairs and no genotype data are obtained.

SNP microarray technology permits the genome-wide search for chromosomal abnormalities, providing genotype and copy number estimates for hundreds of thousands of SNPs in genomic DNA isolated from a biological sample. Statistical tools for the analysis of such SNP chip data are typically employed to assess where the chromosomal changes have occurred, and whether or not these changes are associated with disease. Regions of interest are typically aneuploidies, i. e. regions where copy number changes (deletions and amplifications) have occurred, or regions with unusually long stretches of homozygous genotypes (either naturally occurring, for example through evolutionary pressure on a DNA segment, or through loss of heterozygosity, LOH).

For the analysis of SNP chip data in general, three different tiers of estimation problems arise. 1) By SNP: how can we use the low-level data (such as the fluorescence measurements in Affymetrix SNP chips) to optimally estimate the genotype and DNA copy number for each SNP in the array? 2) By sample: how can we borrow strength between neighbouring SNPs, and infer regions of LOH and copy number changes in the genome of the subject studied? 3) Between samples: how can we compare the genotype of many subjects, infer common regions of abnormality, and for example assess differences between affected subjects and normal controls? This manuscript revolves around methods for tier 2, the assessment of chromosomal abnormalities in one particular sample. However, information derived from tier 1, in particular uncertainty estimates of copy number and genotype estimates, can be critically important and will be incorporated in the analysis. In particular for the Affymetrix platform, originally described as a high-throughput assay for calling genotypes at thousands of SNPs (Kennedy et al., 2003), there have been several algorithms proposed for the appropriate adjustment and pre-processing of probe-level data, and the estimation of SNP-level summaries of probe-level data for genotype (DM, Di et al. (2005), RLMM, Rabbee and Speed (2006), BRLMM, Affymetrix (2006), CRLMM, Carvalho et al. (2006), SNiPer-HD, Hua et al. (2007)) and copy number (CNAG, Nannya et al. (2005), CARAT, Huang et al. (2006), PLASQ, Laframboise et al. (2006), CN-RLMM, Wang et al. (2006)). Notably, Laframboise et al. (2006) and Wang et al. (2006) provide allele-specific estimates of copy number.

We caution that, as with gene expression technologies, pre-processing of probe-level data is an important consideration. For instance, several recent papers have described fragment-length and sequence effects that may be introduced by the polymerase chain reaction (PCR) used to amplify the DNA (Nannya et al., 2005; Carvalho et al., 2006). We assume that SNP-level summaries for each interrogated SNP have been adjusted for probe-specific biases to the extent possible. Statistical models such as CRLMM that use Hapmap data for training have been shown to provide better genotype calls when the centers of the bivariate scatterplots for the A and B allele intensities are less well-defined (Carvalho et al., 2006). Genotype calls for most genotyping algorithms are concordant for over 99.9% of the measured SNPs in the Affymetrix 100k and 500k chips when performance is compared on apparently normal individuals represented in the HapMap study.

Statistical methods that provide an indication of the uncertainty of the genotype call (for example based on the single to noise ratio (SNR) and log likelihood ratio (LLR) defined by CRLMM) can be particularly useful for statistical algorithms devised to infer chromosomal abnormalities. Specifically, statistical models that borrow strength from neighboring SNPs to infer loss or retention of heterozygosity should incorporate the uncertainty of the genotype call estimate, giving less weight to genotype calls that are measured with high uncertainty and more weight to well-estimated genotypes. To our knowledge, this manuscript is the first one to address this issue. Figure 1 illustrates why the uncertainty in genotype calls can differ substantially. Similarly, probe-specific biases for copy number estimates have been described before, see for example Wang et al. (2006).

Figure 1.

Figure 1

HapMap genotype calls (the gold standard) for a bad SNP (left) and a good SNP (right) for 269 samples measured on Affymetrix 100k SNP chips. The HapMap consensus genotype call (taken to be the gold standard) is indicated by color: AA (medium grey), AB (white), and BB (dark grey). The separation between genotype clusters is SNP-specific. This figure motivates an approach that incorporates uncertainty estimates to control smoothing.

Before high-throughput SNP chips were widely available, array comparative genomic hybridization (aCGH) was the most commonly used method to assess DNA copy numbers, and assess regions in the genome where deletions or amplifications occurred in a particular sample. Thus, many statisticians have proposed approaches for aCGH based copy number estimation, and some of these proposed methods are also relevant for SNP chip based copy number analysis. Approaches for aCGH data include hidden Markov models (Fridlyand et al. (2004); Guha et al. (2006)), segmentation algorithms (Olshen et al. (2004), Picard et al. (2005), Venkatraman and Olshen (2007)), wavelets (Hsu et al. (2005)), smoothing (Hupe et al. (2004), Eilers and de Menezes (2005)) regression (Houseman et al. (2006), Huang et al. (2005)), clustering (Wang et al. (2005)), and resampling (Lai and Zhao (2005)). The manuscript by Lai et al. (2005) and Willenbrock and Fridlyand (2005) contain reviews and comparisons of the performances of several of these proposed methods. In addition, many useful extensions or alternative approaches for the above listed methods are being proposed. Some recent publications have confirmed that naturally occurring DNA copy number variations are more abundant than previously thought (Freeman et al. (2006); Redon et al. (2006)), which can produce outliers in the aCGH data. Integrating these known copy number variations as permissible outliers into a hidden Markov model to assess where abnormal copy number alterations have occurred has been proposed by Shah et al. (2006).

For the statistical analysis, SNP chip data differ from array CGH data in two important ways: a) SNP chips also provide information for the genotype, i. e. give homozygous/heterozygous SNP calls, and b) provide a much denser coverage, currently generating genotype information and copy number estimates at locations in excess of 500,000 SNPs. The correlation structure between those estimates has to be an essential part of any statistical modeling approach. The most promising methods currently available are based on hidden Markov models. In particular, to infer LOH regions and to estimate copy numbers changes, the dChip software and methods are among the most widely used in the scientific literature for the analysis of SNP chip data. The dChip methods are based on separate hidden Markov Models for genotype analysis (Lin et al. (2004); Beroukhim et al. (2006)) and copy number (Zhao et al. (2004)). The original dChipSNP HMM (Lin et al. (2004)) was devised to assess loss of heterozygosity regions (a region with an allelic loss, where heterozygote SNPs in a normal sample appear as homozygote SNPs in a tumor sample). This required paired tumor and normal samples from the same subject. As these are often not available, an extension of this model was proposed by Beroukhim et al. (2006) to allow for LOH assessment without paired samples (e. g. tumor only). Note that such an approach using unpaired data would also be required in settings that do not involve abnormal tissue, for example when subjects with mental retardation and apparently normal controls are investigated to assess possible differences in the karyotypes. The dChipSNP hidden Markov Model for copy number assessment (Zhao et al. (2004)) is somewhat similar in nature to the one used for LOH analysis, see Zhao et al. (2004).

Copy number estimates and genotype calls however can provide complementary information. For example, without copy number information, genotype calls alone would not allow for a distinction of LOH due to deletion or uniparental isodisomy (iUPD), which occurs when a subject inherits the same copy of a chromosome (or parts thereof) twice from one parent. While this has been recognized and concurrent analyses have been reported (see for example Zhou et al. (2004, 2005) and Ninomiya et al. (2006)), these analyses were carried out separately for genotype calls and copy number estimates, and the results visually compared. Not until very recently has the need for an integrated analysis of copy number and genotype been addressed for the first time. Colella et al. (2007) propose a Bayesian hidden Markov model approach (QuantiSNP), using both genotype and copy number estimates to infer underlying states (deletions, amplifications, copy neutral regions of homozygosity, etc) of interest. We caution though that data derived from cancer samples might create substantial problems for HMM based methods like QuantiSNP and our approach: DNA copy numbers larger than three are quite possible in such settings, and thus, the number of possible states expands dramatically. Further, non-integer copy numbers do make sense in tumors due to the mix of normal and abnormal cells in the sample (i.e. mosaicism, see Ting et al. (2006) for an example). In these settings, copy number based segmentation approaches might be more promising (Olshen et al. (2004), Picard et al. (2005), Venkatraman and Olshen (2007)), in particular as the definition of a “genotype” is unclear. In this manuscript, we propose a hidden Markov model for the integrated analysis of copy number and genotype estimates, most applicable for abnormalities as a consequence of germline events. We also develop the methodology to integrate genotype and copy number estimate uncertainty measures, and illustrate how integrating such confidence scores of the SNP-level summaries in the HMM can improve inference for the underlying hidden states using simulated and experimental data. These ideas are implemented in the R package vanillaICE.

2 Methods

In this section we describe three HMMs, dependent on whether genotype estimates (abbreviated GT^), copy number estimates (abbreviated CN^), or both GT^ and CN^ are available as defined by three classes of objects for SNP array data (Scharpf et al., 2007).

2.1 Genotype calls

Most algorithms that provide SNP-level summaries of genotype assume a copy number of two, and report the genotype estimates as such. We therefore assume throughout this paper that the GT^ are of the generic form AA or BB and AB corresponding to HOM^ and HET^, respectively. The vanilla HMM with hidden states retention (◐) and loss (○) of heterozygosity require specification of the initial state probability distribution, the emission probabilities (denoted by β below), and the transition probabilities (denoted by τ below) between the true states. Commonly employed in the literature for the transition probability is the “instability-selection” model for LOH analysis (Newton et al., 1998; Beroukhim et al., 2006) that describes the dependencies between the underlying states of adjacent SNPs as a function of distance. For any two adjacent SNPs, θ is defined as the probability that the state of the first marker is not informative (denoted by Ic) for the state of the second marker. As the distance between SNPs affects this probability, it is modeled as θ(d) = 1 − e−2d, where d is a genetic or physical distance (for example 100Mb units, see Beroukhim et al. (2006)) between adjacent SNPs. We assume that with probability 1 − θ(d), SNP(i) is informative (denoted by I) for SNP(i+1) and that no change in state occurs between the adjacent SNPs. For example, this leads to

τ/(d)=P(i+1i,d)=P(i+1,Ii,d)+P(i+1,Ici,d)=P(i+1I,i,d)×P(Ii,d)+P(i+1Ic,i,d)×P(Ici,d)=1θ(d)+P()×θ(d) (2.1)

as the probability that the state of SNP(i+1) is ○, given that the state of SNP(i) with distance d was ○. Also,

τ/(d)=P(i+1i,d)=1P(i+1i,d)=θ(d)×P(). (2.2)

P(◐) and P(○) refer to the initial probabilities for ◐ and ○, respectively. These initial probabilities can be set as fixed constants using knowledge from previous experiments, or alternatively, learned via the EM algorithm (Dempster et al., 1977).

Emission probabilities for states ○ and ◐ are estimated as

β(GT^)Binomial(p=0.99)andβ(GT^)Binomial(p=0.7), (2.3)

where p is the probability of a homozygous genotype call. We use the above probabilities as defaults to reflect values typically seen in experimental data. In a region of retention ◐, about 70% of SNPs on average are homozygous, while in a region of loss ○ all SNPs are homozygous, but genotyping errors do occur. Alternatively, as these probabilities are affected by the quality of the assay, they can also be learned via the EM algorithm. In practice, we find that our approaches are rather insensitive to changes in these parameters. It is certainly also possible to use SNP-specific homozygosity rates here if they are known from a reference population. Effcient computation of the probability of the observed sequence given the model is carried out using the forward algorithm as described in (Rabiner, 1989). The most probable state sequence given the model is calculated via the Viterbi algorithm (Viterbi, 1967; Rabiner, 1989).

Integrating confidence estimates (ICE)

When confidence estimates are available, the observed data at a SNP is the genotype call ( GT^) and the uncertainty measure SGT^. The joint distribution of GT^ and SGT^ depends on the underlying state. For example, if the state for a particular SNP is ○, the emission probability is

β{GT^,SGT^}=f{GT^}×f{SGT^GT^,}. (2.4)

Note that the first of the two terms on the right hand side of equation (2.4) is simply the emission probability when estimates of uncertainty are not available. The second term can be understood as a weight for the former term that depends on the confidence with which the call is made. The second term can be approximated using a density estimate of the SGT^ where the gold standard is available. For example, using CRLMM on the 269 HapMap samples, the distributions of the respective uncertainty measures for all four possible combinations of called and true genotypes measured on the Affymetrix 100k SNP chips are known. We use kernel based density estimates to obtain the distributions of the confidence scores, given the true and called genotype (separately for the Xba and Hind 50k chips):

f{SHOM^HOM^,HOM},f{SHOM^HOM^,HET},f{SHET^HET^,HOM},f{SHET^HET^,HET} (2.5)

The first term in (2.5), for example, denotes the density of the scores when the genotype is correctly called homozygous ( HOM^) and the true genotype is homozygous (HOM). If the underlying state is ○, then the true genotype is always HOM and we assume that

f{SHOM^HOM^,}=f{SHOM^HOM^,HOM}andf{SHET^HET^,}=f{SHET^HET^,HOM}. (2.6)

If the underlying state is ◐, then the true genotype can be HET or HOM. We therefore estimate the emission probabilities for state ◐ as

β{GT^,SGT^}=f{GT^}f{SGT^GT^,}=f{GT^}(f{SGT^,HOMGT^,}+f{SGT^,HETGT^,})=f{GT^}(f{SGT^HOM,GT^,}f{HOMGT^,}+f{SGT^HET,GT^,}f{HETGT^,})=f{GT^}(f{SGT^HOM,GT^}f{HOMGT^,}+f{SGT^HET,GT^}f{HETGT^,}). (2.7)

The unknown terms in Equation 2.7, f{HOMGT^,} and f{HETGT^,}, are also estimated from the HapMap samples.

2.2 Copy number

The hidden states for autosomal copy numbers are hemizygous deletion (↘), two copies (→), and more than two copies (↗). A typical, and from practical experience, quite reasonable assumption when only copy number is considered (applied to aCGH and SNP chip data) is that the logarithm of the copy number estimate, after normalization, is roughly normally distributed around the true log copy number (see for example Zhao et al. (2004)), although slightly heavier tails may also be observed in practice. More important however is the fact that the variability is not necessarily constant across SNPs, which we will address in the ICE HMM. If the variance was assumed to be constant (as done in the vanilla HMM), this parameter can be learned via the EM algorithm (Dempster et al., 1977), or estimated in a robust manner for example using quantiles from the observed data. In the examples presented here, we obtained a robust estimate for the standard deviation of copy number estimates using the 16th and 84th percentiles of the log2 transformed CN^ (corresponding to plus minus one standard deviation from the median). For a state S, the mean μs and variance σS2 of the Gaussians used to describe the emission probabilities can be fixed at starting values, or updated by EM. In the vanilla HMM we assume a constant σ2 and estimate the emission probabilities for state ↘, for instance (on the log2 scale, not divided by 2) as

β(CN^)f(CN^)N(μS=0,var=σ2). (2.8)

The transition probability for the copy number HMM is the same as the one described above.

Integrating confidence estimates (ICE)

The emission probabilities for the HMM retains the same location parameters for the Gaussian, but with SNP-specific standard errors for the CN^. For a given SNP, the emission probability for copy number two (→) for example is

β{CN^SCN^}N(1,(σ×SCN^)2). (2.9)

The scalar σ can be estimated from the sample at hand, or set equal to one if SCN^ measures the actual variability of the copy number estimate around the true copy number.

2.3 Copy number and genotype

For the joint analysis of copy number and genotype, we extend the transition probabilities in Equations 2.1 and 2.2 to the hidden states normal (◒), amplification (Inline graphic), LOH (⊖), and deletion (⦰). For the emission probabilities, we assume conditional independence between the copy number estimates and the genotype calls:

f(CN^,GT^S)=f(CN^S)×f(GT^S). (2.10)

This equation can be further simplified, as the copy number distribution only depends on the true copy number, and the genotype distribution only depends on the true underlying state being ◐ or ○. For example, for the deletion state we have

f{CN^,GT^}=f{CN^}×f{GT^}=f{CN^}×f{GT^}. (2.11)

The terms in Equation 2.11 can be estimated as described above for genotype and and copy number. Emission probabilities for the other states can be obtained similarly.

2.4 Simulation

The simulated data are available in the Bioconductor package vanillaICE. The simulation comprises one subject’s genotype, copy number, and confidences scores for 9165 SNPs on chromosome 1. A description of the 5 features simulated in chromosome 1, referred to by regions A–E, and the underlying hidden states in these regions follows.

Genotype calls

With the exception of Regions A, B, and C in Figure 2, we simulated 9165 genotypes (the approximate number of SNPs in the two 50k SNP chips) from a Bernoulli distribution with probability 0.7 of HOM^. Unless otherwise indicated, confidence scores for GT^ were obtained by random draws of confidence scores in the Hapmap data when the CRLMM GT^ agreed with the gold-standard as defined by consensus of the HapMap genotyping centers. The reference distributions were made separately for the Affymetrix 50k Xba and Hind chips, and hence the confidence score sampled for each SNP were made respective to the chip.

Figure 2.

Figure 2

A simulated chromosome with 9165 SNPs. Top: The simulated GT^ with uniform noise added to reduce overplotting (vertical axis) plotted against physical position (horizontal axis). Bottom: A magnification of region A. Two SNPs in region A with high simulated confidence scores are indicated by the square plotting symbol. Regions A–E are described in more detail in Section 2.4. In truth, there are 4 different segments in state loss (○, indicated in light grey above). The predicted hidden states from the vanilla (Van) and ICE HMMs are denoted by color in the two bars beneath the data points. The ICE HMM detects each of the 4 ○ segments, whereas the vanilla HMM smoothes over a segment in A containing two heterozygous SNPs at position 52.8 Mb. Utilizing confidence scores for the genotype predictions, the ICE HMM may provide more precise locations for ○ breakpoints.

Copy number

The Affymetrix CNAT tool (version 3.0) was used to obtain CN^ for the 9165 SNPs from a presumably normal individual in the HapMap dataset (sample NA06993). Deletions and amplifications were simulated from Gaussian distributions with location parameters log2(1) and log2(3), respectively. For the scale parameter, we used a robust estimate of the log2 transformed copy number standard deviation, denoted by ε. To illustrate how a confidence score such as a standard error of the copy number estimate could be useful, we simulated standard errors from a shifted Gamma: Γ(1, 2) + 0.3, where 1 is the shape parameter and 2 is the rate parameter. To ascertain the effect of qualitatively high confidence scores on the ICE HMM, we scaled ε by 12. Similarly, to simulate less precise CN^ we scaled ε by 2.

Regions A–E were simulated as follows:

  • Region A contains 200 SNPs spanning a physical distance of approximately 5 Mb. Two chromosomal segments of 99 homozygous genotypes are separated by a chromosomal segment of 14 kb containing two heterozygous SNPs. Using a 2-state hidden Markov model and using only the simulated genotypes as the observed data, the true underlying states (number of SNPs) are ○ (99), ◐ (2), and ○ (99) for the 3 segments, respectively. We augment the genotype calls with copy number estimates obtained directly from the CNAT analysis of a normal Hapmap subject’s chromosome 1. Using the 3-state HMM for copy number, the true underlying state is → (200). Modeled jointly, the true underlying state is ○ (99),◒ (2) and ⊖ (99).

  • Region B contains 100 SNPs spanning a physical distance of approximately 2 Mb. Two chromosomal segments each containing 49 SNPs are both in regions of a hemizygous deletion. We assigned a homozygous genotype call to all 98 SNPs in the two hemizygous deletions. The two hemizygous deletions are separated by a chromosomal segment of 360 basepairs with copy number two. To simulate an incorrect genotype call (the true genotype is homozygous for the 2 SNPs on the diploid segment), confidence scores for the two heterozygous SNPs are drawn from the distribution of confidence scores when the CRLMM genotype call of HET was incorrect. Copy number estimates and corresponding confidence scores (standard errors) for the hemizygous deletion were simulated as described above, with the exception that high confidence scores were assigned to the two SNPs in the chromosomal segment with normal copy number. The true underlying state for the genotypes in Region B is ○ (100). The true state for the copy number in region B is ↘ (49), → (2), and ↘ (49). Modeled jointly, the true states are ⦰ (49),◒ (2), and ⦰ (49).

  • Regions C is a segment containing 100 homozygous SNPs spanning < 2 Mb in a hemizygous deletion. The true underlying states are ○ (100) in the genotype HMM, ◒ (100) in the copy number HMM, and ⦰ (100) in the joint HMM.

  • Region D contains two segments with with copy number 3 (< 1 Mb), separated by a diploid segment containing 2 SNPs (9.8 kb). The two amplified fragments are < 1 Mb. The true underlying states are ◒ (200) in the genotype HMM; ↗ (99), → (2), and ↗ (99) in the copy number HMM; and Inline graphic (99),◒ (2), and Inline graphic(99) in the joint HMM.

  • Region E contains a microdeletion spanning 5 SNPs (94 kb) and a microamplification containing 3 SNPs (294 kb). We assigned high confidence scores to the copy number estimates in both regions. The true underlying state are ○ (5) and ◒ (3) in the genotype HMM, ↘ (5) and ↗ (3) in the copy number HMM and ⦰ (5) and Inline graphic (3) in the joint HMM.

3 Results

This section describes results obtained from fitting HMMs to simulated and experimental data. The HMMs are written in the statistical language R (http://www.r-project.org) using S4 classes and methods (Chambers, 1998). In particular, the HMM is dependent on whether genotype estimates (abbreviated GT^), copy number estimates (abbreviated CN^), or both GT^ and CN^ are available as defined by three classes of objects for SNP array data (Scharpf et al., 2007). Organizing the statistical methods in this way allows more flexibility to users interested only in characterizing chromosomal abnormalities in genotype (loss of heterozygosity, LOH) or copy number (deletion or amplification) respectively. When both GT^ and CN^ are available, the HMM will distinguish between copy-neutral LOH and deletion-induced LOH. We use the term LOH in this context as an unusually long stretch of homozygous SNPs, though these regions can be completely naturally occurring, for example due to evolutionary pressure on chromosomal segments. For the simulation, we simulate GT^ and CN^ as described in Section 2.4, analyzing the GT^ and CN^ separately and then jointly. For the experimental data, we use a HapMap sample with a previously identified region of uniparental isodisomy, a mechanism for copy neutral LOH. Both the simulation and experimental data are based on 100k Affymetrix SNP chips (comprised of the Xba and Hind 50k chips). All figures shown are also available in color as supplementary material at http://biostat.jhsph.edu/~iruczins/publications/sm/

3.1 Simulated data

SNP-level summaries were obtained using a combination of real (experimental) and simulated data for 1965 SNPs measured on chromosome 1 of the 50k Hind and Xba Affymetrix SNP chips, as described in Section 2.4 for additional details. Because the states of the HMM are determined by whether genotype estimates( GT^), copy number estimates( CN^), or both GT^ and CN^ are available, we organize the results accordingly. For each example, we plot both the predictions of a HMM that uses only the observed SNP-level summaries as input (vanilla), and a HMM that integrates confidence estimates (ICE) for the SNP-level summaries.

Genotype HMM

The hidden states for the genotype HMM are retention (◐) and loss (○) of heterozy-gosity. In the upper panel of Figure 2 the simulated GT^ are plotted with uniform noise added to reduce overplotting. The predicted states from the vanilla and ICE HMMs are also shown. The predictions from the vanilla HMM are the same as the predictions of the ICE HMM shown, with the exception of the region (A) magnified in the lower panel of Figure 2, where the ICE HMM correctly identifies the ◐ segment. Both approaches miss the 5 SNP spanning microdeletion in region E, but otherwise correctly predict the true underlying states (see Section 2.4 for details). In general, for both the vanilla and ICE HMMs, the Viterbi algorithm (conditional on other parameters of the HMM model) chooses an optimal sequence of states that maximizes the likelihood of the observed genotype calls. The predicted states reflect a trade-off between the likelihood of the observed genotypes given the underlying states, and the transition probabilities. Unlike the vanilla HMM, emission probabilities in the ICE HMM are a function of the confidence scores (as described in Section 2), and factor into the likelihood. Intuitively, a high confidence score at a particular SNP has the effect of giving more weight to the emission probability and less weight to the state of the neighboring SNPs when determining the optimal sequence of states in the Viterbi algorithm. Hence, the sequence of states that maximizes the likelihood of the observed genotype calls differ in the ICE and vanilla HMMs when the confidence scores shifts the balance between the opposing forces of the emission and transition probabilities. In particular, the high confidence scores at the two heterozygous SNPs in region A favor the emission probability for ◒, causing two breakpoints in this region of ○ and, hence, a more local smoothing of the HMM. Although the emission probability for state ◒ is greater than for state ○ at these two SNPs in the vanilla HMM, the probability of having two breakpoints in a region of ○ for SNPs that are physically close is small as reflected in the transition probability. Therefore, the vanilla HMM provides a smoothing that is less localized, corresponding to a sequence of ○ predictions in region A without transitions to the normal state.

Copy number HMM

The hidden states for autosomal copy numbers are hemizygous deletion (↘), normal (two) copies (→), and more than two copies (↗). Figure 3 (upper panel) shows the CN^ of the simulated dataset. In our simulation, chromosome 1 contains three amplifications ↗ (two segments in D separated by a segment with normal copy number, and one in E), and four deletions ↘ (two segments in B separated by a segment with normal copy number, and one segment each in regions C and E). Also shown are the predicted states from the vanilla and ICE HMMs, respectively. The predictions from the two HMMs differ in regions B, D, and E magnified in the lower panel. Without confidence estimates for the copy number, the transition probabilities dominate the likelihood as specified by the emission probabilities, and the vanilla HMM smoothes over the two SNPs with copy number 2 in region B and D, and the amplification in region E. The high confidence scores used in this simulation for the copy number estimates in these regions makes the transition between states more favorable, and thus, the ICE HMM makes the transition back to the normal state for regions B and D, and detects the amplification in region E. Note that when the confidences scores for the CN^ are low, as for the 2 SNPs with copy number near two in the hemizygous deletion in region C, the predictions with ICE and vanilla are identical. Also, the vanilla HMM detects a spurious deletion to the left of region A. As the confidence scores for those copy number estimates were low, the likelihood specified in the ICE HMM does not favor a transition to a non-normal state.

Figure 3.

Figure 3

Top: Copy number estimates (vertical axis) versus physical position (horizontal axis) for 9165 SNPs on a simulated chromosome. Bottom: A magnification of regions D, B, and E. High confidence scores for the copy number estimates were simulated for the square points in regions D, B, and E. The two bars beneath the data points in each figure show the predicted hidden states from the vanilla (Van) and ICE HMMs. Note that where the predictions differ in regions D, B, and E, the ICE correctly classified the hidden states. Note that the vanilla HMM also indicates a (spurious) deletion to the left of region A, not indicated by the ICE HMM due to high variability in those copy number estimates.

Genotype and copy number HMM

We plot both the GT^ and CN^ in the upper panel of Figure 4. By modeling GT^ and CN^ simultaneously, we expand the state space of the HMM to include deletion-induced LOH (⦰), copy neutral LOH (⊖), normal (◒), and amplification (Inline graphic). The predicted states from the vanilla and ICE HMMs are also shown, and differences in predictions are indicated in the lower panel. As before, ICE correctly classifies all SNPs into the respective states, while the vanilla HMM, in the absence of uncertainty estimates, smoothes over some loci (regions A, B, D), and fails to detect the amplification (with high confidence scores) in region E. In contrast, the vanilla HMM does detect the microdeletion in region E. The ability of the vanilla HMM to detect the microdeletion in this example even in the absence of confidence scores is attributable to the additional information that the genotype provides: SNPs in deleted regions all appear as homozygous, in contrast to amplifications, where homozygous and heterozygous SNPs occur. Additionally, the extra genotype information may reduce the occurrence of predicted deletions that are spurious. For instance, in the absence of information on genotype calls in Figure 3, the vanilla HMM predicts a small deletion to the left of region A. As heterozygous genotype estimates in this region are incompatible with a deletion, the vanilla HMM no longer predicts this region to be a deletion in Figure 4.

Figure 4.

Figure 4

Top: The CN^ in Figure 3 are superimposed on the GT^ in Figure 2. We fit HMMs to the joint observation sequence of CN^ and GT^ without (vanilla) and with (ICE) confidence scores of the SNP-level summaries. The predictions from these two HMMs are represented by different shades of grey in the two bars beneath the data points in each panel. We used square plotting symbols to indicate SNPs for which we assigned high confidence scores to the genotype and copy number estimates.

3.2 Experimental data

To illustrate the HMM approaches on experimental data, we used a HapMap sample with a previously identified (but not experimentally confirmed) UPD in chromosome 2. The Affymetrix tool CNAT (version 3.0) and the R software CRLMM were used to obtain SNP-level summaries of copy number and genotype respectively. We caution that at this point in time the GT^ obtained using CRLMM (or the Affymetrix tools) implicitly assume that the copy number is two - ideally, allele specific estimates should be used, and methods are under development (Rafael Irizarry, personal communication). Also, software to obtain confidence scores for CN^ based on probe-level variability and signal-to-noise ratio on the chip (such as described in Wang et al. (2006)) is not yet available. However, differences in the SNP-specific standard deviations of the CN^ across a reference set of 90 HapMap samples have previously been reported (see for example Zhao et al. (2004)), and can be used in a straightforward manner as measures of uncertainty (specifically, using those deviations as the SCN^ in Equation 2.9, and estimating the scalar σ from the autosomal SNP copy number estimates in the sample).

The upper panel in Figure 5 shows CN^ on the vertical axis against physical position on chromosome 2. The region of predominantly called homozygous SNPs at 190 – 200 Mb is a previously identified UPD (Ting et al., 2006). Also shown are the predictions from the vanilla and ICE HMMs. The confirmed UPD between 190 and 200 Mb is detected by both HMMs, though the vanilla HMM incorrectly predicts a small deletion of 3 SNPs in the middle of this region, whereas the ICE HMM provides a more global (and correct) smoothing of the copy number estimates. Also, the vanilla HMM finds a spurious amplification at about 210 Mb. The lower panel on the left provides a magnified view of the region between 135 and 155 Mb, where the vanilla and ICE HMMs differ. Only the middle region (at about 143 Mb) is identified by both HMMs as LOH (we again stress that we use the term LOH here as copy neutral stretches of homozygous SNPs, naturally occurring possibly due to evolutionary pressure on this chromosomal segment). The chromosomal segment at about 140 Mb contains the two heterozygous SNPs (confirmed in the HapMap data, and called as such by CRLMM), and thus is not a region of LOH, as predicted by the vanilla HMM. The lower panel on the right further zooms in on the vanilla and ICE predictions in the region around 150 Mb. The two SNPs with heterozygous genotype calls at about 151 Mb are truly heterozygous SNPs, and therefore the ICE HMM correctly identifies the chromosomal segment containing these two heterozygous SNPs as normal. Due to the abundance of markers in the segment around 151.25 Mb exclusively called homozygous, the ICE HMM still indicates an LOH segment. Several studies have recognized the abundance of short, copy-neutral, entirely homozygous regions (see for example Beroukhim et al. (2006)). To illustrate the prevalence of short, homozygous sequences, we fit the vanilla and ICE HMMs to the chromosome 2 data of the 30 CEPH trio parents available from HapMap (60 independent samples), and highlight these copy-neutral, all homozygous regions in Figure 6. Clearly visible is the abundance of these regions, and the enriched locations along chromosome 2 (possibly explained by evolutionary pressure).

Figure 5.

Figure 5

Top: A confirmed UPD between 190 and 200 Mb is detected by both HMMs in a HapMap sample from the CEPH dataset. Note that the vanilla HMM incorrectly predicts a small deletion of 3 SNPs in the middle of this region, whereas the ICE HMM provides a more global smoothing of the copy number estimates. Bottom left: a magnified view of three possible LOH regions (not confirmed). Only the middle region (143 Mb) is identified by both HMMs as LOH. Because the CRLMM genotype calls agree with the HapMap consensus, the chromosomal segment containing the two heterozygous SNPs at 140 Mb is not a region of LOH, as predicted by the vanilla HMM. Bottom right: magnification of the vanilla (top) and ICE (bottom) predictions for the feature at 150 Mb. Again, the true genotype calls are heterozygous, and so the ICE HMM correctly identifies the chromosomal segment containing the two heterozygous SNPs as normal.

Figure 6.

Figure 6

An image of the predictions from the vanilla HMM fit to chromosome 2 of the 60 parental samples in the CEPH trios dataset (top). The x-and y-coordinates used for the image are physical position and subject, respectively. Subject NA07056 has a confirmed UPD at 195 Mb. Also plotted are the frequencies of LOH across the 60 samples (middle) and the cytoband (bottom).

3.3 A vanilla/ICE comparison

We performed additional simulations to contrast the performances of the vanilla and ICE HMMs. Since large deletions and amplifications can easily be picked up by both approaches, we focused on small deletions and amplifications, spanning between 2 and 10 consecutive SNPs. Since the results were as expected, we only describe the effects of the copy number variability and confidence scores on the detection of small deletions in detail.

The experimental data consisted of genotype calls and copy number as described in Section 2.4. Copy number confidence scores were obtained by weighting the robust estimate of the within-chip log2 copy number standard deviation by the standardized SNP specific standard deviation derived from a reference set of 90 HapMap samples (e. g., this weight for one particular SNP was the ratio of the across sample standard deviation for the SNP and the median of all those numbers across all SNPs). Simulated in these data were 450 sets of copy number estimates and confidence scores for deletions ranging from two to ten consecutive SNPs (50 data sets for each deletion size). The locations of the deletions were randomly selected on chromosome 1 for each data set. The copy numbers in the deletions were simulated from a log-normal distribution with mean zero (indicating a true DNA copy number equal to one), and a standard deviation equal to a scaled version of the SNP specific variability described above. The scalar K controlled whether more (K < 1) or less (K > 1) precise copy number data than average were encountered in the deletion. For both vanilla and ICE, we calculated for each simulated data set the difference in log likelihoods between making a transition to the state for deletion (⦰) from the normal state (◒) for the range of the simulated deletion (and back after the deletion), versus staying in the normal state throughout. In other words, we calculated the difference of the log likelihood of the true state sequence minus the log likelihood of assigning the normal state ◒ to all SNPs.

The upper row of panels in Figure 7 indicate the distributions of the differences in the log likelihoods for both the vanilla (light grey) and ICE (dark grey) HMMs, shown for the deletions of different sizes, and using four different scale parameters K. For the first two panels, the variability in the simulated copy number estimates in the deleted region was less than in the original data (the standard deviations were reduced to 40% and 70% of the original, respectively), and for the fourth panel the standard deviation in the simulated copy number estimates in the deleted region was increased by 30%. The middle row of panels shows the respective estimated probabilities of the differences in log likelihoods being positive, e. g. the proportion of instances when the correct model was favored over the incorrect one. The lower row of panels shows the difference in these probabilities between ICE and vanilla. Quite obvious is the fact that the ability to detect micro-deletions of a few SNPs depends on precise data, and the knowledge of that precision. For example, when the standard deviation of the simulated copy number estimates in the deletion was reduced to 40%, ICE was able to consistently detect even the smallest deletions, while vanilla was only able to do so for deletions of size 5 or larger (left panels). Naturally, larger deletions are easier to detect for both methods. As the quality of the data decreases (simulated here as an increase in the variability of the copy number estimates in the deletion), the ability of ICE to detect the deletion suffers substantially, while vanilla is almost agnostic to these changes. When the standard deviation of the simulated copy number estimates in the deletion was increased by 30%, vanilla picked up the deletion more often than ICE (right panels). The reason for this is as follows: since the variability in the copy number estimates is increased, the evidence of a deletion being present decreases, and ICE acknowledges this fact by incorporating the confidence estimates. Thus, the decrease in the proportion of instances where ICE favors a deletion over the normal state is a feature of the algorithm. The price to pay, otherwise, is in the number of false positives (i. e. the number of incorrectly inferred deletions at other loci). Simulating 200 “synthetic” normal chromosome 1q arms with K=1.3 across all SNPs, vanilla indicated spurious small deletions in 50 of these artifical chromosomal arms (for a total of 86 incorrect state predictions), while ICE indicated none.

Figure 7.

Figure 7

Differences between the log likelihoods for the correct and incorrect state sequences for the vanilla (light grey) and ICE (dark grey) HMMs are indicated in the upper panels. The differences are shown for deletions of different sizes (horizontal axis), and four different scale parameters K for the copy number estimate variability in the simulated deletions (0.4, 0.7, 1.0, 1.3, left to right). The data were scaled to fit the panels, and slighty smoothed from the raw data by exploiting an obvious mean and variance relationship. The middle row of panels shows the estimated probabilities of the differences in log likelihoods being positive (e. g. the proportion of instances when the correct model was favored over the incorrect one) assuming normality of the differences in the log likelihoods. The lower row of panels shows the estimated differences in these probabilities between ICE and vanilla.

4 Discussion

Chromosomal DNA varies between individuals at the level of entire chromosomes, chromosomal segments, and changes in small genomic regions down to one nucleotide (including single nucleotide polymorphisms, SNPs). Many of these variations appear to be completely benign, but some are known or suspected to be associated with disease. Association studies often use some SNPs (in candidate gene studies) or hundreds of thousands of SNPs (in genome wide association studies) as potential candidates or markers of genes to investigate the relationship between genotype and phenotype. However, the abundance of copy number variations in the human genome and their role in disease have played an increasingly prominent role. In particular, the “common disease, common variant” paradigm has been challenged for some diseases (McClellan et al. (2007); see for example Sebat et al. (2007) for a case study on autistic and apparently normal subjects). Undoubtedly, this change is due in part to the recent technological advancement, in particular on high density single nucleotide polymorphism (SNP) microarrays which allow for the detection of these alterations. Besides copy number variations such as deletions and duplications, copy-neutral stretches of homozygosity can also be of scientific interest, as uniparental disomy as one such example has been implicated in disease.

Copy number variations and loss of heterozygosity can arise through somatic and germline events. In this manuscript, we developed methods most applicable for abnormalities as a consequence of germline events. Undoubtedly, the stochastic process as defined by our transition probability could be too rigid for the analysis of data arising from a cancer sample, where microdeletions as well as a loss of an entire chromosomal arm might be present. Further, non-integer copy numbers do make sense in such samples due to the mix of normal and abnormal cells in the sample (i.e. mosaicism, see Ting et al. (2006) for an example), while we assume the copy numbers to be integers in our approach. While rare, non-integer copy numbers may occur even in “normal” genomes (this can occur throughout the body or in specific regions), and thus may pose a problem for our algorithm. In general, even if our method could be extended to allow for non-integer copy numbers (at least the HMM for copy numbers, since the definition of “genotype” is unclear in such a setting), the ability to pick up non-integer copy numbers obviously depended on the quality of the data, the length of the non-normal region, and the actual value of said copy number. For example, delineating a small mosaic region in a sample with 95% normal cells and 5% of cells with a hemizygous deletion would likely not be possible.

Our paper builds on a modular approach for analyzing SNP chip data, extending the functionality of statistical algorithms that pre-process probe-level data to produce SNP-level summaries of genotype and copy number. Noticeably, these approaches have mostly been developed for the Affymetrix platform (such as CRLMM for improved genotype estimates), but our ideas are portable to other high throughput platforms such as Illumina. In particular, the vanilla HMM only relies on genotype ( CN^) and copy number ( GT^) estimates without any confidence scores, which can be exported directly from the Beadstudio software (http://www.illumina.com/). With one noticeable (and very recent) exception suggested by Colella et al. (2007), previous approaches using HMMs have considered genotype and copy number separately, not simultaneously in a single unifying statistical model that allows for the detection of copy number changes as well as copy neutral stretches of homozygosity in the genome. In this sense, this manuscript is not the first to propose such a unifying approach, albeit ours differs in several aspects from the Bayesian HMM of Colella et al. (2007). In particular, the incorporation of uncertainty estimates can be critical for example in the detection of microdeletions. The investigation of one particular sample as discussed in this manuscript, however, does not allow for conclusive statements how the detected alterations are associated with the phenotype. In particular, it has been well established that copy number variations and copy neutral stretches of homozygous genotypes are prevalent in many phenotypically normal individuals. Identifying features that may be associated with a particular phenotype are better handled by statistical models for between-sample variation in studies with phenotypically normal and diseased populations. Such models reside in the next tier of our modular approach to the analysis of SNP chip data and are an extension of the ideas presented here.

In summary, we developed a HMM for SNP chips using the joint observation sequence of copy number ( CN^) and genotype ( GT^) estimates as input. We demonstrated that a HMM model that uses both CN^ and GT^ can for example distinguish copy-neutral LOH from deletion-induced LOH. We also demonstrated how pre-processing algorithms that provide confidence scores of SNP-level summaries can be integrated into the emission probabilities of the HMM to control smoothing in a probabilistic framework, and showed that this can lead to much improved results. Specifically, confidence estimates allow smoothing to be more local or global depending on the uncertainty of the pointwise estimates. We demonstrated how high confidence scores helped in identifying a very small amplification otherwise missed (Figure 4, region E), while low confidence scores for CN^ and GT^ had the desirable effect of providing a more global smoothing (Figure 5). In particular in the experimental data example, this helped to reduce the number of regions identified as LOH in the vanilla HMM, and eliminated the (presumably, spurious) indication of a small deletion and a small amplification. We believe that the ability to detect microdeletions and microamplifications could be of utmost importance to explain the genetic basis of many diseases. Undoubtedly, this ability will greatly depend on the number of markers investigated (such as the number of SNPs used on a particular platform) and the quality of the data produced (i.e. the precision of the genotype and copy number estimates), but also on how the uncertainty of the estimates is utilized. In this sense, we hope that our method and software provides a useful tool for the scientific community.

Acknowledgments

We gratefully acknowledge the support from and helpful discussions with Benilton Carvalho, Rafael Irizarry, and the members of the Pevsner Laboratory. RBS was supported by NSF grant DMS034211 and training grant 5T32HL007024 from the National Heart, Lung, and Blood Institute. GP was supported by NSF grant DMS034211. JP was supported by NIH grants HD046598 and HD24061. IR was supported by NIH grants R01 CA074841 and R01 GM083084.

References

  1. Affymetrix. Tech rep. Affymetrix, Inc; White Paper: 2006. Brlmm: an improved genotype calling method for the genechip human mapping 500k array set. [Google Scholar]
  2. Aggarwal A, Leong SH, Lee C, Kon OL, Tan P. Wavelet transformations of tumor expression profiles reveals a pervasive genome-wide imprinting of aneuploidy on the cancer transcriptome. Cancer Res. 2005;65(1):186–94. [PubMed] [Google Scholar]
  3. Aguirre AJ, Brennan C, Bailey G, Sinha R, Feng B, Leo C, Zhang Y, Zhang J, Gans JD, Bardeesy N, Cauwels C, Cordon-Cardo C, Redston MS, DePinho RA, Chin L. High-resolution characterization of the pancreatic adenocarcinoma genome. Proc Natl Acad Sci U S A. 2004;101(24):9067–72. doi: 10.1073/pnas.0402932101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Altug-Teber O, Dufke A, Poths S, Mau-Holzmann UA, Bastepe M, Colleaux L, Cormier-Daire V, Eg-germann T, Gillessen-Kaesbach G, Bonin M, Riess O. A rapid microarray based whole genome analysis for detection of uniparental disomy. Hum Mutat. 2005;26(2):153–9. doi: 10.1002/humu.20198. [DOI] [PubMed] [Google Scholar]
  5. Beroukhim R, Lin M, Park Y, Hao K, Zhao X, Garraway LA, Fox EA, Hochberg EP, Mellingho3 IK, Hofer MD, Descazeaud A, Rubin MA, Meyerson M, Wong WH, Sellers WR, Li C. Inferring loss-of-heterozygosity from unpaired tumors using high-density oligonucleotide SNP arrays. PLoS Comput Biol. 2006;2(5):e41, 2. doi: 10.1371/journal.pcbi.0020041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Carvalho B, Speed TP, Irizarry RA. Tech rep. Johns Hopkins University; 2006. Exploration, normalization, and genotype calls of high density oligonucleotide SNP array data. [DOI] [PubMed] [Google Scholar]
  7. Chambers JM. Programming with data. Springer; New York: 1998. [Google Scholar]
  8. Colella S, Yau C, Taylor JM, Mirza G, Butler H, Clouston P, Bassett AS, Seller A, Holmes CC, Ragoussis J. Quantisnp: an objective bayes hidden-markov model to detect and accurately map copy number variation using snp genotyping data. Nucleic Acids Res. 2007;35(6):2013–25. doi: 10.1093/nar/gkm076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Dempster A, Laird D, Rubin D. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:1–38. [Google Scholar]
  10. Di X, Matsuzaki H, Webster TA, Hubbell E, Liu G, Dong S, Bartell D, Huang J, Chiles R, Yang G, mei Shen M, Kulp D, Kennedy GC, Mei R, Jones KW, Cawley S. Dynamic model based algorithms for screening and genotyping over 100 K SNPs on oligonucleotide microarrays. Bioinformatics. 2005;21(9):1958–1963. doi: 10.1093/bioinformatics/bti275. [DOI] [PubMed] [Google Scholar]
  11. Dutt A, Beroukhim R. Single nucleotide polymorphism array analysis of cancer. Curr Opin Oncol. 2007;19(1):43–49. doi: 10.1097/CCO.0b013e328011a8c1. [DOI] [PubMed] [Google Scholar]
  12. Eichler EE, Nickerson DA, Altshuler D, Bowcock AM, Brooks LD, Carter NP, Church DM, Felsenfeld A, Guyer M, Lee C, Lupski JR, Mullikin JC, Pritchard JK, Sebat J, Sherry ST, Smith D, Valle D, Waterston RH. Completing the map of human genetic variation. Nature. 2007;447(7141):161–5. doi: 10.1038/447161a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Eilers PHC, de Menezes RX. Quantile smoothing of array cgh data. Bioinformatics. 2005;21(7):1146–53. doi: 10.1093/bioinformatics/bti148. [DOI] [PubMed] [Google Scholar]
  14. Engel E. A fascination with chromosome rescue in uniparental disomy: Mendelian recessive outlaws and imprinting copyrights infringements. Eur J Hum Genet. 2006;14(11):1158–69. doi: 10.1038/sj.ejhg.5201619. [DOI] [PubMed] [Google Scholar]
  15. Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA, Altshuler DM, Aburatani H, Jones KW, Tyler-Smith C, Hurles ME, Carter NP, Scherer SW, Lee C. Copy number variation: new insights in genome diversity. Genome Res. 2006;16(8):949–61. doi: 10.1101/gr.3677206. [DOI] [PubMed] [Google Scholar]
  16. Fridlyand J, Snijders A, Pinkel D, Albertson D, Jain A. Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis. 2004;90:132–153. [Google Scholar]
  17. Guha S, Li Y, Neuberg D. Bayesian hidden markov modeling of array cgh data. Berkeley Electronic Press; 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Houseman EA, Coull BA, Betensky RA. Feature-specific penalized latent class analysis for genomic data. Biometrics. 2006;62(4):1062–70. doi: 10.1111/j.1541-0420.2006.00566.x. [DOI] [PubMed] [Google Scholar]
  19. Hsu L, Self SG, Grove D, Randolph T, Wang K, Delrow JJ, Loo L, Porter P. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics. 2005;6(2):211–26. doi: 10.1093/biostatistics/kxi004. [DOI] [PubMed] [Google Scholar]
  20. Hua J, Craig DW, Brun M, Webster J, Zismann V, Tembe W, Joshipura K, Huentelman MJ, Dougherty ER, Stephan DA. SNiPer-HD: improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays. Bioinformatics. 2007;23(1):57–63. doi: 10.1093/bioinformatics/btl536. [DOI] [PubMed] [Google Scholar]
  21. Huang J, Wei W, Chen J, Zhang J, Liu G, Di X, Mei R, Ishikawa S, Aburatani H, Jones KW, Shapero MH. CARAT: a novel method for allelic detection of DNA copy number changes using high density oligonucleotide arrays. BMC Bioinformatics. 2006;7:83. doi: 10.1186/1471-2105-7-83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Huang T, Wu B, Lizardi P, Zhao H. Detection of dna copy number alterations using penalized least squares regression. Bioinformatics. 2005;21(20):3811–7. doi: 10.1093/bioinformatics/bti646. [DOI] [PubMed] [Google Scholar]
  23. Hupe P, Stransky N, Thiery JP, Radvanyi F, Barillot E. Analysis of array cgh data: from signal ratio to gain and loss of dna regions. Bioinformatics. 2004;20(18):3413–22. doi: 10.1093/bioinformatics/bth418. [DOI] [PubMed] [Google Scholar]
  24. Kennedy GC, Matsuzaki H, Dong S, min Liu W, Huang J, Liu G, Su X, Cao M, Chen W, Zhang J, Liu W, Yang G, Di X, Ryder T, He Z, Surti U, Phillips MS, Boyce-Jacino MT, Fodor SPA, Jones KW. Large-scale genotyping of complex DNA. Nat Biotechnol. 2003;21(10):1233–1237. doi: 10.1038/nbt869. [DOI] [PubMed] [Google Scholar]
  25. Laframboise T, Harrington D, Weir BA. PLASQ: A Generalized Linear Model-Based Procedure to Determine Allelic Dosage in Cancer Cells from SNP Array Data. Biostatistics. 2006 doi: 10.1093/biostatistics/kxl012. [DOI] [PubMed] [Google Scholar]
  26. Lai WR, Johnson MD, Kucherlapati R, Park PJ. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005;21(19):3763–70. doi: 10.1093/bioinformatics/bti611. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Lai Y, Zhao H. A statistical method to detect chromosomal regions with dna copy number alterations using snp-array-based cgh data. Comput Biol Chem. 2005;29(1):47–54. doi: 10.1016/j.compbiolchem.2004.12.004. [DOI] [PubMed] [Google Scholar]
  28. Lin M, Wei LJ, Sellers WR, Lieberfarb M, Wong WH, Li C. dChipSNP: significance curve and clustering of SNP-array-based loss-of-heterozygosity data. Bioinformatics. 2004;20(8):1233–40. doi: 10.1093/bioinformatics/bth069. [DOI] [PubMed] [Google Scholar]
  29. McClellan JM, Susser E, King MC. Schizophrenia: a common disease caused by multiple rare alleles. Br J Psychiatry. 2007;190:194–9. doi: 10.1192/bjp.bp.106.025585. [DOI] [PubMed] [Google Scholar]
  30. Nannya Y, Sanada M, Nakazaki K, Hosoya N, Wang L, Hangaishi A, Kurokawa M, Chiba S, Bailey DK, Kennedy GC, Ogawa S. A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res. 2005;65(14):6071–6079. doi: 10.1158/0008-5472.CAN-05-0465. [DOI] [PubMed] [Google Scholar]
  31. Newton MA, Gould MN, Rezniko3 CA, Haag JD. On the statistical analysis of allelic-loss data. Stat Med. 1998;17(13):1425–45. doi: 10.1002/(sici)1097-0258(19980715)17:13<1425::aid-sim861>3.0.co;2-v. [DOI] [PubMed] [Google Scholar]
  32. Ninomiya H, Nomura K, Satoh Y, Okumura S, Nakagawa K, Fujiwara M, Tsuchiya E, Ishikawa Y. Genetic instability in lung cancer: concurrent analysis of chromosomal, mini- and microsatellite instability and loss of heterozygosity. Br J Cancer. 2006;94(10):1485–91. doi: 10.1038/sj.bjc.6603121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based dna copy number data. Biostatistics. 2004;5(4):557–72. doi: 10.1093/biostatistics/kxh008. [DOI] [PubMed] [Google Scholar]
  34. Picard F, Robin S, Lavielle M, Vaisse C, Daudin JJ. A statistical approach for array cgh data analysis. BMC Bioinformatics. 2005;6:27. doi: 10.1186/1471-2105-6-27. 1471-2105 (Electronic) [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Rabbee N, Speed TP. A genotype calling algorithm for affymetrix SNP arrays. Bioinformatics. 2006;22(1):7–12. doi: 10.1093/bioinformatics/bti741. [DOI] [PubMed] [Google Scholar]
  36. Rabiner LR. A tutorial on hidden markov models and selected applications in speech recognition; Proceedings of the IEEE; 1989. pp. 257–286.pp. 1 [Google Scholar]
  37. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME. Global variation in copy number in the human genome. Nature. 2006;444(7118):444–54. doi: 10.1038/nature05329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Robinson WP. Mechanisms leading to uniparental disomy and their clinical consequences. Bioessays. 2000;22(5):452–459. doi: 10.1002/(SICI)1521-1878(200005)22:5<452::AID-BIES7>3.0.CO;2-K. [DOI] [PubMed] [Google Scholar]
  39. Scharpf RB, Ting JC, Pevsner J, Ruczinski I. SNPchip: R classes and methods for SNP array data. Bioinformatics. 2007;23(5):627–628. doi: 10.1093/bioinformatics/btl638. [DOI] [PubMed] [Google Scholar]
  40. Sebat J, Lakshmi B, Malhotra D, Troge J, Lese-Martin C, Walsh T, Yamrom B, Yoon S, Krasnitz A, Kendall J, Leotta A, Pai D, Zhang R, Lee YH, Hicks J, Spence SJ, Lee AT, Puura K, Lehtimaki T, Ledbetter D, Gregersen PK, Bregman J, Sutcliffe JS, Jobanputra V, Chung W, Warburton D, King MC, Skuse D, Geschwind DH, Gilliam TC, Ye K, Wigler M. Strong association of de novo copy number mutations with autism. Science. 2007;316(5823):445–9. doi: 10.1126/science.1138659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Shah SP, Xuan X, DeLeeuw RJ, Khojasteh M, Lam WL, Ng R, Murphy KP. Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics. 2006;22(14):e431–e439. doi: 10.1093/bioinformatics/btl238. [DOI] [PubMed] [Google Scholar]
  42. Shaw-Smith C, Redon R, Rickman L, Rio M, Willatt L, Fiegler H, Firth H, Sanlaville D, Winter R, Colleaux L, Bobrow M, Carter NP. Microarray based comparative genomic hybridisation (array-CGH) detects submicroscopic chromosomal deletions and duplications in patients with learning disability/mental retardation and dysmorphic features. J Med Genet. 2004;41(4):241–248. doi: 10.1136/jmg.2003.017731. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Szatmari P, Paterson AD, Zwaigenbaum L, Roberts W, Brian J, Liu XQ, Vincent JB, Skaug JL, Thompson AP, Senman L, Feuk L, Qian C, Bryson SE, Jones MB, Marshall CR, Scherer SW, Vieland VJ, Bartlett C, Mangin LV, Goedken R, Segre A, Pericak-Vance MA, Cuccaro ML, Gilbert JR, Wright HH, Abramson RK, Betancur C, Bourgeron T, Gillberg C, Leboyer M, Buxbaum JD, Davis KL, Hollander E, Silverman JM, Hallmayer J, Lotspeich L, Sutcliffe JS, Haines JL, Folstein SE, Piven J, Wassink TH, Sheffeld V, Geschwind DH, Bucan M, Brown WT, Cantor RM, Constantino JN, Gilliam TC, Herbert M, Lajonchere C, Ledbetter DH, Lese-Martin C, Miller J, Nelson S, Samango-Sprouse CA, Spence S, State M, Tanzi RE, Coon H, Dawson G, Devlin B, Estes A, Flodman P, Klei L, McMahon WM, Minshew N, Munson J, Korvatska E, Rodier PM, Schellenberg GD, Smith M, Spence MA, Stodgell C, Tepper PG, Wijsman EM, Yu CE, Roge B, Mantoulan C, Wittemeyer K, Poustka A, Felder B, Klauck SM, Schuster C, Poustka F, Bolte S, Feineis-Matthews S, Herbrecht E, Schmotzer G, Tsiantis J, Papanikolaou K, Maestrini E, Bacchelli E, Blasi F, Carone S, Toma C, Van Engeland H, de Jonge M, Kemner C, Koop F, Langemeijer M, Hijimans C, Staal WG, Baird G, Bolton PF, Rutter ML, Weisblatt E, Green J, Aldred C, Wilkinson JA, Pickles A, Le Couteur A, Berney T, McConachie H, Bailey AJ, Francis K, Honeyman G, Hutchinson A, Parr JR, Wallace S, Monaco AP, Barnby G, Kobayashi K, Lamb JA, Sousa I, Sykes N, Cook EH, Guter SJ, Leventhal BL, Salt J, Lord C, Corsello C, Hus V, Weeks DE, Volkmar F, Tauber M, Fombonne E, Shih A. Mapping autism risk loci using genetic linkage and chromosomal rearrangements. Nat Genet. 2007;39(3):319–28. doi: 10.1038/ng1985. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Ting J, Ye Y, Thomas G, Ruczinski I, Pevsner J. Analysis and visualization of chromosomal abnormalities in SNP data with SNPscan. BMC Bioinformatics. 2006;7(1):25. doi: 10.1186/1471-2105-7-25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array cgh data. Bioinformatics. 2007;23(6):657–663. doi: 10.1093/bioinformatics/btl646. [DOI] [PubMed] [Google Scholar]
  46. Viterbi A. Error bounds for convolution codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory. 1967;13(2):260–269. [Google Scholar]
  47. Wang P, Kim Y, Pollack J, Narasimhan B, Tibshirani R. A method for calling gains and losses in array cgh data. Biostatistics. 2005;6(1):45–58. doi: 10.1093/biostatistics/kxh017. [DOI] [PubMed] [Google Scholar]
  48. Wang W, Carvalho B, Miller N, Pevsner J, Chakravarti A, Irizarry RA. Tech rep. Johns Hopkins University; 2006. Estimating genome-wide copy number using allele specific mixture models. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Willenbrock H, Fridlyand J. A comparison study: applying segmentation to array cgh data for downstream analyses. Bioinformatics. 2005;21(22):4084–91. doi: 10.1093/bioinformatics/bti677. [DOI] [PubMed] [Google Scholar]
  50. Zhao X, Li C, Paez JG, Chin K, Jänne PA, Chen TH, Girard L, Minna J, Christiani D, Leo C, Gray JW, Sellers WR, Meyerson M. An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res. 2004;64(9):3060–71. doi: 10.1158/0008-5472.can-03-3308. [DOI] [PubMed] [Google Scholar]
  51. Zhou X, Mok SC, Chen Z, Li Y, Wong DTW. Concurrent analysis of loss of heterozygosity (loh) and copy number abnormality (cna) for oral premalignancy progression using the affymetrix 10k snp mapping array. Hum Genet. 2004;115(4):327–30. doi: 10.1007/s00439-004-1163-1. [DOI] [PubMed] [Google Scholar]
  52. Zhou X, Rao NP, Cole SW, Mok SC, Chen Z, Wong DT. Progress in concurrent analysis of loss of heterozygosity and comparative genomic hybridization utilizing high density single nucleotide polymorphism arrays. Cancer Genet Cytogenet. 2005;159(1):53–7. doi: 10.1016/j.cancergencyto.2004.09.014. [DOI] [PubMed] [Google Scholar]
  53. Zlotogora J. Parents of children with autosomal recessive diseases are not always carriers of the respective mutant alleles. Hum Genet. 2004;114(6):521–6. doi: 10.1007/s00439-004-1105-y. [DOI] [PubMed] [Google Scholar]

RESOURCES