Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Jan 1.
Published in final edited form as: Pac Symp Biocomput. 2014:3–14.

TUMOR HAPLOTYPE ASSEMBLY ALGORITHMS FOR CANCER GENOMICS

DEREK AGUIAR , WENDY SW WONG ‡,*, SORIN ISTRAIL †,*
PMCID: PMC4051221  NIHMSID: NIHMS544366  PMID: 24297529

Abstract

The growing availability of inexpensive high-throughput sequence data is enabling researchers to sequence tumor populations within a single individual at high coverage. But, cancer genome sequence evolution and mutational phenomena like driver mutations and gene fusions are difficult to investigate without first reconstructing tumor haplotype sequences. Haplotype assembly of single individual tumor populations is an exceedingly difficult task complicated by tumor haplotype heterogeneity, tumor or normal cell sequence contamination, polyploidy, and complex patterns of variation. While computational and experimental haplotype phasing of diploid genomes has seen much progress in recent years, haplotype assembly in cancer genomes remains uncharted territory.

In this work, we describe HapCompass-Tumor a computational modeling and algorithmic framework for haplotype assembly of copy number variable cancer genomes containing haplotypes at different frequencies and complex variation. We extend our polyploid haplotype assembly model and present novel algorithms for (1) complex variations, including copy number changes, as varying numbers of disjoint paths in an associated graph, (2) variable haplotype frequencies and contamination, and (3) computation of tumor haplotypes using simple cycles of the compass graph which constrain the space of haplotype assembly solutions. The model and algorithm are implemented in the software package HapCompass-Tumor which is available for download from http://www.brown.edu/Research/Istrail_Lab/.

Keywords: haplotype assembly, haplotype phasing, tumor haplotypes

1. Introduction

Cancer is the worldwide leading cause of death and the second leading cause of death in the United States. Despite the tremendous amount of effort and resources spent on cancer research, our knowledge of the disease pathology is limited and the outlooks for certain types of cancer are usually dire. The commercialization of high-throughput sequencing platforms in the last decade has accelerated the growth of cancer genomics research dramatically. Since the first whole genome tumor sample was sequenced in 2008,1 there have been hundreds of studies on numerous cancer types.25 One of the fundamental computational challenges common to many of these studies is to separate the true driver mutation signal from the biological noise (e.g. passenger mutations) and experimental noise (e.g. sequencing errors). While it is possible to map sequence reads from tumor samples to a reference genome and call genomic variants, it is exceedingly difficult to determine the parental chromosome of origin for each variant allele – that is, the variant's phase. But, the chromosomal sequence of alleles, or haplotype, is important for elucidating genomic events critical to the understanding of cancer like gene fusions or driver mutations.

A theory for carcinogenesis formulated by Knudson in 1971 demonstrates the importance of haplotype phase in cancer.6 In the two-hit hypothesis, Knudson suggested that in order to cause cancer, at least two “hits” have to take place. The first “hit” is usually an inherited mutation, and the second “hit” is a somatic mutation in the same gene or a different gene in the same pathway occurring later in life and out of phase with the first mutation. Having the ability to reconstruct tumor haplotypes would enable the discovery of such compound heterozygous relationships between variants and enhance our ability to identify driver mutations.

The computational problem of haplotype assembly aims to compute the sequence of co-inherited variant alleles for each chromosome given a set of aligned sequence reads and variants.7,8 Haplotype assembly of diploid genomes has been addressed by many researchers9,10 and several haplotype assembly algorithms for diploid genomes are available for use.11,12 However, the methodologies for diploid haplotype assembly are unable to model polyploid genomes or complex copy number aberrations (CNA). Recently, we developed HapCompass-Polyploidy, the first modeling and algorithm for haplotype assembly in genomes with more than two sets of homologous chromosomes (polyploidy).13 The HapCompass-Polyploidy algorithm assembles pairs of variants in polyploid genomes and then produces a haplotype assembly consistent with the pairwise variant phasings.

Cancer genomes have many similarities with polyploid genomes but present additional complexities that current methodologies do not model. Sequencing reads sampled from cancer patients exhibit a mixture of normal diploid cells and heavily rearranged, aneuploid cells. This introduces two major complexities into the haplotype assembly model: (1) heavily rearranged or translocated chromosomes will exhibit changes in copy number and (2) the heterogeneous nature of tumor samples requires reconstruction of more than two haplotypes each with a sample frequency which biases sequence read coverage.

Before these complexities can be modeled, the spectrum of variation must be inferred. While early cancer research was focused on small variants such as single nucleotide variants (SNV) and indels in a single gene or a small set of genes, advances in technology have enabled us to study large structural variants such as CNAs and large chromosomal rearrangements in tumor genomes. Several recent studies on multiple tumor genomes have found the important role of these large structural variants in tumor development.3,4,14,15 In general, detection of cancer variation with sequencing data involves detecting those variants that are supported in the tumor genome but not found in the normal genome. The algorithms can be largely divided into three categories determined by the variant type they are trying to detect, i.e. small variants (SNVs and indels), CNAs and complex structural variants (translocations, duplications, and inversions).

Strelka jointly models the normal sample as a mixture of germline variation with noise, and the tumor sample as a mixture of the normal sample with somatic mutations, in a Bayesian framework.16 VarScan 2 also uses the sequence reads from tumor and normal cells simultaneously, but uses a one tailed Fisher's exact test to determine whether the variants are somatic, normal, or loss of heterozygosity.17 Control-FREEC not only uses the coverage information but also the read count frequencies to estimate CNAs in tumor samples.18 Control-FREEC also normalizes the tumor read depths by GC content and mappability and hence a normal genome is not required, although it could also be used for normalization.

Detection of large structural variations is often made possible by exploiting the properties of paired-end sequence reads. For example, the insert sizes of reads that are mapped to both sides of a large deletion would appear to have much larger insert sizes than the rest of the population. CREST first looks for a cluster of soft-clipped reads that exhibit evidence of a break point for a structural variant, and then locates the other break point by scanning the location neighboring the paired read.19 However, the accuracy of these methods can be seriously affected when there is contamination in the samples. Cibulskis et al developed a Bayesian model to estimate the level of cross-individual contamination in each sample.20 Contamination may also exist within an individual; tumor tissue can be contaminated with normal DNA and vice versa. Both incorrect variant calling as well as sequence contamination represent sources of complexity and errors for haplotype assembly.

In this work, we leverage the existing literature and tools for cancer genome variant inference and build on the polyploid HapCompass model to construct the first methodology for cancer genome haplotype assembly. In Section 2 we provide the necessary details of the HapCompass polyploid model and extensions for cancer genome haplotype assembly. The modeling section is followed by Section 3 which describes the HapCompass-Tumor algorithm and Section 4 which evaluates the implementation of the algorithm on cancer genome data. Finally, Sections 5 and 6 present a discussion of alternative models of cancer genome haplotype assembly, limitations and extensions to our model, future work, and conclusions.

2. Modeling

Let k be an integer representing the number of unique tumor haplotypes in a sample of tumor tissue. Because the tumor is actively evolving, this k may vary for independent samples of the same tumor. We assume that each sequence read is sampled from a single haploid fragment generated from one of the k haplotypes; this property enables the building of haplotype phase relationships between alleles in sequence reads that contain two or more heterozygous variants (homozygous variants do not provide phase information for assembly). The phase-informative sequence reads and variants are modeled with two graph structures termed the compass graph, GC, and chain graph, Gh. These data structures are described in Aguiar et al. 2013 but their definitions are repeated here in order to present the novel aspects of the model for tumor genomes.13

The compass graph GC(VC, EC) has vVC for each variant and (vi, vj) ∈ EC if variants vi and vj are contained within a sequence read. Edges (vi, vj) are annotated with the most likely haplotype phasing between variants vi and vj given the set of reads that contain both vi and vj (Figure 1).

Fig. 1.

Fig. 1

An example tumor sample GC with three unique haplotypes. Vertices are variants and edges show pairwise haplotype assemblies.

2.1. Phasing edges of GC

Given the probability of sequencing error, se and a set of reads overlapping the two variants, r1, ..., rn, the likelihood of a particular phasing, pp from the set of all phasings between the two variants P can be computed as in Equation 1. Edges of GC are phased by choosing the pp that maximizes this likelihood.

L(ppse,r1,r2,,rn)=P(r1sc,pp)P(rnse,pp)i=1PP(r1,r2,,rnse,pi) (1)

Equation 1 models haplotypes that are in equal proportion which may not be true for heterogeneous tumor samples. Thus the likelihood must be modified to accommodate the different frequencies of haplotypes. Consider the normal haplotype contamination that is often present in tumor sequence samples. Contamination may be modeled by jointly assembling the k tumor haplotypes with two low frequency normal haplotypes. Therefore, the probability of a haplotype h with frequency fh in the phased haplotypes of a pair of variants can be expressed as p(hse,p)=hpfhF(se,p,h) where F is a function that takes the sequencing error probability se, the set of all haplotypes for the two variant phasing p and the particular haplotype h and computes the probability of generating a read containing haplotype h.

For example, assume the three haplotypes 00, 00, and 11 exist between two variants and one of the 00 haplotypes was considered contamination at frequency 10%. If the other two haplotypes were in equal proportions, then

P(00se,{00,00,11})F(se,{00,00,11},00)=(1se)20.1+(1se)20.45+(se)20.45 (2)

The number of unique phasings of an edge depends on the number of unique tumor haplotypes k and the allele content of the variant pair. Let the number of 1 alleles for variants vi and vj be l(vi) and l(vj) respectively. Then, the number of phasings of an edge is upper bounded by min((kl(vi)),(kl(vj))). This is a bound and not equality because some phasings may be repeated in this enumeration.

2.2. Chain graph

Haplotype phasings of the edges of GC can be extended to paths. Because two adjacent edges share a variant, haplotypes with the same allele can be merged on the shared vertex. If two paths in GC of length i and j vertices are merged, the new phasing will have i + j – 1 variants.

For paths or trees in GC, there is exactly (at least) one consistent haplotype phasing, with respect to the edge phasings along the path or tree, for genomes with k = 2 (k > 2). In contrast, simple cycles in GC may be either conflicting or non-conflicting depending on how many phasings are consistent with the cycle. A conflicting cycle does not have a consistent phasing while a non-conflicting cycle has at least one. The chain graph Gh is constructed for each simple cycle to determine its conflicting state.13

The chain graph Gh(Vh, Eh) is constructed for a path or simple cycle c = ((v1, v2), ..., (vs–1, vs), (vs, v1)) in the compass graph GC. We introduce k haplotype vertices corresponding to the phasing for each edge (vi, vj) in the path or cycle. Vertices in Gh created from adjacent edges of GC share a variant; edges connect vertices in Gh if they share a variant and allele. Then, source nodes s1, ..., sk are arbitrarily assigned to vertices at level 1 and sink nodes t1, ..., tk are assigned to vertices at level s if the level s vertex shares an allele with the level 1 vertex. Vertices are annotated with ti if there exists at least one si to ti path which is computed by a depth first from each source. Gh can be described as a trellis graph in which the vertices can be divided into levels; each level in this case corresponds to an edge of GC. Trellis graphs have a wide range of applications including communication network topology and survivability, encryption, encoding and decoding, and are a central data structure in Markov models.

2.3. Disjoint siti paths in the trellis graph Gh

We now present new results on the theoretical properties of this graph and extensions to phasing the entire compass graph. A valid phasing of a path of compass graph edges e1,2, ..., es–1,s is defined as k vertex-disjoint paths from level 1 to level s in the corresponding Gh. A valid phasing of a cycle of compass graph edges e1,2, ..., es,1 is defined as k vertex-disjoint paths from each source si to its corresponding sink ti in the corresponding Gh. There always exists at least one phasing for paths of GC by definition of Gh; cycles may not exhibit a valid phasing (Lemma 2.1).

Lemma 2.1. There exists at least one valid phasing of k haplotypes for a cycle c if and only if there exists a valid matching between sink node annotation and chain graph nodes at each level of Gc.

Proof. If: Adjacent edges share a variant and thus the number of x alleles at level i must equal the number of x alleles at level i + 1 where x is any allele of the shared variant. If there is a matching at level i and i+1, then there must exist an edge between valid haplotype phase nodes because they share a common allele (adjacent levels). One can extend a valid haplotype phasing path from level i to i+1 using the edge generated by the shared allele. Only-if: Assume one level does not have a valid matching; then, either (1) at least two haplotypes share a phased haplotype node or (2) at least one phased haplotype node contain no sink node annotation. Case (1): multiple haplotype paths must share a phased haplotype node which breaks the vertex disjointness condition. Case (2): each level has exactly k nodes each of which must be taken once. If one or more phased haplotype nodes contain no sink annotation, then at least one phased haplotype node must be shared by 2 or more haplotype paths which breaks vertex disjointness.

We will use this property of Gh later in the computation of the tumor haplotype phasing.

2.4. Copy number aberrations and translocations in Gh

The chain graph and disjoint paths framework accommodates modeling the types of variation typical of tumor genomes (Figure 2). CNAs insert or remove large genomic regions. Genomic deletions are modeled as an edge connecting the variants flanking the deletion breakpoint. In this case, the model still expects the computation of k disjoint paths spanning the deletion. Large insertions of genetic material can be modeled as the addition of a temporary path in between or potentially overlapping vertices of Gh. The number of disjoint paths in this case changes to k + 1. Translocations may be modeled in Gh by combining deletions and insertions.

Fig. 2.

Fig. 2

Deletions and insertions are modeled with disjoint paths. The green edge models a deletion which effectively removes the deleted variants in the chain graph. The blue node insertion adds an extra path in Gh.

2.5. Disjoint subgraphs in the general chain graph

The general chain graph Gg is our final graph structure for representing the overall phasing of tumor genomes. Because there may be many matchings at each level of Gh, haplotype assembly of non-conflicting cycles in GC will yield a set of potential phasings. The haplotype phasings of Gh constrain the haplotype assembly to include one of the k disjoint path solutions.

Gg is built from the conflict-free spanning tree cycle basis of GC. The vertices of Gg are constructed in a similar manner as Gh; each edge (vi, vj) of GC generates a vertex for each haplotype in the phasing of (vi, vj). Each Gh constructed from a non-conflicting cycle of GC defines a set of edge adjacencies; these adjacencies are represented in Gg. Therefore, if two edges are adjacent in a Gh, then they are also adjacent in Gg. Because of Lemma 2.1, we can determine the number of disjoint path solutions passing through adjacent levels i and j by simply computing the valid extensions of matchings from level i to j. We assume each of the l valid extensions of the sets of matchings at adjacent levels ei and ej are equally likely. Then, the weight of a particular extension w(ei)w(ej)l where w(ei) is the score or likelihood of edge ei, is added to the edges of Gh (and Gg).

However unlike Gh, Gg is not necessarily a trellis graph if the cycles in the basis do not agree on the ordering of edge adjacencies (Figure 3). If Gg were a tree, finding a phasing could be modeled as packing disjoint Steiner trees or disjoint spanning trees. Instead, we model the computation of the tumor haplotype assembly as the k-maximum weight node-disjoint spanning tree problem. That is, we compute a set of k node-disjoint (within levels) spanning trees in Gg whose total weight is maximum over all k node-disjoint spanning trees and includes every vertex in Gg.

Fig. 3.

Fig. 3

(Left) An example tumor genome GC with three non-conflicting cycles. Dashed lines represent edges not in the spanning tree of GC. The inclusion of each non-tree edge creates a cycle in the cycle basis of GC. The two inner cycles ((v0, v1), (v1, v3), (v3, v0)) and ((v0, v2), (v2, v3), (v3, v0)) create the red-edge adjacencies in Gg (right). Computing the haplotype assembly of a tree (Gg with just the red edges) is simple. However, if the blue non-tree edge is added, the edge adjacency ((v0, v1), (v0, v2)) is included in Gg creating a cycle.

3. HapCompass-Tumor Algorithm

HapCompass-Tumor optimizes the minimum weighted edge removal (MWER) problem. MWER aims to compute a set of edges L of minimum weight, whose removal resolves all conflicting cycles of GC. After all conflicting cycles have been removed, each non-conflicting cycle's Gh is added to Gg. Gg represents the constrained solution space by incorporating the valid haplotype assemblies on subsets of variants computed from each non-conflicting Gh (Algorithm 1).

Algorithm 1.

HapCompass-Tumor

input : Sequence reads, variant calls, and number of distinct haplotypes k
output: k haplotypes
GCspanning tree cycle basis
CCset of conflicting simple cycles with respect to GC
for cCCC do
        Remove edge with smallest likelihood in cC
        Reconstruct GC
end
Compute Gg
CNset of non-conflicting simple cycles with respect to GC
for cNCN do
        Compute Gh with respect to cN
        Compute matchings at each level of Gh
        Compute disjoint paths of Gh
        Increase the weight of each edge e between levels shared by Gh and Gg in Gg
        proportional to the number of disjoint paths using edge e and the likelihood of
        each edge (Equations 1 and 2)
end
Compute a maximum weight spanning tree of the adjacencies in Gg
Output the haplotype assembly computed from the spanning tree of Gg

The final step involves computing k spanning trees in Gg which are node disjoint in respect to haplotype level vertices. Adjacencies between levels in Gg correspond to matchings between the haplotype nodes (Figure 4 right). So, HapCompass-Tumor computes k disjoint spanning trees corresponding to the k tumor haplotypes. We have implemented two algorithms inspired by Kruskal's and Prim's algorithms for computing maximum spanning trees. The principle difference between the two algorithms in the context of HapCompass is the Kruskal-like algorithm focuses on constructing disjoint trees by including strong phasings on the same haplotype (edges of Gg) while the Prim-like algorithm phases all haplotypes between two levels at a time (vertices of Gg).

Fig. 4.

Fig. 4

(Left) chain graphs (Gh) from the compass graph in Figure 1. The level corresponding to edges in GC are denoted by black (non-tree edges) and blue (spanning tree edges) lettering above the vertices. In this example, the edge phasing probabilities in GC are all 1. So, an edge connecting level i to level j which is in b disjoint path solutions will receive a weight of b/d if there are d unique disjoint path solutions from level i to level j. The weights of edges calculated from disjoint siti paths in each Gh are added to the Gg (right).

We illustrate the modeling and algorithm with a series of examples. Let the compass graph GC of a tumor sample with three unique haplotypes be shown in Figure 1. Then, if (v0, v3), (v2, v1), and (v3, v2) are the non-tree edges of GC, the chain graphs in Figure 4 (left) are constructed. Figure 4 (right) shows the Gg updated after the disjoint paths and weights of edges in Gh are computed and distributed to Gg.

4. Results

We implemented HapCompass-Tumor and evaluated its performance on simulated tumor haplotypes. In these experiments we use insert size as a proxy for the computed haplotype length. It has been shown that the dominant factor in producing long haplotype assemblies is the length between the read pairs.13,21 Briefly, if the length between two variants is x and the insert size is y, then a sequence read can never span the two variants if x > y.

4.1. Dependence on insert size and error rates

Using the sequence for the BRCA1 breast cancer susceptibility gene, we simulated three hyper variable tumor haplotypes. Distance between variants were distributed normally ~ N(500, 50). The following procedure was repeated 250 times for each data point in Figure 5. Given the set of variants which remained fixed for each experiment, a random phasing is computed that is consistent with the allele distributions. We then sampled 10000 phase-informative simulated reads from the true haplotypes and computed the average edit distance between assembled and true haplotypes. We compared the distance of haplotype assemblies for the randomly generated triploid BRCA1 genes while varying sequence read insert size, standard deviation of insert size, and single base substitution error rate.

Fig. 5.

Fig. 5

(Left) The average edit distance between haplotypes and the simulated true haplotypes is calculated with a fixed coverage and varying insert sizes, error rates (error), and standard deviations (std). (Right) Haplotype assembly accuracy is plotted as a function of the number of tumor haplotypes in the sample.

Figure 5 (left) demonstrates several interesting trends. First, as the insert size is increased the haplotype assemblies become more accurate. Second, the more variable the insert length, the more accurate the haplotype assembly. A hyper variable insert length appears to have a similar effect as increasing the insert size. These findings confirm patterns observed in conventional diploid haplotype assembly. Finally, while the error rate does affect haplotype assembly accuracy, as long as the error rate is less than 0.2%, the haplotype assemblies are similar in quality. This phenomenon is likely caused by the constant coverage coupled with uncertainty in phasing the edges of GC. When the coverage is fixed and the insert sizes are short, haplotype assemblies are smaller but more accurate. Conversely, when error rates reach a threshold where edge phasings are no longer accurately called, the haplotype assembly quality suffers.

4.2. Cancer genome heterogeneity

We also compared the accuracy of haplotype assembly in terms of tumor genome heterogeneity (Figure 5 right). Sequencing parameters were fixed to produce insert sizes between 500 and 2500, short insert size standard deviations, 10000 sequence reads, and no errors. Each data point contains the average of 250 haplotype assembly edit distances. The more unique tumor haplotypes in the sample the less accurate the solution. The increasing edit distance with 5 unique haplotypes between insert sizes 2000 and 2500 is likely an effect of the rising uncertainty of edge phasings when coverage is kept fixed and more edges are being generated in GC.

4.3. NA12878

We simulated paired tumor sequence reads and their mappings with Enhanced Artificial Genome Engine (EAGLE) developed by Illumina Cambridge Ltd (personal communications). The sequencing parameters were set to model paired-end Illumina data with 101bp read lengths and a mixture of long (length=N(60000, 1412)) and short (empirical distribution from 2 × 101 runs, with median size ~ 300bp) fragment sizes. The variants simulated include SNV and indels called in NA12878 by the Genome in a Bottle Consortium22 and the HCC1187 tumor sample (downloaded from Illumina's Basespace23). Variants were combined then randomly divided into two sets for each homologous chromosome, with 30X coverage for the first chromosome and 15X coverage for the second to simulate tumor genome amplification. Sequence reads were mapped to their simulated location after single base mismatches were introduced according to empirical error rates.

We evaluated HapCompass-Tumor on all autosomes of the EAGLE simulated data and longer reads simulated using HapCompass. The reads simulated from HapCompass include medium (200bp) and long (2000bp) read lengths with error rates of 2% and 5% respectively to model the higher error rates associated with long-read high-throughput sequence technologies. We used the number of allele bit flips required to map the sequence reads to the assembled haplotypes as the evaluation metric. Table 1 shows the results for HapCompass-Tumor using the Kruskal-like and Prim-like algorithms for resolving Gg. Additionally, we implemented a scoring scheme that scores pairs of vertices with more diversity in haplotype sequence higher (termed Diverse in Table 1). This scheme is designed to limit uninformative pairs of vertices in the spanning tree of the compass graph GC.

Table 1.

The proportion of incorrectly mapped alleles (error) by Gg resolution algorithm. Sequence data was simulated for 1000 Genomes Project individual NA12878 using EAGLE to simulate Illumina reads and HapCompass to simulate reads with medium (200bp, 2% error rate) and long (2000bp, 5% error rate) read lengths.

Gg resolution error (autosomes, EAGLE) error (chr20, 200bp) error (chr20, 2000bp)
Kruskal 0.002658 0.02079 0.04626
Kruskal Diverse 0.002659 0.02071 0.04679
Prim 0.002659 0.02789 0.05639
Prim Diverse 0.002659 0.02631 0.05867

Table 1 demonstrates that the accuracy of the haplotype assembly depends minimally on the selection of algorithm when using Illumina-like sequencing parameters. However, as the read length increases, the Kruskal-like algorithm becomes favorable.

5. Discussion

Opportunities exist to extend HapCompass-Tumor to address some of the limitations in the current model. First, HapCompass-Tumor only computes a single solution when the compass graph model allows computation of suboptimal solutions. Phase extension in Gg is deterministic but many highly probable suboptimal solutions may exist. As long as the number of alternative disjoint paths is bounded by a low degree polynomial, we can carry these partial solutions to the assembly step and report multiple haplotype assemblies.

Second, incorporating a priori knowledge of haplotype distributions from population samples or long read lengths would improve the assembly. For example, we assumed each valid haplotype phasing for a cycle in GC is equally likely. However, this assumption can be easily modified to accommodate known haplotype likelihoods in the area (e.g. linkage disequilibrium). Consider a collection of valid disjoint paths for a cycle in GC; if the probability of both phasings is 1 and the edge extension has i distinct matchings, then each matching is given a weight 1i. If, however, one of the haplotypes in an extension is never observed in the population, HapCompass-Tumor could penalize the extension.

A related application of HapCompass-Tumor is in cancer panomics. Much attention in cancer research has been focused on allelic specific expression (ASE). Studies have shown that germline ASE is associated with cancer risk;24,25 and somatic ASE is associated with tumor development.26 ASE in cancer was found not only correlated with CNAs,26 but also with allelic specific methylation (ASM).27 Existing algorithms for detecting ASE with RNA-seq and detecting ASM with Bisulfite-Seq do not usually make use of phased genotype information.26,28 We therefore propose using the phased haplotypes from whole genome sequencing of tumor samples as a reference for RNA-seq and Bisulfite-seq alignment when such data is available.

Finally, the viral quasispecies reconstruction (VQR) problem aims to compute the spectrum of viral quasispecies haplotypes from the sequence reads of a heterogeneous viral sample. The problems of haplotype assembly and VQR are similar but the research literature is largely independent due to the inability of haplotype assembly algorithms to model more than two sets of homologous haplotypes. However, it is possible to model VQR with HapCompass-Tumor by leaving the number of haplotypes in the sample (k) as an unknown parameter. Two possible approaches include inferring the number of quasispecies a priori and then performing haplotype assembly with k unique haplotypes or computing assemblies for a number of different k and comparing the quasispecies solutions. But, using a general haplotype assembly tool for VQR does not take advantage of two critical properties of most viral genomes: (1) knowledge of the phylogenetic relationships between mutations is known for well-studied viral genomes especially those under selective pressures from treatment and (2) the genomes are many orders of magnitude smaller than eukaryotes.

6. Conclusions

In this work, we developed algorithms and models for tumor genome assembly building on our existing haplotype assembly framework HapCompass. We demonstrated how to model tumor haplotype heterogeneity and haplotypes containing CNAs and translocations. The HapCompass-Tumor algorithm was presented using the combined evidence of cycles in GC and disjoint paths in Gh to inform which haplotype assemblies in Gg are probable. Finally, we evaluated the HapCompass-Tumor algorithm on simulated cancer data showing that, while the accuracy is a function of many parameters including the level of cancer genome heterogeneity, we are still able to produce accurate haplotype assemblies. HapCompass-Tumor is available for download from http://www.brown.edu/Research/Istrail_Lab/.

Acknowledgements

We thank Lilian Janin and Anthony Cox at Illumina Cambridge Ltd for sharing and helping us with the EAGLE simulator. This work was supported by the National Science Foundation [1048831 and 1321000 to S.I.].

References

RESOURCES