Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2014 Feb 19;16(2):232–241. doi: 10.1093/bib/bbu002

An assessment of computational methods for estimating purity and clonality using genomic data derived from heterogeneous tumor tissue samples

Vinod Kumar Yadav, Subhajyoti De
PMCID: PMC4794615  PMID: 24562872

Abstract

Solid tumor samples typically contain multiple distinct clonal populations of cancer cells, and also stromal and immune cell contamination. A majority of the cancer genomics and transcriptomics studies do not explicitly consider genetic heterogeneity and impurity, and draw inferences based on mixed populations of cells. Deconvolution of genomic data from heterogeneous samples provides a powerful tool to address this limitation. We discuss several computational tools, which enable deconvolution of genomic and transcriptomic data from heterogeneous samples. We also performed a systematic comparative assessment of these tools. If properly used, these tools have potentials to complement single-cell genomics and immunoFISH analyses, and provide novel insights into tumor heterogeneity.

Keywords: tumor purity and heterogeneity, mixed cell population, deconvolution, software

INTRODUCTION

Normal solid tissues of human and other eukaryotes comprise of several different types of cells. These different cell types have distinct transcriptomic profiles [1], and they might also carry lineage-specific somatic mutations (e.g. in case of somatic mosaicism [2]). Compared to normal solid tissue, solid tumors represent even greater structural complexity owing to the presence of distinct clonal populations of tumor cells (and also contamination by normal stromal and immune cells, Figure 1), and also a gross breakdown of order of the tissue-level architecture [3]. Clonal heterogeneity presents complex problems while analysing tumor samples. For instance, we found that within a breast tumor sample, two clonal populations of cells accumulated different sets of driver mutations [4]; detection of these clones in isolation could bias our pathological classification, prognosis and potentially treatment. Similar observations have been reported by other studies [3, 5, 6]. Heterogeneity and impurity present in tumor samples pose considerable challenges for downstream genomic and transcriptomic analyses, and increase the risk of incorrect inferences [1, 6, 7]. For instance, using the genomic data derived from mixed cell population it is difficult to ascertain whether two cancer gene mutations were present in the same tumor cells, or arose in two distinct clonal populations within the same tumor. Similarly, if a gene has different expression in two distinct populations of cells in the same sample, it might not be apparent from tissue-averaged gene expression data. Until lately, a majority of genome and transcriptome studies have used mixed populations of cells as the starting material, thereby overlooking such issues arising from the presence of tumor heterogeneity.

Figure 1:

Figure 1:

Estimation of purity and clonality using deconvolution of genomic and transcriptomic data from heterogeneous tumor tissue samples.

Single-cell genomics and transcriptomic studies have highlighted the limitations of naïve tissue-averaged analyses [5, 6, 8]. Multiple single-cell-based techniques exist to observe cell-to-cell variation at the resolution of individual cells. Laser-capture micro-dissection and single-cell genomics analysis are currently used to analyse relatively pure population of cells and to avoid the problems arising from heterogeneity. If cells in the sample are in suspension and present suitable biomarkers, cell-sorting methods can be used to isolate cells of interest. For instance, using a combination of flow-sorted nuclei, whole-genome amplification and sequencing Navin et al. have shown the patterns of clonal hierarchy in tumor samples [5]. Recently, Moignard et al. have used single-cell gene expression analysis to characterize transcriptional network of hematopoietic stem and progenitor cells [9]. Bumgarner et al. have used single-cell analysis and identify contribution of non-coding RNA in clonal heterogeneity by modulating transcription factor recruitment [10]. Using immune-FISH and immune-fluorescence techniques we have identified major evolutionary pathways in breast cancer (BRCA) [4]. These experimental approaches have been summarized elsewhere [3]. Nevertheless, as of now, single-cell-based techniques are expensive, low throughput and require specialized resources.

There exists another alternative to address the limitation discussed above—deconvolution of genomic and transcriptomic data from mixed cell populations. It is possible to probabilistically reconstruct the proportion of different cell populations, and their genomic and transcriptomic profiles using the statistical properties of mixed populations (Figure 1). Several computational methods have been developed for deconvolution of heterogeneous tissue samples using, initially microarrays, and more recently, sequencing-based genomic, and transcriptomic datasets. Despite their utility, until lately, these methods were not routinely used [1]. Here we review these methods; in particular, those that have come out in the last few years (Table 1). Below, we will first outline the statistical properties of mixed cell population, and then discuss different approaches adopted by these methods (Table 2), and their broad applications.

Table 1:

A summary of the current, and commonly available software for deconvolution of mixed tissue genomic and transcriptomic data, and their sources

Software Year published Reference Source
CompMix 2004 Ghosh [15] http://www.sph.umich.edu/∼ghoshd/COMPBIO/COMPMIX/
dChipSNP 2008 Li et al. [34] http://biosun1.harvard.edu/complab/dchip/snp/
ISOLATE 2009 Quon and Morris [27] http://morrislab.med.utoronto.ca/software
Dsection 2010 Erkkila et al. [19] http://informatics.systemsbiology.net/DSection
csSAM 2010 Shen-Orr et al. [20] http://www.nature.com/nmeth/journal/v7/n4/extref/nmeth.1439-S2.zip
mixture_estimation.R 2010 Clarke et al. [23] http://medicine.med.miami.edu/statistical-theory/jennifer-clarke
ASCAT 2010 Loo et al. [35] http://www.ifi.uio.no/bioinf/Projects/ASCAT
PERT 2012 Qiao et al. [21] http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002838
JointSNVMix1, JointSNVMix2 2012 Roth et al. [37] http://compbio.bccrc.ca/software/jointsnvmix/
PurityEst 2012 Su et al. [36] http://odin.mdacc.tmc.edu/∼xsu1/PurityEst.html
ABSOLUTE 2012 Carter et al. [32] https://confluence.broadinstitute.org/display/CGATools/ABSOLUTE
CNAnorm 2012 Gusnanto et al. [41] http://www.precancer.leeds.ac.uk/cnanorm
DeMix 2013 Ahn et al. [11] http://odin.mdacc.tmc.edu/∼wwang7/DeMix.html
PurBayes 2013 Larson and Fridley [38] http://cran.r-project.org/package=PurBayes
DeconRNASeq 2013 Gong and Szustakowski [28] http://master.bioconductor.org/packages/release/bioc/html/DeconRNASeq.html
TEMT 2013 Li and Xie [30] https://github.com/uci-cbcl/TEMT
ESTIMATE 2013 Yoshihara et al. [31] https://sourceforge.net/projects/estimateproject/
MuTect 2013 Cibulskis et al. [43] http://www.broadinstitute.org/cancer/cga/mutect
TrAp 2013 Strino et al. [47] http://sourceforge.net/projects/klugerlab/files/TrAp/
N/A 2013 Seo et al. [49] http://www.broadinstitute.org/
ExPANdS 2013 Andor et al. [40] http://cran.r-project.org/web/packages/expands
Virmid 2013 Kim et al. [42] http://sourceforge.net/projects/virmid/
THetA 2013 Oesper et al. [39] http://compbio.cs.brown.edu/software/

The software are sorted according to their year of publication.

Table 2:

A description of the methods adopted by the commonly available software for deconvolution of mixed tissue genomic and transcriptomic data

Software Genomic data used Compatible with Methods Platform Input data required Key features
CompMix [15] RNA Microarray Hierarchical mixture model R based Expression and proportion data required Determine differential expressed genes in mixed cell populations
Dsection [19] RNA Microarray Bayesian model Web-based and MATLAB Expression and proportion data required (i) Estimating cell-type proportions of heterogeneous tissue samples; (ii) estimating replication variance; and (iii) identifying differential expression across cell types.
csSAM [20] RNA Microarray Linear regression-based model Expression profile of mixed tissue samples Identify cell type-specific average expression profiles from mixed tissue samples
PERT [21] RNA Microarray Perturbation model Octave Expression data from mixed cell type and expression profile of each homogeneous cell type Estimate proportion of each individual cell type in mixed sample
mixture_estimation.R [23] RNA Microarray Variation of electronic subtraction method [22] R based Expression profile of mixed tissue samples Estimate proportion and expression of each component in a mixed tissue sample
DeMix [11] RNA Microarray Linear mixture model C and R based Normalized gene expression profile of mixed tissue sample Estimate tumor proportion and tumor-specific expression
ISOLATE [27] RNA Microarray and HTS Unsupervised classification model MATLAB Expression data for tumor/control samples and proportion data Identify site of origin and characterize mixture composition of the secondary tumor
DeconRNASeq [28] RNA Microarray and HTS Globally optimized non-negative decomposition algorithm R based Expression data from multiple tissue, signature of individual tissue and proportion data required Estimates cell type in mixed population
TEMT [30] RNA HTS Probabilistic model including position and sequence-specific biases Python Required RNA-seq sequencing data from pure tissue and mixed tissue Estimate transcript abundances of single-cell type from heterogeneous tissue sample
ESTIMATE [31] RNA HTS Gene signature (ssGSEA) based model R based Expression data in Gene Set Enrichment Analysis (GSEA) gct format Addition to tumor purity estimation it also calculate fraction of stromal cell and immune cells in tumor.
ABSOLUTE [32] DNA Microarray and HTS Gaussian mixture model R based Copy number data in segmentation file Estimate tumor purity and ploidy in cancer sample
dChipSNP [34] DNA Microarray Hidden Markov model Visual C++ SNP array data for tumor and paired normal sample Estimate copy number status and also normal stromal cell contamination
ASCAT [35] DNA Microarray Analytical optimization method R based SNP array data with Log R and B-Allele frequency information Identify allele-specific copy number after adjusted for tumor ploidy and non-aberrant cell admixture
PurityEst [36] DNA HTS Aggregated allelic proportion-based model Perl Script Mutation information in GFF format with allelic information of heterozygous loci with somatic mutations in a tumor and matched normal tissue sample Estimate tumor purity based on mutant allele fractions in a mixture of a tumor clone and a normal clone
JointSNVMix1, JointSNVMix2 [37] DNA HTS Probabilistic graphical model Python Sequence data from tumor/normal pairs Analyse tumor and normal genome jointly to more accurately classify germline and somatic mutations
PurBayes [38] DNA HTS Bayesian model R based Tumor mutation allele frequency data Estimate tumor purity and detect intra-tumor heterogeneity
THetA [39] DNA HTS Explicit probabilistic model Python Copy number data in interval count file format Estimate tumor purity and clonal/subclonal copy number aberrations
ExPANdS [40] DNA HTS Probability distributions model R based and Matlab Somatic mutations and copy number data required It predicts (i) number of clonal expansions; (ii) size of the resulting subpopulations in the tumor bulk; (iii) mutations specific to each subpopulation and tumor purity
CNAnorm [41] DNA HTS Analytical optimization method R based Sequencing data of tumor and normal samples in bam format Estimate copy number in tumor sample after corrected for normal cells contamination and adjusted for tumor ploidy
Virmid [42] DNA HTS Probabilistic model and maximum likelihood estimator Java based Disease and normal sequencing data in bam format Estimate proporation of control sample in a (mixed) disease sample
MuTect [43] DNA HTS Bayesian model Java based Tumor and normal sequencing data Identification of somatic point mutations in heterogeneous cancer samples
TrAp [47] DNA HTS Linear mixture model with evolutionary framework Java based Tumor karyotypes and somatic hypermutation datasets Infer subpopulations composition, abundance and evolutionary paths in a tumor sample
Seo et al. [49] RNA Microarray Linear mixture model Disease-associated variants and expression of heterogeneous normal tissue Perform large-scale eQTL studies in heterogeneous tissues

STATISTICAL PROPERTIES OF MIXED CELL POPULATION

Suppose there are n distinct populations of cells (S: S1, S2, … Sn), which occupy p1, p2, … pn fractions in the cell population in the tissue sample, respectively. We let qij denote the abundance (e.g. expression level) of j-th locus (e.g. gene) in the i-th cell population. Then the averaged abundance

Qj=p1q1j+p2q2j+pnqnj+E. andp1+p1++pn=1

where E is the error term. A special case of this scenario is the contamination of normal stromal cells in a tumor sample (S: T, N), such that tumor and normal cells occupy pT and pN fractions, respectively.

Qj=pTqTj+pNqNj+E. and pT+pN=1

These equations are solvable when certain assumptions are met and boundary conditions are satisfied. There can be additional constraints while solving these equations. For instance, while estimating contamination of normal stromal cells, one might have matched (or unmatched) tumor and normal samples with (or without) reference genes—the scenario with unmatched tumor and normal samples without reference genes being the most challenging one [11]. Several numerical and statistical methods (e.g. linear mixture models, Bayesian models, probabilistic graphical models) are available to estimate the proportions of clonal cell populations in a sample, and also infer the expression profiles of those cells.

SOFTWARE

Here, we first describe some of the very early deconvolution methods, before focusing on the latest tools available for analysing microarray- and high-throughput sequencing (HTS) based transcriptome data. We then review the methods that are specific for genomic data (e.g. aCGH, genome and exome sequencing). We also highlight the emerging tools available for studying tumor heterogeneity and clonal evolution using deconvolved tumor genomics data, and their key features. We also perform a systematic assessment of these tools, and compare their performance (Table 2). At the end, we also mention novel approaches to extend these concepts to genome-wide association studies focusing on cancer.

Transcriptome deconvolution— microarray-based approaches

Attempts to quantify tissue heterogeneity from mixed cell populations using computational approaches are not new. In early 2000, several studies, including those by Venet et al. [12] Tureci et al. [13] and Lu et al. [14], outlined some of the first models to deconvolve microarray data from mixed cell populations. Subsequently, several studies proposed more refined methods for deconvolution of gene expressions, often based on known tissue proportions using strong prior information on the proportions [15–18]. Two popular models that were developed subsequently are Dsection [19] and csSAM [20]. Although Dsection adopted a probabilistic approach, csSAM was based on linear regression and its accuracy benefited from expression variations between samples. PERT, another relatively recent method developed by Qiao et al. used reference profiles from all tissue components and allowed for adjustments in tissue-specific expression levels from the reference profiles [21].

A particularly challenging scenario is when the proportions of the component cell populations are not known. In 2007 Gosink et al. developed an early prototype of a subtraction algorithm to address this scenario [22]. Subsequently, Clarke et al. proposed a more refined statistical approach to expression deconvolution from mixed tissue samples in which the proportion of each component cell type is unknown [23]. Their method estimates the proportion of each component in a mixed tissue sample; this estimate can be used to calculate gene expression from each component. This method directly improved the method of Gosink et al. [22], and did not deconvolve individual gene expressions.

Afterwards, several groups have attempted to develop efficient software based on other approaches including hidden Markov Model and Non-negative Matrix Factorization. The multinomial hidden Markov Model-based method proposed by Roy et al. was able extract information from heterogeneous data-types, did not require a priori identification of gene expression in pure cell type, and was less affected by missing data as much as the linear regression-based models [24]. At least two studies implemented Non-negative Matrix Factorization-based approaches [25, 26]. Gaujoux and Seoighe found that the semi-supervised version of the approach outperforms the unsupervised technique [25]. However, so far limited independent assessments of these different methods have been undertaken.

DeMix is a recent and powerful statistical method for deconvolving mixed cancer transcriptomes developed by Ahn et al. which predicts the likely proportion of tumor and stromal cell samples using a linear mixture model [11]. The advantage of this method is that it explicitly considers all possible scenarios, and thus can be applied broadly. For instance, Ahn et al. outline dedicated strategies for determining the proportions of tumor and normal cells using data from (i) matched tumor and normal samples, with reference genes; (ii) matched tumor and normal samples, without reference genes; (iii) unmatched tumor and normal samples, with reference genes; and (iv) the most difficult scenario: unmatched tumor and normal samples, without reference genes. DeMix has performed well with both simulated and real biological datasets [11].

Transcriptome deconvolution—HTS-based approaches

HTS data are rapidly superseding microarrays. Several recent methods have been proposed to deconvolve mixed population of cells in a heterogeneous tissue sample using HTS data. ISOLATE, developed by Quon and Morris is one of the first software to deconvolve HTS data [27]. The authors then went on to demonstrate the utility of such approaches in clinical samples. This software can also predict the potential site of origin of tumor, filter the effect of sample heterogeneity, and also identify differentially expressed genes [27].

In 2013 alone, several more deconvolution techniques have been proposed. Two popular and recent methods are DeconRNAseq and TEMT. DeconRNASeq [28] is an R package for deconvolution of heterogeneous tissues based on mRNA expression data, which adopts a globally optimized non-negative decomposition algorithm through quadratic programming for estimating the mixing proportions of distinctive tissue types in next-generation sequencing or microarray data. It is based on one of the earlier works by Gong et al. [29]. TEMT is a probablilistic model-based approach, proposed by Li and Xie to estimate the transcript abundances of multiple cell types from RNA-seq data of heterogeneous tissue samples [30]. Their method incorporates positional and sequence-specific biases, implements an efficient expectation maximization algorithm.

Only recently, Yoshihara et al. from the Broad Institute have proposed a gene-signature-based integrated framework called ESTIMATE, to infer tumor purity and stromal and immune cell admixture from expression data [31]. This method works with both microarray and HTS data, and also takes advantage of publicly available datasets. The authors applied the method to the Cancer Genome Atlas data, and compared their predictions with an independent estimate (ABSOLUTE [32], which is based on copy number data). There was significant agreement between the two estimates (correlation coefficient: 0.69 – 0.58). This method is already being used in the Cancer Genome Atlas and several other projects.

Deconvolution of genomic data— array-based approach

There have been parallel efforts to estimate tumor purity and clonality based on genomic data. Jacobs has highlighted the challenges associated with analysis of genomic data from tumor samples with stromal contamination (especially, the formalin-fixed, paraffin-embedded samples) [33]. dChipSNP, developed in 2004 by Cheng Li’s group was one of the early popular methods to estimate copy number status and also normal stromal cell contamination using Hidden Markov Model-based major copy proportion (MCP) analysis of oligonucleotide SNP array data [34]. Since then several methods have been proposed. ASCAT [35], a recent algorithm uses analytical optimization to identify allele-specific copy number after adjusting for tumor ploidy and non-aberrant cell admixture. ABSOLUTE algorithm [32], mentioned above, infers tumor composition using aCGH-based copy number data, but it can also use segmented copy number data derived from whole genome or exome sequencing. It uses haplotype phasing, which allows greater resolution of small differences between homologous copy-ratios.

Deconvolution of genomic data—sequencing-based approach

Last year Su et al. have developed a novel algorithm, PurityEst, to infer the tumor purity level from the allelic differential representation of heterozygous loci with somatic mutations in a human tumor sample with a matched normal tissue using HTS data [36]. Su et al. applied their method to several prostate cancer samples, and benchmarked the accuracy of their predictions using an independent estimate for the same samples. The concordance between the two estimates was excellent (correlation coefficient: 0.91), suggesting that PurityEst estimates were biologically relevant. Concurrently, Roth et al. introduced two novel probabilistic graphical models called JointSNVMix1 and JointSNVMix2 for jointly analysing paired tumor-normal digital allelic count data from NGS experiments [37].

PurBayes is a novel Bayesian method developed by Larson and Fridley to estimate tumor purity and detect intratumor heterogeneity based on next-generation sequencing data of paired tumor-normal tissue samples [38]. This method uses finite mixture modeling methods. PurBayes can simultaneously estimate tumor purity and subclonality; this feature can facilitate inferring the tumor composition, evolution and also identification of potential founder events. In simulated datasets, both PurityEst and PurBayes performed comparably, but the ability of PurBayes to detect clonal composition as well is an advantage.

Based on explicit probabilistic model, Oesper et al. have developed Tumor Heterogeneity Analysis (THetA) [39] algorithm that uses copy number data to derive and solve the maximum likelihood mixture decomposition problem (MLMDP) and then subsequently determines the proportions of normal cells and any number of tumor subpopulations in the sample. THetA was applied to three previously sequenced BRCA genomes and their matched normal samples; two samples were sequenced with depth of approximately 40× and one sample with approximately 188× coverage. The tumor purity estimated by THetA for the two 40× sequence coverage samples were 59% and 76%, while for the higher coverage sample estimated tumor purity was 65.7% which is very close to those predicted by ASCAT (66.0%) and ABSOLUTE (65%) but lower than that estimated histopathologically (70%).

To address cellular subpopulation dynamics within human tumors, Andor et al. present Expanding Ploidy and Allele-frequency on Nested Subpopulations (ExPANdS) [40], a method that characterizes coexisting subpopulations in a tumor using copy number and allele frequencies derived from exome- or whole-genome sequencing input data. They also assessed the performance of ExPANdS on one whole-genome sequenced BRCA and applied the tool on 118 TCGA glioblastoma multiforme samples. Based on the results, Andor et al. claimed that EXPANDS is superior to ABSOLUTE in predicting tumor purity of highly heterogeneous tumor samples [40].

Recent advancement of sequencing technique and availability of exome and whole-genome sequence data for tumor and normal sample on public domain encourage development of several novel algorithms for tackling sample impurity issue in downstream analyses (e.g. single nucleotide variants or SNV identification). CNAnorm [41], estimates copy number in tumor samples after correction for normal cells contamination and adjustment for tumor ploidy. Virmid [42] developed by Kim et al. based on a novel probabilistic model, predicts the level of impurity in a sample, and then uses it for improved detection of somatic variation in the same sample. The authors demonstrated the power of their model by performing extensive analyses on simulated and exome sequencing data from 15 BRCA samples. Only recently, Cibbulskis et al. from the Broad Institute have proposed MuTect—a powerful tool for detection of somatic mutations in heterogeneous cancer samples [43]. They have benchmarked the software using real data, and compare their predictions with those made by several other methods (e.g. JointSNVMix [37], SomaticSniper [44] and Strelka [45]). MuTect has excellent specificity and sensitivity (even for the mutations that have allelic fractions ∼0.1), making it an attractive tool for investigating subclones and their evolution in cancer exome and genome sequencing data.

Identification of clonal structure and patterns of tumor evolution

Deconvolution of genomic and transcriptomic data is already providing interesting insights into the biology of heterogeneous tumors. We have shown that genetic and epigenetic heterogeneity present in tumor samples correlate with pathological subtypes, and known biomarkers [4, 46]. It is possible to examine the relationship between different clones, and also reconstruct the potential evolutionary trajectories taken by those clones during tumorigenesis [7]. Recently, Strino et al. have developed a method called TrAp to deconvolves mixed subpopulations using exome sequencing data, and use that information to reconstruct the evolutionary relationships of their subpopulations [47]. Although authors used simulated data to test their model, it remains to be seen how well the model predicts tumor evolution using real biological data. Using a related approach Landau et al. was able to identify the likely temporal order of driver mutations in chronic lymphocytic leukemia [48]. Latest advances in genomics approaches and concepts for studying clonal evolution in cancer have been discussed elsewhere [7].

A COMPARATIVE ANALYSIS BETWEEN DIFFERENT SOFTWARE FOR ESTIMATING TUMOR PURITY

We have assessed performances of several different tools, described above, that estimate tumor purity using cancer genomics and transcriptomic data. We restricted our analyses to the tools that used HTS data as input and provided quantitative tumor purity estimates. We chose to compare five recently published popular tools (THetA, ESTIMATE, ExPANdS, PurBayes and ABSOLUTE) on five randomly selected BRCA samples from the Cancer Genome Atlas data portal (TCGA; https://tcga-data.nci.nih.gov/tcga/), which had histopathological purity estimates available. Of them, THetA, ExPANdS, PurBayes and ABSOLUTE required genomic data and ESTIMATE required transcriptome data as input. ThetA and ABSOLUTE were copy number based, while ExPANdS and PurBayes required allelic frequency information as input. We ran these tools using default parameters, and compared the predicted tumor purity with histopathological purity estimates provided by the TCGA (Figure 2). We also calculated root mean square deviation (RMSD) value for each tool, after calculating difference in expected value (tumor purity calculated by a given tool) and observed value (histopathological estimation) across these five tumor samples, and ranked the tools accordingly.

RMSDtool=sample=15(calculated purity  histopathological purity)2/5

Figure 2:

Figure 2:

Tumor purity estimation for five BRCA samples using five different software and comparison of the predicted estimates with histopathological tumor purity estimates provided by the Cancer Genome Atlas. The RMSD values between predicted and histopathological estimates are provided for each software.

Interestingly, the estimated tumor purity showed systematic differences with the histopathological tumor purity for all the five software. PurBayes over-estimate tumor purity for every sample analysed, whereas THetA and ABSOLUTE under-estimate for all samples. ESTIMATE predicts over-estimation in one case and under-estimation of tumor purity for rest four samples. ExPANdS over-estimated in three and under-estimated in two cases. Our findings are consistent with that reported by Oesper et al. [39]. The discrepancy between genomic- or transcriptomic-based and histopathology-based estimates might be partly due to subjective biases in histopathological examination or the difference in genetic content between tissue sections that were used for histopathological evaluation and genomic analysis (using HTS or microarray) [31]. Tumor purity over-estimation by PurBayes might be due to outlier observations, especially presence of point mutations in the genomic regions affected by CNVs and structural rearrangements [38]. Nevertheless, based on the RMSD value, ExPANdS performed better than the four others (Figure 2) in our analysis.

Deconvolution of genomics data in cancer population genetics

The methods to deconvolve genomic and transcriptomic data have been extended to population genetics studies as well. Typically, eQTL-based analyses in human primary tissue samples are complicated because of the heterogeneous nature of human tissue. Earlier this year, Seo et al. have outlined a computational method to infer the fraction of different cell types in normal human breast tissue samples [49], and then applied their method to known BRCA risk loci and tested for association. Such methods are expected to evaluate eQTL-expression relationships in GWAS studies more accurately than existing ones.

OUTLOOK

Genetic and non-genetic heterogeneity are becoming hallmarks of cancer genomes, with critical therapeutic significance [3, 50–52]. Increasing evidence suggest that, it would be important to consider contributions of different cell populations on the averaged genome or transcriptome of the bulk sample [1]. The newly reported statistical models provide powerful tools to consider tissue-level heterogeneity and perform refined analyses. Their power lies in the simplicity of implementing these statistical packages, the use of widely available microarray or NGS data, and the biological insights they bring in. Several recent studies report the extent of genetic heterogeneity in tumor samples and their potential implications. For instance, using above-mentioned approaches, multiple Pan-Cancer genomics studies estimate tumor heterogeneity and purity, identifying the likely early driver events during tumorigenesis [53, 54]. Furthermore, it has been demonstrated that the extent of heterogeneity and presence of a subclonal driver mutation was an independent risk factor for rapid disease progression [48].

In many cases these methods require greater sequencing depth in the input data than those routinely available, but with falling cost of sequencing, higher depth of coverage for tumor samples (60–120X) is becoming common [7]. Estimates based from these approaches have the potential to complement insights obtained from single-cell genomics [6, 7, 55]. Single-cell genomics techniques are complex, expensive and require specialized resources—but the field is advancing rapidly, and these issues might not be of concern in the near future [6]. Although single-cell genomics will provide insights at the resolution of individual cells, deconvolution approaches will offer estimates at a tissue-level resolution. A combination of the two is expected to bring in powerful synergy—and novel biological insights in cancer and other diseases.

Key Points.

  • Tumor purity and clonal heterogeneity are major confounding factors while analysing cancer genomic data.

  • Several different methods to deconvolve genomic and transcriptomic data from mixed cell population have been proposed in the last 10 years, with mixed success.

  • Recently, several excellent software to deconvolve high-throughput genomic and transcriptomic data from heterogeneous tissue samples were published, shedding new biological insights.

  • If properly used, these methods have potentials to complement single-cell-based techniques, and provide novel insights into tumor heterogeneity.

FUNDING

This work was supported, in part by grants from the American Cancer Society (ACS IRG 57-001-53), Lung Cancer Colorado Fund ((ST 63401169)), and United Against Lung Cancer grant (84-6000555). The comparative analysis results published here are in part based upon data generated by The Cancer Genome Atlas pilot project [56] established by the NCI and NHGRI (dbGAP accession ID: phs000178.v8.p7). Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at http://cancergenome.nih.gov/.

Biographies

Vinod Kumar Yadav is a Postdoctoral Research Fellow at University of Colorado School of Medicine. His research interests include cancer bioinformatics and computational method development.

Subhajyoti De is an Assistant Professor at University of Colorado School of Medicine. His research interests include cancer genomics, heterogeneous biological data integration and computational method development.

References

  • 1.Zhao Y, Simon R. Gene expression deconvolution in clinical samples. Genome Med. 2010;2:93. doi: 10.1186/gm214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.De S. Somatic mosaicism in healthy human tissues. Trends Genet. 2011;27:217–23. doi: 10.1016/j.tig.2011.03.002. [DOI] [PubMed] [Google Scholar]
  • 3.Marusyk A, Almendro V, Polyak K. Intra-tumour heterogeneity: a looking glass for cancer? Nat Rev Cancer. 2012;12:323–34. doi: 10.1038/nrc3261. [DOI] [PubMed] [Google Scholar]
  • 4.Martins FC, De S, Almendro V, et al. Evolutionary pathways in BRCA1-associated breast tumors. Cancer Discov. 2012;2:503–11. doi: 10.1158/2159-8290.CD-11-0325. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Navin N, Kendall J, Troge J, et al. Tumour evolution inferred by single-cell sequencing. Nature. 2011;472:90–4. doi: 10.1038/nature09807. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Navin N, Hicks J. Future medical applications of single-cell sequencing in cancer. Genome Med. 2011;3:31. doi: 10.1186/gm247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Ding L, Raphael BJ, Chen F, et al. Advances for studying clonal evolution in cancer. Cancer Lett. 2013;340:212–9. doi: 10.1016/j.canlet.2012.12.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Wu AR, Neff NF, Kalisky T, et al. Quantitative assessment of single-cell RNA-sequencing methods. Nat Methods. 2014;11:41–6. doi: 10.1038/nmeth.2694. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Moignard V, Macaulay IC, Swiers G, et al. Characterization of transcriptional networks in blood stem and progenitor cells using high-throughput single-cell gene expression analysis. Nat Cell Biol. 2013;15:363–72. doi: 10.1038/ncb2709. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Bumgarner SL, Neuert G, Voight BF, et al. Single-cell analysis reveals that noncoding RNAs contribute to clonal heterogeneity by modulating transcription factor recruitment. Mol Cell. 2012;45:470–82. doi: 10.1016/j.molcel.2011.11.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ahn J, Yuan Y, Parmigiani G, et al. DeMix: deconvolution for mixed cancer transcriptomes using raw measured data. Bioinformatics. 2013;29:1865–71. doi: 10.1093/bioinformatics/btt301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Venet D, Pecasse F, Maenhaut C, et al. Separation of samples into their constituents using gene expression data. Bioinformatics. 2001;17(Suppl. 1):S279–87. doi: 10.1093/bioinformatics/17.suppl_1.s279. [DOI] [PubMed] [Google Scholar]
  • 13.Tureci O, Ding J, Hilton H, et al. Computational dissection of tissue contamination for identification of colon cancer-specific expression profiles. FASEB J. 2003;17:376–85. doi: 10.1096/fj.02-0478com. [DOI] [PubMed] [Google Scholar]
  • 14.Lu P, Nakorchevskiy A, Marcotte EM. Expression deconvolution: a reinterpretation of DNA microarray data reveals dynamic changes in cell populations. Proc Natl Acad Sci USA. 2003;100:10370–5. doi: 10.1073/pnas.1832361100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ghosh D. Mixture models for assessing differential expression in complex tissues using microarray data. Bioinformatics. 2004;20:1663–9. doi: 10.1093/bioinformatics/bth139. [DOI] [PubMed] [Google Scholar]
  • 16.Stuart RO, Wachsman W, Berry CC, et al. In silico dissection of cell-type-associated patterns of gene expression in prostate cancer. Proc Natl Acad Sci USA. 2004;101:615–20. doi: 10.1073/pnas.2536479100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Lahdesmaki H, Shmulevich L, Dunmire V, et al. In silico microdissection of microarray data from heterogeneous cell populations. BMC Bioinformatics. 2005;6:54. doi: 10.1186/1471-2105-6-54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Wang M, Master SR, Chodosh LA. Computational expression deconvolution in a complex mammalian organ. BMC Bioinformatics. 2006;7:328. doi: 10.1186/1471-2105-7-328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Erkkila T, Lehmusvaara S, Ruusuvuori P, et al. Probabilistic analysis of gene expression measurements from heterogeneous tissues. Bioinformatics. 2010;26:2571–7. doi: 10.1093/bioinformatics/btq406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Shen-Orr SS, Tibshirani R, Khatri P, et al. Cell type-specific gene expression differences in complex tissues. Nat Methods. 2010;7:287–9. doi: 10.1038/nmeth.1439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Qiao W, Quon G, Csaszar E, et al. PERT: a method for expression deconvolution of human blood samples from varied microenvironmental and developmental conditions. PLoS Comput Biol. 2012;8:e1002838. doi: 10.1371/journal.pcbi.1002838. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Gosink MM, Petrie HT, Tsinoremas NF. Electronically subtracting expression patterns from a mixed cell population. Bioinformatics. 2007;23:3328–34. doi: 10.1093/bioinformatics/btm508. [DOI] [PubMed] [Google Scholar]
  • 23.Clarke J, Seo P, Clarke B. Statistical expression deconvolution from mixed tissue samples. Bioinformatics. 2010;26:1043–9. doi: 10.1093/bioinformatics/btq097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Roy S, Lane T, Allen C, et al. A hidden-state Markov model for cell population deconvolution. J Comput Biol. 2006;13:1749–74. doi: 10.1089/cmb.2006.13.1749. [DOI] [PubMed] [Google Scholar]
  • 25.Gaujoux R, Seoighe C. Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution: a case study. Infect Genet Evol. 2012;12:913–21. doi: 10.1016/j.meegid.2011.08.014. [DOI] [PubMed] [Google Scholar]
  • 26.Repsilber D, Kern S, Telaar A, et al. Biomarker discovery in heterogeneous tissue samples – taking the in-silico deconfounding approach. BMC Bioinformatics. 2010;11:27. doi: 10.1186/1471-2105-11-27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Quon G, Morris Q. ISOLATE: a computational strategy for identifying the primary origin of cancers using high-throughput sequencing. Bioinformatics. 2009;25:2882–9. doi: 10.1093/bioinformatics/btp378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Gong T, Szustakowski JD. DeconRNASeq: a statistical framework for deconvolution of heterogeneous tissue samples based on mRNA-Seq data. Bioinformatics. 2013;29:1083–5. doi: 10.1093/bioinformatics/btt090. [DOI] [PubMed] [Google Scholar]
  • 29.Gong T, Hartmann N, Kohane IS, et al. Optimal deconvolution of transcriptional profiling data using quadratic programming with application to complex clinical blood samples. PLoS One. 2011;6:e27156. doi: 10.1371/journal.pone.0027156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Li Y, Xie X. A mixture model for expression deconvolution from RNA-seq in heterogeneous tissues. BMC Bioinformatics. 2013;14(Suppl. 5):S11. doi: 10.1186/1471-2105-14-S5-S11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Yoshihara K, Shahmoradgoli M, Martinez E, et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat Commun. 2013;4:2612. doi: 10.1038/ncomms3612. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Carter SL, Cibulskis K, Helman E, et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;30:413–21. doi: 10.1038/nbt.2203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Jacobs S. Data analysis considerations for detecting copy number changes in formalin-fixed, paraffin-embedded tissues. Cold Spring Harb Protoc. 2012;2012:1203–9. doi: 10.1101/pdb.ip071761. [DOI] [PubMed] [Google Scholar]
  • 34.Li C, Beroukhim R, Weir BA, et al. Major copy proportion analysis of tumor samples using SNP arrays. BMC Bioinformatics. 2008;9:204. doi: 10.1186/1471-2105-9-204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Van Loo P, Nordgard SH, Lingjaerde OC, et al. Allele-specific copy number analysis of tumors. Proc Natl Acad Sci USA. 2010;107:16910–5. doi: 10.1073/pnas.1009843107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Su X, Zhang L, Zhang J, et al. PurityEst: estimating purity of human tumor samples using next-generation sequencing data. Bioinformatics. 2012;28:2265–6. doi: 10.1093/bioinformatics/bts365. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Roth A, Ding J, Morin R, et al. JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics. 2012;28:907–13. doi: 10.1093/bioinformatics/bts053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Larson NB, Fridley BL. PurBayes: estimating tumor cellularity and subclonality in next-generation sequencing data. Bioinformatics. 2013;29:1888–9. doi: 10.1093/bioinformatics/btt293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Oesper L, Mahmoody A, Raphael BJ. THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data. Genome Biol. 2013;14:R80. doi: 10.1186/gb-2013-14-7-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Andor N, Harness JV, Muller S, et al. EXPANDS: expanding ploidy and allele frequency on nested subpopulations. Bioinformatics. 2014;30:50–60. doi: 10.1093/bioinformatics/btt622. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Gusnanto A, Wood HM, Pawitan Y, et al. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data. Bioinformatics. 2012;28:40–7. doi: 10.1093/bioinformatics/btr593. [DOI] [PubMed] [Google Scholar]
  • 42.Kim S, Jeong K, Bhutani K, et al. Virmid: accurate detection of somatic mutations with sample impurity inference. Genome Biol. 2013;14:R90. doi: 10.1186/gb-2013-14-8-r90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Cibulskis K, Lawrence MS, Carter SL, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol. 2013;31:213–9. doi: 10.1038/nbt.2514. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Larson DE, Harris CC, Chen K, et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 2012;28:311–7. doi: 10.1093/bioinformatics/btr665. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Saunders CT, Wong WS, Swamy S, et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics. 2012;28:1811–7. doi: 10.1093/bioinformatics/bts271. [DOI] [PubMed] [Google Scholar]
  • 46.De S, Shaknovich R, Riester M, et al. Aberration in DNA methylation in B-cell lymphomas has a complex origin and increases with disease severity. PLoS Genet. 2013;9:e1003137. doi: 10.1371/journal.pgen.1003137. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Strino F, Parisi F, Micsinai M, et al. TrAp: a tree approach for fingerprinting subclonal tumor composition. Nucleic Acids Res. 2013;41:e165. doi: 10.1093/nar/gkt641. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Landau DA, Carter SL, Stojanov P, et al. Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell. 2013;152:714–26. doi: 10.1016/j.cell.2013.01.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Seo JH, Li Q, Fatima A, et al. Deconvoluting complex tissues for expression quantitative trait locus-based analyses. Philos Trans R Soc Lond B Biol Sci. 2013;368:20120363. doi: 10.1098/rstb.2012.0363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Bedard PL, Hansen AR, Ratain MJ, et al. Tumour heterogeneity in the clinic. Nature. 2013;501:355–64. doi: 10.1038/nature12627. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Burrell RA, McGranahan N, Bartek J, et al. The causes and consequences of genetic heterogeneity in cancer evolution. Nature. 2013;501:338–45. doi: 10.1038/nature12625. [DOI] [PubMed] [Google Scholar]
  • 52.Meacham CE, Morrison SJ. Tumour heterogeneity and cancer cell plasticity. Nature. 2013;501:328–37. doi: 10.1038/nature12624. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Lawrence MS, Stojanov P, Polak P, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499:214–8. doi: 10.1038/nature12213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Kandoth C, McLellan MD, Vandin F, et al. Mutational landscape and significance across 12 major cancer types. Nature. 2013;502:333–9. doi: 10.1038/nature12634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Dewey FE, Pan S, Wheeler MT, et al. DNA sequencing: clinical applications of new DNA sequencing technologies. Circulation. 2012;125:931–44. doi: 10.1161/CIRCULATIONAHA.110.972828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Cancer Genome Atlas N. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61–70. doi: 10.1038/nature11412. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES