A Statistical Method for Association Analysis of Cell Type Compositions

Licai Huang; Paul Little; Jeroen R Huyghe; Qian Shi; Tabitha A Harrison; Greg Yothers; Thomas J George; Ulrike Peters; Andrew T Chan; Polly A Newcomb; Wei Sun

doi:10.1007/s12561-020-09293-0

. Author manuscript; available in PMC: 2022 Dec 1.

Published in final edited form as: Stat Biosci. 2021 Sep 15;13(3):373–385. doi: 10.1007/s12561-020-09293-0

A Statistical Method for Association Analysis of Cell Type Compositions

Licai Huang ¹, Paul Little ², Jeroen R Huyghe ³, Qian Shi ⁴, Tabitha A Harrison ⁵, Greg Yothers ⁶, Thomas J George ⁷, Ulrike Peters ⁸, Andrew T Chan ⁹, Polly A Newcomb ¹⁰, Wei Sun ¹¹

PMCID: PMC8735261 NIHMSID: NIHMS1629420 PMID: 35003378

Abstract

Gene expression data are often collected from tissue samples that are composed of multiple cell types. Studies of cell type composition based on gene expression data from tissue samples have recently attracted increasing research interest and led to new method development for cell type composition estimation. This new information on cell type composition can be associated with individual characteristics (e.g., genetic variants) or clinical outcomes (e.g., survival time). Such association analysis can be conducted for each cell type separately followed by multiple testing correction. An alternative approach is to evaluate this association using the composition of all the cell types, thus aggregating association signals across cell types. A key challenge of this approach is to account for the dependence across cell types. We propose a new method to quantify the distances between cell types while accounting for their dependencies, and use this information for association analysis. We demonstrate our method in two applied examples: to assess the association between immune cell type composition in tumor samples of colorectal cancer patients versus survival time and SNP genotypes. We found immune cell composition has prognostic value, and our distance metric leads to more accurate survival time prediction than other distance metrics that ignore cell type dependencies. In addition, survival time-associated SNPs are enriched among the SNPs associated with immune cell composition.

Keywords: cell type composition, genome-wide associations, survival time

1. Introduction

Variation of cell type composition across tissue samples can explain a substantial proportion of gene expression variation. For example, a recent study showed that more than 88% of gene expression variation in human brains may be explained by the variation of cell type compositions [1]. Many methods have been developed to estimate cell type composition of a tissue sample based on gene expression data from this sample and an external reference of cell type-specific gene expression [2, 3]. While assessing associations of these estimates with individual characteristics (e.g., genetic and environmental factors) or clinical outcomes (e.g., survival time) is of interest, it is a challenging problem due to the compositional nature of the data. For example, one cannot modify the proportion of one cell type without altering the proportion of at least one other cell type [4, 5].

Association analysis of cell type composition can be conducted for each cell type separately or by a global test of all cell types. A global test can be more powerful by aggregating multiple weak associations. The methods for global tests can be divided into two groups: regression [6, 7, 8] or distance-based association methods [9, 10, 11, 12, 13, 14]. The regression approaches use log-transformed compositions as covariates, and thus zero components have to be handled by an ad hoc solution, such as replacement by a small constant. Therefore, regression approaches are more appropriate when most cell type proportion estimates are larger than zero. Since zero cell type proportions are often encountered, we will focus on the distance-based methods.

Distanced-based association methods evaluate whether two samples have similar covariate values if they have similar cell type compositions. A key question is how to define the distance between the cell type compositions of two samples. Similar questions have been studied for compositional data from microbiome studies, where the compositions are the proportion of sequence reads from predefined operational taxonomic units of the microbiome. Microbiome studies show that a distance measure that accounts for phylogenetic relationships among bacteria species can be much more powerful than those ignoring such evolutionary information [15]. Analogously, association analysis using cell type compositions could gain in power by accounting for cell lineage. Although the overall cell lineage structure may be known, the distance between two cell types in the cell lineage tree is unknown. The main contribution of this paper is to construct cell lineage trees with quantification of the length of each tree branch. We refer to such a lineage tree as a quantitative lineage tree (QLT). This method allows for the definition of distance between two samples based on cell type compositions and QLT. We demonstrate two applications to use these distance metrics to study cell type composition association versus survival time and SNP genotypes.

2. Method

2.1. Define distance between two samples based on cell type composition

We constructed a tree of immune cells types and cell states (Figure 1) based on known cell lineage [16] with two modifications. One modification is that we set mast cells to be a sibling node of three myeloblasts: neutrophils, eosinophils, and monocytes while in the known cell lineage tree mast cells is a sibling of myeloblast. The other modification is to set NK cells to be a sibling node of two small lymphocytes: T cells and B cells, while in the known cell lineage tree NK cells is a sibling node of small lymphocytes. These changes are supported by cell type-specific gene expression data collected by Newman et al. (2015) [2]. For example, from gene expression data, T cells and B cells are not more similar than T cells and NK cells. See Supplementary Materials Section A for details.

We denote the l-th branch of this cell lineage tree by Pa_l → Ch_l, where Pa_l and Ch_l are the starting (parent) and ending (child) cell type of this branch, respectively. We define the length of the l-th branch, denoted by b_l, by the minimum distance between cell type Ch_l and all of its siblings. The rationale is that the similarity between Ch_l and its siblings is due to their connections to Pa_l, and thus shorter distance among siblings implies shorter distance between Ch_l and Pa_l. An alternative and likely more accurate approach is to quantify b_l by the similarity between gene expression of Pa_l and Ch_l. However, this is often infeasible because gene expression data are usually not available for parent cell types. To define the distance between two sibling cell types k and m, we use the average of pairwise Euclidean distance between purified samples of cell type k and m. For example, suppose there are n_k samples of cell type k and n_m samples of cell type m. Then there are n_k ×n_m distances between these two cell types. The distance between cell type k and cell type m is the average across these n_k ×n_m distances.

Given the tree specification, we can calculate distances across samples based on their cell type compositions. A popular metric to estimate the distance for the compositions of bacteria species is UniFrac [17], which defines the distance between two samples by the Fraction of a microbiome phylogenetic tree that are Unique to either sample. A branch of a phylogenetic tree is unique to either sample if it leads to the species that exist in one but not both samples. Applying similar ideas to cell type compositions, we can define the distance between two samples i and j by

d_{i j} = \sum_{l = 1}^{L} b_{l} | I (p_{i l} > 0) - I (p_{j l} > 0) | / \sum_{l = 1}^{L} b_{l} (p_{i l} + p_{j l}),

where l indexes the L branches, b_l is the length of the l-th branch, and p_il is the proportion of the “child” cell type of branch l in sample i. p_il for a non-leaf node is the summation of cell type proportions of this node and all of its descendants. This UniFrac distance definition is based on the presence/absence of cell types but does not use the quantitative information of cell type proportions, and thus loses useful information. A generalized UniFrac distances [15] has been proposed to overcome this limitation.

d_{i j} = \frac{\sum_{l = 1}^{L} b_{l} {(p_{i l} + p_{j l})}^{α} | \frac{(p_{i l} - p_{j l})}{(p_{i l} + p_{j l})} |}{\sum_{l = 1}^{L} b_{l} {(p_{i l} + p_{j l})}^{α}},

(1)

where α varies from 0 to 1, and it is a tuning parameter that adjusts the contribution from cell types with high proportions. We will use generalized UniFrac distance in our analysis.

2.2. Survival Association analysis using distance matrix

We performed kernel based survival analysis using R package coxme [18, 19], which predicts survival time using a Cox proportional hazard model:

λ (t) = λ_{0} (t) \exp (X β + Z b), b ~ N (0, K),

(2)

where λ₀(t) is a baseline hazard function, X and Z are design matrices for fixed and random effect, respectively, β are the coefficients for fixed effects, and K is a covariance matrix, or a kernel matrix that can be defined based on a distance matrix.

We compare the results using our kernel/distance metric versus two alternatives that ignore the dependencies across cell types. The first one is the correlation between the cell type compositions of two samples:

K (i, j) = (c o r (ρ_{i}, ρ_{j}) + 1) / 2,

(3)

where the transformation of correlation renders it to be a range of 0 to 1. The second one is a naive Euclidean distance definition that ignores the cell lineage. Specifically, let the cell type composition of the i-th sample be ρ_i, then the Euclidean distance between samples i and j is defined as d_ij = ∥ρ_i −ρ_j∥².

Give a distance definition D = {d_ij} (either the naive Euclidean distance or our cell lineage-aware distance), we consider a few choices of kernels. The exponential kernel is defined as:

K (i, j) = \exp (- d_{i j}) .

(4)

Following earlier works of Wang et al. [20], we also consider a Gaussian kernel defined as:

K (i, j) = \frac{1}{ε_{i j} \sqrt{2 π}} \exp (- \frac{d_{i j}^{2}}{2 ε_{i j}^{2}}),

(5)

where ε_ij is calculated as

ε_{i j} = \frac{σ (μ_{i} + μ_{j})}{2}, and μ_{i} = \frac{\sum_{l \in K N N (D, i, k)} d_{i j}}{k} .

(6)

KNN(D, i, k) represents the patients who are the top k closest neighbors of patient i based on distance measurement in D. We set k = 20, 30, 40, and σ = 0.2, 0.25, 0.5. The results remain very similar with respect to different values of k and we only present the results with k = 40.

2.3. Genome-Wide Association Studies (GWAS) of cell type composition

Given a cell type composition-based distance matrix across samples, we assess the association between cell type composition with genetic variants using the MicrobiomeGWAS method [21]. This method assess whether samples with similar genotypes also have smaller cell type composition-based distances. Traditionally, the significance of this test is evaluated through permutations, which is computationally infeasible for GWAS. The MicrobiomeGWAS method derives an asymptotic distribution for the test statistic and thus it is computationally efficient.

We conducted our analysis using 9 different distance measurements. The correlation based distance is defined as 1 - (corr + 1)/2, where corr is standard Pearson correlation. We consider four distance measurements based on generalized UniFrac distances. One is the standard UniFrac distance, denoted by d_ij for the distance between samples i and j. The other three versions resemble Gaussian kernels: $d_{i j}^{2} / (2 ε_{i j}^{2})$ , where ε_ij is defined in equation (6), and we consider three situations with k = 40 and and σ =0.2, 0.25, or 0.5. Finally, we also consider four distance measurements based on generalized Euclidian distance, by replacing the d_ij defined by UniFrac distance with Euclidian distance. In other words, standard Euclidian distance plus three kernel version with k = 40 and and σ = 0.2, 0.25, or 0.5.

3. Results

3.1. Data Preparation

We used the RNA-seq data and survival time data from 450 colon cancer patients from The Cancer Genome Atlas (TCGA) for our analysis. We removed genes when the third quartile of gene expression as measured by read count across all patients was below 20. A total of 17,986 genes passed this filter. Let T_ij be the read count for the j-th gene in the i-th patient. Gene expressions were normalized by log[(T_ij +1)/(q_i,0.75L_j)], where q_i,0.75 is the 75th percentile of read counts for the ith patient, and L_j is the gene length for jth gene. Then we employed CIBERSORT [2] to infer the proportions of 22 immune cell types for all patients (Supplementary Figures S3–S4). The reference gene expression data of the 22 immune cell types were downloaded from CIBERSORT website: https://cibersort.stanford.edu/. The original reference of CIBERSORT includes 547 genes. After taking intersection with the genes with enough expression in our data, we end up with 434 genes for our analysis. The processed TCGA gene expression data (including the gene list we used) and the source codes can be found at https://github.com/Sun-lab/act/blob/master/TCGA_COAD/.

The genotype data of TCGA colon cancer patients was obtained from Affymetrix SNP6 germline genotype calls. After standard GWAS QC, data were phased using SHAPEIT2, and imputed to the 1000 genome project phase 3 reference panel. We adjusted for age at diagnosis, sex, genotype PCs in our GWAS. We restricted our analysis to cancer patients of European descent and SNPs with minor allele frequency (MAF) > 0.05. A total of 318 patients and 6,123,277 SNPs were included in the analysis.

3.2. Survival Association

Single cell type.

Using Cox regression analyses, we first assessed the association between overall survival time and the log-ratio of each immune cell type abundance relative to that of the reference group (Macrophages M0). Since some cell types are absent in most of the samples, we restricted our analysis to cell types that were present (larger than zero proportion) in at least 1/5 of the patients (Supplementary Table S1). Two Cox regression models were applied to each cell type, one adjusted for age at diagnosis and sex, and another additionally adjusted for tumor stage. Moreover, p-values were adjusted for multiple testing using the Holm-Bonferroni method [22]. We observed that the log ratio of Macrophages M1 to Macrophages M0, denoted by log(M1/M0), was significantly associated with survival time when adjusted for age and sex (p-value: 0.001), though the association strength becomes weaker after adjusting tumor stage (p-value 0.05). (Supplementary Table S2 and S3).

Jointly analysis of 22 cell types.

Next, we seek to predict survival time using a mixed effect model where the similarity across individuals (kernel function) is defined using the composition of all 22 cell types (see Section 2.2. for details). We compare the performance of this method versus using survival time prediction using log(M1/M0) as predictor. We randomly split the 450 patients into training and testing sets with 70% patients in the training set and 30% patients in the testing set. For survival prediction using log(M1/M0), we trained our model using Cox regression. For survival prediction using the composition of all cell types, we trained our model by the coxme method [18, 19]. Then we evaluated the performance of different methods by C-index in the testing data. We repeated this procedure 100 times and summarized the performance of each method using the C-indexes from the 100 testing sets (Figure 2, Supplementary Fig S5). The log(M1/M0) predictor and correlation kernel have the lowest median C-index, below 0.55 (Figure 2). The simple exponential kernel also has limited performance. Gaussian kernels have better performance than correlation or exponential kernels. Given a particular configuration of Gaussian kernel, generalized UniFrac distance consistently outperformed Euclidean distance, implying that incorporating the relatedness information of immune cells improves the prognostic values (Figure 2).

Fig. 2 — Boxplots of C-statistics for survival time prediction using mixed effect Cox-model with the covariance/kernel defined by different methods: Gaussian kernel (GK), exponential kernel, and correlation kernel. For GK(k, σ), k and σ indicate the number of closest neighbors and a scale parameter, respectively (see Equation (6)).

A tuning parameter α is needed to define generalized UniFrac distance, and we found the results are very consistent for α with the range of (0.3, 0.6) (Supplementary Fig S5), hence we jus show the results of α = 0.5 in Figure 2. There are two tuning parameters for Gaussian kernel, σ and k, and we found the results are not sensitive to the choice of k, and σ = 0.25 gives the best survival prediction (Supplementary Fig S5).

Jointly analysis of 12 cell types.

Since our method borrows information across cell types, it is expected it can perform better when there are more cell types. To evaluate this conjecture, we collapsed some of the 22 cell types to 12 cell types. Specifically, we combine “B naive”, “B memory” and “Plasma cells” into one cell type referred to as B cell. “CD4+ T others” corresponds to “CD4+ naive”, “CD4+ memory resting” and “T cells memory activated”. The proportion of NK is the total proportion of “NK resting” and “NK activated”. “Dendritic” contains activated and resting Dendritic cells. “Macrophage” includes “M0”, “M1” and “M2”, and estimates of Mast combines activated and resting Mast cells. Then we define a QLT by simplifying the tree in Figure 2 (Supplementary Fig S6). Based on this new tree, we conduct survival time prediction and found the performance are indeed much worse than using all 22 cell types (Figure 3).

Fig. 3 — Boxplots of C-statistics for 9 types of kernel functions: correlation, Gaussian kernels (GK) and exponential kernels (exp(−D)) defined based on Euclidean distance and generalized UniFrac distance with α = 0.5. The distance or correlation were calculated using the composition of 12 cell types.

Distinguish patients with long vs. short survival time.

We also consider a simplified task to distinguish patients with long versus short survival time, e.g., survival time is longer than 1 year, 2 year or 5 year. Due to censoring, some patients cannot be included in such analysis. Given a distance matrix across individuals, e.g., the one defined by UniFrac distance based on the QLT of 22 or 12 cell types, we can test its association with binary survival outcome by Permutational multivariate analysis of variance (PERMANOVA) test [13]. We also apply support vector machine (SVM) to classify the patients into the two groups. To assess the significance of SVM, we recorded the prediction accuracy, permuted the long/short survival time labels for 1,000 times and recorded the SVM classification accuracy for permuted data sets, and then calculated permutation p-value as the proportion of permutations where we see higher accuracy than using un-permuted data. We found no significant results from either PERMANOVA or SVM (Supplementary Table S4). This may be in part due to limited sample size, particularly some samples are excluded in the dichotomization due to censoring.

3.3. Genetic Association

Using TCGA data, we conducted GWAS for the proportion of 16 cell types with zero proportions less than 80%, as listed in Supplementary Table S2. Similar to the survival associations, we used log ratio of the proportion of one cell type vs. Macrophages M0 as covariate. There were five associations with a p-value smaller than 5 × 10⁻⁸ (Supplementary Table S5) and, after multiple testing correction of 16 cell types, one significant finding remained for the association between relative abundance of resting NK cells for rs7831750 (chr8:11248049). rs7831750 is located within a long intergenic non-protein coding RNA, LINC00529, which has very low expression in almost all the tissues except in testis (https://genome.ucsc.edu/cgi-bin/hgGene?hgg_gene=ENST00000443854.1).

We also performed GWAS of the composition of all 22 cell types using MicrobiomeGWAS [21] and 9 different distance measurements (See section 2.3 for more details). There was no genome-wide significant findings, see Supplementary Figures S16–S17 for two examples of Manhattan plots. Lack of genome-wide significant results is likely due to limited sample size in the TCGA samples. However, we conjecture that those true GWAS signals may still have relatively small p-values, though they do not pass the genome-wide significance threshold. An indirect way to assess this is to check whether certain group of SNPs are more likely to have small p-values than expected by chance. For the top 20,000 SNPs with smallest p-values in each cell type composition GWAS (either single cell type GWAS or all-cell-type GWAS), we examined corresponding p-values from a survival GWAS of colorectal cancer patients of European descent [23] (Supplementary Materials Section B.7). This study includes patients from two clinical trials: N0147 (n=2489) [24] and C08 (n=1700) [25]. Our hypothesis was that SNPs with the smallest p-values from cell type composition GWAS may tend to have small p-values in survival GWAS.

For some distance measurements, the survival GWAS p-values for the top SNPs from all-cell-type GWAS are indeed skewed towards smaller values (Figure 4). For example, among the top 20,000 SNPs from all-cell-type GWAS with generalized UniFrac distance (with α = 0.5), 16,874 had results from the survival GWAS, with 1,207 having a p-value < 0.05 (expect 844, one sided binomial p-value 6.9×10⁻³⁴).

Fig. 4 — Histogram of survival associated p-value for significant SNPs in two GWAS: Left panel: using generalized UniFrac distance with α= 0.5; Right panel: generalized UniFrac distance transformed by Gaussian kernel, adjusted for σ = 0.25 and k = 40.

Among all the 9 distance measurements we consider, top 20,000 SNPs from six of them are over-represented by the survival GWAS findings, and the most significant over-representation is observed using the generalized UniFrac distance (Table 1). In contrast, there is no such strong enrichment for the top SNPs from single cell type GWAS (Supplementary Figures S19–S22, Supplementary Table S6).

Table 1.

Assess the over-representation of the SNPs associated with survival time among those op 20,000 SNPs associated with cell type composition. We consider 9 different distance metrics when assessing the association between SNP genotypes and cell type composition. The number of SNPs evaluated is smaller than 20,000 SNPs because they are the overlap with the SNPs with survival p-values. GUniFrac: generalized UniFrac distance. GK: Gaussian Kernel. See section 2.3 for more details of distance/kernel definition. The “observed number of SNPs” are the number of SNPs with survival time association p-value smaller than 0.05. The expected number of SNPs is simply the rounded value of the total number of SNPs multiplied by 0.05. The binomial p-value is derived from one sided test.

Distance definition	Total # of SNPs	observed # of SNPs	expected # of SNPs	binomial p-value

Correlation GUniFrac	16937	921	847	5.12 × 10⁻³
GUniFrac	16874	1207	844	6.90 × 10⁻³⁴
GK-40–0.2	16744	1119	837	9.18 × 10⁻²²
GUniFrac GK-40–0.25	16744	1119	837	2.32 × 10⁻²²
GUniFrac GK 40–0.5	16745	1118	837	1.27 × 10⁻²¹
Euclidean	17077	1197	854	2.64 × 10⁻³⁰
Euclidean GK 40–0.2	16885	732	844	1
Euclidean GK 40–0.25	16885	728	844	1
Euclidean GK 40–0.5	16886	732	844	1

Open in a new tab

4. Discussions

In this paper, we develop a global test to assess associations against cell type proportions of all cell types. To the best of our knowledge, our method is the first global test for this purpose. The key contribution of this work is to define distance across tissue samples based on cell type composition while accounting for dependencies among cell types. We combine cell lineage tree and cell type-specific gene expression to define a quantitative lineage tree (QLT), and then derive a distance metric base on QLT. Our application shows that cell type composition can be used to predict cancer patient survival time. Our distance metric leads to more accurate prediction than other distance metrics or using a single cell type. We also show that the SNPs associated with survival time are enriched among the SNPs associated with the composition of all cell types. In contrast, there is no such enrichment among the SNPs associated with each individual cell type. Therefore, we conclude that the global test of cell type composition association using our proposed distance metric outperforms global tests using naive distance metrics that ignore cell type dependencies or the local test of each individual cell type. Our method is a timely development. With increasing interests in characterizing tumor tissue samples with immune cell quantity and composition, the application of our methods in cancer studies may further enhance biologic understanding of cancer prognosis, treatment response and resistance.

Although our application focuses on colon cancer, our methods can be applied to other cancer types as well as other tissue types. We focus on immune cells in our study because the cell lineage tree is well established for immune cells and the availability of cell type-specific gene expression. We are optimistic that with the upcoming of many large-scale projects, such as the human cell atlas [26, 27], the information on cell lineage and cell type-specific gene expression will be available for more cell types. It is worth noting that some cell types/states may be better described as a continuous spectrum rather than discrete cell types, for example, CD8+ T cells may reside on a continuous spectrum from activated ones to exhausted ones [28]. Our method is not yet designed for such instances, though it would be an interesting direction for future work.

In this work, we estimated cell type composition using CIBERSORT [2], which employs support vector regression to estimate cell type composition of a tissue sample based on a reference of the cell type-specific gene expression. Here we used the LM22 reference provided by CIBERSORT, which were gene expression of 22 immune cell types from blood. Although some earlier works demonstrated its application to estimate immune cell composition in tumor samples [29], it is more desirable to use reference gene expression data collected from tumor samples instead of blood. We hope this limitation can be overcome by the accumulation of cell type-specific gene expression data in the near future.

Our data analysis workflow, log files, and part of our results can be accessed at https://github.com/Sun-lab/act.

Supplementary Material

12561_2020_9293_MOESM1_ESM

NIHMS1629420-supplement-12561_2020_9293_MOESM1_ESM.pdf^{(7.9MB, pdf)}

Acknowledgements

This work is supported in part by NIH grants R01 CA176272, R01 GM105785, R01CA222833, and R21CA224026.

Footnotes

Publisher's Disclaimer: This Author Accepted Manuscript is a PDF file of an unedited peer-reviewed manuscript that has been accepted for publication but has not been copyedited or corrected. The official version of record that is published in the journal is kept up to date and so may therefore differ from this version.

Contributor Information

Licai Huang, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA.

Paul Little, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA.

Jeroen R. Huyghe, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA

Qian Shi, Department of Health Sciences Research, Mayo Clinic, Rochester, MN.

Tabitha A. Harrison, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA

Greg Yothers, Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA.

Thomas J. George, Department of Medicine, University of Florida Health Cancer Center, Gainesville, FL

Ulrike Peters, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA.

Andrew T. Chan, Massachusetts General Hospital and Harvard Medical School, Boston, MA

Polly A. Newcomb, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA

Wei Sun, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA.

References

1.Wang D, Liu S, Warrell J, Won H, Shi X, Navarro FC, Clarke D, Gu M, Emani P, Yang YT, et al. , Comprehensive functional genomic resource and integrative model for the human brain, Science 362(6420), eaat8464 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, Hoang CD, Diehn M, Alizadeh AA, Robust enumeration of cell subsets from tissue expression profiles, Nature Methods 12(5), 453 (2015). PMCID: PMC4739640 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Li B, Severson E, Pignon JC, Zhao H, Li T, Novak J, Jiang P, Shen H, Aster JC, Rodig S, et al. , Comprehensive analyses of tumor immunity: implications for cancer immunotherapy, Genome Biology 17(1), 174 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Pearson K, Mathematical contributions to the theory of evolution. iii. regression, heredity, and panmixia, Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 187, 253 (1896) [Google Scholar]
5.Aitchison J, Egozcue JJ, Compositional data analysis: where are we and where should we be heading?, Mathematical Geology 37(7), 829 (2005) [Google Scholar]
6.Aitchison J, Bacon-shone J, Log contrast models for experiments with mixtures, Biometrika 71(2), 323 (1984) [Google Scholar]
7.Lin W, Shi P, Feng R, Li H, Variable selection in regression with compositional covariates, Biometrika 101(4), 785 (2014) [Google Scholar]
8.Shi P, Zhang A, Li H, Regression analysis for microbiome compositional data, The Annals of Applied Statistics 10(2), 1019 (2016) [Google Scholar]
9.Zhao N, Chen J, Carroll IM, Ringel-Kulka T, Epstein MP, Zhou H, Zhou JJ, Ringel Y, Li H, Wu MC, Testing in microbiome-profiling studies with MiRKAT, the microbiome regression-based kernel association test, The American Journal of Human Genetics 96(5), 797 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Wu C, Chen J, Kim J, Pan W, An adaptive association test for microbiome data, Genome Medicine 8(1), 56 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Tang ZZ, Chen G, Alekseyenko AV, PERMANOVA-S: association test for microbial community composition that accommodates confounders and multiple distances, Bioinformatics 32(17), 2618 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Tang ZZ, Chen G, Alekseyenko AV, Li H, A general framework for association analysis of microbial communities on a taxonomic tree, Bioinformatics 33(9), 1278 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Anderson MJ, Permutation tests for univariate or multivariate analysis of variance and regression, Canadian journal of fisheries and aquatic sciences 58(3), 626 (2001) [Google Scholar]
14.Pan W, Relationship between genomic distance-based regression and kernel machine regression for multi-marker association testing, Genetic Epidemiology 35(4), 211 (2011) [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Chen J, Bittinger K, Charlson ES, Hoffmann C, Lewis J, Wu GD, Collman RG, Bushman FD, Li H, Associating microbiome composition with environmental covariates using generalized UniFrac distances, Bioinformatics 28(16), 2106 (2012) [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Parham P, The immune system (Garland Science, 2014) [Google Scholar]
17.Lozupone C, Knight R, UniFrac: a new phylogenetic method for comparing microbial communities, Applied and environmental microbiology 71(12), 8228 (2005) [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Therneau TM, Grambsch PM, Pankratz VS, Penalized survival models and frailty, Journal of computational and graphical statistics 12(1), 156 (2003) [Google Scholar]
19.Therneau TM, coxme: Mixed Effects Cox Models (2018). URL https://CRAN.R-project.org/package=coxme. R package version 2.2-10 [Google Scholar]
20.Wang B, Zhu J, Pierson E, Batzoglou S, Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning, Nature Methods 14(4), 414 (2017) [DOI] [PubMed] [Google Scholar]
21.Hua X, Song L, Yu G, Goedert JJ, Abnet CC, Landi MT, Shi J, Microbiomegwas: a tool for identifying host genetic variants associated with microbiome composition, BioRxiv p. 031187 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Holm S, A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics pp. 65–70 (1979) [Google Scholar]
23.Penney K, Banbury BL, Shi Q, Allegra CJ, Alberts SR, Peters U, Yothers G, Sinicrope FA, Sun W, Nair S, Harrison TA, Goldberg RM, Lucas PC, Colangelo LH, Atkins JN, Newcomb PA, Chan AT, Genome-wide association with survival in stage ii-iii colon cancer clinical trials (ncctg n0147, alliance for clinical trials in oncology; nsabp c-08, nrg oncology)., Journal of Clinical Oncology 36(15_suppl), 3582 (2018). DOI 10.1200/JCO.2018.36.15\_suppl.3582. URL _.2018.36.15_suppl.3582 [DOI] [Google Scholar]
24.Alberts SR, Sinicrope FA, Grothey A, N0147: a randomized phase iii trial of oxaliplatin plus 5-fluorouracil/leucovorin with or without cetuximab after curative resection of stage iii colon cancer, Clinical colorectal cancer 5(3), 211 (2005) [DOI] [PubMed] [Google Scholar]
25.Allegra CJ, Yothers G, O’Connell MJ, Sharif S, Petrelli NJ, Colangelo LH, Atkins JN, Seay TE, Fehrenbacher L, Goldberg RM, et al. , Phase iii trial assessing bevacizumab in stages ii and iii carcinoma of the colon: results of nsabp protocol c-08, Journal of Clinical Oncology 29(1), 11 (2011) [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, Bodenmiller B, Campbell P, Carninci P, Clatworthy M, et al. , Science forum: the human cell atlas, eLife 6, e27041 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Rozenblatt-Rosen O, Stubbington MJ, Regev A, Teichmann SA, The human cell atlas: from vision to reality, Nature News 550(7677), 451 (2017) [DOI] [PubMed] [Google Scholar]
28.Yi JS, Cox MA, Zajac AJ, T-cell exhaustion: characteristics, causes and conversion, Immunology 129(4), 474 (2010) [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Chen B, Khodadoust MS, Liu CL, Newman AM, Alizadeh AA, in Cancer Systems Biology (Springer, 2018), pp. 243–259 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12561_2020_9293_MOESM1_ESM

NIHMS1629420-supplement-12561_2020_9293_MOESM1_ESM.pdf^{(7.9MB, pdf)}

[R1] 1.Wang D, Liu S, Warrell J, Won H, Shi X, Navarro FC, Clarke D, Gu M, Emani P, Yang YT, et al. , Comprehensive functional genomic resource and integrative model for the human brain, Science 362(6420), eaat8464 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, Hoang CD, Diehn M, Alizadeh AA, Robust enumeration of cell subsets from tissue expression profiles, Nature Methods 12(5), 453 (2015). PMCID: PMC4739640 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Li B, Severson E, Pignon JC, Zhao H, Li T, Novak J, Jiang P, Shen H, Aster JC, Rodig S, et al. , Comprehensive analyses of tumor immunity: implications for cancer immunotherapy, Genome Biology 17(1), 174 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Pearson K, Mathematical contributions to the theory of evolution. iii. regression, heredity, and panmixia, Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 187, 253 (1896) [Google Scholar]

[R5] 5.Aitchison J, Egozcue JJ, Compositional data analysis: where are we and where should we be heading?, Mathematical Geology 37(7), 829 (2005) [Google Scholar]

[R6] 6.Aitchison J, Bacon-shone J, Log contrast models for experiments with mixtures, Biometrika 71(2), 323 (1984) [Google Scholar]

[R7] 7.Lin W, Shi P, Feng R, Li H, Variable selection in regression with compositional covariates, Biometrika 101(4), 785 (2014) [Google Scholar]

[R8] 8.Shi P, Zhang A, Li H, Regression analysis for microbiome compositional data, The Annals of Applied Statistics 10(2), 1019 (2016) [Google Scholar]

[R9] 9.Zhao N, Chen J, Carroll IM, Ringel-Kulka T, Epstein MP, Zhou H, Zhou JJ, Ringel Y, Li H, Wu MC, Testing in microbiome-profiling studies with MiRKAT, the microbiome regression-based kernel association test, The American Journal of Human Genetics 96(5), 797 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Wu C, Chen J, Kim J, Pan W, An adaptive association test for microbiome data, Genome Medicine 8(1), 56 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Tang ZZ, Chen G, Alekseyenko AV, PERMANOVA-S: association test for microbial community composition that accommodates confounders and multiple distances, Bioinformatics 32(17), 2618 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Tang ZZ, Chen G, Alekseyenko AV, Li H, A general framework for association analysis of microbial communities on a taxonomic tree, Bioinformatics 33(9), 1278 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Anderson MJ, Permutation tests for univariate or multivariate analysis of variance and regression, Canadian journal of fisheries and aquatic sciences 58(3), 626 (2001) [Google Scholar]

[R14] 14.Pan W, Relationship between genomic distance-based regression and kernel machine regression for multi-marker association testing, Genetic Epidemiology 35(4), 211 (2011) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Chen J, Bittinger K, Charlson ES, Hoffmann C, Lewis J, Wu GD, Collman RG, Bushman FD, Li H, Associating microbiome composition with environmental covariates using generalized UniFrac distances, Bioinformatics 28(16), 2106 (2012) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Parham P, The immune system (Garland Science, 2014) [Google Scholar]

[R17] 17.Lozupone C, Knight R, UniFrac: a new phylogenetic method for comparing microbial communities, Applied and environmental microbiology 71(12), 8228 (2005) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Therneau TM, Grambsch PM, Pankratz VS, Penalized survival models and frailty, Journal of computational and graphical statistics 12(1), 156 (2003) [Google Scholar]

[R19] 19.Therneau TM, coxme: Mixed Effects Cox Models (2018). URL https://CRAN.R-project.org/package=coxme. R package version 2.2-10 [Google Scholar]

[R20] 20.Wang B, Zhu J, Pierson E, Batzoglou S, Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning, Nature Methods 14(4), 414 (2017) [DOI] [PubMed] [Google Scholar]

[R21] 21.Hua X, Song L, Yu G, Goedert JJ, Abnet CC, Landi MT, Shi J, Microbiomegwas: a tool for identifying host genetic variants associated with microbiome composition, BioRxiv p. 031187 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Holm S, A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics pp. 65–70 (1979) [Google Scholar]

[R23] 23.Penney K, Banbury BL, Shi Q, Allegra CJ, Alberts SR, Peters U, Yothers G, Sinicrope FA, Sun W, Nair S, Harrison TA, Goldberg RM, Lucas PC, Colangelo LH, Atkins JN, Newcomb PA, Chan AT, Genome-wide association with survival in stage ii-iii colon cancer clinical trials (ncctg n0147, alliance for clinical trials in oncology; nsabp c-08, nrg oncology)., Journal of Clinical Oncology 36(15_suppl), 3582 (2018). DOI 10.1200/JCO.2018.36.15\_suppl.3582. URL _.2018.36.15_suppl.3582 [DOI] [Google Scholar]

[R24] 24.Alberts SR, Sinicrope FA, Grothey A, N0147: a randomized phase iii trial of oxaliplatin plus 5-fluorouracil/leucovorin with or without cetuximab after curative resection of stage iii colon cancer, Clinical colorectal cancer 5(3), 211 (2005) [DOI] [PubMed] [Google Scholar]

[R25] 25.Allegra CJ, Yothers G, O’Connell MJ, Sharif S, Petrelli NJ, Colangelo LH, Atkins JN, Seay TE, Fehrenbacher L, Goldberg RM, et al. , Phase iii trial assessing bevacizumab in stages ii and iii carcinoma of the colon: results of nsabp protocol c-08, Journal of Clinical Oncology 29(1), 11 (2011) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, Bodenmiller B, Campbell P, Carninci P, Clatworthy M, et al. , Science forum: the human cell atlas, eLife 6, e27041 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Rozenblatt-Rosen O, Stubbington MJ, Regev A, Teichmann SA, The human cell atlas: from vision to reality, Nature News 550(7677), 451 (2017) [DOI] [PubMed] [Google Scholar]

[R28] 28.Yi JS, Cox MA, Zajac AJ, T-cell exhaustion: characteristics, causes and conversion, Immunology 129(4), 474 (2010) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Chen B, Khodadoust MS, Liu CL, Newman AM, Alizadeh AA, in Cancer Systems Biology (Springer, 2018), pp. 243–259 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Statistical Method for Association Analysis of Cell Type Compositions

Licai Huang

Paul Little

Jeroen R Huyghe

Qian Shi

Tabitha A Harrison

Greg Yothers

Thomas J George

Ulrike Peters

Andrew T Chan

Polly A Newcomb

Wei Sun

Abstract

1. Introduction

2. Method

2.1. Define distance between two samples based on cell type composition

Fig. 1.

2.2. Survival Association analysis using distance matrix

2.3. Genome-Wide Association Studies (GWAS) of cell type composition

3. Results

3.1. Data Preparation

3.2. Survival Association

Single cell type.

Jointly analysis of 22 cell types.

Fig. 2.

Jointly analysis of 12 cell types.

Fig. 3.

Distinguish patients with long vs. short survival time.

3.3. Genetic Association

Fig. 4.

Table 1.

4. Discussions

Supplementary Material

Acknowledgements

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases