This is the manual for phyloMDA, an R package for phylogeny-aware microbiome data analysis. phyloMDA takes as input a count matrix, a metadata matrix, and a phylogenetic tree. It consists of three modules: multivariate modeling of microbial counts, extraction of relative abundances from counts, and regression with relative abundances as predictors.
We illustrate statistical methods for analyzing tree structured microbiome data by applying them to the COMBO dataset from a cross-sectional study on relating dietary habits to gut microbiome [1]. The preprocessed data set consists of a matrix that relates abundances of 62 OTUs to 98 samples, a phylogenetic tree that reflects the evolutionary relationship of these OTUs, and metadata that provide information about the samples such as BMI and measurements of dietary nutrients. We will be working with the phyloseq
object.
Load Packages
library(MGLM);
library(plyr);
library(caper);
library(genlasso);
library(magrittr);
library(foreach);
library(ape);
library(miLineage);
library(ggplot2);
library(dplyr);
library(readxl);
library(methods);
library(BiocManager);
library(phyloseq);
library(ggtree);
library(adaANCOM);
library(phyloMDA);
Load Example Data
(phyloseq.obj <- combo.phyloseq.obj)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 62 taxa and 98 samples ]
## sample_data() Sample Data: [ 98 samples by 231 sample variables ]
## tax_table() Taxonomy Table: [ 62 taxa by 9 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 62 tips and 61 internal nodes ]
Plot the phylogenetic tree
library(ggtree);
ggtree(phyloseq.obj@phy_tree, aes(), layout="fan", size=0.3, ladderize=F) +
geom_tiplab(size=2, hjust=-0.05, align = F) + theme_tree()
Heatmap of microbial counts
plot_heatmap(phyloseq.obj, taxa.order = taxa_names(phyloseq.obj))
Calculate the proportion of zeros
otu_tab <- t(phyloseq.obj@otu_table@.Data)
n <- nrow(otu_tab) # sample size
K <- ncol(otu_tab) # number of taxa
zp <- sum(otu_tab==0)/n/K
round(zp, 4) # zero proportion
## [1] 0.6141
View the metadata
library(DT); # an R interface to the JavaScript library DataTables
library(magrittr); # chaining commands %>%
metadata <- sample_data(phyloseq.obj)
ex_nut <- metadata[, 18:20] %>% round(., 4)
cbind(metadata[, 1:6], ex_nut) %>% datatable() # here only list 3 nutrients
Plot the sum of counts across samples and across taxa
library(ggpubr) # creating and customizing ggplot2- based publication ready plots
theme_sets <- theme_bw() + theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))
pb1 <- plot_bar(phyloseq.obj) + geom_bar(stat="identity", position="stack") +
labs(title='Sum of counts') + theme_sets +
theme(axis.text.x=element_text(face="bold",size=8,angle=90))
pb2 <- plot_bar(phyloseq.obj, 'OTU',fill="phylum") +
geom_bar(stat="identity", position="stack") + labs(title='Sum of counts')+ theme_sets +
theme(axis.text.x=element_text(face="bold",size=8,angle=90))
ggarrange(pb1, pb2, nrow=2, ncol=1)
Visualize the over-dispersion
ggplot(data=NULL, aes(x=colMeans(otu_tab), y=apply(otu_tab, 2, sd))) +
geom_abline(slope = 1, intercept = 0) + geom_point() +
labs(title='Over-dispersion', x='Mean', y='SD') + theme_sets
One distribution for describing bacterial counts is the multinomial (MN) distribution. We fit this distribution by maximum likelihood. The estimates of probabilities are sample proportions.
View sample proportions
sample_prop <- colMeans(otu_tab/rowSums(otu_tab))
ggplot(data=NULL, aes(x=names(sample_prop), y=sample_prop, size=sample_prop)) +
theme_sets +
theme(axis.text.x=element_text(face="bold",size=6,angle=90), legend.position = 'none') +
geom_vline(xintercept = names(sample_prop), linetype="dotted", color='lightgrey') +
geom_point() +
labs(x='OTU', y='Sample proportion')
In the multinomial-logit regression model, the probabilities of the MN distribution are related to a vector of predictors via the logit linear transformation [2].
Load Packages
library(MGLM) # A package for multivariate response generalized linear models
Simple regression with a single predictor
sodium <- metadata$sodium
reg_mn <- MGLMreg(otu_tab ~ 1 + sodium, dist = 'MN') # MGLMfit fits various multivariate discrete distributions
reg_mn@test # overall test
## wald value Pr(>wald)
## (Intercept) 1322196.89 0
## sodium 22716.23 0
reg_mn@logL # fitted simple MN regression loglikelihood
## [1] -388327.3
We can also use MGLM
to fit the MN distribution.
fit_mn <- MGLMfit(data=otu_tab, dist = 'MN')
fit_mn@logL # fitted MN loglikelihood
## [1] -401364.8
In the presence of over-dispersion, the Dirichlet-multinomial (DM) distribution is a popular alternative to the MN distribution. It is a compound MN distribution with weights from the Dirichlet distribution. DM is MN augmented with one additional parameter (known as the over-dispersion parameter). The function MGLMfit can also fit this distribution.
fit_dm <- MGLMfit(data=otu_tab, dist='DM')
fit_dm@logL # fitted DM loglikelihood
## [1] -13845.53
fit_dm@LRTpvalue # LRT MN vs DM
## [1] 0
To run a regression, the log link is used to relate the parameters to the predictors.
reg_dm <- MGLMreg(otu_tab~1+sodium, dist='DM')
round(reg_dm@test, 4)
## wald value Pr(>wald)
## (Intercept) 11851.5553 0.0000
## sodium 81.4292 0.0496
reg_dm@logL
## [1] -13805.07
Rather than simple or multiple regression, we can also run penalized regression, which is preferred when the number of predictors (here, nutrient measurements) is comparable to or larger than the sample size. For illustration, we carry out a penaltized DM regression using the first 20 nutrients. The tuning parameters are typically chosen by an information criterion or cross-validation, here we use BIC.
# MGLMsparsereg fits sparse regression in multivariate generalized linear models.
# Other penalty types than “sweep” are allowed
X <- metadata %>% as.matrix %>% .[,18:37] %>% apply(., 2, as.numeric)
spreg_dm <- MGLMtune(formula = otu_tab~1+X, dist = 'DM', penalty = 'sweep', ngridpt = 20, penidx = c(F, rep(T, 20)))
show(spreg_dm)
## Call: MGLMtune(formula = otu_tab ~ 1 + X, dist = "DM", penalty = "sweep",
## ngridpt = 20, penidx = c(F, rep(T, 20)))
##
## Distribution: Dirichlet Multinomial
## Log-likelihood: -13820.67
## BIC: 27953.12
## AIC: 27777.34
## Degrees of freedom: 68
## Lambda: 37.85639
## Number of grid points: 20
Tuning via BIC
path_res <- spreg_dm@path
ggplot(data=path_res, aes(x=Lambda, y=BIC)) +
geom_point() + geom_line() + theme_sets + labs(x='lambda') +
geom_vline(xintercept = spreg_dm@select@lambda, color='red', linetype="dotted")
View estimated coefficients
spreg_coef <- coef(spreg_dm) %>% t
colnames(spreg_coef) <- c('Intercept', colnames(X))
datatable(round(spreg_coef[, 1:11], 4)) # show the first 10 components
To take into account the tree information, Wang, T. and Zhao, H. proposed Dirichlet-tree multinomial (DTM) distribution, defined to be the product of DM distributions that factorize over the phylogenetic tree [3]. Zhou, C, Zhao, H., and Wang T. extended DTM to Zero-inflated DTM (ZIDTM) [4] by the same way. Since the distributions placed on different internal nodes are independent, estimation of the parameters of the DTM model can be carried out separately.
tree <- phy_tree(phyloseq.obj)
fit_dtm <- MGLMdtmFit(otu_tab, tree)
(logL_dtm <- Extract_logL(fit_dtm)) # DTM loglikelihood
## [1] -13160.69
pchisq(2*(logL_dtm - fit_dm@logL), df=61*2-62, lower.tail = F)
## [1] 7.642585e-247
fit_zidtm <- ZIdtmFit(otu_tab, tree)
(logL_zidtm <- Extract_logL(fit_zidtm)) # ZIDTM loglikelihood
## [1] -13178.28
pchisq(2*(logL_zidtm - logL_dtm), df=61*2-62, lower.tail = T)
## [1] 0
reg_dtm <- MGLMdtmReg(otu_tab, sodium, tree)
Extract_logL(reg_dtm) # DTM regression loglikelihood
## [1] -13132.97
reg_zidtm <- ZIdtmReg(otu_tab, X.mean=sodium, tree=tree) # test results for ZIDTM
We can also perform penalized DTM regression by running penalized DM in a node-wise manner.
sreg_dtm <- MGLMdtmSparseReg(otu_tab,X, tree, lambda=Inf, penalty="sweep")
Extract_logL(sreg_dtm)
## [1] -13224.1
Tuning via BIC one each node
Extract_BIC(sreg_dtm)
## [1] 26956.14
View estimated coefficients
spreg_dtmcoef <- lapply(sreg_dtm, function(x) coef(x)) %>% do.call(cbind, .) %>% t
colnames(spreg_dtmcoef) <- c('Intercept', colnames(X))
datatable(round(spreg_dtmcoef[,1:11], 4)) # show the first 10 components
We also implement tuning version for lambda.
sreg_dtm_tune <- MGLMdtmTune(otu_tab, X, tree, penalty="sweep")
Extract_logL(sreg_dtm_tune) # not run
Microbiome data are often normalized prior to downstream analysis. Two widely-used normalization approaches are rarefying and total sum scaling [5]. Liu, T. et al. proposed an empirical Bayes approach for transforming raw counts into relative abundances [6].
Assuming a multinomial distribution for microbial counts and a Dirichlet prior for the proportions, the marginal distribution of each count vector is DM, with the same set of parameters as the prior. We estimate these parameters from OTU counts across samples by maximum likelihood. The empirical Bayes method normalizes the data using the estimated posterior mean.
eBay.comps <- eBay_comps(otu_tab, prior="Dirichlet")
datatable(round(eBay.comps[,1:10],4))
We can further incorporate phylogeny into the normalization process by assuming a Dirichlet-tree prior [6]. The resulting marginal distribution of each count vector is then DTM.
eBay.tree.comps <- eBay_comps(otu_tab, prior="Dirichlet-tree", tree=tree)
datatable(round(eBay.tree.comps[, 1:10], 5))
Microbiome data are compositional and should be treated as compositions. However, since compositions are constrained by the simplex, the analysis of compositional data using traditional tools can be misleading [7]. One way to address this issue is to use ratio transformations and then take the logarithm of these ratios, known as log-ratios. Often, the additive log-ratio and centered log-ratio transformations are used.
comps_alr <- alr_trans(eBay.comps)
comps_clr <- clr_trans(eBay.comps)
When the compositions are extracted from microbial count data, downstream analyses such as diversity estimation, and compositionally aware data analysis are easily achievable. Here, we present two types of diversity estimation: alpha diversity and beta diversity.
library(vegan)
meta_sam <- metadata %>% .[.$bmicat1norm2ow3ob!=2, ]
eBay.comps_sub <- eBay.comps[as.character(meta_sam$pid),]
Shannon.Wiener <- diversity(eBay.comps_sub, index = "shannon")
Simpson <- diversity(eBay.comps_sub, index = "simpson") #Simpson
Inverse.Simpson <- diversity(eBay.comps_sub, index = "inv") #Inverse Simpson
group <- c()
group[which(meta_sam$bmicat1norm2ow3ob==1)] <- "normal"
group[which(meta_sam$bmicat1norm2ow3ob==3)] <- "obese"
library(ggpubr) ###
div.sha <- data.frame(shannon = Shannon.Wiener,group= group)
pic1 <- ggboxplot(div.sha,
x="group",
y="shannon",
color="group",
palette = "jama",legend = "right",
add = "jitter",xlab="Group",ylab="alpha diversity",title="Shannon")+
theme(plot.title = element_text(hjust = 0.5))
simp.sha <- data.frame(simpson = Simpson,group= group)
pic2 <- ggboxplot(simp.sha,
x="group",
y="simpson",
color="group",
palette = "jama", legend = "right",
add = "jitter",xlab="Group",ylab="alpha diversity",title="Simpson")+
theme(plot.title = element_text(hjust = 0.5))
isimp.sha <- data.frame(isimpson = Inverse.Simpson,group= group)
pic3 <- ggboxplot(isimp.sha,
x="group",
y="isimpson",
color="group",
palette = "jama",legend = "right",
add = "jitter",xlab="Group",ylab="alpha diversity",title="Inverse.Simpson")+
theme(plot.title = element_text(hjust = 0.5))+
theme(axis.text = element_text(size = 13), axis.title = element_text(size = 15), legend.text = element_text(size = 11))
g123 <- ggarrange(pic1,pic2,pic3,nrow=1,common.legend = TRUE)
g123
df_pca <- prcomp(eBay.comps_sub)
df_pcs <-data.frame(df_pca$x, group = group)
percentage<-round(df_pca$sdev / sum(df_pca$sdev) * 100,2)
percentage<-paste(colnames(df_pcs),"(", paste(as.character(percentage), "%", ")", sep=""))
df_pcs$group <- as.factor(group)
library(RColorBrewer)
col <- brewer.pal(8,"Accent")[c(2,3)]
ggplot(df_pcs,aes(x=PC1,y=PC2,color=group,shape=group)) + geom_point(size=2)+
xlab(percentage[1]) + ylab(percentage[2]) +
scale_fill_manual(values=col)+
scale_colour_manual(values=col)+
scale_shape_manual(values = c(2, 9))+
theme_classic()
To illustrate the usefulness of the empirical Bayes method, we apply it to the COMBO dataset for differential abundance testing. We categorize the covariate BMI as normal weight, overweight, and obese, and focus on the normal weight and obese individuals. There are 70 individuals and 62 OTUs.
meta_sam <- metadata %>% .[.$bmicat1norm2ow3ob!=2, ]
comps_clr_sub <- comps_clr[as.character(meta_sam$pid), ]
comps_test <- apply(comps_clr_sub, 2, function(x) { t.test(x[meta_sam$bmicat1norm2ow3ob==1],
x[meta_sam$bmicat1norm2ow3ob==3])$p.value })
dif.otus <- names(which(p.adjust(comps_test, 'fdr')<0.05))
tax.dif.otus <- tax_table(phyloseq.obj)[dif.otus, ]
# colnames(tax.dif.otus) <- "taxanomy"
datatable(tax.dif.otus)
We also compared eBay to other differential abundance testing methods, such as MetagenomeSeq, DESeq2, ANCOM, rarefying, total sum scaling (tss).
Load Packages
library(metagenomeSeq);
library(DESeq2);
source('/Users/tiantian/Downloads/ancom_functions.r')
library(UpSetR);
metagenomeSeq
otu_tab_sub <- otu_tab[rownames(meta_sam), ]
phenotypeData <- AnnotatedDataFrame(meta_sam)
obj_meta <- newMRexperiment(t(otu_tab_sub),phenoData=phenotypeData) %>% cumNorm(., p = 0.5)
mod_meta <- model.matrix(~meta_sam$bmicat1norm2ow3ob,data = environment(obj_meta))
res_meta <- fitFeatureModel(obj = obj_meta, mod = mod_meta)%>% MRcoefs(.,number=ncol(otu_tab_sub))
res_meta.final <- rownames(res_meta)[which(res_meta$adjPvalues<0.05)]
#####MetagenomeSeq failed to detect any OTU###
DESeq2
group <- meta_sam$bmicat1norm2ow3ob
deseq_data <- DESeqDataSetFromMatrix(t(otu_tab_sub+1), DataFrame(group), ~ group)
res_deseq <- estimateSizeFactors(deseq_data) %>%estimateDispersions %>%nbinomWaldTest %>% results(.)
res_deseq.final <- rownames(res_deseq)[which(res_deseq$padj<0.05)]
ANCOM
res_ancom <- ANCOM(OTUdat=data.frame(otu_tab_sub,Group=group),
sig=0.05, multcorr=3, tau=0.02, theta=0.1, repeated=FALSE)$detected
res_ancom <- gsub('X', '', res_ancom)
rarefying
otu <- otu_table(as.matrix(t(otu_tab_sub)), taxa_are_rows=TRUE)
sample_data <- sample_data(meta_sam)
phyloseq.obj.sub <- phyloseq(otu, sample_data)
phyloseq.obj.rare <- rarefy_even_depth(phyloseq.obj.sub, sample.size = min(sample_sums(phyloseq.obj.sub))*0.9 ,rngseed = 2018)
res_rare <- apply(otu_table(phyloseq.obj.rare), 1, function(x) {
t.test(x[meta_sam$bmicat1norm2ow3ob==1], x[meta_sam$bmicat1norm2ow3ob==3])$p.value
})
res_rare.final <- names(which(p.adjust(res_rare, 'BH') < 0.05))
#####rerafying failed to detect any OTU###
tss
otu_tab_rep <- otu_tab_sub
otu_tab_rep[otu_tab_rep==0] <- 0.5
otu_tab_tss <- otu_tab_rep/rowSums(otu_tab_rep)
tss_clr <- apply(otu_tab_tss, 1, function(x){log2(x) - mean(log2(x))})
res_tss <- apply(tss_clr, 1, function(x) { t.test(x[meta_sam$bmicat1norm2ow3ob==1],
x[meta_sam$bmicat1norm2ow3ob==3])$p.value })
res_tss.final <- names(which(p.adjust(res_tss, 'BH') < 0.05))
none
res_raw <- apply(otu_tab_sub, 2, function(x) {t.test(x[meta_sam$bmicat1norm2ow3ob==1], x[meta_sam$bmicat1norm2ow3ob==3])$p.value })
res_raw.final <- names(which(p.adjust(res_raw, 'BH') < 0.05))
The results are shown as follows:
dataForUpSetPlot <- list("none-t"=res_raw.final,
"tss-t"=res_tss.final,
"rarefying-t"=res_rare.final,
"metagenomeSeq"=res_meta.final,
"eBay-t"=dif.otus,
"eBay-tree-t"=dif.otus,
"DESeq2"=res_deseq.final,
"ANCOM"=res_ancom)
setsBarColors <-c("dodgerblue", "goldenrod1", "darkorange1", "seagreen3", "orchid3", 'brown2', 'pink',"dodgerblue1")
res <- upset(fromList(dataForUpSetPlot),
nsets=length(dataForUpSetPlot),
keep.order = TRUE,
point.size = 1,
line.size = 0.5,
number.angles = 0,
shade.alpha=1,
matrix.dot.alpha=1,
text.scale = c(1.5, 1.2, 1.2, 1, 1.5, 1),
matrix.color="black",
main.bar.color = 'black',
mainbar.y.label = 'Intersection Size',
sets.bar.color=setsBarColors)
res
In some machine learning applications, microbiome relative abundance data are used as predictors or inputs that have some influence on one or more outputs [8]. This section is concerned with regression problems where the predictors are compositional data and the response is univariate.
To remove the unit-sum constraint, Aitchison, J. and Bacon-Shone, J. proposed a linear log-contrast model [9]. Lin, W. et al. proposed constrained lasso for fitting this model in high dimensions [10]. To select a sparse collection of log-contrasts, Bates, S. and Tibshirani, R. introduced the all-pairs log-ratio model [11]. They proposed a two-step procedure for finding a highly sparse solution, which is implemented in the R logratiolasso
package. To illustrate these two methods, consider the regression of BMI on the estimated microbiome compositions (see Estimating microbiome compositions).
Load Packages
library(logratiolasso) # An R implementation of the constrained lasso and two-stage logratio lasso.
source("/Users/tiantian/Downloads/plot_onTree.R")
y <- metadata$bmi
x <- eBay.comps %>% log2 # log of estimated microbiome compositions
centered_y <- y - mean(y)
centered_x <- scale(x, center=T, scale=F)
# constrained lasso
constrain_lasso <- glmnet.constr(centered_x, centered_y, family = 'gaussian')
set.seed(10)
cv_constrain_lasso <- cv.glmnet.constr(constrain_lasso, centered_x, centered_y)
# str(cv_constrain_lasso),
# two-stage log-ratio lasso
set.seed(10)
cv_ts_lasso <- cv_two_stage(centered_x, centered_y, k_max = 7) # str(cv_ts_lasso)
## [1] "Starting CV fold 1"
## [1] "Starting CV fold 2"
## [1] "Starting CV fold 3"
## [1] "Starting CV fold 4"
## [1] "Starting CV fold 5"
## [1] "Starting CV fold 6"
## [1] "Starting CV fold 7"
## [1] "Starting CV fold 8"
## [1] "Starting CV fold 9"
## [1] "Starting CV fold 10"
Plots of the CV estimates of error
gl1 <- ggplot(data=NULL, aes(x=cv_constrain_lasso$lambda, cv_constrain_lasso$cvm)) +
geom_point() + geom_line() + labs(title='constrained lasso', x='lambda', y='CV error')+ theme_sets +
geom_vline(xintercept = cv_constrain_lasso$lambda[which.min(cv_constrain_lasso$cvm)],
color='red', linetype="dotted")
gl2 <- ggplot(data=NULL, aes(x=cv_ts_lasso$lambda, y=cv_ts_lasso$mse[cv_ts_lasso$k_min,])) +
geom_point() + geom_line() + labs(title='two-stage log-ratio lasso', x='lambda', y='CV error') +
theme_sets +
geom_vline(xintercept = cv_ts_lasso$lambda[cv_ts_lasso$lambda_min], color='red',
linetype="dotted")
gl12 <- ggarrange(gl1, gl2, nrow=1)
gl12
View estimated coefficients
beta12 <- data.frame(b1=cv_constrain_lasso$beta, b2=cv_ts_lasso$beta_min)
colnames(beta12) <- c('constrained lasso', 'two-stage log-ratio lasso')
datatable(round(beta12[rowSums(beta12)!=0, ], 4))
Since a composition carries only relative information, subcompositions, which preserve ratio relationships, are fundamental objects of investigation in compositional data analysis. In the regression setting, Wang, T. and Zhao, H. proposed a multiscale method, Tree-guided Automatic Subcomposition Selection Operator (TASSO), for selecting subcompositions at subtree levels [12]. The criterion for TASSO contains two penalty terms corresponding to leaf nodes and internal nodes, respectively. The TASSO solution is obtained by solving a generalized lasso problem [13].
Load Packages
library(genlasso)
# constrained lasso (genlasso)
fit1_tasso <- TASSO(y,eBay.comps,tree=NULL)
# TASSO
fit2_tasso <- TASSO(y, eBay.comps,tree)
#str(fit2_tasso)
Tuning via BIC
# BIC for constrained lasso and TASSO
fit1_tasso_IC <- ICgenlasso(fit1_tasso)
fit2_tasso_IC <- ICgenlasso(fit2_tasso)
gf1 <- ggplot(data=fit1_tasso_IC, aes(x=lambda, y=BIC)) + geom_line() +
geom_point() + theme_sets + labs(title='constrained lasso (genlasso)') +
geom_vline(xintercept = fit1_tasso_IC$lambda[which.min(fit1_tasso_IC$BIC)], color='red', linetype='dotted')
gf2 <- ggplot(data=fit2_tasso_IC, aes(x=lambda, y=BIC)) +
geom_point() + geom_line() + theme_sets + labs(title='TASSO') +
geom_vline(xintercept = fit2_tasso_IC$lambda[which.min(fit2_tasso_IC$BIC)], color='red', linetype='dotted')
gf12 <- ggarrange(gf1, gf2, nrow=1)
gf12
View estimated coefficients
fit1_beta <- fit1_tasso$beta[, which.min(fit1_tasso_IC$BIC)]
fit1_beta <- round(fit1_beta - mean(fit1_beta) , 8)
fit2_beta <- fit2_tasso$beta[, which.min(fit2_tasso_IC$BIC)]
fit2_beta <- round(fit2_beta - mean(fit2_beta) , 8)
tasso_beta12 <- data.frame(b1=beta12[,1], b2=fit1_beta, b3=beta12[,2], b4=fit2_beta)
colnames(tasso_beta12) <- c('constrained lasso', 'constrained lasso (genlasso)', 'two-stage log-ratio lasso', 'TASSO')
rownames(tasso_beta12) <- taxa_names(phyloseq.obj)
tasso_beta12_nonzero <- tasso_beta12[rowSums(tasso_beta12)!=0, ]
datatable(round(tasso_beta12_nonzero, 4))
Visualize the TASSO results on the tree
plot_onTree(fit2_tasso, tree=tree, model = 'TASSO')
So far compositions are assumed to lie in a strictly positive simplex. The main reason is that we cannot take the logarithm of zero in log-contrasts or log-ratios. To take into account the compositional nature, high dimensionality, and phylogeny of microbiome data, Wang, T. and Zhao, H. introduced the concept of variable fusion and proposed a multiscale dimension reduction method [14]. Instead of using linear log-contrast model, they used linear model. For increased interpretability, Wang, T. and Zhao, H. proposed tree-guided variable fusion to harness a predictive microbial signature made of a set of multi-level taxa. They constructed two weighted fused lasso penalties that encode the tree topology. Again, the optimization problem has a generalized lasso formulation.
fit_tflasso1 <- TreeFusedlasso(y, eBay.comps, tree)
fit_tflasso1
##
## Call:
## genlasso(y = centered_Y, X = centered_X, D = D.mat.w)
##
## Output:
## Path model with 153 total steps.
# fit_tflasso2 <- TreeFusedlasso(y, eBay.comps, tree, type = 2)
# fit_tflasso2
Tuning via BIC
tgfused_lasso_IC1 <- ICgenlasso(fit_tflasso1)
# tgfused_lasso_IC2 <- ICgenlasso(fit_tflasso2)
# datatable(round(tgfused_lasso_IC2, 5))
gf3 <- ggplot(tgfused_lasso_IC1, aes(x=lambda, y=BIC)) +
geom_point() + geom_line() + theme_sets + labs(title='tree-guided fused lasso penalty 1') +
geom_vline(xintercept = tgfused_lasso_IC1$lambda[which.min(tgfused_lasso_IC1$BIC)], color='red', linetype='dotted')
# gf4 <- ggplot(tgfused_lasso_IC2, aes(x=lambda, y=BIC)) +
# geom_point() + geom_line() + theme_sets + labs(title='tree-guided fused lasso penalty 2') +
# geom_vline(xintercept = tgfused_lasso_IC2$lambda[which.min(tgfused_lasso_IC2$BIC)], color='red', linetype='dotted')
# gf34 <- ggarrange(gf3, gf4, nrow=1, ncol=2)
gf3 # both the biggest lambda get the smallest BIC
Visualize the results on the tree
plot_onTree(fit_tflasso1, tree=tree, model = 'TreeFusedlasso')
[1] : Wu, G.D., Chen, J., Hoffmann, C., Bittinger, K., Chen, Y.-Y., Keilbaugh, S.A., et al.: Linking long-term dietary patterns with gut microbial enterotypes. Science 334(6052), 105–108 (2011)
[2] : Kim, Juhyun, et al. “MGLM: An R package for multivariate categorical data analysis.” The R Journal 10.1 (2018): 73.
[3] : Wang, T., Zhao, H.: A Dirichlet-tree multinomial regression model for associating dietary nutrients with gut microorganisms. Biometrics 73(3), 792–801 (2017)
[4] : Zhou, C., Zhao, H., and Wang, T. (2021). Transformation and differential abundance analysis of microbiome data incorporating phylogeny. Bioinformatics 37(24), 4652-4660 (2021)
[5] : McMurdie, P.J., Holmes, S.: Waste not, want not: Why rarefying microbiome data is inadmissible. PLoS Computational Biology 10(4), e1003531 (2014)
[6] : Liu, T., Zhao, H. & Wang, T. An empirical Bayes approach to normalization and differential abundance testing for microbiome data. BMC Bioinformatics 21, 225 (2020).
[7] : Gloor, G.B., Macklaim, J.M., Pawlowsky-Glahn, V., Egozcue, J.J.: Microbiome datasets are compositional: And this is not optional. Frontiers in Microbiology 8, 1–6 (2017)
[8] : Knights, D., Parfrey, L.W., Zaneveld, J., Lozupone, C., Knight, R.: Human-associated microbial signatures: Examining their predictive value. Cell Host & Microbe 10(4), 292–296 (2011)
[9] : Aitchison, J., Bacon-Shone, J.: Log contrast models for experiments with mixtures. Biometrika 71(2), 323–330 (1984)
[10] : Lin, W., Shi, P., Feng, R., Li, H.: Variable selection in regression with compositional covariates. Biometrika 104(4), 785–797 (2014)
[11] : Bates, S., Tibshirani, R.: Log-ratio lasso: Scalable, sparse estimation for log-ratio models. Biometrics 75 (2), 613–624 (2019)
[12] : Wang, T., Zhao, H.: Structured subcomposition selection in regression and its application to microbiome data analysis. The Annals of Applied Statistics 11(2), 771–791 (2017)
[13] : Tibshirani, R.J., Taylor, J.: The solution path of the generalized lasso. The Annals of Statistics 39(3), 1335–1371 (2011)
[14] : Wang, T., Zhao, H.: Constructing predictive microbial signatures at multiple taxonomic levels. Journal of the American Statistical Association 112(519), 1022–1031 (2017)
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] logratiolasso_0.1.0 UpSetR_1.4.0
## [3] DESeq2_1.32.0 SummarizedExperiment_1.22.0
## [5] MatrixGenerics_1.4.3 matrixStats_0.61.0
## [7] GenomicRanges_1.44.0 GenomeInfoDb_1.28.4
## [9] IRanges_2.26.0 S4Vectors_0.30.2
## [11] metagenomeSeq_1.34.0 glmnet_4.1-3
## [13] limma_3.48.3 Biobase_2.52.0
## [15] BiocGenerics_0.38.0 RColorBrewer_1.1-2
## [17] vegan_2.5-7 lattice_0.20-45
## [19] permute_0.9-7 ggpubr_0.4.0
## [21] DT_0.21 phyloMDA_0.1.0
## [23] adaANCOM_0.2.1 ggtree_3.0.4
## [25] phyloseq_1.36.0 BiocManager_1.30.16
## [27] readxl_1.3.1 dplyr_1.0.8
## [29] ggplot2_3.3.5 miLineage_2.1
## [31] geepack_1.3.3 data.table_1.14.2
## [33] foreach_1.5.2 magrittr_2.0.2
## [35] genlasso_1.5 igraph_1.2.11
## [37] Matrix_1.4-1 caper_1.0.1
## [39] mvtnorm_1.1-3 MASS_7.3-55
## [41] ape_5.6-1 plyr_1.8.6
## [43] MGLM_0.2.0
##
## loaded via a namespace (and not attached):
## [1] backports_1.4.1 lazyeval_0.2.2 splines_4.1.0
## [4] BiocParallel_1.26.2 crosstalk_1.2.0 digest_0.6.29
## [7] yulab.utils_0.0.4 htmltools_0.5.2 fansi_1.0.2
## [10] memoise_2.0.1 cluster_2.1.2 annotate_1.70.0
## [13] Biostrings_2.60.2 colorspace_2.0-3 blob_1.2.2
## [16] xfun_0.29 crayon_1.5.0 RCurl_1.98-1.6
## [19] jsonlite_1.8.0 genefilter_1.74.1 survival_3.3-0
## [22] iterators_1.0.14 glue_1.6.2 gtable_0.3.0
## [25] zlibbioc_1.38.0 XVector_0.32.0 DelayedArray_0.18.0
## [28] car_3.0-12 Rhdf5lib_1.14.2 shape_1.4.6
## [31] abind_1.4-5 scales_1.1.1 DBI_1.1.2
## [34] rstatix_0.7.0 Rcpp_1.0.8.3 xtable_1.8-4
## [37] gridGraphics_0.5-1 tidytree_0.3.8 bit_4.0.4
## [40] httr_1.4.2 htmlwidgets_1.5.4 gplots_3.1.1
## [43] ellipsis_0.3.2 XML_3.99-0.9 pkgconfig_2.0.3
## [46] farver_2.1.0 sass_0.4.0 locfit_1.5-9.4
## [49] utf8_1.2.2 AnnotationDbi_1.54.1 ggplotify_0.1.0
## [52] tidyselect_1.1.2 labeling_0.4.2 rlang_1.0.1
## [55] reshape2_1.4.4 cachem_1.0.6 munsell_0.5.0
## [58] cellranger_1.1.0 tools_4.1.0 cli_3.2.0
## [61] RSQLite_2.2.10 generics_0.1.2 ade4_1.7-18
## [64] broom_0.7.12 evaluate_0.15 biomformat_1.20.0
## [67] stringr_1.4.0 fastmap_1.1.0 yaml_2.3.5
## [70] bit64_4.0.5 knitr_1.37 caTools_1.18.2
## [73] purrr_0.3.4 KEGGREST_1.32.0 nlme_3.1-155
## [76] aplot_0.1.2 compiler_4.1.0 rstudioapi_0.13
## [79] png_0.1-7 ggsignif_0.6.3 treeio_1.16.2
## [82] geneplotter_1.70.0 tibble_3.1.6 bslib_0.3.1
## [85] stringi_1.7.6 highr_0.9 ggsci_2.9
## [88] multtest_2.48.0 vctrs_0.3.8 pillar_1.7.0
## [91] lifecycle_1.0.1 rhdf5filters_1.4.0 jquerylib_0.1.4
## [94] cowplot_1.1.1 bitops_1.0-7 patchwork_1.1.1
## [97] R6_2.5.1 KernSmooth_2.23-20 gridExtra_2.3
## [100] codetools_0.2-18 exactRankTests_0.8-34 gtools_3.9.2
## [103] assertthat_0.2.1 Wrench_1.10.0 rhdf5_2.36.0
## [106] withr_2.5.0 GenomeInfoDbData_1.2.6 mgcv_1.8-39
## [109] grid_4.1.0 ggfun_0.0.5 tidyr_1.2.0
## [112] rmarkdown_2.12 carData_3.0-5