Abstract
Motivation
Nonparametric multivariate analysis has been widely used to identify variables associated with a dissimilarity matrix and to quantify their contribution. For very large studies () and many explanatory variables, existing software packages (e.g. adonis and adonis2 in vegan) are computationally intensive when conducting sequential multivariate analysis with permutations or bootstrapping. Moreover, for subjects from a complex sampling design, we need to adjust for sampling weights to derive an unbiased estimate.
Results
We implemented an R function fast.adonis to overcome these computational challenges in large-scale studies. fast.adonis generates results consistent with adonis/adonis2 but much faster. For complex sampling studies, fast.adonis integrates sampling weights algebraically to mimic the source population; thus, analysis can be completed very fast without requiring a large amount of memory.
Availability and implementation
fast.adonis is implemented using R and is publicly available at https://github.com/jennylsl/fast.adonis.
Supplementary information
Supplementary data are available at Bioinformatics Advances online.
Nonparametric multivariate analysis (Anderson, 2001; McArdle and Anderson, 2001) based on a dissimilarity matrix has been widely used to analyze human microbiome data. This analysis quantifies the overall contribution of explanatory variables (R2, coefficient of multiple determination), individually or collectively, by explaining the variation in the dissimilarity matrix [e.g. the UniFrac distance matrix (Lozupone and Knight, 2005)]. Then, statistical significance can be quantified using permutations and confidence intervals (CIs) are obtained using bootstrap sampling. The functions adonis and adonis2 in an R package vegan (Oksanen et al., 2020) are most commonly used for human microbiome data. While useful, they are computationally intensive for analyzing large-scale studies with thousands or tens of thousands of subjects and many explanatory variables (McDonald et al., 2018), particularly when performing sequential multivariate analysis (SMA) with permutations and bootstrapping. Moreover, we often perform many SMAs for variables included in the model in different orders because R2 for individual variables depends on the order.
A more complicated problem is to analyze microbiome data from a study using complex sampling. For example, using a nested case-cohort design in a cohort study with N subjects, suppose that we have microbiome data for subjects, each of which is sampled with a probability based on the fraction of the subjects selected from their sampling stratum. To derive an unbiased estimate of R2 that reflects the source cohort population, we have to explicitly incorporate sampling weights (Korn and Graubard, 1999).
We first describe fitting multiple multivariate models simultaneously for subjects from a natural population. For a study with n samples, we have an n × n dissimilarity matrix and an n × p design matrix X for p explanatory variables with the first column . We define an n × n matrix A with . The Gower’s centered matrix (Gower, 1966; Gower and Legendre, 1986) is defined as , where K is an n × n matrix with kij = 1 and I is an n × n identity matrix. Let be the idempotent hat matrix. McArdle and Anderson (2001) showed that
(1) |
Based on this partitioning, is interpreted as the fraction of variance in matrix D explained by the p variables. In marginal multivariate analysis, we fit one model for each individual variable to derive R2. In SMA for p variables with a given order, we calculate for the first k variables and then are calculated as the incremental contribution of each individual variable. Obviously, SMA depends on the order of the variables.
In Supplementary Note, we show that and . Thus, to calculate R2, it remains to calculate tr(HA). Let and . Here, V is an n × p matrix and U is an p × n matrix. Thus, . In Supplementary Note, we show that the computational complexity for tr(HA) is when , which is much less than n3 required for the multiplication of n × n matrices.
Now, we consider simultaneously fitting multivariate models for M subsets of the p variables with a given order (denoted as ). We first calculate for all p variables. For any subset Sm, we do not need to calculate individually; instead, we extract the corresponding rows of U. For a subset Sm with q variables, computing has a complexity of and is the most computationally intensive step for fitting the model when . Thus, this extension is suitable for fitting many models simultaneously, including SMAs.
We compared the computational time between fast.adonis and adonis2 for performing one SMA with 1000 permutations on a MacBook Pro with Intel 2.3 GHz Core i9 CPU and 16 gigabytes of memory. No comparisons were made with adonis because of the requirement of memory ( gigabytes of memory required when n = 10 000 and p = 20). For n = 10 000 and p > 20, fast.adonis is about 50–100 times faster than adonis2 (Fig. 1A). At n = 20 000, adonis2 did not run successfully when . Next, we compared the computing time for performing 10 SMAs for the same set of p variables included in different orders (Fig. 1B). fast.adonis simultaneously performed 10 SMAs while adonis2 performed SMA 10 times serially.
We used fast.adonis to analyze the data from the American Gut Project (McDonald et al., 2018) with 111 variables (p = 559 after expanding categorical variables) and n = 7, 096 subjects. We evaluated the marginal R2 of these variables and performed SMA with variables ordered by the marginal R2 values. Analyses were performed for the weighted UniFrac dissimilarity matrix with 1000 permutations and 1000 bootstrap samples. Results for the top 30 variables are shown in Figure 1C. Consistent with the original publication (McDonald et al., 2018), technical factors were most associated with the dissimilarity matrix, followed by nutrition variables. For most nutrition variables, the R2 from sequential analyses was much lower than those from marginal analyses. Because p = 559, this analysis took fast.adonis 8.8 h (1 SMA, 1000 permutations and 1000 bootstrapping); adonis2 was not able to finish analyses.
Next we extend the algorithm to complex sampling studies. The complex sample design used to sample the cohort to obtain a subcohort for the case-cohort study involved partitioning the subjects in the cohort into multiple strata, and subjects are randomly selected from each stratum with some sampling fraction. As an example, we have recently characterized the oral microbiome of n = 2487 subjects from the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial (PLCO) cohort (n = 37 263 eligible individuals with oral wash specimens) to prospectively investigate the association between the oral microbiome and the risk of multiple cancers, including lung cancer (Vogtmann et al., 2022). To select a referent subcohort for comparison to the cases, 24 strata were created based on age, sex and smoking variables and stratum-specific sampling fractions were determined based on the number of site-specific cancer cases in that stratum. To derive an unbiased estimate of R2 in the dissimilarity matrix for many demographic and lifestyle factors, we extended the algorithm to explicitly incorporate the sampling weight (Supplementary Notes) following the philosophy of survey data analyses (Korn and Graubard, 1999).
We analyzed the PLCO data for p = 29 dummy variables generated from seven categorical variables (Vogtmann et al., 2022): age, race, sex, education, smoking history, alcohol consumption and body mass index (Fig. 1D). For each analysis, we used within-stratum bootstrapping by independently resampling the stratum-specific samples with replacement to derive the CI for R2. Together, these seven variables explained 4.70% (95% CI = 3.84–5.68%) of the variance in the weighted UniFrac dissimilarity matrix, while smoking history alone explained 1.70% (95% CI = 1.20–2.22%) of the variance, conditioning on age, sex and education. Results based on marginal analyses and sequential analyses are similar in this data. It took 3.8 min to finish the analysis with 1000 bootstrap resampling for this data.
In summary, we developed fast.adonis to efficiently fit multivariate models to microbiome data from large-scale studies. fast.adonis can efficiently fit many multivariate models simultaneously, making it useful to identify important factors by forward selection (Blanchet et al., 2008). While results in this manuscript were obtained using a single core, fast.adonis has the option of parallel computation. Moreover, fast.adonis can analyze data from complex sampled studies which require analyses using sampling weights. When the interest centers on testing the association between one variable and a distance matrix, one can rely on the asymptotic P-value instead of permutations (Chen and Zhang, 2021).
Supplementary Material
Acknowledgements
This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD (http://biowulf.nih.gov).
Funding
The work was supported by the NIH Intramural Research Program.
Conflict of Interest: none declared.
Data availability
The American Gut Project metadata were from the Qiita study (https://qiita.ucsd.edu/study/description/10317) and distance matrices were from https://journals.asm.org/doi/10.1128/mSystems.00031-18. Microbiome sequencing data for the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial will be made available at the Sequence Read Archive (SRA) under project number PRJNA801882 with limited metadata (https://www.ncbi.nlm.nih.gov/sra/). For complete metadata, a data application will need to approved from PLCO (www.cdas.cancer.gov).
Contributor Information
Shilan Li, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA; Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University, Washington, DC 20057, USA.
Emily Vogtmann, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA.
Barry I Graubard, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA.
Mitchell H Gail, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA.
Christian C Abnet, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA.
Jianxin Shi, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA.
References
- Anderson M.J. (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecol., 26, 32–46. [Google Scholar]
- Blanchet F.G. et al. (2008) Forward selection of explanatory variables. Ecology, 89, 2623–2632. [DOI] [PubMed] [Google Scholar]
- Chen J., Zhang X. (2021) D-MANOVA: fast distance-based multivariate analysis of variance for large-scale microbiome association studies. Bioinformatics, 38, 286–288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gower J.C. (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53, 325–338. [Google Scholar]
- Gower J.C., Legendre P. (1986) Metric and Euclidean properties of dissimilarity coefficients. J. Classif., 3, 5–48. [Google Scholar]
- Korn E.L., Graubard B.I. (1999) Analysis of Health Surveys. Wiley Series in Probability and Statistics Survey Methodology Section. John Wiley & Sons, New York. [Google Scholar]
- Lozupone C., Knight R. (2005) UniFrac: a new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol., 71, 8228–8235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McArdle B.H., Anderson M.J. (2001) Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology, 82, 290–297. [Google Scholar]
- McDonald D. et al. ; The American Gut Consortium. (2018) American gut: an open platform for citizen science microbiome research. mSystems, 3, e00031-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oksanen J. et al. (2020) Vegan: community ecology package. R Package Version 2.5-7. https://CRAN.R-project.org/package=vegan (20 December 2021, date last accessed).
- Vogtmann E. et al. (2022) The human oral microbiome and risk of lung cancer: an analysis of three prospective cohort studies. Submitted.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The American Gut Project metadata were from the Qiita study (https://qiita.ucsd.edu/study/description/10317) and distance matrices were from https://journals.asm.org/doi/10.1128/mSystems.00031-18. Microbiome sequencing data for the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial will be made available at the Sequence Read Archive (SRA) under project number PRJNA801882 with limited metadata (https://www.ncbi.nlm.nih.gov/sra/). For complete metadata, a data application will need to approved from PLCO (www.cdas.cancer.gov).