Hypothesis testing for phylogenetic composition: a minimum-cost flow perspective

Shulei Wang; T Tony Cai; Hongzhe Li

doi:10.1093/biomet/asaa061

. 2020 Jul 11;108(1):17–36. doi: 10.1093/biomet/asaa061

Hypothesis testing for phylogenetic composition: a minimum-cost flow perspective

Shulei Wang ¹, T Tony Cai ², Hongzhe Li ^3,^✉

PMCID: PMC7937037 PMID: 33716568

Summary

Quantitative comparison of microbial composition from different populations is a fundamental task in various microbiome studies. We consider two-sample testing for microbial compositional data by leveraging phylogenetic information. Motivated by existing phylogenetic distances, we take a minimum-cost flow perspective to study such testing problems. We first show that multivariate analysis of variance with permutation using phylogenetic distances, one of the most commonly used methods in practice, is essentially a sum-of-squares type of test and has better power for dense alternatives. However, empirical evidence from real datasets suggests that the phylogenetic microbial composition difference between two populations is usually sparse. Motivated by this observation, we propose a new maximum type test, detector of active flow on a tree, and investigate its properties. We show that the proposed method is particularly powerful against sparse phylogenetic composition difference and enjoys certain optimality. The practical merit of the proposed method is demonstrated by simulation studies and an application to a human intestinal biopsy microbiome dataset on patients with ulcerative colitis.

Keywords: Microbiome, Metagenomics, Phylogenetic tree, Sparse alternative, Wasserstein distance

1. Introduction

High-throughput sequencing technologies make it possible to survey the microbiome communities from multiple samples, resulting in a need for statistical methods to quantitatively compare samples from different populations/experiments. Testing whether two groups of samples have the same microbiome composition is a key step in deciphering the quantitative difference between populations and identifying the dysbiotic components. In this paper we consider two-sample testing for the means of relative abundance from two populations. Although the problem is mainly motivated by microbiome and metagenomic data analysis, as a general problem it also arises in other high-throughput sequencing data, e.g., single-cell RNA sequencing data.

The microbial community from one sample is usually represented by discrete distributions with the relative abundance of microbe species organized in taxonomy, or operational taxonomic units in some applications. To assess the quantitative difference between groups of samples, various methods have been proposed for the taxonomic compositional data, including global two-sample tests (Zhao et al., 2015; Cao et al., 2017) and differential abundance tests (Robinson et al., 2010; Wagner et al., 2011; Love et al., 2014; Mandal et al., 2015). These methods, however, neglect the degree of similarity between microbe species, due to the fact that the analysis units in these methods, microbe species, are implicitly assumed to be equally distinct (Fukuyama, 2017). Furthermore, the classification of microbes by contemporary microbial taxonomy is coarse, which results in loss of power to detect subtle difference in a higher resolution (Washburne et al., 2018). To alleviate these issues, the phylogeny of the bacterial species is usually incorporated into the analyses of microbiome data (Fukuyama, 2017; Washburne et al., 2018).

In order to capture the phylogenetic compositional difference of the microbes between populations, one of the most widely used two-sample testing methods is multivariate analysis of variance with permutation, permanova, equipped with phylogenetic distance (McArdle & Anderson, 2001; Anderson, 2014; Xia & Sun, 2017). In microbiome data analysis, the popular choices of phylogenetic distances include the unweighted or weighted UniFrac distances (Lozupone & Knight, 2005; Lozupone et al., 2007) and their Inline graphic Zolotarev-type generalized variants (Evans & Matsen, 2012). Through studying these phylogenetic composition distances, we show that they are closely related to a minimum-cost flow problem for the underlying phylogenetic tree, and the phylogenetic composition difference between samples can be fully characterized by the optimal flow at each edge. Motivated by this observation, we consider the optimal flow at each edge as the analysis unit, instead of each microbe species. The main goal of the present paper is to study the problem of two-sample testing on a phylogenetic tree from this minimum-cost flow perspective.

We first investigate permanova equipped with Inline graphic Zolotarev-type phylogenetic distance. Due to its flexibility and ease of computation, permanova using phylogenetic distance has been applied in a wide range of microbiome studies (Smith et al., 2015; Chen et al., 2016; Wu et al., 2016), but it still lacks theoretical justification. Following the minimum-cost flow perspective, we show that permanova is essentially a sum-of-squares type of test, which has been widely used to test the difference between the means of two populations in high-dimensional problems (Bai & Saranadasa, 1996; Srivastava & Du, 2008; Chen & Qin, 2010). We establish its asymptotic normality under the null hypothesis, and show that its power is indeed determined by the phylogenetic distance between the group means. It is known that sum-of-squares type tests are effective in detecting the dense alternatives, but not powerful against sparse alternatives (Cai et al., 2014; Chen et al., 2019). However, in most microbiome studies, only a small fraction of taxa may have different mean abundances (Cao et al., 2017), resulting in optimal flows on a small number of edges that are active, i.e., nonzero. Moreover, permanova, as a global method, is not able to identify the specific locations of the significant differences even when the null hypothesis is rejected. Therefore, there is a need for a more powerful and interpretable test to detect sparse phylogenetic composition difference between two populations.

To fill this need we introduce a new test, detector of active flow on the tree, dafot, to detect the sparse phylogenetic composition difference between two populations. To detect sparse signals, the maximum type statistics are usually adopted in various settings because of their simplicity, effectiveness and optimality (see, e.g., Dümbgen & Spokoiny, 2001; Arias-Castro et al., 2005; Jeng et al., 2010; Arias-Castro et al., 2011; Cai et al., 2014, and references therein). Motivated by this, we construct dafot as the maximum of the standardized statistics for optimal flow at each edge. When the null hypothesis is rejected by dafot, it is also able to identify the edges that the active optimal flows lie on. Thus, different from permanova, dafot can not only detect the difference, it is also able to identify the branches of the phylogenetic tree that show difference in relative abundance between the populations. It is shown that dafot is the minimax optimal test against sparse alternatives, and the optimal detection boundary of phylogenetic composition difference relies on both the structure of phylogenetic tree and heteroskedastic variance of microbe species. The practical merits of dafot are further demonstrated through a real data example. The method is implemented in the R package DAFOT available from CRAN (R Development Core Team, 2021).

Transformation of compositional data is often employed in order to account for the compositional nature of the data. For example, the centred log-ratio transformation is one of the commonly used transformation methods for analysis of compositional data (Aitchison, 1982). To account for possible data transformation, we introduce Inline graphic -generalized optimal flow for any given strictly increasing transformation function defined on . The original form of optimal flow corresponds to the special case with . Another special case of -generalized optimal flow is the difference of balance between populations (Egozcue & Pawlowsky-Glahn, 2016; Rivera-Pinto et al., 2018), when Inline graphic , equivalent to adopting centred log-ratio transformation. After introducing this new concept, we show that all the methodology and theory discussed previously can be generalized accordingly.

2. A hierarchical model for microbiome count data and phylogenetic distance

The human microbiome can be quantified using 16S rRNA sequencing or shotgun metagenomic sequencing. Such 16S rRNA gene sequences of the bacterial genomes or the sequencing of evolutionarily conserved universal marker genes can be used to construct the phylogenetic tree of the bacterial species. The microbe species and their ancestors are usually organized in such a phylogenetic tree based on their evolutionary relationships. Let Inline graphic be the phylogenetic tree of microbe species. Here, is the collection of microbe species and their ancestors, and represents the collection of edges of the phylogenetic tree . For any , is the corresponding branch length. We assume the phylogenetic tree is rooted at , which can be seen as the common ancestor of all microbe species. For any pair of nodes Inline graphic , the unique shortest path between them is denoted by and the corresponding distance between them is defined as

(1)

The dissimilarity between two microbe species Inline graphic and can thus be quantified by . The height of the tree is then defined as the maximum of the distances between the root and the other nodes of the tree, .

The relative abundance of a microbial community can be represented by a discrete distribution on the nodes of the tree Inline graphic . More specifically, write all possible discrete distributions on as

Here, Inline graphic is the relative abundance of microbial species and is a simplex of dimension, where is the number of elements. Suppose there are two populations of interest on , e.g., treated and control groups. These two populations can be represented by two probability distributions on , and , respectively. We are interested in comparing the mean of the relative abundance between these two populations,

(2)

where Inline graphic is the mean of ,

The covariance matrix of Inline graphic is defined similarly:

To test the mean equality hypothesis, Inline graphic and samples are drawn from each of the two populations,

However, the true relative abundance of each sample, Inline graphic , , , is unknown in practice. Sequencing of the microbial DNAs is used to assess the relative abundance of the microbes in the sample. In microbiome studies, the sequencing read data can be modelled by a Poisson distribution. To be specific, the number of sequencing reads that can be assigned to species Inline graphic from the th sample of the th group is assumed to follow a Poisson distribution,

where Inline graphic is the total number of reads in the th sample of the th group and is the relative abundance of microbe species in sample . Thus, the reads count is assumed to be drawn from the hierarchical model

for any Inline graphic , , . The goal of this paper is to test the hypothesis in (2) based on the count data .

Following this hierarchical model, the empirical distribution of each Inline graphic is written as . Due to the hierarchical structure of the model, the covariance matrix of empirical distribution is

Here, Inline graphic represents the diagonal matrix of a vector. It is clear that the covariance matrix of the empirical distribution depends on both the mean and covariance matrix of , and the variance of the empirical distribution is inflated because of the sequencing steps. The difference between and Inline graphic vanishes when goes to infinity. For simplicity of analysis, in the rest of the paper we always assume the total number of reads in each sample to be equal, i.e., for any , . For brevity, we write when and . We also assume that there exists such that

(3)

This assumption implies that the proportion of the samples from either population does not vanish. We write Inline graphic .

In microbiome studies, a phylogenetic distance that reflects the evolution relationships among microbe species is often used in defining the distance between two microbial communities. Examples include unweighted and weighted UniFrac distance (Lozupone & Knight, 2005; Lozupone et al., 2007). As shown in Evans & Matsen (2012), the weighted UniFrac distance is a plugin estimator of the Wasserstein distance of probability masses on the tree, which can be generalized to Inline graphic Zolotarev-type variants; for brevity, we call them Zolotarev-type phylogenetic distance. In the present paper we focus on the Zolotarev-type phylogenetic distance

(4)

where Inline graphic is the total probability of all descendants of edge ,

where Inline graphic and is a subtree below edge : . We also use the following notation:

3. A minimum-cost flow perspective for two-sample testing

The phylogenetic distance between two discrete distributions is closely related to optimal transport theory (Evans & Matsen, 2012). To be specific, the weighted UniFrac/Wasserstein distance between Inline graphic and is equal to the solution of the following optimal transport problem:

(5)

Here, the objective function is the total cost of transport Inline graphic , and is the cost per unit transported from to .

Different from the general optimal transport problem, the distance Inline graphic in (1) is defined as the geodesic distance along the path on tree . Therefore, this optimal transport problem can be naturally cast as a minimum-cost flow problem on a network, as illustrated in Fig. 1. The tree can be seen as a special network, and there is a source with capacity and a sink with capacity Inline graphic at each node . Then, the optimization problem in (5) aims to find a way with the minimum cost of sending an amount of flow from sources to sinks through the network . In particular, for any given transport , define the flows on each edge as

An illustration of the minimum-cost flow problem on the tree. The blue bars are the source and the red bars are the sink.

Here, Inline graphic is the flow through edge towards the root , and is the flow through edge in the opposite direction. Then, the optimization problem in (5) can be reformulated as

(6)

Although the optimal transport Inline graphic in (5) might not be unique, the optimal flow in (6) is unique and has the closed-form solution

The optimal net flow on each edge is then defined as Inline graphic . The weighted UniFrac distance and corresponding Zolotarev-type variants can thus be seen as the weighted norm of optimal flows. It is clear from the above discussion that the phylogenetic composition difference between and can be fully characterized by the optimal flow . If we write Inline graphic , the hypothesis in (2) can be rewritten in the equivalent form

This suggests that the basic unit to quantify phylogenetic composition difference shall be the optimal flow at each edge Inline graphic . The Zolotarev-type phylogenetic distance can be seen as the norm of these optimal flows.

4. Permutational multivariate analysis of variance

4.1. Introduction

To incorporate phylogenetic tree information in comparing two populations, one of the most commonly used two-sample tests is permanova equipped with some phylo-genetic distance (McArdle & Anderson, 2001; Anderson, 2014). Specifically, let Inline graphic be the phylogenetic distance defined in (4). The average empirical distances within and between groups are defined as

and

Similar to analysis of variance, permanova defines the total sum-of-squares as

and the within-group sum-of-squares and between-group sum-of-squares as

The pseudo Inline graphic -statistic for two-sample testing is then defined as the normalized ratio of to ,

To evaluate the significance of the Inline graphic -statistic, its -value is calculated by permutations. To be more specific, the samples are permuted randomly times and the -statistic calculated on each of these permuted data, denoted . Then, the estimated -value is

One implicit assumption of a valid permutation test is the exchangeability of samples under the null hypothesis. Thus, the hypothesis required by a valid permutation test is

Compared with the mean equality hypothesis in (2), this is a more restrictive hypothesis. In the next section we present another way to estimate the Inline graphic -value of permanova based on the asymptotic results.

4.2. Properties

We investigate the properties of the pseudo Inline graphic -statistic under the Zolotarev-type phylogenetic distance . A simple calculation decomposes into two parts:

In particular, the second term is asymptotically equal to Inline graphic ,

This shows that Inline graphic is a scaled version of the difference between the average within-group distance and the across-group distance. In other words, in terms of two-sample testing, is asymptotically equivalent to the following energy distance statistic (Székely & Rizzo, 2005; Sejdinovic et al., 2013):

Due to the fact that the distance in (4) is a negative type, the energy distance statistic can also be written as a kernel-based test statistic (Sejdinovic et al., 2013),

where the kernel of Inline graphic and is defined as . The kernel form of permanova suggests

Therefore, the Inline graphic statistic in permanova is a reasonable statistic for testing the phylogenetic composition difference in hypothesis (2), as the mean of only depends on and .

Clearly, the behaviour of Inline graphic depends on both the covariance matrix of and the tree structure . The structure of the tree can be expressed as a transformation matrix , where if and if . We assume

(7)

where Inline graphic or . Such a moment assumption is a common condition in high-dimension statistics, e.g., condition (3.6) in Chen & Qin (2010). Condition (7) is true when the eigenvalues of both and are bounded. Besides the assumption of the moment, another assumption we make is

(8)

where Inline graphic and are empirical distributions drawn from the first or second population, and is a sequence of numbers such that . This is a fairly weak condition. For example, it is a trivial assumption when the tree is not too high, i.e., , because .

The following theorem shows the asymptomatic behaviour of the permanova statistic Inline graphic .

Theorem 1.

Under the null hypothesis, i.e., , and assumptions (3), (7) and (8),

where can be written as

Furthermore, if and assumptions (3), (7) and (8) hold, the test is consistent when

(9)

This theorem suggests that permanova is a consistent test if the phylogenetic distance between the means of two populations is large enough. As Inline graphic can be written explicitly as

the power of permanova depends on both the number of samples Inline graphic and the number of reads per sample . The power becomes larger if we increase either or . However, (9) also suggests that a larger number of samples is a more efficient way to increase power than a larger number of reads per sample. This theorem also suggests that the -value can be calculated based on asymptomatic distribution instead of conducting permutations. For instance, Inline graphic can be estimated based on a similar -statistic, in Chen & Qin (2010). Then, the -value can be calculated by . This way of calculating the -value does not require under the null hypothesis. In practice, we recommend the permutation test when is not large and the asymptotic critical value of Inline graphic is large.

4.3. Sparse setting

As we have seen in previous sections, permanova is a sum-of-squares type statistic. However, the interesting setting in practice, e.g., in microbiome studies, is a sparse case where only a small number of microbe species may have different relative mean abundance (Cao et al., 2017). This suggests that only a small fraction of optimal flow Inline graphic is active, i.e., . To investigate the performance of permanova under this sparse setting, we consider a simple case where there is active optimal flow on one edge, denoted by . As suggested by Theorem 1, the condition for a consistent permanova test is

On the other hand, we consider an oracle test that has knowledge of the active flow location Inline graphic . Since the location of is known, we consider a two-sample -test for ,

(10)

where

With the central limit theorem, we know that Inline graphic is a consistent test if

A comparison of the two detection boundaries indicates that the oracle test is able to detect a much smaller difference between two groups of samples than permanova. This naturally leads to the question of whether it is possible to develop a more powerful test under the sparse flow setting.

5. Active optimal flow detection

5.1. Detector of active flow on the tree

As shown in § 4, the two-sample Inline graphic -test could improve the power to detect the difference between two populations when the location of the active optimal flow is known. In practice, the location information is usually unknown. To address this issue we consider the maximum of the two-sample -tests at each edge,

where Inline graphic is defined in the same way as (10). The use of this maximum type statistic for detecting sparse signals is very common in a wide range of applications, leading to the construction of rate-optimal tests for many problems (Dümbgen & Spokoiny, 2001; Arias-Castro et al., 2005; Arias-Castro et al., 2011; Jeng et al., 2010; Cai et al., 2014; Cao et al., 2017; Wang et al., 2021).

To evaluate the statistical significance of Inline graphic , one still needs to choose an appropriate critical value for . However, it is difficult to derive the asymptotic distribution of due to the complex dependency structure among the . To overcome this problem, one adopts a resampling method to assign an appropriate critical value for . In particular, a common resampling method to choose a critical value is the permutation test as in permanova (Good, 2013; Anderson, 2014). Although the permutation test requires Inline graphic under the null hypothesis as we discussed before, its performance is robust when the sample size is small.

We propose another resampling method, the bootstrap, to choose the critical value for Inline graphic in order to avoid the condition . Let for , , where is the mean of within each group and is the mean of the combined samples. We randomly draw samples with replacement from each group and then calculate with shifted data . This procedure is repeated times and the corresponding statistics are denoted by Inline graphic . Finally, the approximated is chosen as a quantile of the empirical distribution of , or the -value is calculated as

We then make a decision under the null hypothesis based on the critical value or the Inline graphic -value.

When the null hypothesis is rejected, Inline graphic also provides a natural way to identify the set of edges with their optimal flow not being zero, . To be specific, we consider the set of edges as the active edges identified. Due to the construction of , the familywise error rate of active edge identification is naturally controlled at the Inline graphic level. The identified edge indicates that all microbe species below this edge, as a whole object, are differentially abundant between the two populations. In this way, the active edges are identified as a microbial signature associated with the difference of the two populations.

5.2. Asymptotic behaviour and optimality of

We now investigate the behaviour of Inline graphic under the null and alternative hypothesis. The main difficulty in studying the behaviour of is the strong and complex dependence among different . This complexity of the dependency structure mainly comes from two sources: the high overlapping structure of the subtree and the unknown heteroskedastic variance/covariance structure among the different microbe species. Our investigation shows that the complexity of the the dependence structure among the Inline graphic can be characterized by a single quantity that depends on both the tree structure and the heteroskedastic variance.

Since each Inline graphic is defined on a subtree below edge , the asymptotic behaviour of clearly depends on the effective number of subtrees. We still assume that each group of samples has no vanishing proportion, i.e., (3). Let be the element of the th row and th column of . To decouple the complex dependence structure among the Inline graphic , we provide the following proposition.

Proposition 1.

For any given , let be a subset of edges of the tree such that . Then, can be decomposed into a collection of disjoint paths

where and are nodes of the tree and is the number of disjoint paths. In addition, any two edges from different paths in the above decomposition do not share any descendants.

Intuitively, for any Inline graphic and from the same path defined above, the subtrees and are highly overlapped, and thus the defined on them are expected to behave similarly. On the other hand, for edges from different paths, the corresponding subtrees are distinct in that they do not share any descendants. In general, the number of disjoint paths in the above proposition characterizes an effective number of subtrees. Motivated by this observation, we define Inline graphic as the number of different paths in for , and as the sum of the ,

Clearly, Inline graphic is an integer between and , and depends on the structure of the tree and the distributions and . For example, when the tree is a fully balanced binary tree, and and are Dirichlet distributions with equal parameters for the leaves, . If and are Dirichlet distributions with equal parameters for all nodes of a chain tree Inline graphic , i.e., a one-dimensional lattice, then . More specific examples can be found in the Supplementary Material. We later show that the asymptotic behaviour of can be fully characterized by this quantity .

When two subtrees highly overlap, we expect the correlation between the Inline graphic on them to be very strong. We shall assume there exist constants and such that

(11)

where Inline graphic , and are a pair of edges such that . Here, represents the correlation between two random variables. This is not a strong condition. For instance, (11) is satisfied when the relative abundance of different microbe species in are mutually independent or drawn from a Dirichlet distribution. Furthermore, we also assume that Inline graphic is a sub-Gaussian distribution, i.e., there exist constants and such that

(12)

where Inline graphic is the variance of , the element of the th row and the th column of . The following theorem shows the asymptotic behaviour of under the null hypothesis.

Theorem 2.

Suppose . Under the null hypothesis and assumptions (3), (11) and (12), we have

(13)

Theorem 2 suggests that the amplitude of Inline graphic is when the null hypothesis is true, where plays the same role as the number of variables when all the components are almost independent (Cai et al., 2014; Cao et al., 2017). In other words, although is constructed from different subtrees, it is equivalent to taking a maximum of roughly Inline graphic independent variables because of the high dependency among the statistics . For the one active flow example discussed in the previous section, Theorem 2 suggests that is a consistent test when

A comparison between the above and the oracle test’s detection boundary suggests that the price we pay for the unknown location of Inline graphic is . With the same argument as for permanova, we know that either a larger number of samples or a larger number of reads per sample can increase the power of dafot. More detailed discussions are given in § 6.

We next turn to the power analysis of the test based on Inline graphic from a minimax point of view (Ingster 1993a, b,c; Ingster & Suslina, 2012). The parameter space of the null hypothesis is

and the parameter space of the alternative hypothesis is

The worst-case risk of any given test Inline graphic is then defined as

We say a test Inline graphic is consistent for separating and if , and and are separable if there exists a consistent test for them. On the other hand, a test is powerless for separating and if , and and are inseparable if . Here, is a parameter to control the distance between and . Clearly, if is smaller, Inline graphic and are more difficult to separate. Let be the test that rejects the null hypothesis if and only if for some . The following theorem characterizes the power of and the separability of and .

Theorem 3.

Consider testing and by . Suppose , and (3), (11) and (12) hold. Then there exist constants and such that

This theorem shows that the optimal rate of detection boundary between Inline graphic and is

This optimal rate suggests that the difficulty of this problem is mainly determined by the single quantity Inline graphic , which relies on both the tree structure and the heteroskedastic variance structure. This theorem also suggests that is minimax rate optimal.

6. Numerical experiments

6.1. Simulation studies

We investigate the properties of permanova and dafot using simulated datasets. We choose a phylogenetic tree of bacteria species within the class Gammaproteobacteria as the underlying tree Inline graphic . This tree is extracted from the reference tree of the Greengenes 16S rRNA database version 13.8 clustered at 85% similarity by the R package metagenomeFeatures (DeSantis et al., 2006), and has a total of 247 leaves, denoted by , and 246 internal nodes, denoted by . Figure 2 shows the structure of the tree, with each leaf labelled with a number.

Phylogenetic tree of bacteria within the class *Gammaproteobacteria* used in the simulation studies, with each leaf node labelled with a number. The edges with active optimal flow in the numerical examples are shown in red.

To simulate the numbers of reads on this tree Inline graphic , we adopt the Dirichlet-multinomial distribution. More specifically, the true relative abundance , where and , is drawn from some Dirichlet distribution, i.e., follows a Dirichlet distribution indexed by ,

For each sample, the reads are then drawn from a multinomial distribution with respect to the true relative abundance. Under this model, the mean of the relative abundance for the Inline graphic th group can be written as

Thus, under the null hypothesis, we assume Inline graphic if and if . Under the alternative hypothesis, we perturb and consider two different scenarios: (A) the difference is at nodes and , i.e., and ; (B) the difference is at clades and , i.e., if and if . The parameter is specified later. In particular, the edges with active optimal flow under scenarios (A) and (B) are shown in red in Fig. 2.

The first set of simulation experiments are designed to compare the performance of dafot and permanova under scenario (A). To be specific, we compare four different methods: dafot; dafot after centre log-ratio transformation (dafot-log); permanova equipped with the weighted UniFrac distance (permanova-L1); and permanova equipped with the Inline graphic Zolotarev-type phylogenetic distance (permanova-L2). To make the comparisons fair, the critical values for all tests are chosen by permutations at a significance level of . To investigate the effect of the sample size , the sequence depth and the signal strength , we chose , and , and Inline graphic , and , , , , and in the simulation experiments. The experiment is repeated 100 times for each combination of the , and . The performance of the tests is evaluated by the power of the test, i.e., the probability of rejecting the null hypothesis, which can be estimated by the proportion of null hypothesis rejections among the 100 simulation experiments.

The results are summarized in Fig. 3. These results show that the Type I error is under control when the null hypothesis is true, Inline graphic , and the power of dafot is larger than permanova when the alternative hypothesis is true (). Figure 3 implies that the observed effects of , and on the power of the tests are consistent with the theoretical results. We observe a similar improved power of dafot over permanova for scenario (B) when the active flows connect two clades; see the Supplementary Material for details.

Power comparisons between dafot and permanova under scenario (A), where the difference is at nodes and . dafot, the proposed method based on proportions; dafot-log, the proposed method based on log-proportions permanova-L1, permanova with the Zolotarev-type phylogenetic distance; permanova-L2, permanova with the Zolotarev-type phylogenetic distance.

The sequence count data in real microbiome studies are usually very sparse, i.e., there are a lot of zero values. The next set of simulation experiments are designed to assess the performance of dafot and permanova when there are a lot of zero values. More concretely, we set Inline graphic for under scenario (A) and for under scenario (B). In other words, the probability on nearly of the nodes is zero in scenario (A), and the probability on nearly of the nodes is zero in scenario (B). In order to avoid undefined values of , zero counts are replaced by in dafot-log (Aitchison, 2003; Lin et al., 2014). The sequence depth of each sample is drawn uniformly between Inline graphic and instead of being fixed as in previous experiments. The sample size and the difference between populations are varied in the same way as in the previous simulation experiments. We present the results based on 100 runs for each combination of and in the Supplementary Material. A comparison suggests that dafot, permanova-L1 and permanova-L2 are relatively robust against a lot of zero values; however, the power of dafot-log is affected by these zero values.

We further compare the performance of dafot and permanova under a wide range of sparseness. Specifically, we adopt a similar setting to scenario (A) and choose Inline graphic , and . To assess the effect of sparsity, we randomly choose leaves at each simulation experiment and set for the first leaves and for the last leaves; is chosen to be , , , , , and . The results based on 100 simulation experiments are summarized in Fig. 4. These results show that dafot outperforms permanova when fewer leaves are perturbed. When the signal becomes denser, permanova-L1 can gain more power than dafot.

Power comparisons between dafot and permanova for different sparsity levels.

The final set of simulation experiments aims to evaluate the performance of edge identification by dafot and dafot-log. In particular, we consider scenario (A) with Inline graphic and vary and as in the previous simulation experiments. For each combination of and , we repeat the experiment 100 times and the active edges detected by the two methods are recorded. The results of the average number of false positive edges, the probability of making at least one Type I error, and the true positive edges are summarized in Table 1, showing that the familywise error rate is under control regardless of signal strength, and that the two active edges can be identified successfully when the signal is strong enough. In Fig. 3 and Table 1, dafot-log performs better than dafot, as the log transformation is suitable for nonzero data.

Table 1.

The edge identification performance of dafot and dafot-log



dafot	afp	0.09	0.06	0.08	0.13	0.10	0.08
	fwer	0.04	0.06	0.06	0.09	0.10	0.06
	atp	0.11	0.15	0.56	0.78	1.49	1.66
dafot-log	afp	0.08	0.05	0.08	0.13	0.13	0.36
	fwer	0.04	0.05	0.07	0.11	0.05	0.21
	atp	0.35	0.50	0.90	1.35	1.81	1.94

Open in a new tab

afp, the average number of false positive edges; fwer, the probability of making at least one Type I error; atp, the number of true positive edges.

6.2. Analysis of ulcerative colitis disease microbiome data

To further demonstrate the performance of dafot, we apply the method to a 16S rRNA dataset of 47 human intestinal biopsy samples collected at the University of Pennsylvania. These samples are divided into three groups: A, 18 control samples (control); B, 14 samples with ulcerative colitis who did not receive treatment (unexposed); C, 15 samples with ulcerative colitis who received treatment (exposed). To compare the microbiome communities of these groups, the raw sequence reads of each sample are placed into a reference phylogenic tree from Greengenes 16S rRNA database version 13.8 with a 99% similarity by using sepp (Mirarab et al., 2012; Janssen et al., 2018). The reference phylogenetic tree is then trimmed by keeping all nodes related to the operational taxonomic units observed in the samples. The trimmed phylogenetic tree is shown in Fig. 5(a), including 7980 edges. Figure 5(b) shows the empirical optimal flow between groups A and B on each edge, i.e., the difference of probability on subtrees indexed by edges, Inline graphic . The order of the edges is ranked automatically by the R class phylo. It is clear from Fig. 5(b) that most of the empirical active flows are small, indicating that sparse flow on the tree is a reasonable assumption. In addition, we estimate the effective number of subtrees for pairwise comparison: Inline graphic for groups A and B; for groups A and C; for groups B and C. The estimation of is based on the reference phylogenetic tree structure and estimated variance at each node .

(a) The reference phylogenetic tree used in analysis of intestinal biopsy samples. The branches in red are those identified as an active edge and the zoomed-in subtree below the active edge. (b) The empirical optimal flow between group A and group B on each edge.

To test the phylogenetic composition difference between the groups, we apply the same four two-sample testing methods as in the first simulation experiment. The resulting Inline graphic -values estimated by a permutation test with 1000 permutations are summarized in Table 2. No methods identify any significant phylogenetic composition difference between groups C and A or groups C and B, . However, only dafot indicates an overall difference in intestinal biopsy microbiome composition between groups A and B with Inline graphic , while the other methods do not detect such a difference.

Table 2.

The Inline graphic -values for comparing different groups using dafot and permanova based on 1000 permutations

	dafot	dafot-log	permanova-L1	permanova-L2
Group A vs B	0.007	0.256	0.378	0.460
Group A vs C	0.147	0.495	0.305	0.270
Group B vs C	0.648	0.832	0.639	0.677

Open in a new tab

Besides the overall difference in microbiome compositions between groups A and B, dafot also identifies that the overall difference is due to the active flow on one edge. The subtree indexed by this edge is shown in Fig. 5(a), coloured red in the original phylogenetic tree and zoomed-in in a side figure. There are a total of 31 operational taxonomic units placed on this subtree, 18 of which are annotated as the Ruminococcaceae family and Oscillospira genus, and 13 of which are annotated as the Ruminococcaceae family and unknown genus. Figure 6(a) shows a boxplot of the combined relative abundance on these 31 operational taxonomic units, indicating that the relative abundance on this subtree decreased in ulcerative colitis patients, but is increased partially after receiving treatment. This is consistent with the previous findings on the proportion of Oscillospira genus and the Ruminococcaceae family in gut microbiota deceases in inflammatory bowel disease patients (Morgan et al., 2012; Konikoff & Gophna, 2016; Santoru et al., 2017). For comparison, a boxplot of the combined relative abundance of all operational taxonomic units assigned to the Oscillospira genus is shown in Fig. 6(b), showing that the pattern of relative abundance found in Fig. 6(a) is not that clear any more. This suggests that the finer species classification by microbial phylogeny can provide more power to detect the subtle difference between populations than standard taxonomic classification (Washburne et al., 2018).

Boxplots of relative abundance on operational taxonomic units placed on the detected subtree in Fig. 5(a) and on all operational taxonomic units assigned to the *Oscillospira* genus. The red dots show the raw relative abundance.

Supplementary Material

asaa061_Supplementary_Data

Click here for additional data file.^{(1MB, pdf)}

Acknowledgement

This research was supported by the National Institutes of Health.

Contributor Information

Shulei Wang, Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A.

T Tony Cai, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A.

Hongzhe Li, Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A.

Supplementary material

Supplementary material available at Biometrika online includes proofs of all theorems, additional simulation results and information on generalized optimal flow. The software package dafot is available at https://cran.r-project.org/web/packages/DAFOT/index.html.

References

Aitchison, J. (1982). The statistical analysis of compositional data. J. R. Statist. Soc. B 44, 139–60. [Google Scholar]
Aitchison, J. (2003). The Statistical Analysis of Compositional Data. Caldwell, NJ: Blackburn Press. [Google Scholar]
Anderson, M. J. (2014). Permutational multivariate analysis of variance (PERMANOVA). Wiley Statsref, doi: 10.1002/9781118445112.stat07841. [DOI] [Google Scholar]
Arias-Castro, E., Candes, E. J. & Durand, A. (2011). Detection of an anomalous cluster in a network. Ann. Statist. 39, 278–304. [Google Scholar]
Arias-Castro, E., Donoho, D. L. & Huo, X. (2005). Near-optimal detection of geometric objects by fast multiscale methods. IEEE Trans. Info. Theory 51, 2402–25. [Google Scholar]
Bai, Z. & Saranadasa, H. (1996). Effect of high dimension: By an example of a two-sample problem. Statist. Sinica 6, 311–29. [Google Scholar]
Cai, T. T., Liu, W. & Xia, Y. (2014). Two-sample test of high-dimensional means under dependence. J. R. Statist. Soc. B 76, 349–72. [Google Scholar]
Cao, Y., Lin, W. & Li, H. (2017). Two-sample tests of high-dimensional means for compositional data. Biometrika 105, 115–32. [Google Scholar]
Chen, J., Ryu, E., Hathcock, M., Ballman, K., Chia, N., Olson, J. E. & Nelson, H. (2016). Impact of demographics on human gut microbial diversity in a US midwest population. PeerJ 4, e1514. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen, S., Li, J. & Zhong, P. (2019). Two-sample and ANOVA tests for high-dimensional means. Ann. Statist. 47, 1443–74. [Google Scholar]
Chen, S. & Qin, Y. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38, 808–35. [Google Scholar]
DeSantis, T. Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E. L., Keller, K., Huber, T., Dalevi, D., Hu, P. & Andersen, G. L. (2006). Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dümbgen, L. & Spokoiny, V. G. (2001). Multiscale testing of qualitative hypotheses. Ann. Statist. 29, 124–52. [Google Scholar]
Egozcue, J. J. & Pawlowsky-Glahn, V. (2016). Changing the reference measure in the simplex and its weighting effects. Austrian J. Statist. 45, 25–44. [Google Scholar]
Evans, S. & Matsen, F. (2012). The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples. J. R. Statist. Soc. B 74, 569–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fukuyama, J. (2017). Adaptive gPCA: A method for structured dimensionality reduction. arXiv: 1702.00501. [Google Scholar]
Good, P. (2013). Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. New York: Springer Science & Business Media. [Google Scholar]
Ingster, Y. I. (1993a). Asymptotically minimax hypothesis testing for nonparametric alternatives I. Math. Meth. Statist. 2, 85–114. [Google Scholar]
Ingster, Y. I. (1993b). Asymptotically minimax hypothesis testing for nonparametric alternatives II. Math. Meth. Statist. 2, 171–89. [Google Scholar]
Ingster, Y. I. (1993c). Asymptotically minimax hypothesis testing for nonparametric alternatives III. Math. Meth. Statist. 2, 249–68. [Google Scholar]
Ingster, Y. I. & Suslina, I. A. (2012). Nonparametric Goodness-of-Fit Testing under Gaussian Models, New York: Springer Science & Business Media. [Google Scholar]
Janssen, S., McDonald, D., Gonzalez, A., Navas-Molina, J. A., Jiang, L., Xu, Z., Winker, K., Kado, D. M., Orwoll, E., Manary, M. et al. (2018). Phylogenetic placement of exact amplicon sequences improves associations with clinical information. MSystems 3, e00021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jeng, X. J., Cai, T. T. & Li, H. (2010). Optimal sparse segment identification with application in copy number variation analysis. J. Am. Statist. Assoc. 105, 1156–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
Konikoff, T. & Gophna, U. (2016). Oscillospira: A central, enigmatic component of the human gut microbiota. Trends Microbiol. 24, 523–4. [DOI] [PubMed] [Google Scholar]
Lin, W., Shi, P., Feng, R. & Li, H. (2014). Variable selection in regression with compositional covariates. Biometrika 101, 785–97. [Google Scholar]
Love, M. I., Huber, W. & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lozupone, C., Hamady, M., Kelley, S. & Knight, R. (2007). Quantitative and qualitative diversity measures lead to different insights into factors that structure microbial communities. Appl. Envir. Microbiol. 73, 1576–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lozupone, C. & Knight, R. (2005). UniFrac: A new phylogenetic method for comparing microbial communities. Appl. Envir. Microbiol. 71, 8228–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mandal, S., Van Treuren, W., White, R. A., Eggesbø, M., Knight, R. & Peddada, S. D. (2015). Analysis of composition of microbiomes: A novel method for studying microbial composition. Microbial Ecol. Health Dis. 26, 27663. [DOI] [PMC free article] [PubMed] [Google Scholar]
McArdle, B. H. & Anderson, M. J. (2001). Fitting multivariate models to community data: A comment on distance-based redundancy analysis. Ecology 82, 290–7. [Google Scholar]
Mirarab, S., Nguyen, N. & Warnow, T. (2012). SEPP: SATé-enabled phylogenetic placement. In Pac. Symp. Biocomput., 247–58. [DOI] [PubMed] [Google Scholar]
Morgan, X. C., Tickle, T. L., Sokol, H., Gevers, D., Devaney, K. L., Ward, D. V., Reyes, J. A., Shah, S. A., Leleiko, N., Snapper, S. B. et al. (2012). Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment (article) author. Genome Biol. 13, R79. [DOI] [PMC free article] [PubMed] [Google Scholar]
R Development Core Team (2021). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0, http://www.R-project.org. [Google Scholar]
Rivera-Pinto, J., Egozcue, J., Pawlowsky-Glahn, V., Paredes, R., Noguera-Julian, M. & Calle, M. (2018). Balances: A new perspective for microbiome analysis. MSystems 3, e00053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. (2010). EdgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
Santoru, M. L., Piras, C., Murgia, A., Palmas, V., Camboni, T., Liggi, S., Ibba, I., ai, M. A., Orrù, S., Blois, S. et al. (2017). Cross-sectional evaluation of the gut-microbiome metabolome axis in an Italian cohort of IBD patients. Scientific Rep. 7, 9523. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sejdinovic, D., Sriperumbudur, B., Gretton, A. & Fukumizu, K. (2013). Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Statist. 41, 2263–91. [Google Scholar]
Smith, C. C., Snowberg, L. K., Caporaso, J. G., Knight, R. & Bolnick, D. I. (2015). Dietary input of microbes and host genetic variation shape among-population differences in stickleback gut microbiota. ISME J. 9, 2515. [DOI] [PMC free article] [PubMed] [Google Scholar]
Srivastava, M. S. & Du, M. (2008). A test for the mean vector with fewer observations than the dimension. J. Multi. Anal. 99, 386–402. [Google Scholar]
Székely, G. J. & Rizzo, M. L. (2005). A new test for multivariate normality. J. Multi. Anal. 93, 58–80. [Google Scholar]
Wagner, B. D., Robertson, C. E. & Harris, J. K. (2011). Application of two-part statistics for comparison of sequence variant counts. PLOS One 6, e20296. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang, S., Fan, J., Pocock, G., Arena, E. T., Eliceiri, K. W. & Yuan, M. (2021). Structured correlation detection with application to colocalization analysis in dual-channel fluorescence microscopic imaging. Statist. Sinica 31, 333–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
Washburne, A. D., Morton, J. T., Sanders, J., McDonald, D., Zhu, Q., Oliverio, A. M. & Knight, R. (2018). Methods for phylogenetic analysis of microbiome data. Nature Microbiol. 3, 652–61. [DOI] [PubMed] [Google Scholar]
Wu, G. D., et al. (2016). Comparative metabolomics in vegans and omnivores reveal constraints on diet-dependent gut microbiota metabolite production. Gut 65, 63–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xia, Y. & Sun, J. (2017). Hypothesis testing and statistical analysis of microbiome. Genes Dis. 4, 138–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao, N., Chen, J., Carroll, I. M., Ringel-Kulka, T., Epstein, M. P., Zhou, H., Zhou, J. J., Ringel, Y. Li, H. & Wu, M. C. (2015). Testing in microbiome-profiling studies with MiRKAT, the microbiome regression-based kernel association test. Am. J. Hum. Genet. 96, 797–807. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

asaa061_Supplementary_Data

Click here for additional data file.^{(1MB, pdf)}

[B1] Aitchison, J. (1982). The statistical analysis of compositional data. J. R. Statist. Soc. B 44, 139–60. [Google Scholar]

[B2] Aitchison, J. (2003). The Statistical Analysis of Compositional Data. Caldwell, NJ: Blackburn Press. [Google Scholar]

[B3] Anderson, M. J. (2014). Permutational multivariate analysis of variance (PERMANOVA). Wiley Statsref, doi: 10.1002/9781118445112.stat07841. [DOI] [Google Scholar]

[B4] Arias-Castro, E., Candes, E. J. & Durand, A. (2011). Detection of an anomalous cluster in a network. Ann. Statist. 39, 278–304. [Google Scholar]

[B5] Arias-Castro, E., Donoho, D. L. & Huo, X. (2005). Near-optimal detection of geometric objects by fast multiscale methods. IEEE Trans. Info. Theory 51, 2402–25. [Google Scholar]

[B6] Bai, Z. & Saranadasa, H. (1996). Effect of high dimension: By an example of a two-sample problem. Statist. Sinica 6, 311–29. [Google Scholar]

[B7] Cai, T. T., Liu, W. & Xia, Y. (2014). Two-sample test of high-dimensional means under dependence. J. R. Statist. Soc. B 76, 349–72. [Google Scholar]

[B8] Cao, Y., Lin, W. & Li, H. (2017). Two-sample tests of high-dimensional means for compositional data. Biometrika 105, 115–32. [Google Scholar]

[B9] Chen, J., Ryu, E., Hathcock, M., Ballman, K., Chia, N., Olson, J. E. & Nelson, H. (2016). Impact of demographics on human gut microbial diversity in a US midwest population. PeerJ 4, e1514. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Chen, S., Li, J. & Zhong, P. (2019). Two-sample and ANOVA tests for high-dimensional means. Ann. Statist. 47, 1443–74. [Google Scholar]

[B11] Chen, S. & Qin, Y. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38, 808–35. [Google Scholar]

[B12] DeSantis, T. Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E. L., Keller, K., Huber, T., Dalevi, D., Hu, P. & Andersen, G. L. (2006). Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–72. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] Dümbgen, L. & Spokoiny, V. G. (2001). Multiscale testing of qualitative hypotheses. Ann. Statist. 29, 124–52. [Google Scholar]

[B14] Egozcue, J. J. & Pawlowsky-Glahn, V. (2016). Changing the reference measure in the simplex and its weighting effects. Austrian J. Statist. 45, 25–44. [Google Scholar]

[B15] Evans, S. & Matsen, F. (2012). The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples. J. R. Statist. Soc. B 74, 569–92. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] Fukuyama, J. (2017). Adaptive gPCA: A method for structured dimensionality reduction. arXiv: 1702.00501. [Google Scholar]

[B17] Good, P. (2013). Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. New York: Springer Science & Business Media. [Google Scholar]

[B18] Ingster, Y. I. (1993a). Asymptotically minimax hypothesis testing for nonparametric alternatives I. Math. Meth. Statist. 2, 85–114. [Google Scholar]

[B19] Ingster, Y. I. (1993b). Asymptotically minimax hypothesis testing for nonparametric alternatives II. Math. Meth. Statist. 2, 171–89. [Google Scholar]

[B20] Ingster, Y. I. (1993c). Asymptotically minimax hypothesis testing for nonparametric alternatives III. Math. Meth. Statist. 2, 249–68. [Google Scholar]

[B21] Ingster, Y. I. & Suslina, I. A. (2012). Nonparametric Goodness-of-Fit Testing under Gaussian Models, New York: Springer Science & Business Media. [Google Scholar]

[B22] Janssen, S., McDonald, D., Gonzalez, A., Navas-Molina, J. A., Jiang, L., Xu, Z., Winker, K., Kado, D. M., Orwoll, E., Manary, M. et al. (2018). Phylogenetic placement of exact amplicon sequences improves associations with clinical information. MSystems 3, e00021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] Jeng, X. J., Cai, T. T. & Li, H. (2010). Optimal sparse segment identification with application in copy number variation analysis. J. Am. Statist. Assoc. 105, 1156–66. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] Konikoff, T. & Gophna, U. (2016). Oscillospira: A central, enigmatic component of the human gut microbiota. Trends Microbiol. 24, 523–4. [DOI] [PubMed] [Google Scholar]

[B25] Lin, W., Shi, P., Feng, R. & Li, H. (2014). Variable selection in regression with compositional covariates. Biometrika 101, 785–97. [Google Scholar]

[B26] Love, M. I., Huber, W. & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] Lozupone, C., Hamady, M., Kelley, S. & Knight, R. (2007). Quantitative and qualitative diversity measures lead to different insights into factors that structure microbial communities. Appl. Envir. Microbiol. 73, 1576–85. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] Lozupone, C. & Knight, R. (2005). UniFrac: A new phylogenetic method for comparing microbial communities. Appl. Envir. Microbiol. 71, 8228–35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] Mandal, S., Van Treuren, W., White, R. A., Eggesbø, M., Knight, R. & Peddada, S. D. (2015). Analysis of composition of microbiomes: A novel method for studying microbial composition. Microbial Ecol. Health Dis. 26, 27663. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] McArdle, B. H. & Anderson, M. J. (2001). Fitting multivariate models to community data: A comment on distance-based redundancy analysis. Ecology 82, 290–7. [Google Scholar]

[B31] Mirarab, S., Nguyen, N. & Warnow, T. (2012). SEPP: SATé-enabled phylogenetic placement. In Pac. Symp. Biocomput., 247–58. [DOI] [PubMed] [Google Scholar]

[B32] Morgan, X. C., Tickle, T. L., Sokol, H., Gevers, D., Devaney, K. L., Ward, D. V., Reyes, J. A., Shah, S. A., Leleiko, N., Snapper, S. B. et al. (2012). Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment (article) author. Genome Biol. 13, R79. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B33] R Development Core Team (2021). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0, http://www.R-project.org. [Google Scholar]

[B34] Rivera-Pinto, J., Egozcue, J., Pawlowsky-Glahn, V., Paredes, R., Noguera-Julian, M. & Calle, M. (2018). Balances: A new perspective for microbiome analysis. MSystems 3, e00053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] Robinson, M. D., McCarthy, D. J. & Smyth, G. K. (2010). EdgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B36] Santoru, M. L., Piras, C., Murgia, A., Palmas, V., Camboni, T., Liggi, S., Ibba, I., ai, M. A., Orrù, S., Blois, S. et al. (2017). Cross-sectional evaluation of the gut-microbiome metabolome axis in an Italian cohort of IBD patients. Scientific Rep. 7, 9523. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B37] Sejdinovic, D., Sriperumbudur, B., Gretton, A. & Fukumizu, K. (2013). Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Statist. 41, 2263–91. [Google Scholar]

[B38] Smith, C. C., Snowberg, L. K., Caporaso, J. G., Knight, R. & Bolnick, D. I. (2015). Dietary input of microbes and host genetic variation shape among-population differences in stickleback gut microbiota. ISME J. 9, 2515. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B39] Srivastava, M. S. & Du, M. (2008). A test for the mean vector with fewer observations than the dimension. J. Multi. Anal. 99, 386–402. [Google Scholar]

[B40] Székely, G. J. & Rizzo, M. L. (2005). A new test for multivariate normality. J. Multi. Anal. 93, 58–80. [Google Scholar]

[B41] Wagner, B. D., Robertson, C. E. & Harris, J. K. (2011). Application of two-part statistics for comparison of sequence variant counts. PLOS One 6, e20296. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B42] Wang, S., Fan, J., Pocock, G., Arena, E. T., Eliceiri, K. W. & Yuan, M. (2021). Structured correlation detection with application to colocalization analysis in dual-channel fluorescence microscopic imaging. Statist. Sinica 31, 333–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B43] Washburne, A. D., Morton, J. T., Sanders, J., McDonald, D., Zhu, Q., Oliverio, A. M. & Knight, R. (2018). Methods for phylogenetic analysis of microbiome data. Nature Microbiol. 3, 652–61. [DOI] [PubMed] [Google Scholar]

[B44] Wu, G. D., et al. (2016). Comparative metabolomics in vegans and omnivores reveal constraints on diet-dependent gut microbiota metabolite production. Gut 65, 63–72. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B45] Xia, Y. & Sun, J. (2017). Hypothesis testing and statistical analysis of microbiome. Genes Dis. 4, 138–48. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B46] Zhao, N., Chen, J., Carroll, I. M., Ringel-Kulka, T., Epstein, M. P., Zhou, H., Zhou, J. J., Ringel, Y. Li, H. & Wu, M. C. (2015). Testing in microbiome-profiling studies with MiRKAT, the microbiome regression-based kernel association test. Am. J. Hum. Genet. 96, 797–807. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Hypothesis testing for phylogenetic composition: a minimum-cost flow perspective

Shulei Wang

T Tony Cai

Hongzhe Li

Summary

1. Introduction

2. A hierarchical model for microbiome count data and phylogenetic distance

3. A minimum-cost flow perspective for two-sample testing

Figure 1.

4. Permutational multivariate analysis of variance

4.1. Introduction

4.2. Properties

Theorem 1.

4.3. Sparse setting

5. Active optimal flow detection

5.1. Detector of active flow on the tree

5.2. Asymptotic behaviour and optimality of

Proposition 1.

Theorem 2.

Theorem 3.

6. Numerical experiments

6.1. Simulation studies

Figure 2.

Figure 3.

Figure 4.

Table 1.

6.2. Analysis of ulcerative colitis disease microbiome data

Figure 5.

Table 2.

Figure 6.

Supplementary Material

Acknowledgement

Contributor Information

Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases