Abstract
Motivated by recent research on quantifying bacterial growth dynamics based on genome assemblies, we consider a permuted monotone matrix model Y = ΘΠ+ Z, where the rows represent different samples, the columns represent contigs in genome assemblies and the elements represent log-read counts after preprocessing steps and Guanine-Cytosine (GC) adjustment. In this model, Θ is an unknown mean matrix with monotone entries for each row, Π is a permutation matrix that permutes the columns of Θ, and Z is a noise matrix. This paper studies the problem of estimation/recovery of Π given the observed noisy matrix Y. We propose an estimator based on the best linear projection, which is shown to be minimax rate-optimal for both exact recovery, as measured by the 0–1 loss, and partial recovery, as quantified by the normalized Kendall’s tau distance. Simulation studies demonstrate the superior empirical performance of the proposed estimator over alternative methods. We demonstrate the methods using a synthetic metagenomics dataset of 45 closely related bacterial species and a real metagenomic dataset to compare the bacterial growth dynamics between the responders and the non-responders of the IBD patients after 8 weeks of treatment.
Keywords: Kendall’s tau, Microbiome growth dynamics, Minimax lower bound, Sorting
1. INTRODUCTION
1.1. A Motivation Example from Microbiome Studies
The statistical problem considered in this paper is motivated by the problem of estimating the bacterial growth dynamics based on shotgun metagenomics data (Myhrvold et al. 2015; Abel et al. 2015; Korem et al. 2015; Brown et al. 2016). The growth dynamics of microbial populations reflects their physiological states and drives variation of microbial compositions, which provide important feature summary of the microbes in a given community. One way of studying such communities is through shotgun metagenomic sequencing, which involve direct DNA sequencing of all the microbiome genomes in a given microbial community. Korem et al. (2015) presented the first paper on quantifying the bacterial growth dynamics based on shotgun metagenomics data, where the uneven sequencing read coverage resulting from the bidirectional DNA replications provides information on the rates of microbial DNA replications. For bacterial species with known complete genome sequences, Korem et al. (2015) proposed to use the peak-to-trough ratio (PTR) of read coverages to quantify the bacterial growth dynamics after aligning the sequencing reads to the complete genome sequences.
However, in many applications, it is of importance to quantify the bacterial growth dynamics based on genome assemblies for the bacterial species with unknown genomes. These genome assemblies may represent new bacterial species that we have seen or sequenced before. The genome assembly of a bacterium species consists of a collection of contigs (called bin) constructed based on the overlapping of the sequencing reads (Li et al. 2015; Wu et al. 2014). Compared to the complete genome, the genome assembled bins are more fragmented and often contained errors or contaminations. The noisy read coverage data due to intraspecific variations, interspecific/intraspecific repeated sequences, limited sequencing depths and the inability of binning algorithms to correctly cluster all the contigs further complicate the estimation of growth dynamics based on read coverages of the contigs. Besides these noisy count data, one key difficulty in estimating the growth dynamic based on contig counts is that the accurate locations of the contigs on the original genome are unknown. It is therefore not feasible to measure the microbial growth rate directly using peak-to-trough coverage ratio for the assembled genomes (Brown et al. 2016; Gao and Li 2018).
Brown et al. (2016) presented the first method (called iRep) of estimating the bacterial growth dynamics based on genome assemblies, where the contigs are ordered based on the GC-adjusted counts for each sample separately. However, due to noise in the count data, such an ordering method often leads to wrong ordering of the contigs and therefore inaccurate estimates of the growth dynamics. Gao and Li (2018) developed a computational algorithm, DEMIC, to accurately compare growth dynamics of a given assembled species existing in multiple samples by taking advantage of highly fragmented contigs assembled in typical metagenomics studies. One key step of DEMIC is to apply a principal components analysis (PCA)-based method to recover the true ordering of the contigs along the underlying unknown bacterial complete genomes. Gao and Li (2018) reported excellent empirical performance of DEMIC over existing methods. The goal of this paper is to provide a rigorous statistical framework to study the problem of optimal permutation recovery in a permuted monotone matrix model.
1.2. A Permuted Monotone Matrix Model
For a given genome assembly with p contigs, DEMIC first obtains the read coverage for each of the sliding window of size 5000 bps, denoted by Xijl for the ith sample, jth contig and kth window. In order to account for the GC-content of the kth window, Gao and Li (2018) considered the following mixed-effects model,
where GCjk is the centred GC count of the kth window of the jth contig, Wij is the sample- and contig- specific random intercept, α is the intercept, β is the regression coefficient, and eijk is the random error. This model is fitted for each contig to obtain the best linear unbiased predictor of Wij, which is used as the GC-adjusted log-read count Yij for the ith sample and jth contig. Here Yij can be regarded as average read coverage over non-overlapping windows of a contig and is approximately normally distributed.
Let Y be the GC-adjusted log-contig count matrix of n samples and p contigs of a genome assembly with Yij as its entries. Given this, we consider the following permuted monotone matrix model:
| (1) |
where is an unknown nonnegative signal matrix with nondecreasing rows, is a zero-mean noise matrix, and is a permutation matrix corresponding to some permutation Π from the symmetric group . That is, after a suitable permutation of the columns of Y, all the rows of the mean matrix are nondecreasing sequences. In microbiome applications, Θ is the matrix of true log-coverage of n samples over p contigs along the circular genome of the bacterium, which is generally hypothesized to have non-decreasing rows. Π represents a permutation due to unknown locations of the contigs relative to the replication origin. Throughout this paper, we denote the parameter space
The focus of this paper is to optimally estimate the permutation π from the noisy observation Y.
1.3. Related Problems and Other Applications
The permutation recovery problem under permuted monotone matrix model bears some similarity to other problems studied in machine learning literature, including the feature matching between two sets of observations (Collier and Dalalyan 2016) and linear regression model with permuted data, where the correspondences between the response and the predictors are unknown (Pananjady, Wainwright, and Courtade 2016; Slawski and Ben-David 2017; Pananjady, Wainwright, and Courtade 2017). More recently, Flammarion, Mao, and Rigollet (2019) considered the problem of statistical seriation, which has a close affinity to our model (1). However, the focus of Flammarion, Mao, and Rigollet (2019) is to optimally estimate the signal matrix Θ rather than the underlying permutation.
Model (1) can be thought as a natural extension of the shape constrained matrix denoising model studied in the isotonic regression literature. Specifically, under Π = Ip, risk bounds and the minimax rate-optimal estimator Model (1) with known for Θ under the Frobenius norm was obtained in Chatterjee, Guntuboyina, and Sen (2015) for n = 1 and later in Chatterjee, Guntuboyina, and Sen (2018) for general n > 1. Using the idea of optimal transport, a minimax optimal estimator of the underlying signals was obtained by Rigollet and Weed (2018). However, their goal is not to recover the underlying permutation.
Besides the microbiome applications, the permuted monotone matrix model is generic and has other applications. For instance, the problem of permutation recovery is usually equivalent to statistical ranking/sorting from noisy observations, which arises commonly in finance (Currie and Pandher 2011), sport analytics (Deshpande and Jensen 2016), and recommendation systems (Rendle et al. 2009). Specifically, in the latter case, the task of tag recommendation is to provide a user with a personalized ranked list of tags for a specific item. Under the permuted monotone matrix model, we can treat the entries of Y, say Yij, as an indicator of the jth tag being related to the ith item by a given customer, and Θ as a probability matrix characterizing the customer’s tagging preferences across multiple items. As a result, recovering the underlying permutation provides a solution of a tag recommender.
1.4. Main Contributions and Organization
In this paper, we investigate the problem of permutation recovery in the permuted monotone matrix model (1), which relies on certain invariance property of the singular subspace of the monotone matrices. The properties of the proposed method in terms of both the exact and partial recovery are studied in detail. In particular, we obtained regions of the signal-to-noise ratio (defined later as Γ /σ) that are subject to exact/partial recovery (Figure 1). For both exact and partial permutation recovery, we obtained the matching minimax lower bounds and established the minimax rate-optimality of the proposed method over a wide range of parameter space (Figure 1). For partial recovery, the proof of the lower bound relies on a version of Fano’s lemma and the sphere packing of the symmetric group equipped with the Kendall’s tau metric.
Fig. 1.
A graphical illustration of the main result obtained in this paper about the regions of the signal-to-noise ratio Γ /σ that correspond to exact/partial recovery, and the region with proved minimax optimality.
The rest of this paper is organized as follows. After a brief introduction of notation and definitions, we present in Section 2 the proposed permutation estimator. The theoretical properties of the proposed method are studied, first under a more illustrative linear growth model in Section 3 and then under a general growth model in Section 4. Section 5 provides results on minimax lower bounds and the optimality of the proposed estimator. We evaluate the methods using both simulated data, synthetic and real microbiome datasets and compare with other methods in Section 6. In Section 7, we discuss some implications and extensions of the methods. Finally, the proofs of our main results are given in Section 8.
1.5. Notation and Definitions
Throughout, we define the permutation π as a bijection from the set {1,2,..., p} onto itself. For simplicity, we denote π = (π(1), π(2),...,π(p)). All permutations of the set {1,2,..., p} form a symmetric group, equipped with the function composition operation °, denoted as . For any , we denote as its group inverse, so that π°π−1 = π−1°π = id, and denote rev(π) = (π(p), π(p−1),... π(1)). In particular, we may use π and its corresponding permutation matrix interchangeably, depending on the context. For a vector , we define the ℓp norm , and the ℓ∞ norm . For a matrix , we denote as its i-th norm column, as its i-th row, and denote its (ordered) singular values as . Furthermore, for sequences {an} and {bn}, we write if , and write an = O(bn), an ≲ bn or bn ≳ an if there exists a constant C such that an ≤ Cbn for all n. We write an≍bn if an≲bn and an≳bn. For a finite set A, we denote |A| as its cardinality. We use the logical symbols ∧ and ∨ to represent “and” and “or,” respectively. Lastly, C, C0, C1,... are constants that may vary from place to place.
2. PERMUTATION RECOVERY VIA BEST LINEAR PROJECTION
In the following, we first make some key observations about the connection between the underlying permutation π and the column linear projections of the observed matrix Y, which motivate our construction of the proposed estimator.
2.1. Linear Projection
Given the observed noisy matrix Y, we consider the class of the linear projection statistics of the form where and ||w||2 = 1. Intuitively, by projecting each column of Y onto the subspace generated by w, the components of w⊤Y (hereafter referred as “projection scores”) would quantify the relative position of the columns of Y, so that their order statistics can be used to recover the original orders of the columns of Θ. To fix ideas, we define the following ranking operator.
Definition 1 (Ranking Operator). The ranking operator is defined such that for any vector is the vector of ranks for components of x in increasing order. Whenever there are ties, increasing orders are assigned from left to right.
For example, given a vector x = (2,5,1,6,2)⊤, we have . The following proposition concerning the invariance property of the column spacing of Θ is the key to our construction of the minimax optimal estimator.
Proposition 1. Suppose . For any nonnegative unit vector , we have
| (2) |
Apparently, under the noiseless setting, any nonnegative unit vector would lead to the exact recovery of the underlying permutation as in this case the relative orders of the columns are exactly coded by the relative magnitudes of the projection scores w⊤Y = w⊤ΘΠ. However, with the noisy observations, w⊤Y = w⊤ΘΠ+ w⊤Z so that the relative orders of the columns are only partially preserved by the noisy projection scores w⊤Y, up to some random perturbations.
Consequently, the best linear projection vector w0 would correspond to the case where has the most separated components such that their relative orders are most immune to the random noises. Specifically, since for any given , the i-th component of w⊤ΘΠ has the expression w⊤ΘΠei where is the canonical basis of the Euclidean space , we define
which maximizes the pairwise distances of the components under the squared distance. Now since w0 relies on the unknown ΘΠ and is not computable from the data, we substitute ΘΠ by its sample/noisy counterpart Y and define our data-driven best linear projection vector as
| (3) |
which is actually the first eigenvector of the symmetric matrix
| (4) |
and can be immediately solved by performing an eigen-decomposition on A. Once is obtained, we define our proposed permutation estimator as
| (5) |
Intuitively, the projection vector assigns different weights to the rows of Y so that more weight is given to the rows whose elements are better separated and therefore more informative in distinguishing the columns of Y or Θ.
2.2. Evaluation Criteria
The main focus of this paper is to investigate the theoretical properties of our proposed estimator (5) under various loss measures and parameter spaces. For any given estimator , we first consider the 0–1 loss
with the corresponding risk . The 0–1 loss is used to evaluate the exact recovery, which can be a strong requirement for practical applications. As an alternative, we also consider the more flexible partial recovery, where the loss function is given by the normalized Kendall’s tau distance (Kendall 1938) defined as
| (6) |
Technically, for two permutations π1 and π2, the set of discordant pairs is defined as
so that the numerator in (6) is equal to the cardinality , which, in fact, is also the minimum number of pairwise adjacent transpositions converting into (Diaconis 1988). The denominator ensures that τK (π1, π2) ∈[0,1] where τK (π1, π2) = 0 corresponds to π1 = π2.
3. A LINEAR GROWTH MODEL
We start with a simpler case where the pair (Θ,π) is from the subspace
| (7) |
In other words, each row of Θ has a linear growth pattern with possibly different intercepts and different slopes. In the context of bacterial growth dynamics, this model is sometimes referred as the Cooper-Helmstetter model (Cooper and Helmstetter 1968; Bremer and Churchward 1977) that associates the copy number of genes with their relative distances to the replication origin. Specifically, ai is the ratio of genome replication time and doubling time, which can be used to quantify the bacterial growth dynamics for the ith sample, ηj is related to distance from the replication origin for the jth contig, and bi is related to the read counts at the replication origin and the sequencing depth. If the bacterium is non-dividing in sample i, ai is zero.
For the linear growth model (7), there are two key quantities that are relevant to permutation recovery.
Definition 2. For any , we define
| (8) |
as the local minimal signal gap of Θ, and define
| (9) |
as the global signal strength of Θ, where .
Intuitively, both quantities involve the set and the ℓ2 norm of the vector a = (a1,...,an)⊤, which characterize the column spacings and the growth rates (slopes) of Θ, respectively. Throughout this paper, we assume (A1) the additive noise matrix has i.i.d. entries zij ~ N(0,σ2). The Gaussian assumption simplifies our theoretical analysis. But this is not essential because all the theoretical results remain true if Z has independent sub-Gaussian entries with parameters bounded by σ2. The following theorem provides conditions on Γ and Λ such that exact recovery of π can be obtained by in (5).
Theorem 1 (Exact Recovery, Linear). Suppose (A1) hold, and Θ satisfies
| (10) |
for some C0, C1 > 0. Then with probability at least 1−O(p−c) for some constant c > 0, up to a permutation reversion, we have .
Remark 1. Due to non-identifiability between and defined in (3), in Theorem 1, as well as all the other theoretical results concerning , the statement is up to a possible reversion of . For example, for permutation π= (2,4,1,5,3), its reversion would be rev(π) = (4,2,5,1,3). In fact, such indeterminacy can be avoided by noting that ai ≥ 0 for all i’s, but we will not pursue such a direction in this study as the practical interest only concerns relative orders of the permuted elements.
Since Γ depends on a only through its ℓ2 ∥a∥2, the local minimal signal gap (MSG) condition allows for the presence of non-informative signals in the sense that some components of a can be 0. In contrast, the condition on Λ (GSS) depends on a trade-off between Γ and . One the one hand, when , the condition on Λ becomes , which is independent of Γ, and is minimax optimal for left singular subspace estimation (Cai and Zhang 2018). On the other hand, when , stronger condition on Λ is posed, as a compensation for small Γ.
In some cases, the GSS condition in (10) can be implied by the MSG condition. We summarize our results in the following proposition.
Proposition 2. Suppose Γ /σ>1/ p and the MSG condition hold. Then the GSS condition can be implied by either one of the following conditions
;
, and either .
We next turn to the partial recovery and study the rate of convergence of measured by the normalized Kendall’s tau distance under the linear growth model. In particular, we will assume an approximate uniform assignment of over some subinterval of [0,∞). In other words, the minimal element and maximal element of the set should have roughly the same magnitude, so that . This is equivalent to assuming that the contigs in genome assemblies are approximately uniformly spaced along the circular genome.
Theorem 2 (Partial Recovery, Linear). Suppose (A1) hold, , and Θ satisfies
there exist some C0 > 0 such that for all p > 0, and
for some C1 > 0.
Then, up to a permutation reversion,
for some c, c0, c1, c2 > 0.
Remark 2. The risk upper bound derived in the above theorem can be simplified as
for some c > 0. In the case of Γ /σ→∞, simple calculation yields when , whereas when . As a result, we also have
| (11) |
See Figure 2 for an illustration.
Fig. 2.
A graphical illustration of the risk upper bound for , as a function of signal-to-noise ratio Γ /σ.
In general, Theorem 2 shows that, even with a weaker condition on Γ that is below the requirement for the exact recovery, our proposed estimator is still able to obtain a partial recovery of π with an exponential rate of convergence if Γ /σ ≳1 and a polynomial rate of convergence if 1/ p < Γ /σ ≲1. As for Λ, the requirement is essentially the same as the exact recovery, except for an additional log p term, which is negligible in the exact recovery scenario.
Some implications about the practically preferable settings of n and p should be clarified. Firstly, although Theorem 1 implies that the difficulty for exact recovery increases as p grows (see also Table 1 from our simulations), our theory suggests a wide range of feasible choices for p. For example, if the underlying signals θij and the noise level σ2 are of constant order, then we have and Λ ≍ np3, so the conditions of Theorem 1 imply that the exact recovery can be guaranteed as long as log p≲ n. In other words, p is allowed to grow exponentially with n, which is in line with the modern high-dimensional setting. Secondly, our Theorem 2 implies that, even if some conditions (such as MSG) for the exact recovery are not satisfied, one can still hope to partially recover the underlying permutation. In accordance to our theoretical result (11), our numerical results (Figure 4) show that, for the partial recovery, increasing p indeed reduces the overall risk of the proposed estimator. Finally, as to the sample size n, we argue that, without assuming additional structural assumptions such as row-sparsity, it is very unlikely that including more samples will result in a worse estimate (see Table 1 and Figure 4 for numerical evidences).
Table 1.
The empirical risks of the estimators under the 0–1 loss based on 200 simulations for various combinations of the parameters (p, n, α). : proposed method; Πmean: mean-based method; Πmax: max-based method.
| p = 75 | S1(σ2 = 0.025) | S2(σ2 = 0.1) | S3(σ2 = 0.0075) | S4(σ2 = 0.025) | ||||
| n = 40 | α = 0.1 | 0.2 | 0.1 | 0.2 | 0.1 | 0.2 | 0.1 | 0.2 |
| 0.775 | 0.575 | 0.415 | 0.000 | 0.025 | 0.020 | 0.025 | 0.000 | |
| Πmean | 0.925 | 0.815 | 0.955 | 0.015 | 0.155 | 0.135 | 0.880 | 0.005 |
| Πmax | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | 0.970 | 0.840 | 0.430 |
| n = 40 | S1(σ2 = 0.025) | S2(σ2 = 0.1) | S3(σ2 = 0.0075) | S4(σ2 = 0.025) | ||||
| α = 0.1 | p = 60 | 90 | 60 | 90 | 60 | 90 | 60 | 90 |
| 0.410 | 0.930 | 0.340 | 0.470 | 0.010 | 0.115 | 0.000 | 0.010 | |
| Πmean | 0.720 | 0.985 | 0.910 | 0.980 | 0.070 | 0.245 | 0.775 | 0.900 |
| Πmax | 1.000 | 1.000 | 1.000 | 1.000 | 0.975 | 1.000 | 0.815 | 0.875 |
| p = 75 | S1(σ2 = 0.025) | S2(σ2 = 0.1) | S3(σ2 = 0.0075) | S4(σ2 = 0.025) | ||||
| α = 0.1 | n = 40 | 60 | 40 | 60 | 40 | 60 | 40 | 60 |
| 0.765 | 0.440 | 0.475 | 0.095 | 0.050 | 0.020 | 0.010 | 0.005 | |
| Πmean | 0.920 | 0.645 | 0.940 | 0.700 | 0.175 | 0.045 | 0.900 | 0.905 |
| Πmax | 1.000 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | 0.855 | 0.820 |
Fig. 4.
Boxplots of the empirical normalized Kendall’s distance between the estimated and true permutations under different models. : proposed estimator; Πmean : mean-based estimator; Πmax: max-based estimatior.
4. A GENERAL GROWTH MODEL
In this section we study the permutation recovery over the general parameter space where the growth pattern is not necessarily linear and therefore is more realistic inasmuch as the noisy nature of the shotgun metagenomic datasets (Boulund et al. 2018; Gao and Li 2018). The analysis relies on a deeper understanding of the relationship between the row-monotonic matrices and its leading singular vectors.
Specifically, for any , we define the row-centered matrix
| (12) |
whose singular value decomposition (SVD) is given by , with r ≤ min{n, p}. The following proposition is essential to our analysis of the general growth model.
Proposition 3. Let Θ′ be defined as above, then its first right singular vector v′1 is a monotone vector, i.e., either v′11 ≤ v′12 ≤…≤ v′1p or v′11 ≥ v′12 ≥…≥ v′1p.
Together with Proposition 1, the above proposition justifies our construction of the permutation estimator using a PCA based approach. To overcome the identifiability issue, we further assume λ1(Θ′) has multiplicity one. We first introduce the several quantities that play the key roles in permutation recovery over .
Definition 3. For any and the corresponding Θ′ defined as above, we define
as the local minimal signal gap, define
as the local maximal signal gap, and define
as the global signal strength of Θ.
In particular, the above definitions of Γ and Λ generalize the ones given earlier in the linear growth model as these quantities coincide for . The following theorem concerns the exact permutation recovery with over .
Theorem 3 (Exact Recovery, General). Suppose (A1) hold, , and Θ satisfies and
for some C0, C1 > 0. Then with probability at least 1−O (p−c) for some constant c > 0, up to a permutation reversion, we have .
As in the case of linear growth model (Theorem 1), in Theorem 3, to guarantee exact recovery, we need the MSG condition . Unlike the linear growth model, here Γ only implicitly depends on the elements of Θ through its spectral quantities, which makes its interpretation less clear. To address this issue, we make the following observation that links the minimal singular vector gap in the definition of Γ to the elements of Θ.
Proposition 4. Let Θ′ in (12) be such that there exists a δ> 0 being the lower bound of the normalized minimum gap between any two entries in the same row, i.e.
Then the first singular vector of Θ′ satisfies .
Consequently, the implicit requirement that is large can be guaranteed when the normalized minimum distance is large. Our next theorem concerns the partial recovery over the general parameter space .
Theorem 4 (Partial Recovery, General). Suppose (A1) hold, , and Θ satisfies
there exits some C0 > 0 such that for all p > 0, and
for some C1 > 0.
Then, up to a permutation reversion,
for some c, c0, c1, c2 > 0.
Condition (i) of Theorem 4 parallels the one given in Theorem 2. It essentially requires an even distancing of the elements (the projected columns of Θ) whose ordering is to be tracked by . In contrast, in both Theorem 3 and 4, the conditions on Λ are slightly more complicated than those in Theorem 1 and 2, as it further depends on the relative magnitude between Ξ/σ and . In particular, if , the conditions reduce to the ones required in the linear growth models. Interestingly, the risk upper bound obtained in Theorem 4 remains the same as in the linear growth model, which only depends on p and the signal-to-noise ratio Γ /σ.
5. MINIMAX LOWER BOUNDS AND OPTIMALITY
In this section, we establish the minimax lower bounds for both exact and partial recovery considered in previous sections, in relation to different levels of the signal-to-noise ratio Γ /σ. In the following theorem, we show the MSG condition for exact recovery is asymptotically sharp.
Theorem 5. Suppose (A1) hold. Let and . Then for any p ≥10, we have
where the infimum is over all the permutation estimators .
This theorem along with Theorem 1 and Theorem 3 indicates that our proposed estimator is minimax rate-optimal over and in terms of the MSG condition on Γ. In light of Proposition 2, in some situations the MSG condition can be both necessary and sufficient for the exact recovery, which includes practically important cases such as n ≍ p, n < log p, etc. Using the information-theoretic language, we have therefore obtained both the achievability result, i.e., the existence of an algorithm or estimator that exactly recovers signal with high probability, and the converse result, namely, an upper bound on the probability of exact recovery that applies to any estimators (Cullina and Kiyavash 2016). See Figure 3 for an illustration.
Fig. 3.
A graphical illustration of the achievability/converse result for exact recovery.
Our next theorem establishes a minimax lower bound for the expected rate of convergence for the partial recovery.
Theorem 6. Suppose (A1) hold, for some C, c > 0, and t/σ≥ 2. Then there exist constants C1, C2 > 0 such that
Comparing the above minimax lower bound to the risk upper bounds obtained in Theorem 2 and 4, we conclude that our proposed estimator is minimax rate-optimal in terms of the partial recovery for both the linear growth model and the general growth model over the range whenever Γ /σ does not diminish (Figure 1). In particular, in Theorem 5 and 6, since the minimax lower bounds only concern the worst-case scenarios, the same lower bounds should hold for any parameter spaces whenever the same worst cases are included. Similarly, the assumption (A1) does not pose a restriction to the general applicability of such results.
6. NUMERICAL STUDIES
6.1. Simulation with Model-Generated Data
To demonstrate our theoretical results and compare with alternative methods, we generate data from model (1) with various configurations of the signal matrix Θ. We compare the empirical performance of our proposed estimator with the following alternatives:
πmean : Order the columns of Y by the magnitude of its column means;
πmax : Order the columns of Y by the magnitude of its column maximums.
We use both the 0–1 loss and the normalized Kendall’s tau distance in comparing these methods. Due to the identifiability issue, the performance of each estimator is evaluated up to a complete reversion of the permutation. For example, we use as the empirical Kendall’s tau distance. By symmetry, we set the underlying permutation π= id. The signal matrix is generated under the following four regimes:
S1(α, n, p): For any 1 ≤ j ≤ p, θij = log(1 + jαi +βi) where αi ~ Unif(α/2, α) for 1 ≤ i ≤ n/2, αi ~ Unif(0,0.01) for n/2 < i ≤ n, and βi ~ Unif(1,3) for all 1 ≤ i ≤ n;
S2(α, n, p): For any 1 ≤ j ≤ p, θij = jαi + βi where αi ~ Unif(α/2, α) for 1 ≤ i ≤ n/2, αi ~ Unif(0, α/10) for n/2 < i ≤ n, and βi ~ Unif(1,3) for all 1 ≤ i ≤ n;
S3(α, n, p): For any 1 ≤ j ≤ p, θij = log(1 + jαi + βi) where αi ~ Unif(α/2, α) for 1 ≤ i ≤ 3, αi ~ Unif(0,0.01) for 4 < i ≤ n, and βi ~ Unif(1,3) for all 1 ≤ i ≤ n;
S4(α, n, p): For any 1 ≤ j ≤ p, θij = jαi + βi where αi ~ Unif(α/2, α) for 1 ≤ i ≤ 3, αi ~ Unif(0, α/10) for 4 < i ≤ n, and βi ~ Unif(1,3) for all 1 ≤ i ≤ n.
Specifically, under each regime, the sample-specific “growth rate” parameter αi is randomly and uniformly generated either from the interval [α/2, α] or an interval with much smaller values, namely, [0, α/10] in and and [0,0.01] in and . By construction, the four regimes consist of the nonlinear growth model where the signals spread out over many samples () or concentrate at a few rows () and the linear growth model where the signals spread out over many samples (S2) or concentrate at a few rows (S4). In particular, in accordance to our theory, for the supposedly “non-informative” samples, we allow the corresponding growth rates to be small but non-zero, which shows the flexibility of our proposed method. The entries of Z are drawn from i.i.d. centred normal distributions whose variance σ2 will be given explicitly. In each setting, we evaluate the empirical performance of each method over a range of n, p or α. Each setting is repeated for 200 times.
For the exact recovery, in Table 1, we reported the empirical risks of the estimators under the 0–1 loss for various regimes and parameter combinations. The noise level σ2 is chosen for each regime to better illustrate the differences in the empirical risks among the estimators. From our simulation results, in consistent to our theory, our proposed estimator has the smallest empirical risk over all the settings, and the estimation risk decreases as we increase α, n or decrease p.
For partial recovery, in Figure 4, we show boxplots of the empirical normalized Kendall’s tau between each estimator and the true permutation π. Again, our proposed method outperforms the alternatives in all the cases. As expected from our theory, under all the four regimes, increasing p while keeping other parameters fixed results to smaller estimation risk. As for the dependence on n, under and , increasing n leads to smaller risk as it is equivalent to increasing Γ, whereas under and , the risk roughly remains the same across different n’s as in these case Γ doesn’t change much.
To offer more intuitive interpretation of why performs better than the alternative methods, we assessed the weight vectors of our proposed estimator under each regime after 200 rounds of simulations (Figure 3 in Supplemented Material). In comparison, the weight vector for πmean is simply , which assigns equal weight to all the samples. On the other hand, since πmax cannot be written in the form of for some weight vector w and therefore does not belong to the class of linear projection estimators, we reported instead the pseudo-weight vector where the i-th component is the proportion that the i-th sample is used among the p coordinates. In general, we found that assigns larger weights to only a few samples among those with higher signal strength, and the weight vector for πmean fails to distinguish the informative samples from the non-informative ones. In contrast, the weight vectors for our proposed estimator would automatically adapt to the varying signal strengths across the samples and assign larger weights to the samples with more significant signal changes. This also explains the interesting phenomenon in Figure 4 that, under the regime and , and πmean perform better than πmax, whereas under and , and πmax perform better. In summary, methods that are able to detect and assign larger weight to the more informative samples would perform better than methods that are not. Observably, combines the advantages of πmean and πmax in that it finds the best weights (projection scores) in a data-driven manner.
6.2. Evaluation Using Synthetic Metagenomic Data
We evaluate the empirical performance of our proposed method using a synthetic metagenomic sequencing dataset used in Gao and Li (2018) by generating sequencing reads based on 45 bacterial genomes. Instead of estimating the PTRs, which was the focus of Gao and Li (2018), our goal is to recover the unknown relative orders of the contigs assembled in typical metagenomics studies. In addition to assisting the estimation of PTRs, such ordering of the contigs could be of independent interest for other applications, including genome assemblies based on shotgun metagenomics data.
Gao and Li (2018) presented a synthetic shotgun metagenomic sequencing dataset of a community of 45 phylogenetically related species from 15 genera of five different phyla with known RefSeq ID, taxonomy and replication origin (Gao, Luo, and Zhang 2013) (see Figure 2 in our Supplementary Material). To generate metagenomics reads, reference genome sequences of randomly selected three species in each genus were downloaded from NCBI. Read coverages were generated along the genome based on an exponential distribution with a specified peak-to-trough ratio and a function of accumulative distribution of read coverages along the genome was calculated. Sequencing reads were next generated using the above accumulative distribution function and a random location of each read on the genome, until the total read number achieved a randomly assigned average coverage between 0.5 and 10 folds for the species in a sample. Sequencing errors including substitution, insertion and deletion were simulated in a position- and nucleotide-specific pattern according to a recent study on metagenomic sequencing error profiles of Illumina.
For the final dataset, the average nucleotide identities (ANI) between species within each genus ranged from 66.6% to 91.2% The probability of one species existing in each of the 50 simulated samples was set as 0.6, and a total of 1,336 average coverages and the corresponding PTRs were randomly and independently assigned. After the same processing and filtering steps and CG-adjustment step as in Gao and Li (2018), the final dataset included genome assemblies of 41 species. For each species, we obtained the permuted matrix of log-contig counts, with the number of samples ranging from 29 to 46, and the number of contigs ranging from 47 to 482.
Our proposed method was used to estimate the unknown orders of the contigs for each species and each sample. As a comparison, we also considered the iRep estimator proposed in Brown et al. (2016), where the contigs of a given species were ordered for each sample separately based on the read counts observed. We evaluate these methods by comparing the estimated contig orders to their true orders as measured by the normalized Kendall’s tau distance. To generalize our evaluation to diverse metagenomic datasets, we also evaluate the effect of sample size as well as contig numbers by randomly selecting subsets of samples or contigs from each dataset. The selection was made with replacement.
The results are summarized in Figure 5 by comparing the normalized Kendall’s tau distances. As n or p varies, our proposed estimator performs consistently better than iRep in recovering the true contig orders, which explains partially why the DEMIC algorithm worked better in estimating the bacterial growth dynamics. The results of our methods are not sensitive to the sample size and the number of contigs from the genome assemblies. Our estimator also shows smaller variability.
Fig. 5.
Boxplots of the normalized Kendall’s distance between the estimated contig orders and the true orders for different sample sizes n and different numbers of contigs p. The lighter ones correspond to our proposed method and the darker ones correspond to the iRep estimation method.
6.3. Analysis of a Real Microbiome Metagenomic Data Set
Finally, we complete our numerical studies by analyzing a real metagenomic dataset from the Pediatric Longitudinal Study of Elemental Diet and Stool Microbiome Composition (PLEASE) study, a prospective cohort study to investigate the treatment effects on the gut microbiome and reduction of inflammation in pediatric Crohn’s disease patients (Lewis et al. 2015). In particular, sequencing data from the fecal samples of 86 Crohn’s disease children were obtained at baseline, 1 week and 8 weeks after antiTNF or enteral diet treatment. In our analysis, the sequencing data at the 8th week after treatment was used to compare the bacterial growth dynamics for non-responders (n = 34) and responders (n = 47). The reads were downloaded from NCBI short read archive (SRP057027) with the corresponding metadata. After the same coassembly, alignment and binning steps as in Gao and Li (2018), the DEMIC algorithm was applied to estimate the bacterial growth rate of a given species represented by a contig cluster (bin) for each sample. In particular, DEMIC applied our proposed method to the GC-adjusted contig coverage data to recover the original order of the contigs. After obtaining the ordered contigs, a simple linear regression was fitted to obtain estimates of the PTRs (ePTRs).
In order to compare the baterial growth rates between responders and non-responders, our analysis focused on ePTRs of 8 contig clusters over subsets of the non-responders (n1) and the responders (n2) after 8 weeks of treatment with min{n1, n2} > 5. Other contig clusters were rare and only appeared in a few samples. For each contig cluster, we compare the ePTRs of the responders and non-responders Wilcoxon rank sum test (Table S.1 in Supplementary Material). The taxonomic annotations of these eight contig clusters were obtain by applying the BAT algorithm (von Meijenfeldt et al. 2019) that compares the metagenomic assembled bins to a taxonomy database. In Table S.1, we show the final taxonomic annotations for each bin to the finest possible resolution, with the lineage scores indicating the quality of each taxonomic classification.
Among the 8 contig clusters, bin.026 showed a significant difference in ePTRs between responders and non-responders after either antiTNF or enteral diet treatment for 8 weeks (p=0.0418), where the growth rate was higher in Crohn’s disease patients who did not respond to the treatment. The taxonomic classification (Table S.1) shows that this contig cluster belongs to the phylum Firmicutes and the order Clostridiales. Since BAT algorithm was not able to classify the order Clostridiales to finer taxonomic level of known species, this contig cluster may represent a new species that is important to the treatment outcome of Crohn’s disease patients.
7. DISCUSSION
In this paper, partial recovery was studied under the normalized Kendall’s tau distance. Another commonly used metric is the normalized Spearman’s footrule distance defined by
A celebrated result by Diaconis and Graham (1977) shows that , which means the two distances are equivalent. As a consequence, all the theoretical results presented in this paper concerning the Kendall’s tau distance also hold for the Spearman’s footrule distance without any change.
The minimax optimality of the proposed estimator was investigated in Section 5 by examining the asymptotic sharpness of the MSG condition for exact recovery, and by obtaining the matching minimax risk lower bound for partial recovery. There are a few issues that deserve further investigation. For both exact and partial recovery, it is unclear to what extent the GSS condition is necessary. In our risk analysis, the perturbation bound for the left singular subspace (Cai and Zhang 2018) was used. In fact, similar results can be obtained using the concentration bound for the linear functionals of singular vectors (Koltchinskii and Xia 2016). Nevertheless, it remains to show whether the GSS condition is also asymptotically sharp. In addition, in Theorem 6, the matching minimax lower bound was obtained only for nonvanishing Γ /σ. It remains to show whether the rate σ/(pΓ) is minimax optimal when Γ /σ → 0. The difficulty lies in finding a p1+δ -sphere packing of the group equipped with the Kendall’s tau distance for any 0 <δ<1, while the pairwise ℓ2 distances of the packing elements are also well controlled. Some initial steps have been made in the so-called rank modulation theory (Barg and Mazumdar 2010; Mazumdar, Barg, and Zemor 2013).
There are several related problems that are also of significant theoretical and practical interest. Firstly, although we used the Kendall’s tau distance or the equivalent Spearman’s footrule distance as the metric for partial recovery, other distances such the Hamming distance, Spearman’s rank correlation distance, and Ulam’s distance have also been used as the performance metrics for partial recovery in other permutation estimation problems (Göloğlu et al. 2015; Mukherjee 2016). It is therefore of interest to see how performs under these losses. Secondly, our proposed estimator implicitly performs a (linear) dimension reduction technique and only uses the information contained in the first eigenvector of A in (4). A natural extension is to consider the eigen-subspace spanned by the first k eigenvectors and to estimate the permutation in a sequential manner.
The present paper focuses on the estimation of the permutation matrix Π. It is also of interest to estimate the underlying signal matrix Θ or some functionals of it. For example, in microbiome growth dynamics studies, it is of significant interest to estimate the peak-to-trough ratio for k =1,...,n, which measures the microbial growth rate for the kth sample, and to identify the samples with peak-to-trough ratio of 1. It is also interesting to identify the bacteria that show differential growth dynamics between disease and normal individuals. Finally, robust permutation recovery methods that can relax the Gaussian or sub-Gaussian assumption of the noise in the permuted monotone matrix model are needed. For example, in some applications, the columns of the noise matrix are not independent, or the variance levels across the noise matrix are not identical. In these cases, we argue that, as long as the marginal distributions of the noise matrix entries remain sub-Gaussian, the analytical framework of the current paper can still be applied, but with more efforts to control the underlying heteroskedasticity. Toward this end, results from the recent work of Zhang, Cai, and Wu (2018) can be very useful, in terms of the new technical tools that parallel the ones used in the current paper to analyse the homoskedastic PCA (cf. Lemma 2 and 3). Finally, to account for non-informative samples, sparse PCA (Cai, Ma, and Wu 2013; Yuan and Zhang 2013) can be considered. These are interesting problems left for future research.
8. PROOFS OF THE MAIN THEOREMS
In this section, we prove Theorems 1 and 2 in detail and briefly sketch the proofs of Theorems 3 and 4. We also prove the minimax lower bounds in Theorems 5 and 6. Proofs of other results including the technical lemmas can be found in the online Supplementary Materials.
Proof of Theorem 1
Let X =Θ+Z. It follows that Y = XΠ. By right invariance of the 0–1 loss with respect to permutation composition, we have
Thus it suffices to study the risk . In fact,
| (13) |
which further reduces to obtaining an upper bound for . By definition, is the first eigenvector of . Simple calculation yields for any . So is also the first eigenvector of , where . Note that T admits the decomposition where and . In particular, where is the vector of row means of X. We denote ϕij = T.i − T.j = X.i − X.j and denote as the first eigenvector of the rank-one matrix Θ′Θ′⊤. Now following (13), we have
for some δ > 0. By definition, up to a change of sign for , we have . Then implies , where the first inequality follows from Cauchy-Schwartz and the second inequality used . Thus
| (14) |
The following lemmas provide upper bounds for the two probability events in the last expression.
Lemma 1. Under the conditions of Theorem 1, denote , then for any δ > 0, we have
| (15) |
for and some constants C1, C2, c > 0.
Lemma 2. Suppose for some C > 0, it follows that
for some C1, C2, c > 0.
Now since , we have for some C0 > 0. Set . It follows that δ= o(1). Combining Lemma 1 and Lemma 2, we have
| (16) |
for some C, c > 0. The rest of the analysis is divided into several cases.
Case 1. log p≲n.
In this case, we have . In addition, if , we have , where the last inequality follows from and . If instead , we have , where the last inequality follows from and δ= o(1). Hence, in Case 1, (16) can be bounded by O(p−c).
Case 2. log p≳n.
In this case, we have . In addition, since and δ= o(1), we have . This shows that, in Case 2, (16) can also be bounded by O(p−c).
As a result, it follows that, up to a change of sign for for some constant c > 0. □
Proof of Theorem 2
Firstly, by invariance property of Kendall’s tau distance, . It then follows
The summation in the last expression can be divided into two parts, namely, the consecutive differences and non-consecutive differences, i.e.,
In the following, we first show
| (17) |
so that
| (18) |
for some C, c> 0. Then we show that
| (19) |
Combining (18) and (19), we conclude that
which completes the proof, as the bound is trivial.
Proof of (17). Following the same argument as the proof of Theorem 1, we have for 1 ≤ i ≤ p −1 and where the second term can be bounded using Lemma 2. For the first term, by Lemma 1, for , we have .
Using same argument as the proof of Theorem 1, it holds that . Equation (17) then follows by using formula 7.1.13 of Abramowitz and Stegun (1965) that for t ≥ 0.
Proof of (19). For the set of indices S = {(i, j) :1 ≤ i < j ≤ p, j > i +1}, we further divide it into two subsets and for some constant C > 0. Apparently we have the decomposition
| (20) |
For the first term, by construction, it can be shown using the same argument (see supplementary materials) in Theorem 1 that
| (21) |
Now for the second term in (20), similar argument yields, for (i, j) ∈ S2,
Note that, on the one hand,
We have
For the rest of the proof, we assume , otherwise the set will vanish. Then
where the last inequality used monotonicity of the integrand. The integral in the last inequality, after change of variable, can be bounded by an exponential integral Ei(Γ2 / 2σ2 ), which has an upper bound
so that . For T2, we have . Therefore,
| (22) |
On the other hand, note that
We have
Thus
| (23) |
Combining (22) and (23), we have
| (24) |
Proof of Theorem 3 and Theorem 4
Here we only provide a sketch of the proofs. We refer the readers to our Supplementary Material for detailed proofs. The proofs follow essentially from the same argument as the proofs of Theorem 1 and Theorem 2, respectively. However, in place of Lemma 2 used therein, we need the following lemma that provides a perturbation bound for the leading eigenvector of approximate rank-one matrices, which could be of independent interest.
Lemma 3. Suppose p≳n and for some C > 0. Let be the first left singular vector of Θ′, it follows that,
The proof of Lemma 3 is nontrivial, which depends on a combination of the generic perturbation bound obtained by Cai and Zhang (2018) and new concentration inequalities of approximate rank-one matrices (see Supplementary Materials).
Proof of Theorem 5
The proof relies on the following lemma adapted from (Tsybakov 2009).
Lemma 4. Assume that for some integer M ≥ 2 there exist distinct parameters θ0,..., θM from the parameter space Θ and mutually absolutely continuous probability measures P0,...,PM with for j = 0,1,...,M, defined on a common probability space such that the averaged K-L divergence . Then, for every measurable mapping , .
We construct the (M +1) =p points parameter space as follows. We define p permutations from as an identity plus (p −1) consecutive swaps, i.e., π0 = id, πk = (k, k +1) for k = 1,..., p −1. The signal matrix Θ0 = aη⊤ where and , . In this way, we have . Let Pk corresponds to the joint probability measure of Y under (Θ0, πk) for k = 0,1,..., p −1, and let pk be the pdf of Pk, we have , for k = 1,..., p −1, where ϕμ is the pdf of Gaussian distribution N(μ,σ2). Now we calculate the KL-divergence
Then, we have for . It follows from Lemma 4 that, as long as p ≥ 10. In addition, as . □
Proof of Theorem 6
The proof relies on the following lemma from Tsybakov (2009).
Lemma 5. Assume that M ≥ 2 and suppose that Θ contains elements θ0, θ1,..., θM such that: (i) d(θj, θk) ≥ 2s > 0 for any 0 ≤ j < k ≤ M; (ii) for any with 0 < α < 1/8 and for j = 0,1,...,M. Then
.
We also need the following sphere packing lemma proved by Mao, Weed, and Rigollet (2017), which is a direct consequence of the well-celebrated Varshamov-Gilbert bound.
Lemma 6. For any r < p / 2, there exists a subset of such that (i) , (ii) for any elements , we have , and (iii) for any , we have .
For t / σ ≥ 2, we set . Let π0 = id and be the elements of . The signal matrix Θ0 = aη⊤ where and . Let Pk be the joint probability measure of Y under (Θ0, πk) for , and let pk be the pdf of Pk. By Lemma 6, the KL-divergence
and therefore
Without loss of generality, we assume . By Lemma 5, it then follows that,
for some absolute constant C1 > 0. By Markov’s inequality, we have
.
The relationship follows from . The rate 1 / p2 follows by setting for some C2 > 0. □
Supplementary Material
ACKNOWLEDGEMENT
We would like to thank the Associate Editor and the anonymous referees for many helpful suggestions that significantly improved the paper. R. M. would also like to thank Rui Duan, Anru Zhang, Yuan Gao and Shulei Wang for stimulating discussions at various stages of this project.
Footnotes
SUPPLEMENTARY MATERIALS
In our online Supplemental Materials, we prove Theorem 3–4, Proposition 1–4, as well as the technical lemmas. Some supplementary simulations, figures and tables are included in the appendix.
References
- Abel S, Zur Wiesch PA, Chang H-H, Davis BM, Lipsitch M, and Waldor MK (2015), “Sequence tag–based analysis of microbial population dynamics,” Nature Methods, 12, 223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Abramowitz M, and Stegun IA (1965), Handbook of Mathematical Functions: with Formulas, Graphs, and Mathematical Tables, vol. 55, Courier Corporation. [Google Scholar]
- Barg A, and Mazumdar A (2010), “Codes in permutations and error correction for rank modulation,” in Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on, IEEE, pp. 854–858. [Google Scholar]
- Boulund F, Pereira MB, Jonsson V, and Kristiansson E (2018), “ Computational and statistical considerations in the analysis of metagenomic data,” in Metagenomics, Elsevier, pp. 81–102. [Google Scholar]
- Bremer H, and Churchward G (1977), “An examination of the Cooper-Helmstetter theory of DNA replication in bacteria and its underlying assumptions, ” Journal of Theoretical Biology, 69, 645–654. [DOI] [PubMed] [Google Scholar]
- Brown CT, Olm MR, Thomas BC, and Banfield JF (2016), “ Measurement of bacterial replication rates in microbial communities,” Nature Biotechnology, 34, 1256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cai TT, Ma Z, and Wu Y (2013), “Sparse PCA: Optimal rates and adaptive estimation,” The Annals of Statistics, 41, 3074–3110. [Google Scholar]
- Cai TT, and Zhang A (2018), “Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics,” The Annals of Statistics, 46, 60–89. [Google Scholar]
- Chatterjee S, Guntuboyina A, and Sen B (2015), “On risk bounds in isotonic and other shape restricted regression problems,” The Annals of Statistics, 43, 1774–1800. [Google Scholar]
- ——— (2018), “On matrix estimation under monotonicity constraints,” Bernoulli, 24, 1072–1100. [Google Scholar]
- Collier O, and Dalalyan AS (2016), “Minimax rates in permutation estimation for feature matching,” The Journal of Machine Learning Research, 17, 162–192. [Google Scholar]
- Cooper S, and Helmstetter CE (1968), “Chromosome replication and the division cycle of Escherichia coli B/r,” Journal of Molecular Biology, 31, 519–540. [DOI] [PubMed] [Google Scholar]
- Cullina D, and Kiyavash N (2016), “Improved achievability and converse bounds for Erdös-Renyi graph matching,” in ACM SIGMETRICS Performance Evaluation Review, ACM, vol. 44, pp. 63–72. [Google Scholar]
- Currie RR, and Pandher GS (2011), “Finance journal rankings and tiers: An active scholar assessment methodology,” Journal of Banking & Finance, 35, 7–20. [Google Scholar]
- Deshpande SK, and Jensen ST (2016), “Estimating an NBA player’s impact on his team’s chances of winning,” Journal of Quantitative Analysis in Sports, 12, 51–72. [Google Scholar]
- Diaconis P (1988), Group Representations in Probability and Statistics, Institute of Mathematical Statistics Lecture Notes–Monograph Series (11). [Google Scholar]
- Diaconis P, and Graham RL (1977), “Spearman’s footrule as a measure of disarray,” Journal of the Royal Statistical Society. Series B (Methodological), 262–268. [Google Scholar]
- Flammarion N, Mao C, and Rigollet P (2019), “Optimal rates of statistical seriation,” Bernoulli, 25, 623–653. [Google Scholar]
- Gao F, Luo H, and Zhang C-T (2013), “DoriC 5.0: an updated database of oriC regions in both bacterial and archaeal genomes,” Nucleic Acids Research, 41, D90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao Y, and Li H (2018), “Quantifying and comparing bacterial growth dynamics in multiple metagenomic samples,” Nature Methods, 15, 1041–1044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Göloğlu F, Lember J, Riet A-E, and Skachek V (2015), “New bounds for permutation codes in Ulam metric,” in Information Theory (ISIT), 2015 IEEE International Symposium on, IEEE, pp. 1726–1730. [Google Scholar]
- Kendall MG (1938), “A new measure of rank correlation,” Biometrika, 30, 81–93. [Google Scholar]
- Koltchinskii V, and Xia D (2016), “Perturbation of linear forms of singular vectors under gaussian noise,” in High Dimensional Probability VII, Springer, pp. 397–423. [Google Scholar]
- Korem T, Zeevi D, Suez J, Weinberger A, Avnit-Sagi T, Pompan-Lotan M, Matot E, Jona G, Harmelin A, and Cohen N (2015), “Growth dynamics of gut microbiota in health and disease inferred from single metagenomic samples,” Science, aac4812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewis JD, Chen EZ, Baldassano RN, Otley AR, Griffiths AM, Lee D, Bittinger K, Bailey A, Friedman ES, Hoffmann C, et al. (2015), “ Inflammation, antibiotics, and diet as environmental stressors of the gut microbiome in pediatric CrohnÕs disease,” Cell Host & Microbe, 18, 489–500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li D, Liu C, Luo R, Sadakane K, and Lam T (2015), “MEGAHIT: an ultrafast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph,” Bioinformatics, 15, 1674–1676. [DOI] [PubMed] [Google Scholar]
- Mao C, Weed J, and Rigollet P (2017), “Minimax rates and efficient algorithms for noisy sorting,” arXiv preprint arXiv:1710.10388. [Google Scholar]
- Mazumdar A, Barg A, and Zemor G (2013), “Constructions of rank modulation codes,” IEEE Transactions on Information Theory, 59, 1018–1029. [Google Scholar]
- Mukherjee S (2016), “Estimation in exponential families on permutations,” The Annals of Statistics, 44, 853–875. [Google Scholar]
- Myhrvold C, Kotula JW, Hicks WM, Conway NJ, and Silver PA (2015), “A distributed cell division counter reveals growth dynamics in the gut microbiota, ” Nature Communications, 6, 10039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pananjady A, Wainwright MJ, and Courtade TA (2016), “Linear regression with an unknown permutation: Statistical and computational limits,” arXiv preprint arXiv:1608.02902. [Google Scholar]
- ——— (2017), “Denoising linear models with permuted data,” in Information Theory (ISIT), 2017 IEEE International Symposium on, IEEE, pp. 446–450. [Google Scholar]
- Rendle S, Balby Marinho L, Nanopoulos A, and Schmidt-Thieme L (2009), “ Learning optimal ranking with tensor factorization for tag recommendation,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 727–736. [Google Scholar]
- Rigollet P, and Weed J (2018), “Uncoupled isotonic regression via minimum Wasserstein deconvolution,” arXiv preprint arXiv:1806.10648. [Google Scholar]
- Slawski M, and Ben-David E (2017), “Linear Regression with Sparsely Permuted Data,” arXiv preprint arXiv:1710.06030. [Google Scholar]
- Tsybakov AB (2009), Introduction to Nonparametric Estimation, Springer Series in Statistics. Springer, New York. [Google Scholar]
- von Meijenfeldt FB, Arkhipova K, Cambuy DD, Coutinho FH, and Dutilh BE (2019), “Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT,” bioRxiv, 530188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu Y, Tang Y, Tringe S, Simmons B, and Singer S (2014), “MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm,” Microbiome, 2, 26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan X, and Zhang T (2013), “Truncated power method for sparse eigenvalue problems,” Journal of Machine Learning Research, 14, 899–925. [Google Scholar]
- Zhang A, Cai TT, and Wu Y (2018), “Heteroskedastic PCA: Algorithm, optimality, and applications,” arXiv preprint arXiv:1810.08316. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





