Optimal Permutation Recovery in Permuted Monotone Matrix Model

Rong Ma; T Tony Cai; Hongzhe Li

doi:10.1080/01621459.2020.1713794

. Author manuscript; available in PMC: 2022 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2020 Feb 18;116(535):1358–1372. doi: 10.1080/01621459.2020.1713794

Optimal Permutation Recovery in Permuted Monotone Matrix Model

Rong Ma ¹, T Tony Cai ², Hongzhe Li ¹

PMCID: PMC8612635 NIHMSID: NIHMS1578947 PMID: 34840367

Abstract

Motivated by recent research on quantifying bacterial growth dynamics based on genome assemblies, we consider a permuted monotone matrix model Y = ΘΠ+ Z, where the rows represent different samples, the columns represent contigs in genome assemblies and the elements represent log-read counts after preprocessing steps and Guanine-Cytosine (GC) adjustment. In this model, Θ is an unknown mean matrix with monotone entries for each row, Π is a permutation matrix that permutes the columns of Θ, and Z is a noise matrix. This paper studies the problem of estimation/recovery of Π given the observed noisy matrix Y. We propose an estimator based on the best linear projection, which is shown to be minimax rate-optimal for both exact recovery, as measured by the 0–1 loss, and partial recovery, as quantified by the normalized Kendall’s tau distance. Simulation studies demonstrate the superior empirical performance of the proposed estimator over alternative methods. We demonstrate the methods using a synthetic metagenomics dataset of 45 closely related bacterial species and a real metagenomic dataset to compare the bacterial growth dynamics between the responders and the non-responders of the IBD patients after 8 weeks of treatment.

Keywords: Kendall’s tau, Microbiome growth dynamics, Minimax lower bound, Sorting

1. INTRODUCTION

1.1. A Motivation Example from Microbiome Studies

The statistical problem considered in this paper is motivated by the problem of estimating the bacterial growth dynamics based on shotgun metagenomics data (Myhrvold et al. 2015; Abel et al. 2015; Korem et al. 2015; Brown et al. 2016). The growth dynamics of microbial populations reflects their physiological states and drives variation of microbial compositions, which provide important feature summary of the microbes in a given community. One way of studying such communities is through shotgun metagenomic sequencing, which involve direct DNA sequencing of all the microbiome genomes in a given microbial community. Korem et al. (2015) presented the first paper on quantifying the bacterial growth dynamics based on shotgun metagenomics data, where the uneven sequencing read coverage resulting from the bidirectional DNA replications provides information on the rates of microbial DNA replications. For bacterial species with known complete genome sequences, Korem et al. (2015) proposed to use the peak-to-trough ratio (PTR) of read coverages to quantify the bacterial growth dynamics after aligning the sequencing reads to the complete genome sequences.

However, in many applications, it is of importance to quantify the bacterial growth dynamics based on genome assemblies for the bacterial species with unknown genomes. These genome assemblies may represent new bacterial species that we have seen or sequenced before. The genome assembly of a bacterium species consists of a collection of contigs (called bin) constructed based on the overlapping of the sequencing reads (Li et al. 2015; Wu et al. 2014). Compared to the complete genome, the genome assembled bins are more fragmented and often contained errors or contaminations. The noisy read coverage data due to intraspecific variations, interspecific/intraspecific repeated sequences, limited sequencing depths and the inability of binning algorithms to correctly cluster all the contigs further complicate the estimation of growth dynamics based on read coverages of the contigs. Besides these noisy count data, one key difficulty in estimating the growth dynamic based on contig counts is that the accurate locations of the contigs on the original genome are unknown. It is therefore not feasible to measure the microbial growth rate directly using peak-to-trough coverage ratio for the assembled genomes (Brown et al. 2016; Gao and Li 2018).

Brown et al. (2016) presented the first method (called iRep) of estimating the bacterial growth dynamics based on genome assemblies, where the contigs are ordered based on the GC-adjusted counts for each sample separately. However, due to noise in the count data, such an ordering method often leads to wrong ordering of the contigs and therefore inaccurate estimates of the growth dynamics. Gao and Li (2018) developed a computational algorithm, DEMIC, to accurately compare growth dynamics of a given assembled species existing in multiple samples by taking advantage of highly fragmented contigs assembled in typical metagenomics studies. One key step of DEMIC is to apply a principal components analysis (PCA)-based method to recover the true ordering of the contigs along the underlying unknown bacterial complete genomes. Gao and Li (2018) reported excellent empirical performance of DEMIC over existing methods. The goal of this paper is to provide a rigorous statistical framework to study the problem of optimal permutation recovery in a permuted monotone matrix model.

1.2. A Permuted Monotone Matrix Model

For a given genome assembly with p contigs, DEMIC first obtains the read coverage for each of the sliding window of size 5000 bps, denoted by X_ijl for the ith sample, jth contig and kth window. In order to account for the GC-content of the kth window, Gao and Li (2018) considered the following mixed-effects model,

\log_{2} X_{i j k} = α + G C_{j k} β + W_{i j} + e_{i j k},

where GC_jk is the centred GC count of the kth window of the jth contig, W_ij is the sample- and contig- specific random intercept, α is the intercept, β is the regression coefficient, and e_ijk is the random error. This model is fitted for each contig to obtain the best linear unbiased predictor of W_ij, which is used as the GC-adjusted log-read count Y_ij for the ith sample and jth contig. Here Y_ij can be regarded as average read coverage over non-overlapping windows of a contig and is approximately normally distributed.

Let Y be the GC-adjusted log-contig count matrix of n samples and p contigs of a genome assembly with Y_ij as its entries. Given this, we consider the following permuted monotone matrix model:

Y = Θ Π + Z,

(1)

where $Θ \in ℝ^{n \times p}$ is an unknown nonnegative signal matrix with nondecreasing rows, $Z \in ℝ^{n \times p}$ is a zero-mean noise matrix, and $Π \in ℝ^{p \times p}$ is a permutation matrix corresponding to some permutation Π from the symmetric group $S_{p}$ . That is, after a suitable permutation of the columns of Y, all the rows of the mean matrix are nondecreasing sequences. In microbiome applications, Θ is the matrix of true log-coverage of n samples over p contigs along the circular genome of the bacterium, which is generally hypothesized to have non-decreasing rows. Π represents a permutation due to unknown locations of the contigs relative to the replication origin. Throughout this paper, we denote the parameter space

(Θ, π) \in D = {Θ = (θ_{i j}) \in ℝ^{n \times p}, π \in S_{p} : 0 \leq θ_{i, j - 1} \leq θ_{i, j} < \infty for all 1 \leq i \leq n, 2 \leq j \leq p} .

The focus of this paper is to optimally estimate the permutation π from the noisy observation Y.

1.3. Related Problems and Other Applications

The permutation recovery problem under permuted monotone matrix model bears some similarity to other problems studied in machine learning literature, including the feature matching between two sets of observations (Collier and Dalalyan 2016) and linear regression model with permuted data, where the correspondences between the response and the predictors are unknown (Pananjady, Wainwright, and Courtade 2016; Slawski and Ben-David 2017; Pananjady, Wainwright, and Courtade 2017). More recently, Flammarion, Mao, and Rigollet (2019) considered the problem of statistical seriation, which has a close affinity to our model (1). However, the focus of Flammarion, Mao, and Rigollet (2019) is to optimally estimate the signal matrix Θ rather than the underlying permutation.

Model (1) can be thought as a natural extension of the shape constrained matrix denoising model studied in the isotonic regression literature. Specifically, under Π = I_p, risk bounds and the minimax rate-optimal estimator Model (1) with known for Θ under the Frobenius norm was obtained in Chatterjee, Guntuboyina, and Sen (2015) for n = 1 and later in Chatterjee, Guntuboyina, and Sen (2018) for general n > 1. Using the idea of optimal transport, a minimax optimal estimator of the underlying signals was obtained by Rigollet and Weed (2018). However, their goal is not to recover the underlying permutation.

Besides the microbiome applications, the permuted monotone matrix model is generic and has other applications. For instance, the problem of permutation recovery is usually equivalent to statistical ranking/sorting from noisy observations, which arises commonly in finance (Currie and Pandher 2011), sport analytics (Deshpande and Jensen 2016), and recommendation systems (Rendle et al. 2009). Specifically, in the latter case, the task of tag recommendation is to provide a user with a personalized ranked list of tags for a specific item. Under the permuted monotone matrix model, we can treat the entries of Y, say Y_ij, as an indicator of the jth tag being related to the ith item by a given customer, and Θ as a probability matrix characterizing the customer’s tagging preferences across multiple items. As a result, recovering the underlying permutation provides a solution of a tag recommender.

1.4. Main Contributions and Organization

In this paper, we investigate the problem of permutation recovery in the permuted monotone matrix model (1), which relies on certain invariance property of the singular subspace of the monotone matrices. The properties of the proposed method in terms of both the exact and partial recovery are studied in detail. In particular, we obtained regions of the signal-to-noise ratio (defined later as Γ /σ) that are subject to exact/partial recovery (Figure 1). For both exact and partial permutation recovery, we obtained the matching minimax lower bounds and established the minimax rate-optimality of the proposed method over a wide range of parameter space (Figure 1). For partial recovery, the proof of the lower bound relies on a version of Fano’s lemma and the sphere packing of the symmetric group equipped with the Kendall’s tau metric.

Fig. 1 — A graphical illustration of the main result obtained in this paper about the regions of the signal-to-noise ratio Γ /σ that correspond to exact/partial recovery, and the region with proved minimax optimality.

The rest of this paper is organized as follows. After a brief introduction of notation and definitions, we present in Section 2 the proposed permutation estimator. The theoretical properties of the proposed method are studied, first under a more illustrative linear growth model in Section 3 and then under a general growth model in Section 4. Section 5 provides results on minimax lower bounds and the optimality of the proposed estimator. We evaluate the methods using both simulated data, synthetic and real microbiome datasets and compare with other methods in Section 6. In Section 7, we discuss some implications and extensions of the methods. Finally, the proofs of our main results are given in Section 8.

1.5. Notation and Definitions

Throughout, we define the permutation π as a bijection from the set {1,2,..., p} onto itself. For simplicity, we denote π = (π(1), π(2),...,π(p)). All permutations of the set {1,2,..., p} form a symmetric group, equipped with the function composition operation °, denoted as $S_{p}$ . For any $π \in S_{p}$ , we denote $π^{- 1} \in S_{p}$ as its group inverse, so that π°π⁻¹ = π⁻¹°π = id, and denote rev(π) = (π(p), π(p−1),... π(1)). In particular, we may use π and its corresponding permutation matrix $Π \in ℝ^{p \times p}$ interchangeably, depending on the context. For a vector $a = {(a_{1}, \dots, a_{n})}^{⊤} \in ℝ^{n}$ , we define the ℓ_p norm $‖ a ‖_{p} = {(\sum_{i = 1}^{n} a_{i}^{p})}^{1 / p}$ , and the ℓ_∞ norm $‖ a ‖_{\infty} = \max_{1 \leq j \leq n} | a_{i} |$ . For a matrix $Θ \in ℝ^{p_{1} \times p_{2}}$ , we denote $Θ_{. i} \in ℝ^{p_{1}}$ as its i-th norm column, $Θ_{i .} \in ℝ^{p_{2}}$ as its i-th row, and denote its (ordered) singular values as $λ_{1} (Θ) \geq λ_{2} (Θ) \geq \dots \geq λ_{\min {p_{1}, p_{2}}} (Θ)$ . Furthermore, for sequences {a_n} and {b_n}, we write $a_{n} = o (b_{n})$ if $\lim_{n} a_{n} / b_{n} = 0$ , and write a_n = O(b_n), a_n ≲ b_n or b_n ≳ a_n if there exists a constant C such that a_n ≤ Cb_n for all n. We write a_n≍b_n if a_n≲b_n and a_n≳b_n. For a finite set A, we denote |A| as its cardinality. We use the logical symbols ∧ and ∨ to represent “and” and “or,” respectively. Lastly, C, C_0, C₁,... are constants that may vary from place to place.

2. PERMUTATION RECOVERY VIA BEST LINEAR PROJECTION

In the following, we first make some key observations about the connection between the underlying permutation π and the column linear projections of the observed matrix Y, which motivate our construction of the proposed estimator.

2.1. Linear Projection

Given the observed noisy matrix Y, we consider the class of the linear projection statistics of the form $w^{⊤} Y \in ℝ^{p}$ where $w \in ℝ^{n}$ and ||w||₂ = 1. Intuitively, by projecting each column of Y onto the subspace generated by w, the components of w^⊤Y (hereafter referred as “projection scores”) would quantify the relative position of the columns of Y, so that their order statistics can be used to recover the original orders of the columns of Θ. To fix ideas, we define the following ranking operator.

Definition 1 (Ranking Operator). The ranking operator $r : ℝ^{p} \to S_{p}$ is defined such that for any vector $x \in ℝ^{p}, r (x)$ is the vector of ranks for components of x in increasing order. Whenever there are ties, increasing orders are assigned from left to right.

For example, given a vector x = (2,5,1,6,2)^⊤, we have $r (x) = (2, 4, 1, 5, 3)$ . The following proposition concerning the invariance property of the column spacing of Θ is the key to our construction of the minimax optimal estimator.

Proposition 1. Suppose $(Θ, π) \in D$ . For any nonnegative unit vector $w \in ℝ^{n}$ , we have

r (w^{⊤} Θ Π) = π^{- 1} .

(2)

Apparently, under the noiseless setting, any nonnegative unit vector would lead to the exact recovery of the underlying permutation as in this case the relative orders of the columns are exactly coded by the relative magnitudes of the projection scores w^⊤Y = w^⊤ΘΠ. However, with the noisy observations, w^⊤Y = w^⊤ΘΠ+ w^⊤Z so that the relative orders of the columns are only partially preserved by the noisy projection scores w^⊤Y, up to some random perturbations.

Consequently, the best linear projection vector w₀ would correspond to the case where $w_{0}^{⊤} Θ Π$ has the most separated components such that their relative orders are most immune to the random noises. Specifically, since for any given $w \in ℝ^{n}$ , the i-th component of w^⊤ΘΠ has the expression w^⊤ΘΠe_i where ${e_{i}}_{i = 1}^{p}$ is the canonical basis of the Euclidean space $ℝ^{p}$ , we define

w_{0} = {argmax}_{\begin{array}{l} w \in ℝ^{n} \\ | | w | |_{2} = 1 \end{array}} \sum_{\begin{matrix} 1 \leq i, j \leq p \\ i \neq j \end{matrix}} {(w^{⊤} Θ Π e_{i} - w^{⊤} Θ Π e_{j})}^{2} = {argmax}_{\begin{array}{l} w \in ℝ^{n} \\ | | w | |_{2} = 1 \end{array}} \sum_{i = 1}^{p} {(w^{⊤} Θ Π e_{i} - \frac{1}{p} \sum_{j = 1}^{p} w^{⊤} Θ Π e_{j})}^{2},

which maximizes the pairwise distances of the components under the squared distance. Now since w₀ relies on the unknown ΘΠ and is not computable from the data, we substitute ΘΠ by its sample/noisy counterpart Y and define our data-driven best linear projection vector as

\hat{w} = \arg \max_{\begin{array}{l} w \in ℝ^{n} \\ | | w | |_{2} = 1 \end{array}} \sum_{i = 1}^{p} {(w^{⊤} Y e_{i} - \frac{1}{p} \sum_{i = 1}^{p} w^{⊤} Y e_{i})}^{2},

(3)

which is actually the first eigenvector of the symmetric matrix

A = Y \sum_{i = 1}^{p} (e_{i} - \frac{1}{p} \sum_{i = 1}^{p} e_{i}) {(e_{i} - \frac{1}{p} \sum_{i = 1}^{p} e_{i})}^{⊤} Y^{⊤},

(4)

and can be immediately solved by performing an eigen-decomposition on A. Once $\hat{w}$ is obtained, we define our proposed permutation estimator as

\hat{π} = {(r ({\hat{w}}^{⊤} Y))}^{- 1} .

(5)

Intuitively, the projection vector $\hat{w}$ assigns different weights to the rows of Y so that more weight is given to the rows whose elements are better separated and therefore more informative in distinguishing the columns of Y or Θ.

2.2. Evaluation Criteria

The main focus of this paper is to investigate the theoretical properties of our proposed estimator (5) under various loss measures and parameter spaces. For any given estimator $\overset{ˇ}{π}$ , we first consider the 0–1 loss

ℓ (\overset{ˇ}{π}, π) = 1 {\overset{ˇ}{π} \neq π},

with the corresponding risk $E ℓ (\overset{ˇ}{π}, π) = P (\overset{ˇ}{π} \neq π)$ . The 0–1 loss is used to evaluate the exact recovery, which can be a strong requirement for practical applications. As an alternative, we also consider the more flexible partial recovery, where the loss function is given by the normalized Kendall’s tau distance (Kendall 1938) defined as

τ_{K} (π_{1}, π_{2}) = \frac{{# of discordant pairs between π_{1} and π_{2}}}{(\begin{array}{l} n \\ 2 \end{array})} .

(6)

Technically, for two permutations π₁ and π₂, the set of discordant pairs is defined as

G (π_{1}, π_{2}) = {(i, j) : i < j, [π_{1} (i) < π_{1} (j) \land π_{2} (i) > π_{2} (j)] \lor [π_{1} (i) > π_{1} (j) \land π_{2} (i) < π_{2} (j)]}

so that the numerator in (6) is equal to the cardinality $| G (π_{1}, π_{2}) |$ , which, in fact, is also the minimum number of pairwise adjacent transpositions converting $π_{1}^{- 1}$ into $π_{2}^{- 1}$ (Diaconis 1988). The denominator $(\begin{array}{l} n \\ 2 \end{array})$ ensures that τ_K (π₁, π₂) ∈[0,1] where τ_K (π₁, π₂) = 0 corresponds to π₁ = π_2.

3. A LINEAR GROWTH MODEL

We start with a simpler case where the pair (Θ,π) is from the subspace

D_{L} = {(Θ, π) \in D : \begin{array}{l} θ_{i j} = a_{i} η_{j} + b_{i}, where a_{i}, b_{i} \geq 0 for 1 \leq i \leq n, \\ 0 \leq η_{j} \leq η_{j + 1} for 1 \leq j \leq p - 1 \end{array}} .

(7)

In other words, each row of Θ has a linear growth pattern with possibly different intercepts and different slopes. In the context of bacterial growth dynamics, this model is sometimes referred as the Cooper-Helmstetter model (Cooper and Helmstetter 1968; Bremer and Churchward 1977) that associates the copy number of genes with their relative distances to the replication origin. Specifically, a_i is the ratio of genome replication time and doubling time, which can be used to quantify the bacterial growth dynamics for the ith sample, η_j is related to distance from the replication origin for the jth contig, and b_i is related to the read counts at the replication origin and the sequencing depth. If the bacterium is non-dividing in sample i, a_i is zero.

For the linear growth model (7), there are two key quantities that are relevant to permutation recovery.

Definition 2. For any $Θ \in D_{L}$ , we define

Γ = {(\sum_{i = 1}^{n} a_{i}^{2})}^{1 / 2} \cdot \min_{1 \leq i < j \leq p} | η_{i} - η_{j} |

(8)

as the local minimal signal gap of Θ, and define

Λ = (\sum_{i = 1}^{n} a_{i}^{2}) \cdot \frac{1}{p} \sum_{1 \leq i < j \leq p} {(η_{i} - η_{j})}^{2} = (\sum_{i = 1}^{n} a_{i}^{2}) \cdot \sum_{j = 1}^{p} {(η_{j} - \bar{η})}^{2}

(9)

as the global signal strength of Θ, where $\bar{η} = \sum_{j = 1}^{p} η_{j} / p$ .

Intuitively, both quantities involve the set ${| η_{j} - η_{i} |}_{1 \leq i < j \leq p}$ and the ℓ₂ norm of the vector a = (a₁,...,a_n)^⊤, which characterize the column spacings and the growth rates (slopes) of Θ, respectively. Throughout this paper, we assume (A1) the additive noise matrix $Z \in ℝ^{n \times p}$ has i.i.d. entries z_ij ~ N(0,σ²). The Gaussian assumption simplifies our theoretical analysis. But this is not essential because all the theoretical results remain true if Z has independent sub-Gaussian entries with parameters bounded by σ². The following theorem provides conditions on Γ and Λ such that exact recovery of π can be obtained by $\hat{π}$ in (5).

Theorem 1 (Exact Recovery, Linear). Suppose (A1) hold, $(Θ, π) \in D_{L}$ and Θ satisfies

Γ > C_{0} σ \sqrt{\log p}, Λ > C_{1} σ^{2} (n \max {σ^{2} n / Γ^{2}, 1} + \sqrt{n p \max {σ^{2} n / Γ^{2}, 1}})

(10)

for some C₀, C₁ > 0. Then with probability at least 1−O(p^−c) for some constant c > 0, up to a permutation reversion, we have $\hat{π} = π$ .

Remark 1. Due to non-identifiability between $\hat{w}$ and $- \hat{w}$ defined in (3), in Theorem 1, as well as all the other theoretical results concerning $\hat{π}$ , the statement is up to a possible reversion of $\hat{π}$ . For example, for permutation π= (2,4,1,5,3), its reversion would be rev(π) = (4,2,5,1,3). In fact, such indeterminacy can be avoided by noting that a_i ≥ 0 for all i’s, but we will not pursue such a direction in this study as the practical interest only concerns relative orders of the permuted elements.

Since Γ depends on a only through its ℓ₂ ∥a∥_2, the local minimal signal gap (MSG) condition $Γ \geq C σ \sqrt{\log p}$ allows for the presence of non-informative signals in the sense that some components of a can be 0. In contrast, the condition on Λ (GSS) depends on a trade-off between Γ and $σ \sqrt{n}$ . One the one hand, when $Γ > σ \sqrt{n}$ , the condition on Λ becomes $Λ \geq σ^{2} (C_{0} n + C_{1} \sqrt{n p})$ , which is independent of Γ, and is minimax optimal for left singular subspace estimation (Cai and Zhang 2018). On the other hand, when $Γ < σ \sqrt{n}$ , stronger condition on Λ is posed, as a compensation for small Γ.

In some cases, the GSS condition in (10) can be implied by the MSG condition. We summarize our results in the following proposition.

Proposition 2. Suppose Γ /σ>1/ p and the MSG condition hold. Then the GSS condition can be implied by either one of the following conditions

$Γ ≳ σ \sqrt{n}$ ;
$Γ ≲ σ \sqrt{n}$ , and either ${(σ^{4} n^{2} / Γ^{4})}^{1 / 3} ≲ p ≲ σ^{2} n^{2} / Γ^{2} o r p ≳ σ^{2} n^{2} / Γ^{2} + {(σ^{3} n / Γ^{3})}^{2 / 5}$ .

We next turn to the partial recovery and study the rate of convergence of $\hat{π}$ measured by the normalized Kendall’s tau distance under the linear growth model. In particular, we will assume an approximate uniform assignment of ${η_{j}}_{j = 1}^{p}$ over some subinterval of [0,∞). In other words, the minimal element and maximal element of the set ${| η_{j} - η_{j + 1} |}_{j = 1}^{p - 1}$ should have roughly the same magnitude, so that $Γ = ‖ a ‖_{2} \cdot \min_{1 \leq j \leq p - 1} | η_{j} - η_{j + 1} | ≍ ‖ a ‖_{2} \cdot \max_{1 \leq j \leq p - 1} | η_{j} - η_{j + 1} |$ . This is equivalent to assuming that the contigs in genome assemblies are approximately uniformly spaced along the circular genome.

Theorem 2 (Partial Recovery, Linear). Suppose (A1) hold, $(Θ, π) \in D_{L}$ , and Θ satisfies

there exist some C₀ > 0 such that $\max_{1 \leq j \leq p - 1} | η_{j} - η_{j + 1} | < C_{0} \min_{1 \leq j \leq p - 1} | η_{j} - η_{j + 1} |$ for all p > 0, and
$Λ > C_{1} σ^{2} (\max {\frac{σ^{2} {(n + \log p)}^{2}}{Γ^{2}}, n} + \sqrt{p} \max {\frac{σ (n + \log p)}{Γ}, \sqrt{n}})$ for some C₁ > 0.

Then, up to a permutation reversion,

E [τ_{K} (\hat{π}, π)] \leq 1 \land (\frac{c_{0} σ}{p Γ} \min {1, e^{- Γ^{2} / 2 σ^{2}} \log (1 + \frac{2 σ^{2}}{Γ^{2}})} + \frac{c_{1} e^{- Γ^{2} / 2 σ^{2}}}{p (Γ / σ + \sqrt{8 / π})} + \frac{c_{2}}{p^{c + 2}})

for some c, c₀, c_1, c₂ > 0.

Remark 2. The risk upper bound derived in the above theorem can be simplified as

E [τ_{K} (\hat{π}, π)] ≲ {\begin{array}{l} \frac{σ}{p Γ} \land 1 & if Γ / σ \to 0 \\ \frac{σ}{p Γ} e^{- Γ^{2} / 2 σ^{2}} + 1 / p^{c + 2} & otherwise \end{array}

for some c > 0. In the case of Γ /σ→∞, simple calculation yields $e^{- Γ^{2} / 2 σ^{2}} σ / (p Γ) + 1 / p^{c + 2} ≍ e^{- Γ^{2} / 2 σ^{2}} σ / Γ$ when $Γ < σ \sqrt{2 (c + 1) \log p}$ , whereas $e^{- Γ^{2} / 2 σ^{2}} σ / (p Γ) + 1 / p^{c + 2} ≍ 1 / p^{c + 2}$ when $Γ \geq σ \sqrt{2 (c + 1) \log p}$ . As a result, we also have

E [τ_{K} (\hat{π}, π)] ≲ {\begin{array}{l} 1 / p^{c + 2} & if Γ / σ \geq \sqrt{2 (c + 1) \log p} \\ \frac{σ}{p Γ} e^{- Γ^{2} / 2 σ^{2}} & if 1 ≲ Γ / σ < \sqrt{2 (c + 1) \log p} \\ \frac{σ}{p Γ} \land 1 & if Γ / σ ≲ 1 \end{array} .

(11)

See Figure 2 for an illustration.

Fig. 2 — A graphical illustration of the risk upper bound for $E [τ_{K} (\hat{π}, π)]$ , as a function of signal-to-noise ratio Γ /σ.

In general, Theorem 2 shows that, even with a weaker condition on Γ that is below the requirement for the exact recovery, our proposed estimator $\hat{π}$ is still able to obtain a partial recovery of π with an exponential rate of convergence if Γ /σ ≳1 and a polynomial rate of convergence if 1/ p < Γ /σ ≲1. As for Λ, the requirement is essentially the same as the exact recovery, except for an additional log p term, which is negligible in the exact recovery scenario.

Some implications about the practically preferable settings of n and p should be clarified. Firstly, although Theorem 1 implies that the difficulty for exact recovery increases as p grows (see also Table 1 from our simulations), our theory suggests a wide range of feasible choices for p. For example, if the underlying signals θ_ij and the noise level σ² are of constant order, then we have $Γ ≍ \sqrt{n}$ and Λ ≍ np³, so the conditions of Theorem 1 imply that the exact recovery can be guaranteed as long as log p≲ n. In other words, p is allowed to grow exponentially with n, which is in line with the modern high-dimensional setting. Secondly, our Theorem 2 implies that, even if some conditions (such as MSG) for the exact recovery are not satisfied, one can still hope to partially recover the underlying permutation. In accordance to our theoretical result (11), our numerical results (Figure 4) show that, for the partial recovery, increasing p indeed reduces the overall risk of the proposed estimator. Finally, as to the sample size n, we argue that, without assuming additional structural assumptions such as row-sparsity, it is very unlikely that including more samples will result in a worse estimate (see Table 1 and Figure 4 for numerical evidences).

Table 1.

The empirical risks of the estimators under the 0–1 loss based on 200 simulations for various combinations of the parameters (p, n, α). $\hat{π}$ : proposed method; Π_mean: mean-based method; Π_max: max-based method.

p = 75	S₁(σ² = 0.025)		S₂(σ² = 0.1)		S₃(σ² = 0.0075)		S₄(σ² = 0.025)
n = 40	α = 0.1	0.2	0.1	0.2	0.1	0.2	0.1	0.2
$\hat{π}$	0.775	0.575	0.415	0.000	0.025	0.020	0.025	0.000
Π_mean	0.925	0.815	0.955	0.015	0.155	0.135	0.880	0.005
Π_max	1.000	1.000	1.000	0.995	0.995	0.970	0.840	0.430
n = 40	S₁(σ² = 0.025)		S₂(σ² = 0.1)		S₃(σ² = 0.0075)		S₄(σ² = 0.025)
α = 0.1	p = 60	90	60	90	60	90	60	90
$\hat{π}$	0.410	0.930	0.340	0.470	0.010	0.115	0.000	0.010
Π_mean	0.720	0.985	0.910	0.980	0.070	0.245	0.775	0.900
Π_max	1.000	1.000	1.000	1.000	0.975	1.000	0.815	0.875
p = 75	S₁(σ² = 0.025)		S₂(σ² = 0.1)		S₃(σ² = 0.0075)		S₄(σ² = 0.025)
α = 0.1	n = 40	60	40	60	40	60	40	60
$\hat{π}$	0.765	0.440	0.475	0.095	0.050	0.020	0.010	0.005
Π_mean	0.920	0.645	0.940	0.700	0.175	0.045	0.900	0.905
Π_max	1.000	1.000	1.000	1.000	0.995	0.995	0.855	0.820

Open in a new tab

Fig. 4 — Boxplots of the empirical normalized Kendall’s distance between the estimated and true permutations under different models. $\hat{π}$ : proposed estimator; Π_mean : mean-based estimator; Π_max: max-based estimatior.

4. A GENERAL GROWTH MODEL

In this section we study the permutation recovery over the general parameter space $D$ where the growth pattern is not necessarily linear and therefore is more realistic inasmuch as the noisy nature of the shotgun metagenomic datasets (Boulund et al. 2018; Gao and Li 2018). The analysis relies on a deeper understanding of the relationship between the row-monotonic matrices and its leading singular vectors.

Specifically, for any $Θ \in D$ , we define the row-centered matrix

Θ^{'} = Θ (I - p^{- 1} e e^{⊤}) \in ℝ^{n \times p}

(12)

whose singular value decomposition (SVD) is given by $Θ^{'} = \sum_{i = 1}^{r} λ_{i} (Θ^{'}) u_{i}^{'} v_{i}^{' ⊤}$ , with r ≤ min{n, p}. The following proposition is essential to our analysis of the general growth model.

Proposition 3. Let Θ′ be defined as above, then its first right singular vector v′₁ is a monotone vector, i.e., either v′₁₁ ≤ v′₁₂ ≤…≤ v′_1p or v′₁₁ ≥ v′₁₂ ≥…≥ v′_1p.

Together with Proposition 1, the above proposition justifies our construction of the permutation estimator $\hat{π}$ using a PCA based approach. To overcome the identifiability issue, we further assume λ₁(Θ′) has multiplicity one. We first introduce the several quantities that play the key roles in permutation recovery over $D$ .

Definition 3. For any $Θ \in D$ and the corresponding Θ′ defined as above, we define

Γ = \min_{1 \leq i < j \leq p} | u_{1}^{' ⊤} (Θ_{. i}^{'} - Θ_{. j}^{'}) | = λ_{1} (Θ^{'}) \min_{1 \leq i < j \leq p} | v_{1 i}^{'} - v_{1 j}^{'} |,

as the local minimal signal gap, define

Ξ = \max_{1 \leq i \leq p - 1} {‖ Θ_{. i}^{'} - Θ_{. i + 1}^{'} ‖}_{2} = \max_{1 \leq i \leq p - 1} {(\sum_{j = 1}^{r} λ_{j}^{2} (Θ^{'}) {| v_{j i}^{'} - v_{j, i + 1}^{'} |}^{2})}^{1 / 2},

as the local maximal signal gap, and define

Λ = λ_{1}^{2} (Θ^{'}) - λ_{2}^{2} (Θ^{'})

as the global signal strength of Θ.

In particular, the above definitions of Γ and Λ generalize the ones given earlier in the linear growth model as these quantities coincide for $Θ \in D_{L}$ . The following theorem concerns the exact permutation recovery with $\hat{π}$ over $D$ .

Theorem 3 (Exact Recovery, General). Suppose (A1) hold, $n ≲ p, (Θ, π) \in D$ , and Θ satisfies $Γ > C_{0} σ \sqrt{\log p}$ and

Λ > C_{1} σ^{2} [(n + \frac{Ξ^{2}}{σ^{2}}) \max {\frac{(n + \log p) σ^{2}}{Γ^{2}}, 1} + \sqrt{p} (\sqrt{n} + \frac{Ξ}{σ}) \max {\frac{σ \sqrt{n + \log p}}{Γ}, 1}]

for some C₀, C₁ > 0. Then with probability at least 1−O (p^−c) for some constant c > 0, up to a permutation reversion, we have $\hat{π} = π$ .

As in the case of linear growth model (Theorem 1), in Theorem 3, to guarantee exact recovery, we need the MSG condition $Γ > C_{0} σ \sqrt{\log p}$ . Unlike the linear growth model, here Γ only implicitly depends on the elements of Θ through its spectral quantities, which makes its interpretation less clear. To address this issue, we make the following observation that links the minimal singular vector gap $\min_{1 \leq i < j \leq p} | v_{1 i}^{'} - v_{1 j}^{'} |$ in the definition of Γ to the elements of Θ.

Proposition 4. Let Θ′ in (12) be such that there exists a δ> 0 being the lower bound of the normalized minimum gap between any two entries in the same row, i.e.

\min_{1 \leq k \leq n} \frac{| θ_{k, i}^{'} - θ_{k, j}^{'} |}{{‖ Θ_{k .}^{'} ‖}_{2}} \geq δ for some i \neq j .

Then the first singular vector $v_{1}^{'} \in ℝ^{p}$ of Θ′ satisfies $| v_{1, i}^{'} - v_{1, j}^{'} | \geq δ$ .

Consequently, the implicit requirement that $\min_{1 \leq i < j \leq p} | v_{1 i}^{'} - v_{1 j}^{'} |$ is large can be guaranteed when the normalized minimum distance $\min_{1 \leq i < j \leq p} \min_{1 \leq k \leq n} | θ_{k, i}^{'} - θ_{k, j}^{'} | / {‖ Θ_{k .}^{'} ‖}_{2}$ is large. Our next theorem concerns the partial recovery over the general parameter space $D$ .

Theorem 4 (Partial Recovery, General). Suppose (A1) hold, $n ≲ p, (Θ, π) \in D$ , and Θ satisfies

there exits some C₀ > 0 such that $\max_{1 \leq j \leq p - 1} | v_{1 j}^{'} - v_{1, j + 1}^{'} | < C_{0} \min_{1 \leq j \leq p - 1} | v_{1 j}^{'} - v_{1, j + 1}^{'} |$ for all p > 0, and
$Λ > C_{1} σ^{2} [\max {\frac{{(n + \log p)}^{2} σ^{2}}{Γ^{2}}, n + \frac{Ξ^{2}}{σ^{2}}} + \sqrt{p} \max {\frac{σ (n + \log p)}{Γ}, \sqrt{n} + \frac{Ξ}{σ}}]$ for some C₁ > 0.

Then, up to a permutation reversion,

E [τ_{K} (\hat{π}, π)] \leq 1 \land (\frac{c_{0} σ}{p Γ} \min {1, e^{- Γ^{2} / 2 σ^{2}} \log (1 + \frac{2 σ^{2}}{Γ^{2}})} + \frac{c_{1} e^{- Γ^{2} / 2 σ^{2}}}{p (Γ / σ + \sqrt{8 / π})} + \frac{c_{2}}{p^{c + 2}})

for some c, c₀, c₁, c₂ > 0.

Condition (i) of Theorem 4 parallels the one given in Theorem 2. It essentially requires an even distancing of the elements (the projected columns of Θ) whose ordering is to be tracked by $\hat{π}$ . In contrast, in both Theorem 3 and 4, the conditions on Λ are slightly more complicated than those in Theorem 1 and 2, as it further depends on the relative magnitude between Ξ/σ and $\sqrt{n}$ . In particular, if $Ξ / σ ≲ \sqrt{n}$ , the conditions reduce to the ones required in the linear growth models. Interestingly, the risk upper bound obtained in Theorem 4 remains the same as in the linear growth model, which only depends on p and the signal-to-noise ratio Γ /σ.

5. MINIMAX LOWER BOUNDS AND OPTIMALITY

In this section, we establish the minimax lower bounds for both exact and partial recovery considered in previous sections, in relation to different levels of the signal-to-noise ratio Γ /σ. In the following theorem, we show the MSG condition for exact recovery is asymptotically sharp.

Theorem 5. Suppose (A1) hold. Let $D_{1} = D_{L} \cap {(Θ, π) : Γ \leq \frac{σ}{4} \sqrt{\log p}}$ and $D_{1}^{'} = D \cap {(Θ, π) : Γ \leq \frac{σ}{4} \sqrt{\log p}}$ . Then for any p ≥10, we have

\inf_{\hat{π}} \sup_{(Θ, π) \in D_{1}^{'}} P (\hat{π} \neq π) \geq \inf_{\hat{π}} \sup_{(Θ, π) \in D_{1}} P (\hat{π} \neq π) \geq 0.3,

where the infimum is over all the permutation estimators $\hat{π}$ .

This theorem along with Theorem 1 and Theorem 3 indicates that our proposed estimator is minimax rate-optimal over $D_{L}$ and $D$ in terms of the MSG condition on Γ. In light of Proposition 2, in some situations the MSG condition can be both necessary and sufficient for the exact recovery, which includes practically important cases such as n ≍ p, n < log p, etc. Using the information-theoretic language, we have therefore obtained both the achievability result, i.e., the existence of an algorithm or estimator that exactly recovers signal with high probability, and the converse result, namely, an upper bound on the probability of exact recovery that applies to any estimators (Cullina and Kiyavash 2016). See Figure 3 for an illustration.

Fig. 3 — A graphical illustration of the achievability/converse result for exact recovery.

Our next theorem establishes a minimax lower bound for the expected rate of convergence for the partial recovery.

Theorem 6. Suppose (A1) hold, $D_{2} (t) = D_{L} \cap {(Θ, π) : c t \leq Γ \leq C t}, D_{2}^{'} (t) = D \cap {(Θ, π) : c t \leq Γ \leq C t}$ for some C, c > 0, and t/σ≥ 2. Then there exist constants C₁, C₂ > 0 such that

\inf_{\hat{π}} \sup_{(Θ, π) \in D_{2}^{'} (t)} E [τ_{K} (\hat{π}, π)] \geq \inf_{\hat{π}} \sup_{(Θ, π) \in D_{2} (t)} E [τ_{K} (\hat{π}, π)] \geq \frac{C_{1} σ}{p t} e^{- t^{2} / 2 σ^{2}} + \frac{C_{2}}{p^{2}} .

Comparing the above minimax lower bound to the risk upper bounds obtained in Theorem 2 and 4, we conclude that our proposed estimator $\hat{π}$ is minimax rate-optimal in terms of the partial recovery for both the linear growth model and the general growth model over the range whenever Γ /σ does not diminish (Figure 1). In particular, in Theorem 5 and 6, since the minimax lower bounds only concern the worst-case scenarios, the same lower bounds should hold for any parameter spaces whenever the same worst cases are included. Similarly, the assumption (A1) does not pose a restriction to the general applicability of such results.

6. NUMERICAL STUDIES

6.1. Simulation with Model-Generated Data

To demonstrate our theoretical results and compare with alternative methods, we generate data from model (1) with various configurations of the signal matrix Θ. We compare the empirical performance of our proposed estimator $\hat{π}$ with the following alternatives:

π_mean : Order the columns of Y by the magnitude of its column means;
π_max : Order the columns of Y by the magnitude of its column maximums.

We use both the 0–1 loss and the normalized Kendall’s tau distance in comparing these methods. Due to the identifiability issue, the performance of each estimator is evaluated up to a complete reversion of the permutation. For example, we use $\min {τ_{K} (\hat{π}, π), τ_{K} (\hat{π}, rev (π))}$ as the empirical Kendall’s tau distance. By symmetry, we set the underlying permutation π= id. The signal matrix $Θ = (θ_{i j}) \in ℝ^{n \times p}$ is generated under the following four regimes:

S₁(α, n, p): For any 1 ≤ j ≤ p, θ_ij = log(1 + jα_i +β_i) where α_i ~ Unif(α/2, α) for 1 ≤ i ≤ n/2, α_i ~ Unif(0,0.01) for n/2 < i ≤ n, and β_i ~ Unif(1,3) for all 1 ≤ i ≤ n;
S₂(α, n, p): For any 1 ≤ j ≤ p, θ_ij = jα_i + β_i where α_i ~ Unif(α/2, α) for 1 ≤ i ≤ n/2, α_i ~ Unif(0, α/10) for n/2 < i ≤ n, and β_i ~ Unif(1,3) for all 1 ≤ i ≤ n;
S₃(α, n, p): For any 1 ≤ j ≤ p, θ_ij = log(1 + jα_i + β_i) where α_i ~ Unif(α/2, α) for 1 ≤ i ≤ 3, α_i ~ Unif(0,0.01) for 4 < i ≤ n, and β_i ~ Unif(1,3) for all 1 ≤ i ≤ n;
S₄(α, n, p): For any 1 ≤ j ≤ p, θ_ij = jα_i + β_i where α_i ~ Unif(α/2, α) for 1 ≤ i ≤ 3, α_i ~ Unif(0, α/10) for 4 < i ≤ n, and β_i ~ Unif(1,3) for all 1 ≤ i ≤ n.

Specifically, under each regime, the sample-specific “growth rate” parameter α_i is randomly and uniformly generated either from the interval [α/2, α] or an interval with much smaller values, namely, [0, α/10] in $S_{2}$ and $S_{4}$ and [0,0.01] in $S_{1}$ and $S_{3}$ . By construction, the four regimes consist of the nonlinear growth model where the signals spread out over many samples ( $S_{1}$ ) or concentrate at a few rows ( $S_{3}$ ) and the linear growth model where the signals spread out over many samples (S₂) or concentrate at a few rows (S₄). In particular, in accordance to our theory, for the supposedly “non-informative” samples, we allow the corresponding growth rates to be small but non-zero, which shows the flexibility of our proposed method. The entries of Z are drawn from i.i.d. centred normal distributions whose variance σ₂ will be given explicitly. In each setting, we evaluate the empirical performance of each method over a range of n, p or α. Each setting is repeated for 200 times.

For the exact recovery, in Table 1, we reported the empirical risks of the estimators under the 0–1 loss for various regimes and parameter combinations. The noise level σ² is chosen for each regime to better illustrate the differences in the empirical risks among the estimators. From our simulation results, in consistent to our theory, our proposed estimator has the smallest empirical risk over all the settings, and the estimation risk decreases as we increase α, n or decrease p.

For partial recovery, in Figure 4, we show boxplots of the empirical normalized Kendall’s tau between each estimator and the true permutation π. Again, our proposed method outperforms the alternatives in all the cases. As expected from our theory, under all the four regimes, increasing p while keeping other parameters fixed results to smaller estimation risk. As for the dependence on n, under $S_{1}$ and $S_{2}$ , increasing n leads to smaller risk as it is equivalent to increasing Γ, whereas under $S_{3}$ and $S_{4}$ , the risk roughly remains the same across different n’s as in these case Γ doesn’t change much.

To offer more intuitive interpretation of why $\hat{π}$ performs better than the alternative methods, we assessed the weight vectors $\hat{w}$ of our proposed estimator $\hat{π}$ under each regime after 200 rounds of simulations (Figure 3 in Supplemented Material). In comparison, the weight vector for π_mean is simply $(1 / \sqrt{n}, \dots, 1 / \sqrt{n})$ , which assigns equal weight to all the samples. On the other hand, since π_max cannot be written in the form of ${(r (w^{⊤} Y))}^{- 1}$ for some weight vector w and therefore does not belong to the class of linear projection estimators, we reported instead the pseudo-weight vector $\tilde{w} \in ℝ^{n}$ where the i-th component is the proportion that the i-th sample is used among the p coordinates. In general, we found that $\tilde{w} \in ℝ^{n}$ assigns larger weights to only a few samples among those with higher signal strength, and the weight vector for π_mean fails to distinguish the informative samples from the non-informative ones. In contrast, the weight vectors $\hat{w}$ for our proposed estimator $\hat{π}$ would automatically adapt to the varying signal strengths across the samples and assign larger weights to the samples with more significant signal changes. This also explains the interesting phenomenon in Figure 4 that, under the regime $S_{1}$ and $S_{2}$ , $\hat{π}$ and π_mean perform better than π_max, whereas under $S_{3}$ and $S_{4}$ , $\hat{π}$ and π_max perform better. In summary, methods that are able to detect and assign larger weight to the more informative samples would perform better than methods that are not. Observably, $\hat{π}$ combines the advantages of π_mean and π_max in that it finds the best weights (projection scores) in a data-driven manner.

6.2. Evaluation Using Synthetic Metagenomic Data

We evaluate the empirical performance of our proposed method using a synthetic metagenomic sequencing dataset used in Gao and Li (2018) by generating sequencing reads based on 45 bacterial genomes. Instead of estimating the PTRs, which was the focus of Gao and Li (2018), our goal is to recover the unknown relative orders of the contigs assembled in typical metagenomics studies. In addition to assisting the estimation of PTRs, such ordering of the contigs could be of independent interest for other applications, including genome assemblies based on shotgun metagenomics data.

Gao and Li (2018) presented a synthetic shotgun metagenomic sequencing dataset of a community of 45 phylogenetically related species from 15 genera of five different phyla with known RefSeq ID, taxonomy and replication origin (Gao, Luo, and Zhang 2013) (see Figure 2 in our Supplementary Material). To generate metagenomics reads, reference genome sequences of randomly selected three species in each genus were downloaded from NCBI. Read coverages were generated along the genome based on an exponential distribution with a specified peak-to-trough ratio and a function of accumulative distribution of read coverages along the genome was calculated. Sequencing reads were next generated using the above accumulative distribution function and a random location of each read on the genome, until the total read number achieved a randomly assigned average coverage between 0.5 and 10 folds for the species in a sample. Sequencing errors including substitution, insertion and deletion were simulated in a position- and nucleotide-specific pattern according to a recent study on metagenomic sequencing error profiles of Illumina.

For the final dataset, the average nucleotide identities (ANI) between species within each genus ranged from 66.6% to 91.2% The probability of one species existing in each of the 50 simulated samples was set as 0.6, and a total of 1,336 average coverages and the corresponding PTRs were randomly and independently assigned. After the same processing and filtering steps and CG-adjustment step as in Gao and Li (2018), the final dataset included genome assemblies of 41 species. For each species, we obtained the permuted matrix of log-contig counts, with the number of samples ranging from 29 to 46, and the number of contigs ranging from 47 to 482.

Our proposed method $(\hat{π})$ was used to estimate the unknown orders of the contigs for each species and each sample. As a comparison, we also considered the iRep estimator proposed in Brown et al. (2016), where the contigs of a given species were ordered for each sample separately based on the read counts observed. We evaluate these methods by comparing the estimated contig orders to their true orders as measured by the normalized Kendall’s tau distance. To generalize our evaluation to diverse metagenomic datasets, we also evaluate the effect of sample size as well as contig numbers by randomly selecting subsets of samples or contigs from each dataset. The selection was made with replacement.

The results are summarized in Figure 5 by comparing the normalized Kendall’s tau distances. As n or p varies, our proposed estimator performs consistently better than iRep in recovering the true contig orders, which explains partially why the DEMIC algorithm worked better in estimating the bacterial growth dynamics. The results of our methods are not sensitive to the sample size and the number of contigs from the genome assemblies. Our estimator also shows smaller variability.

Fig. 5 — Boxplots of the normalized Kendall’s distance between the estimated contig orders and the true orders for different sample sizes n and different numbers of contigs p. The lighter ones correspond to our proposed method and the darker ones correspond to the iRep estimation method.

6.3. Analysis of a Real Microbiome Metagenomic Data Set

Finally, we complete our numerical studies by analyzing a real metagenomic dataset from the Pediatric Longitudinal Study of Elemental Diet and Stool Microbiome Composition (PLEASE) study, a prospective cohort study to investigate the treatment effects on the gut microbiome and reduction of inflammation in pediatric Crohn’s disease patients (Lewis et al. 2015). In particular, sequencing data from the fecal samples of 86 Crohn’s disease children were obtained at baseline, 1 week and 8 weeks after antiTNF or enteral diet treatment. In our analysis, the sequencing data at the 8th week after treatment was used to compare the bacterial growth dynamics for non-responders (n = 34) and responders (n = 47). The reads were downloaded from NCBI short read archive (SRP057027) with the corresponding metadata. After the same coassembly, alignment and binning steps as in Gao and Li (2018), the DEMIC algorithm was applied to estimate the bacterial growth rate of a given species represented by a contig cluster (bin) for each sample. In particular, DEMIC applied our proposed method to the GC-adjusted contig coverage data to recover the original order of the contigs. After obtaining the ordered contigs, a simple linear regression was fitted to obtain estimates of the PTRs (ePTRs).

In order to compare the baterial growth rates between responders and non-responders, our analysis focused on ePTRs of 8 contig clusters over subsets of the non-responders (n₁) and the responders (n₂) after 8 weeks of treatment with min{n₁, n₂} > 5. Other contig clusters were rare and only appeared in a few samples. For each contig cluster, we compare the ePTRs of the responders and non-responders Wilcoxon rank sum test (Table S.1 in Supplementary Material). The taxonomic annotations of these eight contig clusters were obtain by applying the BAT algorithm (von Meijenfeldt et al. 2019) that compares the metagenomic assembled bins to a taxonomy database. In Table S.1, we show the final taxonomic annotations for each bin to the finest possible resolution, with the lineage scores indicating the quality of each taxonomic classification.

Among the 8 contig clusters, bin.026 showed a significant difference in ePTRs between responders and non-responders after either antiTNF or enteral diet treatment for 8 weeks (p=0.0418), where the growth rate was higher in Crohn’s disease patients who did not respond to the treatment. The taxonomic classification (Table S.1) shows that this contig cluster belongs to the phylum Firmicutes and the order Clostridiales. Since BAT algorithm was not able to classify the order Clostridiales to finer taxonomic level of known species, this contig cluster may represent a new species that is important to the treatment outcome of Crohn’s disease patients.

7. DISCUSSION

In this paper, partial recovery was studied under the normalized Kendall’s tau distance. Another commonly used metric is the normalized Spearman’s footrule distance defined by

ρ (π_{1}, π_{2}) = \frac{2}{p (p - 1)} \sum_{i = 1}^{p} | π_{1} (i) - π_{2} (i) |, π_{1}, π_{2} \in S_{p} .

A celebrated result by Diaconis and Graham (1977) shows that $τ_{K} (π_{1}, π_{2}) \leq ρ (π_{1}, π_{2}) \leq 2 τ_{K} (π_{1}, π_{2})$ , which means the two distances are equivalent. As a consequence, all the theoretical results presented in this paper concerning the Kendall’s tau distance also hold for the Spearman’s footrule distance without any change.

The minimax optimality of the proposed estimator $\hat{π}$ was investigated in Section 5 by examining the asymptotic sharpness of the MSG condition for exact recovery, and by obtaining the matching minimax risk lower bound for partial recovery. There are a few issues that deserve further investigation. For both exact and partial recovery, it is unclear to what extent the GSS condition is necessary. In our risk analysis, the perturbation bound for the left singular subspace (Cai and Zhang 2018) was used. In fact, similar results can be obtained using the concentration bound for the linear functionals of singular vectors (Koltchinskii and Xia 2016). Nevertheless, it remains to show whether the GSS condition is also asymptotically sharp. In addition, in Theorem 6, the matching minimax lower bound was obtained only for nonvanishing Γ /σ. It remains to show whether the rate σ/(pΓ) is minimax optimal when Γ /σ → 0. The difficulty lies in finding a p^1+δ -sphere packing of the group $S_{p}$ equipped with the Kendall’s tau distance for any 0 <δ<1, while the pairwise ℓ₂ distances of the packing elements are also well controlled. Some initial steps have been made in the so-called rank modulation theory (Barg and Mazumdar 2010; Mazumdar, Barg, and Zemor 2013).

There are several related problems that are also of significant theoretical and practical interest. Firstly, although we used the Kendall’s tau distance or the equivalent Spearman’s footrule distance as the metric for partial recovery, other distances such the Hamming distance, Spearman’s rank correlation distance, and Ulam’s distance have also been used as the performance metrics for partial recovery in other permutation estimation problems (Göloğlu et al. 2015; Mukherjee 2016). It is therefore of interest to see how $\hat{π}$ performs under these losses. Secondly, our proposed estimator $\hat{π}$ implicitly performs a (linear) dimension reduction technique and only uses the information contained in the first eigenvector of A in (4). A natural extension is to consider the eigen-subspace spanned by the first k eigenvectors and to estimate the permutation in a sequential manner.

The present paper focuses on the estimation of the permutation matrix Π. It is also of interest to estimate the underlying signal matrix Θ or some functionals of it. For example, in microbiome growth dynamics studies, it is of significant interest to estimate the peak-to-trough ratio $\exp (θ_{k p} - θ_{k 1})$ for k =1,...,n, which measures the microbial growth rate for the kth sample, and to identify the samples with peak-to-trough ratio of 1. It is also interesting to identify the bacteria that show differential growth dynamics between disease and normal individuals. Finally, robust permutation recovery methods that can relax the Gaussian or sub-Gaussian assumption of the noise in the permuted monotone matrix model are needed. For example, in some applications, the columns of the noise matrix are not independent, or the variance levels across the noise matrix are not identical. In these cases, we argue that, as long as the marginal distributions of the noise matrix entries remain sub-Gaussian, the analytical framework of the current paper can still be applied, but with more efforts to control the underlying heteroskedasticity. Toward this end, results from the recent work of Zhang, Cai, and Wu (2018) can be very useful, in terms of the new technical tools that parallel the ones used in the current paper to analyse the homoskedastic PCA (cf. Lemma 2 and 3). Finally, to account for non-informative samples, sparse PCA (Cai, Ma, and Wu 2013; Yuan and Zhang 2013) can be considered. These are interesting problems left for future research.

8. PROOFS OF THE MAIN THEOREMS

In this section, we prove Theorems 1 and 2 in detail and briefly sketch the proofs of Theorems 3 and 4. We also prove the minimax lower bounds in Theorems 5 and 6. Proofs of other results including the technical lemmas can be found in the online Supplementary Materials.

Proof of Theorem 1

Let X =Θ+Z. It follows that Y = XΠ. By right invariance of the 0–1 loss with respect to permutation composition, we have

ℓ ({(r ({\hat{w}}^{⊤} Y))}^{- 1}, π) = ℓ ({(r ({\hat{w}}^{⊤} X Π))}^{- 1}, π) = ℓ ({(r ({\hat{w}}^{⊤} X))}^{- 1} \circ π, π) = ℓ ({(r ({\hat{w}}^{⊤} X))}^{- 1}, i d) .

Thus it suffices to study the risk $E ℓ ({(r ({\hat{w}}^{⊤} X))}^{- 1}, i d) = P ({(r ({\hat{w}}^{⊤} X))}^{- 1} \neq i d)$ . In fact,

P ({(r ({\hat{w}}^{⊤} X))}^{- 1} \neq i d) \leq P (\cup_{i = 1}^{p - 1} {\sum_{k = 1}^{n} {\hat{w}}_{k} X_{k i} \geq \sum_{k = 1}^{n} {\hat{w}}_{k} X_{k, i + 1}}) \leq \sum_{i = 1}^{p - 1} P (\sum_{k = 1}^{n} {\hat{w}}_{k} X_{k i} \geq \sum_{k = 1}^{n} {\hat{w}}_{k} X_{k, i + 1}) = \sum_{i = 1}^{p - 1} P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k, i + 1}) \geq 0),

(13)

which further reduces to obtaining an upper bound for $P_{i} = P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k, i + 1}) \geq 0)$ . By definition, $\hat{w}$ is the first eigenvector of $A = Y (I - \frac{1}{p} e e^{⊤}) (I - \frac{1}{p} e e^{⊤}) Y^{⊤}$ . Simple calculation yields $Π (I - \frac{1}{p} e e^{⊤}) (I - \frac{1}{p} e e^{⊤}) Π^{⊤} = (I - \frac{1}{p} e e^{⊤}) (I - \frac{1}{p} e e^{⊤})$ for any $Π \in S_{p}$ . So $\hat{w}$ is also the first eigenvector of $A = X (I - \frac{1}{p} e e^{⊤}) (I - \frac{1}{p} e e^{⊤}) X^{⊤} \equiv T T^{⊤}$ , where $T \in ℝ^{n \times p}$ . Note that T admits the decomposition $T = Θ^{'} + E \in ℝ^{n \times p}$ where $E_{i j} \sim N (0, (p - 1) σ^{2} / p)$ and $Θ^{'} = a η^{' ⊤}, η_{j}^{'} = η_{j} - \frac{1}{p} \sum_{i = 1}^{p} η_{i}$ . In particular, $T_{. i} = X_{. i} - {\bar{X}}_{r o w}$ where ${\bar{X}}_{r o w} = p^{- 1} \sum_{i = 1}^{p} X_{. i} \in ℝ^{n}$ is the vector of row means of X. We denote ϕ_ij = T_.i − T_.j = X_.i − X_.j and denote $w = a / ‖ a ‖_{2} \in ℝ^{n}$ as the first eigenvector of the rank-one matrix Θ′Θ′^⊤. Now following (13), we have

P_{i} = P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k, i + 1}) \geq 0) = P (w^{⊤} ϕ_{i j} + {(\hat{w} - w)}^{⊤} ϕ_{i j} \geq 0) = P (w^{⊤} ϕ_{i, i + 1} + {(\hat{w} - w)}^{⊤} ϕ_{i j} \geq 0, | 1 - {({\hat{w}}^{⊤} w)}^{2} | \leq δ) + P (w^{⊤} ϕ_{i, i + 1} + {(\hat{w} - w)}^{⊤} ϕ_{i j} \geq 0, | 1 - {({\hat{w}}^{⊤} w)}^{2} | > δ)

for some δ > 0. By definition, up to a change of sign for $\hat{w}$ , we have $0 \leq {\hat{w}}^{⊤} w \leq 1$ . Then $| 1 - {({\hat{w}}^{⊤} w)}^{2} | \leq δ$ implies $| {(\hat{w} - w)}^{⊤} ϕ_{i, i + 1} | \leq ‖ \hat{w} - w ‖_{2} {‖ ϕ_{i, i + 1} ‖}_{2} \leq \sqrt{2 δ} {‖ ϕ_{i, i + 1} ‖}_{2}$ , where the first inequality follows from Cauchy-Schwartz and the second inequality used $‖ \hat{w} - w ‖_{2} = \sqrt{2 (1 - {\hat{w}}^{⊤} w)} \leq \sqrt{2 (1 - {({\hat{w}}^{⊤} w)}^{2})}$ . Thus

P_{i} \leq P (w^{⊤} ϕ_{i, i + 1} \geq - \sqrt{2 δ} {‖ ϕ_{i, i + 1} ‖}_{2}) + P (| 1 - {({\hat{w}}^{⊤} w)}^{2} | > δ) .

(14)

The following lemmas provide upper bounds for the two probability events in the last expression.

Lemma 1. Under the conditions of Theorem 1, denote $Γ_{i} = ‖ a ‖_{2} (η_{i} - η_{i + 1})$ , then for any δ > 0, we have

P (w^{⊤} ϕ_{i, i + 1} \geq - \sqrt{2 δ} ‖ ϕ_{i, i + 1} ‖) \leq Φ (C_{1} \sqrt{δ} Ψ_{i}^{1 / 2} + \frac{Γ_{i}}{σ}) + \frac{C_{2}}{p^{c}}

(15)

for $Ψ_{i} = {(\sqrt{n} + \sqrt{\log p})}^{2} + \frac{Γ_{i}^{2}}{σ^{2}} + \frac{| Γ_{i} |}{σ} \sqrt{\log p}$ and some constants C₁, C₂, c > 0.

Lemma 2. Suppose $λ_{1}^{2} (Θ^{'}) \geq C σ^{2} (n + \sqrt{n p})$ for some C > 0, it follows that

P (| 1 - {({\hat{w}}^{⊤} w)}^{2} | \leq C_{1} \frac{σ^{2} (λ_{1}^{2} (Θ^{'}) + σ^{2} p) (n + \log p)}{λ_{1}^{4} (Θ^{'})}) \geq 1 - \frac{C_{2}}{p^{c}}

for some C₁, C₂, c > 0.

Now since $\frac{1}{p} \sum_{1 \leq i < j \leq p} {(η_{i} - η_{j})}^{2} = \sum_{j = 1}^{p} {(η_{j} - \frac{1}{p} \sum_{i = 1}^{p} η_{i})}^{2} = \sum_{j = 1}^{p} η'_{j}^{2}$ , we have $λ_{1}^{2} (Θ^{'}) = Λ > C_{0} σ^{2} (n + \sqrt{n p})$ for some C₀ > 0. Set $δ = C_{0} σ^{2} \frac{(λ_{1}^{2} (Θ^{'}) + σ^{2} p) (n + \log p)}{λ_{1}^{4} (Θ^{'})}$ . It follows that δ= o(1). Combining Lemma 1 and Lemma 2, we have

P_{i} \leq Φ (C \sqrt{δ} {[{(\sqrt{n} + \sqrt{\log p})}^{2} + \frac{Γ_{i}^{2}}{σ^{2}} + \frac{| Γ_{i} |}{σ} \sqrt{\log p}]}^{1 / 2} + \frac{Γ_{i}}{σ}) + \frac{C}{p^{c}}

(16)

for some C, c > 0. The rest of the analysis is divided into several cases.

Case 1. log p≲n.

In this case, we have $P_{i} \leq Φ (C \sqrt{δ} {[n + \frac{Γ_{i}^{2}}{σ^{2}} + \frac{| Γ_{i} |}{σ} \sqrt{\log p}]}^{1 / 2} + \frac{Γ_{i}}{σ}) + \frac{C}{p^{c}}$ . In addition, if $| Γ_{i} | / σ ≲ \sqrt{n}$ , we have $P_{i} \leq Φ (C \sqrt{δ n} + \frac{Γ_{i}}{σ}) + \frac{C}{p^{c}} \leq \frac{C^{'}}{p^{c}}$ , where the last inequality follows from $\sqrt{\log p} ≲ Γ / σ \leq | Γ_{i} | / σ ≲ \sqrt{n}$ and $Λ ≳ σ^{2} n (\frac{σ^{2} n}{Γ^{2}} + \frac{σ \sqrt{p}}{Γ})$ . If instead $| Γ_{i} | / σ ≳ \sqrt{n}$ , we have $P_{i} \leq Φ (C \sqrt{δ} \frac{| Γ_{i} |}{σ} + \frac{Γ_{i}}{σ}) + \frac{C}{p^{c}} \leq \frac{C^{'}}{p^{c}}$ , where the last inequality follows from $| Γ_{i} | / σ ≳ \sqrt{n} ≳ \sqrt{\log p}$ and δ= o(1). Hence, in Case 1, (16) can be bounded by O(p^−c).

Case 2. log p≳n.

In this case, we have $P_{i} \leq Φ (C \sqrt{δ} {[\log p + \frac{Γ_{i}^{2}}{σ^{2}} + \frac{| Γ_{i} |}{σ} \sqrt{\log p}]}^{1 / 2} + \frac{Γ_{i}}{σ}) + \frac{C}{p^{c}}$ . In addition, since $| Γ_{i} | \geq Γ ≳ σ \sqrt{\log p}$ and δ= o(1), we have $P (\sum_{k = 1}^{n} {\hat{u}}_{k} (X_{k i} - X_{k, i + 1}) \geq 0) \leq Φ (\frac{C \sqrt{δ}}{σ} | Γ_{i} | + \frac{Γ_{i}}{σ}) + \frac{C}{p^{c}} \leq \frac{C^{'}}{p^{c}}$ . This shows that, in Case 2, (16) can also be bounded by O(p^−c).

As a result, it follows that, up to a change of sign for $\hat{w}, P ({(r ({\hat{w}}^{⊤} X))}^{- 1} \neq i d) = O (p^{- c})$ for some constant c > 0. □

Proof of Theorem 2

Firstly, by invariance property of Kendall’s tau distance, $E [τ_{K} (\hat{π}, π)] = E [τ_{K} ({(r ({\hat{w}}^{⊤} X))}^{- 1}, i d)] = E [τ_{K} ((r ({\hat{w}}^{⊤} X)), i d)]$ . It then follows

E [τ_{K} (\hat{π}, π)] = \frac{2}{p (p - 1)} \sum_{i < j} P ({[r ({\hat{w}}^{⊤} X)]}_{i} \geq {[r ({\hat{w}}^{⊤} X)]}_{j}) = \frac{2}{p (p - 1)} \sum_{i < j} P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k j}) \geq 0) .

The summation in the last expression can be divided into two parts, namely, the consecutive differences and non-consecutive differences, i.e.,

\sum_{i < j} P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k j}) \geq 0) = \sum_{(i, j) : j = i + 1} P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k j}) \geq 0) + \sum_{(i, j) : j > i + 1} P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k j}) \geq 0) .

In the following, we first show

P_{i} = P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k, i + 1}) \geq 0) \leq \frac{c e^{- Γ^{2} / 2 σ^{2}}}{Γ / σ + \sqrt{Γ^{2} / σ^{2} + 8 / π}} + \frac{C}{p^{c}}

(17)

so that

\sum_{(i, j) : j = i + 1} P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k j}) \geq 0) \leq \frac{c p e^{- Γ^{2} / 2 σ^{2}}}{Γ / σ + \sqrt{Γ^{2} / σ^{2} + 8 / π}} + \frac{C}{p^{c}} .

(18)

for some C, c> 0. Then we show that

\sum_{(i, j) : j > i + 1} P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k j}) \geq 0) \leq C \frac{p σ}{Γ} \min {1, e^{- Γ^{2} / 2 σ^{2}} \log (1 + \frac{2 σ^{2}}{Γ^{2}})} + \frac{C}{p^{c}} .

(19)

Combining (18) and (19), we conclude that

E [τ_{K} ({\hat{w}}^{⊤} Y, π)] \leq \frac{C σ}{p Γ} \min {1, e^{- Γ^{2} / 2 σ^{2}} \log (1 + \frac{2 σ^{2}}{Γ^{2}})} + \frac{C e^{- Γ^{2} / 2 σ^{2}}}{p (Γ / σ + \sqrt{8 / π})} + \frac{C}{p^{c + 2}},

which completes the proof, as the bound $E [τ_{K} ({\hat{w}}^{⊤} Y, π)] \leq 1$ is trivial.

Proof of (17). Following the same argument as the proof of Theorem 1, we have for 1 ≤ i ≤ p −1 and $δ = \frac{σ^{2} (n + \log p) (λ_{1}^{2} + σ^{2} p)}{λ_{1}^{4}}, P_{i} \leq P (w^{⊤} ϕ_{i j} \geq - \sqrt{2 δ} {‖ ϕ_{i, i + 1} ‖}_{2}) + P (| 1 - {({\hat{w}}^{⊤} w)}^{2} | > δ),$ where the second term can be bounded using Lemma 2. For the first term, by Lemma 1, for $λ_{1}^{4} (Θ^{'}) \geq σ^{2} (p + σ^{2} λ_{1}^{2} (Θ^{'})) (\log p + n)$ , we have $P_{i} \leq Φ (C \sqrt{δ} {[{(\sqrt{n} + \sqrt{\log p})}^{2} + \frac{Γ_{i}^{2}}{σ^{2}} + \frac{| Γ_{i} |}{σ} \sqrt{\log p}]}^{1 / 2} + \frac{Γ_{i}}{σ}) + \frac{C}{p^{c}}$ .

Using same argument as the proof of Theorem 1, it holds that $P_{i} \leq Φ (\frac{Γ_{i}}{σ}) + \frac{C}{p^{c}}$ . Equation (17) then follows by using formula 7.1.13 of Abramowitz and Stegun (1965) that $Φ (- t) < \frac{2}{t + \sqrt{t^{2} + 8 / π}} ϕ (t)$ for t ≥ 0.

Proof of (19). For the set of indices S = {(i, j) :1 ≤ i < j ≤ p, j > i +1}, we further divide it into two subsets $S_{1} = {(i, j) : 1 \leq i < j \leq p, j > i + ⌊ σ \sqrt{C \log p} / Γ ⌋}$ and $S_{2} = {(i, j) : 1 \leq i < j \leq p, i + 1 < j \leq i + ⌊ σ \sqrt{C \log p} / Γ ⌋}$ for some constant C > 0. Apparently we have the decomposition

\sum_{(i, j) : j > i + 1} P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k j}) \geq 0) = \sum_{(i, j) \in S_{1}} P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k j}) \geq 0) + \sum_{(i, j) \in S_{2}} P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k j}) \geq 0)

(20)

For the first term, by construction, it can be shown using the same argument (see supplementary materials) in Theorem 1 that

\sum_{(i, j) \in S_{1}} P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k j}) \geq 0) \leq \frac{C | S_{1} |}{p^{c}} \leq \frac{C}{p^{c_{0}}} .

(21)

Now for the second term in (20), similar argument yields, for (i, j) ∈ S₂,

P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k j}) \geq 0) \leq \frac{c \exp (- | i - j |^{2} Γ^{2} / (2 σ^{2}))}{| i - j | Γ / σ + \sqrt{| i - j |^{2} Γ^{2} / σ^{2} + 8 / π}} + \frac{C}{p^{c}} .

Note that, on the one hand,

\frac{e^{- | i - j |^{2} Γ^{2} / (2 σ^{2})}}{| i - j | Γ / σ + \sqrt{| i - j |^{2} Γ^{2} / σ^{2} + 8 / π}} \leq \frac{σ e^{- | i - j |^{2} Γ^{2} / 2 σ^{2}}}{| i - j | Γ} .

We have

\sum_{(i, j) \in S_{2}} \frac{σ e^{- | i - j |^{2} Γ^{2} / 2 σ^{2}}}{| i - j | Γ} = \frac{σ}{Γ} \sum_{k = 2}^{p \land ⌊ \sqrt{\log p} / Γ ⌋} (\frac{e^{- k^{2} Γ^{2} / 2 σ^{2}} p}{k} - e^{- k^{2} Γ^{2} / 2 σ^{2}}) = T_{1} - T_{2} .

For the rest of the proof, we assume $Γ \leq σ \sqrt{\log p} / 2$ , otherwise the set $S_{2}$ will vanish. Then

T_{1} = \frac{σ p}{Γ} \sum_{k = 2}^{p \land ⌊ σ \sqrt{\log p} / Γ ⌋} \frac{e^{- k^{2} Γ^{2} / 2 σ^{2}}}{k} \leq \frac{σ p}{Γ} \int_{1}^{p \land σ \sqrt{\log p} / Γ} \frac{e^{- x^{2} Γ^{2} / 2 σ^{2}}}{x} d x

where the last inequality used monotonicity of the integrand. The integral in the last inequality, after change of variable, can be bounded by an exponential integral Ei(Γ² / 2σ² ), which has an upper bound

\int_{1}^{p \land σ \sqrt{\log p} / Γ} \frac{e^{- x^{2} Γ^{2} / 2 σ^{2}}}{x} d x = \frac{1}{2} \int_{Γ^{2} / 2 σ^{2}}^{(p^{2} Γ^{2} / 2 σ^{2}) \land (\log p / 2)} \frac{e^{- t}}{t} d t \leq \frac{1}{2} \int_{Γ^{2} / 2 σ^{2}}^{\infty} \frac{e^{- t}}{t} d x \leq e^{- Γ^{2} / 2 σ^{2}} \log (1 + \frac{2 σ^{2}}{Γ^{2}})

so that $T_{1} \leq \frac{σ p}{Γ} e^{- Γ^{2} / 2 σ^{2}} \log (1 + \frac{2 σ^{2}}{Γ^{2}})$ . For T₂, we have $T_{2} = \frac{σ}{Γ} \sum_{k = 2}^{p \land ⌊ σ \sqrt{\log p} / Γ ⌋} e^{- k^{2} Γ^{2} / 2 σ^{2}} \geq \frac{σ}{Γ} e^{- 2 Γ^{2} / σ^{2}}$ . Therefore,

\sum_{(i, j) \in S_{2}} P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k j}) \geq 0) \leq \frac{C σ p}{Γ} e^{- Γ^{2} / 2 σ^{2}} \log (1 + \frac{2 σ^{2}}{Γ^{2}}) + \frac{C}{p^{c}} .

(22)

On the other hand, note that

\frac{e^{- | i - j |^{2} Γ^{2} / (2 σ^{2})}}{| i - j | Γ / σ + \sqrt{| i - j |^{2} Γ^{2} / σ^{2} + 8 / π}} \leq c e^{- | i - j |^{2} Γ^{2} / (2 σ^{2})} .

We have

\sum_{(i, j) \in S_{2}} e^{- | i - j |^{2} Γ^{2} / (2 σ^{2})} = \sum_{k = 2}^{p \land ⌊ σ \sqrt{\log p} / Γ ⌋} p e^{- k^{2} Γ^{2} / (2 σ^{2})} - \sum_{k = 2}^{p \land ⌊ σ \sqrt{\log p} / Γ ⌋} k e^{- k^{2} Γ^{2} / (2 σ^{2})} \leq p \int_{1}^{\infty} e^{- k^{2} Γ^{2} / 2 σ^{2}} d k - 2 e^{- 2 Γ^{2} / σ^{2}} \leq C p σ / Γ .

Thus

\sum_{(i, j) \in S_{2}} P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k j}) \geq 0) \leq C p σ / Γ + \frac{C^{'}}{p^{c}} .

(23)

Combining (22) and (23), we have

\sum_{(i, j) \in S_{2}} P (\sum_{k = 1}^{n} {\hat{w}}_{k} (X_{k i} - X_{k j}) \geq 0) \leq C \frac{p σ}{Γ} \min {1, e^{- Γ^{2} / 2 σ^{2}} \log (1 + \frac{2 σ^{2}}{Γ^{2}})} + \frac{C}{p^{c}}

(24)

Combining (21) and (24), we have (19). □

Proof of Theorem 3 and Theorem 4

Here we only provide a sketch of the proofs. We refer the readers to our Supplementary Material for detailed proofs. The proofs follow essentially from the same argument as the proofs of Theorem 1 and Theorem 2, respectively. However, in place of Lemma 2 used therein, we need the following lemma that provides a perturbation bound for the leading eigenvector of approximate rank-one matrices, which could be of independent interest.

Lemma 3. Suppose p≳n and $λ_{1}^{2} (Θ^{'}) \geq λ_{2}^{2} (Θ^{'}) + C σ^{2} (n + \sqrt{n p})$ for some C > 0. Let $w = u_{1}^{'}$ be the first left singular vector of Θ′, it follows that,

P (| 1 - {({\hat{w}}^{⊤} w)}^{2} | \leq \frac{C σ^{2} (λ_{1}^{2} (Θ^{'}) + σ^{2} p) (n + \log p)}{{(λ_{1}^{2} (Θ^{'}) - λ_{2}^{2} (Θ^{'}))}^{2}}) \geq 1 - \frac{C}{p^{c}} .

The proof of Lemma 3 is nontrivial, which depends on a combination of the generic perturbation bound obtained by Cai and Zhang (2018) and new concentration inequalities of approximate rank-one matrices (see Supplementary Materials).

Proof of Theorem 5

The proof relies on the following lemma adapted from (Tsybakov 2009).

Lemma 4. Assume that for some integer M ≥ 2 there exist distinct parameters θ₀,..., θ_M from the parameter space Θ and mutually absolutely continuous probability measures P₀,...,P_M with $P_{j} = P_{θ_{j}}$ for j = 0,1,...,M, defined on a common probability space $(Ω, F)$ such that the averaged K-L divergence $\frac{1}{M} \sum_{j = 1}^{M} D (P_{j}, P_{0}) \leq \frac{1}{8} \log M$ . Then, for every measurable mapping $\hat{θ} : Ω \to Θ$ , $\max_{j = 0, \dots M} P_{j} (\hat{θ} \neq θ_{j}) \geq \frac{\sqrt{M}}{\sqrt{M} + 1} (\frac{3}{4} - \frac{1}{2 \sqrt{\log M}})$ .

We construct the (M +1) =p points parameter space as follows. We define p permutations from $S_{p}$ as an identity plus (p −1) consecutive swaps, i.e., π₀ = id, π_k = (k, k +1) for k = 1,..., p −1. The signal matrix Θ₀ = aη^⊤ where $a = {(1, \dots, 1)}^{⊤} \in ℝ^{n}$ and $η = {(0, δ, \dots, (p - 1) δ)}^{⊤} \in ℝ^{p}$ , $δ = \frac{σ}{4} \sqrt{\log p / n}$ . In this way, we have $Γ = ‖ a ‖_{2} \cdot \min_{1 \leq i \leq p - 1} | η_{j} - η_{j + 1} | = \frac{σ}{4} \sqrt{\log p}$ . Let P_k corresponds to the joint probability measure of Y under (Θ₀, π_k) for k = 0,1,..., p −1, and let p_k be the pdf of P_k, we have $p_{0} (x) = \prod_{i = 1}^{n} \prod_{j = 1}^{p} ϕ_{η_{j}} (x_{i j})$ , $p_{k} (x) = \prod_{i = 1}^{n} \prod_{j = 1}^{p} ϕ_{η_{π_{k} (j)}} (x_{i j})$ for k = 1,..., p −1, where ϕ_μ is the pdf of Gaussian distribution N(μ,σ²). Now we calculate the KL-divergence

D (P_{k}, P_{0}) = \int \log (\frac{p_{k} (x)}{p_{0} (x)}) p_{0} (x) d x = \int \frac{n}{2 σ^{2}} \sum_{i = 1}^{p} [{(x_{1 j} - η_{π_{k} (j)})}^{2} - {(x_{1 j} - η_{j})}^{2}] p_{0} (x) d x = \frac{n δ^{2}}{σ^{2}} = \frac{\log p}{16} .

Then, we have for $p \geq 10, \frac{1}{p - 1} \sum_{k = 1}^{p - 1} D (P_{k}, P_{0}) = \frac{\log p}{16} \leq \frac{1}{8} \log (p - 1)$ . It follows from Lemma 4 that, $\inf_{\hat{π}} \sup_{(π, Θ) \in D_{1}} P (\hat{π} \neq π) \geq \inf_{\hat{π}} \max_{j = 0, \dots, p - 1} P_{j} (\hat{π} \neq π_{j}) \geq 0.3$ as long as p ≥ 10. In addition, $\inf_{\hat{π}} \sup_{(π, Θ) \in D_{1}^{'}} P (\hat{π} \neq π) \geq \inf_{\hat{π}} \sup_{(π, Θ) \in D_{1}} P (\hat{π} \neq π)$ as $D_{1} \subset D_{1}^{'}$ . □

Proof of Theorem 6

The proof relies on the following lemma from Tsybakov (2009).

Lemma 5. Assume that M ≥ 2 and suppose that Θ contains elements θ₀, θ₁,..., θ_M such that: (i) d(θ_j, θ_k) ≥ 2s > 0 for any 0 ≤ j < k ≤ M; (ii) for any $j = 1, \dots, M, \frac{1}{M} \sum_{j = 1}^{M} D (P_{j}, P_{0}) \leq α \log M$ with 0 < α < 1/8 and $P_{j} = P_{θ_{j}}$ for j = 0,1,...,M. Then

\inf_{\hat{θ}} \sup_{θ \in Θ} P_{θ} (d (\hat{θ}, θ) \geq s) \geq \frac{\sqrt{M}}{1 + \sqrt{M}} (1 - 2 α - \sqrt{\frac{2 α}{\log M}}) > 0 .

We also need the following sphere packing lemma proved by Mao, Weed, and Rigollet (2017), which is a direct consequence of the well-celebrated Varshamov-Gilbert bound.

Lemma 6. For any r < p / 2, there exists a subset $Q_{r}$ of $S_{p}$ such that (i) $\log | Q_{r} | \geq \frac{r}{5} \log (p / r)$ , (ii) for any elements $π_{1}, π_{2} \in Q_{r}$ , we have $(\begin{array}{l} p \\ 2 \end{array}) \cdot τ_{K} (π_{1}, π_{2}) \geq r$ , and (iii) for any $π \in Q_{r}$ , we have $‖ π - i d ‖_{2}^{2} \leq 2 r$ .

For t / σ ≥ 2, we set $r = \frac{p σ}{t} e^{- t^{2} / 2 σ^{2}} < p / 2$ . Let π₀ = id and $π_{1}, \dots, π_{| Q_{r} |}$ be the elements of $Q_{r}$ . The signal matrix Θ₀ = aη^⊤ where $a = {(1 / \sqrt{320 n}, \dots, 1 / \sqrt{320 n})}^{⊤} \in ℝ^{n}$ and $η = {(t, \dots, p t)}^{⊤} \in ℝ^{p}$ . Let P_k be the joint probability measure of Y under (Θ_0, π_k) for $k = 0, 1, \dots, | Q_{r} |$ , and let p_k be the pdf of P_k. By Lemma 6, the KL-divergence

D (P_{k}, P_{0}) = \int \log (\frac{p_{k} (x)}{p_{0} (x)}) p_{0} (x) d x = \frac{t^{2}}{320 σ^{2}} {‖ π_{k} - i d ‖}_{2}^{2} \leq \frac{p t}{160 σ} e^{- t^{2} / 2 σ^{2}} .

and therefore

\frac{1}{p - 1} \sum_{k = 1}^{p - 1} D (P_{k}, P_{0}) \leq \frac{p t}{160 σ} e^{- t^{2} / 2 σ^{2}} \leq \frac{p σ}{80 t} e^{- t^{2} / 2 σ^{2}} \log (\frac{t}{σ} e^{t^{2} / 2 σ^{2}}) \leq \frac{1}{16} \log | Q_{r} | .

Without loss of generality, we assume $| Q_{r} | \geq 2$ . By Lemma 5, it then follows that,

\inf_{\hat{π}} \sup_{(Θ, π) \in D_{2} (t)} P (τ_{K} (\hat{π}, π) \geq \frac{σ}{2 p t} e^{- t^{2} / 2 σ^{2}}) \geq C_{1},

for some absolute constant C₁ > 0. By Markov’s inequality, we have

\inf_{\hat{π}} \sup_{(Θ, π) \in D_{2} (t)} E [τ_{K} (\hat{π}, π)] \geq \frac{σ}{2 p t} e^{- t^{2} / 2 σ^{2}} \inf_{\hat{π}} \sup_{(Θ, π) \in D_{2} (t)} P (τ_{K} (\hat{π}, π) \geq \frac{σ}{2 p t} e^{- t^{2} / 2 σ^{2}}) \geq \frac{C_{1} σ}{p t} e^{- t^{2} / 2 σ^{2}} .

The relationship $\inf_{\hat{π}} \sup_{(Θ, π) \in D_{2}^{'} (t)} E [τ_{K} (\hat{π}, π)] \geq \inf_{\hat{π}} \sup_{(Θ, π) \in D_{2} (t)} E [τ_{K} (\hat{π}, π)]$ follows from $D_{L} \subset D$ . The rate 1 / p² follows by setting $t = C_{2} σ \sqrt{\log p}$ for some C₂ > 0. □

Supplementary Material

Supp 1

NIHMS1578947-supplement-Supp_1.zip^{(3.4MB, zip)}

Supp 2

NIHMS1578947-supplement-Supp_2.pdf^{(5MB, pdf)}

ACKNOWLEDGEMENT

We would like to thank the Associate Editor and the anonymous referees for many helpful suggestions that significantly improved the paper. R. M. would also like to thank Rui Duan, Anru Zhang, Yuan Gao and Shulei Wang for stimulating discussions at various stages of this project.

Footnotes

SUPPLEMENTARY MATERIALS

In our online Supplemental Materials, we prove Theorem 3–4, Proposition 1–4, as well as the technical lemmas. Some supplementary simulations, figures and tables are included in the appendix.

References

Abel S, Zur Wiesch PA, Chang H-H, Davis BM, Lipsitch M, and Waldor MK (2015), “Sequence tag–based analysis of microbial population dynamics,” Nature Methods, 12, 223. [DOI] [PMC free article] [PubMed] [Google Scholar]
Abramowitz M, and Stegun IA (1965), Handbook of Mathematical Functions: with Formulas, Graphs, and Mathematical Tables, vol. 55, Courier Corporation. [Google Scholar]
Barg A, and Mazumdar A (2010), “Codes in permutations and error correction for rank modulation,” in Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on, IEEE, pp. 854–858. [Google Scholar]
Boulund F, Pereira MB, Jonsson V, and Kristiansson E (2018), “ Computational and statistical considerations in the analysis of metagenomic data,” in Metagenomics, Elsevier, pp. 81–102. [Google Scholar]
Bremer H, and Churchward G (1977), “An examination of the Cooper-Helmstetter theory of DNA replication in bacteria and its underlying assumptions, ” Journal of Theoretical Biology, 69, 645–654. [DOI] [PubMed] [Google Scholar]
Brown CT, Olm MR, Thomas BC, and Banfield JF (2016), “ Measurement of bacterial replication rates in microbial communities,” Nature Biotechnology, 34, 1256. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cai TT, Ma Z, and Wu Y (2013), “Sparse PCA: Optimal rates and adaptive estimation,” The Annals of Statistics, 41, 3074–3110. [Google Scholar]
Cai TT, and Zhang A (2018), “Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics,” The Annals of Statistics, 46, 60–89. [Google Scholar]
Chatterjee S, Guntuboyina A, and Sen B (2015), “On risk bounds in isotonic and other shape restricted regression problems,” The Annals of Statistics, 43, 1774–1800. [Google Scholar]
——— (2018), “On matrix estimation under monotonicity constraints,” Bernoulli, 24, 1072–1100. [Google Scholar]
Collier O, and Dalalyan AS (2016), “Minimax rates in permutation estimation for feature matching,” The Journal of Machine Learning Research, 17, 162–192. [Google Scholar]
Cooper S, and Helmstetter CE (1968), “Chromosome replication and the division cycle of Escherichia coli B/r,” Journal of Molecular Biology, 31, 519–540. [DOI] [PubMed] [Google Scholar]
Cullina D, and Kiyavash N (2016), “Improved achievability and converse bounds for Erdös-Renyi graph matching,” in ACM SIGMETRICS Performance Evaluation Review, ACM, vol. 44, pp. 63–72. [Google Scholar]
Currie RR, and Pandher GS (2011), “Finance journal rankings and tiers: An active scholar assessment methodology,” Journal of Banking & Finance, 35, 7–20. [Google Scholar]
Deshpande SK, and Jensen ST (2016), “Estimating an NBA player’s impact on his team’s chances of winning,” Journal of Quantitative Analysis in Sports, 12, 51–72. [Google Scholar]
Diaconis P (1988), Group Representations in Probability and Statistics, Institute of Mathematical Statistics Lecture Notes–Monograph Series (11). [Google Scholar]
Diaconis P, and Graham RL (1977), “Spearman’s footrule as a measure of disarray,” Journal of the Royal Statistical Society. Series B (Methodological), 262–268. [Google Scholar]
Flammarion N, Mao C, and Rigollet P (2019), “Optimal rates of statistical seriation,” Bernoulli, 25, 623–653. [Google Scholar]
Gao F, Luo H, and Zhang C-T (2013), “DoriC 5.0: an updated database of oriC regions in both bacterial and archaeal genomes,” Nucleic Acids Research, 41, D90. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gao Y, and Li H (2018), “Quantifying and comparing bacterial growth dynamics in multiple metagenomic samples,” Nature Methods, 15, 1041–1044. [DOI] [PMC free article] [PubMed] [Google Scholar]
Göloğlu F, Lember J, Riet A-E, and Skachek V (2015), “New bounds for permutation codes in Ulam metric,” in Information Theory (ISIT), 2015 IEEE International Symposium on, IEEE, pp. 1726–1730. [Google Scholar]
Kendall MG (1938), “A new measure of rank correlation,” Biometrika, 30, 81–93. [Google Scholar]
Koltchinskii V, and Xia D (2016), “Perturbation of linear forms of singular vectors under gaussian noise,” in High Dimensional Probability VII, Springer, pp. 397–423. [Google Scholar]
Korem T, Zeevi D, Suez J, Weinberger A, Avnit-Sagi T, Pompan-Lotan M, Matot E, Jona G, Harmelin A, and Cohen N (2015), “Growth dynamics of gut microbiota in health and disease inferred from single metagenomic samples,” Science, aac4812. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lewis JD, Chen EZ, Baldassano RN, Otley AR, Griffiths AM, Lee D, Bittinger K, Bailey A, Friedman ES, Hoffmann C, et al. (2015), “ Inflammation, antibiotics, and diet as environmental stressors of the gut microbiome in pediatric CrohnÕs disease,” Cell Host & Microbe, 18, 489–500. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li D, Liu C, Luo R, Sadakane K, and Lam T (2015), “MEGAHIT: an ultrafast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph,” Bioinformatics, 15, 1674–1676. [DOI] [PubMed] [Google Scholar]
Mao C, Weed J, and Rigollet P (2017), “Minimax rates and efficient algorithms for noisy sorting,” arXiv preprint arXiv:1710.10388. [Google Scholar]
Mazumdar A, Barg A, and Zemor G (2013), “Constructions of rank modulation codes,” IEEE Transactions on Information Theory, 59, 1018–1029. [Google Scholar]
Mukherjee S (2016), “Estimation in exponential families on permutations,” The Annals of Statistics, 44, 853–875. [Google Scholar]
Myhrvold C, Kotula JW, Hicks WM, Conway NJ, and Silver PA (2015), “A distributed cell division counter reveals growth dynamics in the gut microbiota, ” Nature Communications, 6, 10039. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pananjady A, Wainwright MJ, and Courtade TA (2016), “Linear regression with an unknown permutation: Statistical and computational limits,” arXiv preprint arXiv:1608.02902. [Google Scholar]
——— (2017), “Denoising linear models with permuted data,” in Information Theory (ISIT), 2017 IEEE International Symposium on, IEEE, pp. 446–450. [Google Scholar]
Rendle S, Balby Marinho L, Nanopoulos A, and Schmidt-Thieme L (2009), “ Learning optimal ranking with tensor factorization for tag recommendation,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 727–736. [Google Scholar]
Rigollet P, and Weed J (2018), “Uncoupled isotonic regression via minimum Wasserstein deconvolution,” arXiv preprint arXiv:1806.10648. [Google Scholar]
Slawski M, and Ben-David E (2017), “Linear Regression with Sparsely Permuted Data,” arXiv preprint arXiv:1710.06030. [Google Scholar]
Tsybakov AB (2009), Introduction to Nonparametric Estimation, Springer Series in Statistics. Springer, New York. [Google Scholar]
von Meijenfeldt FB, Arkhipova K, Cambuy DD, Coutinho FH, and Dutilh BE (2019), “Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT,” bioRxiv, 530188. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu Y, Tang Y, Tringe S, Simmons B, and Singer S (2014), “MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm,” Microbiome, 2, 26. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yuan X, and Zhang T (2013), “Truncated power method for sparse eigenvalue problems,” Journal of Machine Learning Research, 14, 899–925. [Google Scholar]
Zhang A, Cai TT, and Wu Y (2018), “Heteroskedastic PCA: Algorithm, optimality, and applications,” arXiv preprint arXiv:1810.08316. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS1578947-supplement-Supp_1.zip^{(3.4MB, zip)}

Supp 2

NIHMS1578947-supplement-Supp_2.pdf^{(5MB, pdf)}

[R1] Abel S, Zur Wiesch PA, Chang H-H, Davis BM, Lipsitch M, and Waldor MK (2015), “Sequence tag–based analysis of microbial population dynamics,” Nature Methods, 12, 223. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Abramowitz M, and Stegun IA (1965), Handbook of Mathematical Functions: with Formulas, Graphs, and Mathematical Tables, vol. 55, Courier Corporation. [Google Scholar]

[R3] Barg A, and Mazumdar A (2010), “Codes in permutations and error correction for rank modulation,” in Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on, IEEE, pp. 854–858. [Google Scholar]

[R4] Boulund F, Pereira MB, Jonsson V, and Kristiansson E (2018), “ Computational and statistical considerations in the analysis of metagenomic data,” in Metagenomics, Elsevier, pp. 81–102. [Google Scholar]

[R5] Bremer H, and Churchward G (1977), “An examination of the Cooper-Helmstetter theory of DNA replication in bacteria and its underlying assumptions, ” Journal of Theoretical Biology, 69, 645–654. [DOI] [PubMed] [Google Scholar]

[R6] Brown CT, Olm MR, Thomas BC, and Banfield JF (2016), “ Measurement of bacterial replication rates in microbial communities,” Nature Biotechnology, 34, 1256. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Cai TT, Ma Z, and Wu Y (2013), “Sparse PCA: Optimal rates and adaptive estimation,” The Annals of Statistics, 41, 3074–3110. [Google Scholar]

[R8] Cai TT, and Zhang A (2018), “Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics,” The Annals of Statistics, 46, 60–89. [Google Scholar]

[R9] Chatterjee S, Guntuboyina A, and Sen B (2015), “On risk bounds in isotonic and other shape restricted regression problems,” The Annals of Statistics, 43, 1774–1800. [Google Scholar]

[R10] ——— (2018), “On matrix estimation under monotonicity constraints,” Bernoulli, 24, 1072–1100. [Google Scholar]

[R11] Collier O, and Dalalyan AS (2016), “Minimax rates in permutation estimation for feature matching,” The Journal of Machine Learning Research, 17, 162–192. [Google Scholar]

[R12] Cooper S, and Helmstetter CE (1968), “Chromosome replication and the division cycle of Escherichia coli B/r,” Journal of Molecular Biology, 31, 519–540. [DOI] [PubMed] [Google Scholar]

[R13] Cullina D, and Kiyavash N (2016), “Improved achievability and converse bounds for Erdös-Renyi graph matching,” in ACM SIGMETRICS Performance Evaluation Review, ACM, vol. 44, pp. 63–72. [Google Scholar]

[R14] Currie RR, and Pandher GS (2011), “Finance journal rankings and tiers: An active scholar assessment methodology,” Journal of Banking & Finance, 35, 7–20. [Google Scholar]

[R15] Deshpande SK, and Jensen ST (2016), “Estimating an NBA player’s impact on his team’s chances of winning,” Journal of Quantitative Analysis in Sports, 12, 51–72. [Google Scholar]

[R16] Diaconis P (1988), Group Representations in Probability and Statistics, Institute of Mathematical Statistics Lecture Notes–Monograph Series (11). [Google Scholar]

[R17] Diaconis P, and Graham RL (1977), “Spearman’s footrule as a measure of disarray,” Journal of the Royal Statistical Society. Series B (Methodological), 262–268. [Google Scholar]

[R18] Flammarion N, Mao C, and Rigollet P (2019), “Optimal rates of statistical seriation,” Bernoulli, 25, 623–653. [Google Scholar]

[R19] Gao F, Luo H, and Zhang C-T (2013), “DoriC 5.0: an updated database of oriC regions in both bacterial and archaeal genomes,” Nucleic Acids Research, 41, D90. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Gao Y, and Li H (2018), “Quantifying and comparing bacterial growth dynamics in multiple metagenomic samples,” Nature Methods, 15, 1041–1044. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Göloğlu F, Lember J, Riet A-E, and Skachek V (2015), “New bounds for permutation codes in Ulam metric,” in Information Theory (ISIT), 2015 IEEE International Symposium on, IEEE, pp. 1726–1730. [Google Scholar]

[R22] Kendall MG (1938), “A new measure of rank correlation,” Biometrika, 30, 81–93. [Google Scholar]

[R23] Koltchinskii V, and Xia D (2016), “Perturbation of linear forms of singular vectors under gaussian noise,” in High Dimensional Probability VII, Springer, pp. 397–423. [Google Scholar]

[R24] Korem T, Zeevi D, Suez J, Weinberger A, Avnit-Sagi T, Pompan-Lotan M, Matot E, Jona G, Harmelin A, and Cohen N (2015), “Growth dynamics of gut microbiota in health and disease inferred from single metagenomic samples,” Science, aac4812. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Lewis JD, Chen EZ, Baldassano RN, Otley AR, Griffiths AM, Lee D, Bittinger K, Bailey A, Friedman ES, Hoffmann C, et al. (2015), “ Inflammation, antibiotics, and diet as environmental stressors of the gut microbiome in pediatric CrohnÕs disease,” Cell Host & Microbe, 18, 489–500. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Li D, Liu C, Luo R, Sadakane K, and Lam T (2015), “MEGAHIT: an ultrafast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph,” Bioinformatics, 15, 1674–1676. [DOI] [PubMed] [Google Scholar]

[R27] Mao C, Weed J, and Rigollet P (2017), “Minimax rates and efficient algorithms for noisy sorting,” arXiv preprint arXiv:1710.10388. [Google Scholar]

[R28] Mazumdar A, Barg A, and Zemor G (2013), “Constructions of rank modulation codes,” IEEE Transactions on Information Theory, 59, 1018–1029. [Google Scholar]

[R29] Mukherjee S (2016), “Estimation in exponential families on permutations,” The Annals of Statistics, 44, 853–875. [Google Scholar]

[R30] Myhrvold C, Kotula JW, Hicks WM, Conway NJ, and Silver PA (2015), “A distributed cell division counter reveals growth dynamics in the gut microbiota, ” Nature Communications, 6, 10039. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Pananjady A, Wainwright MJ, and Courtade TA (2016), “Linear regression with an unknown permutation: Statistical and computational limits,” arXiv preprint arXiv:1608.02902. [Google Scholar]

[R32] ——— (2017), “Denoising linear models with permuted data,” in Information Theory (ISIT), 2017 IEEE International Symposium on, IEEE, pp. 446–450. [Google Scholar]

[R33] Rendle S, Balby Marinho L, Nanopoulos A, and Schmidt-Thieme L (2009), “ Learning optimal ranking with tensor factorization for tag recommendation,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 727–736. [Google Scholar]

[R34] Rigollet P, and Weed J (2018), “Uncoupled isotonic regression via minimum Wasserstein deconvolution,” arXiv preprint arXiv:1806.10648. [Google Scholar]

[R35] Slawski M, and Ben-David E (2017), “Linear Regression with Sparsely Permuted Data,” arXiv preprint arXiv:1710.06030. [Google Scholar]

[R36] Tsybakov AB (2009), Introduction to Nonparametric Estimation, Springer Series in Statistics. Springer, New York. [Google Scholar]

[R37] von Meijenfeldt FB, Arkhipova K, Cambuy DD, Coutinho FH, and Dutilh BE (2019), “Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT,” bioRxiv, 530188. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Wu Y, Tang Y, Tringe S, Simmons B, and Singer S (2014), “MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm,” Microbiome, 2, 26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Yuan X, and Zhang T (2013), “Truncated power method for sparse eigenvalue problems,” Journal of Machine Learning Research, 14, 899–925. [Google Scholar]

[R40] Zhang A, Cai TT, and Wu Y (2018), “Heteroskedastic PCA: Algorithm, optimality, and applications,” arXiv preprint arXiv:1810.08316. [Google Scholar]

PERMALINK

Optimal Permutation Recovery in Permuted Monotone Matrix Model

Rong Ma

T Tony Cai

Hongzhe Li

Abstract

1. INTRODUCTION

1.1. A Motivation Example from Microbiome Studies

1.2. A Permuted Monotone Matrix Model

1.3. Related Problems and Other Applications

1.4. Main Contributions and Organization

Fig. 1.

1.5. Notation and Definitions

2. PERMUTATION RECOVERY VIA BEST LINEAR PROJECTION

2.1. Linear Projection

2.2. Evaluation Criteria

3. A LINEAR GROWTH MODEL

Fig. 2.

Table 1.

Fig. 4.

4. A GENERAL GROWTH MODEL

5. MINIMAX LOWER BOUNDS AND OPTIMALITY

Fig. 3.

6. NUMERICAL STUDIES

6.1. Simulation with Model-Generated Data

6.2. Evaluation Using Synthetic Metagenomic Data

Fig. 5.

6.3. Analysis of a Real Microbiome Metagenomic Data Set

7. DISCUSSION

8. PROOFS OF THE MAIN THEOREMS

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3 and Theorem 4

Proof of Theorem 5

Proof of Theorem 6

Supplementary Material

ACKNOWLEDGEMENT

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases