Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2017 Jan 8;18(3):422–433. doi: 10.1093/biostatistics/kxw053

Structural zeros in high-dimensional data with applications to microbiome studies

Abhishek Kaul *,*, Ori Davidov *, Shyamal D Peddada *
PMCID: PMC5862388  PMID: 28065879

Summary

This paper is motivated by the recent interest in the analysis of high-dimensional microbiome data. A key feature of these data is the presence of “structural zeros” which are microbes missing from an observation vector due to an underlying biological process and not due to error in measurement. Typical notions of missingness are unable to model these structural zeros. We define a general framework which allows for structural zeros in the model and propose methods of estimating sparse high-dimensional covariance and precision matrices under this setup. We establish error bounds in the spectral and Frobenius norms for the proposed estimators and empirically verify them with a simulation study. The proposed methodology is illustrated by applying it to the global gut microbiome data of Yatsunenko and others (2012. Human gut microbiome viewed across age and geography. Nature 486, 222–227). Using our methodology we classify subjects according to the geographical location on the basis of their gut microbiome.

Keywords: Classification, High dimension, Microbiome data, Missing data, Sparsity

1. Introduction

With the advancement of high-throughput technologies it is now common to encounter high-dimensional data where the number of parameters Inline graphic exceeds the sample size Inline graphic. One of many such examples is the human microbiome data obtained by the 16s rRNA sequencing technology. The resulting data, known as operational taxonomic units (OTUs), represent counts of thousands of microbial taxa (Mandal and others, 2015). In this setting it is often of interest to investigate relationships among the microbes to understand their effects on health outcomes. These relationships can in turn be used to predict the health status of an individual based on his/her microbial composition.

Many such objectives can be achieved via the estimation of the covariance matrix Inline graphic or its inverse, the precision matrix Inline graphic which characterize the dependence or the conditional dependence structure between variables, respectively. In the high-dimensional setting, estimation of Inline graphic and Inline graphic has been discussed extensively in the literature and the existing literature can be broadly classified into two categories, the first approach involves estimation of the precision matrix by exploiting its natural sparsity in comparison to the covariance matrix (Friedman and others, 2008; Rothman, Bickel and others, 2009; Cai and others, 2011). A limitation of this approach is that it does not apply to low-rank matrices Inline graphic since the precision matrix does not exist in this case. The second popular approach is to estimate Inline graphic by assuming that Inline graphic is itself sparse. One of several methods for this purpose is to threshold each element of the sample covariance matrix (Bickel and Levina, 2008; Rothman, Levina and others, 2009).

The current literature assumes the availability of independent and identically distributed (i.i.d.) copies of the vector Inline graphic whose distribution is Gaussian or more generally sub-Gaussian with Inline graphic and Inline graphic as the Inline graphic-dimensional mean vector and covariance matrix, respectively. Note that a real-valued random variable (r.v.) Inline graphic is said to be sub-Gaussian if there exists a Inline graphic such that for every Inline graphic one has Inline graphic

In contrast to typical high-dimensional data, not all variables (i.e. microbes) are observed in a sample of microbiome data. Thus if Inline graphic represents a vector of abundances of Inline graphic microbes in a specimen, then not all components of Inline graphic may be observed. We refer to this missingness as structural zeros, and it is due to the underlying biology and not due to error in measurement or values below the minimum detection level. For example, it is known that the bacterial genus Bacteroides is prevalent in the human gut when the associated diet is high protein/fat diet, whereas it may not be present otherwise, e.g. carbohydrate rich diet. The total abundance of such bacteria is coded as 0 counts in the observational vector Inline graphic

The missing structure required to model structural zeros is more general than typical notions of missingness in the literature. More precisely, in the classical notions of missingness, such as missing completely at random or missing at random (MAR), it is assumed that in place of Inline graphic we observe a surrogate vector Inline graphic where Inline graphic represents a component-wise product and Inline graphic is a Inline graphic-dimensional vector of independent Bernoulli r.v.’s. In effect, not all components of Inline graphic are observed in Inline graphic. For example, Inline graphic corresponds to the case where the first two components of Inline graphic are not observed in Inline graphic with Inline graphic In this example, although Inline graphic and Inline graphic are absent in Inline graphic they still influence the distribution of the remaining components Inline graphic through the underlying dependence structure of Inline graphic and are only hidden by the corresponding multiplicative Bernoulli noise vector Inline graphic In contrast, for the case of structural zeros the observed vector itself is Inline graphic i.e. the first two components are truly absent from the observation and thus the missing components should not influence the distribution of the remaining components. It should also be noted from this example that in the latter case, implementing classical methods of imputation under such conditions would be a logical error due to the definition of a structural zero.

In this paper we introduce a framework which allows for structural zeros in the model and discuss consistent methods of estimating sparse high-dimensional covariance and precision matrices. We establish consistency in estimation of the proposed methodology and empirically support it with a simulation study. We also apply our methodology to classify observations to geographical locations based on the global human gut microbiome data of Yatsunenko and others (2012).

Some work related to ours in the literature is that of Kurkland and Heagerty (2005) who provide a regression setting for the analysis of longitudinal data truncated by deaths, here they make a similar distinction between zeros in response variable due to an individual dropping out of the study or due to death, in this context the zero due to death can be described as a structural zero in our definition. Estimation of covariance and precision matrices in the traditional missing values setting has also been discussed in the literature (Loh and Wainwright, 2012; Lounici, 2014; Kaul and others, 2016). As noted above and as shall become more apparent in the following sections, our model allows for a more general notion of missingness while assuming weaker conditions in comparison to typical notions of missingness.

2. Notations and framework

For any matrix Inline graphic define the Inline graphicInline graphic and Inline graphic norms as Inline graphicInline graphic and Inline graphic, respectively, where Inline graphic represents the usual Euclidean norm of a vector. We shall also require the matrix Inline graphic norm, Inline graphic and the elementwise Inline graphic norm, Inline graphic The notation Inline graphic indicates the matrix Inline graphic is positive definite and symbols Inline graphicInline graphic, and Inline graphic represent generic constants which may change according to the context but are independent of any model parameters. The notation Inline graphic represents that Inline graphic for some constant Inline graphic and Inline graphic large enough. For any set of indices Inline graphic its cardinality is denoted by Inline graphic and for a subset Inline graphic, Inline graphic denotes the vector of components of Inline graphic with indices in Inline graphic Finallyw we partition a Inline graphic matrix Inline graphic as follows:

Σ=(ΣAAΣAAcΣAcAΣAcAc),where Ac denotes the compliment set of A. (2.1)

We begin by describing a framework that characterizes structural zeros. As stated in the previous section, these structural zeros represent components that are biologically absent in the specimen. Thus, the framework should allow for the distribution of the specimen to be determined by only the observed components. For this purpose, let Inline graphic be the sample space of possible configurations of missing components be as follows.

S={(1,,1),(0,1,,1),(1,0,,1),,(1,,1,0)(0,0,1,...1),(0,1,0..,1),,(1,,0,0)..(0,0,,1),(0,0,,1,0),,(1,0,...0). (2.2)

Here Inline graphic correspond to the cases where a component is unobserved or observed in the sample, respectively. We represent each of the above Inline graphic events of the sample space by Configuration (Config.) Inline graphic, Inline graphic, in the order written in (2.2). For example, Config. (Inline graphic) is the case where all components are observed and Config. Inline graphic corresponds to the configuration where only the first component is observed. Assume for the Inline graphicth sample, the missing structure is generated by independent r.v.’s Inline graphicInline graphic with sample space described in (2.2).

In many applications, including the analysis of microbiome data, it may be unreasonable to assume that the missingness is generated by identically distributed r.v.’s since this distribution function may be influenced by factors or covariates such as geographical location, age, race, and gender of the subject. We allow for this flexibility by defining the distribution of the r.v.’s Inline graphicInline graphic as follows,

P(Miis in Config.(j))=δ(j)i,0δ(j)i1,1j2d1. (2.3)

Under this general definition each sample may have a different probability of each configuration. This feature of allowing the missingness to be acting independently but not identically is reminiscent of the MAR structure. We now proceed to define the conditional distribution of the observed components of a specimen.

Let Inline graphic and Inline graphic be a Inline graphic-dimensional vector and positive definite matrix, respectively. For the Inline graphicth subject, with missing configuration given by the r.v. Inline graphic, we denote the observed components by the index set

Ai={j,Mij=1}. (2.4)

Note that the index set Inline graphic is a random set which is determined by the r.v. Inline graphic Now assume that conditioned on Inline graphic the components of Inline graphic with indices in the index set Inline graphic jointly follow a Gaussian distribution with mean and covariance being the corresponding sub-vector of Inline graphic and sub-matrix of Inline graphic, respectively, i.e. for any Inline graphic

P(XAixAi|Mi)=ΦAi(xAi), (2.5)

where Inline graphic represents the Gaussian distribution function with mean Inline graphic and covariance matrix Inline graphic For example let Inline graphic, then the observed vector is Inline graphic with the conditional distribution of the observed components as Inline graphic This completes the description of the distributional structure assumed in this paper. Under this definition, the zero components of an observation do not influence the distribution of the nonzero components, thus characterizing what we refer to as structural zeros.

To proceed further we shall require the following definitions. For all Inline graphic let Inline graphic and Inline graphic be the number of subjects where Inline graphicth component is observed and the number of subjects where both Inline graphicth and Inline graphicth components are observed, respectively. Note that these are random quantities determined by the random set Inline graphic which in turn is determined by the r.v.’s Inline graphicInline graphic Also define for each observation Inline graphic, and the indices Inline graphic the collections,

Ci(l)={1j2d1,component l is present in Config. (j) in r.v. Mi}Ci(l,m)={1j2d1,components l and m are present in Config. (j) in r.v. Mi}

In the sequel we make the following additional assumption over the missing structure.

  • (A1) There exists a constant Inline graphic such that for any Inline graphic
    (i)1ni=1njCi(l)δ(j)i=δ(l)>δminand(ii)1ni=1njCi(l,m)δ(j)i=δ(l,m)>δmin.

The condition (A1) is a mild assumption on the missing structure. If the r.v.’s Inline graphicInline graphic are assumed to be i.i.d., then (A1)(i) requires that each component is present in an observation with a nonzero probability, i.e. Inline graphic and (A1) (ii) requires that every pair of components are present in each observation vector with a nonzero probability, i.e. Inline graphic

3. Methodology

In this section we discuss methodologies to estimate the covariance and precision matrices. First we provide a Inline graphic minimization approach to estimate the precision matrix Inline graphic Then a generalized thresholding procedure to estimate the covariance matrix Inline graphic We shall provide error bounds for estimates obtained by these methods that hold with asymptotic probability Inline graphic unconditional on the missing structure. These error bounds shall also allow for the dimension Inline graphic to increase exponentially with the sample size, thus allowing for high dimensions. To describe our methodology we require the following definitions. For each Inline graphic let

μ^l=1|n(l)|in(l)Xil,1ld (3.1)

and define a “re-normalized sample covariance” matrix Inline graphic as follows,

σ^lm=in(l,m)(Xilμ^l)(Ximμ^m)/|n(l,m)|andΣ^=[σ^lm]l,m=1,..,d. (3.2)

The matrix Inline graphic forms an initial estimator for obtaining consistent estimates of the covariance matrix and the precision matrices in the high-dimensional setting. The following lemma provides an approximation result between Inline graphic and Inline graphic in the Inline graphic norm and shall be key to providing convergence rates of the estimators to follow later in this section.

Lemma 3.1.

Suppose the observations Inline graphicInline graphic follow the distribution (2.5) and that the missing structure satisfies condition (A1). In addition assume that the variance components of the covariance matrix Inline graphic are bounded above, i.e. Inline graphicInline graphic for a constant Inline graphic Then with probability at least Inline graphic

Σ^Σc0σx2δminlogdn, (3.3)

for some universal constant Inline graphic

To appreciate this result note that the re-normalized sample covariance matrix Inline graphic is defined through the r.v.’s Inline graphicInline graphic whose distribution is in turn defined conditionally of the missing structure Inline graphic However, Lemma 3.1 provides an unconditional probability bound on the desired random quantity with only a mild assumption (A1) on the missing structure. The proof of this result relies on the observation that Inline graphicInline graphic is a sum of independent r.v.’s, i.e. Inline graphic where Inline graphic for every Inline graphic Here Inline graphic represents the indicator function. This observation allows the applicability of the Hoeffding’s inequality (Hoeffding, 1963) in combination with conditional expectation arguments. The details of the proof are provided in the supplementary material available at Biostatistics online. To proceed with the estimation of Inline graphic and Inline graphic we require these matrices to belong to the following class of approximately sparse matrices.

  • (A2) Assume that the covariance and precision matrices belong to the following classes of matrices, respectively: define for Inline graphic
    (i)M(q,so(d),K)={Σ:σiiK,max1idj=1d|σij|qs0(d)}and(ii)U(q,so(d),K)={Ω:Ω0,ΩL1K,max1idj=1d|ωij|qs0(d)}.

The quantity Inline graphic is allowed to depend on Inline graphic and thus is not and explicit restriction on sparsity. Two examples of matrices which satisfy this restrictions are, a p-diagonal matrix that satisfies this condition with any Inline graphic and Inline graphic Second, an Inline graphic covariance matrix where Inline graphic which satisfies the restriction with Inline graphic for some constant Inline graphic

3.1 Estimation of the precision matrix Inline graphic

The problem of estimation of the precision matrix Inline graphic has received wide attention in the literature. Several solutions have been proposed in the context of i.i.d. sub-Gaussian observation and one such solution has been the penalized likelihood method for which exploits the i.i.d. Gaussian structure of observations (Friedman and others, 2008; Ravikumar and others, 2011). In our setup however, the observations Inline graphicInline graphic are no longer identically distributed, and hence such likelihood-based approaches are no longer feasible. However this problem can be overcome by adapting penalized moment-based approaches such as the method “Clime” of Cai and others (2011) under our setting. Such moment-based approaches to the estimation of high-dimensional precision matrices do not rely on an explicit likelihood functions, instead only require probabilistic bounds on the quantity Inline graphic and thus Lemma 3.1 stated earlier forms the connecting link between such approaches and the distributional structure of Section 1 without requiring the observations Inline graphicInline graphic to be identically distributed.

Let Inline graphic be the solution of the following convex program,

minΩ1subject to|Σ^nΩI|λΩ,ΩRp×p, (3.4)

with a suitable choice of Inline graphic Here Inline graphic represents the identity matrix and Inline graphic as defined in (3.2). Since the solution Inline graphic may not be symmetric in general, the final estimate Inline graphic is obtained by symmetrizing Inline graphic by choosing the smaller of Inline graphic and Inline graphic in the final estimate Inline graphic i.e. Inline graphic with Inline graphic

The estimator Inline graphic is not guaranteed to be positive definite; however, the following theorem shows that it converges to a positive definite limit with asymptotic probability 1. To ensure positive definiteness, in practice one may add a small constant to the diagonal elements of this matrix, or project the matrix onto the cone of positive definite matrices.

Theorem 3.1.

Suppose Inline graphicInline graphic follow the distribution (2.5) and that the missing structure satisfies condition (A1). Also assume that Inline graphic and the regularizer is chosen Inline graphic then the following bounds hold with probability at least Inline graphic

(i)Ω^Ω=O[logdn](ii)Ω^Ω2=O[s0(d)(logdn)1q],and,(iii)1dΩ^ΩF2=O[s0(d)(logdn)2q].

This methodology was introduced by Cai and others (2011) under the standard i.i.d. Gaussian setup, which is implemented using the sample covariance matrix Inline graphic as the initial estimate. The proof for the error bounds of Theorem 3.1 follows by deterministic arguments on the event where the inequality (3.3) holds and follows from the arguments of Cai and others (2011), a sketch of this proof has been provided in the supplementary material available at Biostatistics online.

3.2 Estimation of the covariance matrix Inline graphic

In this sub-section we discuss thresholding methods of estimating the covariance matrix Inline graphic in our setting with structural zeros. These methods were first proposed in the standard i.i.d. setting by Wu and Pourahmadi (2003) and have also been studied by Bickel and Levina (2008) and Levina and others (2008) amongst several others. Although assuming sparsity on the covariance matrix is a stronger assumption than assuming the same on the precision matrix, however thresholding carries very little computational burden in comparison to the penalized methods such as the one described in the previous sub-section. Thus this procedure forms an attractive alternative to estimating the precision matrix, especially in very high-dimensional problems and real time applications. We adopt this methodology in our setup with structural zeros as follows.

Let Inline graphic be a generalized thresholding operator as defined by Rothman, Levina and others (2009). We restate the definition for the convenience of readers. A function Inline graphic satisfying

(i)|sλ(x)||x|,(ii)sλ(x)=0for|x|λ,and(iii)|sλ(x)x|λ (3.5)

is said to be a generalized thresholding operator. In view of this definition, the covariance matrix Inline graphic can be estimated by elementwise thresholding as, Inline graphic The two common examples of these operators are the hard- and soft-thresholding operators defined as,

sλh(x)=x1(|x|>λ),sλs(x)=sign(x)(|x|λ)+, (3.6)

respectively. The soft-thresholding operator can also be defined as Inline graphic and has been studied by various authors, including Donoho and others (1995) and Tibshirani (1996). The hard-thresholding operator has been investigated by Bickel and Levina (2008) among others. Additional examples of thresholding operators include SCAD of Fan and Li (2001), adaptive Lasso of Zou (2006). The following result provides the consistency of this estimator.

Theorem 3.2.

Suppose Inline graphicInline graphic follow the distribution (2.5) and that the missing structure satisfies condition (A1). Also assume that Inline graphic satisfies condition (3.5). Then uniformly on Inline graphic choosing the regularizer Inline graphic we obtain

sλ(Σ^)Σ2=O[s0(d)(logdn)1q], (3.7)

with probability at least Inline graphic

In the standard i.i.d. setting, Rothman, Levina and others (2009) introduced this generalized thresholding methodology based on usual sample covariance matrix Inline graphic The proof of this theorem relies on deterministic arguments of Rothman, Levina and others (2009) on the set where the bound provided in Lemma 3.1 holds and a sketch is provided in the supplementary material available at Biostatistics online.

4. Simulation study

In this section we numerically illustrate that the methodology of Section 3 provides consistent estimates of Inline graphic and Inline graphic We confirm that our method provides a significant improvement over the typical method of using the sample covariance Inline graphic as the initial estimate in the methods of Sections 3.1 and 3.2, note that using the sample covariance matrix ignores the presence of these structural zeros.

4.1 Simulation setup and results

The missingness is generated by r.v.’s Inline graphic where each Inline graphicInline graphicInline graphic Here Inline graphic denotes the probability of Inline graphicth component being a structural zero. These Inline graphic’s Inline graphic are chosen by uniformly between Inline graphic Finally, for each Inline graphic the components of Inline graphic with indices in the set Inline graphic defined in (2.5) are assumed to be zero mean Gaussian r.v.’s with the covariance being the corresponding sub-block of the matrix Inline graphic

We set the covariance matrix as Inline graphic where Inline graphic and all entries Inline graphic0.01 are set to zero. Under this setting we estimate Inline graphic as described in Section 3.2 by the hard- and soft-thresholding operations of (3.6) on the initial re-normalized (RN) estimate (3.2), we shall refer to these estimates as “RN-hard” and “RN-soft,” respectively. Similarly we estimate Inline graphic as described in Section 3.1 and refer to the estimates obtained as “RN-clime.” We also illustrate that our method provides a significant improvement over the standard method of using the sample covariance Inline graphic as the initial estimate for the hard/soft thresholding and clime procedures, we refer to these as “S-hard,” “S-soft,” and “S-clime,” respectively.

We repeat simulations on Inline graphic independent data sets that are generated under the following settings. (i) Inline graphic is fixed at Inline graphic and Inline graphic is allowed to vary from Inline graphic to Inline graphic with increments of Inline graphic leading to 101 independent data sets; and (ii) Inline graphic is allowed to vary from Inline graphic to Inline graphic with increments of Inline graphic and Inline graphic is allowed to vary from Inline graphic to Inline graphic with increments of Inline graphic leading to 112 independent data sets.

Figures 1 and 2 illustrate the results of our simulation. Here each dot (triangle) represents Frobenius norm of estimation error for each independent data set. To measure the average performance over the simulated models, nonparametric regression lines are fit via the Loess method with its smoothing parameter set at Inline graphic The solid line represents the average performance of our methods and the dotted line represents the average performance of the standard method which ignores the presence of structural zeros. For the soft-/hard-thresholding case, the tuning parameter Inline graphic is chosen via cross-validation under the Frobenius loss, and for the “clime” method, the tuning parameter is chosen by cross-validating with the loss function Inline graphic

Fig. 1.

Fig. 1.

Comparisons of Frobenius norms of error in estimation due to soft (left) and hard (right) thresholding, x-axis: n, y-axis: Inline graphic results at d=75.

Fig. 2.

Fig. 2.

Left panel: Comparison of Frobenius norms of error in estimation due to soft thresholding the renormalized sample covariance at scaled sample size, x-axis: Inline graphicInline graphicInline graphic; y-axis: Inline graphic Right panel: Comparison of spectral norms of error in estimation due to Clime procedure on the renormalized sample covariance, Inline graphic and Clime procedure on the standard sample covariance Inline graphic Results at Inline graphic.

Results of thresholding procedures are provided in Figure 1. This figure plots Inline graphic for different sample sizes Inline graphic for the case Inline graphic From this figure, it is clear that the error corresponding to the proposed estimator under both soft and hard thresholding tends to get smaller with sample size faster than the standard estimator that ignores the structural zeros. The left panel of Figure 2 plots of the errors for different values of scaled sample sizes Inline graphic for the soft-thresholding operator. This figure seems to confirm the result of Theorem 3.2 regarding the rate of convergence of the estimator. Similarly, the right panel of Figure 2 describes the results of the simulation for the “clime” methodology described in Section 3.1.

5. Analysis of global human gut microbiome data

In this section we analyze the global human gut microbiome data of Yatsunenko and others (2012) available at the repository MG-RAST (http://metagenomics.anl.gov/) under accession numbers qiime:621 for fecal microbiome shotgun sequencing data sets. Here we estimate the precision matrix Inline graphic and use this estimate to classify observations according to geographical locations. Further details on the availability of data have been provided in the supplementary material available at Biostatistics online.

The data consist of microbial OTU counts obtained from individuals of United States (US), Venezuela (VE), and Malawi (MA). Our analysis is based on the “genus” level of bacterial taxonomy. We subdivide the data into two age categories “under 2 years” and “at least 2 years.” This stratification is done since it is known that microbial composition of infants changes drastically when they switch over from breast milk (or formula milk) to solid food (Lozupone and others, 2013). The sample sizes in the two strata (under 2 years, at least 2 years) for US, VE and MA samples are (70, 225), (15, 74) and (32, 72), respectively. After several pre-processing steps of the raw OTU data we obtain for each age group and each location, a matrix of observations with Inline graphic microbes and inherent structural zeros. Each row of these matrices is assumed to follow the conditional Gaussian distribution of (2.5). Note that for the “under 2 years” category for the pair VE–MA, the total number of observations is Inline graphic and thus we have a high-dimensional scenario. The pre-processing of data is described in detail in the supplementary material available at Biostatistics online. For each pair of locations, we randomly split Inline graphic-th data into training set and the remaining one-sixth as the test set. The training set is used to estimate the common precision matrix using the procedure described in Section 3.

5.1 Tuning parameter

The regularizer Inline graphic is chosen via 5-fold cross-validation within the combined training data for each pair of locations. The loss function used to evaluate cross-validation error is Inline graphic Also, in the construction of Inline graphic in (3.2) if a pair Inline graphic does not occur then we set the pairwise covariance to zero.

5.2 Classification of subjects to geographical locations

In this section we exploit the assumed conditional Gaussian structure of observations to classify subjects of the test set to one of two geographical locations by using estimates of the corresponding precision matrices.We perform pairwise classifications of samples into (i) US and MA, (ii) US and VE, and finally (iii) VE and MA. Let Inline graphicInline graphic be the estimated Inline graphic-dimensional mean vector obtained as in (3.1) for each of the two populations under consideration, and Inline graphic be the common precision matrix of the two locations.

Let Inline graphic represent the observation to be classified and let Inline graphic denote the collection of indices of the non-zero components of Inline graphic. For location Inline graphic, We implement the following classification rule: let

δr(XA)=XATΣ^AA1μ^rA12μ^rATΣ^AA1μ^rA (5.1)

denote the linear discriminant function where an observation Inline graphic is classified into population 1 if Inline graphic, otherwise classified into population 2. The percentage of correctly classified observations from the test set is computed and the above process is repeated 20 times to obtain average correct classification percentages. The results for the two age groups and for every pair of locations are summarized in Table 1.

Table 1.

Percentages of correct classification between locations

US–MA US–VE VE–MA
Age/ %Correct US MA Total US VE Total VE MA Total
Age Inline graphic 2 years 87.5 80.9 85.6 79.6 81.2 79.8 76.1 60.5 69.4
Age Inline graphic 2 years 92.8 95.0 93.3 80.9 73.3 73.3 62.1 57.5 60.0

The classification results depend on the two populations under consideration. The trends in the correct classification rates (column “Total” in Table 1) are the same in both age categories, the best being US–MA and the worst VE–MA. A possible reason for this is that the microbial composition between US and MA is more different relative to the other two pairs. This is illustrated in Figure 3, which plots the largest Inline graphic absolute differences in the signal-to-noise ratios (SNR) for each of the three pairs of location for the “at least 2 years” age group. More precisely, for each Inline graphic, Inline graphic, we plot Inline graphicInline graphic, and Inline graphic This decreasing trend in the absolute difference of SNRs is a possible reason for the lower correct classification rate between MA and VE than between US and MA.

Fig. 3.

Fig. 3.

Comparison of absolute difference in SNR for each pair of locations in descending order for the top 25 microbes. Results for the “at least 2 years age” category.

6. Discussion

New technologies such as the 16s RNA sequencing have yielded high-dimensional data with characteristics that cannot be modeled by standard i.i.d. formulations of multivariate data. In this paper we describe one such characteristic, namely “structural zeros,” which are encountered in microbiome studies. We proposed a conditional Gaussian distributional structure that characterizes these zeros and provide methods to estimate covariance and precision matrices in this context. We show that in spite of the distribution being conditional, it is indeed possible to obtain results that are unconditional. As future work, we believe that the conditional Gaussian distributional structure proposed in this paper can be used to carry forward the work of Kurkland and Heagerty (2005) in the high-dimensional setting where the covariates are subjected to structural zeros.

Acknowledgement

We thank a reviewer and the associate editor for several useful suggestions that helped improve this manuscript. Conflict of Interest: None declared.

Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

Funding

Intramural Research Program of the NIH, NIEHS (Z01 ES101744-04) to S.D.P. and A.K.; Israeli Science Foundation (1256/13) to O.D.

References

  1. Bickel, P. and Levina, E. (2008). Covariance regularization by thresholding. Annals of Statistics 36, 2577–2604. [Google Scholar]
  2. Cai, T., Liu, W. and Luo, X. (2011). A constrained Inline graphic minimization approach to sparse precision matrix estimation. Journal of American Statistical Association, 106, 594–607. [Google Scholar]
  3. Donoho, D. L., Johnstone, I. M., Kerkyacharian, G. and Pickard, D. (1995). Wavelet shrinakge: asymptopia? Journal of the Royal Statistical Society, Series. B, 57, 301–369. [Google Scholar]
  4. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96,1348–1360. [Google Scholar]
  5. Friedman, J.,, Hastie, T.,, Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9, 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Hoeffding, W. (1963). Probability inequalities for sums of bounded variables. Journal of American Statistical Association, 58, 13–30. [Google Scholar]
  7. Kaul, A., Koul, H. L., Chawla, A. and Lahiri, S. N. (2016). Two stage non-penalized corrected least squares for high dimensional linear models with measurement error or missing covariates. arXiv preprint arXiv:1605.03154v1. [Google Scholar]
  8. Kurkland, B. F. and Heagerty, P. J. (2005). Directly parameterized regression conditioning on being alive: analysis of longitudinal data truncated by deaths. Biostatistics 6, 241–258. [DOI] [PubMed] [Google Scholar]
  9. Levina, E., Rothman, A., and Zhu, J. (2008). Sparse estimation of large covariance matrices via a nested Lasso penalty. Annals of Applied Statistics, 1,245–263. [Google Scholar]
  10. Loh, P., and Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: provable guarantees with non-convexity. Annals of Statistics 40, 1637–1664. [Google Scholar]
  11. Lounici, K. (2014). High dimensional covariance matrix estimation with missing observations. Bernoulli 20, 1029–1058. [Google Scholar]
  12. Lozupone, C. A.,, Stombaugh, J.,, Gonzalez, A.,, Ackerman, G.,, Wendel, D.,, Vazquez-Baeza, Y.,, Jansson, J. K.,, Gordon, J. I.,, Knight R. (2013). Meta-analyses of studies of the human microbiota. Genome Research 23(10), 1704–1714. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Mandal, S., Teuren, W. V., White, R. A., Eggesbo, M., Knight, R. and Peddada, S. (2015) Analysis of composition of microbiomes: a novel method for studying microbial composition. Microbial Ecology in Health and Disease 261651–2235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Ravikumar, P., Wainwright, M., Raskutti, G., and Yu, B. (2011). High dimensional covariance estimation by minimizing Inline graphic-penalized log-determinant divergence. Electronic Journal of Statistics 5, 935–980. [Google Scholar]
  15. Rothman, A., Bickel, P., Levina, E., and Zhu, J. (2009). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics 2, 494–515. [Google Scholar]
  16. Rothman, A., Levina, E., and Zhu, J. (2009). Generalized thresholding of large covariance matrices. Journal of American Statistical Association 104, 177–186. [Google Scholar]
  17. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B, 58,267–288. [Google Scholar]
  18. Wu, W. B. and Pourahmadi, M. (2003). Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika 90, 831–844. [Google Scholar]
  19. Yatsunenko, T.,, Rey, F. E.,, Manary, M. J.,, Trehan, I.,, Dominguez-Bello, M. G.,, Contreras, M.,, Magris, M.,, Hidalgo, G.,, Baldassano, R. N.,, Anokhin, A. P., and others (2012). Human gut microbiome viewed across age and geography. Nature, 486, 222–227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Zou, H. (2006). The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101,1418–1429. [Google Scholar]

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES