Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2019 Sep 16;106(4):823–840. doi: 10.1093/biomet/asz037

Accounting for unobserved covariates with varying degrees of estimability in high-dimensional biological data

Chris McKennan 1,, Dan Nicolae 1
PMCID: PMC6845853  PMID: 31754283

Summary

An important phenomenon in high-throughput biological data is the presence of unobserved covariates that can have a significant impact on the measured response. When these covariates are also correlated with the covariate of interest, ignoring or improperly estimating them can lead to inaccurate estimates of and spurious inference on the corresponding coefficients of interest in a multivariate linear model. We first prove that existing methods to account for these unobserved covariates often inflate Type I error for the null hypothesis that a given coefficient of interest is zero. We then provide alternative estimators for the coefficients of interest that correct the inflation, and prove that our estimators are asymptotically equivalent to the ordinary least squares estimators obtained when every covariate is observed. Lastly, we use previously published DNA methylation data to show that our method can more accurately estimate the direct effect of asthma on DNA methylation levels compared to existing methods, the latter of which likely fail to recover and account for latent cell type heterogeneity.

Keywords: Batch effect, Cell type heterogeneity, Confounding, High-dimensional factor analysis, Unobserved covariates, Unwanted variation

1. Introduction

High-throughput genetic, DNA methylation, metabolomic and proteomic data are often influenced by unobserved covariates that are difficult or impossible to record (Johnson et al., 2007; Leek et al., 2010; Houseman et al., 2012). Suppose we observe data Inline graphic, where the number of genomic units, Inline graphic, is on the order of or larger than the sample size, Inline graphic. For example, in most DNA methylation data, the number of studied methylation sites, Inline graphic, is between Inline graphic and Inline graphic and Inline graphic. Assume the true model for Inline graphic is

graphic file with name Equation1.gif (1)

where Inline graphic, Inline graphic contains the covariates of interest and Inline graphic contains the Inline graphic unobserved covariates. Our goal is to estimate and perform inference on the coefficients of interest, Inline graphic.

Under model (1), the naive ordinary least squares estimate of Inline graphic,

graphic file with name Equation2.gif

is biased by Inline graphic, where Inline graphic is the ordinary least squares coefficient estimate for the regression of Inline graphic on to Inline graphic. The bias induced by Inline graphic and Inline graphic is often consequential in biological data. For example, in DNA methylation studies where disease status is the covariate of interest, DNA methylation Inline graphic depends on the latent cellular heterogeneity of the Inline graphic samples (Jaffe & Irizarry, 2014), and cellular heterogeneity often depends on disease status Inline graphic (Fahy, 2002; Stein et al., 2016). Ignoring unobserved covariates Inline graphic when analysing these types of data can therefore drastically affect the interpretation of results.

There have been a number of methods proposed to estimate and correct for the latent factors Inline graphic in model (1) (Leek & Storey, 2008; Gagnon-Bartsch & Speed, 2012; Sun et al., 2012; Gagnon-Bartsch et al., 2013; Houseman et al., 2014; Lee et al., 2017). While these methods perform well on selected datasets, they either do not have the requisite theory to justify downstream inference on Inline graphic (Leek & Storey, 2008; Sun et al., 2012; Houseman et al., 2014; Lee et al., 2017) or they require the practitioner to have prior knowledge regarding which coefficients Inline graphic are zero (Gagnon-Bartsch & Speed, 2012; Gagnon-Bartsch et al., 2013).

Recently, Fan & Han (2017) and Wang et al. (2017) proposed methods that first compute Inline graphic, an estimate of Inline graphic, from Inline graphic, where Inline graphic is the orthogonal projection matrix on to the orthogonal complement of Inline graphic. They then estimate Inline graphic by regressing Inline graphic on to Inline graphic, and finally estimate Inline graphic by subtracting the estimated bias Inline graphic from Inline graphic. The advantage of this estimation paradigm is obvious: it decouples the estimation of Inline graphic and Inline graphic without requiring the practitioner to have prior knowledge regarding which coefficients Inline graphic are zero. These articles are quite remarkable because, when their assumptions hold, the authors prove that they can perform inference on Inline graphic that is as accurate as when Inline graphic is known. However, it has been observed that these methods tend to inflate test statistics and cause anticonservative inference in both simulated and real data (van Iterson et al., 2017).

One source of the discrepancy between theory and practice is that the aforementioned articles assume that all Inline graphic of the nonzero eigenvalues of Inline graphic are on the order of the number of samples, Inline graphic, and are overtly larger than the average residual variance Inline graphic. If these assumptions were valid, there would be an unambiguous gap between the Inline graphicth and Inline graphicth eigenvalues of Inline graphic. However, this rarely occurs in practice (Cangelosi & Goriely, 2007; Owen & Wang, 2016; Wang et al., 2017). When these eigenvalue assumptions are violated, we show that previous methods’ techniques to estimate Inline graphic from the regression of Inline graphic onto Inline graphic are sensitive to the error in the estimated design matrix Inline graphic, which causes inaccurate estimates of Inline graphic. In practice, some of the nonzero eigenvalues of Inline graphic will not be large if the sample size is not sufficiently large, if some of the Inline graphic latent covariates do not influence the response of every genomic unit, or if some of the latent covariates are correlated with the covariate of interest Inline graphic, since this will dampen Inline graphic. The latter is common in DNA methylation data because unobserved cellular heterogeneity is often correlated with Inline graphic (Jaffe & Irizarry, 2014).

The purpose of this article is to fill the described gap in the literature by studying the unobserved covariate problem when some or all of the Inline graphic nonzero eigenvalues of Inline graphic are not exceedingly large. We prove that when the eigenvalues fall below a certain threshold, then for fixed Inline graphic, previous methods have a propensity to inflate Type I error when testing the null hypothesis Inline graphic, and even tend to falsely reject Inline graphic when using the conservative Bonferroni correction. We then provide alternative estimators for Inline graphic and prove that when Inline graphic is suitably sparse, our estimators are asymptotically equivalent to the ordinary least squares estimators obtained using the design matrix Inline graphic, regardless of the size of the eigenvalues of Inline graphic. We lastly use simulated data and real DNA methylation data from Nicodemus-Johnson et al. (2016) to show that latent covariates with ostensibly small effects can be detrimental to inference if not properly accounted for, and that our method can better account for latent covariates than the leading competitors.

2. The model, our estimation procedure and intuition

2.1. Notation

For any integer Inline graphic, we define Inline graphic. For any matrix Inline graphic, we define Inline graphic and Inline graphic to be the orthogonal projection matrices that project vectors on to the image of Inline graphic and the orthogonal complement of Inline graphic, respectively, and Inline graphic, Inline graphic and Inline graphic to be the Inline graphicth row, Inline graphicth column and Inline graphic element of Inline graphic. Lastly, we define Inline graphic to be the vectors of all ones and all zeros and use the notation Inline graphic if the random variables, or matrices, Inline graphic and Inline graphic have the same distribution.

2.2. A model for the data

Let Inline graphic be the observed data, where Inline graphic is an observation at genomic unit Inline graphic in sample Inline graphic. Let Inline graphic be an observed, full rank matrix containing the covariates of interest and define Inline graphic to be their corresponding coefficients across all Inline graphic genomic units. We also define an additional covariate matrix Inline graphic and let Inline graphic be its corresponding coefficient. We assume that Inline graphic is unobserved, but Inline graphic is known. Evidently, Inline graphic is rarely known in true data applications. While we acknowledge that estimating Inline graphic is a challenging problem, there is a large body of work devoted to estimating it (Leek & Storey, 2008; Onatski, 2010; Gagnon-Bartsch & Speed, 2012; Owen & Wang, 2016; McKennan & Nicolae, 2018). We discuss how different values of Inline graphic affect our downstream estimates in § 4. We assume (1) is the true model for Inline graphic, and we define

graphic file with name Equation3.gif (2)

We also define

graphic file with name Equation4.gif (3)

to be the ordinary least squares coefficient estimates and residuals from the regression of Inline graphic on to Inline graphic, respectively. We have not assumed an explicit relationship between Inline graphic and Inline graphic, because one can always decompose Inline graphic as

graphic file with name Equation5.gif

A more general model for Inline graphic would be Inline graphic, where Inline graphic contains observed nuisance covariates, like the intercept or technical covariates, whose coefficients Inline graphic are not of interest. We can get back to model (1) by multiplying Inline graphic on the right by a matrix whose columns form an orthonormal basis for the null space of Inline graphic. Therefore, we work exclusively with model (1) and assume any observed nuisance factors have already been rotated out, as they would be in ordinary least squares.

2.3. Estimating Inline graphic when Inline graphic is unobserved

We break Inline graphic into two independent pieces using a technique proposed in Sun et al. (2012):

graphic file with name Equation6.gif (4)
graphic file with name Equation7.gif (5)

where Inline graphic and Inline graphic are independent because Inline graphic and Inline graphic. The matrix Inline graphic is the ordinary least squares estimate of Inline graphic that ignores Inline graphic, and the rows of Inline graphic lie on an (Inline graphic)-dimensional linear subspace of Inline graphic. We now describe how to use Inline graphic and Inline graphic to derive the ordinary least squares estimates of Inline graphic when Inline graphic is observed. This will provide a template for estimating Inline graphic when Inline graphic is unobserved.

Algorithm 1

(Ordinary least squares when Inline graphic is observed) Let Inline graphic, Inline graphic, Inline graphic and Inline graphic be given. Our goal is to use ordinary least squares to estimate and perform inference on Inline graphic, the rows of Inline graphic.

  • (a) Set Inline graphic. Use Inline graphic to estimate Inline graphic and Inline graphic as
    graphic file with name Equation8.gif
    where Inline graphic is the Inline graphicth row of Inline graphic.
  • (b) Set Inline graphic.

  • (c) Define the ordinary least squares estimate of Inline graphic to be
    graphic file with name Equation9.gif (6)
    where Inline graphic and Inline graphic are the Inline graphicth rows of Inline graphic and Inline graphic, respectively.

It is straightforward to derive the asymptotic properties of the estimators defined in Algorithm 1. In Step (a), Inline graphic as Inline graphic and Inline graphic. Since Inline graphic is independent of Inline graphic, both of these estimates are independent of Inline graphic. This implies that the asymptotic distribution of Inline graphic is

graphic file with name Equation10.gif

as Inline graphic, where Inline graphic.

A property of the ordinary least squares estimate Inline graphic is

graphic file with name Equation11.gif

That is, Inline graphic depends only on the column space Inline graphic, meaning we may replace Inline graphic with Inline graphic as input in Algorithm 1 for any invertible matrix Inline graphic. In particular, we may choose Inline graphic so that Inline graphic. This parametrization of Inline graphic, and therefore Inline graphic, is convenient because it suggests that a reasonable estimate of Inline graphic when Inline graphic is unobserved is a scalar multiple of the first Inline graphic right singular vectors of Inline graphic. Using this intuition, we now present our method to estimate and perform inference on Inline graphic when Inline graphic is unobserved. This is described in Algorithm 2, which mimics the three steps of Algorithm 1.

Algorithm 2

(Estimation and inference when Inline graphic is unobserved). Let Inline graphic, Inline graphic, Inline graphic and Inline graphic be given. Our goal is to estimate and perform inference on Inline graphic, the rows of Inline graphic.

  • (a) Let Inline graphic be the singular value decomposition of Inline graphic where Inline graphic and Inline graphic. Define Inline graphic, where Inline graphic is the Inline graphicth column of Inline graphic. Estimate Inline graphic and Inline graphic as
    graphic file with name Equation12.gif (7)
    graphic file with name Equation13.gif (8)
  • (b) Define Inline graphic and
    graphic file with name Equation14.gif (9)
    where Inline graphic is the Inline graphicth column of Inline graphic. Estimate Inline graphic as
    graphic file with name Equation15.gif (10)
  • (c) Estimate Inline graphic as
    graphic file with name Equation16.gif (11)

Just like the estimates Inline graphic and Inline graphic, Inline graphic and Inline graphic defined in (7) and (8) are independent of Inline graphic. To perform inference on Inline graphic, we assume

Algorithm 2

2.4. Intuition regarding Step (b) of Algorithm 2

The estimates of Inline graphic (Inline graphic) and Inline graphic in Step (a) of Algorithm 2 are similar to those used in Sun et al. (2012), Gagnon-Bartsch et al. (2013), Lee et al. (2017) and Wang et al. (2017). However, the estimate of Inline graphic in Step (b) is different from those used in previous methods. Recall from (4) that Inline graphic. If Inline graphic is sufficiently sparse, Sun et al. (2012), Gagnon-Bartsch et al. (2013), Lee et al. (2017) and Wang et al. (2017) propose using variations of the following estimator to recover Inline graphic:

graphic file with name Equation18.gif (12)

That is, they ignore the uncertainty in Inline graphic when regressing Inline graphic on to Inline graphic. To see why this is imprudent, let Inline graphic be the residual and suppose for the sake of argument that Inline graphic. Then the regression coefficients from the regression Inline graphic should be very close to 0, since Inline graphic is independent of Inline graphic. In other words, existing estimates of Inline graphic are shrunk towards 0. We quantify the shrinkage exactly in § 3.3 and use that result to derive an inflation term, Inline graphic. We then use Inline graphic to inflate the shrunken estimate Inline graphic, which allows us to better estimate Inline graphic in Step (c) of Algorithm 2.

The importance of the inflation term Inline graphic in (10) is related to how informative the data are for Inline graphic. The estimate Inline graphic (Inline graphic) defined in (9) is the Inline graphicth largest eigenvalue of Inline graphic, and can therefore be viewed as an estimate of Inline graphic, the Inline graphicth largest eigenvalue of Inline graphic. The eigenvalue Inline graphic is also the Inline graphicth largest eigenvalue of Inline graphic. When Inline graphic is sufficiently large for all Inline graphic, we say that the data are strongly informative for the latent factors Inline graphic. Under this regime, Inline graphic will tend to dominate Inline graphic, an estimate of the constant Inline graphic defined in (2), meaning Inline graphic will be negligible. In this case it suffices to use Inline graphic or other previously proposed estimates of Inline graphic in place of Inline graphic in (11). On the other hand, we say the data are only moderately informative for Inline graphic if one or more of Inline graphic is not large. This can occur if the sample size Inline graphic is not large enough, if some of the columns of Inline graphic do not affect the expression or methylation of all Inline graphic genomic units, or if Inline graphic is correlated with the columns of Inline graphic, since this will dampen Inline graphic. In these cases, Inline graphic will be moderate to large. In fact, we prove in § 3.3 and show with simulation and a real data example in § 4 that existing methods that ignore the shrinkage in their estimates of Inline graphic are not amenable to inference. We define the informativeness of the data for Inline graphic precisely in Definition 1 in § 3.3.

3. Theoretical results

3.1. Assumptions

In all of our assumptions and theoretical results, we assume model (1) holds, Inline graphic and Inline graphic are as defined in (4) and (5), and Inline graphic.

Assumption 1.

  • (a) Let Inline graphic be an observed, nonrandom matrix such that Inline graphic.

  • (b) Let Inline graphic be an unobserved, nonrandom matrix with Inline graphic nonzero singular values, where Inline graphic is a known constant.

  • (c) For some constant Inline graphic, Inline graphic for all Inline graphic.

Under (a) and (b), Inline graphic, Inline graphic and Inline graphic (Inline graphic) are identifiable. The choice to treat Inline graphic as nonrandom is to illustrate that ignoring this term tends to bias estimates of Inline graphic. However, all of our results in § § 3.23.4 can be extended to the case when Inline graphic is a random variable using results from the Supplementary Material. Item (c) is a standard assumption in the high-dimensional factor analysis literature (Bai & Li, 2012; Wang et al., 2017). We next place assumptions on Inline graphic.

Assumption 2.

Let Inline graphic and Inline graphic be a constant and let:

  • (a) Inline graphic for all Inline graphic;

  • (b) Inline graphic has Inline graphic nonzero eigenvalues Inline graphic such that Inline graphic and Inline graphic for all Inline graphic, where Inline graphic;

  • (c) Inline graphic be a nondecreasing function of Inline graphic such that Inline graphic and Inline graphic as Inline graphic.

The quantity Inline graphic is identifiable because Inline graphic is identifiable, and (a) is equivalent to Inline graphic for all Inline graphic if Inline graphic. We comment on this further after we state Proposition 1 below. The assumptions on Inline graphic in (b) are weaker than those considered in previous work that provide inferential guarantees, which focused on the case when Inline graphic (Bai & Li, 2012; Fan & Han, 2017; Wang et al., 2017). Lee et al. (2017) do allow Inline graphic, provided Inline graphic and Inline graphic as Inline graphic. However, they only prove the consistency of their estimates of Inline graphic. In fact, we show in § 3.3 that inference with their method, as well as other existing methods, is fallacious if Inline graphic for some Inline graphic. The assumptions on Inline graphic in (c) are the same as those used by Wang et al. (2017), who only consider the case Inline graphic. We next place assumptions on the parameters of Inline graphic.

Assumption 3.

Let Inline graphic be a constant.

  • (a) Let Inline graphic for all Inline graphic as Inline graphic.

  • (b) Let Inline graphic for all Inline graphic and Inline graphic.

  • (c) Let Inline graphic be any matrix such that Inline graphic for some Inline graphic. Then for Inline graphic and Inline graphic, Inline graphic.

Item (a) is the same sparsity as assumed in Wang et al. (2017). Item (c) is justifiable because we prove that Inline graphic and Inline graphic are identifiable under Assumptions 1, 2 and 3(a) in Proposition 1 below, and Proposition S1 in § S2.1 of the Supplementary Material.

In DNA methylation data with Inline graphicInline graphic, Inline graphic and in the previously unexplored regime Inline graphic, Assumption 3(a) restricts the number of genomic units with nonzero coefficient of interest to be on the order of hundreds to thousands, which is common in many studies (Liu et al., 2018; Morales et al., 2016; Yang et al., 2017; Zhang et al., 2018). We also show through simulations that we can egregiously violate Assumption 3(a) and still perform accurate inference on Inline graphic. We now state a proposition regarding the identifiability of Inline graphic and Inline graphic.

Proposition 1.

Let Inline graphic, suppose Assumptions 1 and 2 hold and define the parameter space

Proposition 1. (13)

Then Inline graphic is nonempty and if Inline graphic, then Inline graphic and Inline graphic for some Inline graphic. If Assumptions 1, 2 and 3(a) hold, then there exists a constant Inline graphic such that Inline graphic is identifiable and

Proposition 1. (14)

is nonempty for all Inline graphic. Further, if Inline graphic, then Inline graphic and Inline graphic for some Inline graphic for all Inline graphic.

The condition that Inline graphic is a classic constraint to identify the components of factor models (Bai & Li, 2012). If Inline graphic, Assumption 2(a) becomes Inline graphic for all Inline graphic, and if Inline graphic, Inline graphic. While we prove it is unnecessary to assume a particular parametrization of Inline graphic and Inline graphic to estimate and perform inference on Inline graphic using Algorithm 2, we use the parameter spaces Inline graphic and Inline graphic in the statements of theoretical results regarding the accuracy of estimates of Inline graphic and Inline graphic, respectively, in § § 3.23.4.

3.2. Asymptotic properties of the estimates from Step (a) of Algorithm 2

We start by illustrating the asymptotic properties of Inline graphic (Inline graphic) and Inline graphic defined in (7) and (8).

Lemma 1.

Suppose Assumptions 1 and 2 hold and Inline graphic. Then, for Inline graphic defined in (2),

Lemma 1. (15)
Lemma 1. (16)

Lemma 2.

Suppose Assumptions 1 and 2 hold and Inline graphic. Then, for Inline graphic defined in (9),

Lemma 2. (17)

Let Inline graphic and Inline graphic be as defined in Inline graphic 13 Inline graphic and Step (a) of Algorithm 2, respectively. If we also assume that Inline graphic and the K diagonal elements of Inline graphic are nonnegative, then, for Inline graphic,

Lemma 2. (18)

Remark 1.

The identifiability constraints, that Inline graphic and Inline graphic has nonnegative diagonal elements, are equivalent to the IC3 constraint used in Bai & Li (2012) to identify the components of factor models.

Remark 2.

When Inline graphic is observed and Inline graphic, (17) and (18) hold for the ordinary least squares estimator Inline graphic defined in Step (a) of Algorithm 1.

Lemmas 1 and 2 show that Inline graphic and Inline graphic have the same asymptotic properties as Inline graphic and Inline graphic, the ordinary least squares estimates of Inline graphic and Inline graphic defined in Algorithm 1. However, (17) states that the estimates of Inline graphic are biased by Inline graphic, which we show below is the primary reason why previously proposed methods often return inflated test statistics.

3.1. Previous estimates of Inline graphic in Step (b) of Algorithm 2 inflate test statistics

Existing methods that use the estimation paradigm outlined in Algorithm 2 ignore the uncertainty in Inline graphic, and use variations of Inline graphic to estimate Inline graphic. We show in Proposition 2 and Corollary 1 below that these methods tend to underestimate Inline graphic, which can lead to spurious inference on Inline graphic.

Proposition 2.

Suppose Assumptions 1, 2 and 3 hold with Inline graphic, Inline graphic and Inline graphic, where Inline graphic was defined in (14). In addition, suppose the diagonal elements of Inline graphic are nonnegative and Inline graphic for some constant Inline graphic. If we estimate Inline graphic as Inline graphic defined in (12), then

Proposition 2. (19)

Corollary 1.

Fix some Inline graphic and let Inline graphic be a small constant. In addition to the assumptions of Proposition 2, suppose Inline graphic and the following hold:

  • (i) We replace Inline graphic with Inline graphic in (11) and estimate Inline graphic as Inline graphic.

  • (ii) There exists some constant Inline graphic such that, Inline graphic, where Inline graphic is the Inline graphicth element of Inline graphic.

Define Inline graphic to be the Inline graphicth z-score and let Inline graphic be any significance level. Then for Inline graphic, the Inline graphic quantile of the standard normal distribution, there exists a constant Inline graphic such that, as Inline graphic,

Corollary 1.

where Inline graphic is the Bonferroni threshold at a level Inline graphic.

Remark 3.

Gagnon-Bartsch et al. (2013) used Inline graphic to estimate Inline graphic, but Lee et al. (2017) and Wang et al. (2017) used slightly different estimators. We prove analogous versions of Proposition 2 and Corollary 1 for the estimators used by Lee et al. (2017) and Wang et al. (2017) in the Supplementary Material.

Remark 4.

The assumption that Inline graphic made in Proposition 2 and Corollary 1 requires the eigenvalues be on the same order of magnitude. It is a standard assumption made by previous authors who use versions of Algorithm 2 to estimate Inline graphic (Lee et al., 2017; Wang et al., 2017). In Remark 6, after the statement of Theorem 2, we discuss how to extend it to allow Inline graphic to diverge.

When Condition (ii) in the statement of Corollary 1 does not hold, it implies that the bias Inline graphic in Inline graphic is minor, or the largest components of Inline graphic load on to the columns of Inline graphic corresponding to the largest eigenvalues Inline graphic, which are the components least affected by the shrinkage in Proposition 2. The shrinkage in Inline graphic will have less of an impact on inference in these cases. If Inline graphic, Condition (ii) can be replaced with Inline graphic for some constant Inline graphic.

The results of Proposition 2 and Corollary 1 show that ignoring the uncertainty in Inline graphic when estimating Inline graphic can lead to inflated test statistics and Type I errors if Inline graphic is not small enough, even if one uses the conservative Bonferroni threshold. We therefore define the informativeness of the data for Inline graphic in terms of the magnitude of Inline graphic in relation to Inline graphic.

Definition 1

(Informativeness of the data for Inline graphic). The data Inline graphic are strongly informative for Inline graphic if Inline graphic as Inline graphic, and moderately informative for Inline graphic if there exists a constant Inline graphic such that Inline graphic for all Inline graphic.

Corollary 1 shows that existing methods risk performing anticonservative inference when the data are only moderately informative for Inline graphic. We next show that our shrinkage-corrected estimate of Inline graphic in (10) begets estimates of Inline graphic that are asymptotically equivalent to the corresponding ordinary least squares estimates obtained when Inline graphic is known, even when the data are only moderately informative for Inline graphic.

3.4. Estimates of Inline graphic from Algorithms 1 and 2 are asymptotically equivalent

We first prove that our shrinkage-corrected estimate of Inline graphic, Inline graphic, corrects the aforementioned shrinkage present in existing methods’ estimates of Inline graphic.

Lemma 3.

Suppose Assumptions 1, 2 and 3 hold and Inline graphic. Further, assume the diagonal entries of Inline graphic are nonnegative and Inline graphic, where Inline graphic was defined in the statement of Proposition 2. If Inline graphic is defined as in (10) and Inline graphic, then

Lemma 3. (20)

We use this result to prove that inference with Inline graphic (Inline graphic) is asymptotically equivalent to the ordinary least squares estimator obtained when Inline graphic is known.

Theorem 1.

Let Inline graphic and suppose Assumptions 1, 2 and 3 hold with Inline graphic and Inline graphic. Then inference with Inline graphic is asymptotically equivalent to inference with Inline graphic in the following sense:

Theorem 1. (21)
Theorem 1. (22)

The estimates Inline graphic, Inline graphic and Inline graphic are defined in (6), (10) and (11), and Inline graphic.

In some real experimental data, the largest eigenvalue Inline graphic may be substantially larger than the smallest eigenvalue Inline graphic. We therefore extend Theorem 1 to relax the assumption that the Inline graphic are the same order of magnitude in the following theorem.

Theorem 2.

Let Inline graphic, suppose Assumptions 1, 2 and 3 hold and assume Inline graphic. Define Inline graphic to be the Inline graphicth left singular vector of Inline graphic (Inline graphic). If Inline graphic for some constant Inline graphic for all Inline graphic, then (21) and (22) hold.

Remark 5.

Under Assumptions 1 and 2, Inline graphic is identifiable for all Inline graphic. If Assumptions 1 and 2 hold and Inline graphic, Inline graphic for all Inline graphic.

Remark 6.

Proposition 2 and Corollary 1 can be extended to accommodate data where Inline graphic diverges by replacing the condition that Inline graphic with Inline graphic for all Inline graphic.

The condition on Inline graphic (Inline graphic) is quite general, as it can be shown to hold in probability when Inline graphic and Inline graphic (Inline graphic) for any distributions Inline graphic and Inline graphic with compact support, such that Inline graphic has eigenvalues bounded away from 0 with high probability. We refer the reader to the Supplementary Material for more detail.

3.5. Inference on the relationship between Inline graphic and Inline graphic

One may be interested in understanding the origin of Inline graphic. For example, if components of Inline graphic were large, it would be informative to know if this were due to random experimental variation, or if some of the columns of Inline graphic truly depended on Inline graphic. To incorporate this type of inference, we state the following theorem that allows Inline graphic, and therefore Inline graphic, to be treated as a random variable.

Theorem 3.

Let Inline graphic be a constant. In addition to Assumptions 11, 11 and 3(b), suppose the following hold:

  • (i) Inline graphic and Inline graphic is a nonrandom matrix such that Inline graphic, where Inline graphic is known;

  • (ii) let Inline graphic. Then Inline graphic and Inline graphic for all Inline graphic, Inline graphic for all Inline graphic and Inline graphic for all Inline graphic;

  • (iii) Inline graphic is a nondecreasing function of Inline graphic such that Inline graphic, Inline graphic as Inline graphic and Inline graphic for all Inline graphic;

  • (iv) Inline graphic, where Inline graphic is nonrandom and Inline graphic has independent and identically distributed rows Inline graphic that are independent of Inline graphic such that Inline graphic, Inline graphic and Inline graphic for all Inline graphic and Inline graphic.

Let Inline graphic be the standard Wishart distribution in Inline graphic dimensions with Inline graphic degrees of freedom. If the null hypothesis Inline graphic is true and Inline graphic, then

Theorem 3.

where Inline graphic is defined in (10) and Inline graphic. If Inline graphic, Inline graphic.

Remark 7.

Under the definition of Inline graphic in (iv), Inline graphic and Inline graphic.

4. Simulations and data analysis

4.1. Simulation study

In this section we use simulations to compare the performance of our shrinkage-corrected method defined by Algorithm 2 with that of methods proposed in Leek & Storey (2008), Gagnon-Bartsch & Speed (2012), Gagnon-Bartsch et al. (2013), Lee et al. (2017) and Wang et al. (2017), as well as the ordinary least squares estimator when Inline graphic is known and when it is ignored. We do not include results from Fan & Han (2017) or Houseman et al. (2014), because these methods perform similarly to those proposed in Lee et al. (2017) and Wang et al. (2017). In all of our simulations, we set Inline graphic, Inline graphic and Inline graphic to mimic DNA methylation data where Inline graphic ranges from Inline graphic to Inline graphic, although our results are nearly identical for Inline graphic on the order of gene expression data (Inline graphic). We set Inline graphic and assigned 50 samples to the treatment group and the rest to the control group so that Inline graphic. We then set the eigenvalues Inline graphic so that Inline graphic, Inline graphic and, for the others,

graphic file with name Equation31.gif

For a predefined value of Inline graphic we simulated Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic according to

graphic file with name Equation32.gif (23)

where Inline graphic was chosen so that Inline graphic and Inline graphic is the Inline graphic-distribution with four degrees of freedom. We then set the observed data to be Inline graphic. Although our theory from § 3 assumes the residuals Inline graphic are normally distributed, we simulated Inline graphic-distributed data to mimic real data with heavy tails. The values used for Inline graphic and Inline graphic (Inline graphic) are given in Table 1. We show additional simulation results where we simulate Inline graphic according Inline graphic in the Supplementary Material.

Table 1.

The Inline graphic and Inline graphic values Inline graphic used to simulate Inline graphic

Factor no. (Inline graphic) Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic 1 0.78 0.60 0.5 0.5 0.5 0.5 0.5 0.5 0.5
Inline graphic 0 0 0 0.13 0.48 0.85 0.89 0.92 0.94 0.96
Inline graphic 98.0 58.9 35.4 21.3 12.8 3.8 2.7 1.9 1.4 1.0

We set the parameter Inline graphic used to simulate Inline graphic, where Inline graphic in (23), to be one of two values:

graphic file with name Equation33.gif

with the scalar Inline graphic chosen so that Inline graphic explained 30% of the variability in group status Inline graphic, on average. The choice of 30% was not arbitrary, as we estimated that over 30% of the variance in group status was explained by Inline graphic in our data application in § 4.2.

As simulated, the eigenvalues Inline graphic are large enough that the shrinkage terms Inline graphic (Inline graphic) from (19) in Proposition 2 are negligible. This implies that when Inline graphic, Inline graphic will likely be a suitable estimate of Inline graphic, since Inline graphic will correctly estimate the largest and most important components of Inline graphic, Inline graphic. The anticonservative nature of Inline graphic implied by Corollary 1 does not apply when Inline graphic because Condition (ii) Corollary 1 will generally not hold. We would therefore expect our shrinkage-corrected method defined by Algorithm 2 to perform similarly to previous methods that ignore the shrinkage in their estimates of Inline graphic in this simulation scenario. However, when Inline graphic, Inline graphic will not recover the largest and most consequential components of Inline graphic, Inline graphic, because of the substantial shrinkage caused by the relatively small eigenvalues Inline graphic. In this case, Corollary 1 and Remark 6 suggest that ignoring the shrinkage will lead to anticonservative inference on Inline graphic, whereas Theorems 1 and 2 imply that our shrinkage-corrected method will be asymptotically equivalent to ordinary least squares when Inline graphic is observed.

We simulated 100 datasets with Inline graphic and another 100 with Inline graphic. We found that we could perform the best inference on Inline graphic with each method by performing ordinary least squares with the design matrix Inline graphic, where Inline graphic was Inline graphic if Inline graphic was known, or was estimated with any one of the six methods described above. Our shrinkage-corrected estimate of Inline graphic was Inline graphic, where Inline graphic was defined in Step (a) of Algorithm 2. We describe how the other five methods estimate Inline graphic below. We compared the ordinary least squares Inline graphic-statistics from all the methods to a Inline graphic-distribution with Inline graphic degrees of freedom to compute Inline graphic-values for the null hypotheses Inline graphic (Inline graphic). We then judged the performance of each method by comparing their true false discovery proportion at a nominal 20% false discovery rate, estimated using Inline graphic-values (Storey, 2001), because this is the inference method popular among biologists.

Figure 1 provides the simulation results. We see that our shrinkage-corrected method is able to control the false discovery rate both when Inline graphic is known to be 10 and when we drastically overestimate it to be 20. Further, our method’s power to detect units with nonzero Inline graphic at this nominal 20% false discovery rate threshold was 13.6% when Inline graphic and 12.8% when Inline graphic, which is compared to 13.6% when Inline graphic was known. The power of all three methods was the same for both values of Inline graphic. This is exactly what one would expect from Theorems 1 and 2, which prove that inference with our shrinkage-corrected estimator is asymptotically equivalent to that with ordinary least squares when Inline graphic is known. This equivalence was also manifested when we overtly violated Assumption 3(a) and simulated Inline graphic; see the Supplementary material for more detail.

Fig. 1.

Fig. 1.

The false discovery proportion, FDP, for each method at a Inline graphic-value threshold of 0.2 in simulations when (a) Inline graphic and (b) Inline graphic. BC is our shrinkage-corrected method defined in Algorithm 2 and Inline graphic is the number of factors used to estimate Inline graphic. CATE-RR, dSVA, IRW-SVA, RUV-2 and RUV-4 are the methods proposed in Leek & Storey (2008), Gagnon-Bartsch & Speed (2012), Gagnon-Bartsch et al. (2013), Lee et al. (2017) and Wang et al. (2017), respectively. These five methods were all applied with Inline graphic. Inference with None was performed using the design matrix Inline graphic.

It is also informative to study the performance of the other five methods, as this can be important to practitioners deciding which method to apply to their data.

The methods of Wang et al. (2017), CATE-RR, and Lee et al. (2017), dSVA, estimate Inline graphic as Inline graphic and Inline graphic, respectively, where their estimates of Inline graphic, Inline graphic and Inline graphic, are nearly identical to Inline graphic as defined in Step (a) of Algorithm 2. However, their estimates of Inline graphic, Inline graphic and Inline graphic, ignore the shrinkage described in Proposition 2. We would therefore expect them to introduce more Type I errors when Inline graphic. Both CATE-RR and dSVA’s false discovery proportion estimates were closer to nominal values when Inline graphic, since any rejection region was likely to have more genomic units with nonzero coefficients of interest.

The method of Leek & Storey (2008), IRW-SVA, estimates Inline graphic by performing a factor analysis on Inline graphic, where Inline graphic is an estimate of Inline graphic (Inline graphic), by iteratively estimating Inline graphic and Inline graphic. Since the first iteration assumes Inline graphic, Inline graphic tends to be small if the marginal correlation between Inline graphic and Inline graphic is large, which occurs if Inline graphic is large. Therefore, the latent factors that influence Inline graphic will be different than those of Inline graphic if the latent factors with the largest effects are also correlated with Inline graphic. This explains why IRW-SVA performs poorly when Inline graphic. Unfortunately, there is no theory that states when IRW-SVA is expected to accurately recover Inline graphic.

Both RUV-2 (Gagnon-Bartsch & Speed, 2012) and RUV-4 (Gagnon-Bartsch et al., 2013) assume the practitioner has prior knowledge of a subset Inline graphic of control genomic units where Inline graphic for all Inline graphic. We selected Inline graphic control units uniformly at random from the set of all genomic units with Inline graphic across all simulations, because simulations in Wang et al. (2017) use 30 control units when Inline graphic. RUV-2 estimates Inline graphic via factor analysis using only data from genomic units in Inline graphic, whereas RUV-4 first estimates Inline graphic and Inline graphic as Inline graphic and Inline graphic defined in Step (a) of Algorithm 2, and then estimates Inline graphic as Inline graphic. Here, Inline graphic and Inline graphic are the submatrices of Inline graphic and Inline graphic restricted to the rows in Inline graphic. The RUV-4 estimate of Inline graphic is then Inline graphic. The obvious caveat for RUV-2 and RUV-4 is that the practitioner must have a list of units whose coefficients of interest are zero and whose expression or methylation carries the latent factor signature, i.e., the first Inline graphic eigenvalues of Inline graphic must be suitably large. For example, the large variability in RUV-2’s false discovery proportion when Inline graphic is because the Inline graphic control units were not sufficient to capture the latent factor signature in many simulations.

4.2. Data application

In order to demonstrate the importance of using our shrinkage-corrected estimator, we applied our method to reanalyse data from Nicodemus-Johnson et al. (2016), which studied the correlation between adult asthma and DNA methylation in lung epithelial cells. The authors collected endobronchial brushings from 74 adult patients with a current doctor’s diagnosis of asthma and 41 healthy adults, and quantified their DNA methylation at Inline graphic methylation sites, also referred to as CpGs, using the Infinium Human Methylation 450K Bead Chip (Dedeurwaerder et al., 2011). Nicodemus-Johnson et al. (2016) then used ordinary least squares to regress the methylation at each of the Inline graphic sites on to the mean model subspace that included asthma status, age, ethnicity, sex and smoking status to estimate the effect due to asthma, Inline graphic. They found 40 892 CpGs that were differentially methylated between asthmatics and healthy patients at a nominal false discovery rate of 5%.

We investigated whether or not the strong association between DNA methylation and asthma status was in part due to unobserved covariates. In particular, lung cell composition may differ between asthmatics and nonasthmatics, with asthmatic patients generally having a greater proportion of airway goblet cells that excrete mucus (Rogers, 2002; Bai & Knight, 2005). We therefore reanalysed these data to account for latent covariates with our shrinkage-corrected method defined by Algorithm 2, and compared the results to those obtained using the methods proposed in Leek & Storey (2008), Lee et al. (2017) and Wang et al. (2017). We could not apply the methods proposed in Gagnon-Bartsch & Speed (2012) and Gagnon-Bartsch et al. (2013) because we did not have access to control CpGs. We first used bi-crossvalidation (Owen & Wang, 2016) to estimate that there were Inline graphic latent factors in these data, and subsequently estimated Inline graphic using the four different methods. We then computed Inline graphic-values for the null hypotheses Inline graphic (Inline graphic) using ordinary least squares with the design matrix Inline graphic, where Inline graphic was asthma status and Inline graphic contained the observed nuisance covariates age, ethnicity, sex and smoking status. The total number of asthma-related CpGs returned by each method as a function of Inline graphic-value cut-offs (Storey et al., 2015), as well as the uncorrected and shrinkage-corrected estimates of Inline graphic, are given in Fig. 2. At a Inline graphic-value threshold of 20%, our method identifies 10 324 asthma-related CpGs, while the methods proposed in Leek & Storey (2008), Lee et al. (2017) and Wang et al. (2017) ostensibly identify 32 952, 29 415 and 22 545 asthma-related CpGs, respectively. These numbers changed only slightly when we let Inline graphic be as high as 7.

Fig. 2.

Fig. 2.

Results from our analysis of lung DNA methylation data from Nicodemus-Johnson et al. (2016). (a) The number of asthma-related CpGs at a given Inline graphic-value cut-off using our shrinkage-corrected estimator (solid line), as well as the estimators proposed in Lee et al. (2017) (dot-dashed line), Wang et al. (2017) (dotted line) and Leek & Storey (2008) (dashed line). (b) The Inline graphic components of Inline graphic (Inline graphic) and Inline graphic (Inline graphic) as a function of Inline graphic. The dashed line is the 0.95 quantile of the Inline graphic distribution, where Inline graphic is defined such that Inline graphic converges to a chi-squared random variable with Inline graphic degrees of freedom under the null hypothesis from Theorem 3.

We estimated that approximately 36% of the variance in asthma status was explained by Inline graphic, which, using Theorem 3, corresponds to a Inline graphic-value for the null hypothesis Inline graphic of Inline graphic. Moreover, assuming Inline graphic, the largest component of Inline graphic appeared to load on to the third column of Inline graphic, where Inline graphic. Since this was much smaller than Inline graphic and we estimated Inline graphic at over 40% of the studied CpGs Inline graphic, Proposition 2, Corollary 1 and simulations connote that the methods proposed in Lee et al. (2017) and Wang et al. (2017) are likely underestimating the fraction of CpGs with Inline graphic at any nominal Inline graphic-value threshold. It is likely the case that Inline graphic, the third largest eigenvalue of Inline graphic, was small even though the third factor explained a significant portion of the variability in methylation levels because its strong correlation with asthma status dampened Inline graphic.

We next sought to determine if differences in lung cell composition between asthmatic and healthy patients were responsible for some of the correlation between asthma status and the latent factors, since understanding the origin of the latent covariates could help practitioners determine which method is most appropriate for their data. To do so, we fit a topic model with Inline graphic topics on the same individual’s gene expression data, which has been shown to cluster bulk RNA-seq samples by tissue and cell type (Taddy, 2012; Dey et al., 2017). We then used the Inline graphic-dimensional factor whose corresponding loading was the largest on the MUC5AC gene as a proxy for the proportion of goblet cells in each sample, as MUC5AC is a unique identifier for goblet cells (Zuhdi Alimam et al., 2000). Just as one would expect, asthmatics tended to have a higher proportion of estimated goblet cells than healthy controls, and we rejected the null hypothesis that asthmatics and healthy controls had the same mean estimated goblet cell proportion at the significance level of Inline graphic. This indicates that cell composition is presumably driving much of the observed correlation between methylation levels and asthma status in Nicodemus-Johnson et al. (2016), as well as the results from the reanalysis with the methods proposed in Lee et al. (2017) and Wang et al. (2017).

These conclusions also help to explain why the method proposed in Leek & Storey (2008) is likely underestimating the number of false discoveries. We estimated that Inline graphic in these data, which is precisely what one would expect if cellular heterogeneity were among the unobserved factors, since changes in methylation help drive cellular differentiation. And since we have already shown that Inline graphic is correlated with Inline graphic, the method proposed in Leek & Storey (2008) would not be expected to control the false discovery rate, as the simulations in § 4.1 showed exactly this when Inline graphic was large for many genomic units Inline graphic.

5. Discussion

The prevalence of unobserved covariates in high-throughput omic data has precipitated the development of methods that account for unobserved factors Inline graphic in downstream inference. While these methods perform well when the data are strongly informative for Inline graphic, they are not amenable to inference when the data are only moderately informative for Inline graphic. On the other hand, we prove that inference using estimates from our shrinkage-corrected method in Algorithm 2 is asymptotically equivalent to ordinary least squares when Inline graphic is observed.

Our method is not a cure-all for inference with unobserved covariates. For example, Assumption 3(a) restricts the number of units with nonzero main effect in DNA methylation data to be on the order of hundreds to thousands when the data are only moderately informative Inline graphic. Even though simulations show we can potentially relax this number substantially to tens or even hundreds of thousands in practice, it begs the question as to whether or not practitioners should spend time and money to measure nuisance variables like cellular heterogeneity, or estimate them directly from the data. If the practitioner is concerned that Inline graphic is correlated with Inline graphic, but has reason to believe Inline graphic is sparse, our theory suggests the effort should be spent collecting more samples. However, if Inline graphic is correlated with Inline graphic and Inline graphic is dense, it may be worthwhile to attempt to measure some of the latent factors with other technologies. We are currently working with the authors of Nicodemus-Johnson et al. (2016) to use external sources of information to potentially better account for cellular heterogeneity in their data.

Supplementary Material

asz037_Supplementary_Data

Acknowledgement

We thank Carole Ober and Michelle Stein for comments that have substantially improved this manuscript. The research was supported in part by the National Institutes of Health.

Supplementary material

Supplementary material available at Biometrika online includes additional simulation results and proofs of all the propositions, lemmas and theorems presented in this paper. An Inline graphic package implementing our method, together with instructions and code to reproduce the simulations from § 4.1, are available from https://github.com/chrismckennan/BCconf.

References

  1. Bai, J. & Li, K. (2012). Statistical analysis of factor models of high dimension. Ann. Statist. 40, 436–65. [Google Scholar]
  2. Bai, T. R. & Knight, D. A. (2005). Structural changes in the airways in asthma: observations and consequences. Clin. Sci. 108, 463–77. [DOI] [PubMed] [Google Scholar]
  3. Cangelosi, R. & Goriely, A. (2007). Component retention in principal component analysis with application to cDNA microarray data. Biol. Direct 2, 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Dedeurwaerder, S., Defrance, M., Calonne, E., Denis, H., Sotiriou, C. & Fuks, F. (2011). Evaluation of the Infinium Methylation 450K technology. Epigenomics 3, 771–84. [DOI] [PubMed] [Google Scholar]
  5. Dey, K. K., Hsiao, C. J. & Stephens, M. (2017). Visualizing the structure of RNA-seq expression data using grade of membership models. PLOS Genetics 13, e1006599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Fahy, J. V. (2002). Goblet cell and mucin gene abnormalities in asthma. Chest 122, 320S–26S. [DOI] [PubMed] [Google Scholar]
  7. Fan, J. & Han, X. (2017). Estimation of the false discovery proportion with unknown dependence. J. R. Statist. Soc. B 79, 1143–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Gagnon-Bartsch, J. A., Jacob, L. & Speed, T. P. (2013). Removing unwanted variation from high dimensional data with negative controls. Tech. rep. 820, UC Berkeley. [Google Scholar]
  9. Gagnon-Bartsch, J. A. & Speed, T. P. (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Houseman, E. A., Accomando, W. P., Koestler, D. C., Christensen, B. C., Marsit, C. J., Nelson, H. H., Wiencke, J. K. & Kelsey, K. T. (2012). DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics 13, 86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Houseman, E. A., Molitor, J. & Marsit, C. J. (2014). Reference-free cell mixture adjustments in analysis of DNA methylation data. Bioinformatics 30, 1431–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Jaffe, A. E. & Irizarry, R. A. (2014). Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome Biol. 15, R31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Johnson, W. E., Li, C. & Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–27. [DOI] [PubMed] [Google Scholar]
  14. Lee, S., Sun, W., Wright, F. A. & Zou, F. (2017). An improved and explicit surrogate variable analysis procedure by coefficient adjustment. Biometrika 104, 303–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerly, K. & Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Rev. Genet. 11, 733–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Leek, J. T. & Storey, J. D. (2008). A general framework for multiple testing dependence. Proc. Nat. Acad. Sci. 105, 18718–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Liu, C., Marioni, R. E., Hedman, Å. K., Pfeiffer, L., Tsai, P. C., Reynolds, L. M., Just, A. C., Duan, Q., Boer, C. G., Tanaka, T., et al. (2018). A DNA methylation biomarker of alcohol consumption. Molec. Psychiatry 23, 422–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. McKennan, C. & Nicolae, D. (2018). Estimating and accounting for unobserved covariates in high dimensional correlated data. arXiv:1808.05895v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Morales, E., Vilahur, N., Salas, L. A., Motta, V., Fernandez, M. F., Murcia, M., Llop, S., Tardon, A., Fernandez-Tardon, G., Santa-Marina, L., et al. (2016). Genome-wide DNA methylation study in human placenta identifies novel loci associated with maternal smoking during pregnancy. Int. J. Epidemiol. 45, 1644–55. [DOI] [PubMed] [Google Scholar]
  20. Nicodemus-Johnson, J., Myers, R. A., Sakabe, N. J., Sobreira, D. R., Hogarth, D. K., Naureckas, E. T., Sperling, A. I., Solway, J., White, S. R., Nobrega, M. A., et al. (2016). DNA methylation in lung cells is associated with asthma endotypes and genetic risk. JCI Insight 1, e90151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Onatski, A. (2010). Determining the number of factors from empirical distribution of eigenvalues. Rev. Econom. Statist. 92, 1004–16. [Google Scholar]
  22. Owen, A. B. & Wang, J. (2016). Bi-cross-validation for factor analysis. Statist. Sci. 31, 119–39. [Google Scholar]
  23. Rogers, D. F. (2002). Airway goblet cell hyperplasia in asthma: Hypersecretory and anti-inflammatory? Clin. Experim. Allergy 32, 1124–7. [DOI] [PubMed] [Google Scholar]
  24. Stein, M. M., Hrusch, C. L., Gozdz, J., Igartua, C., Pivniouk, V., Murray, S. E., Ledford, J. G., Marques Dos Santos, M., Anderson, R. L., Metwali, N., et al. (2016). Innate immunity and asthma risk in Amish and Hutterite farm children. New Engl. J. Med. 375, 411–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Storey, J. D. (2001). A direct approach to false discovery rates. J. R. Statist. Soc. B 63, 479–98. [Google Scholar]
  26. Storey, J. D., Bass, A. J., Dabney, A. & Robinson, D. (2015). qvalue: Q-value Estimation for False Discovery Rate Control. R package version 2.10.0. http://github.com/jdstorey/qvalue [last accessed 14June2019]. [Google Scholar]
  27. Sun, Y., Zhang, N. R. & Owen, A. B. (2012). Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. Ann. Appl. Statist. 6, 1664–88. [Google Scholar]
  28. Taddy, M. (2012). On estimation and selection for topic models. Proc. Mach. Learn. Res. 22, 1184–93. [Google Scholar]
  29. van Iterson, M., van Zwet, E. W. & Heijmans, B. T. (2017). Controlling bias and inflation in epigenome- and transcriptome-wide association studies using the empirical null distribution. Genome Biol. 18, 19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Wang, J., Zhao, Q., Hastie, T. & Owen, A. B. (2017). Confounder adjustment in multiple hypothesis testing. Ann. Statist. 45, 1863–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Yang, I. V., Pedersen, B. S., Liu, A. H., O’Connor, G. T., Pillai, D., Kattan, M., Misiak, R. T., Gruchalla, R., Szefler, S. J., Khurana Hershey, G. K., et al. (2017). The nasal methylome and childhood atopic asthma. J. Allergy Clin. Immunol. 139, 1478–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Zhang, X., Biagini Myers, J. M., Burleson, J., Ulm, A., Bryan, K. S., Chen, X., Weirauch, M. T., Baker, T. A., Butsch Kovacic, M. S. & Ji, H. (2018). Nasal DNA methylation is associated with childhood asthma. Epigenomics 10, 629–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Zuhdi Alimam, M., Piazza, F. M., Selby, D. M., Letwin, N., Huang, L. & Rose, M. C. (2000). Muc-5/5ac mucin messenger RNA and protein expression is a marker of goblet cell metaplasia in murine airways. Am. J. Respir. Cell Molec. Biol. 22, 253–60. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

asz037_Supplementary_Data

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES