Estimating and accounting for unobserved covariates in high-dimensional correlated data

Chris McKennan; Dan Nicolae

doi:10.1080/01621459.2020.1769635

. Author manuscript; available in PMC: 2023 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2020 Jun 30;117(537):225–236. doi: 10.1080/01621459.2020.1769635

Estimating and accounting for unobserved covariates in high-dimensional correlated data

Chris McKennan ¹, Dan Nicolae ²

PMCID: PMC9126075 NIHMSID: NIHMS1641111 PMID: 35615339

Abstract

Many high dimensional and high-throughput biological datasets have complex sample correlation structures, which include longitudinal and multiple tissue data, as well as data with multiple treatment conditions or related individuals. These data, as well as nearly all high-throughput ‘omic’ data, are influenced by technical and biological factors unknown to the researcher, which, if unaccounted for, can severely obfuscate estimation of and inference on the effects of interest. We therefore developed CBCV and CorrConf: provably accurate and computationally efficient methods to choose the number of and estimate latent confounding factors present in high dimensional data with correlated or nonexchangeable residuals. We demonstrate each method’s superior performance compared to other state of the art methods by analyzing simulated multi-tissue gene expression data and identifying sex-associated DNA methylation sites in a real, longitudinal twin study.

Keywords: Batch effects, unwanted variation, confounding, correlation, multi-tissue, cell-type heterogeneity

1. Introduction

The development of high-throughput technologies has provided biologists with a cornucopia of genetic, proteomic and metabolomic data that can help elucidate the genetic components of phenotypes and the mediation of environmental exposures. Many of these ‘omic’ data have complex sample-correlation structure, which includes longitudinal data (Baumgart et al., 2016; McKennan et al., 2018), multi-tissue data (GTEx Consortium, 2017), data with multiple treatment conditions (Knowles et al., 2018) and data with related individuals (Martino et al., 2013; Tung et al., 2015). Longitudinal studies have even been cited as a critical area of future DNA methylation research to assess the stability of methylation marks over time (Breton et al., 2017; Martin and Fry, 2018), so it is crucial that suitable methods exist to analyze these data. An important feature of these high-throughput data is the presence of unmeasured factors that influence the measured response, which include technical factors like batch variables and biological factors like cell composition (Leek et al., 2010; Jaffe and Irizarry, 2014). When unaccounted for, these factors can bias test statistics, reduce power and lead to irreproducible results (Yao et al., 2012; Peixoto et al., 2015).

There have been a number of methods developed by the statistical community to estimate and correct for latent factors in high throughput biological data (Buja and Eyuboglu, 1992; Leek and Storey, 2008; Gagnon-Bartsch and Speed, 2012; Sun et al., 2012; Gagnon-Bartsch et al., 2013; Houseman et al., 2014; Owen and Wang, 2016; Fan and Han, 2017; Lee et al., 2017; Wang et al., 2017; McKennan and Nicolae, 2019). However, these methods make the critical assumption that conditional on both the observed and unobserved covariates, samples are independent with homogeneous residual variances, and tend to perform poorly when these assumptions are violated. The goal of this paper is to provide a provably accurate method to both choose the number of and estimate latent factors from the measured correlated data so that downstream inference on the effects of interest is as accurate as when all latent factors are observed. To the best of our knowledge, this is the first method to estimate and correct for latent covariates in high throughput biological data with correlated samples.

We use DNA methylation quantified in p ≈ 8 × 10⁵ methylation sites (CpGs) in n / 2 = 183 unrelated individuals at birth and age 7 from McKennan et al. (2018) as a motivating data example, although we analyze other methylation data with a more complex covariance structure in Section 6. The aim of that study was to jointly model methylation at birth and age 7 to determine if the effects due to ancestry on methylation levels changed or remained constant over time. If we ignore observed nuisance variables like the intercept, a reasonable model relating ancestry $A \in ℝ^{n / 2}$ to methylation levels at CpG g at birth, $y_{g, 0} \in ℝ^{n / 2}$ , and age 7, $y_{g, 7} \in ℝ^{n / 2}$ , in the presence of unobserved covariates $C \in ℝ^{n \times K}$ is

y_{g} = {(y_{g, 0}^{T} y_{g, 7}^{T})}^{T} = X β_{g} + {C^{l}}_{g} + e_{g}, e_{g} ~ N_{n} (0, V_{g}) (g = 1, \dots, p) V_{g} = v_{g, 1} B_{1} + v_{g, 2} B_{2} + v_{g, 3} B_{3} (g = 1, \dots, p),

(1)

where X = A ⊗ A, β_g = (β_g,0 β_g,7)^T and β_g,0, β_g,7 are the effects of interest. $B_{1} \in ℝ^{n \times n}$ is a partition matrix that groups samples by individuals and captures the within-individual variability. $B_{2} \in ℝ^{n \times n}$ and $B_{3} \in ℝ^{n \times n}$ capture the potential difference in residual variance at birth and age 7 and are diagonal matrices with ones in the first n / 2 and second n / 2 diagonal entries, respectively. In order to avoid overestimating the residual variance or biasing our estimates for β_g, we must first estimate C and its latent dimension K. One approach would be to use methods from the economics literature that allow for dependent residuals in their theoretical arguments (Bai, 2009; Li et al., 2016; Lu and Su, 2016; Su and Ju, 2018). However, their assumptions on the homogeneity of the effects β₁, …, β_p are too restrictive to have any application in genetic data. Another strategy would be to use existing methods designed for genetic data to estimate the factors at each age separately. However, this reduces the sample size by 50% and risks underestimating K, and subsequently biasing estimates for β_g. To avoid these biases, a second option would be to estimate K and C using all n samples. However, naively estimating K with methods commonly applied to genetic data with independent and identically distributed residuals, like parallel analysis (Buja and Eyuboglu, 1992) and bi-cross validation (Owen and Wang, 2016), will typically overestimate K to be on the order of the sample size, as they will be unable to distinguish the low dimensional factors C from the high dimensional random effect. Even if K were known, the correlation among the residuals obfuscates existing estimates for C.

Our proposed method uses all of the available data to estimate K and C, and subsequently the effects of interest β₁, …, β_p, in data whose genomic unit-specific covariance can be written as a linear combination of known matrices. This covariance structure is quite general and includes longitudinal, multi-tissue, multi-treatment and twin studies, as well as studies with individuals related through a kinship matrix.

The remainder of the paper is organized as followed. We describe the data and set up the problem in Section 2 and present a detailed description of our method in Section 3. We prove its efficacy in Section 4 by proving its estimate for K is consistent, its estimate for the column space of C is nearly as accurate as when samples are independent (i.e. V_g in (1) is a multiple of the identity) and that inference on the effects of interest is asymptotically equivalent to that when C is known. In Section 5 we analyze simulated multi-tissue gene expression data with a complex, gene-dependent correlation structure that demonstrates our method’s superior performance in both choosing K and estimating C. We lastly apply our method to a longitudinal DNA methylation dataset with measurements made on pairs of twins to identify CpGs with sex-dependent methylation levels in Section 6. An R package called CorrConf that implements our method to estimate K and C is freely available on GitHub. The proofs of all theorems are in the Supplement.

2. Notation and problem set-up

2.1. Notation

We define $1_{n} \in ℝ^{n}$ to be the vector of all ones, $I_{n} \in ℝ^{n \times n}$ to be the identity matrix, [n] = {1, …, n} and [x]_i to be the ith element of $x \in ℝ^{n}$ . For $M \in ℝ^{n \times m}$ , we let [M]_ij be the (i, j) element of M, and define P_M and $P_{M}^{⊥}$ to be the orthogonal projections matrices onto $Im (M) = {M v : v \in ℝ^{m}}$ and $M^{⊥} = {v \in ℝ^{n} : M^{T} v = 0}$ . If d = dim (M^⊥) > 0, we define $Q_{M} \in ℝ^{n \times d}$ to be a matrix whose columns form an orthonormal basis for M^⊥, i.e. $P_{M}^{⊥} = Q_{M} Q_{M}^{T}$ . Lastly, we use the relation $X \overset{d}{=} y$ if the two random variables, vectors or matrices X and Y have the same distribution.

2.2. A description of and generative probability model for the data

Let $y_{g} \in ℝ^{n}$ be the measured expression or methylation of genomic unit g ∈ [p] in n potentially correlated samples, and let $X \in ℝ^{n \times d}$ be the covariates of interest. When latent factors $C \in ℝ^{n \times K}$ influence the data y_g, we assume y_g is generated as

y_{g} = X β_{g} + C_{g}^{l} + e_{g}, e_{g} ~ N_{n} (0, V_{g}) (g = 1, \dots, p)

(2a)

V_{g} = v_{g, 1} B_{1} + \dots + v_{g, b} B_{b} (g = 1, \dots, p)

(2b)

y_{p \times n} = {[y_{1} \dots y_{p}]}^{T} = β_{p \times d} X_{n \times d}^{T} + L_{p \times K} C_{n \times K}^{T} + E_{p \times n} .

(2c)

The gth rows of β, L and E are β_g, ℓ_g and e_g, respectively, e₁,…, e_p are independent, and our goal is to estimate and perform inference on β₁, …, β_p. We assume C is a random matrix that is independent of E, and whose mean may be a function of X in our theoretical results in Section 4. Therefore, accounting for C can be interpreted as a way to alleviate both the dependence across genomic units and potential biases when estimating and performing inference on β₁, …, β_p. We postpone discussing the distributional assumptions on C until Section 4 because, other than C being independent of E, they play no part in the methodology presented in Section 3.

The observed matrices $B_{1}, \dots, B_{b} \in ℝ^{n \times n}$ describe how the n samples are related, and Model (2b) is a typical model for the variance in mixed effect models in high throughput biological data (Martino et al., 2013; Tung et al., 2015; Chen et al., 2017; Knowles et al., 2018; McKennan et al., 2018). When b = 1 and B₁ = I_n, Model (2) reduces to the model typically used to estimate β₁, …, β_p in high throughput data with independent samples (Leek and Storey, 2008; Sun et al., 2012; Lee et al., 2017; Wang et al., 2017; Gerard and Stephens, 2018; McKennan and Nicolae, 2019). We assume the unknown variance multipliers v_g = (v_g,1 … v_g,b)^T lie in the convex polytope

Θ = {x \in ℝ^{b} : A_{E} x = 0, A_{I} x \geq 0},

(3)

where the matrices $A_{E} \in ℝ^{N_{ε} \times b}$ and $A_{I} \in ℝ^{N τ \times b}$ are the equality and inequality constraints on v_g. For example, we may know that certain multipliers must be larger than others, or that the sum of two sets of multipliers must be equal to ensure the marginal variances are the same.

In many applications, there are other observed covariates $Z \in ℝ^{n \times r}$ that may influence the response y_g but whose effects are not of interest. In that case, the model for y_g would be

y_{g} = X β_{g} + Z ω_{g} + C^{l}_{g} + {se}_{g},

(4)

and we can get back to Model (2) by multiplying y_g by $Q_{z}^{T}$ , provided r does not grow with n:

Q_{Z}^{T} y_{g} = (Q_{Z}^{T} X) β_{g} + {(Q_{Z}^{T} C)}^{l}_{g} + {\bar{e}}_{g}, {\bar{e}}_{g} \sim N_{n - r} (0, {\bar{V}}_{g}) {\bar{V}}_{g} = v_{g, 1} (Q_{z}^{T} B_{1} Q_{z}) + \dots + v_{g, b} (Q_{z}^{T} B_{b} Q_{z}) .

Therefore, we assume that the only observed covariates are contained in X.

Evidently, C in Model (2) is not identifiable. However, we show in Section 4 that conditional on C, Im (C) is identifiable when β satisfies a modest sparsity assumption. Therefore, our primary goal is to develop a procedure to estimate and perform inference on β₁, …, β_p that only relies on an accurate estimate for Im (C). Our secondary goal is to simply estimate Im (C) when there are no covariates of interest, i.e. d = 0. Since accomplishing the former implies we can achieve the latter, we only consider the case when d ≥ 1.

We estimate Im (C) by separately computing, and then combining, estimates for the parts of Im (C) in Im (X) and X^⊥. Let $G \in ℝ^{n \times n}$ be a positive definite matrix. We estimate the part of Im (C) in Im (X) using the variation in y₁, …, y_p explainable by X:

y_{g_{1}} = {(X^{T} G^{- 1} X)}^{- 1} X^{T} G^{- 1} y_{g} = β_{g} + Ω_{g}^{l} + e_{g_{1}} \in ℝ^{d} (g = 1, \dots, p)

(5a)

Ω = {(X^{T} G^{- 1} X)}^{- 1} X^{T} G^{- 1} C \in ℝ^{d \times K}

(5b)

Y_{1} = {[\begin{array}{l} y_{1_{1}} & \dots & y_{p_{1}} \end{array}]}^{T} = β + L Ω^{T} + E_{1} \in ℝ_{p \times d} .

(5c)

Here $y_{g_{1}}$ is the naive weighted least squares estimator for β_g that ignores C and Ω is the weighted least squares regression coefficient for the regression of C onto X. We show how to choose the appropriate G in Section 3.4. Next, we use the variation in y₁, …, y_p explainable by X^⊥ to estimate the part of Im (C) in X^⊥ :

y_{g_{2}} = Q_{X}^{T} y_{g} = C_{⊥} l_{g} + e_{g_{2}} \in ℝ^{n - d}, e_{g_{2}} ~ N_{n - d} (0, Q_{X}^{T} V_{g} Q_{X}) (g = 1, \dots, p)

(6a)

C_{⊥} = Q_{X}^{T} C \in ℝ^{(n - d) \times K}

(6b)

Y_{2} = {[\begin{array}{l} y_{1_{2}} & \dots & y_{p_{2}} \end{array}]}^{T} = L C_{⊥}^{T} + E_{2} \in ℝ^{p \times (n - d)} .

(6c)

Here C_⊥ and $y_{g_{2}}$ lie in the space orthogonal to X, where the latter no longer depends on β_g. Algorithm 1 below, which we call CBCV-CorrConf, provides a cursory description of how we use the objects defined in (5) and (6) to estimate Im (C) and subsequently β₁,…,β_p.

Algorithm 1 (CBCV-CorrConf).

For $k \in {0, 1, \dots, K_{\max}}$ , use Y₂ to obtain estimates $L \in ℝ^{p \times k}$ and $C_{⊥} \in ℝ^{(n - d) \times k}$ using a factor analysis procedure we call ICaSE (Iterative Correlation and Subspace Estimation) (Section 3.2).
Use Y₂ and a cross-validation procedure we call CBCV (Correlated Bi-Cross Validation) to define $\hat{K}$ , the estimate for dim {Im (C)} (Section 3.3).
Define G and estimate Ω using Y₁ and the estimators $L \in ℝ^{p \times \hat{K}}$ , $C_{⊥} \in ℝ^{(n - d) \times \hat{K}}$ from steps 1 and 2. Let the estimate for C be $C = X Ω + G Q_{X} {(Q_{X}^{T} G Q_{X})}^{- 1} C_{⊥} \in ℝ^{n \times \hat{K}}$ . For $\hat{K}$ given, we call this method of estimating C CorrConf (Section 3.4).
Use the design matrix [X C] to estimate β₁ , …, β_p with generalized least sqaures.

The estimates for Im (C_⊥) and Im (C) are then Im (C_⊥) and Im (C), where the estimate for C in step 3 follows from the fact that $C = X Ω + G Q_{X} {(Q_{X}^{T} G Q_{X})}^{- 1} C_{⊥}$ for any positive definite G. We show in Section 4 that our estimator in Step 4 only depends on Im (C). If b = 1, B₁ = I_n and K were known, steps 1 and 3 with G = I_n are similar to methods used by Sun et al. (2012); Wang et al. (2017); Lee et al. (2017); McKennan and Nicolae (2019).

3. Estimating the factor dimension K and Im (C)

3.1. A computationally tractable model for the data

The generative model assumed in (2) is not conducive to estimating Im (C), since it requires jointly estimating Im (C) and all p covariance matrices V₁,…,V_p. Instead, we use a simpler, but incorrect, model that assumes $V_{1} = \dots = V_{p} = δ^{2} V$ to estimate K, L and Im (C):

Y = β X^{T} + L C^{T} + E, E ~ M N_{p \times n} (0, δ^{2} I_{p}, V), \det (Q_{X}^{T} V Q_{X}) = 1,

(7)

where we introduce δ² so that V is scale-invariant. We define δ² in terms of the determinant for reasons discussed in Section 3.3.

Let $δ_{*}^{2}$ and V_* be the values of δ² and V that, conditional on C, minimize the KL-divergence between the data generating model in (2) and the incorrect model in (7). Then

δ_{*}^{2} V_{*} = p^{- 1} (V_{1} + \dots + V_{p}), \det (Q_{X}^{T} V_{*} Q_{X}) = 1.

(8)

In Sections 3.2, 3.3 and 3.4 we show that, quite remarkably, we only need to estimate V_* and not V₁, …,V_p to get accurate estimates for Im (C).

Nevertheless, without any assumptions on V_*, this is a challenging, if not an intractable, problem. However, since $V_{g} = v_{g, 1} B_{1} + \dots + v_{g, b} B_{b}$ for all g = 1,…, p and B₁,…, B_b are known, then $V_{*} = τ_{*_{1}} B_{1} + \dots + τ_{*_{b}} B_{b}$ by (8), where

τ_{*} = {(τ_{*_{1}}^{\dots} τ_{*_{b}})}^{T} = {(δ_{*}^{2} p)}^{- 1} (v_{1} + \dots + v_{p}), δ_{*}^{2} = \det {p^{- 1} Q_{X}^{T} (V_{1} + \dots + V_{p}) Q_{X}}^{1 /(n − d)} .

Therefore, we need only estimate τ_* to estimate V_*, and subsequently Im (C).

3.2. ICaSE: an iterative factor analysis to estimate Im (C_⊥)

Here we present the algorithm ICaSE, a method to estimate Im (C_⊥), $δ_{*}^{2}$ and τ_* using Y₂. Recall from (6) that the mean of Y₂ is $L C_{⊥}^{T}$ and is not dependent on X. Let $W_{*} = Q_{X}^{T} V_{*} Q_{X}$ , which implies $W_{*} = τ_{*_{1}} Q_{X}^{T} B_{1} Q_{X} + \dots + τ_{*_{b}} Q_{X}^{T} B_{b} Q_{X}$ and $δ_{*}^{2} W_{*} = p^{- 1} Q_{X}^{T} (V_{1} + \dots + V_{p}) Q$ . Therefore,

S_{2} = p^{- 1} Y_{2}^{T} Y_{2} \approx C_{⊥} (p^{- 1} L^{T} L) C_{⊥}^{T} + p^{- 1} E_{2}^{T} E_{2} \approx C_{⊥} (p^{- 1} L^{T} L) C_{⊥}^{T} + δ_{*}^{2} W_{*} .

Since W_* is not a multiple of the identity, the span of the first K eigenvectors of S₂ will not be an accurate estimate for Im (C_⊥), in general. If W_* were known, we could instead first estimate $Im (W_{*}^{- 1 / 2} C_{⊥})$ as the span of the first K eigenvectors of $W_{*}^{- 1 / 2} S_{2} W_{*}^{- 1 / 2}$ , and then rescale the estimate by $W_{*}^{- 1 / 2}$ to estimate Im (C_⊥). On the other hand, if Im (C_⊥) were known, one could easily estimate $δ_{*}^{2}$ and τ_* using restricted maximum likelihood. ICaSE iterates between these two steps to estimate Im (C_⊥), $δ_{*}^{2}$ and τ_*.

It remains to show how to find a starting point for $δ_{*}^{2}$ and τ_* that avoids incorporating signal from the random effect into the estimate for Im (C_⊥), as imprudent starting points often beget biased estimates for K and Im (C_⊥). We Y₂ due to C_⊥ from the random effect by employing a separate the variation in “warm start” technique often used to solve penalized regression problems. First, we initialize our estimates for $δ_{*}^{2}$ and τ_* assuming K = 0 (i.e. $L C_{⊥}^{T} = 0$ ). We then use the estimate for τ_* when we assume dim {Im (C_⊥)} = k − 1 as the starting point when we assume dim {Im (C_⊥)} = k. This ensures that we attribute as much variability as possible to the random effect and only ascribe variability to the latent covariates if the observed signal is not amenable to the model for the variance. Algorithm 2 enumerates these steps in ICaSE.

Algorithm 2 (ICaSE).

Let $W (x) = \sum_{j = 1}^{b} {[x]}_{j} Q_{X}^{T} B_{j} Q_{X}$ for $x \in ℝ^{b}$ .

For k = 0, estimate $δ_{*}^{2}$ and τ_* by maximum likelihood using the model $Y_{2} ~ M N_{p \times (n - d)} {0, δ^{2} I_{p}, W (τ)}$ under the restriction that det {W (τ)} = 1 and δ² τ ∈ Θ.
Let W = W (τ). For k ≥ 1 and given estimates ${\hat{δ}}^{2}$ and τ for $δ_{*}^{2}$ and τ_*, estimate W^−1/2 C_⊥ as the first k right singular vectors of Y₂ W^−1/2. Re-scale the estimate by (n − d)^1/2 W^1/2 to get an estimate for $C_{⊥}, C_{⊥} \in ℝ^{(n - d) \times k}$ .
Given C_⊥, obtain ${\hat{δ}}^{2}$ and τ by restricted maximum likelihood using the model
$Y_{2} ~ M N_{p \times (n - d)} {L C_{⊥}^{T}, δ^{2} I_{p}, W (τ)} {where}^{\det {W (τ)} = 1} and δ^{2} τ \in Θ .$
Iterate between steps 1 and 2 and stop on step 1 of the second iteration. Repeat this for k = 1, …, K_max, using τ, ${\hat{δ}}^{2}$ obtained when dim {Im (C_⊥)} = k − 1 as the starting point for step 1 when dim {Im (C_⊥)} = k.

Remark 1.

If b = 1 and $B_{1} = I_{n}, C_{⊥} \in ℝ^{(n - d) \times k}$ is a scalar multiple of the first k right singular vectors of Y₂ for each k ∈ [K_max].

Evidently, iterating between steps 1 and 2 is not explicitly maximizing an objective function. However, Theorem 1 in Section 4 proves that ICaSE’s estimates for Im (C_⊥) are as accurate as those obtained from principal component analysis when samples are independent.

3.3. Correlated Bi-Cross Validation to estimate K

Here we use Y₂ and a cross-validation paradigm developed in Owen and Wang (2016) to estimate K = dim {Im (C)}. Besides having substantially shorter computation time and providing a less variable estimate for K, what differentiates our method from the one developed in Owen and Wang (2016) is we provide an approach amenable to dependent data, while their method is only valid for independent data.

Our primary concern is ensuring our estimates for K are not biased by the correlations in the data (Hastie et al., 2009). An equivalent concern is we want to avoid including factors arising from the random effect in our estimate for K. To describe our procedure, we assume for notational convenience that $Y_{p \times n} = L_{p \times K} C_{n \times K}^{T} + E_{p \times n}$ where the rows e₁, …, e_p of E are independent and e_g ~ N_n (0,V_g) for all g ∈ [p]. Algorithm 3 provides an outline of the algorithm we use to estimate K, which we call Correlated Bi-Cross Validation (CBCV).

Algorithm 3 (CBCV).

Randomly partition the rows of Y (i.e. genes) into F = O (1) folds and let f ∈ [F]. Without loss of generality, assume

Y = [\begin{matrix} Y_{(- f)} \\ Y_{f} \end{matrix}] = [\begin{matrix} L_{(- f)} C^{T} \\ L_{f} C^{T} \end{matrix}] + [\begin{matrix} E_{(- f)} \\ E_{f} \end{matrix}]

where $Y_{(- f)} \in ℝ^{p_{(- f)} \times n}$ and $Y_{f} \in ℝ^{p_{f} \times n}$ are the training and test sets.

Let $k \in {0, 1, \dots, K_{\max}}$ . Obtain $C \in ℝ^{n \times k}$ and V_{(− f)} from Y_{(− f)} using Algorithm 2.
Let ${\bar{Y}}_{f} = Y_{f} V_{(- f)}^{- 1 / 2}$ and $\bar{C} = V_{(- f)}^{- 1 / 2} C$ . Define the loss for this fold, dimension pair as the leave-one-out cross validation loss:
$L_{f} (k) = \sum_{i = 1}^{n} {‖ {\bar{y}}_{f, i} - L_{f, (- i)} {\hat{\bar{c}}}_{i} ‖}_{2}^{2},$ (9)
where ${\bar{y}}_{f, i} \in ℝ^{p_{f}}$ and ${\hat{\bar{c}}}_{i} \in ℝ^{k}$ are the ith columns of ${\bar{Y}}_{f}$ and ${\bar{C}}^{T}$ , respectively. L_{f,(− i)} is the ordinary least squares regression coefficient from the regression of ${\bar{Y}}_{f, (- i)}$ onto ${\bar{C}}_{(- 1)}$ , where ${\bar{Y}}_{f, (- i)}$ and ${\bar{C}}_{(- i)}^{T}$ are submatrices of ${\bar{Y}}_{f}$ and ${\bar{C}}^{T}$ with the ith columns removed.
Repeat this for folds f = 1,…, F and k = 0,1,…, K max and set $\hat{K}$ to be

\hat{K} = \underset{k \in {0, 1, \dots, K_{\max}}}{\arg m in} {\sum_{f = 1}^{F} L_{f} (k)} .

(10)

Since (9) is the leave one out cross validation squared loss using the design matrix $\bar{C}$ and the scaled data ${\bar{Y}}_{f}$ , it only depends on Im (C) and V_{(− f)}.

Scaling by $V_{(- f)}^{- 1 / 2}$ in (9) is sensible because it places more importance on correctly estimating the portion of C not captured by the model for the residual variance. However, unless proper care is taken, this scaled loss function would underestimate K simply because the estimated residual variance is larger for underspecified models. The restriction that $\det {V_{(- f)}} = 1$ alleviates this issue by making the loss function scale-invariant.

To understand why CBCV gives accurate estimates for K, consider the expected leave-one-out squared error for a particular fold, dimension pair, where ${\bar{c}}_{i}$ is the ith row of $V_{(- f)}^{- 1 / 2} C$ :

E {L_{f} (k) ∣ Y_{(- f)}} = \underset{I}{\underset{︸}{\sum_{i = 1}^{n} E {{‖ L_{f} {\bar{c}}_{i} - L_{f, (- i)} {\hat{\bar{c}}}_{i} ‖}_{2}^{2} ∣ Y_{(- f)}}}} + \underset{I I}{\underset{︸}{p_{f} δ_{f^{*}}^{2} Tr {V_{(- f)}^{- 1} V_{f^{*}}}}} + \underset{I I I}{\underset{︸}{cross - term}} δ_{f^{*}} V_{f^{*}} = p_{f}^{- 1} \sum_{g \in fold f} V_{g}, \det (V_{f^{*}}) = 1.

In standard cross validation, V_{(− f)} = I_n, meaning the residual variance term (II) would be constant for all k = 0,1,…, K_max and the cross-term (III) would be 0, since L_{f,(− i)} would be independent of the ith column of Y_f. The minimizer of the squared bias term (I) would then also minimize the expected cross validated error. However, since we must account for the correlation between samples, we now need to ensure that I + II is minimized when $\hat{K} = K$ and that the cross-term does not contribute to the loss. The restriction that det {V_{(− f)} } = 1 helps ensure this is the case, since ${(n p_{f})}^{- 1} I I \geq δ_{f^{*}}^{2}$ , where the inequality holds with equality if and only if V_{(− f)} = V_{f *} by Jensen’s Inequality. Since an accurate estimate of Im (C) begets an accurate estimate of V_{(− f)*} ≈_{f *}, the minimizer of I + II, and also I + II + III if III ≈ 0, should be close to K. We make this rigorous and prove the consistency of $\hat{K}$ in Theorem 2 in Section 4.

3.4. De-biasing estimates for the main effect in correlated data

Recall from (5b) that for some positive definite $G \in ℝ^{n \times n}$ , $Ω = {(X^{T} G^{- 1} X)}^{- 1} X^{T} G^{- 1} C$ , which helps quantify the proportion of the variability of C explainable by X. While C_⊥ enables one to estimate ℓ_g and V_g, Ω allows us to estimate the effect of X on y_g while controlling for the latent factors C. This helps make results reproducible.

To estimate Ω, we must first specify a suitable weighting matrix G. Section 3.1 suggests that a reasonable choice would be $G = \sum_{j = 1}^{b} {[τ]}_{j} B_{j} = V$ , where τ is computed using Algorithm 2. From here on out, we assume K is known and define $Y_{g_{1}}$ , Ω and Y₁ in (5) by setting G = V.

Our strategy for estimating Ω is to regress Y₁ onto the estimate for L obtainable after completing Algorithm 2, where for $W = Q_{X}^{T} V Q_{X}, L = Y_{2} W^{- 1} C_{⊥} {(C_{⊥}^{T} W^{- 1} C_{⊥})}^{- 1}$ . One possible estimator for Ω is the ordinary least squares estimate $Ω^{naive} = Y_{1}^{T} L {(L^{T} L)}^{- 1}$ . When samples are independent, this strategy is similar to those used in Sun et al. (2012); Houseman et al. (2014); Lee et al. (2017); Wang et al. (2017); Fan and Han (2017), although the exact estimators differ slightly. However, McKennan and Nicolae (2019) showed that Ω^naive tends to underestimate Ω when samples are independent because L is a noisy estimate for the design matrix L. Therefore, we use the following de-biased estimate of Ω, which is analogous to the estimator used in McKennan and Nicolae (2019) when samples are assumed independent:

Ω = Y_{1}^{T} L {L^{T} L - p {\hat{δ}}^{2} (C_{⊥}^{T} W^{- 1} C + 1)}^{- 1}} .

(11)

The $p {\hat{δ}}^{2} {(C_{⊥}^{T} W^{- 1} C_{⊥})}^{- 1}$ term in (11) removes the bias in L^T L and reduces to the bias correction used in McKennan and Nicolae (2019) when b = 1 and B₁ = I_n.

Lastly, we can express C as $C = X Ω + V Q_{X} W^{- 1} C_{⊥}$ . Therefore, our estimate for C is

C = X Ω + V Q_{X} W^{- 1} C_{⊥} .

(12)

We subsequently estimate V_g and β_g for all g ∈ [p] with restricted maximum likelihood followed by generalized least squares using the design matrix [X C].

4. Theoretical justification of CBCV-CorrConf

For all theoretical results, we assume d, b, K_max = O (1), and y₁,…, y_p are generated by (2), where e₁,…, e_p are independent. The latter assumption is based off of the repeated observation in experimental genetic data that, for the purposes of marginal testing, it is sufficient to assume the dependence between genomic units is driven by a relatively small number of latent variables (Leek and Storey, 2007; Maksimovic et al., 2015; Gerard and Stephens, 2018). It is also a common assumption among the literature that motivates their models with high throughput biological data (Leek and Storey, 2008; Sun et al., 2012; Lee et al., 2017; Wang et al., 2017; Dobriban and Owen, 2019). We also provide simulation evidence in Section S1.3 of the Supplement that suggests our method is robust to dependence among e₁,…, e_p.

Assumption 1.

$X \in ℝ^{n \times d}$ and $B_{1}, \dots, B_{b} \in ℝ^{n \times n}$ are observed, non-random matrices. Let c₁ > 1 be a constant. Define $M \in ℝ^{b \times b}$ to be [M]_ij = n⁻¹Tr (B_i B_j) for all i, j ∈ [b]. For any matrix D, let D_{(− i)} be the sub-matrix of D with the ith row removed. Assume:

The random matrix C is independent of E, where $E {{({[C]}_{i k})}^{2}} < \infty$ for all i ∈ [n] and k ∈ [K]. Further, $\lim_{n \to \infty} n^{- 1} X^{T} X = Σ_{X} ≻ 0$ , $M ≽ c_{1}^{- 1} I_{b}$ , $B_{j} = B_{j}^{T}$ , ${‖ B_{j} ‖}_{2} \leq c_{1}$ for all j ∈ [b] and $L \in Θ_{L} = {L \in ℝ p \times K : L_{(- g)} c o n t a i n s t w o d i s t i n c t rank K s u b m a t r i c e s f o r a l l g \in [p]} .$
Let $W_{*} = Q_{X}^{T} V_{*} Q_{X}$ and $Ψ = E (n^{- 1} C_{⊥}^{T} W_{*}^{- 1} C_{⊥})$ . Then $_{g}^{l T} Ψ_{g}^{l} \leq c_{1}$ for all g ∈ [p], and the first K eigenvalues λ₁ > … >λ_K > λ_K+1 = 0 of np⁻¹ L Ψ L^T satisfy $λ_{1}, \dots, λ_{K} \in [c_{1}^{- 1}, n c_{1}]$ .
$V_{g} ≽ c_{1}^{- 1} I_{n}$ , $| v_{g, j} | \leq c_{1}$ for all g ∈ [p] and j ∈ [b]. Further, ${‖ n^{- 1} C_{⊥}^{T} W_{*}^{- 1} C_{⊥} - Ψ ‖}_{2} = o_{P} (1)$ as n, p → ∞ and $‖ Ψ ‖_{2}, {‖ Ψ^{- 1} ‖}_{2} \leq c_{1}$ .
$1 - λ_{k + 1} / λ_{k} \geq c_{1}^{- 1}$ for all k ∈ [K], λ₁ / λ_K ≤ c₁ and one of the following holds:
$λ_{K} \geq c_{1}^{- 1} n O R λ_{K} \to \infty but λ_{K} / n \to 0 as n, p \to \infty O R λ_{K} \leq c_{1} .$
p is a non-decreasing function of n such that n / p, n^3/2 / (p λ_K) → 0 as n → ∞.

Remark 2.

We prove V₁,…,V_p ≻ 0 and L Ψ L^T are identifiable under Assumption 11 in Proposition S1 in Section S3.1 in the Supplement. The set Θ_L is a classic way to identify parameters in factor models when $V_{g} = σ_{g}^{2} I_{n}$ for all g ∈ [p] (Anderson and Rubin, 1956; Wang et al., 2017).

Remark 3.

The assumptions on C imply $E (y_{g}) = X β_{g} + E {(C)}_{g}^{l}$ and $Cov ({[y_{g}]}_{i}, {[y_{h}]}_{j}) =_{g}^{l T} Ψ_{i j}^{l} + {[V_{g}]}_{i j} I (g = h)$ , where Ψ_ij is the covariance between the ith and jth rows of C.

The condition on M ensures v₁,…, v_p are identifiable, and one can easily verify the condition on ║B_j║₂ in data from McKennan et al. (2018) and Martino et al. (2013) discussed in Sections 1 and 6, as well as in the simulated multi-tissue data from Section 5. We provide additional data examples to illustrate the ubiquity of the assumption on ║B_j║₂ in Section S4.1 of the Supplement. The eigenvalues λ₁,…, λ_K quantify the average variation in y₁,…, y_p due to C that can be distinguished from that due to X, and we give general sufficient conditions under which ${‖ n^{- 1} C_{⊥}^{T} W_{*}^{- 1} C_{⊥} - Ψ ‖}_{2} = o_{P} (1)$ in Lemma S1 in the Supplement. Item 5 is standard when $V_{g} = σ_{g}^{2} I_{n}$ for all g ∈ [p] (Wang et al., 2017; McKennan and Nicolae, 2019).

Assumption 2.

Let c² > 1 be a constant not dependent on n or p. The estimates ${\hat{δ}}^{2}$ and τ from step 2 in Algorithm 2 are such that ${\hat{δ}}^{2} τ$ lies in the convex set

Θ_{*} = Θ \cap {x \in R^{b} : ‖ x ‖_{2} \leq b c_{2} c_{1}, \sum_{j = 1}^{b} {[x]}_{j} B_{j} - {(c_{2} c_{1})}^{- 1} I_{n} ≻ 0},

where c₁ was defined in Assumption 1.

This makes the residual variance parameter space compact and is analogous to Assumption D in Bai and Li (2012) and Assumption 2 in Wang et al. (2017). It is also a standard assumption in likelihood theory (Wald, 1949; Ferguson, 1996; Douc et al., 2004). We now state Theorem 1.

Theorem 1 (Accuracy of ICaSE).

Suppose Assumptions 1 and 2 hold and we apply Algorithm 2 for k = 1,…, K_max , where K ≤ K max. Then the estimates for $δ_{*}^{2}$ , τ_* and C_⊥ satisfy

| {\hat{δ}}^{2} - δ_{*}^{2} |, {‖ τ - τ_{*} ‖}_{2} = O_{P} (n^{- 1}), k \geq K

{‖ P_{C_{⊥}} - P_{C_{⊥}} ‖}_{F}^{2} = O_{P} {n / (λ_{K} p) + {(p λ_{K})}^{- 1 / 2} + {(n λ_{K})}^{- 1}}, k = K \geq 1

{‖ P_{C_{⊥}} - P_{C_{⊥}} P_{C_{⊥}} ‖}_{F}^{2} = O_{P} {n / (λ_{K} p) + {(p λ_{K})}^{- 1 / 2} + {(n λ_{K})}^{- 1}}, k \geq K \geq 1

where ${\hat{δ}}^{2}$ and τ depend on k. Further, ${‖ P_{C_{⊥}} - P_{C_{⊥}} ‖}_{F}^{2}, {‖ P_{C_{⊥}} - P_{C_{⊥}} P_{C_{⊥}} ‖}_{F}^{2}$ and the estimates ${\hat{δ}}^{2}$ , τ are invariant of the choice of Q_X.

Theorem 1 implies that Im (C_⊥) is estimated well when k = K and its column space is approximately a subspace of Im (C_⊥) when we overestimate K. This result is quite remarkable because besides the additional factor ${(n λ_{K})}^{- 1} + {(p λ_{K})}^{- 1 / 2} < < n^{- 1 / 2}$ , this is the same rate obtained from principal components analysis when the samples are independent (McKennan and Nicolae, 2019). Theorem 1 makes no assumption on the starting points for ${\hat{δ}}^{2}$ and τ (other than the REML estimates are estimated in Θ_*), which proves the efficacy of our warm-start technique.

Theorem 2 (Consistency of CBCV).

Let $\hat{K}$ be as defined in (10). Suppose the assumptions of Theorem 1 hold, $λ_{K} \geq δ_{*}^{2} + ϵ$ for some constant ϵ > 0 and we sample Q_X uniformly over the space of all orthonormal bases for X^⊥. Then $\lim_{n, p \to \infty} P (\hat{K} = K) = 1$ .

Remark 4.

Let $Q_{*} \in ℝ^{n \times (n - d)}$ be a non-random matrix such that $P_{X}^{-} = Q_{*} Q_{*}^{T}$ . Then Q_X = Q_* M, where $M \in ℝ^{(n - d) \times (n - d)}$ is independent of C and E, and is sampled uniformly over the space of unitary matrices. This guarantees the maximum leverage score of $W_{*}^{- 1 / 2} C_{⊥}$ is o_P (1) as n → ∞, and is a common technique to uniformize the leverage scores of a matrix (Mahoney, 2011).

This theorem shows that recovering K is easier when the sample size n is larger, since λ₁,…, λ_K tend to grow with n. This is why it is advisable to use all of the available samples to estimate K, rather than partitioning the data to alleviate correlation, as discussed in Section 1.

The requirement that $λ_{K} \geq δ_{*}^{2} + ϵ$ is tight, as Owen and Wang (2016) demonstrate that $δ_{*}^{2}$ is the lower limit of detection when b = 1 and B₁ = I_n. This lower limit of detection is smaller than parallel analysis’, which often fails to recover moderate to small factors (Owen and Wang, 2016). We discuss this in more detail after the statement of Theorem S2 in Section S4 of the Supplement. Theorem 2 is also, to the best of our knowledge, the first result proving that bi-cross validation-like methods are consistent. The authors of Owen and Perry (2009); Owen and Wang (2016) only showed their estimates for K minimize the expected loss when samples are independent.

Assumption 3.

$s = p^{- 1} \sum_{g = 1}^{p} I (β_{g} \neq 0) = o (n^{- 3 / 2} λ_{K})$ , $\max_{g \in [p]} {‖ β_{g} ‖}_{2} \leq c_{3}$ for some constant c₃ > 0 , and for $S = {(X^{T} V_{*}^{- 1} X)}^{- 1} X^{T} V_{*}^{- 1} C$ , ${‖ S Ψ^{- 1} S^{T} ‖}_{2} = O_{p} (1)$ as n, p→ ∞.

Remark 5.

We prove β is identifiable under Assumptions 11–2, 15 and the sparsity assumption on s in Proposition S2 in the Supplement. Other than the conditions on $n^{- 1} C_{⊥} W_{*}^{- 1} C_{⊥}$ and S in Assumptions 1 and 3, we make no assumptions about the relationship between C and X.

This provides an explicit relationship between the maximum allowable sparsity on the main effect and the informativeness of the data Y for estimating C: the more signal in $L C_{⊥}^{T}$ , the part of LC^T unequivocally distinguishable from βX^T, the less stringent Assumption 3 becomes. This sparsity condition is consistent with many modern genetic studies (Liu et al., 2016; Morales et al., 2016; Yang et al., 2017; Zhang et al., 2018), and is the same sparsity assumed in existing theoretical work when b = 1 and B₁ = I_n (Wang et al., 2017; McKennan and Nicolae, 2019). We show through simulation in Section 5 that we can accurately recover C even when this assumption is violated.

Theorem 3 (Inference on β)

Let g ∈ [p]. Suppose Assumptions 1, 2 and 3 hold and we estimate C according to (12). Let ${\hat{v}}_{g} \in Θ_{*}$ and β_g be the restricted maximum likelihood (REML) and generalized least squares (GLS) estimates for v_g and β_g using the design matrix [X C]. Then

{‖ V_{g} - V_{g} ‖}_{2} = o_{P} (1), V_{g} = \sum_{j = 1}^{b} {[{\hat{v}}_{g}]}_{j} B_{j}

M_{g}^{- 1 / 2} {(β_{g} - β_{g})}^{d} = Z + o_{P} (1), M_{g} = {(X^{T} V_{g}^{- 1} X)}^{- 1} + Ω_{g} {C_{⊥}^{T} {(Q_{X}^{T} V_{g} Q_{X})}^{- 1} C_{⊥}}^{- 1} Ω_{g}^{T}

as n → ∞ , where Z ~ N_d (0, I_d) and $Ω_{g} = {(X^{T} V_{g}^{- 1} X)}^{- 1} X^{T} V_{g}^{- 1} C$ . Further, ${‖ M_{g} M_{g}^{- 1} - I_{d} ‖}_{2} = o_{P} (1)$ as n → ∞, where M_g is the variance of the GLS estimate for β_g when C and V_g are known.

Remark 6.

The REML and GLS estimates V g, β_g and M_g only depend on C through Im (C). We prove that conditional on C, Im (C) and M_g are identifiable for all g ∈ [p] in Proposition S3 in Section S5.2 in the Supplement.

5. Simulated multi-tissue gene expression data analysis

5.1. Simulation setup and parameters

We simulated the expression of p = 15, 000 genes from 50 individuals across three tissues with a complicated tissue-by-tissue correlation structure to compare our method against other state of the art methods designed to estimate K and β₁,…, β_p. We include comparisons of our method to estimate C_⊥ in Section S1.1 in the Supplement. We first randomly chose 25 individuals to be in the treatment group and set X ∈ {0,1}ⁿ to be the treatment status for the n = 150 samples and Z = 1₅₀ ⊗ I₃ to be the tissue-specific intercept. For each g ∈ [p], we let V_g = I₅₀ ⊗ M_g be the gene-specific covariance matrix for gene g and simulated $M_{g} \in ℝ^{3 \times 3}$ in each dataset so the average correlation matrix across all p genes was as given in Table 1.

Table 1.

The correlation matrix corresponding to the covariance matrix $\lim_{t \to \infty} t^{- 1} \sum_{g = 1}^{t} M_{g}$ .

	Tissue 1	Tissue 2	Tissue 3
Tissue 1	1	0.72	0.58
Tissue 2	0.72	1	0.80
Tissue 3	0.58	0.80	1

Open in a new tab

Specifically, we assumed tissues two and three were more similar to one another than to the first and for each g ∈ [p], set M_g to be

M_{g} = Cov {{(\begin{array}{l} ϵ & ϵ & ϵ \\ g 1 & g 2 & g 3 \end{array})}^{T}}

ϵ_{g 1} = α_{g 1} + ξ_{g 1}, ϵ_{g 2} = ϕ_{g 2} α_{g 1} + α_{g 2} + ξ_{g 2}, ϵ_{g 3} = ϕ_{g 3} α_{g 1} + ρ_{g 3} α_{g 2} + ξ_{g 3}

α_{g 1} ~ (0, v_{g 1}^{2}), α_{g 2} ~ (0, v_{g 2}^{2}), ξ_{g j} ~ (0, σ_{g j}^{2}) (j = 1, 2, 3) .

The constants $v_{g 1}^{2}, ϕ_{g 2}$ , $v_{g 2}^{2}, ϕ_{g 3}, ρ_{g 3}$ and $σ_{g j}^{2}$ were simulated from Gamma distributions with means 0.8, 1.25, 0.4, 0.75, 1 and 0.2, respectively, each with coefficient of variation equal to 0.2. We subsequently re-scaled these parameters so that $δ_{*}^{2} = 1$ . This complex tissue-by-tissue covariance is amenable to the variance model assumed in (2b), since

V_{g} = \sum_{r = 1}^{3} \sum_{s = r}^{3} v_{g r s} {I_{50} \otimes (a_{r s} a_{r s}^{T})} = \sum_{r = 1}^{3} \sum_{s = r}^{3} v_{g r s} B_{r s}

(13)

where $a_{r s} \in ℝ^{3}$ has a 1 in the rth and sth coordinates and 0 everywhere else and

v_{g 12}, v_{g 13}, v_{g 23} \geq 0, v_{g 11} + v_{g 12} + v_{g 13}, v_{g 22} + v_{g 12} + v_{g 23}, v_{g 33} + v_{g 13} + v_{g 23} \geq 0.

We next simulated data with K = 10 latent factors. The parameters X, Z, K and α, defined below, were fixed across all simulations. For $W_{*} = Q_{[X Z]}^{T} V_{*} Q_{[X Z]}$ , we simulated $y_{g} \in ℝ^{n}$ , the expression of gene g in the n (individual, tissue) pairs, as follows:

y_{g} = X β_{g} + C_{g}^{l} + {(3 / 4)}^{1 / 2} V_{g}^{1 / 2} {\tilde{e}}_{g}, {[{\tilde{e}}_{g}]}_{i} ~ T_{s} (g = 1, \dots, p; i = 1, \dots, n) β_{g} ~ 0.8 δ_{0} + 0.2 N_{1} (0, {0.4}^{2}) (g = 1, \dots, p) C = α X 1_{K}^{T} + Ξ {(n^{- 1} Ξ^{T} Q_{[X Z]} W_{*}^{- 1} Q_{[X Z]}^{T} Ξ)}^{- 1 / 2}, Ξ ~ M N_{n \times K} (0, I_{n}, I_{K}) {[\begin{array}{l} l \\ g \end{array}]}_{k} ~ π_{k} δ_{0} + (1 - π_{k}) N_{1} (0, η_{k}^{2}) (g = 1, \dots, p; k = 1, \dots, K) .

(14)

Here δ₀ is the point mass at 0 and T₈ is the t-distribution with eight degrees of freedom, and was chosen to emulate data with heavy tails. We chose to simulate a non-sparse main effect β to show that we can violate Assumption 3 and still do inference that is just as accurate as when C is known. The re-scaling of Ξ was simply to make the diagonal elements of n p⁻¹ L^T L approximately equal to the eigenvalues λ1,…, λ10 defined in Assumption 1. This re-scaling also caused λ1,…, λ₁₀ to shrink by a factor of 1.5, on average, thus making it harder to recover K and C. The constant α was chosen so that $P_{V_{*}^{^{- 1 / 2}} z}^{⊥} V_{*}^{- 1 / 2} C$ explained approximately 30% of the variability in $P_{V_{*}^{^{- 1 / 2}} z}^{⊥} V_{*}^{- 1 / 2} X$ , on average. We defined α by scaling X, C and Z by $V_{*}^{- 1 / 2}$ because this is what one would do in generalized least squares when the covariance of the observations is V_*. The values for π_k, η_k and the resulting λ₁,…,λ₁₀ are provided in Table 2.

Table 2.

The π_k and η_k values used to simulated L and the resulting average λ_k (^{k = 1, …, 10}).

Factor number (k)	1	2	3	4	5	6	7	8	9	10
π _k	0	0.45	0.60	0.71	0.79	0.85	0.90	0.92	0.94	0.96
η _k	1	0.4	0.4	0.4	0.4	0.4	0.4	0.4	0.4	0.5
λ _k	143	12.7	9.5	6.6	4.8	3.3	2.5	1.8	1.4	1.4

Open in a new tab

For each simulated dataset, we let X be the covariate of interest and the tissue specific intercepts, Z, be nuisance covariates and estimated K using CBCV with F = 3 folds, C_⊥ and V_* using ICaSE and C using (12). We then estimated V_g and β_g for each g ∈ [p] using restricted maximum likelihood and generalized least squares, respectively. We include additional results for data simulated such that conditional on C, expression across genes was correlated in Section S1.3 of the Supplement.

5.2. Comparison of estimators for K and β

Since, as far as we are aware, our methods for estimating K and C are the first methods that account for correlation among samples in high throughput biological data, we compared our estimates for K and β with state of the art methods designed for high throughput biological data with independent samples. We first assessed our estimates for K in 100 simulated datasets and compared them with three widely used methods: the method proposed in Bai and Ng (2002), parallel analysis (Buja and Eyuboglu, 1992) and bi-cross validation (Owen and Wang, 2016). The results are given in Figure 1. We could not compare our method to the deterministic version of parallel analysis designed for data with independent samples proposed in Dobriban and Owen (2019), since the authors did not provide R code to implement their method. However, their simulations show that their method tends to select at least as many factors as parallel analysis.

Fig. 1 — Estimates for K = 10 in 100 simulated datasets using our proposed method (CBCV), the method proposed in Bai and Ng (2002) (Bai & Ng), parallel analysis (PA) and bi-cross validation (BCV).

The fact that CBCV recovers K in all simulated datasets is quite remarkable given that we simulate data with heavy tails and $λ_{K} \approx δ_{*}^{2}$ (see Theorem 2). We discuss the behavior of CBCV-CorrConf when λ_K is smaller than $δ_{*}^{2}$ in Sections S1.3 and S4.2 of the Supplement. The method proposed in Bai and Ng (2002) severely underestimates K because it is only able to recover latent factors with overtly large effects (Owen and Wang, 2016). On the other hand, bi-cross validation and parallel analysis overestimate K because both methods are treating the high dimensional random effect as part of the low dimensional effect $L C_{⊥}^{T}$ . We discuss possible adjustments to ameliorate bi-cross validation’s and parallel analysis’ estimates for K in these simulated data in Section S1.2 of the Supplement. Suffice it to say that these adjustments either did not change the estimates, or caused them to consistently underestimate K.

We lastly estimated β via generalized least squares with the design matrix M = [X Z C], where C was estimated with CBCV followed by CorrConf (our estimator specified in (12)), BCconf (McKennan and Nicolae, 2019), Cate-RR (Wang et al., 2017), dSVA (Lee et al., 2017), IRW-SVA (Leek and Storey, 2008), RUV-2 (Gagnon-Bartsch and Speed, 2012), RUV-4 (Gagnon-Bartsch et al., 2013) and when C was known. BCconf, Cate-RR, dSVA, IRW-SVA, RUV-2 and RUV-4 were all applied assuming K = 10 was known. In order to make the estimation of V₁,…,V_p computationally tractable, we modeled the simulated residuals e₁,…, e_p as

e_{g} ~ N_{n} (0, v_{g} V) (g = 1, \dots, p), V = \sum_{r = 1}^{3} \sum_{s = r}^{3} τ_{r s} B_{r s},

where B_rs was defined in (13), and estimated v₁,…, v_p and T_rs with restricted maximum likelihood for each of the eight methods using the estimated design matrix M. We then computed the P value for the null hypothesis β_g = 0 for all g ∈ [p] by comparing the t-statistics to a t-distribution with n – 4 − K = n − 14 degrees of freedom, used these P values as input into q-value (Storey, 2001) to control the false discovery rate and deemed a gene as being differentially expressed across the two treatment conditions if its q-value was no greater than 0.2. We chose q-value to control the false discovery rate because this is the software most popular among biologists. Figure 2 plots the true false discovery proportion in 100 simulated datasets among genes with a q-value less than or equal to 0.2 for each of the eight methods. The false discovery proportion when we completely ignored C was uniformly greater than that of IRW-SVA’s.

Fig. 2 — The false discovery proportion (FDP) among genes with a q-value no greater than 0.2 in 100 simulated datasets. We randomly chose 90 genes with β_g = 0 as control genes in RUV-2 and RUV-4. The FDP when we ignored C was uniformly greater than IRW-SVA’s.

The performance of our method, CorrConf, as the statement of Theorem 3 suggests, is nearly indistinguishable from the generalized least squares estimator when C is known, with nearly identical power. On the other hand, the other six methods tend to introduce more type I errors because their estimates for C do not account for the dependence between the residuals and therefore fail to recover Ω. When K is underestimated, these six methods perform even worse than when K = 10 is known, and we show in Section S1.3 in the Supplement that these methods still fail to control the false discovery rate and, due to the reduction in residual degrees of freedom, exhibit a decrease in power when K is overestimated to the extent suggested by bi-cross validation and parallel analysis in Figure 1.

6. Sex-specific DNA methylation in a longitudinal twin study

We next applied our method to identify sex-specific DNA methylation patterns from a longitudinal twin study using data previously published in Martino et al. (2013). The authors measured the DNA methylation of 10 monozygotic (MZ) and 5 dizygotic (DZ) Australian twin pairs (all DZ twins were both male or both female) at birth and 18 months on the Infinium HumanMethylation450 BeadChip platform in buccal epithelium, a relatively homogeneous tissue. After probe and sample quality filtering and data-normalization, the authors were left with p = 330, 168 methylation sites (CpGs) whose methylation was quantified in 29 male and 24 female (n = 53) samples as the difference between log-methylated and log-unmethylated probe intensity (see Martino et al. (2013) for all pre-processing steps). We then used our proposed method (CBCV-CorrConf), BCconf, Cate-RR, dSVA and IRW-SVA to identify CpGs whose methylation differs in males and females, which we refer to as sex-associated CpGs. We subsequently validated each method’s findings using sex-associated CpGs identified at birth in previous studies with substantially larger sample sizes. We did not compare our method with RUV-2 or RUV-4, since we did not have access to control CpGs.

We first show that we can write the covariance of the 53 observations at each CpG as a linear combination of six matrices. Let y_{m, t,a} be the measured DNA methylation at an arbitrary CpG for twin t ∈{1, 2}, from mother m ∈ [15] at age a ∈{0,18}, where samples with different mothers were assumed to be independent and twin 1 and twin 2 from the same mother were assumed to be exchangeable. A preliminary analysis showed that the correlation between MZ and DZ twins’ methylomes was approximately the same at both ages, which is consistent with the observation that methylation patterns are in large part determined by environmental exposures (Galanter et al., 2017; Martin and Fry, 2018). Therefore, the 4 × 4 covariance matrix for y_m = (y_m,1,0 y_m,1,18 y_m,2,0 y_m,2,18)^T completely determined the n × n covariance matrix for each CpG. We show in Section S2 of the Supplement that by only making assumptions on the pairwise covariances, one can write Cov (y_m) as

Cov (y_{m}) = v_{α} 1_{4} 1_{4}^{T} + v_{η} (1_{2} 1_{2}^{T} \oplus 1_{2} 1_{2}^{T}) + v_{ϕ, 0} a_{0} a_{0}^{T} + v_{ϕ, 18} a_{18} a_{18}^{T} + v_{0} diag (a_{0}) + v_{18} diag (a_{18})

(15)

where $a_{0}^{T} = (1, 0, 1, 0)$ , $a_{18}^{T} = (0, 1, 0, 1)$ . The variance multipliers are such that for t₁ ≠ t₂,

v_{α} = Cov (y_{m, t_{1}, 0}, y_{m, t_{2}, 18})

v_{α} + v_{η} = Cov (y_{m, t_{1}, 0}, y_{m, t_{1}, 18}), v_{α} + v_{ϕ, a} = Cov (y_{m, t_{1}, a}, y_{m, t_{2}, a}) (a = 0, 18)

v_{α} + v_{η} + v_{ϕ, a} + v_{a} = Var (y_{m, t_{1}, a}) (a = 0, 18)

and lie in a convex polytope defined in (3). We derive these variance multiplier relationships in Section S2 of the Supplement.

Since there was no evidence that the difference in methylation between males and females changed from birth to 18 months, we assumed the methylation at each CpG was a linear combination of the subject’s age (birth or 18 months), sex and other unobserved factors to be estimated, where age was a nuisance covariate and sex was the phenotype of interest. We used CBCV with F = 5 folds and estimated K to be 2, and subsequently estimated V_* and C with ICaSE and CorrConf. Our estimates for the six average variance multipliers were all greater than 0, the average residual variance at 18 months was 25% larger than that at birth and the correlation between methylation for twins at 18 months was nearly 20% larger than that at birth. This indicated this set of twin’ s methylomes tended to converge over the first 18 months of life, which is consistent with previous observations that one’s methylome reflects one’s environmental exposure history (Galanter et al., 2017; Martin and Fry, 2018).

We next computed each of the other four method’s estimates for C by first using each method’s default software to choose K: bi-cross validation, the default for Cate-RR and BCconf, or parallel analysis, the default for dSVA and IRW-SVA. These methods estimated K to be 4 and 15, respectively. The fact that both estimated the latent factor dimension to be greater than CBCV’s estimate of 2 is not surprising, as these methods tend to overestimate K when samples are correlated. We subsequently estimated C with each of the four methods using their software’s default settings.

We lastly estimated the effect due to sex on methylation and corresponding q-values to control the false discovery rate, exactly as we did for the simulated data in Section 5. We deemed a CpG a sex-associated CpG if its q-value in that method was no greater than 0.2. Since we did not know the ground truth, we used sex-associated CpGs identified at birth in Yousefi et al. (2015) and Maschietto et al. (2017) as a validation set to help judge the veracity of each method’s findings. Yousefi et al. (2015) and Maschietto et al. (2017) measured DNA methylation in umbilical cord blood on the Infinium HumanMethylation450 BeadChip platform in children born to 111 unrelated Brazilian and 71 unrelated Mexican American mothers, respectively. The authors of both studies measured and corrected for cord blood cellular composition and identified 2,355 and 1,928 sex-associated CpGs that were also among the 330,168 CpGs studied in Martino et al. (2013). Table 3 gives the fraction of sex-associated CpGs identified using C estimated with our method (CBCV-CorrConf), along with the other four methods, that are also among the 3,532 sex-associated CpGs identified in Maschietto et al. (2017) or Yousefi et al. (2015). Since the other four methods are not designed for dependent data, we discuss two adjustments designed to alleviate potential biases in their estimates for both K and C in Section S2 of the Supplement. The overlaps with the modified methods were at least as poor as those presented in Table 3.

Table 3.

The fraction of sex-associated CpGs identified using data from Martino et al. (2013) that are also one of the 3,532 sex-associated CpGs confidently identified in Yousefi et al. (2015) or Maschietto et al. (2017).

CBCV-CorrConf (K = 2)	BCconf (K = 4)	Cate-RR (K = 4)	dSVA (K = 15)	IRW-SVA (K = 15)
38% (278/729)	23% (424/1839)	20% (474/2404)	19% (487/2517)	28% (341/1240)

Open in a new tab

While it may be the case that most of BCconf, Cate-RR, dSVA and IRW-SVA are actual sex-associated CpGs, the results in Table 3 mirror the trends observed in Figure 2. That is, while these four methods nominally identify more sex-associated CpGs, we are less confident in their results because their estimates for C reduce the residual variance but likely do not suitably account for the variation in sex explainable by C, which makes their results less reproducible.

These results also highlight the importance of the choice of K. Estimating K with CBCV, and cross-validation in general, tends to yield more reproducible results because we only include a latent factor if prediction performs suitably well on new, held-out data. When we applied all five methods with K = 2, BCconf, Cate-RR and dSVA performed similarly with overlaps no greater than 30%. However, 272 out of IRW-SVA’s 662 sex-associated CpGs (41%) were in the validation set, which is nearly identical to CorrConf’s results in Table 3. Similarly, when we set K = 4 for all methods, dSVA performed nearly identically to Cate-RR, whereas CorrConf and IRW-SVA had overlaps of 27% and 26%, respectively, and both ostensibly identified approximately 1,500 sex-associated CpGs. This similarity arises because unlike BCconf, Cate-RR and dSVA, IRW-SVA circumvents estimating Ω with a noisy estimate for L and estimates C by performing factor analysis directly on Y, restricted to the genomic units that show little marginal correlation with X. In fact, we use the proof of Theorem 1 in Sections S1.4 and S2 in the Supplement to show that under certain conditions on C, V_* and λ_K, which appear to be satisfied in these data, IRW-SVA can accurately recover C even when residuals are correlated. We believe K = 2 is the most appropriate choice of K for this dataset because CorrConf’s estimate for C appears to explain enough of the variance in methylation to achieve reasonable power, while also accurately recovering Ω to control for false discoveries and ensuring that the results are reproducible.

7. Discussion

To the best of our knowledge, we have developed the first method to account for latent factors in high throughput biological data with correlated samples. We proved that our estimate for K is consistent and that our estimate for Im (C) is accurate enough so that inference on the main effects β₁,…, β_p is just as accurate as when C is known. We also demonstrated our method’s finite sample properties by analyzing complex, multi-tissue simulated gene expression data, and used a real longitudinal DNA methylation data from a twin study to show our method tends to give more reproducible results compared to existing methods.

Our proposed procedure is certainly not a panacea for data with arbitrary correlation structure, and relies on the residual variance V_g being a linear combination of known matrices. Data with more complex, non-linear sample correlation structure may not be amenable to (2b), since a linear combination of p non-linear functions will not necessarily have an apriori known functional form. This could be an interesting area of future research.

Supplementary Material

Supp 1

NIHMS1641111-supplement-Supp_1.zip^{(167.7MB, zip)}

Acknowledgments

* The authors gratefully acknowledge the NIH for their support of this research, which was funded by NIH grants R01-HL129735 and R01-MH101820

Contributor Information

Chris McKennan, Department of Statistics, University of Pittsburgh.

Dan Nicolae, Department of Statistics, University of Chicago.

References

Anderson TW and Rubin H (1956), Statistical inference in factor analysis, in ‘Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 5: Contributions to Econometrics, Industrial Research, and Psychometry’, University of California Press, Berkeley, California, pp. 111–150. [Google Scholar]
Bai J (2009), ‘Panel data models with interactive fixed effects’, Econometrica 77(4), 1229–1279. [Google Scholar]
Bai J and Li K (2012), ‘Statistical analysis of factor models of high dimension’, The Annals of Statistics 40(1), 436–465. [Google Scholar]
Bai J and Ng S (2002), ‘Determining the number of factors in approximate factor models’, Econometrica 70(1), 191–221. [Google Scholar]
Baumgart M, Priebe S, Groth M, Hartmann N, Menzel U, Pandolfini L, Koch P, Felder M, Ristow M, Englert C, Guthke R, Platzer M and Cellerino A (2016), ‘Longitudinal RNA-Seq analysis of vertebrate aging identifies mitochondrial complex I as a small-molecule-sensitive modifier of lifespan’, Cell Systems 2(2), 122–132. [DOI] [PubMed] [Google Scholar]
Breton CV, Marsit CJ, Faustman E, Nadeau K, Goodrich JM, Dolinoy DC, Herbstman J, Holland N, LaSalle JM, Schmidt R, Yousefi P, Perera F, Joubert BR, Wiemels J, Taylor M, Yang IV, Chen R, Hew KM, Freeland DMH, Miller R and Murphy SK (2017), ‘Small-magnitude effect sizes in epigenetic end points are important in children’s environmental health studies: The children’s environmental health and disease prevention research center’s epigenetics working group’, Environmental Health Perspectives 125(4), 511–526. [DOI] [PMC free article] [PubMed] [Google Scholar]
Buja A and Eyuboglu N (1992), ‘Remarks on parallel analysis’, Multivariate Behavioral Research 27(4), 509–540. [DOI] [PubMed] [Google Scholar]
Chen LS, Wang J, Wang X and Wang P (2017), ‘A mixed-effects model for incomplete data from labeling-based quantitative proteomics experiments’, The Annals of Applied Statistics 11(1), 114–138. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dobriban E and Owen AB (2019), ‘Deterministic parallel analysis: an improved method for selecting factors and principal components’, Journal of the Royal Statistical Society Series B 81(1), 163–183. [Google Scholar]
Douc R, Moulines É and Rydén T (2004), ‘Asymptotic properties of the maximum likelihood estimator in autoregressive models with markov regime’, The Annals of Statistics 32(5), 2254–2304. [Google Scholar]
Fan J and Han X (2017), ‘Estimation of the false discovery proportion with unknown dependence’, Journal of the Royal Statistical Society: Series B 79(4), 1143–1164. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ferguson T (1996), A Course in Large Sample Theory, Chapman & Hall. [Google Scholar]
Gagnon-Bartsch JA, Jacob L and Speed TP (2013), Removing unwanted variation from high dimensional data with negative controls, Technical report, UC Berkeley. [Google Scholar]
Gagnon-Bartsch JA and Speed TP (2012), ‘Using control genes to correct for unwanted variation in microarray data’, Biostatistics 13(3), 539–552. [DOI] [PMC free article] [PubMed] [Google Scholar]
Galanter JM, Gignoux CR, Oh SS, Torgerson D, Pino-Yanes M, Thakur N, Eng C, Hu D, Huntsman S, Farber HJ, Avila PC, Brigino-Buenaventura E, LeNoir MA, Meade K, Serebrisky D, Rodríguez-Cintrón W, Kumar R, Rodríguez-Santana JR, Seibold MA, Borrell LN, Burchard EG and Zaitlen N (2017), ‘Differential methylation between ethnic sub-groups reflects the effect of genetic ancestry and environmental exposures’, eLife 6, e20532. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gerard D and Stephens M (2018), ‘Empirical Bayes shrinkage and false discovery rate estimation, allowing for unwanted variation’, Biostatistics. [DOI] [PMC free article] [PubMed] [Google Scholar]
Consortium GTEx (2017), ‘Genetic effects on gene expression across human tissues’, Nature 550. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hastie T, Tibshirani R and Friedman J (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second edition), Springer. [Google Scholar]
Houseman EA, Molitor J and Marsit CJ (2014), ‘Reference-free cell mixture adjustments in analysis of DNA methylation data’, Bioinformatics 30(10), 1431–1439. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jaffe AE and Irizarry RA (2014), ‘Accounting for cellular heterogeneity is critical in epigenome-wide association studies’, Genome Biology 15(31). [DOI] [PMC free article] [PubMed] [Google Scholar]
Knowles DA, Burrows CK, Blischak JD, Patterson KM, Serie DJ, Norton N, Ober C, Pritchard JK, Gilad Y and McVean G (2018), ‘ Determining the genetic basis of anthracycline-cardiotoxicity by response QTL mapping in induced cardiomyocytes’, eLife 7, e33480. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee S, Sun W, Wright FA and Zou F (2017), ‘An improved and explicit surrogate variable analysis procedure by coefficient adjustment’, Biometrika 104(2), 303–316. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K and Irizarry RA (2010), ‘Tackling the widespread and critical impact of batch effects in high-throughput data’, Nature reviews. Genetics 11(10), 10.1038/nrg2825. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leek JT and Storey JD (2007), ‘Capturing heterogeneity in gene expression studies by surrogate variable analysis’, PLOS Genetics 3(9), 1724–1735. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leek JT and Storey JD (2008), ‘A general framework for multiple testing dependence’, Proceedings of the National Academy of Sciences 105(48), 18718–18723. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li D, Qian J and Su L (2016), ‘Panel data models with interactive fixed effects and multiple structural breaks’, Journal of the American Statistical Association 111(516), 1804–1819. [Google Scholar]
Liu C, Marioni RE, Hedman Å, Pfeiffer L, Tsai P-C, Reynolds LM, Just AC, Duan Q, Boer CG, Tanaka T, Elks CE, Aslibekyan S, Brody JA, Kühnel B, Herder C, Almli LM, Zhi D, Wang Y, Huan T, Yao C, Mendelson MM, Joehanes R, Liang L, Love S-A, Guan W, Shah S, McRae AF, Kretschmer A, Prokisch H, Strauch K, Peters A, Visscher PM, Wray NR, Guo X, Wiggins KL, Smith AK, Binder EB, Ressler KJ, Irvin MR, Absher DM, Hernandez D, Ferrucci L, Bandinelli S, Lohman K, Ding J, Trevisi L, Gustafsson S, Sandling JH, Stolk L, Uitterlinden AG, Yet I, Castillo-Fernandez JE, Spector TD, Schwartz JD, Vokonas P, Lind L, Li Y, Fornage M, Arnett DK, Wareham NJ, Sotoodehnia N, Ong KK, van Meurs JBJ, Conneely KN, Baccarelli AA, Deary IJ, Bell JT, North KE, Liu Y, Waldenberger M, London SJ, Ingelsson E and Levy D (2016), ‘A DNA methylation biomarker of alcohol consumption’, Molecular Psychiatry 23, 422 EP –. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu X and Su L (2016), ‘Shrinkage estimation of dynamic panel data models with interactive fixed effects’, Journal of Econometrics 190(1), 148–175. [Google Scholar]
Mahoney MW (2011), Randomized Algorithms for Matrices and Data, now.
Maksimovic J, Gagnon-Bartsch JA, Speed TP and Oshlack A (2015), ‘ Removing unwanted variation in a differential methylation analysis of illumina humanmethylation450 array data’, Nucleic acids research 43(16), e106–e106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Martin EM and Fry RC (2018), ‘Environmental influences on the epigenome: Exposure- associated DNA methylation in human populations’, Annual Review of Public Health 39(1), 309–333. [DOI] [PubMed] [Google Scholar]
Martino D, Loke YJ, Gordon L, Ollikainen M, Cruickshank MN, Saffery R and Craig JM (2013), ‘Longitudinal, genome-scale analysis of DNA methylation in twins from birth to 18 months of age reveals rapid epigenetic change in early life and pair-specific effects of discordance’, Genome Biology 14(5), R42. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maschietto M, Bastos LC, Tahira AC, Bastos EP, Euclydes VLV, Brentani A, Fink G, de Baumont A, Felipe-Silva A, Francisco RPV, Gouveia G, Grisi SJFE, Escobar AMU, Moreira-Filho CA, Polanczyk GV, Miguel EC and Brentani H (2017), ‘Sex differences in DNA methylation of the cord blood are related to sex-bias psychiatric diseases’, Scientific Reports 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
McKennan C, Naughton K, Stanhope C, Kattan M, O’Connor G, Sandel M, Visness C, Wood R, Bacharier L, Beigelman A, Lovisky-Desir S, Togias A, Gern J, Nicolae D and Ober C (2018), ‘Longitudinal studies at birth and age 7 reveal strong effects of genetic variation on ancestry-associated DNA methylation patterns in blood cells from ethnically admixed children’, bioRxiv. [Google Scholar]
McKennan C and Nicolae D (2019), ‘Accounting for unobserved covariates with varying degrees of estimability in high-dimensional biological data’, Biometrika 106(4), 823–840. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morales E, Vilahur N, Salas LA, Motta V, Fernandez MF, Murcia M, Llop S, Tardon A, Fernandez-Tardon G, Santa-Marina L, Gallastegui M, Bollati V, Estivill X, Olea N, Sunyer J and Bustamante M (2016), ‘ Genome-wide DNA methylation study in human placenta identifies novel loci associated with maternal smoking during pregnancy’, International Journal of Epidemiology 45(5), 1644–1655. [DOI] [PubMed] [Google Scholar]
Owen AB and Perry PO (2009), ‘Bi-cross-validation of the SVD and the nonnegative matrix factorization’, Annals of Applied Statistics 3(2), 564–594. [Google Scholar]
Owen AB and Wang J (2016), ‘Bi-cross-validation for factor analysis’, Statistical Science 31(1), 119–139. [Google Scholar]
Peixoto L, Risso D, Poplawski SG, Wimmer ME, Speed TP, Wood MA and Abel T (2015), ‘How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets’, Nucleic Acids Research 43(16), 7664–7674. [DOI] [PMC free article] [PubMed] [Google Scholar]
Storey JD (2001), ‘A direct approach to false discovery rates’, Journal of the Royal Statistics Society Series B 63(3), 479–498. [Google Scholar]
Su L and Ju G (2018), ‘Identifying latent grouped patterns in panel data models with interactive fixed effects’, Journal of Econometrics 206(2), 554–573. [Google Scholar]
Sun Y, Zhang NR and Owen AB (2012), ‘Multiple hypothesis testing adjusted for latent variables, with an application to the agemap gene expression data’, The Annals of Applied Statistics 6(4), 1664–1668. [Google Scholar]
Tung J, Zhou X, Alberts SC, Stephens M, Gilad Y and Dermitzakis ET (2015), ‘The genetic architecture of gene expression levels in wild baboons’, eLife 4, e04729. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wald A (1949), ‘Note on the consistency of the maximum likelihood estimate’, The Annals of Mathematical Statistics 20(4), 595–601. [Google Scholar]
Wang J, Zhao Q, Hastie T and Owen AB (2017), ‘Confounder adjustment in multiple hypothesis testing’, The Annals of Statistics 45(5), 1863–1894. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang IV, Pedersen BS, Liu AH, O’Connor GT, Pillai D, Kattan M, Misiak RT, Gruchalla R, Szefler SJ, Khurana Hershey GK, Kercsmar C, Richards A, Stevens AD, Kolakowski CA, Makhija M, Sorkness CA, Krouse RZ, Visness C, Davidson EJ, Hennessy CE, Martin RJ, Togias A, Busse WW and Schwartz DA (2017), ‘The nasal methylome and childhood atopic asthma’, Journal of Allergy and Clinical Immunology 139(5), 1478–1488. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yao C, Li H, Shen X, He Z, He L and Guo Z (2012), ‘Reproducibility and concordance of differential dna methylation and gene expression in cancer’, PLOS ONE 7(1), e29686–. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yousefi P, Huen K, Davé V, Barcellos L, Eskenazi B and Holland N (2015), ‘Sex differences in DNA methylation assessed by 450K beadchip in newborns’, BMC Genomics 16(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang X, Biagini Myers JM, Burleson J, Ulm A, Bryan KS, Chen X, Weirauch MT, Baker TA, Butsch Kovacic MS and Ji H (2018), ‘Nasal DNA methylation is associated with childhood asthma’, Epigenomics 10(5), 629–641. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS1641111-supplement-Supp_1.zip^{(167.7MB, zip)}

[R1] Anderson TW and Rubin H (1956), Statistical inference in factor analysis, in ‘Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 5: Contributions to Econometrics, Industrial Research, and Psychometry’, University of California Press, Berkeley, California, pp. 111–150. [Google Scholar]

[R2] Bai J (2009), ‘Panel data models with interactive fixed effects’, Econometrica 77(4), 1229–1279. [Google Scholar]

[R3] Bai J and Li K (2012), ‘Statistical analysis of factor models of high dimension’, The Annals of Statistics 40(1), 436–465. [Google Scholar]

[R4] Bai J and Ng S (2002), ‘Determining the number of factors in approximate factor models’, Econometrica 70(1), 191–221. [Google Scholar]

[R5] Baumgart M, Priebe S, Groth M, Hartmann N, Menzel U, Pandolfini L, Koch P, Felder M, Ristow M, Englert C, Guthke R, Platzer M and Cellerino A (2016), ‘Longitudinal RNA-Seq analysis of vertebrate aging identifies mitochondrial complex I as a small-molecule-sensitive modifier of lifespan’, Cell Systems 2(2), 122–132. [DOI] [PubMed] [Google Scholar]

[R6] Breton CV, Marsit CJ, Faustman E, Nadeau K, Goodrich JM, Dolinoy DC, Herbstman J, Holland N, LaSalle JM, Schmidt R, Yousefi P, Perera F, Joubert BR, Wiemels J, Taylor M, Yang IV, Chen R, Hew KM, Freeland DMH, Miller R and Murphy SK (2017), ‘Small-magnitude effect sizes in epigenetic end points are important in children’s environmental health studies: The children’s environmental health and disease prevention research center’s epigenetics working group’, Environmental Health Perspectives 125(4), 511–526. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Buja A and Eyuboglu N (1992), ‘Remarks on parallel analysis’, Multivariate Behavioral Research 27(4), 509–540. [DOI] [PubMed] [Google Scholar]

[R8] Chen LS, Wang J, Wang X and Wang P (2017), ‘A mixed-effects model for incomplete data from labeling-based quantitative proteomics experiments’, The Annals of Applied Statistics 11(1), 114–138. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Dobriban E and Owen AB (2019), ‘Deterministic parallel analysis: an improved method for selecting factors and principal components’, Journal of the Royal Statistical Society Series B 81(1), 163–183. [Google Scholar]

[R10] Douc R, Moulines É and Rydén T (2004), ‘Asymptotic properties of the maximum likelihood estimator in autoregressive models with markov regime’, The Annals of Statistics 32(5), 2254–2304. [Google Scholar]

[R11] Fan J and Han X (2017), ‘Estimation of the false discovery proportion with unknown dependence’, Journal of the Royal Statistical Society: Series B 79(4), 1143–1164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Ferguson T (1996), A Course in Large Sample Theory, Chapman & Hall. [Google Scholar]

[R13] Gagnon-Bartsch JA, Jacob L and Speed TP (2013), Removing unwanted variation from high dimensional data with negative controls, Technical report, UC Berkeley. [Google Scholar]

[R14] Gagnon-Bartsch JA and Speed TP (2012), ‘Using control genes to correct for unwanted variation in microarray data’, Biostatistics 13(3), 539–552. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Galanter JM, Gignoux CR, Oh SS, Torgerson D, Pino-Yanes M, Thakur N, Eng C, Hu D, Huntsman S, Farber HJ, Avila PC, Brigino-Buenaventura E, LeNoir MA, Meade K, Serebrisky D, Rodríguez-Cintrón W, Kumar R, Rodríguez-Santana JR, Seibold MA, Borrell LN, Burchard EG and Zaitlen N (2017), ‘Differential methylation between ethnic sub-groups reflects the effect of genetic ancestry and environmental exposures’, eLife 6, e20532. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Gerard D and Stephens M (2018), ‘Empirical Bayes shrinkage and false discovery rate estimation, allowing for unwanted variation’, Biostatistics. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Consortium GTEx (2017), ‘Genetic effects on gene expression across human tissues’, Nature 550. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Hastie T, Tibshirani R and Friedman J (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second edition), Springer. [Google Scholar]

[R19] Houseman EA, Molitor J and Marsit CJ (2014), ‘Reference-free cell mixture adjustments in analysis of DNA methylation data’, Bioinformatics 30(10), 1431–1439. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Jaffe AE and Irizarry RA (2014), ‘Accounting for cellular heterogeneity is critical in epigenome-wide association studies’, Genome Biology 15(31). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Knowles DA, Burrows CK, Blischak JD, Patterson KM, Serie DJ, Norton N, Ober C, Pritchard JK, Gilad Y and McVean G (2018), ‘ Determining the genetic basis of anthracycline-cardiotoxicity by response QTL mapping in induced cardiomyocytes’, eLife 7, e33480. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Lee S, Sun W, Wright FA and Zou F (2017), ‘An improved and explicit surrogate variable analysis procedure by coefficient adjustment’, Biometrika 104(2), 303–316. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K and Irizarry RA (2010), ‘Tackling the widespread and critical impact of batch effects in high-throughput data’, Nature reviews. Genetics 11(10), 10.1038/nrg2825. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Leek JT and Storey JD (2007), ‘Capturing heterogeneity in gene expression studies by surrogate variable analysis’, PLOS Genetics 3(9), 1724–1735. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Leek JT and Storey JD (2008), ‘A general framework for multiple testing dependence’, Proceedings of the National Academy of Sciences 105(48), 18718–18723. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Li D, Qian J and Su L (2016), ‘Panel data models with interactive fixed effects and multiple structural breaks’, Journal of the American Statistical Association 111(516), 1804–1819. [Google Scholar]

[R27] Liu C, Marioni RE, Hedman Å, Pfeiffer L, Tsai P-C, Reynolds LM, Just AC, Duan Q, Boer CG, Tanaka T, Elks CE, Aslibekyan S, Brody JA, Kühnel B, Herder C, Almli LM, Zhi D, Wang Y, Huan T, Yao C, Mendelson MM, Joehanes R, Liang L, Love S-A, Guan W, Shah S, McRae AF, Kretschmer A, Prokisch H, Strauch K, Peters A, Visscher PM, Wray NR, Guo X, Wiggins KL, Smith AK, Binder EB, Ressler KJ, Irvin MR, Absher DM, Hernandez D, Ferrucci L, Bandinelli S, Lohman K, Ding J, Trevisi L, Gustafsson S, Sandling JH, Stolk L, Uitterlinden AG, Yet I, Castillo-Fernandez JE, Spector TD, Schwartz JD, Vokonas P, Lind L, Li Y, Fornage M, Arnett DK, Wareham NJ, Sotoodehnia N, Ong KK, van Meurs JBJ, Conneely KN, Baccarelli AA, Deary IJ, Bell JT, North KE, Liu Y, Waldenberger M, London SJ, Ingelsson E and Levy D (2016), ‘A DNA methylation biomarker of alcohol consumption’, Molecular Psychiatry 23, 422 EP –. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Lu X and Su L (2016), ‘Shrinkage estimation of dynamic panel data models with interactive fixed effects’, Journal of Econometrics 190(1), 148–175. [Google Scholar]

[R29] Mahoney MW (2011), Randomized Algorithms for Matrices and Data, now.

[R30] Maksimovic J, Gagnon-Bartsch JA, Speed TP and Oshlack A (2015), ‘ Removing unwanted variation in a differential methylation analysis of illumina humanmethylation450 array data’, Nucleic acids research 43(16), e106–e106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Martin EM and Fry RC (2018), ‘Environmental influences on the epigenome: Exposure- associated DNA methylation in human populations’, Annual Review of Public Health 39(1), 309–333. [DOI] [PubMed] [Google Scholar]

[R32] Martino D, Loke YJ, Gordon L, Ollikainen M, Cruickshank MN, Saffery R and Craig JM (2013), ‘Longitudinal, genome-scale analysis of DNA methylation in twins from birth to 18 months of age reveals rapid epigenetic change in early life and pair-specific effects of discordance’, Genome Biology 14(5), R42. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Maschietto M, Bastos LC, Tahira AC, Bastos EP, Euclydes VLV, Brentani A, Fink G, de Baumont A, Felipe-Silva A, Francisco RPV, Gouveia G, Grisi SJFE, Escobar AMU, Moreira-Filho CA, Polanczyk GV, Miguel EC and Brentani H (2017), ‘Sex differences in DNA methylation of the cord blood are related to sex-bias psychiatric diseases’, Scientific Reports 7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] McKennan C, Naughton K, Stanhope C, Kattan M, O’Connor G, Sandel M, Visness C, Wood R, Bacharier L, Beigelman A, Lovisky-Desir S, Togias A, Gern J, Nicolae D and Ober C (2018), ‘Longitudinal studies at birth and age 7 reveal strong effects of genetic variation on ancestry-associated DNA methylation patterns in blood cells from ethnically admixed children’, bioRxiv. [Google Scholar]

[R35] McKennan C and Nicolae D (2019), ‘Accounting for unobserved covariates with varying degrees of estimability in high-dimensional biological data’, Biometrika 106(4), 823–840. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Morales E, Vilahur N, Salas LA, Motta V, Fernandez MF, Murcia M, Llop S, Tardon A, Fernandez-Tardon G, Santa-Marina L, Gallastegui M, Bollati V, Estivill X, Olea N, Sunyer J and Bustamante M (2016), ‘ Genome-wide DNA methylation study in human placenta identifies novel loci associated with maternal smoking during pregnancy’, International Journal of Epidemiology 45(5), 1644–1655. [DOI] [PubMed] [Google Scholar]

[R37] Owen AB and Perry PO (2009), ‘Bi-cross-validation of the SVD and the nonnegative matrix factorization’, Annals of Applied Statistics 3(2), 564–594. [Google Scholar]

[R38] Owen AB and Wang J (2016), ‘Bi-cross-validation for factor analysis’, Statistical Science 31(1), 119–139. [Google Scholar]

[R39] Peixoto L, Risso D, Poplawski SG, Wimmer ME, Speed TP, Wood MA and Abel T (2015), ‘How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets’, Nucleic Acids Research 43(16), 7664–7674. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Storey JD (2001), ‘A direct approach to false discovery rates’, Journal of the Royal Statistics Society Series B 63(3), 479–498. [Google Scholar]

[R41] Su L and Ju G (2018), ‘Identifying latent grouped patterns in panel data models with interactive fixed effects’, Journal of Econometrics 206(2), 554–573. [Google Scholar]

[R42] Sun Y, Zhang NR and Owen AB (2012), ‘Multiple hypothesis testing adjusted for latent variables, with an application to the agemap gene expression data’, The Annals of Applied Statistics 6(4), 1664–1668. [Google Scholar]

[R43] Tung J, Zhou X, Alberts SC, Stephens M, Gilad Y and Dermitzakis ET (2015), ‘The genetic architecture of gene expression levels in wild baboons’, eLife 4, e04729. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Wald A (1949), ‘Note on the consistency of the maximum likelihood estimate’, The Annals of Mathematical Statistics 20(4), 595–601. [Google Scholar]

[R45] Wang J, Zhao Q, Hastie T and Owen AB (2017), ‘Confounder adjustment in multiple hypothesis testing’, The Annals of Statistics 45(5), 1863–1894. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] Yang IV, Pedersen BS, Liu AH, O’Connor GT, Pillai D, Kattan M, Misiak RT, Gruchalla R, Szefler SJ, Khurana Hershey GK, Kercsmar C, Richards A, Stevens AD, Kolakowski CA, Makhija M, Sorkness CA, Krouse RZ, Visness C, Davidson EJ, Hennessy CE, Martin RJ, Togias A, Busse WW and Schwartz DA (2017), ‘The nasal methylome and childhood atopic asthma’, Journal of Allergy and Clinical Immunology 139(5), 1478–1488. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] Yao C, Li H, Shen X, He Z, He L and Guo Z (2012), ‘Reproducibility and concordance of differential dna methylation and gene expression in cancer’, PLOS ONE 7(1), e29686–. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] Yousefi P, Huen K, Davé V, Barcellos L, Eskenazi B and Holland N (2015), ‘Sex differences in DNA methylation assessed by 450K beadchip in newborns’, BMC Genomics 16(1). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] Zhang X, Biagini Myers JM, Burleson J, Ulm A, Bryan KS, Chen X, Weirauch MT, Baker TA, Butsch Kovacic MS and Ji H (2018), ‘Nasal DNA methylation is associated with childhood asthma’, Epigenomics 10(5), 629–641. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Estimating and accounting for unobserved covariates in high-dimensional correlated data

Chris McKennan

Dan Nicolae

Abstract

1. Introduction

2. Notation and problem set-up

2.1. Notation

2.2. A description of and generative probability model for the data

Algorithm 1 (CBCV-CorrConf).

3. Estimating the factor dimension K and Im (C)

3.1. A computationally tractable model for the data

3.2. ICaSE: an iterative factor analysis to estimate Im (C⊥)

Algorithm 2 (ICaSE).

Remark 1.

3.3. Correlated Bi-Cross Validation to estimate K

Algorithm 3 (CBCV).

3.4. De-biasing estimates for the main effect in correlated data

4. Theoretical justification of CBCV-CorrConf

Assumption 1.

Remark 2.

Remark 3.

Assumption 2.

Theorem 1 (Accuracy of ICaSE).

Theorem 2 (Consistency of CBCV).

Remark 4.

Assumption 3.

Remark 5.

Theorem 3 (Inference on β)

Remark 6.

5. Simulated multi-tissue gene expression data analysis

5.1. Simulation setup and parameters

Table 1.

Table 2.

5.2. Comparison of estimators for K and β

Fig. 1.

Fig. 2.

6. Sex-specific DNA methylation in a longitudinal twin study

Table 3.

7. Discussion

Supplementary Material

Acknowledgments

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.2. ICaSE: an iterative factor analysis to estimate Im (C_⊥)