CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING

Jingshu Wang; Qingyuan Zhao; Trevor Hastie; Art B Owen

doi:10.1214/16-AOS1511

. Author manuscript; available in PMC: 2019 Aug 22.

Published in final edited form as: Ann Stat. 2017 Oct 31;45(5):1863–1894. doi: 10.1214/16-AOS1511

CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING

Jingshu Wang ^*, Qingyuan Zhao ^*, Trevor Hastie ^†,¹, Art B Owen ^†,²

PMCID: PMC6706069 NIHMSID: NIHMS991053 PMID: 31439967

Abstract

We consider large-scale studies in which thousands of significance tests are performed simultaneously. In some of these studies, the multiple testing procedure can be severely biased by latent confounding factors such as batch effects and unmeasured covariates that correlate with both primary variable(s) of interest (e.g., treatment variable, phenotype) and the outcome. Over the past decade, many statistical methods have been proposed to adjust for the confounders in hypothesis testing. We unify these methods in the same framework, generalize them to include multiple primary variables and multiple nuisance variables, and analyze their statistical properties. In particular, we provide theoretical guarantees for RUV-4 [Gagnon-Bartsch, Jacob and Speed (2013)] and LEAPP [Ann. Appl. Stat. 6 (2012) 1664–1688], which correspond to two different identification conditions in the framework: the first requires a set of “negative controls” that are known a priori to follow the null distribution; the second requires the true nonnulls to be sparse. Two different estimators which are based on RUV-4 and LEAPP are then applied to these two scenarios. We show that if the confounding factors are strong, the resulting estimators can be asymptotically as powerful as the oracle estimator which observes the latent confounding factors. For hypothesis testing, we show the asymptotic z-tests based on the estimators can control the type I error. Numerical experiments show that the false discovery rate is also controlled by the Benjamini–Hochberg procedure when the sample size is reasonably large.

Keywords: Empirical null, surrogate variable analysis, unwanted variation, batch effect, robust regression, Primary 62J15, secondary 62H25

1. Introduction.

Multiple hypothesis testing has become an important statistical problem for many scientific fields, where tens of thousands of tests are typically performed simultaneously. Traditionally, the tests are assumed to be independent of each other, so the false discovery rate (FDR) can be easily controlled by, for example, the Benjamini–Hochberg procedure [8]. Recent years have witnessed an extensive investigation of multiple hypothesis testing under dependence, ranging from permutation tests [33, 60], positive dependence [9], weak dependence [14, 56], accuracy calculation under dependence [18, 44] to mixture models [19, 57] and latent factor models [20, 21, 35]. Many of these works provide theoretical guarantees for FDR control under the assumption that the individual test statistics are valid and may even be correlated.

In this paper, we investigate a more challenging setting. The test statistics may be correlated with each other due to latent factors and those latent factors may also be correlated with the variable of interest. As a result, the test statistics are not only correlated but are also confounded. We use the phrase “confounding” to emphasize that these latent factors can significantly bias the individual p-values, therefore, this problem is fundamentally different from the literature in the previous paragraph and poses an immediate threat to the reproducibility of the discoveries. Many confounder adjustment methods have already been proposed for multiple testing over the last decade [25, 39, 49, 59]. Our goal is to unify these methods in the same framework and study their statistical properties.

The confounding problem.

We start with three real data examples to illustrate the confounding problem. The first microarray data [Figure 1(a)] is used by Singh et al. [55] to identify candidate genes associated with a chronic lung disease called emphysema. The second [Figure 1(b) and (d)] and third [Figure 1(c)] data are used by Gagnon-Bartsch, Jacob and Speed [25] to study the performance of various confounder adjustment methods. For each dataset, we plot the histogram of t-statistics of a simple linear model that regresses the gene expression on the variable of interest (disease status for the first and gender for the second and third datasets). These statistics are commonly used in genome-wide association studies (GWAS) to find potentially interesting genes. See Section 6.2.1 for more detail of these datasets.

Fig. 1. — *Dataset* 1 *is the COPD dataset* [55]. *Dataset* 2 *and* 3 *are from* [25]. *Histograms of regression t-statistics in three microarray studies show clear departure from the theoretical null distribution* N(0, 1). *The mean and standard deviation of the normal approximation are obtained from the median and median absolute deviation of the statistics. See* Section 6.2 *for the empirical distributions after confounder adjustment.*

The histograms of t-statistics in Figure 1 clearly depart from the approximate theoretical null distribution N(0,1). The bulk of the test statistics can be skewed [Figure 1(a) and (b)], overdispersed [Figure 1(a)], underdispersed [Figure 1(b) and (d)] or noncentered [Figure 1(c)]. In these cases, neither the theoretical null N(0, 1), nor even the empirical null as shown in the histograms, look appropriate for measuring significance. Schwartzman [52] proved that a largely overdispersed histogram like Figure 1(a) cannot be explained by correlation alone, and is possibly due to the presence of confounding factors. For a sneak preview of the confounder adjustment, the reader can find the histograms after our confounder adjustment in Figure 3 at the end of this paper. The p-values of our test of confounding (Section 3.3.2) in Table 2 indicate that all the three datasets suffer from confounding latent factors.

Fig. 3. — Histograms of z-statistics after confounder adjustment (without calibration) using the number of confounders r selected by bi-cross-validation.

Table 2.

Summary of the adjusted z-statistics. The first group is summary statistics of the z-statistics before the empirical calibration. The second group is some performance metrics after the empirical calibration, including total number of significant genes of p-value less than 0.01 in Remark 3.2 (#sig.), number of the genes on X/Y chromosome that have p-value less than 0.01 (X/Y), the number among the 100 most significant genes that are on the X/Y chromosome (top 100) and the p-value of the confounding test in Section 3.3.2. The bold row corresponds to the r selected by BCV (Figure 3)

(a) Dataset 1 (n = 143, p = 54,675). Primary variable: severity of COPD
r	mean	median	sd	mad	skewness	medc.	#sig.	p-value
0	−0.16	0.024	2.65	2.57	−0.104	−0.091	164	NA
1	−0.45	−0.39	2.85	2.52	−0.25	0.00074	1162	0.0057
2	0.012	−0.039	1.35	1.33	0.139	0.042	542	<1e–10
3	0.014	−0.05	1.43	1.41	0.169	0.048	552	<1e–10
5	−0.029	−0.11	1.52	1.48	0.236	0.057	647	<1e–10
7	−0.1	−0.14	1.42	1.35	0.109	0.027	837	<1e–10
10	−0.06	−0.085	1.13	1.12	0.103	0.022	506	<1e–10
20	−0.083	−0.095	1.2	1.19	0.0604	0.0095	479	<1e–10
33	−0.099	−0.11	1.33	1.3	0.0727	0.0056	579	<1e–10
40	−0.1	−0.12	1.43	1.4	0.0775	0.0072	585	<1e–10
50	−0.16	−0.17	1.58	1.53	0.0528	0.0032	678	<1e–10

(b) Dataset 2 (n = 84, p = 12,600). Primary variable: gender
r	mean	median	sd	mad	skewness	medc.	#sig.	X/Y	top 100	p-value
0	0.11	0.043	0.36	0.237	2.99	0.2	1036	58	11	NA
1	−0.44	−0.47	1.06	1.04	0.688	0.035	108	20	20	0.74
2	−0.14	−0.15	1.15	1.13	0.601	0.015	113	21	21	0.31
3	0.013	0.012	1.13	1.08	0.795	−0.01	168	34	28	0.03
5	0.044	0.019	1.18	1.08	0.878	0.017	238	32	27	0.0083
7	0.03	0.012	1.26	1.15	0.784	0.0062	269	35	25	0.006
10	0.023	0.00066	1.36	1.24	0.661	0.011	270	38	27	0.019
15	0.049	0.022	1.46	1.31	0.584	0.012	296	36	29	0.00082
20	0.029	−0.0009	1.53	1.36	0.502	0.019	314	36	28	7.2e–07
25	0.048	0.012	1.68	1.48	0.452	0.026	354	37	27	1.1e–06
30	0.026	0.012	1.82	1.61	0.436	0.0068	337	40	27	8.7e–08
40	0.061	0.046	2.07	1.79	0.642	0.0028	363	41	27	7.7e–10

(c) Dataset 3 (n = 31, p = 22,283). Primary variable: gender
r	mean	median	sd	mad	skewness	medc.	#sig.	X/Y	top 100	p-value
0	−1.8	−1.8	0.599	0.513	−3.46	0.082	418	39	20	NA
1	−0.55	−0.56	1.09	1.01	−1.53	0.01	261	29	23	0.00024
2	−0.2	−0.22	1.2	1.11	−0.99	0.014	320	38	22	0.00014
3	−0.096	−0.12	1.27	1.18	−0.844	0.017	311	42	25	0.00014
5	−0.33	−0.32	1.31	1.22	−1.29	−0.011	305	35	23	2.1e–07
7	−0.37	−0.36	1.46	1.36	−0.855	−0.0099	300	38	23	4.0e–07
11	−0.13	−0.12	1.51	1.36	−0.601	−0.0051	432	48	31	1.8e–09
15	−0.12	−0.13	1.83	1.62	−0.341	0.013	492	54	25	2.3e–08
20	−0.13	−0.14	2.61	2.23	−0.327	0.0045	613	50	26	4.0e–06

Open in a new tab

Other common sources of confounding in gene expression profiling include systematic ancestry differences [49], environmental changes [22, 27] and surgical manipulation [41]. See [36] for a survey. In many studies, especially for observational clinical research and human expression data, the latent factors, either genetic or technical, are confounded with primary variables of interest due to the observational nature of the studies and heterogeneity of samples [50, 51]. Similar confounding problems also occur in other high-dimensional datasets such as brain imaging [53] and metabonomics [15].

Previous methods.

As early as [1], principal component analysis has been suggested to estimate the confounding factors. This approach can work reasonably well if the confounders clearly stand out. For example, in population genetics, [49] proposed a procedure called EIGENSTRAT that removes the largest few principal components from their SNP genotype data, claiming they closely resemble the ancestry difference. In gene expression data, however, it is often unrealistic to assume they always represent the confounding factors. The largest principal component may also correlate with the primary effects of interest. Therefore, directly removing them can result in loss of statistical power.

More recently, an emerging literature considers the confounding problem in similar statistical settings and a variety of methods have been proposed for confounder adjustment [24–26, 38, 39, 59]. These statistical methods are shown to work better than the EIGENSTRAT procedure for gene expression data. However, little is known about their theoretical properties. Indeed, the authors did not focus on model identifiability and rely on impressive heuristic calculations to derive their estimators. In this paper, we address the identifiability problem, rederive the estimators in [25, 59] in a more principled way and provide theoretical guarantees for them.

Before describing the modeling framework, we want to clarify our terminology. The confounding factors or confounders considered in the present paper are referred to by different names in the literature, such as “surrogate variables” [38], “latent factors” [24], “batch effects” [37], “unwanted variation” [26] and “latent effects” [59]. We believe they are all describing the same phenomenon: that there exist some unobserved variables that correlate with both the primary variable(s) of interest and the outcome variables (e.g., gene expression). This problem is generally known as confounding [23, 32]. A famous example is Simpson’s paradox. The term “confounding” has multiple meanings in the literature. We use the meaning from [28]: “a mixing of effects of extraneous factors (called confounders) with the effect of interest.”

Statistical model of confounding.

Most of the confounder adjustment methods mentioned above are built around the following model:

Y = X β^{T} + Z Γ^{T} + E .

(1.1)

Here, Y is a n × p observed matrix (e.g., gene expression); X is an n × 1 observed primary variable of interest (e.g., treatment-control, phenotype, health trait); Z is an n × r latent confounding factor matrix; E is often assumed to be a Gaussian noise matrix. The p × 1 vector β contains the primary effects we want to estimate.

Model (1.1) is very general for multiple testing dependence. Leek and Storey [39], Proposition 1, suggest that multiple hypothesis tests based on linear regression can always be represented by (1.1) using sufficiently many factors. However, equation (1.1) itself is not enough to model confounded tests. To elucidate the concept of confounding, we need to characterize the relationship between the latent variables Z and the primary variable X. To be more specific, we assume the regression of Z on X also follows a linear relationship:

Z = X α^{T} + W,

(1.2)

where W is a n × r random noise matrix independent of X and E and the r × 1 vector α characterizes the extent of confounding in this data. By plugging (1.2) in (1.1), the linear regression of Y on X gives an unbiased estimate of the marginal effects

τ = β + Γ α .

(1.3)

When α ≠ 0, τ is not the same as β by (1.3). In this case, the data (X, Y) are confounded by Z. Since the confounding factors Z are data artifacts in this model, the statistical inference of β is much more interesting than that of τ. See Section 5.2 for more discussion on the marginal and the direct effects.

Following LEAPP [59], we use a QR decomposition to decouple the estimation of Γ from β. The inference procedure splits into the following two steps:

Step 1. By regressing out X in (1.1), Γ is the loading matrix in a factor analysis model and can be efficiently estimated by maximum likelihood.

Step 2. Equation (1.3) can be viewed as a linear regression of the marginal effects τ on the factor loadings Γ. To estimate α and β, we replace τ by its observed value and Γ by its estimate in Step 1.

As mentioned before, other existing confounder adjustment methods including SVA [39] and RUV-4 [25] can be unified in this two-step statistical procedure. See Section 5.3 for a detailed discussion of these methods.

Contributions.

Our first contribution in Section 2 is to establish identifiability for the confounded multiple testing model. In the first step of estimating factor loadings Γ, identifiability is well studied in classical multivariate statistics. However, the second step of estimating the effects β is not identifiable without additional constraints. We consider two different sufficient conditions for global identifiability. The first condition assumes the researcher has a “negative control” variable set for which there should be no direct effect. This negative control set often serves as a quality control precaution in microarray studies [26], but they can also be used to adjust for the confounding factors. The second identification condition assumes at least half of the true effects are zero, that is, the true alternative hypotheses are sparse. These two identification conditions correspond to the approaches of RUV-4 [25] and LEAPP [59], respectively.

Our second contribution in Section 3 is to derive valid and efficient statistical methods under these identification conditions in the second step. In order to estimate the effects, it is essential to estimate the coefficients α relating the primary variable to the confounders. Under the two different identification conditions, we study two different regression methods which are analytically tractable and equally well performing alternatives to RUV-4 and LEAPP. For the negative control (NC) scenario, ${\hat{α}}^{NC}$ and ${\hat{β}}^{NC}$ are obtained by generalized least squares using the negative controls. For the sparsity scenario, ${\hat{α}}^{RR}$ and ${\hat{β}}^{RR}$ are obtained by using a simpler and more analytically tractable robust regression (RR) than the one used in LEAPP.

When the factors are strong (as large as the noise magnitude), for both scenarios we find that the resulting estimators of β are asymptotically as efficient as the oracle estimator which is allowed to observe the confounding factors. It is surprising that no essential loss of efficiency is incurred by searching for the confounding variables. Our asymptotic analysis relies on some recent theoretical results for factor analysis due to [3]. The asymptotic regime we consider has both n, the number of observations, and p, the number of outcome variables (e.g., genes), going to infinity. The most important condition that we require for asymptotic efficiency in the negative control scenario is that the number of negative controls increases to infinity; in the sparsity scenario, we need the L₁ norm of the effects to satisfy ${‖ β ‖}_{1} \sqrt{n} ∕ p \to 0$ . The fact that p ≫ n in many multiple hypothesis testing problems plays an important role in these asymptotics.

Next, in Section 3, we show that the asymptotic z-statistics based on the efficient estimators of β can control the type I error. This is not a trivial corollary from the asymptotic distribution of the test statistics because the size of β is growing and the z-statistics are weakly correlated. Proving FDR control is more technically demanding and is beyond the scope of this paper. Instead, we use numerical simulations to study the empirical performance (including FDR) of our tests. We also give a significance test of confounding (null hypothesis α = 0) in Section 3. This test can help the experimenter to determine if there is any hidden confounder in the design or the experiment process.

In Section 4, we generalize the confounder adjustment model to include multiple primary variables and multiple nuisance covariates. We show the statistical methods and theory for the single primary variable regression problem (1.1) can be smoothly extended to the multiple regression problem.

Outline.

Section 2 introduces the model and describes the two identification conditions. Section 3 studies the statistical inference. Section 4 extends our framework to a linear model with multiple primary variables and multiple known controlling covariates. Section 5 discusses our theoretical analysis in the context of previous literature, including the existing procedures for debiasing the confounders and existing theoretical results of multiple hypothesis testing under dependence (but no confounding). Section 6 studies the empirical behavior of our estimators in simulations and real data examples. Technical proofs of the results are provided in the supplementary material [62].

To help the reader follow this paper and compare our methods and theory with existing approaches, Table 1 summarizes some related publications with more detailed discussion in Section 5.

Table 1.

Selected literature in multiple hypothesis testing under dependence. The categorization is partially subjective as some authors do not use exactly the same terminology

	Noise conditional on latent factors
	Independent	Correlated
Positive or weak dependence	Benjamini and Yekutieli [9] Storey, Taylor and Siegmund [56] Clarke and Hall [14]
Unconfounding factors	Friguet, Kloareg and Causeur [24] Desai and Storey [16]	Fan, Han and Gu [21] Lan and Du [35] Discussed in Sections 5.1 and 5.2
Confounding factors	Leek and Storey [38, 39] Gagnon-Bartsch and Speed [26] Sun, Zhang and Owen [59] Studied in Sections 2–4 Discussed in Section 5.3	Discussed in Section 5.4 (future research)

Open in a new tab

Notation.

Throughout the article, we use bold upper-case letters for matrices and lower-case letters for vectors. We use Latin letters for random variables and Greek letters for model parameters. Subscripts of matrices are used to indicate row(s) whenever possible. For example, if $C$ is a set of indices, then $Γ_{C}$ is the corresponding rows of Γ. The L₀ norm of a vector is defined as the number of nonzero entries: ‖β‖₀ = |{1 ≤ j ≤ p : β_j ≠ 0}|. A random matrix E ∈ ℝ^{n × p} is said to follow a matrix normal distribution with mean M ∈ ℝ^{n × p}, row covariance U ∈ ℝ^{n × n} and column covariance V ∈ ℝ^{p ∈ p}, abbreviated as E ~ MN(M, U, V), if the vectorization of E by column follows the multivariate normal distribution vec(E) ~ N(vec(M), V ⊗ U). When U = I_n, this means the rows of E are i.i.d. N(0, V). We use the usual notation in asymptotic statistics that a random variable is O_p (1) if it is bounded in probability, and o_p (1) if it converges to 0 in probability. Bold symbols O_p (1) or o_p (1) mean each entry of the vector is O_p (1) or o_p (1).

2. The model.

2.1. Linear model with confounders.

We consider a single primary variable of interest in this section. It is common to add intercepts and known confounder effects (such as lab and batch effects) in the regression model. This extension to multiple linear regression does not change the main theoretical results in this paper and is discussed in Section 4.

For simplicity, all the variables in this section are assumed to have mean 0 marginally. Our model is built on equation (1.1) that is already widely used in the existing literature and we rewrite it here:

Y_{n \times p} = X_{n \times 1} β_{p \times 1}^{T} + Z_{n \times r} Γ_{p \times r}^{T} + E_{n \times p} .

(2.1a)

As mentioned earlier, it is also crucial to model the dependence of the confounders Z and the primary variable X. We assume a linear relationship as in (1.2)

Z = X α^{T} + W,

(2.1b)

and in addition some distributional assumptions on X, W and the noise matrix E

X_{i} \overset{i.i.d.}{~} mean 0, variance 1, i = 1, \dots, n,

(2.1c)

W ~ MN (0, I_{n}, I_{r}), W ⫫ X,

(2.1d)

E ~ MN (0, I_{n}, Σ), E ⫫ (X,Z) .

(2.1e)

The parameters in the model (2.1a)–(2.1e) are β ∈ ℝ^{p × 1} the primary effects we are most interested in, Γ ∈ ℝ^{p × r} the influence of confounding factors on the outcomes, α ∈ ℝ^{r × 1} the association of the primary variable with the confounding factors, and Σ ∈ ℝ^{p × p} the noise covariance matrix. We assume Σ is diagonal $Σ = diag (σ_{1}^{2}, \dots, σ_{p}^{2})$ , so the noise for different outcome variables is independent. We discuss possible ways to relax this independence assumption in Section 5.4.

In (2.1c), X_i is not required to be Gaussian or even continuous. For example, a binary or categorical variable after normalization also meets this assumption. As mentioned in Section 1, the parameter vector α measures how severely the data are confounded. For a more intuitive interpretation, consider an oracle procedure of estimating β when the confounders Z in (2.1a) are observed. The best linear unbiased estimator in this case is the ordinary least squares ( ${\hat{β}}_{j}^{OLS}$ , ${\hat{Γ}}_{j}^{OLS}$ ), whose variance is $σ_{j}^{2} Var {(X_{i}, Z_{i})}^{- 1} ∕ n$ . Using (2.1b) and (2.1d), it is easy to show that $Var ({\hat{β}}_{j}^{OLS}) = (1 + {‖ α ‖}_{2}^{2}) σ_{j}^{2} ∕ n$ and $Cov ({\hat{β}}_{j}^{OLS}, {\hat{β}}_{k}^{OLS}) = 0$ for j ≠ k. In summary,

Var({\hat{β}}^{OLS}) = \frac{1}{n} (1 + ‖ α ‖_{2}^{2}) Σ .

(2.2)

Notice that in the unconfounded linear model in which Z = 0, the variance of the OLS estimator of β is Σ/n. Therefore, $1 + {‖ α ‖}_{2}^{2}$ represents the relative loss of efficiency when we add observed variables Z to the regression which are correlated with X. In Section 3.2, we show that the oracle efficiency (2.2) can be asymptotically achieved even when Z is unobserved.

Let θ = (α, β, Γ, Σ) be all the parameters and Θ be the parameter space. Without any constraint, the model (2.1a)–(2.1e) is unidentifiable. In Sections 2.3 and 2.4, we show how to restrict Θ to ensure identifiability.

2.2. Rotation.

Following [59], we introduce a transformation of the data to make the identification issues clearer. Consider the Householder rotation matrix Q^T ∈ ℝ^{n × n} such that Q^T X = ‖X‖₂e₁ = (‖X‖₂, 0, 0, …, 0)^T. Left-multiplying Y by Q^T, we get $\tilde{Y} = Q^{T} Y = {‖ X ‖}_{2} e_{1} β^{T} + \tilde{Z} Γ^{T} + \tilde{E}$ , where

\tilde{Z} = Q^{T} Z = Q^{T} (X α^{T} + W) = ‖ X ‖_{2} e_{1} α^{T} + \tilde{W},

(2.3)

and $\tilde{W} = Q^{T} W \overset{d}{=} W, \tilde{E} = Q^{T} E \overset{d}{=} E$ . As a consequence, the first and the rest of the rows of $\tilde{Y}$ are

{\tilde{Y}}_{1} = ‖ X ‖_{2} β^{T} + {\tilde{Z}}_{1} Γ^{T} + {\tilde{E}}_{1} ~ N (‖ X ‖_{2} {(β + Γ α)}^{T}, Γ Γ^{T} + Σ),

(2.4)

{\tilde{Y}}_{- 1} = {\tilde{Z}}_{- 1} Γ^{T} + {\tilde{E}}_{- 1} ~ MN (0, I_{n - 1}, Γ Γ^{T} + Σ) .

(2.5)

Here, ${\tilde{Y}}_{1}$ is a 1 × p vector, ${\tilde{Y}}_{- 1}$ is a (n − 1) × p matrix, and the distributions are conditional on X.

The parameters α and β only appear in (2.4), so their inference (step 1 in our procedure) can be completely separated from the inference of Γ and Σ (step 2 in our procedure). In fact, ${\tilde{Y}}_{1} ⫫ {\tilde{Y}}_{- 1} ∣ X$ because ${\tilde{E}}_{1} ⫫ {\tilde{E}}_{- 1}$ , so the two steps use mutually independent information. This in turn greatly simplifies the theoretical analysis.

We intentionally use the symbol Q to resemble the QR decomposition of X. In Section 4, we show how to use the QR decomposition to separate the primary effects from confounder and nuisance effects when X has multiple columns. Using the same notation, we discuss how SVA and RUV decouple the problem in a slightly different manner in Section 5.3.1.

2.3. Identifiability of Γ.

Equation (2.5) is just the exploratory factor analysis model, thus Γ can be easily identified up to some rotation under some mild conditions. Here, we assume a classical sufficient condition for the identification of Γ ([2], Theorem 5.1).

Lemma 2.1.

Let Θ = Θ₀ be the parameter space such that:

If any row of Γ is deleted, there remain two disjoint submatrices of Γ of rank r, and
Γ^TΣ⁻¹Γ/p is diagonal and the diagonal elements are distinct, positive and arranged in decreasing order.

Then Γ and Σ are identifiable in the model (2.1a)–(2.1e).

In Lemma 2.1, condition (1) requires that p ≥ 2r + 1. Condition (1) identifies Γ up to a rotation which is sufficient to identify β. To see this, we can reparameterize Γ and α to ΓU and U^T α using an r × r orthogonal matrix U. This reparameterization does not change the distribution of ${\tilde{Y}}_{1}$ in (2.4) if β remains the same. Condition (2) identifies the rotation uniquely but is not necessary for our theoretical analysis in later sections.

2.4. Identifiability of β.

The parameters β and α cannot be identified from (2.4) because they have in total p + r parameters while ${\tilde{Y}}_{1}$ is a length p vector. If we write $P_{Γ}$ and $P_{Γ^{⊥}}$ as the projection onto the column space and orthogonal space of Γ so that $β = P_{Γ} β + P_{Γ^{⊥}} β$ , it is impossible to identify $P_{Γ} β$ from (2.4).

This suggests that we should further restrict the parameter space Θ. We will reduce the degrees of freedom by restricting at least r entries of β to equal 0. We consider two different sufficient conditions to identify β:

Negative control $Θ_{1} = {(α, β, Γ, Σ) : β_{C} = 0, rank (Γ_{C}) = r}$ for a known negative control set $∣ C ∣ \geq r$ .

Sparsity Θ₂(s) = {(α, β, Γ, Σ) : ‖β‖₀ ≤ ⌊(p − s)/2⌋, $rank (Γ_{C}) = r, \forall C \subset {1, \dots, p}, ∣ C ∣ = s}$ for some r ≤ s ≤ p.

Proposition 2.1.

If Θ = Θ₀ ∩ Θ₁ or Θ = Θ₀ ∩₂ (s) for some r ≤ s ≤ p, the parameters θ = (α, β, Γ, Σ) in the model (2.1a)–(2.1e) are identifiable.

Proof.

Since Θ ⊂ Θ₀, we know from Lemma 2.1 that Γ and Σ are identifiable. Now consider two combinations of parameters θ⁽¹⁾ = (α⁽¹⁾, β⁽¹⁾, Γ, Σ) and θ⁽²⁾ = (α⁽²⁾, β⁽²⁾, Γ, Σ) both in the space Θ and inducing the same distribution in the model (2.1a)–(2.1e), that is, β⁽¹⁾ + Γα⁽¹⁾ = β⁽²⁾ + Γα⁽²⁾.

Let $C$ be the set of indices such that $β_{C}^{(1)} = β_{C}^{(2)} = 0$ . If Θ = Θ₀ ∩ Θ₁, we already know $∣ C ∣ \geq r$ . If Θ = Θ₀ ∩ Θ₂(s), it is easy to show that $∣ C ∣ \geq s$ is also true because both β⁽¹⁾ and β⁽²⁾ have at most ⌊(p − s)/2⌋ nonzero entries. Along with the rank constraint on $Γ_{C}$ , this implies that $Γ_{C} α^{(1)} = Γ_{C} α^{(2)}$ . However, the conditions in Θ₁ and Θ₂ ensure that $Γ_{C}$ has full rank, so α⁽¹⁾ = α⁽²⁾, and hence β⁽¹⁾ = β⁽²⁾. □

Remark 2.1.

The condition (2) in Lemma 2.1 that uniquely identifies Γ is not necessary for the identification of β. This is because for any set |C| ≥ r and any orthogonal matrix U ∈ ℝ^{r × r}, we always have $rank (Γ_{C}) = rank (Γ_{C}) U$ . Therefore, Γ only needs to be identified up to a rotation.

Remark 2.2.

Almost all dense matrices of Γ ∈ ℝ^{p × r} satisfy the conditions. However, for Θ₂(s) the sparsity of Γ allowed depends on the sparsity of β. The condition Θ₂(s) rules out some too sparse Γ. In this case, one may consider using confirmatory factor analysis instead of exploratory factor analysis to model the relationship between confounders and outcomes. For some recent identification results in confirmatory factor analysis, see [29, 34].

Remark 2.3.

The maximum allowed ‖β‖₀ in Θ₂, ⌊(p − r)/2⌋, is exactly the maximum breakdown point of a robust regression with p observations and r predictors [42]. Indeed, we use a standard robust regression method to estimate β in this case in Section 3.2.2.

Remark 2.4.

To the best of our knowledge, the only existing literature that explicitly addresses the identifiability issue is [58], Chapter 4.2, where the author gives sufficient conditions for local identifiability of β by viewing (2.1a) as a “sparse plus low rank” matrix decomposition problem. See [13], Section 3.3, for a more general discussion of the local and global identifiability for this problem. Local identifiability refers to identifiability of the parameters in a neighborhood of the true values. In contrast, the conditions in Proposition 2.1 ensure that β is globally identifiable in the restricted parameter space.

3. Statistical inference.

As mentioned earlier in Section 1, the statistical inference consists of two steps: the factor analysis (Section 3.1) and the linear regression (Section 3.2).

3.1. Inference for Γ and Σ.

The most popular approaches for factor analysis are principal component analysis (PCA) and maximum likelihood (ML). Bai and Ng [6] derived a class of estimators of r by principal component analysis using various information criteria. The estimators are consistent under Assumption 3 in this section and some additional technical assumptions in [6]. Due to this reason, we assume the number of confounding factors r is known in this section. See [45], Section 3, for a comprehensive literature review of choosing r in practice.

We are most interested in the asymptotic behavior of factor analysis when both n, p → ∞. In this case, PCA cannot consistently estimate the noise variance Σ [3]. For theoretical analysis, we use the quasi maximum likelihood estimate in [3] to get $\hat{Γ}$ and $\hat{Σ}$ . This estimator is called “quasi”-MLE because it treats the factors ${\tilde{Z}}_{- 1}$ as fixed quantities. Since the confounders Z in our model (2.1a)–(2.1e) are random variables, we introduce a rotation matrix R ∈ ℝ^{r × r} and let ${\tilde{Z}}_{- 1}^{(0)} = {\tilde{Z}}_{- 1} {(R^{- 1})}^{T}$ , Γ⁽⁰⁾ = ΓR be the target factors and factor loadings that are studied in [3].

To make ${\tilde{Z}}_{- 1}^{(0)}$ and Γ⁽⁰⁾ identifiable, [3] consider five different identification conditions. However, the parameter of interest in model (2.1a)–(2.1e) is β instead of Γ or Γ⁽⁰⁾. As we have discussed in Section 2.4, we only need the column space of Γ to estimate β, which gives us some flexibility of choosing the identification condition. In our theoretical analysis, we use the third condition (IC3) in [3], which imposes the constraints that ${(n - 1)}^{- 1} {({\tilde{Z}}_{- 1}^{(0)})}^{T} {\tilde{Z}}_{- 1}^{(0)} = I_{r}$ and $p^{- 1} {\tilde{Γ}}^{(0) T} Σ^{- 1} Γ^{(0)}$ is diagonal. Therefore, the rotation matrix R satisfies ${RR}^{T} = {(n - 1)}^{- 1} {\tilde{Z}}_{- 1}^{T} {\tilde{Z}}_{- 1}$ .

The quasi-log-likelihood being maximized in [3] is

- \frac{1}{2 p} log det (Γ^{(0)}) (Γ^{(0)})^{T} + Σ) - \frac{1}{2 p} tr {S {[Γ^{(0)} (Γ^{(0)})^{T} + Σ]}^{- 1}},

(3.1)

where S is the sample covariance matrix of ${\tilde{Y}}_{- 1}$ .

The theoretical results in this section rely heavily on recent findings in [3]. They use these three assumptions.

Assumption 1.

The noise matrix E follows the matrix normal distribution E ~ MN(0, I_n, Σ) and Σ is a diagonal matrix.

Assumption 2.

There exists a positive constant D such that ‖Γ_j‖₂ ≤ D, $D^{- 2} \leq σ_{j}^{2} \leq D^{2}$ for all j, and the estimated variances ${\hat{σ}}_{j}^{2} \in [D^{- 2}, D^{2}]$ for all j.

Assumption 3.

The limits lim_p→∞ p⁻¹Γ^T Σ⁻¹Γ and ${lim}_{p \to \infty} \sum_{j = 1}^{p} σ_{j}^{- 4} \times (Γ_{j} \otimes Γ_{j}) (Γ_{j}^{T} \otimes Γ_{j}^{T})$ exist and are positive definite matrices.

Lemma 3.1 (Bai and Li [3]).

Under Assumptions 1–3, the maximizer ( $\hat{Γ}, \hat{Σ}$ ) of the quasi-log-likelihood (3.1) satisfies

\sqrt{n} ({\hat{Γ}}_{j} - Γ_{j}^{(0)}) \overset{d}{\to} N (0, σ_{j}^{2} I_{r}) and \sqrt{n} ({\hat{σ}}_{j}^{2} - σ_{j}^{2}) \overset{d}{\to} N (0, 2 σ_{j}^{4}) .

In the supplementary material [62], we prove some strengthened technical results of Lemma 3.1 that are used in the proof of subsequent theorems.

Remark 3.1.

Assumption 2 is Assumption D from [3]. It requires that the diagonal elements of the quasi-MLE $\hat{Σ}$ be uniformly bounded away from zero and infinity. We would prefer boundedness to be a consequence of some assumptions on the distribution of the data, but at present we are unaware of any other results like Lemma 3.1 which do not use this assumption. In practice, the quasi-likelihood problem (3.1) is commonly solved by the Expectation–Maximization (EM) algorithm. Similar to [3, 4], we do not find it necessary to impose an upper or lower bound for the parameters in the EM algorithm in the numerical experiments.

3.2. Inference for α and β.

The estimation of α and β is based on the first row of the rotated outcome ${\tilde{Y}}_{1}$ in (2.4), which can be rewritten as

{\tilde{Y}}_{1}^{T} / ‖ X ‖_{2} = β + Γ (α + {\tilde{W}}_{1} / ‖ X ‖_{2} + {\tilde{E}}_{1}^{T} / ‖ X ‖_{2},

(3.2)

where ${\tilde{W}}_{1} \sim N (0, I_{p})$ is from (2.3) and ${\tilde{W}}_{1}$ is independent of ${\tilde{E}}_{1} \sim N (0, Σ)$ . Note that ${\tilde{Y}}_{1} ∕ {‖ X ‖}_{2}$ is proportional to the sample covariance between Y and X. All the methods described in this section first try to find a good estimator $\hat{α}$ . They then use $\hat{β} = {\tilde{Y}}_{1}^{T} ∕ {‖ X ‖}_{2} - \hat{Γ} \hat{α}$ to estimate β.

To reduce variance, we choose to estimate (3.2) conditional on ${\tilde{W}}_{1}$ . Also, to use the results in Lemma 3.1, we replace Γ by Γ⁽⁰⁾. Then we can rewrite (3.2) as

{\tilde{Y}}_{1}^{T} / ‖ X ‖_{2} = β + Γ^{(0)} α^{(0)} + {\tilde{E}}_{1}^{T} / ‖ X ‖_{2},

(3.3)

where Γ⁽⁰⁾ = ΓR and $α^{(0)} = R^{- 1} (α + {\tilde{W}}_{1} ∕ {‖ X ‖}_{2})$ . Notice that the random R only depends on ${\tilde{Y}}_{- 1}$ , and thus is independent of ${\tilde{Y}}_{1}$ . In the proof of the results in this section, we first consider the estimation of β for fixed ${\tilde{W}}_{1}$ , R and X, and then show the asymptotic distribution of $\hat{β}$ indeed does not depend on ${\tilde{W}}_{1}$ , R or X, and thus also holds unconditionally.

3.2.1. Negative control scenario.

If we know a set $C$ such that $β_{C} = 0$ (so Θ ⊂ Θ₁), then ${\tilde{Y}}_{1}$ can be correspondingly separated into two parts:

\begin{matrix} {\tilde{Y}}_{1, C}^{T} ∕ ‖ X ‖_{2} & = Γ_{C}^{(0)} α^{(0)} + {\tilde{E}}_{1, C}^{T} ∕ ‖ X ‖_{2}, a n d \\ {\tilde{Y}}_{1, - C}^{T} ∕ ‖ X ‖_{2} & = β_{- C} + Γ_{- C}^{(0)} α^{(0)} + {\tilde{E}}_{1, - C}^{T} ∕ ‖ X ‖_{2} . \end{matrix}

(3.4)

The number of negative controls $∣ C ∣$ may grow as p → ∞. We impose an additional assumption on the latent factors of the negative controls.

Assumption 4.

${lim}_{p \to \infty} {∣ C ∣}^{- 1} Γ_{C}^{T} Σ_{C}^{- 1} Γ_{C}$ exists and is positive definite.

We consider the following negative control (NC) estimator where α⁽⁰⁾ is estimated by generalized least squares:

{\hat{α}}^{NC} = {({\hat{Γ}}_{C}^{T} {\hat{Σ}}_{C}^{- 1} {\hat{Γ}}_{C})}^{- 1} {\hat{Γ}}_{C}^{T} {\hat{Σ}}_{C}^{- 1} {\tilde{Y}}_{1, C}^{T} / ‖ X ‖_{2} and

(3.5)

{\hat{β}}_{- C}^{NC} = {\tilde{Y}}_{1, - C}^{T} / ‖ X ‖_{2} - {\hat{Γ}}_{- C} {\hat{α}}^{NC} .

(3.6)

This estimator matches the RUV-4 estimator of [25] except that it uses quasi-maximum likelihood estimates of Σ and Γ instead of using PCA, and generalized linear squares instead of ordinary linear squares regression. The details are in Section 5.3.2.

Our goal is to show consistency and asymptotic variance of ${\hat{β}}_{- C}^{NC}$ . Let $Σ_{C}$ represents the noise covariance matrix of the variables in $C$ . We then have

Theorem 3.1.

Under Assumptions 1–4, if n, p → ∞ and (log p)²/n → 0, then for any fixed index set $S$ with finite cardinality and $S \cap C = \emptyset$ , we have

\sqrt{n} ({\hat{β}}_{S}^{NC} - β_{S}) \overset{d}{\to} N (0, (1 + ‖ α ‖_{2}^{2}) (Σ_{S} + Δ_{S})),

(3.7)

where $Δ_{S} = Γ_{S} {(Γ_{C}^{T} Σ_{C}^{- 1} Γ_{C})}^{- 1} Γ_{S}^{T}$ .

If in addition, $∣ C ∣ \to \infty$ , the minimum eigenvalue of $Γ_{C}^{T} Σ_{C}^{- 1} Γ_{C} \to \infty$ to by Assumption 4, then the maximum entry of $Δ_{S}$ goes to 0. Therefore, in this case,

\sqrt{n} ({\hat{β}}_{S}^{NC} - β_{S}) \overset{d}{\to} N (0, (1 + ‖ α ‖_{2}^{2}) Σ_{S}) .

(3.8)

The asymptotic variance in (3.8) is the same as the variance of the oracle least squares in (2.2). Comparable oracle efficiency statements can be found in the econometrics literature [7, 63]. This is also the variance used implicitly in RUV-4 as it treats the estimated Z as given when deriving test statistics for β. When the number of negative controls is not too large, say $∣ C ∣ = 30$ , the correction term Δ_S is nontrivial and gives more accurate estimate of the variance of ${\hat{β}}^{NC}$ . See Section 6.1 for more simulation results.

3.2.2. Sparsity scenario.

When the zero indices in β are unknown but sparse (so Θ ⊆ Θ₂), the estimation of α and β from ${\tilde{Y}}_{1}^{T} ∕ {‖ X ‖}_{2} = β + Γ^{(0)} α^{(0)} + {\tilde{E}}_{1}^{T} ∕ {‖ X ‖}_{2}$ can be cast as a robust regression by viewing ${\tilde{Y}}_{1}^{T}$ as observations and Γ⁽⁰⁾ as design matrix. The nonzero entries in β correspond to outliers in this linear regression.

The problem here has two nontrivial differences compared to classical robust regression. First, we expect some entries of β to be nonzero, and our goal is to make inference on the outliers; second, we do not observe the design matrix Γ⁽⁰⁾ but only have its estimator $\hat{Γ}$ . In fact, if β = 0 and Γ⁽⁰⁾ is observed, the ordinary least squares estimator of α⁽⁰⁾ is unbiased and has variance of order 1/(np), because the noise in (3.2) has variance 1/n and there are p observations. Our main conclusion is that α⁽⁰⁾ can still be estimated very accurately given the two technical difficulties.

Given a robust loss function ρ, we consider the following estimator:

{\hat{α}}^{RR} = \arg min \sum_{j = 1}^{p} ρ (\frac{{\tilde{Y}}_{1 j} / {‖ X ‖}_{2} - {\hat{Γ}}_{j}^{T} α}{{\hat{σ}}_{j}}) and

(3.9)

{\hat{β}}^{RR} = {\tilde{Y}}_{1} / {‖ X ‖}_{2} - \hat{Γ} {\hat{α}}^{RR} .

(3.10)

For a broad class of loss functions ρ, estimating α by (3.9) is equivalent to

({\hat{α}}^{RR}, \tilde{β}) = \arg \min_{α, β} \sum_{j = 1}^{p} \frac{1}{{\hat{σ}}_{j}^{2}} ({\tilde{Y}}_{1 j} ∕ ‖ X ‖_{2} - β_{j} - {\hat{Γ}}_{j}^{T} α)^{2} + P_{λ} (β),

(3.11)

where P_λ(β) is a penalty to promote sparsity of β [54]. However, ${\hat{β}}^{RR}$ is not identical to $\tilde{β}$ , which is a sparse vector that does not have an asymptotic normal distribution. The LEAPP algorithm [59] uses the form (3.11). Replacing it by the robust regression (3.9) and (3.10) allows us to derive significance tests of H_0j : β_j = 0.

We assume a smooth loss ρ for the theoretical analysis.

Assumption 5.

The penalty ρ : ℝ → [0, ∞) with ρ(0) = 0. The function ρ(x) is nonincreasing when x ≤ 0 and is nondecreasing when x > 0. The derivative ψ = ρ′ exists and |ψ| ≤ D for some D < ∞. Furthermore, ρ is strongly convex in a neighborhood of 0.

A sufficient condition for the local strong convexity is that ψ′ > 0 exists in neighborhood of 0. The next theorem establishes the consistency of ${\hat{β}}^{RR}$ .

Theorem 3.2.

Under Assumptions 1–3 and 5, if n, p → ∞, (log p)²/n → 0 and ‖β‖₁/p → 0, then ${\hat{α}}^{RR} \overset{p}{\to} α$ . As a consequence, for any j, ${\hat{β}}_{j}^{RR} \overset{p}{\to} β_{j}$ .

To derive the asymptotic distribution, we consider the estimating equation corresponding to (3.9). By taking the derivative of (3.9), ${\hat{α}}^{RR}$ satisfies

Ψ_{p, \hat{Γ}, \hat{Σ}} ({\hat{α}}^{RR}) = \frac{1}{p} \sum_{j = 1}^{p} ψ (\frac{{\tilde{Y}}_{1 j} / ‖ X ‖_{2} - {\hat{Γ}}_{j}^{T} {\hat{α}}^{RR}}{{\hat{σ}}_{j}}) {\hat{Γ}}_{j} / {\hat{σ}}_{j} = 0 .

(3.12)

The next assumption is used to control the higher order term in a Taylor expansion of Ψ.

Assumption 6.

The first two derivatives of ψ exist and both |ψ′ (x)| ≤ D and |ψ″ (x)| ≤ D hold at all x for some D < ∞.

Examples of loss functions ρ that satisfy Assumptions 5 and 6 include smoothed Huber loss and Tukey’s bisquare.

The next theorem gives the asymptotic distribution of ${\hat{β}}^{RR}$ when the nonzero entries of β are sparse enough. The asymptotic variance of ${\hat{β}}^{RR}$ is, again, the oracle variance in (2.2).

Theorem 3.3.

Under Assumptions 1–3, 5 and 6, if n, p → ∞, with (log p)²/n → 0 and ${‖ β ‖}_{1} \sqrt{n} ∕ p \to 0$ , then

\sqrt{n} ({\hat{β}}_{S}^{RR} - β_{S}) \overset{d}{\to} N (0, (1 + ‖ α ‖_{2}^{2}) Σ_{S})

for any fixed index set S with finite cardinality.

If n/p → 0, then a sufficient condition for ${‖ β ‖}_{1} \sqrt{n} ∕ p \to 0$ in Theorem 3.3 is ${‖ β ‖}_{1} = O (\sqrt{p})$ . If instead $n ∕ p \to c \in (0, \infty)$ , then ${‖ β ‖}_{1} = o (\sqrt{p})$ suffices.

3.3. Hypothesis testing.

In this section, we construct significance tests for β and α based on the asymptotic normal distributions in the previous section.

3.3.1. Test of the primary effects.

We consider the asymptotic test for H_0j : β_j = 0, j = 1,…, p resulting from the asymptotic distributions of ${\hat{β}}_{j}$ derived in Theorems 3.1 and 3.3:

t_{j} = \frac{‖ X ‖_{2} {\hat{β}}_{j}}{{\hat{σ}}_{j} \sqrt{1 + ‖ \hat{α} ‖^{2}}}, j = 1, …, p .

(3.13)

Here, we require $∣ C ∣ \to \infty$ for the NC estimator. The null hypothesis H_0j is rejected at level-α if |t_j| > z_α/2 = Φ⁻¹ (1 − α/2) as usual, where Φ is the cumulative distribution function of the standard normal. Note that here we slightly abuse the notation a to represent the significance level and this should not be confused with the model parameter α.

The next theorem shows that the overall type-I error and the family-wise error rate (FWER) can be asymptotically controlled by using the test statistics t_j, j = 1,…, p.

Theorem 3.4.

Let $N_{p} = {j ∣ β_{j} = 0, j = 1, \dots, p}$ be all the true null hypotheses. Under the assumptions of Theorem 3.1 or Theorem 3.3, $∣ C ∣ \to \infty$ for the NC scenario, as n, p, $| N_{p} | \to \infty$ ,

\frac{1}{| N_{p} |} \sum_{j \in N_{p}} I (| t_{j} | > z_{α / 2}) \overset{p}{\to} α and

(3.14)

\lim sup P (\sum_{j \in N_{p}} I (| t_{j} | > z_{α / (2 p)}) \geq 1) \leq α .

(3.15)

Although the individual test is asymptotically valid as $t_{j} \overset{d}{\to} N (0, 1)$ , Theorem 3.4 is not a trivial corollary of the asymptotic normal distribution in Theorems 3.1 and 3.3. This is because t_j, j = 1,…, p are not independent for finite samples. The proof of Theorem 3.4 investigates how the dependence of the test statistics diminishes when n, p → ∞. The proof of Theorem 3.4 already requires a careful investigation of the convergence of $\hat{β}$ in Theorem 3.3. It is more cumbersome to prove FDR control using our test statistics. In Section 6, we show that FDR is usually well controlled in simulations for the Benjamini–Hochberg procedure when the sample size is large enough.

Remark 3.2.

We find a calibration technique in [59] very useful to improve the type I error and FDR control for finite sample size. Because the asymptotic variance used in (3.13) is the variance of an oracle OLS estimator, when the sample size is not sufficiently large, the variance of ${\hat{β}}^{RR}$ should be slightly larger than this oracle variance. To correct for this inflation, one can use median absolute deviation (MAD) with customary scaling to match the standard deviation for a Gaussian distribution to estimate the empirical standard error of t_j, j = 1,…, p and divide t_j by the estimated standard error. The performance of this empirical calibration is studied in the simulations in Section 6.1.

3.3.2. Test of confounding.

We also consider a significance test for H_0,α : α = 0, under which the latent factors are not confounding.

Theorem 3.5.

Let the assumptions of Theorem 3.1 or Theorem 3.3 and $∣ C ∣ \to \infty$ for the NC scenario be given. Under the null hypothesis that α = 0, for $\hat{α} = {\hat{α}}^{NC}$ in (3.5) or $\hat{α} = {\hat{α}}^{RR}$ in (3.9), we have

n \cdot {\hat{α}}^{T} \hat{α} \overset{d}{\to} χ_{r}^{2},

where $χ_{r}^{2}$ is the chi-square distribution with r degree of freedom.

Therefore, the null hypothesis H_0,α : α = 0 is rejected if $n \cdot {\hat{α}}^{T} \hat{α} > χ_{r, α}^{2}$ where $χ_{r, α}^{2}$ is the upper-α quantile of $χ_{r}^{2}$ . This test, combined with exploratory factor analysis, can be used as a diagnosis tool for practitioners to check whether the data gathering process has any confounding factors that can bias the multiple hypothesis testing.

4. Extension to multiple regression.

In Sections 2 and 3, we assume that there is only one primary variable X and all the random variables X, Y and Z have mean 0. In practice, there may be several predictors, or we may want to include an intercept term in the regression model. Here, we develop a multiple regression extension to the original model (2.1a)–(2.1e).

Suppose we observe in total d = d₀ + d₁ random predictors that can be separated into two groups:

X₀: n × d₀ nuisance covariates that we would like to include in the regression model, and
X₁: n × d₁ primary variables whose effects we want to study.

For example, the intercept term can be included in X₀ as a n × 1 vector of 1 (i.e., a random variable with mean 1 and variance o).

Leek and Storey [39] consider the case d₀ = 0 and d₁ ≥ 1 for SVA and [59] consider the case d₀ ≥ 0 and d₁ = 1 for LEAPP. Here, we study the confounder adjusted multiple regression in full generality, for any d₀ ≥ 0 and d₁ ≥ 1. Our model is

Y = X_{0} B_{0}^{T} + X_{1} B_{1}^{T} + Z Γ^{T} + E,

(4.1a)

(\begin{matrix} X_{0 i} \\ X_{1 i} \end{matrix}) are i.i.d. with E [(\begin{matrix} X_{0 i} \\ X_{1 i} \end{matrix}) {(\begin{matrix} X_{0 i} \\ X_{1 i} \end{matrix})}^{T}] = Σ_{X},

(4.1b)

Z | {(X}_{0}, X_{1}) {~ MN (X}_{0} A_{0}^{T} + X_{1} A_{1}^{T}, I_{n}, I_{r}) and

(4.1c)

E ⫫ (X_{0}, X_{1}, Z), {E ~ MN (0, I}_{n}, Σ) .

(4.1d)

The model does not specify means for X_0i and X_1i; we do not need them. The parameters in this model are, for i = 0 or 1, B_i ∈ ℝ^{p × d_i}, Γ ∈ ℝ^{p × r}, E_X ∈ ℝ^{d × d}, and A_i ∈ ℝ^{r × d_i}. The parameters A and B are the matrix versions of α and β in model (2.1a)–(2.1e). Additionally, we assume Σ_X is invertible. To clarify our purpose, we are primarily interested in estimating and testing for the significance of B₁.

For the multiple regression model (4.1), we again consider the rotation matrix Q^T that is given by the QR decomposition (X₀ X₁) = QU where Q ∈ ℝ^{n × n} is an orthogonal matrix and U is an upper triangular matrix of size n × d. Therefore, we have

Q^{T} (X_{0} X_{1}) = U = (\begin{matrix} U_{00} \\ 0 \\ 0 \end{matrix} \begin{matrix} U_{01} \\ U_{11} \\ 0 \end{matrix}),

where U₀₀ is a d₀ × d₀ upper triangular matrix and U₁₁ is a d₁ × d₁ upper triangular matrix. Now let the rotated Y be

\tilde{Y} = Q^{T} Y = (\begin{matrix} {\tilde{Y}}_{0} \\ {\tilde{Y}}_{1} \\ {\tilde{Y}}_{- 1} \end{matrix}),

(4.2)

where ${\tilde{Y}}_{0}$ is d₀ × p, ${\tilde{Y}}_{1}$ is d₁ × p and ${\tilde{Y}}_{- 1}$ is (n − d) × p, then we can partition the model into three parts: conditional on both X₀ and X₁ (hence U),

{\tilde{Y}}_{0} = U_{00} B_{0}^{T} + U_{01} B_{1}^{T} + {\tilde{Z}}_{0} Γ^{T} + {\tilde{E}}_{0},

(4.3)

{\tilde{Y}}_{1} = U_{11} B_{1}^{T} + {\tilde{Z}}_{1} Γ^{T} + {\tilde{E}}_{1} ~ MN(U_{11} {(B_{1} + Γ A_{1})}^{T}, I_{d_{1}}, Γ Γ^{T} + Σ),

(4.4)

{\tilde{Y}}_{- 1} = {\tilde{Z}}_{- 1} Γ^{T} + {\tilde{E}}_{- 1} ~ {MN (0, I}_{n - d}, Γ Γ^{T} + Σ),

(4.5)

where $\tilde{Z} = Q^{T} Z$ and $\tilde{E} = Q^{T} E \overset{d}{=} E$ . Equation (4.3) corresponds to the nuisance parameters B₀ and is discarded according to the ancillary principle. Equation (4.4) is the multivariate extension to (2.4) that is used to estimate B₁ and equation (4.5) plays the same role as (2.5) to estimate Γ and Σ.

We consider the asymptotics when n, p → ∞ and d, r are fixed and known. Since d is fixed, the estimation of Γ is not different from the simple regression case and we can use the maximum likelihood factor analysis described in Section 3.1. Under Assumptions 1–3, the precision results of $\hat{Γ}$ and $\hat{Σ}$ in Lemma 3.1 still hold.

Let $Σ_{X}^{- 1} = Ω = (\begin{matrix} Ω_{00} & Ω_{01} \\ Ω_{10} & Ω_{11} \end{matrix})$ . In the proof of Theorems 3.1 and 3.3, we consider a fixed sequence of X such that ${‖ X ‖}_{2} ∕ \sqrt{n} \to 1$ . Similarly, we have the following lemma in the multiple regression scenario:

Lemma 4.1.

As $n \to \infty, \frac{1}{n} U_{11}^{T} U_{11} \overset{a . s .}{\to} Ω_{11}^{- 1}$ .

Similar to (3.2), we can rewrite (4.4) as

{\tilde{Y}}_{1}^{T} U_{11}^{- T} = B_{1} + Γ (A_{1} + {\tilde{W}}_{1} U_{11}^{- T}) + {\tilde{E}}_{1} U_{11}^{- T},

where ${\tilde{W}}_{1} ~ MN (0, I_{d_{1}}, I_{p}$ is independent from ${\tilde{E}}_{1}$ . As in Section 3.2, we derive statistical properties of the estimate of B₁ for a fixed sequence of X, ${\tilde{W}}_{1}$ and Z, which also hold unconditionally. For simplicity, we assume that the negative controls are a known set of variables $C$ with $B_{1, C} = 0$ . We can then estimate each column of A₁ by applying the negative control (NC) or robust regression (RR) we discussed in Sections 3.2.1 and 3.2.2 to the corresponding row of ${\tilde{Y}}_{1} U_{11}^{- T}$ , and then estimate B₁ by

{\tilde{B}}_{1} = {\tilde{Y}}_{1}^{T} U_{11}^{- T} - \hat{Γ} {\hat{A}}_{1} .

Notice that ${\tilde{E}}_{1} U_{11}^{- T} ~ MN (0, Σ, U_{11}^{- 1} U_{11}^{- T})$ . Thus, the “samples” in the robust regression, which are actually the p variables in the original problem are still independent within each column. Though the estimates of each column of A₁ may be correlated, we will show that the correlation will not affect inference on B₁. As a result, we still get asymptotic results similar to Theorem 3.3 for the multiple regression model (4.1).

Theorem 4.1.

Under Assumptions 1–6, if n, p → ∞, with (log p)²/n → 0 and ${‖ vec (B_{1}) ‖}_{1} \sqrt{n} ∕ p \to 0$ , then for any fixed index set $S$ with finite cardinality $∣ S ∣$ ,

\sqrt{n} ({\hat{B}}_{1, S}^{NC} - B_{1, S}) \overset{d}{\to} {MN (0}_{| S | \times k_{1}}, Σ_{S} + Δ_{S}, Ω_{11} + A_{1}^{T} A_{1}) and

(4.6)

\sqrt{n} ({\hat{B}}_{1, S}^{RR} - B_{1, S}) \overset{d}{\to} {MN (0}_{| S | \times k_{1}}, Σ_{S}, Ω_{11} + A_{1}^{T} A_{1}),

(4.7)

where $Δ_{S}$ is defined in Theorem 3.1.

As for the asymptotic efficiency of this estimator, we again compare it to the oracle OLS estimator of B₁ which observes confounding variables Z in (4.1). In the multiple regression model, we claim that ${\hat{B}}_{1}^{RR}$ still reaches the oracle asymptotic efficiency. In fact, let B = (B₀ B₁ Γ). The oracle OLS estimator of B, ${\hat{B}}^{OLS}$ , is unbiased and its vectorization has variance V⁻¹ ⊗ Σ/n where

V = (\begin{matrix} Σ_{X} \\ A Σ_{X} \end{matrix} \begin{matrix} Σ_{X} A^{T} \\ I_{r} + A Σ_{X} A^{T} \end{matrix}) {for A = (A}_{0} A_{1}) .

By the block-wise matrix inversion formula, the top left d × d block of V⁻¹ is $Σ_{X}^{- 1} + A^{T} A$ . The variance of ${\hat{B}}_{1}^{OLS}$ only depends on the bottom right d₁ × d₁ sub-block of this d × d block, which is simply $Ω_{11} + A_{1}^{T} A_{1}$ . Therefore, ${\hat{B}}_{1}^{OLS}$ is unbiased and its vectorization has variance $(Ω_{11} + A_{1}^{T} A_{1}) \otimes Σ ∕ n$ , matching the asymptotic variance of ${\hat{B}}_{1}^{RR}$ in Theorem 4.1.

5. Discussion.

5.1. Confounding versus unconfounding

The issue of multiple testing dependence arises because Z in the true model (1.1) is unobserved. We have focused on the case where Z is confounded with the primary variable. Some similar results were obtained earlier for the unconfounded case, corresponding to α = 0 in our notation. For example, [35] used a factor model to improve the efficiency of significance tests of the regression intercepts. Jin [31], Li and Zhong [40] developed more powerful procedures for testing β while still controlling FDR under unconfounded dependence.

In another related work, Fan, Han and Gu [21] imposed a factor structure on the unconfounded test statistics, whereas this paper and the articles discussed later in Section 5.3 assume a factor structure on the raw data. Fan, Han and Gu [21] used an approximate factor model to accurately estimate the false discovery proportion. Their correction procedure also includes a step of robust regression. Nevertheless, it is often difficult to interpret the factor structure of the test statistics. In comparison, the latent variables Z in our model (2.1a)–(2.1e), whether confounding or not, can be interpreted as batch effects, laboratory conditions, or other systematic bias. Such problems are widely observed in genetics studies (see, e.g., the review article [37]).

As a final remark, some of the models and methods developed in the context of unconfounded hypothesis testing may be useful for confounded problems as well. For example, the relationship between Z and X needs not be linear as in (1.2). In certain applications, it may be more appropriate to use a time-series model [57] or a mixture model [19].

5.2. Marginal effects versus direct effects.

In Section 1, we switched our interest from the marginal effects τ in (1.3) to the direct effects β. We believe that they are usually more scientifically meaningful and interpretable than the marginal effects. For instance, if the treated (control) samples are analyzed by machine A (machine B), and the machine A outputs higher values than B, we certainly do not want to include the effects of this machine to machine variation on the outcome measurements.

When model (2.1a)–(2.1e) is interpreted as a “structural equations model” [11], β is indeed the causal effect of X on Y [46]. In this paper, we do not make such structural assumptions about the data generating process. Instead, we use (2.1a)–(2.1e) to describe the screening procedure commonly applied in high throughput data analysis. The model (2.1a)–(2.1e) also describes how we think the marginal effects can be confounded, and hence different from the more meaningful direct effects β. Additionally, the asymptotic setting in this paper is quite different from that in the traditional structural equations model.

5.3. Comparison with existing confounder adjustment methods.

We discuss in more detail how previous methods of confounder adjustment, namely SVA [38, 39], RUV-4 [25, 26] and LEAPP [59], fit in the framework (2.1a)–(2.1e). See [47] for an alternative approach of bilinear regression with latent factors that is also motivated by high-throughput data analysis.

5.3.1. SVA.

There are two versions of SVA: the reduced subset SVA (subset-SVA) of [38] and the iteratively reweighted SVA (IRW-SVA) of [39]. Both of them can be interpreted as the two-step statistical procedure in the framework (2.1a)–(2.1e). In the first step, SVA estimates the confounding factors by applying PCA to the residual matrix (I − H_X)Y where H_X = X(X^TX)⁻¹X^T is the projection matrix of X. In contrast, we applied factor analysis to the rotated residual matrix (Q^T Y)₋₁, where Q comes from the QR decomposition of X in Section 4. To see why these two approaches lead to the same estimate of Γ, we introduce the block form of Q = (Q₁ Q₂) where Q₁ ∈ ℝ^{n × d} and Q₂ ∈ ℝ^n×(n−d). It is easy to show that $(Q^{T} Y)_{- 1} = Q_{2}^{T} Y$ and $(I - H_{X}) Y = Q_{2} Q_{2}^{T} Y$ . Thus, our rotated matrix (Q^T Y)₋₁ decorrelates the residual matrix by left-multiplying by Q₂ (because $Q_{2}^{T} Q_{2} = I_{n - d}$ ). Because $(Q_{2}^{T} Y)^{T} Q_{2}^{T} Y = (Q_{2} Q_{2}^{T} Y)^{T} Q_{2} Q_{2}^{T} Y, (Q^{T} Y)_{- 1}$ and (I − H_X)Y have the same sample covariance matrix, they will yield the same factor loading estimate under PCA and also under MLE. The main advantage of using the rotated matrix is theoretical: the rotated residual matrices have independent rows.

Because SVA does not assume an explicit relationship between the primary variable X and the confounders Z, it cannot use the regression (3.2) to estimate α (not even defined) and β. Instead, the two SVA algorithms try to reconstruct the surrogate variables, which are essentially the confounders Z in our framework. Assuming the true primary effect β is sparse, the subset-SVA algorithm finds the outcome variables Y that have the smallest marginal correlation with X and uses their principal scores as Z. Then it computes the p-values by F-tests comparing the linear regression models with and without Z. This procedure can easily fail because a small marginal correlation does not imply no real effect of X due to the confounding factors. For example, most of the marginal effects in the gender study in Figure 1(b) are very small, but after confounding adjustment we find some are indeed significant (see Section 6.2).

The IRW-SVA algorithm modifies subset-SVA by iteratively choosing the subset. At each step, IRW-SVA gives a weight to each outcome variable based on how likely β_j = 0 the current estimate of surrogate variables. The weights are then used in a weighted PCA algorithm to update the estimated surrogate variables. IRW-SVA may be related to our robust regression estimator in Section 3.2.2 in the sense that an M-estimator is commonly solved by Iteratively Reweighted Least Squares (IRLS) and the weights also represent how likely the data point is an outlier. However, unlike IRLS, the iteratively reweighted PCA algorithm is not even guaranteed to converge. Some previous articles [25, 59] and our experiments in Section 6.1 and the supplementary material [62] show that SVA is outperformed by the NC and RR estimators in most confounded examples.

5.3.2. RUV.

Gagnon-Bartsch, Jacob and Speed [25] derived the RUV-4 estimator of β via a sequence of heuristic calculations. In Section 3.2.1, we derived an analytically more tractable estimator ${\hat{β}}^{NC}$ which is actually the same as RUV-4, with the only difference being that we use MLE instead of PCA to estimate the factors and GLS instead of OLS in (3.5). To see why ${\hat{β}}^{NC}$ is essentially the same as ${\hat{β}}^{RUV - 4}$ , in the first step of RUV-4 it uses the residual matrix to estimate Γ and Z, which yields the same estimate as using the rotated matrix (Section 5.3.1). In the second step, RUV-4 estimates β via a regression of Y on X and $\hat{Z} = Q ({\tilde{Z}}_{- 1}^{T} {\tilde{α}}^{T})^{T}$ . This is equivalent to using ordinary least squares (OLS) to estimate α in (3.4). Based on more heuristic calculations, the authors claim that the RUV-4 estimator has approximately the oracle variance. We rigorously prove this statement in Theorem 3.1 when the number of negative controls is large and give a finite sample correction when the negative controls are few. In Section 6.1, we show this correction is very useful to control the type I error and FDR in simulations.

5.3.3. LEAPP.

We follow the two-step procedure and robust regression framework in LEAPP [59] in this paper, thus the test statistics $t_{j}^{RR}$ are very similar to the test statistics in LEAPP. The difference is that LEAPP uses the Θ-IPOD algorithm of [54] for outlier detection, which is robust against outliers at leverage points but is not easy to analyze. Indeed [59] replaced it by the Dantzig selector in its theoretical Appendix. The classical M-estimator, although not robust to leverage points [64], allows us to study the theoretical properties more easily. In practice, LEAPP and RR estimator usually produce very similar results; see Section 6.1 for a numerical comparison.

5.4. Inference when Σ is nondiagonal.

Our analysis is based on the assumption that the noise covariance matrix Σ is diagonal, though in many applications, the researcher might suspect that the outcome variables Y in model (2.1a)–(2.1e) are still correlated after conditioning on the latent factors. Typical examples include gene regulatory networks [17] and cross-sectional panel data [48], where the variable dependence sometimes cannot be fully explained by the latent factors or may simply require too many of them. Bai and Li [5] extend the theoretical results in [3] to approximate factor models allowing for weakly correlated noise. Approximate factor models have also been discussed in [20].

6. Numerical experiments.

6.1. Simulations.

We have provided theoretical guarantees of confounder adjusting methods in various settings and the asymptotic regime of n, p → ∞ (e.g., Theorems 3.1–3.4 and 4.1). Now we use numerical simulations to verify these results and further study the finite sample properties of our estimators and tests statistics.

The simulation data are generated from the single primary variable model (2.1a)–(2.1e). More specifically, X_i is a centered binary variable $(X_{i} + 1) ∕ 2 \overset{i . i . d .}{~} Bernoulli (0.5)$ , and Y_i, Z_i are generated according to (2.1a)–(2.1e).

For the parameters in the model, the noise variances are generated by $σ_{j}^{2} \overset{i.i.d.}{~} InvGamma (3, 2)$ , j = 1,–,p, and so $E (σ_{j}^{2}) = Var (σ_{j}^{2}) = 1$ . We set each $α_{k} = ‖ α ‖_{2} ∕ \sqrt{r}$ equally for k = 1, 2,…, r where $‖ α ‖_{2}^{2}$ is set to 1, so the variance of X_i explained by the confounding factors is R² = 50%. (Additional results for R² = 5% and 0 are in the supplementary material [62].) The primary effect β has independent components β_i taking the values $3 \sqrt{1 + ‖ α ‖_{2}^{2}}$ and 0 with probability π = 0.05 and 1 − π = 0.95, respectively, so the nonzero effects are sparse and have effect size 3. This implies that the oracle estimator has power approximately P(N(3,1) > z_0.025) = 0.85 to detect the signals at a significance level of 0.05. We set the number of latent factors r to be either 2 or 10. For the latent factor loading matrix Γ, we take $Γ = \tilde{Γ} D$ where $\tilde{Γ}$ is a p × r orthogonal matrix sampled uniformly from the Stiefel manifold V_r(ℝ^p), the set of all p × r orthogonal matrix. Based on Assumption 3, we set the latent factor strength $D = \sqrt{p} \cdot diag (d_{1}, \dots, d_{r})$ where d_k = 3 − 2(k − 1)/(r − 1) thus d₁ to d_r are distributed evenly inside the interval [3,1]. As the number of factors r can be easily estimated for this strong factor setting (more discussions can be found in [45]), we assume that the number r of factors is known to all of the algorithms in this simulation.

We set p = 5000, n = 100 or 500 to mimic the data size of many genetic studies. For the negative control scenario, we choose $∣ C ∣ = 30$ negative controls at random from the zero positions of β. We expect that negative control methods would perform better with a larger value of $∣ C ∣$ and worse with a smaller value. The choice $∣ C ∣ = 30$ is around the size of the spike-in controls in many microarray experiments [26]. For the loss function in our sparsity scenario, we use Tukey’s bisquare which is optimized via IRLS with an ordinary least-square fit as the starting values of the coefficients. Finally, each of the four combinations of n and r is randomly repeated 100 times.

We compare the performance of nine different approaches. There are two baseline methods: the “naive” method estimates β by a linear regression of Y on just the observed primary variable X and calculates p-values using the classical t-tests, while the “oracle” method regresses Y on both X and the confounding variables Z as described in Section 2.1. There are three methods in the RUV-4/negative controls family: the RUV-4 method [25], our “NC” method which computes test statistics using ${\hat{β}}^{NC}$ and its variance estimate $(1 + ‖ \hat{α} ‖_{2}^{2}) (\hat{Σ} + \hat{Δ})$ , and our “NC-ASY” method which uses the same ${\hat{β}}^{NC}$ but estimates its variance by $(1 + ‖ \hat{α} ‖_{2}^{2}) \hat{Σ}$ . We compare four methods in the SVA/LEAPP/sparsity family: these are “IRW-SVA” [39], “LEAPP” [59], the “LEAPP(RR)” method which is our RR estimator using M-estimation at the robustness stage and computes the test-statistics using (3.13), and the “LEAPP(RR-MAD)” method which uses the median absolute deviation (MAD) of the test statistics in (3.13) to calibrate them (see Section 3.3).

To measure the performance of these methods, we report the type I error (Theorem 3.4), power, false discovery proportion (FDP) and precision of hypotheses with the smallest 100 p-values in the 100 simulations. For both the type I error and power, we set the significance level to be 0.05. For FDP, we use the Benjamini–Hochberg procedure with FDR controlled at 0.2. These metrics are plotted in Figure 2 under different settings of n and r.

Fig. 2. — *Compare the performance of nine different approaches (from left to right): naive regression ignoring the confounders (Naive), IRW-SVA, negative control with finite sample correction (NC) in* (3.7), *negative control with asymptotic oracle variance (NC-ASY) in* (3.8), *RUV-4, robust regression* [LEAPP(RR)], *robust regression with calibration* [LEAPP(RR-MAD)], *LEAPP, oracle regression which observes the confounders (Oracle). The error bars are one standard deviation over* 100 *repeated simulations. The three dashed horizontal lines from bottom to top are the nominal significance level, FDR level and oracle power, respectively.*

First, from Figure 2, we see that the oracle method has exactly the same type I error and FDP as specified, while the naive method and SVA fail drastically. SVA performs performs better than the naive method in terms of the precision of the smallest 100 p-values, but is still much worse than other methods. Next, for the negative control scenario, as we only have $∣ C ∣ = 30$ negative controls, ignoring the inflated variance term Δ_S in Theorem 3.1 will lead to overdispersed test statistics, and that is why the type I error and FDP of both NC-ASY and RUV-4 are much larger than the nominal level. By contrast, the NC method correctly controls type I error and FDP by considering the variance inflation, though as expected it loses some power compared with the oracle. For the sparsity scenario, the “LEAPP(RR)” method performs as the asymptotic theory predicted when n = 500, while when n = 100 the p-values seem a bit too small. This is not surprising because the asymptotic oracle variance in Theorem 3.3 can be optimistic when the sample size is not sufficiently large, as we discussed in Remark 3.2. On the other hand, the methods which use empirical calibration for the variance of test statistics, namely the original LEAPP and “LEAPP(RR-MAD),” control both FDP and type I error for data of small sample size in our simulations. The price for the finite sample calibration is that it tends to be slightly conservative, resulting in a loss of power to some extent.

In conclusion, the simulation results are consistent with our theoretical guarantees when p is as large as 5000 and n is as large as 500. When n is small, the variance of the test statistics will be larger than the asymptotic variance for the sparsity scenario and we can use empirical calibrations (such as MAD) to adjust for the difference.

6.2. Real data examples.

In this section, we return to the three motivating real data examples in Section 1. The main goal here is to demonstrate a practical procedure for confounder adjustment and show that our asymptotic results are reasonably accurate in real data. In an open-source R package cate (available on CRAN), we also provide the necessary tools to carry out the procedure.

6.2.1. The datasets.

First, we briefly describe the three datasets. The first dataset [55] tries to identify candidate genes associated with the extent of emphysema and can be downloaded from the GEO database (Series GSE22148). We preprocessed the data using the standard Robust Multi-array Average (RMA) approach [30]. The primary variable of interest is the severity (moderate or severe) of the Chronic Obstructive Pulmonary Disease (COPD). The dataset also include age, gender, batch and date of the 143 sampled patients which are served as nuisance covariates.

The second and third datasets are taken from [25] where they used them to compare RUV methods with other methods such as SVA and LEAPP. The original scientific studies are [61] and [10], respectively. The primary variable of interest is gender in both datasets, though the original objective in [10] is to identify genes associated with Alzheimer’s disease. Gagnon-Bartsch, Jacob and Speed [25] switch the primary variable to gender in order to have a gold standard: the differentially expressed genes should mostly come from or relate to the X or Y chromosome. We follow their suggestion and use this standard to study the performance of our RR estimator. In addition, as the first COPD dataset also contains gender information of the samples, we apply this suggestion and use gender as the primary variable for the COPD data as a supplementary dataset.

Finally, we want to mention that the second dataset has repeated samples from the same individuals while the individual information is lost. We suspect that the individual information are then strong latent factors which caused the atypical concentration of the histograms in Figure 1(b) and Figure 1(d). This suggests necessity of a latent factor model for this dataset.

6.2.2. Confounder adjustment.

Recall that without the confounder adjustment, the distribution of the regression t-statistics in these datasets can be skewed, noncentered, underdispersed or overdispersed as shown in Figure 1. The adjustment method used here is the maximum likelihood factor analysis described in Section 3.1 followed by the robust regression (RR) method with Tukey’s bisquare loss described in Section 3.2.2. Since the true number of confounders is unknown, we increase r from 1 to n/2 and study the empirical performance. We report the results without empirical calibration for illustrative purposes, though in practice we suggest using calibration for better control of type I errors and FDP.

In Table 2 and Figure 3, we present the results after confounder adjustment for the three datasets. We report two groups of summary statistics in Table 2: the first group is several summary statistics of all the z-statistics computed using (3.13), including the mean, median, standard deviation, median absolute deviation (scaled for consistency of normal distribution), skewness and the medcouple. The medcouple [12] is a robust measure of skewness. After subtracting the median observation, some positive and some negative values remain. For any pair of values x₁ ≥ 0 and x₂ ≤ 0 with x₁ + |x₂| > 0, one can compute (x₁ − |x₂|)/(x₁ + |x₂|). The medcouple is the median of all those ratios. The second group of statistics has performance metrics to evaluate the effectiveness of the confounder adjustment. See the caption of Table 2 for more detail.

In all three datasets, the z-statistics become more centered at 0 and less skewed as we include a few confounders in the model. Though the standard deviation (SD) suggests overdispersed variance, the overdispersion will go away if we add MAD calibration as SD and MAD have similar values. The similarity between SD and MAD values also indicates that the majority of statistics after confounder adjustment are approximately normally distributed. Note that the medcouple values shrink towards zero after adjustment, suggesting that skewness then only arises from small fraction of the genes, which is in accordance with our assumptions that the primary effects should be sparse.

In practice, some latent factors may be too weak to meet Assumption 3 (i.e. $d_{h} ≪ \sqrt{p}$ ), making it difficult to choose an appropriate r. A practical way to pick the number of confounders r with presence of heteroscedastic noise we investigate here is the bi-cross-validation (BCV) method of [45], which uses randomly held-out submatrices to estimate the mean squared error of reconstructing factor loading matrix. It is shown in [45] that BCV outperforms many existing methods in recovering the latent signal matrix and the number of factors r, especially in high-dimensional datasets (n, p → ∞). In Figure 3, we demonstrate the performance of BCV on these three datasets. The r selected by BCV is respectively 33, 25 and 11 [Figure 3(a), (c) and (e)], and they all result in the presumed shape of z-statistics distribution [Figure 3(b), (d) and (f)]. For the second and the third datasets where we have a gold standard, the r selected by BCV has near optimal performance in selecting genes on the X/Y chromosome [columns 3 and 4 in Table 2(b) and (c)]. Another method we applied is proposed by [43] based on the empirical distribution of eigenvalues. This method estimates r as 2, 9 and 3, respectively, for the three datasets. Table 3 of [25] has the “top 100” values for RUV-4 on the second and third dataset. They reported 26 for LEAPP, 28 for RUV-4, and 27 for SVA in the second dataset, and 27 for LEAPP, 31 for RUV-4, and 26 for SVA in the third dataset. Notice that the precision of the top 100 significant genes is relatively stable when r is above certain number. Intuitively, the factor analysis is applied to the residuals of Y on X and the overestimated factors also have very small eigenvalues, thus they usually do not change $\hat{β}$ a lot. See also [25] for more discussion on the robustness of the negative control estimator to overestimating r.

Lastly, we want to point out that both the small sample size of the datasets and presence of weak factors can result in overdispersed variance of the test statistics. The BCV plots indicate presence of many weak factors in the first two datasets. In the third dataset, the sample size n is only 31, so the adjustment result is not ideal. Nevertheless, the empirical performance (e.g., number of X/Y genes in top 100) suggests it is still beneficial to adjust for the confounders.

Supplementary Material

NIHMS991053-supplement-S1.pdf^{(304.1KB, pdf)}

Acknowledgments.

The authors thank Bhaswar Bhattacharya, Murat Erdogdu, Jian Li, Weijie Su and Yunting Sun for helpful discussion.

J. Wang and Q. Zhao contributed equally to this paper. This paper was completed when Jingshu Wang and Qingyuan Zhao were Ph.D. candidates at Stanford University.

Footnotes

SUPPLEMENTARY MATERIAL

Supplement to “Confounder adjustment in multiple hypothesis testing” (DOI: 10.1214/16-AOS1511SUPP; .pdf). We provide detailed proof for the theoretical results in this paper and some additional numerical results.

Contributor Information

Jingshu Wang, Email: jingshuw@wharton.upenn.edu.

Qingyuan Zhao, Email: qyzhao@wharton.upenn.edu.

Trevor Hastie, Email: hastie@stanford.edu.

Art B. Owen, Email: owen@stanford.edu.

REFERENCES

[1].Alter O, Brown PO and Botstein D (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. USA 97 10101–10106. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Anderson TW and Rubin H (1956). Statistical inference in factor analysis. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1954–1955, Vol. V 111–150. Univ. California Press, Berkeley and Los Angeles: MR0084943 [Google Scholar]
[3].Bai J and Li K (2012). Statistical analysis of factor models of high dimension. Ann. Statist. 40 436–465. MR3014313 [Google Scholar]
[4].Bai J and Li K (2014). Theory and methods of panel data models with interactive effects. Ann. Statist. 42 142–170. MR3178459 [Google Scholar]
[5].Bai J and Li K (2016). Maximum likelihood estimation and inference for approximate factor models of high dimension. Rev. Econ. Stat. 98 298–309. [Google Scholar]
[6].Bai J and Ng S (2002). Determining the number of factors in approximate factor models. Econometrica 70 191–221. MR1926259 [Google Scholar]
[7].Bai J and Ng S (2006). Confidence intervals for diffusion index forecasts and inference for factor-augmented regressions. Econometrica 74 1133–1150. MR2238213 [Google Scholar]
[8].Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300. MR1325392 [Google Scholar]
[9].Benjamini Y and Yekutieli D (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29 1165–1188. MR1869245 [Google Scholar]
[10].Blalock EM, Geddes JW, Chen KC, Porter NM, Markesbery WR and Landfield PW (2004). Incipient Alzheimer’s disease: Microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc. Natl. Acad. Sci. USA 101 2173–2178. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Bollen KA (1989). Structural Equations with Latent Variables. Wiley, New York: MR0996025 [Google Scholar]
[12].Brys G, Hubert M and Struyf A (2004). A robust measure of skewness. J. Comput. Graph. Statist. 13 996–1017. MR2109062 [Google Scholar]
[13].Chandrasekaran V, Parrilo PA andWillsky AS (2012). Latent variable graphical model selection via convex optimization. Ann. Statist. 40 1935–1967. MR3059067 [Google Scholar]
[14].Clarke S and Hall P (2009). Robustness of multiple testing procedures against dependence. Ann. Statist. 37 332–358. MR2488354 [Google Scholar]
[15].Craig A, Cloarec O, Holmes E, Nicholson JK and Lindon JC (2006). Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Anal. Chem. 78 2262–2267. [DOI] [PubMed] [Google Scholar]
[16].Desai KH and Storey JD (2012). Cross-dimensional inference of dependent high-dimensional data. J. Amer. Statist. Assoc. 107 135–151. MR2949347 [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].De La Fuente A, Bing N, Hoeschele I and Mendes P (2004). Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20 3565–3574. [DOI] [PubMed] [Google Scholar]
[18].Efron B (2007). Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 102 93–103. MR2293302 [Google Scholar]
[19].Efron B (2010). Correlated z-values and the accuracy of large-scale statistical estimates. J. Amer. Statist. Assoc. 105 1042–1055. MR2752597 [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Fan J and Han X (2013). Estimation of false discovery proportion with unknown dependence. Available at arXiv:1305.7007. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Fan J, Han X and Gu W (2012). Estimating false discovery proportion under arbitrary covariance dependence. J. Amer. Statist. Assoc. 107 1019–1035. MR3010887 [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Fare TL, Coffey EM, Dai H, He YD, Kessler DA, Kilian KA, Koch JE, Leproust E, Marton MJ, Meyer MR et al. (2003). Effects of atmospheric ozone on microarray data quality. Anal. Chem. 75 4672–4675. [DOI] [PubMed] [Google Scholar]
[23].Fisher RA (1935). The Design of Experiments. Oliver & Boyd, Edinburgh. [Google Scholar]
[24].Friguet C, Kloareg M and Causeur D (2009). A factor model approach to multiple testing under dependence. J. Amer. Statist. Assoc. 104 1406–1415. MR2750571 [Google Scholar]
[25].Gagnon-Bartsch J, Jacob L and Speed TP (2013). Removing unwanted variation from high dimensional data with negative controls Technical Report 820, Dept. Statistics, Univ. California, Berkeley, Berkeley, CA. [Google Scholar]
[26].Gagnon-Bartsch JA and Speed TP (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics 13 539–552. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D and Brown PO (2000). Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell 11 4241–4257. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Greenland S, Robins JM and Pearl J (1999). Confounding and collapsibility in causal inference. Statist. Sci 14 29–46. [Google Scholar]
[29].Grzebyk M,Wild P and Chouanière D (2004). On identification of multi-factor models with correlated residuals. Biometrika 91 141–151. MR2050465 [Google Scholar]
[30].Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP et al. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4 249–264. [DOI] [PubMed] [Google Scholar]
[31].Jin J (2012). Comment: “Estimating false discovery proportion under arbitrary covariance dependence.” [MR3010887] J. Amer. Statist. Assoc 107 1042–1045. MR3010891 [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Kish L (1959). Some statistical problems in research design. Am. Sociol. Rev. 24 328–338. [Google Scholar]
[33].Korn EL, Troendle JF, Mcshane LM and Simon R (2004). Controlling the number of false discoveries: Application to high-dimensional genomic data. J. Statist. Plann. Inference 124 379–398. MR2080371 [Google Scholar]
[34].Kuroki M and Pearl J (2014). Measurement bias and effect restoration in causal inference. Biometrika 101 423–437. MR3215357 [Google Scholar]
[35].Lan W and Du L (2014). A factor-adjusted multiple testing procedure with application to mutual fund selection Available at arXiv:1407.5515. [Google Scholar]
[36].Lazar C, Meganck S, Taminau J, Steenhoff D, Coletta A, Molter C, Weiss-Solís DY, Duque R, Bersini H and Nowé A (2013). Batch effect removal methods for microarray gene expression data integration: A survey. Brief. Bioinform. 14 469–490. [DOI] [PubMed] [Google Scholar]
[37].Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K and Irizarry RA (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11 733–739. [DOI] [PMC free article] [PubMed] [Google Scholar]
[38].Leek JT and Storey JD (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3 1724–1735. [DOI] [PMC free article] [PubMed] [Google Scholar]
[39].Leek JT and Storey JD (2008). A general framework for multiple testing dependence. Proc. Natl. Acad. Sci. USA 105 18718–18723. [DOI] [PMC free article] [PubMed] [Google Scholar]
[40].Li J and Zhong P-S (2016). A rate optimal procedure for recovering sparse differences between high-dimensional means under dependence. Ann. Statist. To appear. [Google Scholar]
[41].Lin DW, Coleman IM, Hawley S, Huang CY, Dumpit R, Gifford D, Kezele P, Hung H, Knudsen BS, Kristal AR et al. (2006). Influence of surgical manipulation on prostate gene expression: Implications for molecular correlates of treatment effects and disease prognosis. J. Clin. Oncol. 24 3763–3770. [DOI] [PubMed] [Google Scholar]
[42].Maronna RA, Martin RD and Yohai VJ (2006). Robust Statistics: Theory and Methods. Wiley, Chichester: MR2238141 [Google Scholar]
[43].Onatski A (2010). Determining the number of factors from empirical distribution of eigenvalues. Rev. Econ. Stat 92 1004–1016. [Google Scholar]
[44].Owen AB (2005). Variance of the number of false discoveries. J. R. Stat. Soc. Ser. B Stat. Methodol 67 411–426. MR2155346 [Google Scholar]
[45].Owen AB and Wang J (2016). Bi-cross-validation for factor analysis. Statist. Sci 31 119–139. MR3458596 [Google Scholar]
[46].Pearl J (2009). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge Univ. Press, Cambridge: MR2548166 [Google Scholar]
[47].Perry PO and Pillai NS (2013). Degrees of freedom for combining regression with factor analysis Preprint Available at arXiv:1310.7269. [Google Scholar]
[48].Pesaran MH (2004). General diagnostic tests for cross section dependence in panels Cambridge: Working Papers in Economics No. 0435. [Google Scholar]
[49].Price AL, Patterson NJ, Plenge RM,Weinblatt ME, Shadick NA and Reich D (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 904–909. [DOI] [PubMed] [Google Scholar]
[50].Ransohoff DF (2005). Bias as a threat to the validity of cancer molecular-marker research. Nat. Rev. Cancer 5 142–149. [DOI] [PubMed] [Google Scholar]
[51].Rhodes DR and Chinnaiyan AM (2005). Integrative analysis of the cancer transcriptome. Nat. Genet. 37 S31–S37. [DOI] [PubMed] [Google Scholar]
[52].Schwartzman A (2010). Comment: “Correlated z-values and the accuracy of large-scale statistical estimates.” [MR2752597] J. Amer. Statist. Assoc. 105 1059–1063. MR2752600 [DOI] [PMC free article] [PubMed] [Google Scholar]
[53].Schwartzman A, Dougherty RF and Taylor JE (2008). False discovery rate analysis of brain diffusion direction maps. Ann. Appl. Stat. 2 153–175. MR2415598 [DOI] [PMC free article] [PubMed] [Google Scholar]
[54].She Y and Owen AB (2011). Outlier detection using nonconvex penalized regression. J. Amer. Statist. Assoc. 106 626–639. MR2847975 [Google Scholar]
[55].Singh D, Fox SM, Tal-Singer R, Plumb J, Bates S, Broad P, Riley JH and Celli B (2011). Induced sputum genes associated with spirometric and radiological disease severity in COPD ex-smokers. Thorax 66 489–495. [DOI] [PubMed] [Google Scholar]
[56].Storey JD, Taylor JE and Siegmund D (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. J. R. Stat. Soc. Ser. B Stat. Methodol 66 187–205. MR2035766 [Google Scholar]
[57].Sun W and Cai TT (2009). Large-scale multiple testing under dependence. J. R. Stat. Soc. Ser. B Stat. Methodol 71 393–424. MR2649603 [Google Scholar]
[58].Sun Y (2011). On latent systemic effects in multiple hypotheses. Ph.D. thesis, Stanford University. [Google Scholar]
[59].Sun Y, Zhang NR and Owen AB (2012). Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. Ann. Appl. Stat. 6 1664–1688. MR3058679 [Google Scholar]
[60].Tusher VG, Tibshirani R and Chu G (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98 5116–5121. [DOI] [PMC free article] [PubMed] [Google Scholar]
[61].Vawter MP, Evans S, Choudary P, Tomita H,Meador-Woodruff J,Molnar M, Li J, Lopez JF,Myers R, Cox D et al. (2004). Gender-specific gene expression in post-mortem human brain: Localization to sex chromosomes. Neuropsychopharmacology 29 373–384. [DOI] [PMC free article] [PubMed] [Google Scholar]
[62].Wang J, Zhao Q, Hastie T and Owen AB (2017). Supplement to “Confounder adjustment in multiple hypothesis testing.” DOI: 10.1214/16-AOS1511SUPP. [DOI] [PMC free article] [PubMed]
[63].Wang S, Cui G and Li K (2015). Factor-augmented regression models with structural change. Econom. Lett 130 124–127. MR3336182 [Google Scholar]
[64].Yohai VJ (1987). High breakdown-point and high efficiency robust estimates for regression. Ann. Statist 15 642–656. MR0888431 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS991053-supplement-S1.pdf^{(304.1KB, pdf)}

[R1] [1].Alter O, Brown PO and Botstein D (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. USA 97 10101–10106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Anderson TW and Rubin H (1956). Statistical inference in factor analysis. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1954–1955, Vol. V 111–150. Univ. California Press, Berkeley and Los Angeles: MR0084943 [Google Scholar]

[R3] [3].Bai J and Li K (2012). Statistical analysis of factor models of high dimension. Ann. Statist. 40 436–465. MR3014313 [Google Scholar]

[R4] [4].Bai J and Li K (2014). Theory and methods of panel data models with interactive effects. Ann. Statist. 42 142–170. MR3178459 [Google Scholar]

[R5] [5].Bai J and Li K (2016). Maximum likelihood estimation and inference for approximate factor models of high dimension. Rev. Econ. Stat. 98 298–309. [Google Scholar]

[R6] [6].Bai J and Ng S (2002). Determining the number of factors in approximate factor models. Econometrica 70 191–221. MR1926259 [Google Scholar]

[R7] [7].Bai J and Ng S (2006). Confidence intervals for diffusion index forecasts and inference for factor-augmented regressions. Econometrica 74 1133–1150. MR2238213 [Google Scholar]

[R8] [8].Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300. MR1325392 [Google Scholar]

[R9] [9].Benjamini Y and Yekutieli D (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29 1165–1188. MR1869245 [Google Scholar]

[R10] [10].Blalock EM, Geddes JW, Chen KC, Porter NM, Markesbery WR and Landfield PW (2004). Incipient Alzheimer’s disease: Microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc. Natl. Acad. Sci. USA 101 2173–2178. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Bollen KA (1989). Structural Equations with Latent Variables. Wiley, New York: MR0996025 [Google Scholar]

[R12] [12].Brys G, Hubert M and Struyf A (2004). A robust measure of skewness. J. Comput. Graph. Statist. 13 996–1017. MR2109062 [Google Scholar]

[R13] [13].Chandrasekaran V, Parrilo PA andWillsky AS (2012). Latent variable graphical model selection via convex optimization. Ann. Statist. 40 1935–1967. MR3059067 [Google Scholar]

[R14] [14].Clarke S and Hall P (2009). Robustness of multiple testing procedures against dependence. Ann. Statist. 37 332–358. MR2488354 [Google Scholar]

[R15] [15].Craig A, Cloarec O, Holmes E, Nicholson JK and Lindon JC (2006). Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Anal. Chem. 78 2262–2267. [DOI] [PubMed] [Google Scholar]

[R16] [16].Desai KH and Storey JD (2012). Cross-dimensional inference of dependent high-dimensional data. J. Amer. Statist. Assoc. 107 135–151. MR2949347 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].De La Fuente A, Bing N, Hoeschele I and Mendes P (2004). Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20 3565–3574. [DOI] [PubMed] [Google Scholar]

[R18] [18].Efron B (2007). Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 102 93–103. MR2293302 [Google Scholar]

[R19] [19].Efron B (2010). Correlated z-values and the accuracy of large-scale statistical estimates. J. Amer. Statist. Assoc. 105 1042–1055. MR2752597 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Fan J and Han X (2013). Estimation of false discovery proportion with unknown dependence. Available at arXiv:1305.7007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Fan J, Han X and Gu W (2012). Estimating false discovery proportion under arbitrary covariance dependence. J. Amer. Statist. Assoc. 107 1019–1035. MR3010887 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Fare TL, Coffey EM, Dai H, He YD, Kessler DA, Kilian KA, Koch JE, Leproust E, Marton MJ, Meyer MR et al. (2003). Effects of atmospheric ozone on microarray data quality. Anal. Chem. 75 4672–4675. [DOI] [PubMed] [Google Scholar]

[R23] [23].Fisher RA (1935). The Design of Experiments. Oliver & Boyd, Edinburgh. [Google Scholar]

[R24] [24].Friguet C, Kloareg M and Causeur D (2009). A factor model approach to multiple testing under dependence. J. Amer. Statist. Assoc. 104 1406–1415. MR2750571 [Google Scholar]

[R25] [25].Gagnon-Bartsch J, Jacob L and Speed TP (2013). Removing unwanted variation from high dimensional data with negative controls Technical Report 820, Dept. Statistics, Univ. California, Berkeley, Berkeley, CA. [Google Scholar]

[R26] [26].Gagnon-Bartsch JA and Speed TP (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics 13 539–552. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D and Brown PO (2000). Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell 11 4241–4257. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Greenland S, Robins JM and Pearl J (1999). Confounding and collapsibility in causal inference. Statist. Sci 14 29–46. [Google Scholar]

[R29] [29].Grzebyk M,Wild P and Chouanière D (2004). On identification of multi-factor models with correlated residuals. Biometrika 91 141–151. MR2050465 [Google Scholar]

[R30] [30].Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP et al. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4 249–264. [DOI] [PubMed] [Google Scholar]

[R31] [31].Jin J (2012). Comment: “Estimating false discovery proportion under arbitrary covariance dependence.” [MR3010887] J. Amer. Statist. Assoc 107 1042–1045. MR3010891 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Kish L (1959). Some statistical problems in research design. Am. Sociol. Rev. 24 328–338. [Google Scholar]

[R33] [33].Korn EL, Troendle JF, Mcshane LM and Simon R (2004). Controlling the number of false discoveries: Application to high-dimensional genomic data. J. Statist. Plann. Inference 124 379–398. MR2080371 [Google Scholar]

[R34] [34].Kuroki M and Pearl J (2014). Measurement bias and effect restoration in causal inference. Biometrika 101 423–437. MR3215357 [Google Scholar]

[R35] [35].Lan W and Du L (2014). A factor-adjusted multiple testing procedure with application to mutual fund selection Available at arXiv:1407.5515. [Google Scholar]

[R36] [36].Lazar C, Meganck S, Taminau J, Steenhoff D, Coletta A, Molter C, Weiss-Solís DY, Duque R, Bersini H and Nowé A (2013). Batch effect removal methods for microarray gene expression data integration: A survey. Brief. Bioinform. 14 469–490. [DOI] [PubMed] [Google Scholar]

[R37] [37].Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K and Irizarry RA (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11 733–739. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] [38].Leek JT and Storey JD (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3 1724–1735. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] [39].Leek JT and Storey JD (2008). A general framework for multiple testing dependence. Proc. Natl. Acad. Sci. USA 105 18718–18723. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] [40].Li J and Zhong P-S (2016). A rate optimal procedure for recovering sparse differences between high-dimensional means under dependence. Ann. Statist. To appear. [Google Scholar]

[R41] [41].Lin DW, Coleman IM, Hawley S, Huang CY, Dumpit R, Gifford D, Kezele P, Hung H, Knudsen BS, Kristal AR et al. (2006). Influence of surgical manipulation on prostate gene expression: Implications for molecular correlates of treatment effects and disease prognosis. J. Clin. Oncol. 24 3763–3770. [DOI] [PubMed] [Google Scholar]

[R42] [42].Maronna RA, Martin RD and Yohai VJ (2006). Robust Statistics: Theory and Methods. Wiley, Chichester: MR2238141 [Google Scholar]

[R43] [43].Onatski A (2010). Determining the number of factors from empirical distribution of eigenvalues. Rev. Econ. Stat 92 1004–1016. [Google Scholar]

[R44] [44].Owen AB (2005). Variance of the number of false discoveries. J. R. Stat. Soc. Ser. B Stat. Methodol 67 411–426. MR2155346 [Google Scholar]

[R45] [45].Owen AB and Wang J (2016). Bi-cross-validation for factor analysis. Statist. Sci 31 119–139. MR3458596 [Google Scholar]

[R46] [46].Pearl J (2009). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge Univ. Press, Cambridge: MR2548166 [Google Scholar]

[R47] [47].Perry PO and Pillai NS (2013). Degrees of freedom for combining regression with factor analysis Preprint Available at arXiv:1310.7269. [Google Scholar]

[R48] [48].Pesaran MH (2004). General diagnostic tests for cross section dependence in panels Cambridge: Working Papers in Economics No. 0435. [Google Scholar]

[R49] [49].Price AL, Patterson NJ, Plenge RM,Weinblatt ME, Shadick NA and Reich D (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 904–909. [DOI] [PubMed] [Google Scholar]

[R50] [50].Ransohoff DF (2005). Bias as a threat to the validity of cancer molecular-marker research. Nat. Rev. Cancer 5 142–149. [DOI] [PubMed] [Google Scholar]

[R51] [51].Rhodes DR and Chinnaiyan AM (2005). Integrative analysis of the cancer transcriptome. Nat. Genet. 37 S31–S37. [DOI] [PubMed] [Google Scholar]

[R52] [52].Schwartzman A (2010). Comment: “Correlated z-values and the accuracy of large-scale statistical estimates.” [MR2752597] J. Amer. Statist. Assoc. 105 1059–1063. MR2752600 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] [53].Schwartzman A, Dougherty RF and Taylor JE (2008). False discovery rate analysis of brain diffusion direction maps. Ann. Appl. Stat. 2 153–175. MR2415598 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] [54].She Y and Owen AB (2011). Outlier detection using nonconvex penalized regression. J. Amer. Statist. Assoc. 106 626–639. MR2847975 [Google Scholar]

[R55] [55].Singh D, Fox SM, Tal-Singer R, Plumb J, Bates S, Broad P, Riley JH and Celli B (2011). Induced sputum genes associated with spirometric and radiological disease severity in COPD ex-smokers. Thorax 66 489–495. [DOI] [PubMed] [Google Scholar]

[R56] [56].Storey JD, Taylor JE and Siegmund D (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. J. R. Stat. Soc. Ser. B Stat. Methodol 66 187–205. MR2035766 [Google Scholar]

[R57] [57].Sun W and Cai TT (2009). Large-scale multiple testing under dependence. J. R. Stat. Soc. Ser. B Stat. Methodol 71 393–424. MR2649603 [Google Scholar]

[R58] [58].Sun Y (2011). On latent systemic effects in multiple hypotheses. Ph.D. thesis, Stanford University. [Google Scholar]

[R59] [59].Sun Y, Zhang NR and Owen AB (2012). Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. Ann. Appl. Stat. 6 1664–1688. MR3058679 [Google Scholar]

[R60] [60].Tusher VG, Tibshirani R and Chu G (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98 5116–5121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] [61].Vawter MP, Evans S, Choudary P, Tomita H,Meador-Woodruff J,Molnar M, Li J, Lopez JF,Myers R, Cox D et al. (2004). Gender-specific gene expression in post-mortem human brain: Localization to sex chromosomes. Neuropsychopharmacology 29 373–384. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] [62].Wang J, Zhao Q, Hastie T and Owen AB (2017). Supplement to “Confounder adjustment in multiple hypothesis testing.” DOI: 10.1214/16-AOS1511SUPP. [DOI] [PMC free article] [PubMed]

[R63] [63].Wang S, Cui G and Li K (2015). Factor-augmented regression models with structural change. Econom. Lett 130 124–127. MR3336182 [Google Scholar]

[R64] [64].Yohai VJ (1987). High breakdown-point and high efficiency robust estimates for regression. Ann. Statist 15 642–656. MR0888431 [Google Scholar]

PERMALINK

CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING

Jingshu Wang

Qingyuan Zhao

Trevor Hastie

Art B Owen

Abstract

1. Introduction.

The confounding problem.

Fig. 1.

Fig. 3.

Table 2.

Previous methods.

Statistical model of confounding.

Contributions.

Outline.

Table 1.

Notation.

2. The model.

2.1. Linear model with confounders.

2.2. Rotation.

2.3. Identifiability of Γ.

Lemma 2.1.

2.4. Identifiability of β.

Proposition 2.1.

Proof.

Remark 2.1.

Remark 2.2.

Remark 2.3.

Remark 2.4.

3. Statistical inference.

3.1. Inference for Γ and Σ.

Assumption 1.

Assumption 2.

Assumption 3.

Lemma 3.1 (Bai and Li [3]).

Remark 3.1.

3.2. Inference for α and β.

3.2.1. Negative control scenario.

Assumption 4.

Theorem 3.1.

3.2.2. Sparsity scenario.

Assumption 5.

Theorem 3.2.

Assumption 6.

Theorem 3.3.

3.3. Hypothesis testing.

3.3.1. Test of the primary effects.

Theorem 3.4.

Remark 3.2.

3.3.2. Test of confounding.

Theorem 3.5.

4. Extension to multiple regression.

Lemma 4.1.

Theorem 4.1.

5. Discussion.

5.1. Confounding versus unconfounding

5.2. Marginal effects versus direct effects.

5.3. Comparison with existing confounder adjustment methods.

5.3.1. SVA.

5.3.2. RUV.

5.3.3. LEAPP.

5.4. Inference when Σ is nondiagonal.

6. Numerical experiments.

6.1. Simulations.

Fig. 2.

6.2. Real data examples.

6.2.1. The datasets.

6.2.2. Confounder adjustment.

Supplementary Material

Acknowledgments.

Footnotes

Contributor Information

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles