Simultaneous dimension reduction and adjustment for confounding variation

Zhixiang Lin; Can Yang; Ying Zhu; John Duchi; Yao Fu; Yong Wang; Bai Jiang; Mahdi Zamanighomi; Xuming Xu; Mingfeng Li; Nenad Sestan; Hongyu Zhao; Wing Hung Wong

doi:10.1073/pnas.1617317113

. 2016 Dec 7;113(51):14662–14667. doi: 10.1073/pnas.1617317113

Simultaneous dimension reduction and adjustment for confounding variation

Zhixiang Lin ^a, Can Yang ^b, Ying Zhu ^c,^d, John Duchi ^a,^e, Yao Fu ^f, Yong Wang ^g, Bai Jiang ^a, Mahdi Zamanighomi ^a, Xuming Xu ^d, Mingfeng Li ^d, Nenad Sestan ^d,^h,ⁱ, Hongyu Zhao ^c,¹, Wing Hung Wong ^a,^j,¹

PMCID: PMC5187682 PMID: 27930330

Significance

With the advancement in high-throughput technologies, analyzing high-dimensional data has become a common task. Dimension reduction methods have been applied to visualize and identify dominant patterns in high-dimensional data. Confounding factors, commonly observed in high-throughput biological experiments, can affect the performance of these methods, and other downstream analysis. Here, we develop a method by coupling dimension reduction with the adjustment for confounder effects. Our method is able to capture the underlying patterns, as demonstrated by a human brain exon array dataset, a model organism ENCODE RNA sequencing dataset, and simulations.

Keywords: dimension reduction, confounding variation, transcriptome

Abstract

Dimension reduction methods are commonly applied to high-throughput biological datasets. However, the results can be hindered by confounding factors, either biological or technical in origin. In this study, we extend principal component analysis (PCA) to propose AC-PCA for simultaneous dimension reduction and adjustment for confounding (AC) variation. We show that AC-PCA can adjust for (i) variations across individual donors present in a human brain exon array dataset and (ii) variations of different species in a model organism ENCODE RNA sequencing dataset. Our approach is able to recover the anatomical structure of neocortical regions and to capture the shared variation among species during embryonic development. For gene selection purposes, we extend AC-PCA with sparsity constraints and propose and implement an efficient algorithm. The methods developed in this paper can also be applied to more general settings. The R package and MATLAB source code are available at https://github.com/linzx06/AC-PCA.

Dimension reduction methods, such as multidimensional scaling (MDS) and principal component analysis (PCA), are commonly applied in high-throughput biological datasets to visualize data in a low-dimensional space, identify dominant patterns, and extract relevant features (1–6). MDS aims to place each sample in a lower-dimensional space such that the between-sample distances are preserved as much as possible (7). PCA seeks the linear combinations of the original variables such that the derived variables capture maximal variance (8). One advantage of PCA is that the principal components (PCs) are more interpretable by checking the loadings of the variables.

Confounding factors, either biological or technical in origin, are commonly observed in high-throughput biological experiments. Various methods have been proposed to estimate the confounding variation, for example, regression models on known confounding factors (9) and factor models and surrogate vector analysis for unobserved confounding factors (10–15). However, limited work has been done in the context of dimension reduction. Confounding variation can affect PC-based visualization of the data points because it may obscure the desired biological variation, and it can also affect the loading of the variables in the PCs.

Here we extend PCA to propose AC-PCA for simultaneous dimension reduction and adjustment for confounding (AC) variation. We introduce a class of penalty functions in PCA, which encourages the PCs to be invariant to the confounding variation. We demonstrate the performance of AC-PCA through its application to a human brain development exon array dataset (4), a model organism ENCODE (modENCODE) RNA sequencing (RNA-Seq) dataset (16, 17), and simulated data. We also implemented AC-PCA with sparsity constraints to enable variable/gene selection and better interpretation of the PCs.

Results

AC-PCA in a General Form.

Let $X$ denote the $N \times p$ data matrix, where $N$ is the number of observations and $p$ is the number of variables/genes. $X$ is centered by column. Let $x_{(i)}$ denote the $i$ th observation. Let $v$ denote a $p$ -dimensional vector and $t_{i} = x_{(i)} \cdot v$ denote the projection induced by $v$ . $\sum_{i = 1}^{N} t_{i}^{2} = \sum_{i = 1}^{N} {(x_{(i)} \cdot v)}^{2} = v^{T} X^{T} X v$ is proportional to the total variation after the projection and classical PCA seeks $v$ that maximizes it. The dimension extracted this way can be misleading if there is confounding variation (Results, A Motivating Example). Let $Y$ denote a $N \times l$ confounder matrix, representing $l$ confounders, and let $K = Y Y^{T}$ . $Y$ is centered by column. We choose $Y$ so that $v^{T} X^{T} K X v$ represent the confounding variation in $t$ . Because we are not interested in the subspace exhibiting the confounding variation, this suggests the following modification of PCA:

\begin{matrix} \underset{v \in ℝ^{p}}{maximize} v^{T} X^{T} X v - λ v^{T} X^{T} K X v \\ subject to {|| v ||}_{2}^{2} \leq 1, \end{matrix}

[1]

where the tuning parameter $λ \geq 0$ controls the strength of regularization. If $λ = 0$ , this is classical PCA; when $λ$ is large enough, we are restricting the subspace to be orthogonal to the columns in $Y$ . Denote $Z = X^{T} X - λ X^{T} K X$ and problem 1 can be solved directly by implementing eigendecomposition on $Z$ . The confounder matrix $Y$ is user-defined, depending on the data structure and the assumptions on the confounding variation. We provide several examples below on how to choose $Y$ . Methods for choosing $λ$ and for assessing the statistical significance of the extracted dimensions are presented in Materials and Methods.

In PCA, the loadings for the variables are typically nonzero. For better interpretation of the PCs, a sparse solution for $v$ can be achieved by adding $ℓ_{1}$ constraint:

\begin{matrix} \underset{v \in ℝ^{p}}{maximize} v^{T} X^{T} X v \\ subject to v^{T} X^{T} K X v \leq c_{1}, {|| v ||}_{1} \leq c_{2}, {|| v ||}_{2}^{2} \leq 1, \end{matrix}

[2]

where $c_{1}$ is a constant depending on $λ$ and $c_{2}$ is the sparsity parameter.

A Motivating Example.

To motivate the problem, we first conducted PCA on a subset of samples from the human brain exon array dataset (4) (Fig. 1). The samples come from 10 brain regions in six donors and each donor tends to form a cluster. There are no clear patterns among the regions in the first 20 PCs (SI Appendix, Fig. S1). The variation of individual donors makes it challenging to extract the variation across brain regions. This dataset will be discussed in more detail later.

For simplicity, we assume that there are no missing data. Let $X^{(i)}$ represent the $b \times p$ matrix for the gene expression levels of donor $i$ , where $b$ is the number of brain regions and $p$ is the number of genes. By stacking the rows of $X^{(1)}, \dots, X^{(n)}$ , we obtain the $N \times p$ data matrix $X$ , where $N = n \times b$ . We propose the following objective function to adjust for the variation of donors:

\begin{matrix} \underset{v \in ℝ^{p}}{maximize} v^{T} X^{T} X v - \frac{λ}{n} \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} v^{T} {(X^{(j)} - X^{(i)})}^{T} \times (X^{(j)} - X^{(i)}) v \\ subject to {|| v ||}_{2}^{2} \leq 1 . \end{matrix}

[3]

Formula 3 is a special case of formula 1 (SI Appendix). The penalty term in formula 3 encourages the projection of the same region across donors to be similar. There are missing samples in the brain data and formula 3 can be modified to handle missing values (SI Appendix). More implementation details are discussed in Materials and Methods. The penalty can artificially induce the appearance of clusters. We provide guidelines on how to detect overcorrection in SI Appendix.

The Analysis of Variance Interpretation.

For a number of observations, assume that the samples can be divided into $K$ groups. $t_{j k}$ denotes the projection (induced by $v$ ) for the $j$ th observation in group $k$ , $j = 1, \dots, n_{k}$ . Let ${\bar{t}}_{\cdot k}$ and ${\bar{t}}_{\cdot \cdot}$ denote the group mean and grand mean, correspondingly. The total sum of squares $S S_{T} = \sum_{k = 1}^{K} \sum_{j = 1}^{n_{k}} {(t_{j k} - {\bar{t}}_{\cdot \cdot})}^{2}$ , the between-groups sum of squares $S S_{B} = \sum_{k = 1}^{K} n_{k} {({\bar{t}}_{\cdot k} - {\bar{t}}_{\cdot \cdot})}^{2}$ , and the remaining sum of squares is represented by $S S_{R}$ . We have $S S_{T} = S S_{B} + S S_{R}$ . Consider the samples for a region $r$ in the brain data. Let the donor labels represent the groups, so we have $K = n$ and $n_{k} = 1$ , $\forall k$ , because there are no replicates for a donor. Dropping the notation $j$ , $S S_{B}^{(r)} = \sum_{k = 1}^{n} {(t_{r k} - {\bar{t}}_{r, \cdot})}^{2}$ . It can be shown that $n S S_{B}^{(r)} = \sum_{k = 1}^{n - 1} \sum_{l = k + 1}^{n} {(t_{r k} - t_{r l})}^{2}$ . For samples in all of the regions, $S S_{T} = \sum_{r = 1}^{b} \sum_{k = 1}^{n} {(t_{r k} - {\bar{t}}_{\cdot \cdot})}^{2} = \sum_{r = 1}^{b} \sum_{k = 1}^{n} {(t_{r k})}^{2}$ because $X$ is centered. Let $S S_{B}^{*} \equiv \sum_{r = 1}^{b} S S_{B}^{(r)}$ . The objective function in formula 3 can be rewritten as $S S (λ) = S S_{T} - λ S S_{B}^{*}$ , so it penalizes the donor-to-donor variation in the projected data. In general, the $Y$ matrix in formula 1 can be designed such that the penalty term represents the between-groups sum of squares, when the between-groups variation is undesirable.

Simulations.

We evaluated AC-PCA in simulations mimicking the brain data, and we set $b = 10$ , $p = 400$ and $n = 5$ in the simulations. We assumed that $X^{(i)} = Ω + Γ^{(i)} + ϵ^{(i)}$ , where $Ω$ is the low rank component shared among donors, $Γ^{(i)}$ is the donor-specific component, and $ϵ^{(i)}$ is Gaussian noise. $𝚪$ is unknown and our goal is to capture the shared component $Ω$ . Consider performing PCA on the pooled samples from multiple donors. When $𝚪 = 0$ , the first several PCs can capture $Ω$ . When $𝚪 \neq 0$ , the PCs can be affected by $𝚪$ , as we have seen in the brain data. We further assumed that $Γ^{(i)} = Λ_{1}^{(i)} + Λ_{2}^{(i)}$ : The donor’s effect is the same in all regions within a donor in $Λ_{1}^{(i)}$ , whereas it is different in $Λ_{2}^{(i)}$ . $Λ_{1}^{(i)}$ causes samples from the same donor to cluster, whereas $Λ_{2}^{(i)}$ allows for more complicated donor’s effect and we considered two settings (Materials and Methods): (i) only a subset of regions is affected (three random regions in each donor) and (ii) the latent structure in $Λ_{2}^{(i)}$ is correlated with that in $Ω$ . The results for one representative run and for 100 runs are shown in Fig. 2 and SI Appendix, Fig. S2. We compared AC-PCA with ComBat (9) and SVA (10, 14), where PCA was implemented after removing the confounder effects (SI Appendix). Compared with ComBat and SVA, both the projected data (“PC”) and the loading of genes (“PC loading”) in AC-PCA tend to be more correlated with that in the shared component $Ω$ . To see how the other methods may fail, we calculated the correlations of the first two PCs with PC1 in $Λ_{1}$ and $Λ_{2}$ (SI Appendix, Figs. S3 and S4). ComBat adjusts for $Λ_{1}$ well and cannot adjust for $Λ_{2}$ . This is as expected because ComBat assumes that the donor’s effect is similar in all regions within a donor, when using donor labels for the adjustment. Compared with ComBat, SVA adjusts for $Λ_{2}$ better but not as well for $Λ_{1}$ . AC-PCA adjusts for both $Λ_{1}$ and $Λ_{2}$ well. Simulations for $Γ^{(i)} = Λ_{1}^{(i)}$ or $Λ_{2}^{(i)}$ alone and other settings are provided in SI Appendix.

Fig. 2. — Comparison of AC-PCA with PCA, ComBat (9), and SVA (10, 14) on simulated data. (A and C) Settings 1 and 2, one representative run. Each color represents a donor. (B) Setting 1, correlation with PCA on the shared component $Ω$ . The correlations of the first two PCs in $Ω$ with the matched PCs in the three methods were calculated, and the distribution for 100 runs is shown. We calculated the Pearson’s correlation for the projected data (“PC”) and the Spearman’s rank correlation for the loading of genes (“PC loading”). The dot in the violin plot indicates the median of the distribution. (D) Setting 1, distribution of the distance between the left-out sample and the retained samples, the first two PCs.

In addition to identifying dominant patterns in the data, PCA has been used to detect abnormal samples, potentially caused by mislabeling. To see whether AC-PCA can detect a mislabeled sample, we performed leave-one-out cross-validation (CV): (i) We performed AC-PCA on the retained samples and used the eigenvectors to calculate the projection for the left-out sample, (ii) we calculated the distance of the left-out sample with each of the retained samples, and (iii) we iterated i and ii through all of the samples. In the first two PCs, the left-out sample tends to be closer to the retained samples with the same region label (Fig. 2D and SI Appendix, Figs. S5 and S6). The two distributions, same region label vs. different region labels, are well separated in the first two PCs, especially for PC1. For a left-out sample, by comparing its distances to samples that have the same region label vs. different region labels in the retained set, we are likely to identify whether it is mislabeled.

Application to the Human Brain Exon Array Data.

The human brain exon array dataset (4) includes the transcriptomes of 16 brain regions comprising 11 areas of the neocortex and 5 other regions. In the analysis, we used samples from $10$ regions in the neocortex. Primary visual cortex (V1C) was excluded from the analysis because the distinct nature of this area relative to other neocortical regions tended to compress the other 10 regions into a single cluster. We sorted the donors by age and defined nine time windows by grouping samples from every six donors (SI Appendix, Table S1). Samples within a time window are relatively homogeneous in time, except for window 4, in which the donor’s effect is likely driven by age (SI Appendix, Fig. S7). Samples in window 5 were used for demonstration in Fig. 1. In window 5, when we applied formula 3 to adjust for the confounding effects from individual donors, samples from the same neocortical region tended to cluster together, and we were able to recover the anatomical structure of neocortex (Fig. 3 A and B). ComBat and SVA were able to remove some donors’ effect, because samples no longer cluster by donors after the adjustment. However, no clear interregional patterns were identified by the two methods. One limitation of our method is that the age effect is not distinguished from the donor’s effect. It is challenging to distinguish between the two effects because the donor labels are highly confounded with age (SI Appendix, Fig. S7).

We compared the eigenvalues in the brain data versus permutation by shuffling the region labels in each donor (Fig. 3C). The first three PCs are likely to be significant. When we shuffled all samples across donors, the trend was similar (SI Appendix, Fig. S8). We also compared the variance explained by the PCs in the brain data versus permutation, and the trend was similar (SI Appendix, Fig. S9). A parallel evidence for the significance of the PCs is achieved through CV. In addition to the leave-one-out CV presented in the simulation section, we considered leave-one-donor-out CV, where all samples within a donor are left out. We iterated through all donors and calculated the distance between all pairs of samples: one from the left-out donor and the other one from the retained samples. The CV result is consistent with the eigenvalue and variance results, because the left-out samples tend to be closer to the retained samples with the same region label in the first three PCs (SI Appendix, Figs. S10 and S11). When we combined the first three PCs, the Euclidean distance for the pair of samples with the same region label versus different region labels tended to be more separated, compared with that for each PC individually (Fig. 3D and SI Appendix, Figs. S10–S12). In fact, if we predict the region label of a testing sample by assigning it to the $k$ -closest clusters of regions in the training set based on the first three PCs, the prediction accuracies are $32 %$ ( $k = 1$ ), $65 %$ ( $k = 2$ ), $80 %$ ( $k = 3$ ) for leave one donor out and $43 %$ ( $k = 1$ ), $67 %$ ( $k = 2$ ) and $85 %$ ( $k = 3$ ) for leave one sample out, whereas the expected accuracies for random guess are $10 %$ ( $k = 1$ ), $20 %$ ( $k = 2$ ), and $30 %$ ( $k = 3$ ). Because the confounding effect of the left-out donor cannot induce any bias in the clustering of regions, this result shows that the penalty in AC-PCA enables the learning of dimension with significantly reduced confounder influence. In simulation settings 1 and 2, the three criteria (eigenvalue, variance, and CV) give consistent estimates for the number of significant PCs, which equals to the number of latent factors in $Ω$ (SI Appendix, Figs. S5 and S6).

The visualization results for the other windows are shown in SI Appendix, Figs. S13 and S14. In summary, all methods performed reasonably well in windows 1 to 3; in windows 4 to 9, no clear interregional patterns were identified in ComBat and SVA, whereas the patterns identified by AC-PCA tend to agree with the regions’ physical locations. When PCA is performed separately for each donor and hemisphere, the gross structure tends to be consistent within a time window, but the pattern is distorted by a high level of noise (SI Appendix, Fig. S15). Next, we explored the temporal dynamics of the PCs in the brain data (SI Appendix, Fig. S16). The pattern is similar from windows $1$ to $5$ , with PC1 representing the frontal-to-temporal gradient, which follows the contour of developing cortex (5), and PC2 representing the dorsal-to-ventral gradient. Starting from window $6$ , these two components reversed order. The interregional variation explained by the first two PCs decreases close to birth (window $4$ ) and then increases in later time windows (SI Appendix, Fig. S17), similar to the “hourglass” pattern previously reported based on cross-species comparison and differential expression (19, 20).

We then implemented AC-PCA with sparsity constraints to select genes associated with the PCs. The number of genes with nonzero loadings are shown in Fig. 3E, along with the interregional variation explained in the regular PCs. Interestingly, the trends tend to be consistent: When the regular PC explains more variation, more genes are selected in the corresponding sparse PC. To produce more stringent and comparable gene lists, we chose the sparsity parameter such that $200$ genes are selected in each window. The overlap of gene lists across windows is moderate (SI Appendix, Fig. S18) and, as expected, the overlap with the first window decreases over time. The overlap between adjacent windows tends to be larger in later time windows, indicating that interregional differences become stable. Genes with the largest loadings demonstrate interesting spatial patterns (Fig. 3F and SI Appendix, Fig. S19). In windows 1 and 3, the top genes in PC1 follow the frontal-to-temporal gradient, whereas in PC2 they tend to follow the dorsal-to-ventral gradient. A brief overview of the functions of these genes is shown in SI Appendix, Table S2. We compared the gene level results between SVA and AC-PCA (SI Appendix, Fig. S19). In the later time windows, especially windows 5, 6, and 7, AC-PCA tends to select genes with larger interregional variation.

Finally, we demonstrate the functional conservation of the $200$ genes selected in PC1 and PC2. These genes tend to have low dN/dS scores for human versus macaque comparison, even lower than the complete list of all essential genes (Fig. 3G). In the human versus mouse comparison, we observed a similar trend (SI Appendix, Fig. S20). Parallel to the cross-species conservation, we also observed that these genes tend to have low heterozygosity scores, a measure of functional conservation in human (SI Appendix, Fig. S21).

Application to the modENCODE RNA-Seq Data.

The modENCODE project generates the transcriptional landscapes for model organisms during development (16, 17). In the analysis, we used the time-course RNA-Seq data for fly and worm embryonic development. For the fly data, samples were taken in 12 time windows during embryonic development: 0 to 2, 2 to 4, $\dots$ , 22 to 24 h; for the worm data, samples were taken every 30 min during embryonic development: $0, 0.5, \dots, 4, 5, \dots, 12$ h, where the sample from 4.5 h is missing, resulting in $24$ samples.

We first conducted PCA on fly and worm separately, as shown in Fig. 4A. Although the temporal patterns share some similarity, the projections for fly and worm are different. The genes with top loadings in fly have different temporal dynamics in worm, especially for PC2 (Fig. 4B). We also conducted PCA on fly and worm jointly, which reveals the variations of different species (Fig. 4C).

Let $X^{(f)}$ and $X^{(w)}$ represent the data matrices for fly and worm, correspondingly. Let $X_{t}^{(f)}$ and $X_{t}^{(w)}$ denote the data for time window $t$ and time point $t$ , correspondingly. $X_{10}^{(w)}$ is missing. Let $X$ represent the data matrix for both species, by stacking the rows in $X^{(f)}$ and $X^{(w)}$ . We propose the following objective function to adjust for the variation of species:

\begin{matrix} \underset{v \in ℝ^{p}}{maximize} v^{T} X^{T} X v - λ \sum_{t = 1}^{12} v^{T} {(X_{t}^{(f)} - f (X^{(w)}, t))}^{T} \times (X_{t}^{(f)} - f (X^{(w)}, t)) v \\ subject to {|| v ||}_{2}^{2} \leq 1, \end{matrix}

[4]

where

f (X^{(w)}, t) = {\begin{matrix} \frac{1}{2} (X_{2 t - 1}^{(w)} + X_{2 t + 1}^{(w)}), & if t = 5 \\ \frac{1}{3} (X_{2 t - 1}^{(w)} + X_{2 t}^{(w)} + X_{2 t + 1}^{(w)}), & otherwise \end{matrix}

Formula 4 is a special case of the general form (SI Appendix). To incorporate the difference in the length of the embryonic stages, we shrink the projection of $X_{t}^{(f)}$ toward the mean of $X_{2 t - 1}^{(w)}$ , $X_{2 t}^{(w)}$ , and $X_{2 t + 1}^{(w)}$ after the projection. Li et al. (21) aligned fly and worm development based on stage-associated genes using the same dataset. We did not implement their alignment results, to keep the analysis unsupervised.

Formula 4 is able to capture the shared variation among fly and worm (Fig. 4C). The selected genes tend to have consistent and smooth temporal patterns in both species (Fig. 4D). PCA on fly and worm jointly cannot capture the direction of PC2 in AC-PCA, in which the gene expression levels peak in the middle embryonic stage. The other PCs in PCA are shown in SI Appendix, Fig. S22. After using ComBat with the species labels for the adjustment, PCA still cannot capture that direction (SI Appendix, Fig. S23).

Other Applications.

Two additional simulation examples are provided in SI Appendix, where we implemented other formulations of AC-PCA.

Discussion

Confounding variation can affect the performance of dimension reduction methods, and hence the visualization and interpretation of the results. In this study, we have proposed a general class of penalty functions in PCA for simultaneous dimension reduction and adjustment for confounding variation. In formula 1, we implemented linear kernel on $Y$ (i.e., $Y Y^{T}$ ) and other kernels can be used as well. The additive property of kernels enables a further generalization, through the combination of different kernels, to adjust for multiple types of confounders’ effects simultaneously.

The application of AC-PCA is not limited to transcriptome datasets. Dimension reduction methods have been applied to other types of genomics data for various purposes, such as feature extraction for methylation prediction (22), classifying yeast mutants using metabolic footprinting (23), classifying immune cells using DNA methylome (24), and others. AC-PCA is applicable to these datasets to capture the desired variation, adjust for potential confounders, and select the relevant features. AC-PCA can serve as an exploratory tool and be combined with other methods. For example, the extracted features can be implemented in regression models.

Materials and Methods

AC-PCA Adjusting for Variations of Individual Donors.

We treated the left and right hemispheres from the same donor as two different individuals when implementing formula 3. We implemented all methods (AC-PCA, PCA, ComBat, and SVA) separately for each time window. To formulate the penalty term in formula 3, we need to know the source of confounding variation. However, the donor labels are not necessarily required. Only the brain region labels (i.e., labels for the primary variables of interest) are needed, because we are penalizing every pair of samples in the same brain region (SI Appendix, SI Materials and Methods). The connection of formula 3 with canonical correlation analysis is shown in SI Appendix, SI Materials and Methods.

AC-PCA with Sparse Loading.

See SI Appendix, SI Materials and Methods.

Multiple PCs.

See SI Appendix, SI Materials and Methods.

Tuning $λ$ .

When $λ$ increases from 0, the ratio $R (λ) = v^{T} X^{T} K X v / v^{T} X^{T} X v$ will tend to decrease (SI Appendix, Fig. S24). If $v^{T} X^{T} K X v$ has the interpretation of the between-groups sum of squares, we choose the smallest $λ$ such that $R (λ) \leq 0.05$ in the PCs that we are interested in, so the confounding variation is “small” compared with the total variation in the projected data. In other designs of $K$ , $R (λ)$ can be greater than 1 and we choose the smallest $λ$ such that $R (λ) \leq 0.05 R (λ = 0)$ . We notice that the overall patterns captured by AC-PCA are robust to a wide range of $λ$ s in simulations and real data analysis (SI Appendix, Fig. S25). In the human brain dataset, we fixed $λ$ = 5 for a better comparison over the time windows.

Significance of the PCs.

For a fixed $λ$ , we evaluate the significance of the PCs by checking whether the eigenvalues and the variance explained by the PCs are significant. To achieve this, we compare the values in the original data vs. permutation, where we permutate the rows in $X$ and keep $Y$ the same.

Tuning $c_{1}$ and $c_{2}$ .

See SI Appendix, SI Materials and Methods.

Simulations.

In setting 1, we considered $n = 5$ , $b = 10$ , and $p = 400$ . For the $i$ th individual, $X^{(i)} = Ω + α Γ^{(i)} + ϵ^{(i)}$ . The shared component $Ω = W h . W = (w_{1} w_{2})$ is a $b \times 2$ matrix, representing the latent structure of the shared variation. For visualization purpose, we assumed that it is smooth and has rank 2. Let $μ = (1, \dots, b)'$ and $w_{1}$ is the normalized $μ$ , with mean 0 and variance 1. $w_{2} \sim N (0, 0.25 \cdot Σ)$ , where $Σ_{i j} = \exp (- {(w_{i 1} - w_{j 1})}^{2} / 4)$ . $h$ is a $2 \times p$ matrix and the rows in $h$ are generated from $N (0, I_{p})$ . The donor-specific component $Γ^{(i)} = Λ_{1}^{(i)} + Λ_{2}^{(i)}$ , where $Λ_{1}^{(i)} = 1 r_{i}$ and $Λ_{2}^{(i)} = B_{i} s_{i}$ . The 1 represents a $b \times 1$ matrix with all 1s. $B_{i}$ is a $b \times 1$ matrix, in which three entries are generated from Uniform[0, 2], and the other entries are set to 0. $r_{i}$ and $s_{i}$ are $1 \times p$ matrices, generated from $N (0, I_{p})$ . $α$ is a scalar indicating the strength of confounding variation, and we set $α = 2.5$ . The entries in $ϵ^{(i)}$ are generated from $N (0, 0.25)$ . In setting 2, the only difference from setting 1 is the term $B_{i}$ . We first set $B_{i}$ to be equal to $w_{1}$ and then randomly pick and shuffle three entries. We implemented formula 3 for settings 1 and 2. More simulation results are shown in SI Appendix.

Data Preprocessing.

See SI Appendix, SI Materials and Methods.

Supplementary Material

Supplementary File

pnas.1617317113.sapp.pdf^{(5.2MB, pdf)}

Acknowledgments

We thank Yixuan Qiu for the RSpectra package, Angel Rubio and Joey Arthur for useful discussions, and Matthew W. State for the partial financial support of Z.L. This work was partially supported by National Science Foundation Grant DMS-1106738 and National Institutes of Health Grants R01 GM59507 and P01 CA154295 (to Z.L. and H.Z.); National Institutes of Health Grants R01 HG007834 and R01 GM109836 (to Z.L., Y.W., B.J., M.Z., and W.H.W.); National Science Funding of China Grant 61501389; Hong Kong Research Grant Council Grants 22302815 and 12316116; Hong Kong Baptist University Grants FRG2/14-15/069 and FRG2/15-16/011 (to C.Y.); and National Institutes of Health Grants P50 MH106934 and U01 MH103339 (to Y.Z., X.X., M.L., and N.S.). All computations were performed at the Yale University Biomedical High Performance Computing Center.

Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1617317113/-/DCSupplemental.

References

1.Sharov AA, et al. Transcriptome analysis of mouse stem cells and early embryos. PLoS Biol. 2003;1(3):E74. doi: 10.1371/journal.pbio.0000074. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Ringnér M. What is principal component analysis? Nat Biotechnol. 2008;26(3):303–304. doi: 10.1038/nbt0308-303. [DOI] [PubMed] [Google Scholar]
3.Giordano TJ, et al. Molecular classification and prognostication of adrenocortical tumors by transcriptome profiling. Clin Cancer Res. 2009;15(2):668–676. doi: 10.1158/1078-0432.CCR-08-1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Kang HJ, et al. Spatio-temporal transcriptome of the human brain. Nature. 2011;478(7370):483–489. doi: 10.1038/nature10523. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Miller JA, et al. Transcriptional landscape of the prenatal human brain. Nature. 2014;508(7495):199–206. doi: 10.1038/nature13185. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Darmanis S, et al. A survey of human brain transcriptome diversity at the single cell level. Proc Natl Acad Sci USA. 2015;112(23):7285–7290. doi: 10.1073/pnas.1507125112. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Kruskal JB, Wish M. 1978. Multidimensional Scaling. Quantitative Applications in the Social Sciences (SAGE, Thousand Oaks, CA), Vol 11.
8.Jolliffe I. Principal Component Analysis. Wiley; New York: 2002. [Google Scholar]
9.Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–127. doi: 10.1093/biostatistics/kxj037. [DOI] [PubMed] [Google Scholar]
10.Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3(9):1724–1735. doi: 10.1371/journal.pgen.0030161. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Gagnon-Bartsch JA, Speed TP. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012;13(3):539–552. doi: 10.1093/biostatistics/kxr034. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The SVA package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28(6):882–883. doi: 10.1093/bioinformatics/bts034. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Yang C, Wang L, Zhang S, Zhao H. Accounting for non-genetic factors by low-rank representation and sparse regression for eQTL mapping. Bioinformatics. 2013;29(8):1026–1034. doi: 10.1093/bioinformatics/btt075. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Parker HS, Bravo HC, Leek JT. Removing batch effects for prediction problems with frozen surrogate variable analysis. PeerJ. 2014;2:e561. doi: 10.7717/peerj.561. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat Biotechnol. 2014;32(9):896–902. doi: 10.1038/nbt.2931. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Celniker SE, et al. Unlocking the secrets of the genome. Nature. 2009;459(7249):927–930. doi: 10.1038/459927a. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Gerstein MB, et al. Comparative analysis of the transcriptome across distant species. Nature. 2014;512(7515):445–448. doi: 10.1038/nature13424. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Zhang R, Lin Y. DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res. 2009;37(Suppl 1):D455–D458. doi: 10.1093/nar/gkn858. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Pletikos M, et al. Temporal specification and bilaterality of human neocortical topographic gene expression. Neuron. 2014;81(2):321–332. doi: 10.1016/j.neuron.2013.11.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Lin Z, et al. A markov random field-based approach to characterizing human brain development using spatial–temporal transcriptome data. Ann Appl Stat. 2015;9(1):429–451. doi: 10.1214/14-AOAS802. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Li JJ, Huang H, Bickel PJ, Brenner SE. Comparison of D. melanogaster and C. elegans developmental stages, tissues, and cells by modENCODE RNA-seq data. Genome Res. 2014;24(7):1086–1101. doi: 10.1101/gr.170100.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Das R, et al. Computational prediction of methylation status in human genomic sequences. Proc Natl Acad Sci USA. 2006;103(28):10713–10716. doi: 10.1073/pnas.0602949103. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Allen J, et al. High-throughput classification of yeast mutants for functional genomics using metabolic footprinting. Nat Biotechnol. 2003;21(6):692–696. doi: 10.1038/nbt823. [DOI] [PubMed] [Google Scholar]
24.Kulis M, et al. Epigenomic analysis detects widespread gene-body DNA hypomethylation in chronic lymphocytic leukemia. Nat Genet. 2012;44(11):1236–1242. doi: 10.1038/ng.2443. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

pnas.1617317113.sapp.pdf^{(5.2MB, pdf)}

[r1] 1.Sharov AA, et al. Transcriptome analysis of mouse stem cells and early embryos. PLoS Biol. 2003;1(3):E74. doi: 10.1371/journal.pbio.0000074. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r2] 2.Ringnér M. What is principal component analysis? Nat Biotechnol. 2008;26(3):303–304. doi: 10.1038/nbt0308-303. [DOI] [PubMed] [Google Scholar]

[r3] 3.Giordano TJ, et al. Molecular classification and prognostication of adrenocortical tumors by transcriptome profiling. Clin Cancer Res. 2009;15(2):668–676. doi: 10.1158/1078-0432.CCR-08-1067. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4] 4.Kang HJ, et al. Spatio-temporal transcriptome of the human brain. Nature. 2011;478(7370):483–489. doi: 10.1038/nature10523. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.Miller JA, et al. Transcriptional landscape of the prenatal human brain. Nature. 2014;508(7495):199–206. doi: 10.1038/nature13185. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r6] 6.Darmanis S, et al. A survey of human brain transcriptome diversity at the single cell level. Proc Natl Acad Sci USA. 2015;112(23):7285–7290. doi: 10.1073/pnas.1507125112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7] 7.Kruskal JB, Wish M. 1978. Multidimensional Scaling. Quantitative Applications in the Social Sciences (SAGE, Thousand Oaks, CA), Vol 11.

[r8] 8.Jolliffe I. Principal Component Analysis. Wiley; New York: 2002. [Google Scholar]

[r9] 9.Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–127. doi: 10.1093/biostatistics/kxj037. [DOI] [PubMed] [Google Scholar]

[r10] 10.Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3(9):1724–1735. doi: 10.1371/journal.pgen.0030161. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11] 11.Gagnon-Bartsch JA, Speed TP. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012;13(3):539–552. doi: 10.1093/biostatistics/kxr034. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12] 12.Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The SVA package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28(6):882–883. doi: 10.1093/bioinformatics/bts034. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13] 13.Yang C, Wang L, Zhang S, Zhao H. Accounting for non-genetic factors by low-rank representation and sparse regression for eQTL mapping. Bioinformatics. 2013;29(8):1026–1034. doi: 10.1093/bioinformatics/btt075. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r14] 14.Parker HS, Bravo HC, Leek JT. Removing batch effects for prediction problems with frozen surrogate variable analysis. PeerJ. 2014;2:e561. doi: 10.7717/peerj.561. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r15] 15.Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat Biotechnol. 2014;32(9):896–902. doi: 10.1038/nbt.2931. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r16] 16.Celniker SE, et al. Unlocking the secrets of the genome. Nature. 2009;459(7249):927–930. doi: 10.1038/459927a. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r17] 17.Gerstein MB, et al. Comparative analysis of the transcriptome across distant species. Nature. 2014;512(7515):445–448. doi: 10.1038/nature13424. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r18] 18.Zhang R, Lin Y. DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res. 2009;37(Suppl 1):D455–D458. doi: 10.1093/nar/gkn858. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r19] 19.Pletikos M, et al. Temporal specification and bilaterality of human neocortical topographic gene expression. Neuron. 2014;81(2):321–332. doi: 10.1016/j.neuron.2013.11.018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r20] 20.Lin Z, et al. A markov random field-based approach to characterizing human brain development using spatial–temporal transcriptome data. Ann Appl Stat. 2015;9(1):429–451. doi: 10.1214/14-AOAS802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r21] 21.Li JJ, Huang H, Bickel PJ, Brenner SE. Comparison of D. melanogaster and C. elegans developmental stages, tissues, and cells by modENCODE RNA-seq data. Genome Res. 2014;24(7):1086–1101. doi: 10.1101/gr.170100.113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r22] 22.Das R, et al. Computational prediction of methylation status in human genomic sequences. Proc Natl Acad Sci USA. 2006;103(28):10713–10716. doi: 10.1073/pnas.0602949103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r23] 23.Allen J, et al. High-throughput classification of yeast mutants for functional genomics using metabolic footprinting. Nat Biotechnol. 2003;21(6):692–696. doi: 10.1038/nbt823. [DOI] [PubMed] [Google Scholar]

[r24] 24.Kulis M, et al. Epigenomic analysis detects widespread gene-body DNA hypomethylation in chronic lymphocytic leukemia. Nat Genet. 2012;44(11):1236–1242. doi: 10.1038/ng.2443. [DOI] [PubMed] [Google Scholar]

PERMALINK

Simultaneous dimension reduction and adjustment for confounding variation

Zhixiang Lin

Can Yang

Ying Zhu

John Duchi

Yao Fu

Yong Wang

Bai Jiang

Mahdi Zamanighomi

Xuming Xu

Mingfeng Li

Nenad Sestan

Hongyu Zhao

Wing Hung Wong

Significance

Abstract

Results

AC-PCA in a General Form.

A Motivating Example.

Fig. 1.

The Analysis of Variance Interpretation.

Simulations.

Fig. 2.

Application to the Human Brain Exon Array Data.

Fig. 3.

Application to the modENCODE RNA-Seq Data.

Fig. 4.

Other Applications.

Discussion

Materials and Methods

AC-PCA Adjusting for Variations of Individual Donors.

AC-PCA with Sparse Loading.

Multiple PCs.

Tuning λ.

Significance of the PCs.

Tuning c1 and c2.

Simulations.

Data Preprocessing.

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Tuning $λ$ .

Tuning $c_{1}$ and $c_{2}$ .