Principal regression for high dimensional covariance matrices

Yi Zhao; Brian Caffo; Xi Luo; Alzheimer’s Disease Neuroimaging Initiative

doi:10.1214/21-ejs1887

. Author manuscript; available in PMC: 2022 Jul 1.

Published in final edited form as: Electron J Stat. 2021 Sep 14;15(2):4192–4235. doi: 10.1214/21-ejs1887

Principal regression for high dimensional covariance matrices

Yi Zhao ¹, Brian Caffo ², Xi Luo ³; Alzheimer’s Disease Neuroimaging Initiative^†

PMCID: PMC9248851 NIHMSID: NIHMS1765239 PMID: 35782590

Abstract

This manuscript presents an approach to perform generalized linear regression with multiple high dimensional covariance matrices as the outcome. In many areas of study, such as resting-state functional magnetic resonance imaging (fMRI) studies, this type of regression can be utilized to characterize variation in the covariance matrices across units. Model parameters are estimated by maximizing a likelihood formulation of a generalized linear model, conditioning on a well-conditioned linear shrinkage estimator for multiple covariance matrices, where the shrinkage coefficients are proposed to be shared across matrices. Theoretical studies demonstrate that the proposed covariance matrix estimator is optimal achieving the uniformly minimum quadratic loss asymptotically among all linear combinations of the identity matrix and the sample covariance matrix. Under certain regularity conditions, the proposed estimator of the model parameters is consistent. The superior performance of the proposed approach over existing methods is illustrated through simulation studies. Implemented to a resting-state fMRI study acquired from the Alzheimer’s Disease Neuroimaging Initiative, the proposed approach identified a brain network within which functional connectivity is significantly associated with Apolipoprotein E ε4, a strong genetic marker for Alzheimer’s disease.

Keywords: Covariance matrix estimation, generalized linear regression, heteroscedasticity, shrinkage estimator

MSC2020 subject classifications: Primary 62J99, secondary 62H99

1. Introduction

In this manuscript, we study a regression problem with covariance matrices as the outcome under a high dimensional setting. Suppose $y_{i t} \in ℝ^{p}$ is a p-dimensional random vector, which is the tth acquisition from subject i, for t = 1, …, T_i and i = 1, …, n, where T_i is the number of observations of subject i and n is the number of subjects. Let T_max = max_i T_i. The term “high dimensionality” refers to the scenario when T_max ≪ p and p increases to infinity. The data, y_it, are assumed to follow a normal distribution with covariance matrix Σ_i. Here, without loss of generality, it is assumed that the distribution mean is zero as the study interest focuses on the covariance matrices. Let $x_{i} \in ℝ^{q}$ denote the q-dimensional covariates of interest acquired from subject i. For the covariance matrices, we assume the following regression model. For i = 1, …, n, the data heteroscedasticity satisfies the following generalized linear regression model with a logarithmic link function,

log (γ^{⊤} Σ_{i} γ) = x_{i}^{⊤} β,

(1.1)

where $γ \in ℝ^{p}$ is a linear projection and $β \in ℝ^{q}$ is the model coefficient. In x_i, the first element is set to one to include the intercept term. Using a logarithmic link function, it is guaranteed that Σ_i’s are positive semi-definite. The goal is to estimate γ and β using the observed data ${(y_{i 1}, \dots, y_{i T_{i}}), x_{i}}_{i = 1}^{n}$ . In Model (1.1), γ is an unknown linear projection to be estimated such that the characteristic of the covariance matrices can be best captured by the covariates of interest.

One application of such a regression problem is to analyze covariate associated variations in brain coactivation in a functional magnetic resonance imaging (fMRI) study, where covariance/correlation matrices of the fMRI signals are generally utilized to reveal the coactivation patterns. Characterizing these patterns with population/individual covariates is of great interest in neuroimaging studies [34, 41]. Another example is the study of financial equities data. Considering a pool of stock values, covariance matrices over a period of time capture the comovement or synchronicity of the stocks. Firm and market-level information, such as industry type, firm’s cash flow, stock size, and book-to-market ratio, plays an essential role in determining the synchronicity. Quantifying such association is an important topic in financial theory [43].

Assuming T_min = min_i T_i > p and p is fixed, Zhao et al. [41] first studied Model (1.1) and proposed to estimate γ and β through a likelihood-based approach minimizing the negative log-likelihood function in the projection space. One sufficient condition to solve the likelihood-based criterion is that the sample covariance matrices are positive definite. Thus, the likelihood estimator is ill-posed when T_max < p as the sample covariance matrices are rank-deficient. Additionally, it has been shown that when p increases, the sample covariance matrix performs poorly and can lead to invalid conclusions. For example, the largest eigenvalue of the sample covariance matrix is not a consistent estimator, and the eigenvectors can be nearly orthogonal to the truth [22]. To circumvent difficulties raised by the high dimensionality, one solution is to impose structural assumptions, such as bandable covariance matrices, sparse covariance matrices, spiked covariance matrices, covariances with a tensor product structure, and latent graphical models [see a review of 6, and references therein]. Based on structural assumptions, many regularization-based methods have been developed. However, most of these methods produce covariance estimates that may not always be positive definite (numerically), and this creates subsequent numerical convergence issues when the quadratic product with Σ_i is negative in (1.1). Moreover, most regularization methods can be computationally expensive on finding the solution and may require searching over different regularization parameters, not to mention the computational costs increase multiplicatively when computing over multiple covariance matrices. Further research is also needed to evaluate what structural assumptions are most appropriate for fMRI data. Another class of high-dimensional covariance matrix estimator is the shrinkage estimator. Daniels and Kass [11] considered two shrinkage estimators of the covariance matrix, a correlation shrinkage and a rotation shrinkage, offering a compromise between completely unstructured and structured estimators to improve the robustness. Ledoit and Wolf [24] introduced a well-conditioned estimator of the covariance matrix, which is an optimal linear combination of the identity matrix and the sample covariance matrix under squared error loss. This estimator is guaranteed to be positive definite and is easy to compute based on a simple and explicit formula. These advantages make it desirable for formulating the proposed estimator. Instead of a linear combination, Ledoit and Wolf [25] extended this work to nonlinear transformations of the sample eigenvalues and presented a way of finding the transformation that is asymptotically equivalent to the oracle linear combination. Based on Tyler’s robust M-estimator [37] and the linear shrinkage estimator [24], Chen et al. [8] and Pascal et al. [28], in parallel, introduced robust estimators of covariance matrices for elliptical distributed samples.

To model multiple covariance matrices, procedures include regression-type approaches [1, 9, 21, 14, 43]; (common) principal component analysis related methods [13, 5, 20, 15]; and methods based on other types of matrix decomposition, such as the Cholesky decomposition [30]. Among these, Fox and Dunson [14] introduced a scalable nonparametric covariance regression model applying low-rank approximation. Franks and Hoff [15] generalized a Bayesian hierarchical model studying the heterogeneity in the covariance matrices to high dimensional settings. Assuming that the ideal covariance structure exists in the eigenspace of the data covariance matrix, Chen et al. [7] introduced a regression-based approach to remove the scanner effects in covariance achieving the goal of harmonization. Compared to the above-mentioned approaches, Model (1.1) offers higher flexibility in modeling the relationship with the covariates. For example, x can be either continuous or categorical, and one can easily include interactions and/or polynomials of the covariates.

In the high dimensional setting considered in this study, γ and β, as well as n covariance matrices, will be estimated under Model (1.1). It is well known that the eigenvalues of the sample covariance matrix are more dispersed than the truth [24]. The class of linear combinations of the identity and sample covariance matrix corrects this dispersion issue by shrinking towards the identity matrix. The choice of the identity matrix can also be interpreted as a prior without strong structural assumptions or prior knowledge. Interestingly, it will be shown that estimating each covariance matrix separately, such as using the shrinkage estimator proposed in Ledoit and Wolf [24], leads to suboptimal estimation accuracy for γ, β, and Σ_i’s. Thus, we propose a linear shrinkage estimator of all the covariance matrices jointly, of which the shrinkage coefficients are shared across matrices. In addition, it is shown that the proposed shrinkage estimator leads to a consistent estimator of model coefficients. We first replace the sample covariance formulation with the proposed shrinkage estimator, and then estimate (γ, β) through maximizing a plug-in likelihood evaluated at the shrinkage estimator. In fMRI studies, shrinkage is also a popular technique to improve the reliability of subject-level functional connectivity captured by the covariance matrix. In the technique, when estimating individual covariance matrix, population-level information is borrowed as prior knowledge [38, 31, 27, 32, 29].

The framework proposed in this manuscript has three major contributions.

This paper first studies a joint shrinkage estimator for multiple high dimensional covariance matrices, generalizing the linear shrinkage estimator for a single covariance matrix [24]. We show that the latter approach is suboptimal compared to the proposed joint covariance shrinkage estimator, where the shrinkage coefficients are shared across multiple matrices. Within this class of shrinkage estimators, we believe that this is among the first attempts to analyze the variations of a large number of covariance matrices associated with covariates in a regression setting under certain model assumptions.
The proposed shrinkage estimator of the covariance matrices is well conditioned and has uniformly minimum quadratic risk asymptotically among all linear combinations (Theorem 3.3).
Under certain regularity conditions, the proposed approach achieves consistent estimators of the parameters in Model (1.1) (Proposition 3.1).

The rest of the paper is organized as the following. Section 2 introduces the proposed shrinkage estimator of the covariance matrices and the pseudo-likelihood based method of estimating γ and β. Section 3 studies the asymptotic properties. In Section 4, the superior performance of the proposed approach over existing methods is demonstrated through simulation studies. Section 5 articulates an application to a resting-state fMRI data set acquired from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Section 6 concludes this paper with discussions. Technical proofs are collected in the appendix.

2. Methods

Considering the regression model (1.1), it is proposed to estimate the parameters by solving the following optimization problem.

\underset{(β, γ)}{minimize} ℓ (β, γ) = \frac{1}{2} \sum_{i = 1}^{n} T_{i} {x_{i}^{⊤} β + γ^{⊤} {\hat{Σ}}_{i} γ \cdot exp (- x_{i}^{⊤} β)}, such that γ^{⊤} H γ = 1,

(2.1)

where ${\hat{Σ}}_{i}$ is an estimator of the covariance matrix Σ_i to be discussed later, which is positive definite, for i = 1, …, n; and H is a positive definite matrix in $ℝ^{p \times p}$ , which is set to be the average of ${\hat{Σ}}_{i}$ ’s, that is $H = \sum_{i = 1}^{n} T_{i} {\hat{Σ}}_{i} / \sum_{i = 1}^{n} T_{i}$ . It is essential to impose a constraint on γ, otherwise the objective function of (2.1) is minimized at γ = 0 with fixed β. When ${\hat{Σ}}_{i} = S_{i} = \sum_{t = 1}^{T_{i}} y_{i t} y_{i t}^{⊤} / T_{i}$ (i.e., the sample covariance matrix), which is the proposal in Zhao et al. [41], it is equivalent to minimize the negative log-likelihood function of {γ^⊤ y_it}_i,t assuming the data are normally distributed. However, when T_max = max_i T_i < p, problem (2.1) is ill-posed as S_i’s are rank-deficient. Thus, the goal of this manuscript is to propose a well-conditioned estimator of Σ_i that yields optimal properties. To achieve this, a covariate-dependent linear shrinkage estimator, denoted as $Σ_{i}^{*}$ , is proposed, which yields the minimum expected squared loss under regression model (1.1), where the expectation is taken over the sample covariance matrix S_i.

\underset{(μ, ρ)}{minimize} \frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} Σ_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2}, such that Σ_{i}^{*} = ρ μ I + (1 - ρ) S_{i}, for i = 1, \dots, n .

(2.2)

The following theorem gives the solution to (2.2).

Theorem 2.1. For given (γ, β), the solution to optimization problem (2.2) is

Σ_{i}^{*} = \frac{ψ^{2}}{δ^{2}} μ I + \frac{ϕ^{2}}{δ^{2}} S_{i}, for i = 1, \dots, n,

(2.3)

and the minimum value is

\frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} Σ_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2} = \frac{ϕ^{2} ψ^{2}}{δ^{2}},

(2.4)

where

μ = \frac{1}{n (γ^{⊤} γ)} \sum_{i = 1}^{n} exp (x_{i}^{⊤} β), ϕ^{2} = \frac{1}{n} \sum_{i = 1}^{n} ϕ_{i}^{2}, ψ^{2} = \frac{1}{n} \sum_{i = 1}^{n} ψ_{i}^{2}, δ^{2} = \frac{1}{n} \sum_{i = 1}^{n} δ_{i}^{2}, ϕ_{i}^{2} = {μ (γ^{⊤} γ) - exp (x_{i}^{⊤} β)}^{2}, ψ_{i}^{2} = E {γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)}^{2}, δ_{i}^{2} = E {γ^{⊤} S_{i} γ - μ (γ^{⊤} γ)}^{2};

and Lemma 2.1 shows that ψ²/δ² + ϕ²/δ² = 1.

Lemma 2.1. For ∀ i ∈ {1, …, n}, $δ_{i}^{2} = ϕ_{i}^{2} + ψ_{i}^{2}$ , and thus δ² = ϕ² + ψ².

According to Theorem 2.1, parameters $ϕ_{i}^{2}$ , $ψ_{i}^{2}$ and $δ_{i}^{2}$ are expected values as the objective is to minimize the expected squared loss. Thus, one cannot replace ${\hat{Σ}}_{i}$ with $\sum_{i}^{*}$ in (2.1) and solve for solution using the data. For implementation in practice, the following sample counterparts are used to compute (2.3) and thus ${\hat{Σ}}_{i}$ in (2.1). Let

{\hat{δ}}_{i}^{2} = {γ^{⊤} S_{i} γ - μ (γ^{⊤} γ)}^{2}, {\hat{ψ}}_{i}^{2} = \frac{1}{T_{i}} {γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)}^{2}, {\hat{ϕ}}_{i}^{2} = {\hat{δ}}_{i}^{2} - {\hat{ψ}}_{i}^{2}, {\hat{δ}}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{δ}}_{i}^{2}, {\hat{ψ}}^{2} = \frac{1}{n} \sum_{i = 1}^{n} min ({\hat{ψ}}_{i}^{2}, {\hat{δ}}_{i}^{2}), {\hat{ϕ}}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{ϕ}}_{i}^{2},

and

S_{i}^{*} = \frac{{\hat{ψ}}^{2}}{{\hat{δ}}^{2}} μ I + \frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} S_{i}, for i = 1, \dots, n .

(2.5)

In Section 3, we show that $S_{i}^{*}$ is a consistent estimator of $\sum_{i}^{*}$ and is uniformly optimal asymptotically among all the linear combinations of the sample covariance matrices and the identity matrix regarding the quadratic risk. The objective function ℓ(β, γ) is an approximation of the negative log-likelihood function if replacing ${\hat{Σ}}_{i}$ with the proposed shrinkage estimator $S_{i}^{*}$ . Thus, optimizing (2.1) can be considered as a pseudo-likelihood approach under the normality assumption.

The proof of Theorem 2.1 and Lemma 2.1 is presented in Appendix Section A.1. Formulation (2.2) introduces a shrinkage estimator of the covariance matrix, where the shrinkage is shared across subjects and is optimal under the squared error loss. For each subject, $\sum_{i}^{*}$ is a linear combination of the sample covariance matrix S_i and the identity matrix. The weighting parameters, ρ and μ, are population level parameters that are shared across subjects. This is equivalent to imposing a linear shrinkage on the sample eigenvalues. Assuming γ is a common eigenvector of all the covariance matrices, μ is the average eigenvalue corresponding to γ. The level of shrinkage is determined by the leverage between the accuracy of S_i’s and the variation in the eigenvalues. If S_i’s are accurate or the errors are small relative to the variation in the eigenvalues, less shrinkage will be imposed; otherwise, if S_i’s are inaccurate and the errors are comparable or even higher than the eigenvalue variability, the sample covariance matrices will be shrank more.

Algorithm 1 summarizes the optimization procedure. As problem (2.1) is nonconvex, a series of random initializations of (γ, β) is considered and the one that achieves the minimum value of the objective function is the estimate. The initial values of γ can be set as the eigenvectors of the average sample covariance matrices, $\bar{S} = \sum_{i = 1}^{n} T_{i} S_{i} / \sum_{i = 1}^{n} T_{i}$ ; and the initial values of β is the corresponding solution to (2.1) by replacing ${\hat{Σ}}_{i}$ with a well-conditioned estimator, such as the estimator proposed in Ledoit and Wolf [24]. When $p < \sum_{i = 1}^{n} T_{i}$ , $\bar{S}$ is of full rank, and the sample eigenvectors are consistent estimators assuming all the covariance matrices have the same eigendecomposition. Step 3 in the algorithm updates the covariance matrix estimators with a global shrinkage parameter. In Section 4, through simulation studies, we show that it improves the performance in estimating the covariance matrices and β with lower bias and higher stability. The details of updating γ and β in Step 4 can be found in Algorithm 1 in Zhao et al. [41].

graphic file with name nihms-1765239-f0001.jpg

For higher-order components, one can first remove the identified components and use the new data to estimate the next with an additional orthogonality constraint, that is, the new component is orthogonal to the identified ones. Different from Algorithm 2 in Zhao et al. [41], there is no need to include a rank-completion step as $S_{i}^{*}$ is introduced to render the rank-deficiency issue. To determine the number of components, the metric of average deviation from diagonality is adopted [41]. Let $Γ^{(k)} \in ℝ^{p \times k}$ denote the first k estimated components, the average deviation from diagonality is defined as

DfD (Γ^{(k)}) = \prod_{i = 1}^{n} {(\frac{det {diag (Γ^{(k) ⊤} S_{i}^{*} Γ^{(k)})}}{det (Γ^{(k) ⊤} S_{i}^{*} Γ^{(k)})})}^{T_{i} / \sum_{i} T},

(2.6)

where diag(A) is a diagonal matrix of the diagonal elements in a square matrix A, and det(A) is the determinant of A. If Γ^(k) is a common diagonalization of $S_{i}^{*}$ ’s, that is, $Γ^{(k) ⊤} S_{i}^{*} Γ^{(k)}$ is a diagonal matrix, for ∀ i = 1, …, n, then DfD(Γ^(k)) = 1. In practice, k can be chosen before DfD increases far away from one or before a sudden jump occurs.

3. Asymptotic Properties

In this section, we study the asymptotic properties of the proposed estimators. For i = 1, …, n, it is assumed that Σ_i has the eigendecomposition of $Σ_{i} = Π_{i} Λ_{i} Π_{i}^{⊤}$ , where Λ_i = diag{λ_i1, …, λ_ip} is a diagonal matrix and Π_i = (π_i1, …, π_ip) is an orthonormal rotation matrix; {λ_i1, …, λ_ip} are the eigenvalues and the columns of Π_i are the corresponding eigenvectors. Let Z_i = Y_iΠ_i, where $Y_{i} = {(y_{i 1}, \dots, y_{i T_{i}})}^{⊤} \in ℝ^{T_{i} \times p}$ is the data matrix of subject i. Under the normality assumption, the columns of Z_i = (z_itj)_t,j are uncorrelated, and the rows, $z_{i t} = (z_{i 1}, \dots, z_{i p}) \in ℝ^{p}$ for t = 1, …, T_i, are normally distributed with mean zero and covariance matrix Λ_i. The following assumptions are imposed.

Assumption A1 There exists a constant C₁ independent of T_max such that p/T_max ≤ C₁, where T_max = max_i T_i.

Assumption A2 Let $N = \sum_{i = 1}^{n} T_{i}$ , p/N → ∞ as n, T_min → ∞, where T_min = min_i T_i.

Assumption A3 There exists a constant C₂ independent of T_min and T_max such that $\sum_{j = 1}^{p} E (z_{i 1 j}^{8}) / p \leq C_{2}$ , for ∀ i ∈ {1, …, n}.

Assumption A4 Let $Q$ denote the set of all the quadruples that are made of four distinct integers between 1 and p, for ∀ i ∈ {1, …, n},

lim_{T_{i} \to \infty} \frac{p^{2}}{T_{i}^{2}} \frac{\sum_{(j, k, l, m) \in Q} {Cov (z_{i 1 j} z_{i 1 k}, z_{i 1 l} z_{i 1 m})}^{2}}{| Q |} = 0,

(3.1)

where $| Q |$ is the cardinality of set $Q$ .

Assumption A5 All the covariance matrices share the same set of eigenvectors, i.e., Π_i = Π, for i = 1, …, n. For each Σ_i, there exists (at least) a column, indexed by j_i, such that γ = π_iji and Model (1.1) is satisfied.

Assumption A1 allows the data dimension, p, to be greater than the (maximum) number of observations, T_max, and to grow at the same rate as T_max does. This is a common regularity condition for shrinkage estimators [24]. Assumption A2 guarantees that the average sample covariance matrix $\bar{S} = \sum_{i = 1}^{n} T_{i} S_{i} / N$ utilized in the initial step of Algorithm 1 is positive definite. Together with Assumption A5, the eigenvectors of $\bar{S}$ are consistent estimators of Π [2]. Assumptions A3 and A4 regulate z_it on higher-order moments, which is equivalent to imposing restrictions on the higher-order moments of y_it. When the data are assumed to be normally distributed, both A3 and A4 are satisfied. Assumption A5 assumes that all the covariance matrices share the same eigenspace, though the ordering of the eigenvectors may differ. When p/T_min → 0, Zhao et al. [41] relaxed this assumption to partial common diagonalization and demonstrated the method robustness through numerical examples. Studying the asymptotic properties under the relaxation is difficult and not available in existing literature, especially when p > T_max.

Taking the eigenvectors of $\bar{S}$ as the initial values of γ, the following proposition demonstrates the consistency of the proposed estimator.

Proposition 3.1. Under Assumptions A1–A5, as n,T_min → ∞, the estimator of γ and β obtained by Algorithm 1 are asymptotically consistent.

To prove Proposition 3.1, we first study the asymptotic properties of $S_{i}^{*}$ and show that $S_{i}^{*}$ is the optimal linear shrinkage estimator of the covariance matrix under the squared loss. This is accomplished under the assumption that γ is given. As the initialization of γ is already a consistent estimator, the consistency of the solution after iteration follows. For β, it is firstly shown that the association between the shrinkage estimator, $Σ_{i}^{*}$ , and the covariates is the same as the covariance matrix, Σ_i, does (Lemma 3.3). Thus, it is equivalent to optimize problems (2.1) and (2.2) to solve for β, and the solution is a consistent estimator of β based on the pseudo-likelihood theory [16]. In the iteration step of Algorithm 1, $S_{i}^{*}$ improves the estimation of the covariance matrices with lower squared loss, and in consequence, improves the estimation of γ and β. In Section 4, the improvement is demonstrated through simulation studies.

In Section 2, the optimization problem (2.2) introduces a linear combination of the sample covariance matrix and the identity matrix, $Σ_{i}^{*}$ , that achieves the minimum expected squared error. From Theorem 2.1, the solution has population-level parameters. Thus, the sample counterpart, $S_{i}^{*}$ , is introduced. The following Lemma 3.1 first shows that asymptotically, the weighting parameters in $Σ_{i}^{*}$ are well-behaved. Lemma 3.2 demonstrates that the corresponding sample counterpart of the weighting parameters are consistent estimators. Theorem 3.1 demonstrates that $S_{i}^{*}$ performs as well as $Σ_{i}^{*}$ does asymptotically.

Lemma 3.1. For given (γ, β), let T_min = min_i T_i, as T_min → ∞, μ, ϕ², ψ² and δ² are bounded.

Lemma 3.2. For given (γ, β), as T_min → ∞,

$E {({\hat{δ}}_{i}^{2} - δ_{i}^{2})}^{2} \to 0$ , for i = 1, …, n, and thus $E {({\hat{δ}}^{2} - δ^{2})}^{2} \to 0$ ;
$E {({\hat{ψ}}_{i}^{2} - ψ_{i}^{2})}^{2} \to 0$ , for i = 1, …, n, and thus $E {({\hat{ψ}}^{2} - ψ^{2})}^{2} \to 0$ ;
$E {({\hat{ϕ}}_{i}^{2} - ϕ_{i}^{2})}^{2} \to 0$ , for i = 1, …, n, and thus $E {({\hat{ϕ}}^{2} - ϕ^{2})}^{2} \to 0$ .

Theorem 3.1. For ∀ i ∈ {1, …, n}, $S_{i}^{*}$ is a consistent estimator of $Σ_{i}^{*}$ , that is, as T_min = min_i T_i → ∞,

E {‖ S_{i}^{*} - Σ_{i}^{*} ‖}^{2} \to 0.

(3.2)

Thus, the asymptotic expected loss of $S_{i}^{*}$ and $Σ_{i}^{*}$ are identical, that is,

E {γ^{⊤} S_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2} - E {γ^{⊤} Σ_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2} \to 0.

(3.3)

Next, we show that $S_{i}^{*}$ uniformly achieves the minimum quadratic risk asymptotically over all linear combinations of the sample covariance matrix and the identity matrix. For given (γ, β), let $Σ_{i}^{* *}$ denote the solution to the following optimization problem,

\underset{ρ_{1}, ρ_{2}}{minimize} \frac{1}{n} \sum_{i = 1}^{n} {γ^{⊤} Σ_{i}^{* *} γ - exp (x_{i}^{⊤} β)}^{2}, such that Σ_{i}^{* *} = ρ_{1} I + ρ_{2} S_{i}, for i = 1, \dots, n .

(3.4)

Theorem 3.2. $S_{i}^{*}$ is a consistent estimator of $Σ_{i}^{* *}$ , that is, as T_min = min_i T_i → ∞, for i = 1, …, n,

E {‖ S_{i}^{*} - Σ_{i}^{* *} ‖}^{2} \to 0.

(3.5)

Then, $S_{i}^{*}$ has the same asymptotic expected loss as $Σ_{i}^{* *}$ does, that is,

E {γ^{⊤} S_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2} - E {γ^{⊤} Σ_{i}^{* *} γ - exp (x_{i}^{⊤} β)}^{2} \to 0.

(3.6)

Theorem 3.3. Assume (γ, β) is given. With a fixed $n \in ℕ^{+}$ , for any sequence of linear combinations ${{\hat{Σ}}_{i}}_{i = 1}^{n}$ of the identity matrix and the sample covariance matrix, where the combination coefficients are constant over i ∈ {1, …, n}, the estimator $S_{i}^{*}$ verifies:

lim_{T \to \infty} inf_{T_{i} \geq T} [\frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} {\hat{Σ}}_{i} γ - exp (x_{i}^{⊤} β)}^{2} - \frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} S_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2}] \geq 0.

(3.7)

In addition, every sequence of ${{\hat{Σ}}_{i}}_{i = 1}^{n}$ . that performs as well as ${S_{i}^{*}}_{i = 1}^{n}$ identical to ${S_{i}^{*}}_{i = 1}^{n}$ in the limit:

lim_{T \to \infty} [\frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} {\hat{Σ}}_{i} γ - exp (x_{i}^{⊤} β)}^{2} - \frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} S_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2}] = 0

(3.8)

\Leftrightarrow E {‖ {\hat{Σ}}_{i} - S_{i}^{*} ‖}^{2} \to 0, for i = 1, \dots, n .

(3.9)

The difference between $Σ_{i}^{* *}$ and $Σ_{i}^{*}$ is that $Σ_{i}^{* *}$ minimizes the squared loss instead of the expected loss, while asymptotically they are equivalent (Theorems 3.1 and 3.2). Theorem 3.3 presents the main result that, with a fixed sample size n, the proposed shrinkage estimator ${S_{i}^{*}}_{i = 1}^{n}$ achieves the uniformly minimum (average) quadratic risk asymptotically among all linear combinations of the identity matrix and the sample covariance matrix. Here, “average” implies an average over the subjects, and “asymptotically” refers to that the number of observations within each subject increases to infinity. Therefore, $S_{i}^{*}$ is asymptotically optimal. In addition, it is guaranteed that $S_{i}^{*}$ is positive definite (see a discussion in Appendix Section A.8). Thus, there exits unique solution to the optimization problem (2.1).

Next, we study the asymptotic properties of the model coefficient estimator. Let $\hat{β}$ denote the solution to the optimization problem (2.1).

Lemma 3.3. For given γ, assume the linear shrinkage estimator, $Σ_{i}^{*}$ , satisfies

E (γ^{⊤} Σ_{i}^{*} γ) = exp (x_{i}^{⊤} β^{*}), for i = 1, \dots, n,

(3.10)

then

β^{*} = β .

(3.11)

Theorem 3.4. For given γ, assume Assumptions A1–A5 are satisfied, $\hat{β}$ is a consistent estimator of β as n, T_min → ∞, where T_min = min_i T_i.

Lemma 3.3 implies that under the rotation γ, the expectation of the shrinkage estimator, $Σ_{i}^{*}$ , has the same association with the covariates as the true covariance matrix, Σ_i, does. $S_{i}^{*}$ is a consistent estimator of $Σ_{i}^{*}$ and is positive definite. This substantiates the choice of $S_{i}^{*}$ replacing the sample covariance matrix S_i in the optimization problem. Theorem 3.4 states the consistency of $\hat{β}$ .

4. Simulation Study

4.1. γ is known

In this section, we focus on examining the performance of the proposed method in estimating the covariance matrices and model coefficients by assuming the projection γ is known. Three methods are compared. (i) Estimate each individual covariance matrix using the estimator proposed in Ledoit and Wolf [24] and replace ${\hat{Σ}}_{i}$ with it in the optimization problem (2.1). We denote this approach as LW-CAP (Ledoit and Wolf based Covariate Assisted Principal regression), where the shrinkage is estimated on each individual covariance matrix. (ii) Estimate the covariance matrices using the proposed shrinkage estimator $S_{i}^{*}$ in (2.5). We denote this approach as CS-CAP (Covariate dependent Shrinkage CAP), where the shrinkage parameters are assumed to be shared across subjects. (iii) Estimate each individual covariance matrix using the sample covariance matrix and plug into the optimization problem (2.1). This is the CAP approach proposed in Zhao et al. [41], which is only applicable when T_min = min_i T_i > p.

The covariance matrices are generated using the eigendecomposition Σ_i = ΠΛ_iΠ^⊤, where Π = (π₁, …, π_p) is an orthonormal matrix in $ℝ^{p \times p}$ and Λ_i = diag{λ_i1, …, λ_ip} is a diagonal matrix with the diagonal elements to be the eigenvalues, for i = 1, …, n. In Λ_i, the diagonal elements are exponentially decaying, where eigenvalues of the second and the fourth dimension (D2 and D4) satisfy the log-linear model in (1.1). We consider a case with a single predictor X (thus q = 2), which is generated from a Bernoulli distribution with probability 0.5 to be one. For D2, the coefficient β₁ = −1; and for D4, β₁ = 1. For the rest dimensions, λ_ij, for i = 1, …, n, is generated from a log-normal distribution, where the mean of the corresponding normal distribution decreases from 5 to −1 over j. Cases when p = 20, 50, 100 are considered.

We first compare the three approaches, LW-CAP, CS-CAP and CAP, under sample sizes n = 50 and T_i = T = 50 for all i and present the result in Table 1. In the estimation, for dimension j, γ is set to be π_j. In Table 1, we present the bias and the mean squared error (MSE) in estimating the eigenvalues and the model coefficient in D2 and D4. From the table, for both the eigenvalues and β₁, CS-CAP yields lower estimation bias and MSE than LW-CAP does. When p < T, CS-CAP achieves a similar estimation bias as the CAP approach does in estimating the covariance matrices, while the MSE is slightly lower. For the estimation of β₁, CS-CAP yields slightly lower bias. As the dimension p increases, the bias and MSE of eigenvalue estimates from LW-CAP increase; while the bias and MSE of the estimates from CS-CAP are similar at all p settings. This demonstrates the superiority of the proposed estimator in estimating the covariance matrices. Figure 1 presents the estimation bias and MSE of CS-CAP estimator at various levels of T when fixing n = 50 when p = 20. From the figure, as the number of observations within each subject increases, the estimates converge to the truth.

Table 1.

Bias and mean squared error (MSE) in estimating the eigenvalues of the covariance matrices and bias, MSE, and coverage probability (CP) in estimating β₁ coefficient with sample sizes n = 50 and T_i = T = 50, for i = 1, …, n, when γ is known.

		Method	${\hat{λ}}_{i j}$		${\hat{β}}_{1}$
		Method	Bias	MSE	Bias	MSE	CP
p = 20	D2	LW-CAP	−6.520	225.360	0.053	0.006	0.795
		CS-CAP	−1.175	204.686	0.001	0.004	0.935
		CAP	−1.175	206.117	−0.003	0.004	0.935
	D4	LW-CAP	−7.422	277.888	−0.040	0.005	0.860
		CS-CAP	−1.223	249.881	0.001	0.004	0.905
		CAP	−1.223	251.595	0.005	0.004	0.910
p = 50	D2	LW-CAP	−7.975	244.326	0.028	0.004	0.915
		CS-CAP	−1.428	202.141	0.008	0.003	0.935
		CAP	-	-	-	-	-
	D4	LW-CAP	−8.641	295.221	−0.012	0.004	0.915
		CS-CAP	−1.242	248.254	0.001	0.004	0.925
		CAP	-	-	-	-	-
p = 100	D2	LW-CAP	−8.924	260.268	0.010	0.004	0.915
		CS-CAP	−0.973	203.151	−0.001	0.003	0.930
		CAP	-	-	-	-	-
	D4	LW-CAP	−10.487	331.864	−0.011	0.003	0.940
		CS-CAP	−1.705	245.754	−0.007	0.003	0.940
		CAP	-	-	-	-	-

Open in a new tab

Fig 1. — Bias and mean squared error (MSE) in estimating the eigenvalues of the covariance matrices and bias, MSE, and coverage probability in estimating β₁ coefficient using CS-CAP with the number of subjects n = 50 at various numbers of observations from each subject with p = 20 when γ is known.

4.2. γ is unknown

In this section, we evaluate the performance of the CS-CAP approach when γ is unknown and estimated by solving optimization problem (2.1) using Algorithm 1. The data are generated following the same procedure as in Section 4.1. To evaluate the performance in estimating the projection γ, we consider a similarity metric measured by $| 〈 \hat{γ}, γ 〉 |$ , where 〈·, ·〉 is the inner product of two vectors and $\hat{γ}$ denotes the estimate of γ. When this metric is one, the two vectors are identical (up to sign flipping); and when this metric is zero, the two vectors are orthogonal. Case where p = 100 is studied. The performance of the CS-CAP approach is firstly compared to the LW-CAP approach with sample sizes n = 100 and T_i = T = 100. The results are presented in Table 2. From the table, the CS-CAP approach improves the performance with much lower MSE in estimating the eigenvalues, and lower MSE and higher coverage probability (CP) in estimating the β coefficient. After iterations, the CS-CAP approach yields an estimate of the projection with much higher similarity to the truth. To further examine the performance of the CS-CAP approach under finite sample size, combinations of sample sizes n = 50, 100, 500, 1000 and T_i = T = 50, 100, 500, 1000 are considered. Figure 2 presents the performance in estimating the second dimension (D2), including the bias, the MSE and the CP of ${\hat{β}}_{1}$ , the MSE of ${\hat{λ}}_{i j}$ , and the similarity of $\hat{γ}$ to the eigenvector of D2 (Appendix Section B.1 presents the results of the fourth dimension, D4). From the figure, as n, T → ∞, all estimates converge to the truth.

Table 2.

Bias, mean squared error (MSE), and coverage probability (CP) from 500 bootstrap samples in estimating the β₁ coefficient, the similarity of $\hat{γ}$ to π_j and the standard error (SE), and the MSE in estimating the eigenvalues ${\hat{λ}}_{i j}$ , for j = 2, 4. Data dimension p = 100, sample size n = 100 and T_i = T = 100.

	Method	${\hat{β}}_{1}$			$\hat{λ}$	${\hat{λ}}_{i j}$
	Method	Bias	MSE	CP	$\| 〈 \hat{γ}, π_{j} 〉 \|$ (SE)	MSE
D2	LW-CAP	−0.027	0.002	0.782	0.653 (0.033)	1812.091
D2	CS-CAP	−0.023	0.001	0.855	0.931 (0.012)	173.225
D4	LW-CAP	0.018	0.002	0.770	0.666 (0.027)	2186.265
D4	CS-CAP	0.019	0.001	0.845	0.926 (0.011)	231.856

Open in a new tab

Fig 2. — Estimation performance of CS-CAP in estimating the second dimension (D2) when γ is unknown. For ${\hat{β}}_{1}$ , (a) bias, (b) mean squared error (MSE) and (c) coverage probability (CP) are presented, where CP is obtained from 500 bootstrap samples. For the eigenvalues ${\hat{λ}}_{i j}$ , (d) MSE is presented. For $\hat{γ}$ , (e) similarity to π₂ is presented. Data dimension p = 100. Sample sizes vary from n = 50, 100, 500, 100 and T_i = T = 50, 100, 500, 1000.

In Appendix Section B.2, the robustness of the proposed method to model misspecification is examined. Two types of model misspecification are considered, model misspecification in β and model misspecification in γ. When the log-linear model is misspecified, the proposed approach can correctly identify the linear projections under certain scenarios, while the estimate of model coefficients is biased. The proposed approach is robust to the setting that the eigenvectors of the covariance matrices are partially common, while it will not work when the eigenvectors are completely unique to each covariance matrix.

5. The Alzheimer’s Disease Neuroimaging Initiative Study

Data used in this study are obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD).

We apply the proposed approach to ADNI resting-state functional MRI (fMRI) data acquired at the baseline screening. AD is an irreversible neurodegenerative disease that destroys memory and related brain functions causing problems in cognition and behavior. Apolipoprotein E ε4 (APOE-ε4) has been consistently identified as a strong genetic risk factor for AD. With an increasing number of APOE-ε4 alleles, the lifetime risk of developing AD increases, and the age of onset decreases [10]. Thus, APOE-ε4 is generally treated as a potential therapeutic target [33]. In AD studies, resting-state fMRI is another emerging biomarker for diagnosis [23]. It is important to articulate the genetic impact on brain functional architecture. In this study, n = 194 subjects diagnosed with either MCI or AD are analyzed. Resting-state fMRI data collected at the initial screening are preprocessed. Time courses are extracted from p = 75 brain regions, including 60 cortical and 15 subcortical regions grouped into 10 functional modules, using the Harvard-Oxford Atlas in FSL [35]. For each time course, a subsample is taken with an effective sample size of T = 67 to remove the temporal dependence. The resulting data, denoted as y_it (for i = 1, …, n and t = 1, …, T), are assumed to follow a multivariate normal distribution with mean zero and covariance Σ_i. The off-diagonal elements in Σ_i represent the pairwise functional connectivity between brain regions and Σ_i represents the brain functional architecture of subject i. In the regression model, APOE-ε4, sex, and age are entered as the covariates (x_i’s). The validity of model assumptions is discussed in Appendix Section C.1.

The CS-CAP approach is applied to identify brain subnetworks within which the functional connectivity demonstrates a significant association with APOE-ε4. Using the deviation from diagonality criterion, CS-CAP identifies three components denoted as C1, C2, and C3. The model coefficients and 95% bootstrap confidence interval from 500 bootstrap samples are presented in Table 3. From the table, C3 is significantly associated with APOE-ε4 and age; C1 and C2 are significantly associated with sex and age. To better interpret C3, a fused lasso regression [36] is employed to sparsify the loading profile, similarly as in the sparse principal component analysis proposed in Zou et al. [42]. The fused lasso regularization is defined based on the modular information to impose local smoothness and consistency [19, 40]. Figure 3(a) presents the sparse loading profile colored by the corresponding functional module, and Figure 3(b) is the river plot illustrating the loading configuration. In C3, all regions with negative loadings are subcortical regions. Contributions to positive loadings are from regions in the default mode network (DMN), the ventral- and dorsal-attention networks, and the somato-motor network. Figure 3(c) presents these regions on a brain map. C3 is negatively associated with APOE-ε4 indicating that functional connectivity between regions in the same sign among APOE-ε4 carriers is lower, while connectivity between regions in the opposite signs among APOE-ε4 carriers is higher. The findings are in line with existing knowledge about AD. Compared to APOE-ε4 non-carriers, more functional connectivity between the left hippocampus and the insular/prefrontal cortex while more functional disconnection of the hippocampus has been observed in APOE-ε4 carriers [12]. Alterations in DMN connectivity in cognitively normal APOE-ε4 carriers have been reported across all age groups [3]. Increased connectivity in the limbic system, including the hippocampus, the amygdala, and the thalamus, has been detected in individuals with memory impairment [18, 17], though the effect of APOE-ε4 carriage lacks consensus [3]. It was shown that the limbic hyperconnectivity is positively associated with the memory performance, suggesting the preservation of brain function due to increased connectivity in the medial temporal lobe pathology [17].

Table 3.

Model coefficient estimate and 95% bootstrap confidence interval using the PS-CAP approach. The intervals are obtained over 500 bootstrap samples.

	APOE-ε4	Sex	Age
C1	0.012 (−0.031, 0.263)	−0.431 (−0.636, −0.230)	−0.227 (−0.319, −0.129)
C2	0.049 (−0.191, 0.309)	−0.544 (−0.867, −0.186)	−0.232 (−0.383, −0.066)
03	−0.156 (−0.270, −0.045)	−0.061 (−0.201, 0.075)	−0.241 (−0.328,−0.172)

Open in a new tab

Fig 3. — (a)The sparsified loading profile, (b) the module river plot, and (c) regions with nonzero loadings in a brain map of C3. In (a) and (b), the figure and the legend are colored by brain functional modules. In (c), the brain maps are colored by the loading weights.

6. Discussion

In this study, we introduce an approach to perform linear regression with multiple high dimensional covariance matrices as the outcome. A linear shrinkage estimator of the covariance matrix is firstly introduced, where the shrinkage coefficients are shared parameters across subjects. It is shown that the proposed estimator is optimal achieving the uniformly minimum quadratic loss asymptotically among all linear combinations of the identity matrix and the sample covariance matrix. Replacing the sample covariance matrices with the proposed well-conditioned estimator in the likelihood function, the linear projection parameter and the model coefficient are shown to be consistently estimated. Through simulation studies, the proposed approach demonstrates superior performance in estimating the covariance matrices and the model coefficients with lower estimation bias and variation over the existing methods. Applying to a resting-state fMRI data set acquired from the ADNI study, the findings are consistent with existing knowledge about AD.

The proposed framework extends the proposal in Zhao et al. [41] to high dimensional scenario. When p is small, the proposed shrinkage estimator demonstrates lower squared loss than the sample covariance matrix as suggested in both theoretical results and simulation studies. Different from the linear shrinkage estimator introduced in Ledoit and Wolf [24], which was proposed for a single covariance matrix estimation, the shrinkage coefficients considered in this study are population level parameters shared across subjects. This is superior than the individual shrinkage as the proposed one leverages the accuracy of the sample covariance matrix and the variability in the eigenvalues across subjects.

In this study, the asymptotic properties are studied under the assumption that the covariance matrices have the same eigendecomposition. We leave the study of the consistency relaxing this assumption to future research. The proposed shrinkage estimator is optimal with respect to a squared risk. However, this may overshrink the small eigenvalues [11]. Other types of loss function, such as the Stein’s loss, will be considered in the future. In the ADNI application, we included an ad hoc procedure to select important brain regions for interpretation. A next-step research is to include the regularization on γ into the optimization or to introduce an efficient approach to draw inference on the loadings (such as a bootstrap sampling procedure).

Acknowledgments

Zhao was partially supported by NIH grant U54AG065181 and P30AG010133; Caffo by NIH grant R01EB029977 and P41EB031771; and Luo by NIH grant R01EB022911. Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI, National Institutes of Health Grant U01 AG024904 and Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

Appendix A: Theory and Proof

A.1. Proof of Theorem 2.1 and Lemma 2.1

Proof. Given (γ, β), $E (γ^{⊤} S_{i} γ) = γ^{⊤} Σ_{i} γ = exp (x_{i}^{⊤} β)$ . For the objective function in (2.2), under the constraint that $Σ_{i}^{*} = ρ μ I + (1 - ρ) S_{i}$ , we have

f (μ, ρ) = \frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} Σ_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2} = \frac{1}{n} \sum_{i = 1}^{n} [ρ^{2} {μ (γ^{⊤} γ) - exp (x_{i}^{⊤} β)}^{2} + {(1 - ρ)}^{2} E {γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)}^{2}] .

In order to minimize the objective function, as the objective function is convex, derivatives are firstly taken over μ and ρ.

For μ,

\frac{\partial f}{\partial μ} = ρ^{2} \frac{1}{n} \sum_{i = 1}^{n} 2 {μ (γ^{⊤} γ) - exp (x_{i}^{⊤} β)} (γ^{⊤} γ) = 0, \Rightarrow μ = \frac{1}{n (γ^{⊤} γ)} \sum_{i = 1}^{n} exp (x_{i}^{⊤} β) .

For ρ, let $ϕ_{i}^{2} = {μ (γ^{⊤} γ) - exp (x_{i}^{⊤} β)}^{2}$ and $ψ_{i}^{2} = E {γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)}^{2}$ ,

\frac{\partial f}{\partial ρ} = 2 ρ (\frac{1}{n} \sum_{i = 1}^{n} ϕ_{i}^{2}) - 2 (1 - ρ) (\frac{1}{n} \sum_{i = 1}^{n} ψ_{i}^{2}) = 0, \Rightarrow ρ = \frac{\sum_{i = 1}^{n} ψ_{i}^{2}}{\sum_{k = 1}^{n} ϕ_{i}^{2} + \sum_{i = 1}^{n} ψ_{i}^{2}} .

Let $δ_{i}^{2} = E {γ^{⊤} S_{i} γ - μ (γ^{⊤} γ)}^{2}$ , then $δ_{i}^{2} = ϕ_{i}^{2} + ψ_{i}^{2}$ . Let $ϕ^{2} = \sum_{i = 1}^{n} ϕ_{i}^{2} / n$ , $ψ^{2} = \sum_{i = 1}^{n} ψ_{i}^{2} / n$ , and $δ^{2} = \sum_{i = 1}^{n} δ_{i}^{2} / n$ (thus, δ² = ϕ² + ψ²), the optimizer of problem (2.2) is

Σ_{i}^{*} = \frac{ψ^{2}}{δ^{2}} μ I + \frac{ϕ^{2}}{δ^{2}} S_{i}, i = 1, \dots, n .

The minimum value of the function is

\frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} Σ_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2} = \frac{1}{n} \sum_{i = 1}^{n} E {\frac{ψ^{2}}{δ^{2}} μ γ^{⊤} γ + \frac{ϕ^{2}}{δ^{2}} γ^{⊤} S_{i} γ - \frac{ψ^{2} + ϕ^{2}}{δ^{2}} exp (x_{i}^{⊤} β)}^{2} = \frac{1}{n} \sum_{i = 1}^{n} [E {\frac{ψ^{2}}{δ^{2}} μ γ^{⊤} γ - \frac{ψ^{2}}{δ^{2}} exp (x_{i}^{⊤} β)}^{2} + E {\frac{ϕ^{2}}{δ^{2}} γ^{⊤} S_{i} γ - \frac{ϕ^{2}}{δ^{2}} exp (x_{i}^{⊤} β)}^{2}] = \frac{1}{n} \sum_{i = 1}^{n} (\frac{ψ^{4}}{δ^{4}} ϕ_{i}^{2} + \frac{ϕ^{4}}{δ^{4}} ψ_{i}^{2}) = \frac{ψ^{4} ϕ^{2} + ϕ^{4} ψ^{2}}{δ^{4}} = \frac{ϕ^{2} ψ^{2}}{δ^{2}} .

□

A.2. Proof of Proposition 3.1

Proof. Under Assumptions A2 and A5, the eigenvectors of $\bar{S}$ are consistent estimators of Π. Replace γ with its estimate in Theorems 3.1–3.3 and Theorem 3.4, the consistency of β follows. □

A.3. Proof of Lemma 3.1

Proof. (1) For μ,

μ = \frac{1}{n (γ^{⊤} γ)} \sum_{i = 1}^{n} exp (x_{i}^{⊤} β) = \frac{1}{n} \sum_{i = 1}^{n} \frac{γ^{⊤} Σ_{i} γ}{γ^{⊤} γ} \leq \frac{1}{n} \sum_{i = 1}^{n} {‖ Σ_{i} ‖}_{2}^{2} .

Under Assumption A2,

\frac{1}{n} \sum_{i = 1}^{n} {‖ Σ_{i} ‖}_{2}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {‖ Λ_{i} ‖}_{2}^{2} \leq \frac{1}{n} \sum_{i = 1}^{n} {‖ Λ_{i} ‖}_{F}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {\frac{1}{p} \sum_{j = 1}^{p} E {(z_{i 1 j}^{2})}^{2}} = \frac{1}{n} \sum_{i = 1}^{n} {\frac{1}{p} \sum_{j = 1}^{p} E (z_{i 1 j}^{4})} \leq \frac{1}{n} \sum_{i = 1}^{n} \sqrt{\frac{1}{p} \sum_{j = 1}^{p} E {(z_{i 1 j})}^{8}} \leq \frac{1}{n} \sum_{i = 1}^{n} \sqrt{C_{2}} = \sqrt{C_{2}},

where ∥·∥_F is the Frobenius norm of a matrix.

(2) For ϕ², upper limits of $ϕ_{i}^{2}$ is derived first.

ϕ_{i}^{2} = {μ (γ^{⊤} γ) - exp (x_{i}^{⊤} β)}^{2} \leq μ^{2} {(γ^{⊤} γ)}^{2} + {exp (x_{i}^{⊤} β)}^{2} = μ^{2} {(γ^{⊤} γ)}^{2} + {(γ^{⊤} Σ_{i} γ)}^{2} \leq (μ^{2} + {‖ Σ_{i} ‖}_{2}^{4}) {(γ^{⊤} γ)}^{2} .

From the above derivation, we have

μ^{2} \leq C_{2}, and {‖ Σ_{i} ‖}_{2}^{2} = {‖ Λ_{i} ‖}_{2}^{2} \leq {‖ Λ_{i} ‖}_{F}^{2} \leq \sqrt{C_{2}} .

Since γ is given, without loss of generality, assume that ∥γ∥₂ = 1, i.e., γ^⊤γ = 1. Then,

ϕ_{i}^{2} \leq 2 C_{2} (γ^{⊤} γ) = 2 C_{2} .

Thus,

ϕ^{2} = \frac{1}{n} \sum_{i = 1}^{n} ϕ_{i}^{2} \leq 2 C_{2} .

(3) For ψ², analogously, $ψ_{i}^{2}$ is considered first.

ψ_{i}^{2} = E {γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)}^{2} = E {γ^{⊤} (S_{i} - Σ_{i}) γ}^{2} \leq {(γ^{⊤} γ)}^{2} E {‖ S_{i} - Σ_{i} ‖}_{2}^{2} E {‖ S_{i} - Σ_{i} ‖}_{F}^{2} = \frac{1}{p} \sum_{j = 1}^{p} \sum_{k = 1}^{p} E {{(\frac{1}{T_{i}} \sum_{t = 1}^{T_{i}} y_{i t j} y_{i t k} - σ_{i j k})}^{2}} = \frac{1}{p} \sum_{j = 1}^{p} \sum_{k = 1}^{p} E {{(\frac{1}{T_{i}} \sum_{t = 1}^{T_{i}} z_{i t j} z_{i t k} - λ_{i j k})}^{2}} = \frac{1}{p} \sum_{j = 1}^{p} \sum_{k = 1}^{p} Var (\frac{1}{T_{i}} \sum_{t = 1}^{T_{i}} z_{i t j} z_{i t k}) = \frac{1}{p} \sum_{j = 1}^{p} \sum_{k = 1}^{p} \frac{1}{T_{i}} Var (z_{i 1 j} z_{i 1 k}) \leq \frac{1}{p T_{i}} \sum_{j = 1}^{p} \sum_{k = 1}^{p} E (z_{i 1 j}^{2} z_{i 1 k}^{2}) \leq \frac{1}{p T_{i}} \sum_{j = 1}^{p} \sum_{k = 1}^{p} \sqrt{E z_{i 1 j}^{4}} \sqrt{E z_{i 1 k}^{4}} \leq \frac{p}{T_{i}} {(\frac{1}{p} \sum_{j = 1}^{p} \sqrt{E z_{i 1 j}^{4}})}^{2} \leq \frac{p}{T_{i}} (\frac{1}{p} \sum_{j = 1}^{p} E z_{i 1 j}^{4}) \leq \frac{p}{T_{i}} \sqrt{\frac{1}{p} \sum_{j = 1}^{p} E z_{i 1 j}^{8}} \leq C_{1} \sqrt{C_{2}}

Thus, for ψ²,

ψ^{2} = \frac{1}{n} \sum_{i = 1}^{n} ψ_{i}^{2} \leq \frac{1}{n} \sum_{i = 1}^{n} {(γ^{⊤} γ)}^{2} C_{1} \sqrt{C_{2}} = C_{1} \sqrt{C_{2}} .

(4) Finally, for δ²,

δ^{2} = ϕ^{2} + ψ^{2} \leq 2 C_{2} + C_{1} \sqrt{C_{2}} .

□

A.4. Proof of Lemma 3.2

Proof. In the proof of Lemma 3.2, here, it is assumed that γ is a column of Π_i indexed by j_i, for i = 1, …, n (Assumption A4).

(i) First, we prove the consistency of ${\hat{δ}}_{i}^{2}$ .

{\hat{δ}}_{i}^{2} - δ_{i}^{2} = {γ^{⊤} S_{i} γ - μ (γ^{⊤} γ)}^{2} - E {γ^{⊤} S_{i} γ - μ (γ^{⊤} γ)}^{2} = {{(γ^{⊤} S_{i} γ)}^{2} - E {(γ^{⊤} S_{i} γ)}^{2}} - 2 μ (γ^{⊤} γ) {(γ^{⊤} S_{i} γ) - E (γ^{⊤} S_{i} γ)}

Under Assumption A4,

γ^{⊤} S_{i} γ = \frac{1}{T_{i}} \sum_{t = 1}^{T_{i}} γ^{⊤} y_{i t} y_{i t}^{⊤} γ = \frac{1}{T_{i}} \sum_{t = 1}^{T_{i}} z_{i t j_{i}}^{2} .

{(γ^{⊤} S_{i} γ)}^{2} = \frac{1}{T_{i}^{2}} {(\sum_{t = 1}^{T_{i}} z_{i t j_{i}}^{2})}^{2} = \frac{1}{T_{i}^{2}} \sum_{t = 1}^{T_{i}} z_{i t j_{i}}^{4} + \frac{1}{T_{i}^{2}} \sum_{t \neq s} z_{i t j_{i}}^{2} z_{i s j_{i}}^{2} .

E {(γ^{⊤} S_{i} γ)}^{2} = \frac{1}{T_{i}^{2}} T_{i} E z_{i 1 j_{i}}^{4} + \frac{1}{T_{i}} T_{i} (T_{i} - 1) {(E z_{i t j_{i}}^{2})}^{2} = \frac{1}{T_{i}} E z_{i 1 j_{i}}^{4} + \frac{T_{i} (T_{i} - 1)}{T_{i}^{2}} {(γ^{⊤} Σ_{i} γ)}^{2} .

For ∀ ϵ > 0,

ℙ {| (γ^{⊤} S_{i} γ) - E (γ^{⊤} S_{i} γ) | \geq ϵ} \leq \frac{1}{ϵ^{2}} Var (γ^{⊤} S_{i} γ) = \frac{1}{ϵ^{2}} [E {(γ^{⊤} S_{i} γ)}^{2} - {E (γ^{⊤} S_{i} γ)}^{2}] = \frac{1}{ϵ^{2}} {\frac{1}{T_{i}^{2}} E z_{i 1 j_{i}}^{4} + \frac{T_{i} (T_{i} - 1)}{T_{i}^{2}} {(γ^{⊤} Σ_{i} γ)}^{2} - {(γ^{⊤} Σ_{i} γ)}^{2}} \overset{T_{i} \to \infty}{\to} 0.

E {(γ^{⊤} S_{i} γ)}^{4} = \frac{1}{T_{i}^{4}} E {(\sum_{t = 1}^{T_{i}} z_{i t j_{i}}^{2})}^{4} = \frac{1}{T_{i}^{4}} {\sum_{t} E z_{i t j_{i}}^{8} + 2 \sum_{t \neq s} E z_{i t j_{i}}^{4} z_{i s j_{i}}^{4} + 2 \sum_{u} E (z_{i u j_{i}}^{4} \sum_{t \neq s} z_{i t j_{i}}^{2} z_{i s j_{i}}^{2}) + \sum_{u \neq v} \sum_{t \neq s} E (z_{i t j_{i}}^{2} z_{i s j_{i}}^{2} z_{i u j_{i}}^{2} z_{i v j_{i}}^{2})} = \frac{1}{T_{i}^{4}} {T_{i} E z_{i 1 j_{i}}^{8} + 2 T_{i} (T_{i} - 1) {(E z_{i 1 j_{i}}^{4})}^{2} + 2 T_{i}^{2} (T_{i} - 1) E z_{i 1 j_{i}}^{4} {(E z_{i 1 j_{i}}^{2})}^{2} + T_{i}^{2} {(T_{i} - 1)}^{2} {(E z_{i 1 j_{i}}^{2})}^{4}} .

{E {(γ^{⊤} S_{i} γ)}^{2}}^{2} = \frac{1}{T_{i}^{2}} {(E z_{i 1 j_{i}}^{4})}^{2} + \frac{2 T_{i} (T_{i} - 1)}{T_{i}^{3}} E z_{i 1 j_{i}}^{4} {(γ Σ_{i} γ)}^{2} + \frac{T_{i}^{2} {(T_{i} - 1)}^{2}}{T_{i}^{4}} {(γ Σ_{i} γ)}^{4} .

For ∀ ϵ > 0,

ℙ {| {(γ^{⊤} S_{i} γ)}^{2} - E {(γ^{⊤} S_{i} γ)}^{2} | \geq ϵ} \leq \frac{1}{ϵ^{2}} Var {(γ^{⊤} S_{i} γ)}^{2} = \frac{1}{ϵ^{2}} [E {(γ^{⊤} S_{i} γ)}^{4} - {E {(γ^{⊤} S_{i} γ)}^{2}}^{2}] = \frac{1}{ϵ^{2}} {\frac{1}{T_{i}^{3}} E z_{i 1 j_{i}}^{8} + \frac{T_{i} - 2}{T_{i}^{3}} {(E z_{i 1 j_{i}}^{4})}^{2}} \overset{T_{i} \to \infty}{\to} 0.

Therefore, as T_min = min_i T_i → ∞,

E {({\hat{δ}}_{i}^{2} - δ_{i}^{2})}^{2} \to 0, for i = 1, \dots, n, and E {({\hat{δ}}^{2} - δ^{2})}^{2} \to 0.

(ii) Secondly, prove the consistency of ${\hat{ψ}}_{i}^{2}$ , for i = 1, …, n.

{\hat{ψ}}_{i}^{2} - ψ_{i}^{2} = \frac{1}{T_{i}} {γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)}^{2} - E {γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)}^{2} .

E {γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)}^{2} = E {\frac{1}{T_{i}} \sum_{t} z_{i t j_{i}}^{2} - exp (x_{i}^{⊤} β)}^{2} = \frac{1}{T_{i}^{2}} \sum_{t} Var (z_{i t j_{i}}^{2}) = \frac{1}{T_{i}} Var (z_{i 1 j_{i}}^{2}) .

{\hat{ψ}}_{i}^{2} - ψ_{i}^{2} = \frac{1}{T_{i}} [{γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)}^{2} - Var (z_{i 1 j_{i}}^{2})] = \frac{1}{T_{i}} [{(γ^{⊤} S_{i} γ)}^{2} - E z_{i 1 j_{i}}^{4} - 2 exp (x_{i}^{⊤} β) {γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)}] .

From above derivation and the fact that $E (γ^{⊤} S_{i} γ) = γ^{⊤} Σ_{i} γ = exp (x_{i}^{⊤} β)$ , as T_i → ∞, for ∀ ϵ > 0,

ℙ {| (γ^{⊤} S_{i} γ) - E (γ^{⊤} S_{i} γ) | \geq ϵ} \to 0.

As both (γ^⊤ S_iγ)² and $E z_{i 1 j_{i}}^{4}$ are bounded, then, as T_min = min_i T_i → ∞,

E {({\hat{ψ}}_{i}^{2} - ψ_{i}^{2})}^{2} \to 0, for i = 1, \dots, n .

Let ${\tilde{ψ}}_{i}^{2} = min ({\hat{ψ}}_{i}^{2}, {\hat{δ}}_{i}^{2})$ .

{\tilde{ψ}}_{i}^{2} - ψ_{i}^{2} = min ({\hat{ψ}}_{i}^{2}, {\hat{δ}}_{i}^{2}) - ψ_{i}^{2} \leq {\hat{ψ}}_{i}^{2} - ψ_{i}^{2} \leq | {\hat{ψ}}_{i}^{2} - ψ_{i}^{2} | \leq max (| {\hat{ψ}}_{i}^{2} - ψ_{i}^{2} |, | {\hat{δ}}_{i}^{2} - δ_{i}^{2} |) .

$δ_{i}^{2} = ϕ_{i}^{2} + ψ_{i}^{2} \geq ψ_{i}^{2}$ , then

{\tilde{ψ}}_{i}^{2} - ψ_{i}^{2} = min ({\hat{ψ}}_{i}^{2}, {\hat{δ}}_{i}^{2}) - ψ_{i}^{2} = min ({\hat{ψ}}_{i}^{2} - ψ_{i}^{2}, {\hat{δ}}_{i}^{2} - ψ_{i}^{2}) \geq min ({\hat{ψ}}_{i}^{2} - ψ_{i}^{2}, {\hat{δ}}_{i}^{2} - δ_{i}^{2}) \geq min (- | {\hat{ψ}}_{i} - ψ_{i}^{2} |, - | {\hat{δ}}_{i}^{2} - δ_{i}^{2} |) \geq - max (| {\hat{ψ}}_{i} - ψ_{i}^{2} |, | {\hat{δ}}_{i}^{2} - δ_{i}^{2} |) .

E {({\tilde{ψ}}_{i}^{2} - ψ_{i}^{2})}^{2} \leq E {max {(| {\hat{ψ}}_{i} - ψ_{i}^{2} |, | {\hat{δ}}_{i}^{2} - δ_{i}^{2} |)}^{2}} \leq E {({\hat{ψ}}_{i}^{2} - ψ_{i}^{2})}^{2} + E {({\hat{δ}}_{i}^{2} - δ_{i}^{2})}^{2} .

Therefore, as T_min = min_i T_i → ∞,

E {({\tilde{ψ}}_{i}^{2} - ψ_{i}^{2})}^{2} \to 0, for i = 1, \dots, n, and E {({\hat{ψ}}^{2} - ψ^{2})}^{2} \to 0.

(iii) Lastly, ${\hat{ϕ}}_{i}^{2} = {\hat{δ}}_{i}^{2} - {\hat{ψ}}_{i}^{2}$ . The consistency of ${\hat{ϕ}}_{i}^{2}$ (for i = 1, …, n) and ${\hat{ϕ}}^{2}$ are straightforward. □

A.5. Proof of Theorem 3.1

In order to prove Theorem 3.1, the following lemma is firstly introduced. This lemma is also used to prove Lemma A.2 in the next section.

Lemma A.1. If $a_{i}^{2}$ is a sequence of nonnegative random variables (implicitly indexed by T_i) whose expectations converge to zero, for i = 1, …, n, and κ₁, κ₂ are two nonrandom scalars, and

\frac{a_{i}^{2}}{{\hat{δ}}_{i}^{k_{1}} δ_{i}^{κ_{2}}} \leq 2 ({\hat{δ}}_{i}^{2} + δ_{i}^{2}) a . s .,

then, as T_min = min_i T_i → ∞,

E (\frac{a_{i}^{2}}{{\hat{δ}}_{i}^{κ_{1}} δ_{i}^{κ_{2}}}) \to 0.

Analogously, if a² is a sequence of nonnegative random variables (implicitly indexed by T_min = min_i T_i) whose expectations converge to zero, and κ₁, κ₂ are two nonrandom scalars, and

\frac{a^{2}}{\hat{δ} κ_{1} δ^{κ_{2}}} \leq 2 ({\hat{δ}}^{2} + δ^{2}) a . s .,

then, as T_min = min_i T_i → ∞,

E (\frac{a^{2}}{\hat{δ} κ_{1} δ κ_{2}}) \to 0.

Proof. For a fixed ϵ > 0, let $T_{i}$ denote the set of indices T_i such that $δ_{i}^{2} \leq ϵ / 8$ . In Lemma 3.2, it is proved that $E {({\hat{δ}}_{i}^{2} - δ_{i}^{2})}^{2} \to 0$ . Thus, there exists an integer T_i1 such that ∀ T_i ≥ T_i1,

E | {\hat{δ}}_{i}^{2} - δ_{i}^{2} | \leq ϵ / 4.

For ∀ T_i ≥ T_i1 in the set $T_{i}$ ,

E (\frac{a_{i}^{2}}{{\hat{δ}}_{i}^{κ_{1}} δ_{i}^{κ_{2}}}) \leq 2 (E {\hat{δ}}_{i}^{2} + δ_{i}^{2}) \leq 2 (E | {\hat{δ}}_{i}^{2} - δ_{i}^{2} | + 2 δ_{i}^{2}) \leq 2 (\frac{ϵ}{4} + 2 \times \frac{ϵ}{8}) = ϵ .

Consider the complementary of set $T_{i}$ , since $E a_{i}^{2} \to 0$ , there exists an integer T_i2 such that, ∀ T_i ≥ T_i2,

E a^{2} \leq \frac{ϵ^{κ_{1} + κ_{2} + 1}}{2^{4 κ_{1} + 3 κ_{2} + 1}} .

$δ_{i}^{2}$ is bounded by $2 C_{2} + C_{1} \sqrt{C_{2}}$ . Then, there exists an integer T_i3 such that, for ∀ Ti ≥ Ti3

ℙ (| {\hat{δ}}_{i}^{2} - δ_{i}^{2} | \geq \frac{ϵ}{16}) \leq \frac{4 ϵ}{16 (2 C_{2} + C_{1} \sqrt{C_{2}}) + ϵ} .

Let 1_{·} denote the indicator function. For ∀ T_i ≥ max(T_i2, T_i3) outside the set $T_{i}$ , then

\begin{array}{l} E (\frac{a_{i}^{2}}{{\hat{δ}}_{i}^{κ_{1}} δ_{i}^{κ_{2}}}) = E (\frac{a_{i}^{2}}{{\hat{δ}}_{i}^{κ_{1}} δ_{i}^{κ_{2}}} 1_{{{\hat{δ}}_{i}^{2} \leq ϵ / 16}}) + E (\frac{a_{i}^{2}}{{\hat{δ}}_{i}^{κ_{1}} δ_{i}^{κ_{2}}} 1_{{{\hat{δ}}_{i}^{2} > ϵ / 16}}) \leq E {2 ({\hat{δ}}_{i}^{2} + δ_{i}^{2}) 1_{{{\hat{δ}}_{i}^{2} \leq ϵ / 16}}} + {(\frac{16}{ϵ})}^{κ_{1}} {(\frac{8}{ϵ})}^{κ_{2}} E (a_{i}^{2} 1_{{{\hat{δ}}_{i}^{2} > ϵ / 16}}) \leq 2 {(2 C_{2} + C_{1} \sqrt{C_{2}}) + \frac{ϵ}{16}} ℙ (| {\hat{δ}}_{i}^{2} - δ_{i}^{2} | \geq \frac{ϵ}{16}) + {(\frac{16}{ϵ})}^{κ_{1}} {(\frac{8}{ϵ})}^{κ_{2}} E (a_{i}^{2}) \leq 2 {(2 C_{2} + C_{1} \sqrt{C_{2}}) + \frac{ϵ}{16}} \frac{4 ϵ}{16 (2 C_{2} + C_{1} \sqrt{C_{2}}) + ϵ} + {(\frac{16}{ϵ})}^{κ_{1}} {(\frac{8}{ϵ})}^{κ_{2}} \frac{ϵ^{κ_{1} + κ_{2} + 1}}{2^{4 κ_{1} + 3 κ_{2} + 1}} \\ \leq ϵ . \end{array}

Bringing together the results inside and outside the set $T_{i}$ , for ∀ T_i ≥ max(T_i1, T_i2, T_i3),

E (\frac{a_{i}^{2}}{{\hat{δ}}_{i}^{κ_{1}} δ_{i}^{κ_{2}}}) \leq ϵ .

The proof of the second part follows the same strategy. □

Now, we prove Theorem 3.1.

Proof. We first prove that $S_{i}^{*}$ is a consistent estimator of $Σ_{i}^{*}$ .

{‖ S_{i}^{*} - Σ_{i}^{*} ‖}^{2} = max_{γ \neq 0} \frac{{‖ γ^{⊤} (S_{i}^{*} - Σ_{i}^{*}) γ ‖}^{2}}{γ^{⊤} γ} = max_{γ \neq 0} \frac{1}{γ^{⊤} γ} {‖ (\frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} - \frac{ϕ^{2}}{δ^{2}}) (γ^{⊤} S_{i} γ - μ γ^{⊤} γ) ‖}^{2} = max_{γ \neq 0} \frac{1}{γ^{⊤} γ} {(\frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} - \frac{ϕ^{2}}{δ^{2}})}^{2} {\hat{δ}}_{i}^{2} .

\frac{1}{n} \sum_{i = 1}^{n} {‖ S_{i}^{*} - Σ_{i}^{*} ‖}^{2} = max_{γ \neq 0} \frac{1}{γ^{⊤} γ} \frac{{({\hat{ϕ}}^{2} δ^{2} - ϕ^{2} {\hat{δ}}^{2})}^{2}}{{\hat{δ}}^{4} δ^{4}} \frac{1}{n} \sum_{i = 1}^{n} {\hat{δ}}_{i}^{2} = max_{γ \neq 0} \frac{1}{γ^{⊤} γ} \frac{{({\hat{ϕ}}^{2} δ^{2} - ϕ^{2} {\hat{δ}}^{2})}^{2}}{{\hat{δ}}^{2} δ^{4}} .

Using the fact that ϕ² ≤ δ² and ${\hat{ϕ}}^{2} \leq {\hat{δ}}^{2}$ ,

\frac{{({\hat{ϕ}}^{2} δ^{2} - ϕ^{2} {\hat{δ}}^{2})}^{2}}{{\hat{δ}}^{2} δ^{4}} \leq {\hat{δ}}^{2} \leq 2 ({\hat{δ}}^{2} + δ^{2}) .

In Lemma 3.2, it is shown that $E {({\hat{ϕ}}^{2} - ϕ^{2})}^{2}$ and $E {({\hat{δ}}^{2} - δ^{2})}^{2}$ converge to zero. In addition, Lemma 3.1 shows that ϕ² and δ² are bounded. Thus,

E {({\hat{ϕ}}^{2} δ^{2} - ϕ^{2} {\hat{δ}}^{2})}^{2} = E {({\hat{ϕ}}^{2} - ϕ^{2}) δ^{2} - ϕ^{2} ({\hat{δ}}^{2} - δ^{2})}^{2} \leq δ^{4} E {({\hat{ϕ}}^{2} - ϕ^{2})}^{2} + ϕ^{4} E {({\hat{δ}}^{2} - δ^{2})}^{2} \to 0.

Let $a^{2} = {({\hat{ϕ}}^{2} δ^{2} - ϕ^{2} {\hat{δ}}^{2})}^{2}$ , κ₁ = 2 and κ₂ = 4, then $E a^{2} \to 0$ , and using Lemma A.1,

E \frac{{({\hat{ϕ}}^{2} δ^{2} - ϕ^{2} {\hat{δ}}^{2})}^{2}}{{\hat{δ}}^{2} δ^{4}} \to 0.

Thus,

\frac{1}{n} \sum_{i = 1}^{n} E {‖ S_{i}^{*} - Σ_{i}^{*} ‖}^{2} \to 0.

And therefore, for ∀ i,

E {‖ S_{i}^{*} - Σ_{i}^{*} ‖}^{2} \to 0.

For the second statement,

E | {‖ S_{i}^{*} - Σ_{i} ‖}^{2} - {‖ Σ_{i}^{*} - Σ_{i} ‖}^{2} | = E | 〈 S_{i}^{*} - Σ_{i}^{*}, S_{i}^{*} + Σ_{i}^{*} - 2 Σ_{i} 〉 | \leq \sqrt{E {‖ S_{i}^{*} - Σ_{i}^{*} ‖}^{2}} \sqrt{E {‖ S_{i}^{*} + Σ_{i}^{*} - 2 Σ_{i} ‖}^{2}} \to 0.

Therefor,

E {γ^{⊤} S_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2} - E {γ^{⊤} Σ_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2} \to 0.

□

A.6. Proof of Theorem 3.2

Before proving Theorem 3.2, we first provide the solution to the optimization problem (3.4). Let

f (ρ_{1}, ρ_{2}) = \frac{1}{n} \sum_{i = 1}^{n} {γ^{⊤} (ρ_{1} I + ρ_{2} S_{i}) γ - exp (x_{i}^{⊤} β)}^{2} .

\frac{\partial f}{\partial ρ_{1}} = \frac{1}{n} \sum_{i = 1}^{n} 2 (γ^{⊤} γ) {ρ_{1} γ^{⊤} γ + ρ_{2} γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)} = 0

\frac{\partial f}{\partial ρ_{2}} = \frac{1}{n} \sum_{i = 1}^{n} 2 (γ^{⊤} S_{i} γ) {ρ_{1} (γ^{⊤} γ) + ρ_{2} (γ^{⊤} S_{i} γ) - exp (x_{i}^{⊤} β)} = 0.

\Rightarrow ρ_{2} = \frac{\sum_{i} (γ^{⊤} S_{i} γ) exp (x_{i}^{⊤} β) / n - (\sum_{i} γ^{⊤} S_{i} γ / n) (\sum_{i} exp (x_{i}^{⊤} β) / n)}{\sum_{i} {(γ^{⊤} S_{i} γ)}^{2} / n - {(\sum_{i} γ^{⊤} S_{i} γ / n)}^{2}}

ρ_{1} = \frac{1}{γ^{⊤} γ} {\frac{1}{n} \sum_{i = 1}^{n} exp (x_{i}^{⊤} β) - \frac{1}{n} \sum_{i = 1}^{n} ρ_{2} (γ^{⊤} S_{i} γ)} = \frac{1}{γ^{⊤} γ} {\frac{(\sum_{i} γ^{⊤} S_{i} γ / n) (\sum_{i} (γ^{⊤} S_{i} γ) exp (x_{i}^{⊤} β) / n)}{\sum_{i} {(γ^{⊤} S_{i} γ)}^{2} / n - {(\sum_{i} γ^{⊤} S_{i} γ / n)}^{2}} - \frac{(\sum_{i} exp (x_{i}^{⊤} β) / n) (\sum_{i} {(γ^{⊤} S_{i} γ)}^{2} / n)}{\sum_{i} {(γ^{⊤} S_{i} γ)}^{2} / n - {(\sum_{i} γ^{⊤} S_{i} γ / n)}^{2}}} .

In order to prove Theorem 3.2, the following lemma is introduced.

Lemma A.2. For given (γ, β), let T_min = min_i T_i, as T_min → ∞, for ∀ i ∈ {1, …, n},

E (| \frac{{\hat{ϕ}}_{i}^{2} {\hat{ψ}}_{i}^{2}}{{\hat{δ}}_{i}^{2}} - \frac{ϕ_{i}^{2} ψ_{i}^{2}}{δ_{i}^{2}} |) \to 0.

Then, as n, T_min → ∞,

E (| \frac{{\hat{ϕ}}^{2} {\hat{ψ}}^{2}}{{\hat{δ}}^{2}} - \frac{ϕ^{2} ψ^{2}}{δ^{2}} |) \to 0.

Proof.

\frac{{\hat{ϕ}}_{i}^{2} {\hat{ψ}}_{i}^{2}}{{\hat{δ}}_{i}^{2}} - \frac{ϕ_{i}^{2} ψ_{i}^{2}}{δ_{i}^{2}} = \frac{{\hat{ϕ}}_{i}^{2} {\hat{ψ}}_{i}^{2} δ_{i}^{2} - ϕ_{i}^{2} ψ_{i}^{2} {\hat{δ}}_{i}^{2}}{{\hat{δ}}_{i}^{2} δ_{i}^{2}} .

Let $a_{i}^{2} = | {\hat{ϕ}}_{i}^{2} {\hat{ψ}}_{i}^{2} δ_{i}^{2} - ϕ_{i}^{2} ψ_{i}^{2} {\hat{δ}}_{i}^{2} |$ , κ₁ = 2 and κ₂ = 2. First need to verify the assumptions in Lemma A.1.

| \frac{{\hat{ϕ}}_{i}^{2} {\hat{ψ}}_{i}^{2}}{{\hat{δ}}_{i}^{2}} - \frac{ϕ_{i}^{2} ψ_{i}^{2}}{δ_{i}^{2}} | \leq \frac{{\hat{ϕ}}_{i}^{2} {\hat{ψ}}_{i}^{2}}{{\hat{δ}}_{i}^{2}} + \frac{ϕ_{i}^{2} ψ_{i}^{2}}{δ_{i}^{2}} \leq {\hat{ϕ}}_{i}^{2} + ϕ_{i}^{2} \leq {\hat{δ}}_{i}^{2} + δ_{i}^{2} \leq 2 ({\hat{δ}}_{i}^{2} + δ_{i}^{2}), a.s ..

Furthermore,

E (| {\hat{ϕ}}_{i}^{2} {\hat{ψ}}_{i}^{2} δ_{i}^{2} - ϕ_{i}^{2} ψ_{i}^{2} {\hat{δ}}_{i}^{2} |) = E {| ({\hat{ϕ}}_{i}^{2} {\hat{ψ}}_{i}^{2} - ϕ_{i}^{2} ψ_{i}^{2}) δ_{i}^{2} - ϕ_{i}^{2} ψ_{i}^{2} ({\hat{δ}}_{i}^{2} - δ_{i}^{2}) |} = E {| ({\hat{ϕ}}_{i}^{2} - ϕ_{i}^{2}) ({\hat{ψ}}_{i}^{2} - ψ_{i}^{2}) δ_{i}^{2} + ϕ_{i}^{2} ({\hat{ψ}}_{i}^{2} - ψ_{i}^{2}) δ_{i}^{2} + ({\hat{ϕ}}_{i}^{2} - ϕ_{i}^{2}) ψ_{i}^{2} δ_{i}^{2} - ϕ_{i}^{2} ψ_{i}^{2} ({\hat{δ}}_{i}^{2} - δ_{i}^{2}) |} \leq \sqrt{E {({\hat{ϕ}}_{i}^{2} - ϕ_{i}^{2})}^{2}} \sqrt{E {({\hat{ψ}}_{i}^{2} - ψ_{i}^{2})}^{2}} δ_{i}^{2} + ϕ_{i}^{2} E | {\hat{ψ}}_{i}^{2} - ψ_{i}^{2} | δ_{i}^{2} + E | {\hat{ϕ}}_{i}^{2} - ϕ_{i}^{2} | ψ_{i}^{2} δ_{i}^{2} - ϕ_{i}^{2} ψ_{i}^{2} E | {\hat{δ}}_{i}^{2} - δ_{i}^{2} | .

The right-hand side converges to zero. Therefore, $E a_{i}^{2} \to 0$ , conditions in Lemma A.1 are satisfied. Therefore,

E | \frac{{\hat{ϕ}}_{i}^{2} {\hat{ψ}}_{i}^{2}}{{\hat{δ}}_{i}^{2}} - \frac{ϕ_{i}^{2} ψ_{i}^{2}}{δ_{i}^{2}} | \to 0.

Analogously, it can be shown that

E | \frac{{\hat{ϕ}}^{2} {\hat{ψ}}^{2}}{{\hat{δ}}^{2}} - \frac{ϕ^{2} ψ^{2}}{δ^{2}} | \to 0.

□

Next, we prove Theorem

Proof. Let α_i = (γ^⊤Σ_iγ)(γ^⊤S_iγ) − {μ(γ^⊤γ)}² and $α = \sum_{i = 1}^{n} α_{i} / n$ . $E (α_{i}) = {exp}^{2} (x_{i}^{⊤} β) - μ^{2} {(γ^{⊤} γ)}^{2}$ , then

E α = \frac{1}{n} \sum_{i = 1}^{n} {exp}^{2} (x_{i}^{⊤} β) - μ^{2} (γ^{⊤} γ) = ϕ^{2} .

First, need to prove that α − ϕ² converges to zero in quadratic mean.

Var (α_{i}) = Var {(γ^{⊤} Σ_{i} γ) (γ^{⊤} S_{i} γ) - μ^{2} {(γ^{⊤} γ)}^{2}} = Var {(γ^{⊤} Σ_{i} γ) (γ^{⊤} S_{i} γ)} + Var {μ^{2} {(γ^{⊤} γ)}^{2}} - 2 Cov {(γ^{⊤} Σ_{i} γ) (γ^{⊤} S_{i} γ), μ^{2} {(γ^{⊤} γ)}^{2}} = Var {(γ^{⊤} Σ_{i} γ) (γ^{⊤} S_{i} γ)} .

(γ^{⊤} Σ_{i} γ) (γ^{⊤} S_{i} γ) = λ_{i j_{i}} (\frac{1}{T_{i}} \sum_{t = 1}^{T_{i}} z_{i t j_{i}}^{2}) .

Var {(γ^{⊤} Σ_{i} γ) (γ^{⊤} S_{i} γ)} = Var {\frac{1}{T_{i}} \sum_{t = 1}^{T_{i}} λ_{i j_{i}} z_{i t j_{i}}^{2}} = \frac{1}{T_{i}} Var (λ_{i j_{i}} z_{i 1 j_{i}}^{2}) \leq \frac{1}{T_{i}} E {(λ_{i j_{i}} z_{i 1 j_{i}}^{2})}^{2} \leq \frac{1}{T_{i}} E λ_{i j_{i}}^{2} z_{i 1 j_{i}}^{4} \leq \frac{1}{T_{i}} {(E z_{i 1 j_{i}}^{2})}^{2} E z_{i 1 j_{i}}^{4} \leq \frac{1}{T_{i}} {(E z_{i 1 j_{i}}^{4})}^{2} \leq \frac{1}{T_{i}} E z_{i 1 j_{i}}^{8} \leq \frac{C_{2}}{T_{i}} .

Var (α) = \frac{1}{n^{2}} \sum_{i = 1}^{n} Var (α_{i}) \leq \frac{C_{2}}{n^{2}} \sum_{i = 1}^{n} \frac{1}{T_{i}} \to 0, as T_{min} = min_{i} T_{i} \to \infty .

This proves that α − ϕ² converges to 0 in quadratic mean. In the following, we prove that $S_{i}^{*}$ is a consistent estimator of $Σ_{i}^{* *}$ .

S_{i}^{*} = \frac{{\hat{ψ}}^{2}}{{\hat{δ}}^{2}} μ I + \frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} S_{i} = \frac{{\hat{δ}}^{2} - {\hat{ψ}}^{2}}{{\hat{δ}}^{2}} μ I + \frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} S_{i} = μ I + \frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} (S_{i} - μ I) . Σ_{i}^{* *} = ρ_{1} I + ρ_{2} S_{i} = (ρ_{1} + ρ_{2} μ) I + ρ_{2} (S_{i} - μ I) .

\begin{array}{l} \frac{1}{n} \sum_{i = 1}^{n} {‖ S_{i}^{*} - Σ_{i}^{* *} ‖}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {‖ (μ - ρ_{1} - ρ_{2} μ) I + (\frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} - ρ_{2}) (S_{i} - μ I) ‖}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {max_{γ \neq 0} \frac{1}{γ^{⊤} γ} {‖ (μ - ρ_{1} - ρ_{2} μ) (γ^{⊤} γ) + (\frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} - ρ_{2}) (γ^{⊤} S_{i} γ - μ (γ^{⊤} γ)) ‖}^{2}} max_{γ \neq 0} {{(μ - ρ_{1} - ρ_{2} μ)}^{2} (γ^{⊤} γ) + \frac{1}{γ^{⊤} γ} {(\frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} - ρ_{2})}^{2} {\hat{δ}}_{i}^{2} + 2 (μ - ρ_{1} - ρ_{2} μ) (\frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} - ρ_{2}) (\frac{1}{n} \sum_{i = 1}^{n} γ^{⊤} S_{i} γ - μ (γ^{⊤} γ))} . \\ {(μ - ρ_{1} - ρ_{2} μ)}^{2} = \frac{{(\sum_{i} γ^{⊤} S_{i} γ / n - \sum_{i} exp (x_{i}^{⊤} β) / n)}^{2}}{{(γ^{⊤} γ)}^{2} {{(\sum_{i} γ^{⊤} S_{i} γ / n)}^{2} - \sum_{i} {(γ^{⊤} S_{i} γ)}^{2} / n}^{2}} . \frac{{(\sum_{i} γ^{⊤} S_{i} γ / n) (\sum_{i} exp (x_{i}^{⊤} β) / n) - \sum_{i} (γ^{⊤} S_{i} γ) exp (x_{i}^{⊤} β) / n}^{2}}{{(γ^{⊤} γ)}^{2} {{(\sum_{i} γ^{⊤} S_{i} γ / n)}^{2} - \sum_{i} {(γ^{⊤} S_{i} γ)}^{2} / n}^{2}} \end{array}

E {\frac{1}{n} \sum_{i} γ^{⊤} S_{i} γ - \frac{1}{n} \sum_{i} exp (x_{i}^{⊤} β)}^{2} = \frac{1}{n^{2}} \sum_{i} E {γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)}^{2} + \frac{1}{n^{2}} \sum_{i \neq i^{'}} E {γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)} {γ^{⊤} S_{i^{'}} γ - exp (x_{i^{'}}^{⊤} β)} .

E {γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)}^{2} = E {γ^{⊤} S_{i} γ - E (γ^{⊤} S_{i} γ)}^{2} = Var (γ^{⊤} S_{i} γ) = \frac{1}{T_{i}} E z_{i 1 j_{i}}^{4} + \frac{T_{i} (T_{i} - 1)}{T_{i}^{2}} {(γ^{⊤} Σ_{i} γ)}^{2} - {(γ^{⊤} Σ_{i} γ)}^{2} \overset{T_{i} \to \infty}{\to} 0.

It is assumed that the samples/subjects are independent, therefore,

E {γ^{⊤} S_{i} γ - exp (x_{i}^{⊤} β)} {γ^{⊤} S_{i^{'}} γ - exp (x_{i^{'}}^{⊤} β)} = 0.

Thus,

E {\frac{1}{n} \sum_{i} γ^{⊤} S_{i} γ - \frac{1}{n} \sum_{i} exp (x_{i}^{⊤} β)}^{2} \to 0, as T_{min} \to \infty .

E {(\frac{1}{n} \sum_{i} γ^{⊤} S_{i} γ) (\frac{1}{n} \sum_{i} exp (x_{i}^{⊤} β)) - \frac{1}{n} \sum_{i} (γ^{⊤} S_{i} γ) exp (x_{i}^{⊤} β)}^{2} \leq E {(\frac{1}{n} \sum_{i} γ^{⊤} S_{i} γ)}^{2} {(\frac{1}{n} \sum_{i} exp (x_{i}^{⊤} β))}^{2} + E {\frac{1}{n} \sum_{i} (γ^{⊤} S_{i} γ) exp (x_{i}^{⊤} β)}^{2} .

E {(\frac{1}{n} \sum_{i} γ^{⊤} S_{i} γ)}^{2} = \frac{1}{n^{2}} \sum_{i} E {(γ^{⊤} S_{i} γ)}^{2} + \frac{1}{n^{2}} \sum_{i \neq i^{'}} E (γ^{⊤} S_{i} γ) (γ^{⊤} S_{i^{'}} γ) = \frac{1}{n^{2}} \sum_{i} {\frac{1}{T_{i}^{2}} E z_{i t j_{i}}^{4} + \frac{1}{T_{i}^{2}} \sum_{t \neq s} E z_{i t j_{i}}^{2} z_{i s j_{i}}^{2}} + \frac{1}{n^{2}} \sum_{i \neq i^{'}} (\frac{1}{T_{i}^{2}} \sum_{t = 1}^{T_{i}} E z_{i t j_{i}}^{2}) (\frac{1}{T_{i^{'}}^{2}} \sum_{t = 1}^{T_{i^{'}}} E z_{i^{'} t j_{i^{'}}}^{2}) = \frac{1}{n^{2}} \sum_{i} {\frac{1}{T_{i}} E z_{i 1 j_{i}}^{4} + \frac{T_{i} (T_{i} - 1)}{T_{i}^{2}} {(γ^{⊤} Σ_{i} γ)}^{2}} + \frac{1}{n^{2}} \sum_{i \neq i^{'}} (\frac{1}{T_{i}} (γ^{⊤} Σ_{i} γ)) (\frac{1}{T_{i^{'}}} (γ^{⊤} Σ_{i^{'}} γ)) \overset{T_{min} \to \infty}{\to} \frac{1}{n^{2}} \sum_{i} {(γ^{⊤} Σ_{i} γ)}^{2} .

E {\frac{1}{n} \sum_{i} (γ^{⊤} S_{i} γ) exp (x_{i}^{⊤} β)}^{2} = \frac{1}{n^{2}} \sum_{i} \sum_{i} E {(γ^{⊤} S_{i} γ)}^{2} {exp}^{2} (x_{i}^{⊤} β) + \frac{1}{n^{2}} \sum_{i \neq i^{'}} E (γ^{⊤} S_{i} γ) exp (x_{i}^{⊤} β) E (γ^{⊤} S_{i^{'}} γ) exp (x_{i^{'}}^{⊤} β) = \frac{1}{n^{2}} \sum_{i} {\frac{1}{T_{i}} E z_{i t j_{i}}^{4} + \frac{T_{i} (T_{i} - 1)}{T_{i}^{2}} {(γ^{⊤} Σ_{i} γ)}^{2}} {(γ^{⊤} Σ_{i} γ)}^{2} + \frac{1}{n^{2}} \sum_{i \neq i^{'}} {(γ^{⊤} Σ_{i} γ)}^{2} {(γ^{⊤} Σ_{i^{'}} γ)}^{2} \overset{T_{min} \to \infty}{\to} \frac{1}{n^{2}} \sum_{i} {(γ^{⊤} Σ_{i} γ)}^{4} + \frac{1}{n^{2}} \sum_{i \neq i^{'}} {(γ^{⊤} Σ_{i} γ)}^{2} {(γ^{⊤} Σ_{i^{'}} γ)}^{2} .

E {(\frac{1}{n} \sum_{i} γ^{⊤} S_{i} γ) (\frac{1}{n} \sum_{i} exp (x_{i}^{⊤} β)) - \frac{1}{n} \sum_{i} (γ^{⊤} S_{i} γ) exp (x_{i}^{⊤} β)}^{2} \leq E {(\frac{1}{n} \sum_{i} γ^{⊤} S_{i} γ)}^{2} {(\frac{1}{n} \sum_{i} exp (x_{i}^{⊤} β))}^{2} + E {\frac{1}{n} \sum_{i} (γ^{⊤} S_{i} γ) exp (x_{i}^{⊤} β)}^{2} \overset{T_{min} \to \infty}{\to} \frac{1}{n^{2}} \sum_{i} {(γ^{⊤} Σ_{i} γ)}^{2} {(\frac{1}{n} \sum_{i} (γ^{⊤} Σ_{i} γ))}^{2} + \frac{1}{n^{2}} \sum_{i} {(γ^{⊤} Σ_{i} γ)}^{4} + \frac{1}{n^{2}} \sum_{i \neq i^{'}} {(γ^{⊤} Σ_{i} γ)}^{2} {(γ^{⊤} Σ_{i^{'}} γ)}^{2} .

The above quantity on the right is bounded by a constant from above. Therefore, as T_min → ∞,

{(μ - ρ_{1} - ρ_{2} μ)}^{2} \to 0.

{(\frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} - ρ_{2})}^{2} = {(\frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} - \frac{ϕ^{2}}{{\hat{δ}}^{2}})}^{2} + {(\frac{ϕ^{2}}{{\hat{δ}}^{2}} - \frac{α}{{\hat{δ}}^{2}})}^{2} + {(\frac{α}{{\hat{δ}}^{2}} - ρ_{2})}^{2} .

Since ${\hat{δ}}^{4}$ is bounded,

E {({\hat{ϕ}}^{2} - ϕ^{2})}^{2} \to 0 \Rightarrow E {(\frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} - \frac{ϕ^{2}}{{\hat{δ}}^{2}})}^{2} \to 0.

E {(ϕ^{2} - α)}^{2} \to 0 \Rightarrow E {(\frac{ϕ^{2}}{{\hat{δ}}^{2}} - \frac{α}{{\hat{δ}}^{2}})}^{2} \to 0.

Let $ρ_{2} = ρ_{2}^{(1)} / ρ_{2}^{(2)}$ , where

ρ_{2}^{(1)} = \frac{1}{n} \sum_{i} (γ^{⊤} S_{i} γ) exp (x_{i}^{⊤} β) - (\frac{1}{n} \sum_{i} γ^{⊤} S_{i} γ) (\frac{1}{n} \sum_{i} exp (x_{i}^{⊤} β)), ρ_{2}^{(2)} = \frac{1}{n} \sum_{i} {(γ^{⊤} S_{i} γ)}^{2} - {(\frac{1}{n} \sum_{i} γ^{⊤} S_{i} γ)}^{2} .

E {(α - ρ_{2}^{(1)})}^{2} = {(\frac{1}{n} \sum_{i} exp (x_{i}^{⊤} β))}^{2} E {\frac{1}{n} \sum_{i} (γ^{⊤} S_{i} γ) - \frac{1}{n} \sum_{i} exp (x_{i}^{⊤} β)}^{2} \to 0.

{\hat{δ}}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {γ^{⊤} S_{i} γ - μ (γ^{⊤} γ)}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(γ^{⊤} S_{i} γ)}^{2} - 2 (\frac{1}{n} \sum_{i = 1}^{n} γ^{⊤} S_{i} γ) (\frac{1}{n} \sum_{i = 1}^{n} exp (x_{i}^{⊤} β)) + {(\frac{1}{n} \sum_{i = 1}^{n} exp (x_{i}^{⊤} β))}^{2} .

It can be concluded that as T_min → ∞,

E {(\hat{δ} - ρ_{2}^{2})}^{2} = E {\frac{1}{n} \sum_{i = 1}^{n} (γ^{⊤} S_{i} γ) - \frac{1}{n} \sum_{i = 1}^{n} exp (x_{i}^{⊤} β)}^{2} \to 0,

and

E {(\frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} - ρ_{2})}^{2} \to 0. E {\frac{1}{n} \sum_{i = 1}^{n} {‖ S_{i}^{*} - Σ_{i}^{* *} ‖}^{2}} \to 0, \Rightarrow E {‖ S_{i}^{*} - Σ_{i}^{* *} ‖}^{2} \to 0.

This implies that

E {γ^{⊤} S_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2} - E {γ^{⊤} Σ_{i}^{* *} γ - exp (x_{i}^{⊤} β)}^{2} \to 0.

A.7. Proof of Theorem 3.3

Proof. For the first statement,

lim_{T_{min} \to \infty} inf_{T_{i} \geq T_{min}} [\frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} {\hat{Σ}}_{i} γ - exp (x_{i}^{⊤} β)}^{2} - \frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} S_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2}] \geq inf [\frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} {\hat{Σ}}_{i} γ - exp (x_{i}^{⊤} β)}^{2} - \frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} Σ_{i}^{* *} γ - exp (x_{i}^{⊤} β)}^{2}] + lim [\frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} Σ_{i}^{* *} γ - exp (x_{i}^{⊤} β)}^{2} - \frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} S_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2}] .

By Theorem 3.2, the second term on the right converges to zero, and the first term is ≥ 0 by the definition of $Σ_{i}^{* *}$ .

For the second statement,

lim_{T_{min} \to \infty} [\frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} {\hat{Σ}}_{i} γ - exp (x_{i}^{⊤} β)}^{2} - \frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} S_{i}^{*} γ - exp (x_{i}^{⊤} β)}^{2}] = 0 \Leftrightarrow lim_{T_{min} \to \infty} [\frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} {\hat{Σ}}_{i} γ - exp (x_{i}^{⊤} β)}^{2} - \frac{1}{n} \sum_{i = 1}^{n} E {γ^{⊤} Σ_{i}^{* *} γ - exp (x_{i}^{⊤} β)}^{2}] = 0 \Leftrightarrow lim_{T_{min} \to \infty} E {γ^{⊤} {\hat{Σ}}_{i} γ - exp (x_{i}^{⊤} β)}^{2} - E {γ^{⊤} Σ_{i}^{* *} γ - exp (x_{i}^{⊤} β)}^{2} = 0 \Leftrightarrow lim_{T_{min} \to \infty} E {‖ γ^{⊤} {\hat{Σ}}_{i} γ - γ^{⊤} Σ_{i}^{* *} γ ‖}^{2} = 0 \Leftrightarrow lim_{T_{min} \to \infty} E {‖ γ^{⊤} {\hat{Σ}}_{i} γ - γ^{⊤} S_{i}^{*} γ ‖}^{2} = 0 \Leftrightarrow lim_{T_{min} \to \infty} E {‖ {\hat{Σ}}_{i} - S_{i}^{*} ‖}^{2} = 0.

This finishes the proof of this theorem.□

A.8. $S_{i}^{*}$ is well-conditioned

In this section, we show that the proposed estimator $S_{i}^{*}$ is well-conditioned and thus, invertible. This is achieved by two steps: for i = 1, …, n, (1) prove that the largest eigenvalue of $S_{i}^{*}$ is bounded in probability; (2) prove that the smallest eigenvalue of $S_{i}^{*}$ is bounded away from zero in probability. The proof follows the same strategy as in Ledoit and Wolf [24], but considers the case with multiple covariance matrices.

The covariance matrix Σ_i has the eigendecomposition as $Σ_{i} = Π_{i} Λ_{i} Π_{i}^{⊤}$ . Let $U_{i} = Λ_{i}^{- 1 / 2} Y_{i}$ . Denote λ_max(A) and λ_min(A) as the maximum and minimum eigenvalue of a matrix A, respectively.

λ_{max} (S_{i}^{*}) = λ_{max} (\frac{{\hat{ψ}}^{2}}{{\hat{δ}}^{2}} μ I + \frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} S_{i}) = \frac{{\hat{ψ}}^{2}}{{\hat{δ}}^{2}} μ + \frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} λ_{max} (S_{i}) .

μ = \frac{1}{n} \sum_{i = 1}^{n} exp (x_{i}^{⊤} β) = \frac{1}{n} \sum_{i = 1}^{n} λ_{i j_{i}} \leq max_{i} λ_{max} (Λ_{i}) .

λ_{max} (S_{i}) = λ_{max} (\frac{1}{T_{i}} Λ^{1 / 2} U_{i} U_{i}^{⊤} Λ_{i}^{1 / 2}) \leq λ_{max} (\frac{1}{T_{i}} U_{i} U_{i}^{⊤}) λ_{max} (Λ_{i}) \leq λ_{max} (\frac{1}{T_{i}} U_{i} U_{i}^{⊤}) max_{i} λ_{max} (Λ_{i}) .

Assume that p/T_max converges to a limit, denoted as c. Based on Assumption A1, c ≤ C₁. Based on the results in Yin et al. [39], as T_min = min_i T_i → ∞, for i = 1, …, n,

lim λ_{max} (\frac{1}{T_{i}} U_{i} U_{i}^{⊤}) = {(1 + \sqrt{c})}^{2}, a . s .

This implies that

ℙ {λ_{max} (S_{i}^{*}) \leq {(1 + \sqrt{c})}^{2} max_{i} λ_{max} (Λ_{i})} \to 1,

and

ℙ {λ_{max} (S_{i}^{*}) \leq {(1 + \sqrt{C_{1}})}^{2} max_{i} λ_{max} (Λ_{i})} \to 1.

Therefore, if p/T_max converges to a constant, the largest eigenvalue of $S_{i}^{*}$ is bounded in probability. If p/T_max has no limit, under Assumption A1, there exists a subsequence such that p/T_max converges. Along this sequence, the largest eigenvalue of $S_{i}^{*}$ is bounded in probability. This is true for any converging sequence, and in addition, the upper bound is independent of the particular subsequence. As a result, it holds for the whole sequence.

Next, we show that the smallest eigenvalue of $S_{i}^{*}$ is bounded away from zero in probability. Analogously, we have

λ_{min} (S_{i}) = λ_{min} (\frac{1}{T_{i}} Λ^{1 / 2} U_{i} U_{i}^{⊤} Λ_{i}^{1 / 2}) \geq λ_{min} (\frac{1}{T_{i}} U_{i} U_{i}^{⊤}) λ_{min} (Λ_{i}) \geq λ_{min} (\frac{1}{T_{i}} U_{i} U_{i}^{⊤}) min_{i} λ_{min} (Λ_{i}) .

First, assume p/T_max converges to a constant c. If c ∈ (0, 1), based on the results in Bai and Yin [4],

lim λ_{min} (\frac{1}{T_{i}} U_{i} U_{i}^{⊤}) = {(1 - \sqrt{c})}^{2}, a . s .

Assume c ≤ 1 − κ for some κ ∈ (0, 1). One can conclude that

ℙ {λ_{min} (S_{i}^{*}) \geq {(1 - \sqrt{1 - κ})}^{2} min_{i} λ_{min} (Λ_{i})} \to 1.

When c > 1 − κ, we propose to identify a lower bound from the following

λ_{min} (S_{i}^{*}) = λ_{min} (\frac{{\hat{ψ}}^{2}}{{\hat{δ}}^{2}} μ I + \frac{{\hat{ϕ}}^{2}}{{\hat{δ}}^{2}} S_{i}) \geq \frac{{\hat{ψ}}^{2}}{{\hat{δ}}^{2}} μ .

Compare the right-hand side in the above to it population counterpart,

\frac{{\hat{ψ}}^{2}}{{\hat{δ}}^{2}} μ - \frac{ψ^{2}}{δ^{2}} μ = μ {\frac{{\hat{ψ}}^{2} - ψ^{2}}{δ^{2}} + {\hat{ψ}}^{2} (\frac{1}{{\hat{δ}}^{2}} - \frac{1}{δ^{2}})} .

From Lemmas 3.1 and 3.2, we can show that the above converges to zero in probability. First, consider $ψ^{2} = \sum_{i = 1}^{n} ψ_{i}^{2} / n$ , where $ψ_{i}^{2} = E {γ^{⊤} (S_{i} - Σ_{i}) γ}^{2}$ . From the proof of Lemma 3.1,

E {‖ S_{i} - Σ_{i} ‖}^{2} = \frac{1}{p T_{i}} \sum_{j = 1}^{p} \sum_{k = 1}^{p} E (z_{i 1 j}^{2} z_{i 1 k}^{2}) - \frac{1}{p T_{i}} \sum_{j = 1}^{p} \sum_{k = 1}^{p} λ_{i j k}^{2} = \frac{p}{T_{i}} {\frac{1}{p^{2}} \sum_{j = 1}^{p} \sum_{k = 1}^{p} E (z_{i 1 j}^{2} z_{i 1 k}^{2})} - \frac{1}{p T_{i}} \sum_{j = 1}^{p} λ_{i j j}^{2} .

As T_min → ∞, the second term on the right-hand side converges to zero. For ϵ > 0, there exists a constant M > 0 such that when T_min > M, $\sum_{j = 1}^{p} λ_{i j j}^{2} / (p T_{i}) < ϵ$ . Thus, $ψ_{i}^{2} \geq (1 - κ) - ϵ$ and ψ² ≥ (1 − κ) − ϵ.

λ_{min} (S_{i}^{*}) \geq \frac{{\hat{ψ}}^{2}}{{\hat{δ}}^{2}} μ = \frac{ψ^{2}}{δ^{2}} μ + (\frac{{\hat{ψ}}^{2}}{{\hat{δ}}^{2}} μ - \frac{ψ^{2}}{δ^{2}} μ) \geq \frac{ψ^{2}}{δ^{2}} μ - ϵ \geq \frac{ψ^{2}}{2 C_{2} + C_{1} \sqrt{C_{2}}} - ϵ \geq \frac{(1 - κ) - ϵ}{2 C_{2} + C_{1} \sqrt{C_{2}}} - ϵ .

For a choice of ϵ, we have

ℙ {λ_{min} (S_{i}^{*}) \geq \frac{1 - κ}{2 (2 C_{2} + C_{1} \sqrt{C_{2}})}} \to 1.

Therefore, for both c ≤ 1 − κ and c > 1 − κ, the smallest eigenvalue of $S_{i}^{*}$ is bounded away from zero. Analogous to the proof of the largest eigenvalue, for the case that p/T_max does not have a limit, we can also have the conclusion for the whole sequence. Since both the largest and the smallest eigenvalues are bounded, $S_{i}^{*}$ is well-conditioned and invertible.

A.9. Proof of Lemma 3.3 and Theorem 3.4

We first proof Lemma 3.3.

Proof.

E (γ^{⊤} Σ_{i}^{*} γ) = \frac{ψ^{2}}{δ^{2}} μ (γ^{⊤} γ) + \frac{ϕ^{2}}{δ^{2}} E (γ^{⊤} S_{i} γ) = \frac{ψ^{2}}{δ^{2}} μ (γ^{⊤} γ) + \frac{ϕ^{2}}{δ^{2}} exp (x_{i}^{⊤} β) = exp (x_{i}^{⊤} β^{*}) .

\frac{\sum_{i} exp (x_{i}^{⊤} β^{*}) / n}{\sum_{i} exp (x_{i}^{⊤} β) / n} = \frac{ψ^{2}}{δ^{2}} \frac{μ (γ^{⊤} γ)}{\sum_{i} exp (x_{i}^{⊤} β) / n} + \frac{ϕ^{2}}{δ^{2}} = \frac{ψ^{2}}{δ^{2}} + \frac{ϕ^{2}}{δ^{2}} = 1.

\Rightarrow \frac{1}{n} \sum_{i = 1}^{n} exp (x_{i}^{⊤} β^{*}) = \frac{1}{n} \sum_{i = 1}^{n} exp (x_{i}^{⊤} β) .

Therefore,

β^{*} = β .

Next, we prove that the proposed estimator β is a consistent estimator (Theorem 3.4).

Proof. Using the consistency of pseudo-likelihood estimator [16] and the conclusion in Lemma 3.3, $\hat{β}$ is a consistent estimator of β. □

Appendix B: Additional Simulation Results

B.1. γ unknown

Here, we present the performance of estimating the fourth dimension (D4) when γ is unknown (Figure B.1). From the figures, as n and T increase, the estimate of the covariance matrices, the projection and the model coefficient converge to the truth.

B.2. Model misspecification

B.2.1. Model misspecification in β

In this section, we examine the performance of the proposed approach when the log-linear model (1.1) is misspecified. We consider the case when the data dimension is p = 20 and sample size n = 100 and T_i = T = 100 for illustration. Two scenarios are considered. In the first scenario, the true model has two correlated covariates generated from a bivariate normal distribution with mean zero, standard deviation 0.5, and correlation 0.2:

log (λ_{i j}) = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} .

(B.1)

In D2, |β₁| = |β₂| and in D4, |β₁| = 2|β₂|. Under the misspecified case, the second covariate, x_i2, is ignored. Table B.1 presents the results using the proposed approach.

Fig B.1 — Estimation performance of PS-CAP in estimating the fourth dimension (D4) when γ is unknown. For ${\hat{β}}_{1}$ , (a) bias, (b) mean squared error (MSE) and (c) coverage probability (CP) are presented, where CP is obtained from 500 bootstrap samples. For the eigenvalues ${\hat{λ}}_{i j}$ , (d) MSE is presented. For $\hat{λ}$ , (e) similarity to π₄ is presented. Data dimension p = 100. Sample sizes vary from n = 50, 100, 500, 100 and T_i = T = 50, 100, 500, 1000.

The second scenario considers the following log-linear model for the eigenvalues is considered, which includes an interaction between the covariates:

log (λ_{i j}) = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + β_{3} (x_{i 1} \times x_{i 2}),

(B.2)

where x_i1 is generated from a Bernoulli distribution with probability 0.5 to be one and x_i2 is generated from a normal distribution with mean zero and standard deviation 0.5. Table B.2 presents the estimation results using the proposed method. Under the misspecified case, the interaction between the two covariates is ignored. Thus, in the table, it has no estimate of β₃.

From both tables, under either correctly specified or misspecified model, the proposed approach correctly identifies the components related to the covariates. Under the misspecified model, the estimate of β is biased.

B.2.2. Model misspecification in γ

In this section, we discuss the robustness of the proposed approach to the violation of the assumption that all the covariance matrices share the same eigenspace. One advantage of the proposed shrinkage estimator of the covariance matrix is that it will not change the eigenvectors compared to the sample covariance matrix. In Section E.6 of the supplementary materials of Zhao et al. [41], the performance under the partial common diagonalization assumption was examined through a simulation study. In the setting, two eigencomponents are set to be the same across subjects and the rest are unique to each subject. The method can correctly identify the common component across subjects that is related to the covariates. As the proposed approach in this study has the property of preserving the eigenstructure, under the setting of partial common diagonalization, it will also correctly identify the common component that is related to the covariates.

Here, we also consider a case that each covariance matrix has a unique eigenspace, that is, the covariance matrices are generated using the eigendecomposition $Σ_{i} = Π_{i} Λ_{i} Π_{i}^{⊤}$ , where Π_i is an orthonormal matrix in $ℝ^{p \times p}$ , for i = 1, …, n. The rest simulation settings are the same as in Section 4.2. Table B.3 presents the results when p = 20, n = 100, and T_i = T = 100. For the estimation of γ, we compare with an average of the eigenvectors (after scaling to unit ℓ₂-norm). For D2, through the correlation between the estimated γ and the average if 0.915, the estimation of β₁ is off. For D4, both the estimate of γ and β₁ are away from the truth. Therefore, an assumption of partial common diagonalization is essential for the proposed framework.

Appendix C: Additional Analysis of the ADNI Study

C.1. Validity of model assumptions

In resting-state fMRI studies, the output data are generally considered normally distributed. For each time course, data are temporally correlated of at most lag two [26]. Thus, we subsample the data to remove the temporal correlation. Figure C.1 presents the normal Q-Q plot and the histogram of the data extracted from one brain region of one subject. From the figure, the marginal distribution is close to normal. Thus, the normality assumption is satisfied.

In Section 3, five assumptions are imposed to achieve estimation consistency of the parameters. By setting C₁ = 2, Assumption A1 is satisfied. In the fMRI dataset, the increase of the total number of observations across subjects (i.e. $N = \sum_{i = 1}^{n} T_{i}$ ) can be faster than the number of variables (p). Thus, Assumption A2 can be satisfied. Under the normality assumption, the eighth order moment exits, and Assumptions A3 and A4 are valid. Assumption A5 concerns the population eigenvalues. We cannot easily assess this assumption using sample covariance matrices due to the large bias under the high-dimensional setting [22]. We thus can only provide some empirical examination while noting that the empirical results should be evaluated with caution due to this bias issue. To empirically assess the validity of Assumption A5, we first calculate the average sample covariance matrix and then compare the eigenvectors of the average covariance matrix with the eigenvectors of each individual’s sample covariance matrix. When the correlation of two eigenvectors is greater than 0.5, we say there is a high similarity, allowing variability and bias in sample eigenvectors. About 67% of the eigenvectors have a high similarity across multiple subjects. Since the individual sample covariance matrix is rank-deficient, the eigenvectors are not unique. With about 67% overlapping, the assumption of common eigenstructure is partially satisfied. In addition, as discussed in Section B.2.2, when the eigenstructure is partially common across subjects, it will not impact the identification of the common components that are related to the covariates. The proposed approach identifies three components based on the metric of average deviation from diagonality suggesting that these three components commonly diagonalize the covariance matrices. Here the assumption that the log-linear model is correctly specified is challenging to validate using data alone. The current model is considered based on the domain knowledge and the study interest of AD research.

Table B.1.

Bias and mean squared error (MSE) in estimating β, and the similarity of $\hat{λ}$ to π_j and the standard error (SE), for j = 2, 4, under the misspecified and correctly specified models for model (B.1). Data dimension p = 20, sample size n = 100 and T_i = T = 100.

		${\hat{β}}_{1}$		${\hat{β}}_{2}$		$\hat{λ}$
		Bias	MSE	Bias	MSE	$\| 〈 \hat{γ}, π_{j} 〉 \|$ (SE)
D2	Misspecified	0.105	0.014	-	-	0.994 (0.003)
D2	Correctly specified	0.002	0.001	< 0.001	0.001	0.993 (0.003)
D4	Misspecified	−0.081	0.008	-	-	0.991 (0.004)
D4	Correctly specified	−0.015	0.001	0.008	0.001	0.983 (0.009)

Open in a new tab

Table B.2.

Bias and mean squared error (MSE) in estimating β, and the similarity of $\hat{λ}$ to π_j and the standard error (SE), for j = 2, 4, under the misspecified and correctly specified models for model (B.2). Data dimension p = 20, sample size n = 100 and T_i = T = 100.

		${\hat{β}}_{1}$		${\hat{β}}_{2}$		${\hat{β}}_{3}$		$\hat{λ}$
		Bias	MSE	Bias	MSE	Bias	MSE	$\| 〈 \hat{γ}, π_{j} 〉 \|$ (SE)
D2	Misspecified	< 0.001	0.002	−0.252	0.066	-	-	0.993 (0.003)
D2	Correctly specified	0.002	0.001	−0.002	0.002	−0.001	0.003	0.993 (0.003)
D4	Misspecified	−0.010	0.001	−0.113	0.014	-	-	0.987 (0.006)
D4	Correctly specified	−0.010	0.001	0.015	0.002	−0.005	0.003	0.988 (0.006)

Open in a new tab

Table B.3.

Bias and mean squared error (MSE) in estimating β₁ and the similarity of $\hat{λ}$ to the average of π_ij (denoted as ${\bar{π}}_{j}$ ) and the standard error (SE), for j = 2, 4, when each covariance matrix has a unique eigenspace. Data dimension p = 20, sample size n = 100 and T_i = T = 100.

	${\hat{β}}_{1}$			$\hat{λ}$
	Truth	Bias	MSE	$\| 〈 \hat{γ}, {\bar{π}}_{j} 〉 \|$ (SE)
D2	−1	0.980	0.971	0.915 (0.031)
D4	1	−0.523	0.293	0.571 (0.060)

Open in a new tab

Fig C.1. — Normal Q-Q plot and histogram of the data extracted from one brain region of one subject.

Contributor Information

Yi Zhao, Department of Biostatistics and Health Data Science, Indiana University School of Medicine.

Brian Caffo, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health.

Xi Luo, Department of Biostatistics and Data Science, The University of Texas Health Science Center at Houston.

References

[1].Anderson T (1973). Asymptotically efficient estimation of covariance matrices with linear structure. The Annals of Statistics 1 135–141. [Google Scholar]
[2].Anderson TW (1963). Asymptotic theory for principal component analysis. The Annals of Mathematical Statistics 34 122–148. [Google Scholar]
[3].Badhwar A, Tam A, Dansereau C, Orban P, Hoffstaedter F and Bellec P (2017). Resting-state network dysfunction in Alzheimer’s disease: a systematic review and meta-analysis. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring 8 73–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Bai Z and Yin Y (1993). Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. The annals of Probability 1275–1294. [Google Scholar]
[5].Boik RJ (2002). Spectral models for covariance matrices. Biometrika 89 159–182. [Google Scholar]
[6].Cai TT, Ren Z and Zhou HH (2016). Estimating structured high-dimensional covariance and precision matrices: Optimal rates and adaptive estimation. Electronic Journal of Statistics 10 1–59. [Google Scholar]
[7].Chen AA, Beer JC, Tustison NJ, Cook PA, Shinohara RT and Shou H (2020). Removal of scanner effects in covariance improves multivariate pattern analysis in neuroimaging data. bioRxiv 858415. [Google Scholar]
[8].Chen Y, Wiesel A and Hero AO (2011). Robust shrinkage estimation of high-dimensional covariance matrices. IEEE Transactions on Signal Processing 59 4097–4107. [Google Scholar]
[9].Chiu TY, Leonard T and Tsui K-W (1996). The matrix-logarithmic covariance model. Journal of the American Statistical Association 91 198–210. [Google Scholar]
[10].Corder EH, Saunders AM, Strittmatter WJ, Schmechel DE, Gaskell PC, Small G, Roses AD, Haines J and Pericak-Vance MA (1993). Gene dose of apolipoprotein E type 4 allele and the risk of Alzheimer’s disease in late onset families. Science 261 921–923. [DOI] [PubMed] [Google Scholar]
[11].Daniels MJ and Kass RE (2001). Shrinkage estimators for covariance matrices. Biometrics 57 1173–1184. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].De Marco M and Venneri A (2017). ApoE-dependent differences in functional connectivity support memory performance in early-stage Alzheimer’s disease (P4. 094). Neurology 88. [Google Scholar]
[13].Flury BN (1984). Common principal components in k groups. Journal of the American Statistical Association 79 892–898. [Google Scholar]
[14].Fox EB and Dunson DB (2015). Bayesian nonparametric covariance regression. Journal of Machine Learning Research 16 2501–2542. [Google Scholar]
[15].Franks AM and Hoff P (2019). Shared Subspace Models for Multi-Group Covariance Estimation. Journal of Machine Learning Research 20 1–37. [Google Scholar]
[16].Gong G and Samaniego FJ (1981). Pseudo maximum likelihood estimation: theory and applications. The Annals of Statistics 861–869. [Google Scholar]
[17].Gour N, Felician O, Didic M, Koric L, Gueriot C, Chanoine V, Confort-Gouny S, Guye M, Ceccaldi M and Ranjeva JP (2014). Functional connectivity changes differ in early and late-onset Alzheimer’s disease. Human Brain Mapping 35 2978–2994. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Gour N, Ranjeva J-P, Ceccaldi M, Confort-Gouny S, Barbeau E, Soulier E, Guye M, Didic M and Felician O (2011). Basal functional connectivity within the anterior temporal network is associated with performance on declarative memory tasks. Neuroimage 58 687–697. [DOI] [PubMed] [Google Scholar]
[19].Grosenick L, Klingenberg B, Katovich K, Knutson B and Taylor JE (2013). Interpretable whole-brain prediction analysis with GraphNet. NeuroImage 72 304–321. [DOI] [PubMed] [Google Scholar]
[20].Hoff PD (2009). A hierarchical eigenmodel for pooled covariance estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 971–992. [Google Scholar]
[21].Hoff PD and Niu X (2012). A covariance regression model. Statistica Sinica 22 729–753. [Google Scholar]
[22].Johnstone IM and Lu AY (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association 104. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Koch W, Teipel S, Mueller S, Benninghoff J, Wagner M, Bokde AL, Hampel H, Coates U, Reiser M and Meindl T (2012). Diagnostic power of default mode network resting state fMRI in the detection of Alzheimer’s disease. Neurobiology of Aging 33 466–478. [DOI] [PubMed] [Google Scholar]
[24].Ledoit O and Wolf M (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88 365–411. [Google Scholar]
[25].Ledoit O and Wolf M (2012). Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics 40 1024–1060. [Google Scholar]
[26].Lindquist MA (2008). The statistical analysis of fMRI data. Statistical Science 23 439–464. [Google Scholar]
[27].Mejia AF, Nebel MB, Barber AD, Choe AS, Pekar JJ, Caffo BS and Lindquist MA (2018). Improved estimation of subject-level functional connectivity using full and partial correlation with empirical Bayes shrinkage. NeuroImage 172 478–491. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Pascal F, Chitour Y and Quek Y (2014). Generalized robust shrinkage estimator and its application to STAP detection problem. IEEE Transactions on Signal Processing 62 5640–5651. [Google Scholar]
[29].Pervaiz U, Vidaurre D, Woolrich MW and Smith SM (2020). Optimising network modelling methods for fMRI. Neuroimage 211 116604. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Pourahmadi M, Daniels MJ and Park T (2007). Simultaneous modelling of the Cholesky decomposition of several covariance matrices. Journal of Multivariate Analysis 98 568–587. [Google Scholar]
[31].Rahim M, Thirion B and Varoquaux G (2017). Population-shrinkage of covariance to estimate better brain functional connectivity. In International Conference on Medical Image Computing and Computer-Assisted Intervention 460–468. Springer. [Google Scholar]
[32].Rahim M, Thirion B and Varoquaux G (2019). Population shrinkage of covariance (PoSCE) for better individual brain functional-connectivity estimation. Medical Image Analysis 54 138–148. [DOI] [PubMed] [Google Scholar]
[33].Safieh M, Korczyn AD and Michaelson DM (2019). ApoE4: an emerging therapeutic target for Alzheimer’s disease. BMC Medicine 17 1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Seiler C and Holmes S (2017). Multivariate heteroscedasticity models for functional brain connectivity. Frontiers in Neuroscience 11 696. [DOI] [PMC free article] [PubMed] [Google Scholar]
[35].Smith SM, Jenkinson M, Woolrich MW, Beckmann CF, Behrens TE, Johansen-Berg H, Bannister PR, De Luca M, Drobnjak I, Flitney DE et al. (2004). Advances in functional and structural MR image analysis and implementation as FSL. NeuroImage 23 S208–S219. [DOI] [PubMed] [Google Scholar]
[36].Tibshirani R, Saunders M, Rosset S, Zhu J and Knight K (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 91–108. [Google Scholar]
[37].Tyler DE (1987). A distribution-free M-estimator of multivariate scatter. The Annals of Statistics 234–251. [Google Scholar]
[38].Varoquaux G, Gramfort A, Poline J-B and Thirion B (2010). Brain covariance selection: better individual functional connectivity models using population prior. In Advances in neural information processing systems 2334–2342. [Google Scholar]
[39].Yin Y-Q, Bai Z-D and Krishnaiah PR (1988). On the limit of the largest eigenvalue of the large dimensional sample covariance matrix. Probability theory and related fields 78 509–521. [Google Scholar]
[40].Zhao Y, Lindquist MA and Caffo BS (2020). Sparse principal component based high-dimensional mediation analysis. Computational Statistics & Data Analysis 142 106835. [DOI] [PMC free article] [PubMed] [Google Scholar]
[41].Zhao Y, Wang B, Mostofsky SH, Caffo BS and Luo X (2021). Covariate assisted principal regression for covariance matrix outcomes. Biostatistics 22 629–645. [DOI] [PMC free article] [PubMed] [Google Scholar]
[42].Zou H, Hastie T and Tibshirani R (2006). Sparse principal component analysis. Journal of computational and graphical statistics 15 265–286. [Google Scholar]
[43].Zou T, Lan W, Wang H and Tsai C-L (2017). Covariance Regression Analysis. Journal of the American Statistical Association 112 266–281. [Google Scholar]

[R1] [1].Anderson T (1973). Asymptotically efficient estimation of covariance matrices with linear structure. The Annals of Statistics 1 135–141. [Google Scholar]

[R2] [2].Anderson TW (1963). Asymptotic theory for principal component analysis. The Annals of Mathematical Statistics 34 122–148. [Google Scholar]

[R3] [3].Badhwar A, Tam A, Dansereau C, Orban P, Hoffstaedter F and Bellec P (2017). Resting-state network dysfunction in Alzheimer’s disease: a systematic review and meta-analysis. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring 8 73–85. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Bai Z and Yin Y (1993). Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. The annals of Probability 1275–1294. [Google Scholar]

[R5] [5].Boik RJ (2002). Spectral models for covariance matrices. Biometrika 89 159–182. [Google Scholar]

[R6] [6].Cai TT, Ren Z and Zhou HH (2016). Estimating structured high-dimensional covariance and precision matrices: Optimal rates and adaptive estimation. Electronic Journal of Statistics 10 1–59. [Google Scholar]

[R7] [7].Chen AA, Beer JC, Tustison NJ, Cook PA, Shinohara RT and Shou H (2020). Removal of scanner effects in covariance improves multivariate pattern analysis in neuroimaging data. bioRxiv 858415. [Google Scholar]

[R8] [8].Chen Y, Wiesel A and Hero AO (2011). Robust shrinkage estimation of high-dimensional covariance matrices. IEEE Transactions on Signal Processing 59 4097–4107. [Google Scholar]

[R9] [9].Chiu TY, Leonard T and Tsui K-W (1996). The matrix-logarithmic covariance model. Journal of the American Statistical Association 91 198–210. [Google Scholar]

[R10] [10].Corder EH, Saunders AM, Strittmatter WJ, Schmechel DE, Gaskell PC, Small G, Roses AD, Haines J and Pericak-Vance MA (1993). Gene dose of apolipoprotein E type 4 allele and the risk of Alzheimer’s disease in late onset families. Science 261 921–923. [DOI] [PubMed] [Google Scholar]

[R11] [11].Daniels MJ and Kass RE (2001). Shrinkage estimators for covariance matrices. Biometrics 57 1173–1184. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].De Marco M and Venneri A (2017). ApoE-dependent differences in functional connectivity support memory performance in early-stage Alzheimer’s disease (P4. 094). Neurology 88. [Google Scholar]

[R13] [13].Flury BN (1984). Common principal components in k groups. Journal of the American Statistical Association 79 892–898. [Google Scholar]

[R14] [14].Fox EB and Dunson DB (2015). Bayesian nonparametric covariance regression. Journal of Machine Learning Research 16 2501–2542. [Google Scholar]

[R15] [15].Franks AM and Hoff P (2019). Shared Subspace Models for Multi-Group Covariance Estimation. Journal of Machine Learning Research 20 1–37. [Google Scholar]

[R16] [16].Gong G and Samaniego FJ (1981). Pseudo maximum likelihood estimation: theory and applications. The Annals of Statistics 861–869. [Google Scholar]

[R17] [17].Gour N, Felician O, Didic M, Koric L, Gueriot C, Chanoine V, Confort-Gouny S, Guye M, Ceccaldi M and Ranjeva JP (2014). Functional connectivity changes differ in early and late-onset Alzheimer’s disease. Human Brain Mapping 35 2978–2994. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Gour N, Ranjeva J-P, Ceccaldi M, Confort-Gouny S, Barbeau E, Soulier E, Guye M, Didic M and Felician O (2011). Basal functional connectivity within the anterior temporal network is associated with performance on declarative memory tasks. Neuroimage 58 687–697. [DOI] [PubMed] [Google Scholar]

[R19] [19].Grosenick L, Klingenberg B, Katovich K, Knutson B and Taylor JE (2013). Interpretable whole-brain prediction analysis with GraphNet. NeuroImage 72 304–321. [DOI] [PubMed] [Google Scholar]

[R20] [20].Hoff PD (2009). A hierarchical eigenmodel for pooled covariance estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 971–992. [Google Scholar]

[R21] [21].Hoff PD and Niu X (2012). A covariance regression model. Statistica Sinica 22 729–753. [Google Scholar]

[R22] [22].Johnstone IM and Lu AY (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association 104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Koch W, Teipel S, Mueller S, Benninghoff J, Wagner M, Bokde AL, Hampel H, Coates U, Reiser M and Meindl T (2012). Diagnostic power of default mode network resting state fMRI in the detection of Alzheimer’s disease. Neurobiology of Aging 33 466–478. [DOI] [PubMed] [Google Scholar]

[R24] [24].Ledoit O and Wolf M (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88 365–411. [Google Scholar]

[R25] [25].Ledoit O and Wolf M (2012). Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics 40 1024–1060. [Google Scholar]

[R26] [26].Lindquist MA (2008). The statistical analysis of fMRI data. Statistical Science 23 439–464. [Google Scholar]

[R27] [27].Mejia AF, Nebel MB, Barber AD, Choe AS, Pekar JJ, Caffo BS and Lindquist MA (2018). Improved estimation of subject-level functional connectivity using full and partial correlation with empirical Bayes shrinkage. NeuroImage 172 478–491. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Pascal F, Chitour Y and Quek Y (2014). Generalized robust shrinkage estimator and its application to STAP detection problem. IEEE Transactions on Signal Processing 62 5640–5651. [Google Scholar]

[R29] [29].Pervaiz U, Vidaurre D, Woolrich MW and Smith SM (2020). Optimising network modelling methods for fMRI. Neuroimage 211 116604. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Pourahmadi M, Daniels MJ and Park T (2007). Simultaneous modelling of the Cholesky decomposition of several covariance matrices. Journal of Multivariate Analysis 98 568–587. [Google Scholar]

[R31] [31].Rahim M, Thirion B and Varoquaux G (2017). Population-shrinkage of covariance to estimate better brain functional connectivity. In International Conference on Medical Image Computing and Computer-Assisted Intervention 460–468. Springer. [Google Scholar]

[R32] [32].Rahim M, Thirion B and Varoquaux G (2019). Population shrinkage of covariance (PoSCE) for better individual brain functional-connectivity estimation. Medical Image Analysis 54 138–148. [DOI] [PubMed] [Google Scholar]

[R33] [33].Safieh M, Korczyn AD and Michaelson DM (2019). ApoE4: an emerging therapeutic target for Alzheimer’s disease. BMC Medicine 17 1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Seiler C and Holmes S (2017). Multivariate heteroscedasticity models for functional brain connectivity. Frontiers in Neuroscience 11 696. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] [35].Smith SM, Jenkinson M, Woolrich MW, Beckmann CF, Behrens TE, Johansen-Berg H, Bannister PR, De Luca M, Drobnjak I, Flitney DE et al. (2004). Advances in functional and structural MR image analysis and implementation as FSL. NeuroImage 23 S208–S219. [DOI] [PubMed] [Google Scholar]

[R36] [36].Tibshirani R, Saunders M, Rosset S, Zhu J and Knight K (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 91–108. [Google Scholar]

[R37] [37].Tyler DE (1987). A distribution-free M-estimator of multivariate scatter. The Annals of Statistics 234–251. [Google Scholar]

[R38] [38].Varoquaux G, Gramfort A, Poline J-B and Thirion B (2010). Brain covariance selection: better individual functional connectivity models using population prior. In Advances in neural information processing systems 2334–2342. [Google Scholar]

[R39] [39].Yin Y-Q, Bai Z-D and Krishnaiah PR (1988). On the limit of the largest eigenvalue of the large dimensional sample covariance matrix. Probability theory and related fields 78 509–521. [Google Scholar]

[R40] [40].Zhao Y, Lindquist MA and Caffo BS (2020). Sparse principal component based high-dimensional mediation analysis. Computational Statistics & Data Analysis 142 106835. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] [41].Zhao Y, Wang B, Mostofsky SH, Caffo BS and Luo X (2021). Covariate assisted principal regression for covariance matrix outcomes. Biostatistics 22 629–645. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] [42].Zou H, Hastie T and Tibshirani R (2006). Sparse principal component analysis. Journal of computational and graphical statistics 15 265–286. [Google Scholar]

[R43] [43].Zou T, Lan W, Wang H and Tsai C-L (2017). Covariance Regression Analysis. Journal of the American Statistical Association 112 266–281. [Google Scholar]

PERMALINK

Principal regression for high dimensional covariance matrices

Yi Zhao

Brian Caffo

Xi Luo

Abstract

1. Introduction

2. Methods

3. Asymptotic Properties

4. Simulation Study

4.1. γ is known

Table 1.

Fig 1.

4.2. γ is unknown

Table 2.

Fig 2.

5. The Alzheimer’s Disease Neuroimaging Initiative Study

Table 3.

Fig 3.

6. Discussion

Acknowledgments

Appendix A: Theory and Proof

A.1. Proof of Theorem 2.1 and Lemma 2.1

A.2. Proof of Proposition 3.1

A.3. Proof of Lemma 3.1

A.4. Proof of Lemma 3.2

A.5. Proof of Theorem 3.1

A.6. Proof of Theorem 3.2

A.7. Proof of Theorem 3.3

A.8. Si* is well-conditioned

A.9. Proof of Lemma 3.3 and Theorem 3.4

Appendix B: Additional Simulation Results

B.1. γ unknown

B.2. Model misspecification

B.2.1. Model misspecification in β

Fig B.1.

B.2.2. Model misspecification in γ

Appendix C: Additional Analysis of the ADNI Study

C.1. Validity of model assumptions

Table B.1.

Table B.2.

Table B.3.

Fig C.1.

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A.8. $S_{i}^{*}$ is well-conditioned