Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 May 23.
Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2018 Jun 25;80(5):899–926. doi: 10.1111/rssb.12279

An imputation–regularized optimization algorithm for high dimensional missing data problems and beyond

Faming Liang 1, Bochao Jia 2, Jingnan Xue 3, Qizhai Li 4, Ye Luo 5
PMCID: PMC6533005  NIHMSID: NIHMS996644  PMID: 31130816

Summary.

Missing data are frequently encountered in high dimensional problems, but they are usually difficult to deal with by using standard algorithms, such as the expectation-maximization algorithm and its variants. To tackle this difficulty, some problem-specific algorithms have been developed in the literature, but there still lacks a general algorithm. This work is to fill the gap: we propose a general algorithm for high dimensional missing data problems. The algorithm works by iterating between an imputation step and a regularized optimization step. At the imputation step, the missing data are imputed conditionally on the observed data and the current estimates of parameters and, at the regularized optimization step, a consistent estimate is found via the regularization approach for the minimizer of a Kullback–Leibler divergence defined on the pseudocomplete data. For high dimensional problems, the consistent estimate can be found under sparsity constraints. The consistency of the averaged estimate for the true parameter can be established under quite general conditions. The algorithm is illustrated by using high dimensional Gaussian graphical models, high dimensional variable selection and a random-coefficient model.

Keywords: Expectation-maximization algorithm, Gaussian graphical model, Gibbs sampler, Imputation consistency, Random-coefficient model, Variable selection

1. Introduction

Missing data are frequently encountered in both low and high dimensional data, where low and high refer to whether the number of variables is smaller or larger than the sample size respectively. For example, microarray data are usually considered as high dimensional, where the number of genes can be much larger than the number of samples. Missing values can appear in microarray data for various reasons such as scratches on slides, spotting problems and experimental errors. In some microarray experiments, missing values can occur for more than 90% of the genes (Ouyang et al., 2004). Simply deleting the samples or genes for which missing values occur can lead to a significant loss of data information. How to deal with missing data has been a long-standing problem in statistics.

For low dimensional problems, the missing data can be dealt with by using the expectation- maximization (EM) algorithm (Dempster et al., 1977) or its variants. Let  Let X obs =(X1 obs ,X2 obs ,,Xnobs) denote the observed incomplete data, where n denotes the sample size. Let  Let Xmis =(X1mis ,X2mis ,,Xnmis) denote the missing data, and let X = (Xobs, Xmis) denote the complete data. Let θ denote the vector of unknown parameters, and let f(X|θ) denote the likelihood function of the complete data. Then the maximum likelihood estimate of θ can be determined by maximizing the marginal likelihood of the observed data,

f(Xobs|θ)=f(Xobs,xmis|θ)h(xmis|θ,Xobs)dxmis, (1)

where h(xmis|θ,Xobs) denotes the predictive density of the missing data. The EM algorithm seeks to maximize equation (1) by iterating between the following two steps.

  1. E-step: calculate the expected value of the log-likelihood function with respect to the predictive distribution of the missing data given the current estimate θ(t), i.e.
    Q(θ|θ(t))=log{f(Xobs,xmis|θ)}h(xmis|θ(t),Xobs)dxmis.
  2. M-step: find a value of θ that maximizes the quantity Q(θ|θ(t)), i.e. set
    θ(t+1)=argmaxθQ(θ|θ(t)).

Dempster et al. (1977) showed that the marginal likelihood value increases with each iteration and, under fairly general conditions, it converges to a local or global maximum of the marginal likelihood. A rigorous study for the convergence is given in Wu (1983).

Both the E- and the M-steps of the algorithm can be quite complicated or even intractable. Meng and Rubin (1993) found that, in many cases, the M-step is relatively simple when conditioned on some function of the parameters under estimation. Motivated by this observation, they introduced the expectation-conditional maximization algorithm, which replaces the M- step by a number of computationally simpler conditional maximization steps. Later, the EM algorithm was further accelerated by some other variants, such as the ‘expectation-conditional maximization either’ algorithm (Liu and Rubin, 1994; He and Liu, 2012) and the parameter expansion EM algorithm (Liu et al., 1998). When the E-step is analytically intractable, Wei and Tanner (1990) introduced the Monte Carlo EM algorithm, which simulates multiple missing values from the predictive distribution h(xmis|θ(t), Xobs) at the (t + 1)th iteration, and then maximizes the approximate conditional expectation of the complete-data log-likelihood function

Q^(θ|θ(t))=1mj=1mlog{f(Xobs,Xjmis|θ)},

where X1mis,,Xmmis denote the missing values simulated from h(xmis|θ(t), Xobs). When the dimension of Xmis is high, the Monte Carlo approximation can be quite expensive. An alternative algorithm to deal with the intractable E-step is the stochastic EM (SEM) algorithm (Celeux and Diebolt, 1985). In this algorithm, the E-step is replaced by an imputation step, where the missing data are imputed with plausible values conditioned on the observed data and the current parameter estimate. At the M-step, the parameters are estimated by maximizing the likelihood function of the pseudocomplete data. Unlike the deterministic EM algorithm, the imputation step and M-step of the SEM algorithm generate a Markov chain which converges to a stationary distribution whose mean is close to the maximum likelihood estimate and variance reflects the loss of information due to the missing data (Nielsen, 2000).

Although EM and its variants work well for low dimensional problems (see McLachlan and Krishnan (2008) for an overview), essentially they fail for high dimensional problems. For high dimensional problems the maximum likelihood estimate can be non-unique or inconsistent. To address this issue, some problem-specific algorithms have been proposed; see for example, misgLasso (Stadler and Bühlmann, 2012), misPALasso (Städler et al., 2014) and matrix completion algorithms (Cai et al., 2010; Mazumder et al., 2010). misgLasso was specifically designed for estimating Gaussian graphical models (GGMs) in the presence of missing data. Similarly to misgLasso, misPALasso also deals with multivariate Gaussian data in the presence of missing data. The matrix completion algorithm deals with large incomplete matrices, to learn a low rank approximation for a large-scale matrix with missing entries. However, lacking still is a general algorithm for high dimensional missing data problems.

This work fills the gap: we propose a general algorithm for dealing with high dimensional missing data problems. The algorithm proposed consists of two steps: an imputation step and a regularized optimization (RO) step, and is therefore called an imputation-regularized optimization (IRO) algorithm. The imputation step imputes the missing data with plausible values conditioned on the observed data and the current estimate of the parameters. The RO step finds a consistent estimate for the minimizer of the Kullback–Leibler divergence defined on the pseudocomplete data. For high dimensional problems, a certain type of regularization must be imposed on the parameter space for finding a consistent estimate. Like the SEM algorithm, the IRO algorithm generates a Markov chain which converges to a stationary distribution. Under mild conditions, we show that the mean of the stationary distribution converges to the true value of the parameters in probability as the sample size becomes large. For low dimensional problems, the SEM algorithm can be viewed as a special case of the IRO algorithm. The IRO algorithm has strong implications for big data computing: on the basis of it, we propose a general strategy to improve Bayesian computation for big data problems. The IRO algorithm also facilitates data integration from multiple sources, which plays an important role in big data analysis. An R package accompanying this paper has been distributed to the public via the Comprehensive R Archive Network and is available from https://cran.r-project.org/web/packages/IROmiss/index.html.

The remainder of this paper is organized as follows. Section 2 describes the IRO algorithm with the theoretical development deferred to Appendix A. Section 3 applies the IRO algorithm to high dimensional GGMs. Section 4 applies the IRO algorithm to high dimensional variable selection. Section 5 applies the IRO algorithm to a random-coefficient model and discusses its potential use for big data problems. Section 6 concludes the paper with a brief discussion.

2. The imputation–regularized optimization algorithm

2.1. The algorithm

Let X1, …, Xn denote a random sample drawn from the distribution f(x|θ) (also denoted by fθ(x) depending on convenience), where θ is a vector of parameters. Let Xi=(Xiobs,Ximis),i=1,,n, where Xiobs is observed and Ximis is missed. Let X = (X1, …, Xn), Xobs=(X1obs,,Xnobs) and Xmis=(X1mis,,Xnmis). To indicate the dependence of the dimension of θ on the sample size n, we also write θ as θn and denote by θn(t) the estimate of θ that is obtained at iteration t of the IRO algorithm. The IRO algorithm works by starting with an initial guess θn(0) and then iterates between the imputation I- and RO steps.

  1. I-step: draw X˜mis from the predictive distribution h(xmis|Xobs,θn(t)) given Xobs and the current estimate θn(t).

  2. RO step: on the basis of pseudocomplete data X˜=(Xobs,X˜mis), find an updated estimate θn(t+1) which forms a consistent estimate of
    θ*(t)=arg max θEθn(t)[log{fθ(x˜)}], (2)
    where Eθn(t)[log{fθ(x˜)}]=log{f(xobs,x˜mis|θ)}f(xobs|θ*)h(x˜mis|xobs,θn(t))dxobsdx˜mis, θ* denotes the true value of the parameters and f(xobs|θ*) denotes the marginal density function of xobs.

The RO step finds the minimizer of the Kullback–Leibler divergence from f(xobs,x˜ mis |θ) to the joint density f(xobs|θ*)h(x˜mis|xobs,θn(t)). Intuitively, each RO step provides a drift to move the current estimate θn(t) towards θ*, and the convergence will eventually happen by assuming that n ∞. For the case with a finite value of n, θn(t) will jump around θ* after convergence because of the randomness that is introduced in imputation. To find a consistent estimate of θ*(t), an ideal objective function is Eθn(t)[log{fθ(x˜)}], but this cannot be directly used as θ* is unknown. In practice, we can estimate the ideal objective function by the empirical mean of the log-likelihood function, i.e. i=1nlog{f(xiobs,x˜imis|θ)}/n. Further, to accommodate the high dimensional case, for which the number of parameters can be much greater than the sample size, we impose a regularization term on the parameters, i.e. we propose to estimate θ*(t) by

θn,p(t+1)=arg max θ[1ni=1nlog{f(xiobs,x˜imis|θ)}Pλ(θ)], (3)

where the x˜imiss are drawn from h(x˜mis|xiobs,θn(t)),Pλ(θ) denotes the regularization or penalty function and λ denotes the regularization parameter. For this reason, we call the algorithm proposed an IRO algorithm.

Under mild conditions, theorem 1 in Appendix A establishes a uniform law of large numbers for the empirical mean of the log-likelihood function, i.e., as n ∞, i=1nlog{f(xi obs ,x˜i mis |θ)}/n will converge to Eθn(t)[log{fθ(x˜)}] uniformly over the space of θ and the space of θn(t). With appropriate regiularization functions and regularity conditions for Eθn(t)[log{fθ(x˜)}], corollary 1 in Appendix A shows that θn,p(t+1) is uniformly consistent for θ*(t) for all t = 1, 2, …, T, where T denotes the total number of iterations. Further, corollary 2 in Appendix A shows that, other than θn,p(t+1)s, any sequence of consistent estimates θn,g(t+1) can be used in the IRO algorithm, as long as θn,g(t+1) is sufficiently accurate for each t pointwise and the log-likelihood function of the pseudocomplete data satisfies certain regularity conditions. For obtaining such consistent estimates, different methods can be used, which might not explicitly maximize the objective function defined in equation (3). However, following from Jensen’s inequality, they do implicitly and asymptotically maximize the objective function that is defined in equation (3). Corollary 2 greatly facilitates the use of the IRO algorithm.

Like the SEM algorithm, by simulating new independent missing values at each iteration, the sequence of estimates {θn(t)} forms a time homogeneous Markov chain. The imputed values at different iterations also form a Markov chain. The two Markov chains are interleaved and share many properties, such as irreducibility, aperiodicity and ergodicity. Refer to Nielsen (2000) for more discussion on this issue. Theorem 3 and theorem 4 in Appendix A prove that the Markov chain {θn(t)} has a stationary distribution and, furthermore, the mean of the stationary distribution forms a consistent estimate of θ*.

We note that many of the assumptions that we made for proving the convergence and consistency of the IRO algorithm are quite regular. For example, we assumed that log{fθ(x˜)} is a continuous function of θ for each x˜X and a measurable function of x˜ for each θ. Since we aim to address the missing data issue for a wide range of problems and it is difficult to specify the structure for each problem, we incorporate the assumptions about the parameters and structure of the problem in a metric entropy condition; see condition 2 in Appendix A. As discussed in remark 1 of Appendix A, this condition allows p to grow with n at a polynomial rate O(nγ) for some constant 0 < γ < ∞ and allows the number of non-zero elements in θ to grow with n at a rate of O(nα) for some 0<α<12. These rates seem a little more restrictive than the exponential rate, i.e. log(p) = nb for some constant 0<b< 1, conventionally sought in high dimensional regression. In the literature (see for example Raskutti et al. (2011)), the metric entropy condition has been used in studying the rate of convergence of the estimation for high dimensional regression.

Regarding conditions on missing data, we note that the IRO algorithm essentially works with any missing data mechanism, as long as the predictive distribution h(xmis|Xobs,θn(t)) is available, well behaved and unchanged with the sample size n. Our current theory rules out the case that the missing data mechanism changes as the sample size increases, e.g. the missingness rate increases because of increased wear and tear on measurement instruments or fatigue among data subjects that are measured later in the study. Condition 3 constrains the behaviour of h(xmis|Xobs,θn(t)) via some moment conditions on the log-likelihood function of the pseudocomplete data, and it implies that a high missingness rate may hurt the performance of the method.

Regarding the implementation of the algorithm, we note that the RO step is the key. However, because of the specificity of each problem, it is neither easy to specify a general regularization function nor to specify a general optimization procedure for obtaining the consistent estimates. The IRO algorithm just provides a general road map for dealing with high dimensional missing data problems, which leaves the user plenty of freedom for choosing appropriate regularization functions and optimization procedures for different problems. In this paper, the IRO algorithm is applied to learning high dimensional GGMs with missing data. How the RO step is performed for this problem will be detailed in Section 3.

Regarding the regularization function, we note that, in general, it can be specified in two ways depending on the problem under consideration. The first way is as conventional, i.e. employing some l0-, l1- or l2-penalties. In this way, we recommend using the same penalty function as if there are no missing data. The second way is via sure screening, which first reduces the space of θ*(t) to a low dimensional subspace and then finds a consistent estimate of θ*(t) in the low dimensional subspace by using a conventional statistical method, such as maximum likelihood, moment estimation or even penalized optimization. This is equivalent to imposing a binary-type penalty function on the original space, for which the solutions in the low dimensional subspace receives a zero penalty and those outside the subspace receive a penalty of ∞. Such a binary-type penalty function satisfies condition 1 in Appendix A imposed for regularization functions. For low dimensional problems, the consistent estimator of θ*(t) can be obtained by solving equation (3) with a zero penalty function. In this sense, the SEM algorithm can be viewed as a special case of the IRO algorithm.

Finally, we note that, like for other Markov chains, a good initial value will accelerate the convergence of the IRO simulation. There are many ways to specify initial values for the IRO algorithm. In most examples of this paper, we started the simulation with an I-step, where all missingness values are filled by the median of the corresponding variable. This method is simple and usually works well when the missingness rate is not high. However, when we perceive that such a constant filling method does not work well, we may start the simulation with an RO step. In this case, the initial estimate θn(0) may be obtained on the basis of the complete samples (i.e. those without missing entries) only.

2.2. An extension of the imputation-regularized optimization algorithm

As mentioned previously, the RO step is the key to the implementation of the IRO algorithm. We found that for many problems, similarly to the expectation–conditional maximization algorithm (Meng and Rubin, 1993), a consistent estimate of θ*(t) can be easily obtained with a number of conditional RO steps, i.e. we can partition the parameter θ into a number of blocks and then find a consistent estimator for each block conditioned on the current estimates of other blocks. Note that, for many problems, the partitioning of θ is natural; see, for example, the examples that are studied in Sections 4 and 5.

Suppose that θ = (θ(1),…,θ(k)) has been partitioned into k blocks. The imputation–conditional regularized optimization (ICRO) algorithm can be described as follows.

  1. I-step: draw X˜mis from the conditional distribution h(xmis|Xobs,θn(t,1),,θn(t,k)) given Xobs and the current estimate θn(t)=(θn(t,1),,θn(t,k)).

  2. CRO step: on the basis of the pseudocomplete data X˜=(Xobs,X˜mis), do the following.
    1. Conditioned on (θn(t,2),,θn(t,k)), find θn(t+1,1) which forms a consistent estimate of
      θ*(t,1)=argmaxθ(t,1)Eθn(t,1),,θn(t,k)[log{f(x˜|θn(t,1),θn(t,2),,θn(t,k))}],
      where the expectation is taken with respect to the joint distribution of x˜=(xobs,x˜mis) and the subscript of E gives the current estimate of θ.
    2. Conditioned on (θn(t+1,1),θn(t,3),,θn(t,k)), find θn(t+1,2) which forms a consistent estimate of
      θ*(t,2)=argmaxθ(t,2)Eθn(t+1,1),θn(t,2),θn(t,3),,θn(t,k)[log{f(x˜|θn(t+1,1),θn(t,2),θn(t,3),,θn(t,k))}].
  3. Conditioned on (θn(t+1,1),,θn(t+1,k1)), find θn(t+1,k) which forms a consistent estimate of
    θ*(t,k)=argmaxθ(t,k)Eθn(t+1,1),,θn(t+1,k1),θn(t,k)[log{f(x˜|θn(t+1,1),,θn(t+1,k1),θn(t,k))}].

Similarly to equation (3), the consistent estimate of each block can be obtained by maximizing a regularized conditional likelihood function, i.e. setting

θn,p(t+1,i)=argmaxθ(t,i)[1ni=1nlog{f(xiobs,x˜imis|θn(t+1,i),,θn(t+1,i1),θn(t,i),θn(t,i),θn(t,i+1),,θn(t,k))Pλi(θ(t,i))}], (4)

for i = 1,2, …, k, where Pλi() denotes the regularization or penalty function that is used for block i. Similarly to the IRO algorithm, it is easy to see that the sequence {(θn(t,1),,θn(t,k))} also forms a Markov chain. The convergence of the Markov chain can be studied under similar conditions to those of the IRO algorithm. Theorem 5 and theorem 6 in Appendix A show that the Markov chain {θn(t)} has a stationary distribution and the mean of the stationary distribution forms a consistent estimate of θ*.

Like the IRO algorithm, the ICRO algorithm just provides a road map for dealing with high dimensional missing data problems. How to partition the parameter vector θ into different blocks and how to obtain a consistent estimate for each block are problem dependent. In this paper, the ICRO algorithm is applied to the problem of high dimensional variable selection with missing data and the problem of random-coefficient linear model estimation. How the CRO step is performed for the two problems will detailed in Sections 4 and 5 respectively.

3. Learning high dimensional Gaussian graphical models in the presence of missing data

GGMs have often been used in learning gene regulatory networks from microarray data; see, for example, Dobra et al. (2004) and Friedman et al. (2008). As mentioned in Section 1, missing values can appear in microarray data for many reasons. To deal with missing values in microarray data, many imputation methods, such as single-value decomposition imputation (Troyanskaya et al., 2001), least square imputation (Bo et al., 2004) and Bayesian principal component analysis (BPCA) imputation (Oba et al., 2003) have been proposed. Since these methods impute the missing values independently of the models under consideration, they are often ineffective. Moreover, statistical inference based on the ‘one-time’ imputed data is potentially biased, because uncertainty in the missing values cannot be properly accounted for. In this section, we apply the IRO algorithm to handle missing values for microarray data. The IRO algorithm iteratively imputes missing values on the basis of the updated parameter estimate. Therefore, it overcomes the weakness of the one-time imputation methods and improves accuracy of statistical inference. For a thorough study, the algorithm proposed is also compared with the regression-tree-based multiple-imputation method (Burgette and Reiter, 2010), and the numerical results indicate the superiority of the IRO algorithm.

Let X = (x1, …, xn)T denote a microarray data set of n samples and p genes, where xi is assumed to follow a multivariate Gaussian distribution Np(μ, Σ). According to the theory of GGMs, estimating a GGM is equivalent to identifying non-zero elements of the concentration matrix, i.e. the inverse of the covariance matrix Σ, or to identify non-zero partial correlation coefficients for different pairs of genes. In recent years, a couple of methods have been proposed to estimate high dimensional GGMs, e.g. the graphical lasso (Yuan and Lin, 2007; Friedman et al., 2008), nodewise regression (Meinshausen and Bühlmann, 2006) and ψ-learning (Liang et al., 2015). However, none of them can be directly applied in the presence of missing data.

3.1. The imputation–regularized optimization algorithm

To apply the IRO algorithm to learn GGMs in the presence of missing data, we choose the ψ-learning algorithm as the estimation procedure that is used in the RO step. For GGMs, θ corresponds to the concentration matrix, which can be uniquely determined from the network structure by using the algorithm that is given in Hastie et al. (2009), page 634. Under mild conditions, Liang et al. (2015) showed that the ψ-learning algorithm provides a consistent estimator for Gaussian graphical networks. Refer to the on-line supplementary material for a brief review of the algorithm. The ^-learning algorithm belongs to the class of sure-screening- based algorithms. It first restricts the solution space of the GGM to a low dimensional subspace via correlation screening and then conducts GGM estimation via covariance selection (Dempster, 1972), which, by nature, is a maximum likelihood estimation method. Following from corollary 2 in Appendix A, such an estimation procedure can be used in the IRO algorithm. The resulting estimates implicitly and asymptotically maximize the objective function that is defined in equation (3). The corresponding penalty function is of binary type, which takes a value of 0 in the restricted low dimensional subspace and ∞ otherwise. Other than the ψ-learning algorithm, nodewise regression and the graphical lasso can also be used here, which both produce a consistent estimate for the Gaussian graphical network.

The Gaussian graphical network specifies the dependence between different genes, according to which missing values can be imputed. For convenience, we let A = (ajk) denote the adjacency matrix of a Gaussian graphical network, where ajk = 1 if an edge exists between node j and node k and ajk = 0 otherwise. For microarray data, a node corresponds to a gene. Let xij denote a missing entry, and let ω(j) = {k: ajk = 1} denote the neighbourhood of node j. According to the faithfulness property of GGMs, conditionally on the neighbouring genes in ω(j), gene j is independent of all other genes. Therefore, xij can be imputed conditionally on the expression values of the neighbouring genes. Mathematically, we have

(xijxiω)~N{(μjμω),(σj2ΣjωΣjωTΣωω)}, (5)

where x = {xik: kω(j)}, and μj, μω, σj2, Σ and Σωω denote the corresponding mean and variance components. The mean and variance of xij conditionally on x are thus given by

μij|ω=μj+ΣjωΣωω1(xiωμω),σij|ω=σj2ΣjωΣωω1ΣjωT. (6)

As shown in Liang et al. (2015), for each gene, the neighbourhood size can be upper bounded by ⎾n/ log(n)⏋, where ⎾z⏋ denotes the smallest integer that is not smaller than z. Hence, in practice, σj2, Σ and Σωω can be directly estimated from the data. Let sj2, S, Sωω, x¯j and x¯ω denote the respective sample estimates of σj2, Σ, Σωω, μj and μω. Then, at each iteration, xij can be imputed by sampling from the distribution

Xij|ω~N{x¯j+SjωSωω1(xiωx¯ω),sj2SjωSωω1SjωT}. (7)

In this way, exact evaluation of the concentration matrix can be skipped. In summary, we have the following IRO algorithm for learning GGMs in the presence of missing data.

  1. (Initialization): fill each missing entry by the median of the corresponding variable, and then iterate between the RO and I-steps.

  2. (RO step): apply the ψ-learning algorithm to learn the structure of the Gaussian graphical network.

  3. (I-step): impute missing values according to distribution (7) based on the network learned in the RO step.

The IRO algorithm outputs a series of Gaussian graphical networks. To integrate or average these networks into a single network we adopt the ψ-score averaging approach that was suggested by Liang et al. (2015). Let (ψij(t)) denote the ψ-scores at iteration t, which are obtained from ψ-partial-correlation coefficients via Fisher’s transformation. Let ψ¯ij=t=1Tψij(t)/T, i, j = 1, 2, …, p, ij, denote the averaged ψ-score for gene i and gene j. Then the averaged network can be obtained by applying a multiple-hypothesis approach to threshold the averaged ψ-scores; if an averaged ψ-score is greater than the threshold value, we set the corresponding element of the adjacency matrix to 1 and to 0 otherwise. The multiple-hypothesis test can be done by using the method of Liang and Zhang (2008), which can be viewed as a generalized empirical Bayesian method (Efron, 2004). The level of significance of the multiple-hypothesis test can be specified in terms of Storey’s q-value (Storey, 2002). In this paper, we set it to 0·05.

3.2. A simulated example

We consider an auto-regressive process of order 2 with the concentration matrix given by

Ci,j={0.5, if |ji|=1,i=2,,p1,0.25, if |ji|=2,i=3,,p2,1, if i=j,i=1,,p,0, otherwise.  (8)

This example has been used by many researchers; see, for example, Yuan and Lin (2007), Mazumder and Hastie (2012) and Liang et al. (2015) to illustrate different GGM methods. In this paper, we generated multiple data sets with n = 200 and different values of p = 100, 200, 300, 400. For each combination of (n, p), we generated 10 data sets independently, and, for each data set, we randomly deleted 10% of the observations as missing values. To evaluate the performance of the IRO algorithm, precision-recall curves were drawn by varying the threshold value of ψ-scores, where the precision is the fraction of true edges among the retrieved edges, and the recall is the fraction of true edges that have been retrieved over the total amount of true edges.

For each data set, the IRO algorithm was run for 50 iterations. Fig. 1 shows the resulting precision-recall curves for one data set with p = 400. For comparison, Fig. 1 also includes the precision-recall curves that are produced by misgLasso and those produced by the ψ-learning algorithm with the missing values imputed by the median filling, BPCA and regression tree (with multiple imputation) methods. The misgLasso algorithm is a combination of the gLasso and EM algorithms, which integrates out the missing data as in the EM algorithm (see, for example, Stadler and Bühlmann (2012)) and then learns the GGM by using the gLasso algorithm. The misgLasso algorithm has been implemented in the R package spaceExt (He, 2011). The BPCA method has been implemented in the R-package pcaMethods (Stacklies et al., 2007) and the regression tree method has been implemented in the R package MICE (van Buuren and Groothuis-Oudshoorn, 2011). In the regression tree method, for each data set, the missing values were imputed 20 times and, as the IRO-Ave curve, the MIRegTree curve that is reported in Fig. 1 was obtained with the ψ-scores averaged over the 20 pseudocomplete data sets. Refer to Fig. 1 of the on-line supplementary material for the precision-recall curves for other values of p.

Fig. 1.

Fig. 1.

Precision–recall curves resulting from various imputation methods for one simulated data set with p = 400: Inline graphic, curve obtained with complete data (‘true’); Inline graphic, curve produced by the misgLasso algorithm; Inline graphic, curve obtained with the ψ-scores generated in the last IRO iteration; Inline graphic, curve obtained with the ψ-scores averaged over the last 20 IRO iterations; Inline graphic, curve obtained with missing values imputed by the median filling method; Inline graphic, curve obtained with missing data imputed by the BPCA method; Inline graphic, curve obtained with missing data imputed by the regression tree (multiple-imputation) method

Table 1 compares the averaged areas (over 10 data sets) under the precision-recall curves produced by various methods. The comparison indicates that IRO-Ave outperforms all others for this example. It is interesting that, although IRO-Last is also based on one-time imputation, it significantly outperforms the median filling, regression tree (with multiple imputation) and BPCA methods. The misgLasso algorithm does not work well for this example. This inferiority is not due to the EM algorithm, but the gLasso algorithm which does not work well for this example. This is consistent with Liang et al. (2015), where it was shown that the ψ-learning algorithm works much better than gLasso for the complete-data version of this example.

Table 1.

Average areas (over 10 data sets) under the precision–recall curves resulting from various imputation methods

p Results for the following methods:
misgLasso Median BPCA MIRegTree IRO-Last IRO-Ave True
100 0.678
(0.006)
0.882
(0.007)
0.874
(0.006)
0.877
(0.005)
0.877
(0.007)
0.904
(0.006)
0.949
(0.006)
200 0.633
(0.005)
0.856
(0.004)
0.855
(0.004)
0.844
(0.006)
0.887
(0.004)
0.902
(0.003)
0.941
(0.002)
300 0.599
(0.004)
0.830
(0.003)
0.833
(0.003)
0.807
(0.006)
0.869
(0.003)
0.901
(0.002)
0.936
(0.002)
400 0.580
(0.003)
0.824
(0.003)
0.824
(0.003)
0.799
(0.005)
0.868
(0.003)
0.900
(0.002)
0.932
(0.001)

Numbers in parentheses denote the standard deviation of the average.

3.3. Yeast cell expression data

Using deoxyribonucleic acid microarrays, Gasch et al. (2000) measured the changes in gene expression levels in the yeast Saccharomyces cerevisiae as cells responded to different environmental changes, such as temperature shocks, hydrogen peroxide and nitrogen source depletion. The whole data set is available from http://genome-www.stanford.edu/yeast-stress/, consisting of 173 samples and 6152 known or predicted yeast genes and having a missingness rate of 3.01%.

In this example, our goal is to study the regulatory relationships between yeast genes. We considered 1000 genes with expression levels having the largest variation over the samples. The missingness rate for the subset data is also 3.01%. On the basis of the subset data, the IRO algorithm produced an approximate scale-free gene regulatory network which is reported in the on-line supplementary material. From the network, a few hub genes can be identified, and the top three of them are ARB1, HXT5 and XBP1. A gene is called a hub gene if it has strong connectivity to other genes. We find that the top three hub genes are all included in the transcriptional factor (TF)-gene direct regulatory network (Yang et al., 2014), which is available from the yeast transcriptional regulatory pathway database at http://cosbi3.ee.ncku.edu.tw/YTRP. The yeast transcriptional regulatory pathway database has deposited all possible transcriptional regulatory pathways for each TF-gene regulatory pair inferred from the TF perturbation experiment and can be somehow considered as the ‘ground truth’ for gene regulatory relationships. For comparison, we have applied the BPCA, median filling, regression tree (with multiple imputation) and misgLasso methods to this example with the top three rule employed in identifying hub genes. The misgLasso algorithm always aborted for this data set, which may be due to some computational issues that are involved in the package. Among the top three hub genes identified by the BPCA method, only one gene is included in the TF-gene direct regulatory network from the yeast transcriptional regulatory pathway database. For the median filling and regression tree methods, this number is 0 and 1 respectively. This comparison indicates the superiority of the IRO algorithm over the competitive methods in learning true gene regulatory relationships for the data set. Refer to the supplementary material for details.

4. High dimensional variable selection in presence of missing data

This problem is also motivated by microarray data analysis, but the goal has been shifted to selection of genes that are relevant to a particular phenotype. To be more general, we let Y = (Y1, …, Yn)T denote the response vector for n observations, and let X = (X1, …, Xn)T denote the matrix of covariates, where each Xi is a p-dimensional vector and p can be much larger than n (also known as small n–large p). The response variable and covariates are linked through the regression

Y=(1n,X)β+ϵ, (9)

where β = (β0, β1, …, βp)T denotes the vector of regression coefficients, and ϵ~N(0,σϵ2In) denotes the vector of random errors.

Variable selection for model (9) with complete data has been extensively studied in the recent literature. Methods have been developed from both frequentist and Bayesian perspectives; see, for example, Tibshirani (1996), Johnson and Rossell (2012) and Song and Liang (2015a). For incomplete data, Garcia et al. (2010) proposed to conduct variable selection by maximizing the penalized likelihood function of the incomplete data. However, when p is large and the covariates Xi are generally correlated, the incomplete-data likelihood function can be intractable, rendering failure of their method. Zhao and Long (2016) showed through numerical studies that for the high dimensional data the standard multiple-imputation approach performs poorly, whereas the imputation method based on the Bayesian lasso often works better. However, since the Bayesian lasso tends to overshrink the non-zero regression coefficients, its consistency in variable selection is difficult to justify when p is much greater than n (Castillo et al.,2015). Quite recently, Long and Johnson (2015) proposed to combine Bayesian lasso imputation and stability selection (Meinshausen and Bühlmann, 2010). Again, the consistency of this method is difficult to justify because of the inconsistency of the Bayesian lasso.

4.1. The imputation–conditional regularized optimization algorithm

In what follows, we consider a general setting of model (9), where the covariates follow a multivariate Gaussian distribution X ~ N(μ, Σ). Under this setting, the parameter vector θ consists of three natural blocks β, σϵ2 and the concentration matrix C = Σ−1. Since n has been assumed to be smaller than p, we further assume sparsity for both the regression coefficients β and the concentration matrix C.

To apply the ICRO algorithm to this problem, we employ the sure independence screening (SIS)-minimax concave penalty (MCP) algorithm for estimating β, i.e. the variables are first subject to an SIS procedure, and then the survived variables are selected by using the MCP method (Zhang, 2010). This algorithm has been implemented in the R package SIS, and the resulting estimate maximizes the regularized conditional likelihood function as defined in expression (4), where the regularization function is given by the MCP penalty (Zhang, 2010) in the restricted subspace and ∞ otherwise. Given the estimate of β, σϵ2 can be estimated by σ^ϵ2=i=1nϵ^i2/(n|β^|1), where ϵ^i denotes the residual of sample i, and |β^| denotes the number of non-zero elements that are included in the estimate β^. Given the consistency of β^, the consistency of σ^ϵ2 can be easily justified, for which the corresponding penalty function is 0. To estimate the concentration matrix C, we employ the ψ-learning algorithm. As mentioned previously, the ψ-learning algorithm provides a consistent estimate for the Gaussian graphical network, based on which a consistent estimate of the concentration matrix can be uniquely determined by the algorithm given in Hastie et al. (2009), page 634. As discussed in Section 3, the ψ-learning algorithm implicitly and asymptotically maximizes the regularized conditional likelihood function as defined in expression (4) with a binary-type penalty function. Note that the SIS-MCP algorithm does not make use of the dependence between the covariates. Given the structure of the ICRO algorithm, some other variable selection algorithms which have made use of the dependence between the covariates, e.g. Yu and Liu (2016), can also be applied here.

Next, we consider the imputation step. Suppose that the value of xhk is missed in X. Section 4 of the on-line supplementary material presents the conditional distributions of Xhk given Y and the remaining elements of X under various scenarios. On the basis of the conditional distributions, xhk can be easily imputed by sampling from the respective sampled conditional distributions. Here the sampled conditional distribution refers to the distribution with its population parameters replaced by their respective estimates calculated from samples. For example, βis are replaced by their SIS-MCP estimates and σ^ϵ2 is replaced by σ^ϵ2. In summary, the ICRO algorithm works as follows.

  1. (Initialization): fill each missing entry of X by the median of the corresponding variable, and then iterate between the CRO and I-steps.

  2. (CRO-step):
    1. apply the SIS-MCP algorithm to estimate the regression coefficients β;
    2. estimate σ^ϵ2 conditionally on the estimate of β;
    3. apply the ψ-learning algorithm to learn the structure of the Gaussian graphical network.
  3. (I-step): impute missing values according to the conditional distributions (given in the online supplemental material) on the basis of the regression model and network structure learned in the CRO step.

4.2. A simulated example

The data sets were simulated from model (9) with n = 100 and p = 200 and p = 500. The covariates X were generated under two settings:

  1. the covariates are mutually independent, where xi ~ N(0,2In) for i = 1, …, n;

  2. the covariates are generated according to the concentration matrix (8).

For both settings, we set (β0, β1, …, β5) = (1,1,2, −1.5, −2.5, 5) and β6 = … = βp = 0, and random error ϵ ~ N(0, In). For each pair of (n, p), we simulated 10 data sets independently. For each data set, we considered two missingness rates, randomly deleting 5% and 10% entries of X to create missingness data. The performance of the various methods was measured by using three criteria:

errβ2=β^β2,
fsr=|s\s*|/|s|,
nsr=|s*\s|/|s*|,

where ‘‖ · ‖’ denotes the Euclidean norm, β^ denotes the estimate of β, s* denotes the set of true covariates and s denotes the set of selected covariates.

The ICRO algorithm was first applied to this example with the results summarized in Table 2 and 3. For each data set, the algorithm was run for 30 iterations. For variable selection, we kept only the variables that appeared five or more times in the last 10 iterations. For estimation of β, we averaged the estimates of β that were obtained in the last 10 iterations. For comparison, we also tried the median filling, BPCA and regression tree (with multiple imputation) methods. As explained previously, the median filling method fills each missing value by the median of the corresponding variable, and BPCA imputes the missing values on the basis of the principal component regression. The regression tree method performs multiple imputation via sequential regression trees: as in ICRO, we kept only the variables that appeared five or more times in 10 imputations for variable selection, and averaged the estimates of β that were obtained in all 10 imputations for estimation of β.

Table 2.

Comparison of the ICRO algorithm with the median filling, BPCA and regression tree (with multiple imputation) methods for high dimensional variable selection with independent covariates

p MR (%) Results for the following methods:
BPCA Median MIRegTree ICRO True
200 5 errβ2 0.257 (0.267) 0.262 (0.261) 0.333 (0.08) 0.042(0.041) 0.046 (0.048)
fsr 0.119(0.143) 0.082 (0.092) 0.075 (0.038) 0(0) 0(0)
nsr 0(0) 0(0) 0(0) 0(0) 0(0)
10 errβ2 0.903 (0.396) 0.856 (0.421) 0.369(0.189) 0.065 (0.087) 0.046 (0.048)
fsr 0.310(0.159) 0.308(0.178) 0.057 (0.031) 0(0) 0(0)
nsr 0(0) 0(0) 0.016(0.016) 0(0) 0(0)
500 5 errβ2 0.339 (0.214) 0.350 (0.206) 0.254(0.121) 0.029 (0.034) 0.027 (0.023)
fsr 0.249 (0.225) 0.266 (0.237) 0.142(0.117) 0(0) 0(0)
nsr 0(0) 0(0) 0(0) 0(0) 0(0)
10 errβ2 1.532(1.071) 1.354(0.895) 1.075(0.149) 0.044 (0.022) 0.027 (0.023)
fsr 0.470 (0.265) 0.420 (0.255) 0.083 (0.037) 0(0) 0(0)
nsr 0.033 (0.070) 0.017 (0.053) 0.083 (0.037) 0(0) 0(0)

MR denotes the missingness rate and true denotes the results obtained by the MCP method from the complete data. The values in the table are obtained by averaging over 10 independent data sets with the standard deviation reported in the parentheses.

Table 3.

Comparison of the ICRO algorithm with the median filling, BPCA and regression tree (with multiple imputation) methods for high dimensional variable selection with dependent covariates

p MR (%) Results for the following methods:
BPCA Median MIRegTree ICRO True
200 5 errβ2 0.580(0.413) 0.548(0.140) 0.537 (0.269) 0.118(0.097) 0.071 (0.050)
fsr 0.262 (0.204) 0.263 (0.200) 0.076 (0.033) 0(0) 0(0)
nsr 0.017 (0.052) 0.017(0.052) 0.083 (0.027) 0(0) 0(0)
10 errβ2 1.604(0.666) 1.575(0.974) 1.550(0.462) 0.424 (0.461) 0.071 (0.050)
fsr 0.247 (0.229) 0.273 (0.238) 0.031 (0.021) 0(0) 0(0)
nsr 0.100(0.086) 0.083 (0.088) 0.167(0.025) 0.033 (0.070) 0(0)
500 5 errβ2 0.669 (0.366) 0.717(0.358) 0.555(0.134) 0.172(0.195) 0.096 (0.083)
fsr 0.262 (0.202) 0.289 (0.236) 0.083 (0.037) 0(0) 0(0)
nsr 0.017(0.053) 0.017(0.053) 0.167(0.144) 0(0) 0(0)
10 errβ2 2.752 (2.306) 2.896(2.601) 1.433(1.151) 0.578 (0.587) 0.096 (0.083)
fsr 0.297 (0.230) 0.327 (0.224) 0.083 (0.037) 0(0) 0(0)
nsr 0.133(0.070) 0.133(0.070) 0.167(0.125) 0.050(0.081) 0(0)

MR denotes the missingness rate, and true denotes the results obtained by the MCP method from the complete data.

The comparison indicates that the ICRO algorithm works extremely well for this example. For the case of independent covariates, its results are almost as good as those obtained from the complete data. In both cases, the ICRO algorithm significantly outperforms the other imputation methods.

As an extension of this example, we also considered one case in which the missing data are generated under the mechanism of missingness at random. To mimic this mechanism, we deleted each entry of X with a probability proportional to the absolute value of the mean of its conditional distribution; see (equation 2) of the on-line supplementary material, i.e. the propensity for each data point to be missing is not related to the missing data but is related to some of the observed data. Table 2 of the supplementary material summarizes the results from the median filling, BPCA, regression tree (with multiple imputation) and ICRO methods. The comparison indicates that, under this data missingness mechanism, ICRO can still work well and significantly outperform the competitors.

4.3. Expression data for Bardel–Biedl syndrome

We analysed one real gene expression data set about Bardet-Biedl syndrome (Scheetz et al., 2006). The complete data set contains 120 samples, where the expression level of the gene TRIM32 works as the response variable and the expression levels of 200 other genes work as the predictors. The data set is available in the R package flare. We generated 10 incomplete data sets from the complete data set by randomly deleting 5% of the observations. For each incomplete data set, we ran the ICRO algorithm for 30 iterations and averaged the estimates of β obtained in the last 10 iterations as the final estimate. For comparison, the median filling, BPCA and regression tree (with multiple imputation) methods were also applied to this example. Table 4 summarizes the estimation errors of β^ (with respect to βc, the estimate of β from the complete data) produced by the four methods for 10 incomplete data sets.

Table 4.

Estimation errors of β^ (with respect to βc) produced by the ICRO, median filling, BPCA and regression tree (with multiple imputation) methods for the Bardet–Biedl syndrome example

Method errβ2 sd
BPCA 0.428 0.091
Median 0.397 0.086
MIRegTree 0.289 0.042
ICRO 0.187 0.040

errβ2 is calculated by averaging β^βc2 over 10 incomplete data sets, and sd represents the standard deviation of errβ2.

We have also explored the results of variable selection. The complete-data model selects five variables: v.153, v.180, v.185, v.87 andv.200. For the ICRO, median filling, BPCA and regression tree models, we count the selection frequency of each variable for the 10 incomplete data sets. For the ICRO and regression tree methods, the top five variables in selection frequency are v.153, v.185, v.180, v.87 and v.200, which are the same (ignoring the order) as for the complete-data model. For the median filling models, the top five variables are v.153, v.185, v.62, v.200 and v.54. For the BPCA models, the top five variables are v.153, v.87, v.185, v.62 and v.200. Both the results of β-estimation and of variable selection indicate the superiority of ICRO over the other imputation methods.

5. A random-coefficient linear model

To illustrate the use of the ICRO algorithm further, we consider a random-coefficient linear model. Such a model often arises, for instance, in recommendation systems where customers rate different items, e.g. products or service. Specifically, we simulate the data from the model

yij=xijTβ+ziTλi+wjTγj+eij,eij~N(0,σ2),λi~N(0,Λ),γj~N(0,Γ), (10)

where yij represents the response for customer i on item j. Assume that there are I customers and each customer responds to J items. Thus, the data set consists of a total of n = IJ observations. The vector xij represents the covariates that characterize the customers and items, e.g. how and how long the customer has purchased the item, zi represents customer-specific covariates such as gender, education and demographics, and wj represents item-specific covariates, e.g. the manufacturer and category of the item. The vector λi represents the customer-specific (random) coefficients and γj represents the item-specific (random) coefficients. This model can be easily extended to the case where each customer responds to only a subset of items. For this model, we treat the random coefficients λi and γj as missing data and are interested in estimation of β. For simplicity, we assume that β is low dimensional, although the whole data set can be big when I and/or J become large. Under this assumption, the ICRO algorithm is essentially reduced to the SEM algorithm for this example. Instead of using the ICRO algorithm in this straightforward way, we propose to use it under the Bayesian framework. This extends the applications of the ICRO algorithm to Bayesian computation.

To conduct Bayesian analysis for the model, we assume the following semiconjugate priors:

β~N(μβ,Σβ),σ2~IG(a,b),Λ~ IW (ρΛ,RΛ),Γ~ IW (ρΓ,RΓ),} (11)

where IG(·, ·) denotes the inverse gamma distribution, IW(·, ·) denotes the inverse Wishart distribution and μβ, Σβ, a, b, ρΛ, RΛ, ρΓ and RΓ are hyperparameters to be specified by the user. Each of these priors is individually conjugate to the normal likelihood function, given the other parameters, although the joint prior is not conjugate. The resulting full conditional posterior distributions and their modes are derived in section 5 of the on-line supplementary material. Since, under the low dimensional setting, the mode of the full conditional posterior distribution provides a consistent estimator for the corresponding parameter, the ICRO algorithm can work as follows.

  1. (Initialization): initialize the λis, γjs and all parameters by some random numbers.

  2. (CRO step): estimate the parameters β, Λ, Γ and σ2 by the mode of their respective full conditional posterior distributions, for which the log-prior density plays the role of the penalty function that is imposed in expression (4).

  3. (I-step): impute the values of the λis and γjs according to their respective full conditional posterior distributions.

Under the Bayesian framework, the ICRO algorithm works in a similar way to the Gibbs sampler except that it replaces posterior samples of the parameters by their respective full conditional posterior modes. Also, in this case, it is reduced to a hybrid of data augmentation (Tanner and Wong, 1987) and iterative conditional modes (Besag, 1974) by using imputation for the missing data and conditional modes for the parameters. However, the ICRO algorithm offers more, whose consistency step allows it to conduct parameter estimation based on subsamples only, and this can create great savings in computation for big data problems. For high dimensional problems, if the choice of prior distributions ensures posterior consistency, then the above algorithm can still be employed.

Fig. 3 of the on-line supplementary material compares the sampling path and auto-correlations of the ICRO and Gibbs samples. As expected, the comparison shows that the ICRO algorithm can converge faster than the Gibbs sampler and, in addition, the samples that are generated by the ICRO algorithm tend to have smaller variations than those by the Gibbs sampler. Note that the accuracy of the ICRO estimates is achieved at the price that we sacrify the variance information that is contained in the posterior samples. As pointed out by Nielsen (2000), the variance of the ICRO samples reflects the loss of information due to the missing data. However, for the random-coefficient example, the variance information can be obtained from the full conditional posterior distributions (given in the on-line supplementary material) by simply plugging the parameter estimates into their variances. This observation suggests a general strategy to improve simulations of the Gibbs sampler: at each iteration, we need to draw samples for only the components for which the posterior variance is not analytically available and also of interest to us, and the other components can be replaced by the mode of the respective full conditional posterior distributions. As aforementioned, the mode can be found with a subset of samples, which can be much cheaper than sampling from the full data conditional posterior. We expect that the strategy proposed can significantly facilitate Bayesian computation for big data problems. A further study of this proposed strategy will be reported elsewhere.

6. Discussion

In this paper, we have proposed the IRO algorithm, as a general algorithm for dealing with high dimensional missing data problems. Under quite general conditions, we show that the IRO algorithm can lead to a consistent estimate for the parameters. We have also extended the IRO algorithm to the case of multiple block parameters, which leads to the ICRO algorithm. We illustrate the proposed algorithms by using high dimensional GGMs, high dimensional variable selection and a random-coefficient model.

Like the EM algorithm for low dimensional data, we expect that the IRO or ICRO algorithm can have many applications for high dimensional data. With the IRO or ICRO algorithm, many problems can be much simplified, e.g. variable selection for high dimensional mixture regression (Khalili and Chen, 2007) and variable selection for high dimensional mixed effect models (Fan and Li, 2012). For the former, the group index of each sample can be treated as missing data, and then the IRO algorithm can be applied: the I-step assigns the samples to different groups and the RO step conducts variable selection for each group separately. For the latter, at each iteration of the IRO algorithm, it is reduced to variable selection for a high dimensional regression with fixed designs given the imputed random effects.

To assess the convergence of IRO simulations, we recommend the Gelman-Rubin statistic (Gelman and Rubin, 1992). Since the IRO algorithm usually converges very fast, we do not recommend a long run. Our experience shows that 20 iterations have often been sufficiently long to produce a stable parameter estimate. Because of the Markov chain Monte Carlo nature of IRO simulations, we recommend the averaging method for parameter estimation and allow a burn-in period before collecting the samples for averaging. In Section 3.2, we reported the results based on the last iteration, which is just to compare with other one-time imputation methods. Theoretically, the variation in the IRO estimates that are collected at different iterations reflects the missing data information. Hence, in addition to parameter estimates, one might report their variance over iterations. However, as this information is often not of interest to us, we choose not to report them in the paper.

When a large amount of noise is brought into the system through missing data, the IRO or ICRO algorithm may fail to work as implied by condition 3 in Appendix A. When the sample size is small, consistent estimators may not be adequately accurate. Our numerical experience shows that, in this case, the IRO or ICRO algorithm might not significantly outperform onetime imputation methods, such as median filling and BPCA, as there is not much information to use for improving imputation during iteration. In general, when the sample size increases, the IRO or ICRO algorithm can significantly outperform one-time imputation methods.

Regarding statistical inference for parameters, we note that, theoretically, it can be done in general scenarios, not limited to the constraint that the posterior distribution of the parameters has a closed form. For example, for high dimensional regression, if the lasso algorithm is employed as the consistent estimator in the IRO algorithm, then the inference for θ, e.g. constructing confidence intervals for each component of θ, can be done by using the desparsified method (Zhang and Zhang, 2014; van de Geer et al., 2014). Let V(xobs,x˜tmis,θn(t)) denote an uncertainty assessment statistic that is obtained at iteration t. Assume that V(xobs, xmis, θ) is a Lipschitz function with respect to θ and is integrable. Then, by corollary 3 or corollary 4 in Appendix A (depending on whether IRO or ICRO is being used), we shall be able to obtain an uncertainty assessment for θ by averaging V(xobs,x˜tmis,θn(t)) along the IRO or ICRO chain. However, how to obtain the uncertainty assessment statistic V(xobs,x˜tmis,θn(t)) for general high dimensional problems is beyond the scope of this paper.

As a variant of the ICRO algorithm, we note that the I-step can be replaced by an E-step if it is available. In this case, the RO step maximizes the objective function

max θEθ*[Q(xobs,θ|θn(t))], (12)

where the Q-function is as defined in the EM algorithm, and Eθ* denotes expectation with respect to the true distribution π(xobs|θ*). Suppose that θ has been partitioned into a few blocks θ = (θ(1), …, θ(k)). To solve problem (12), the ICRO algorithm is reduced to a block- wise consistency algorithm, which finds consistent estimates iteratively for the parameters of each block conditioned on the current estimates of the parameters of other blocks. The block- wise consistency algorithm is closely related to, but more flexible than, the co-ordinate ascent algorithm (Tseng, 2001; Tseng and Yun, 2009). The co-ordinate ascent algorithm iteratively finds the exact maximizer for each block conditioned on the current estimates of the parameters of other blocks. Under appropriate conditions, such as contraction as in condition 11’ in Appendix A and uniform consistency of the θn(t)s as shown in theorem 2, we shall be able to show that the paths of the two algorithms will converge to the same point. This will be explored elsewhere.

The ICRO algorithm has strong implications for big data computing. On the basis of the ICRO algorithm, we have proposed a general strategy to improve Bayesian computation under big data scenarios, i.e. we can replace posterior samples by posterior modes in Gibbs iterations to accelerate simulations, where the posterior modes can be calculated with a subset of samples. In addition, the IRO or ICRO algorithm facilitates data integration from multiple sources when missing data are present. With the IRO or ICRO algorithm, the problem of data integration for incomplete data is converted into a problem for complete data and thus many of the existing meta-analysis methods can be conveniently applied for inference. This is very important for big data analysis.

Supplementary Material

1

Acknowledgements

Liang’s research was supported in part by grants DMS-1612924, DMS/NIGMS R01-GM117597 and NIGMS R01-GM126089. The authors thank the Joint Editor, Associate Editor and three referees for their constructive comments which have led to a significant improvement in this paper.

Appendix A

A.1. Proof of consistency of θn(t+1)

Define Θn as the parameter space of θ, where the subscript n indicates the dependence of the dimension of θ on the sample size n. Without possible confusion, we shall refer to Θn as Θ. In addition, we let ΘnT={θn(1),,θn(T)} denote a path of θn in the IRO algorithm, which can be considered as an arbitrary subset of Θn with T elements (replicates are allowed). Let x˜=(x obs ,x˜ mis ) and define

Gn(θ|θn(t))=Eθn(t)[log{fθ(x˜)}]=log{fθ(x˜)}f(xobs|θ*)h(x˜mis|θn(t))dx˜,G^n(θ|x˜,θn(t))=1ni=1nlog{f(xiobs,x˜imis|θ)},G˜n(θ|θn(t))=1ni=1nlog{f(xiobs,x˜mis|θ)}h(x˜mis|xiobs,θn(t))dx˜mis:=1ni=1nq(xiobs).} (13)

Our first goal is to show that the following uniform law of large numbers holds for any T such that log(T) = o(n):

supθn(t)ΘnTsupθΘn|G^n(θ|x˜,θn(t))Gn(θ|θn(t))|p0, (14)

where →p denotes convergence in probability. To achieve this goal, we need the following conditions.

Condition 1. log{fθ(x˜)} is a continuous function of θ for each x˜X and a measurable function of x˜ for each θ.

Condition 2 (conditions for Glivenko–Cantelli theorem).

  1. There is a function mn(x˜) such that supθΘn,x˜X|log{fθ(x˜)}|mn(x˜).

  2. Define m˜n(xobs,θn(t))=mn(x˜)h(x˜ mis |xobs,θn(t))dx˜ mis . Assume that there exists mn*(x obs ) such that 0m˜n(xobs,θn(t))mn*(xobs) for all θn(t),E[mn*(xobs)]< and supn+E[mn*(xobs)1{mn*(xobs)ζ}]0 as ζ. In addition, supn1supxX,θΘn|mn(x˜)1{mn(x˜)>ζ}h(x˜mis|x,θ)dx˜mis|0  as  ζ.

  3. Define Fn={log{f(xobs,x˜ mis |θ)}h(x˜ mis |xobs,θn(t))dx˜mis|θ,θn(t)Θn} and Gn,M= {q1{mn*(xobs)M}qFn}. Suppose that, for every ϵ and M>0, the metric entropy log N[{ϵ,Gn,M,L1(n)}=op*(n), where n is the empirical measure of xobs, and N{ϵ,Gn,M,L1(n)} is the covering number with respect to the L1()-norm.

Condition 3 (conditions for imputed data). Define Zt,i=log{f(xiobs,x˜imis|θ)}log{f(xiobs,x˜mis|θ)}×h(x˜mis|xiobs,θn(t))dx˜mis. Suppose that, for any θ and θn(t)Θn,E|Zt,i|mm!Mbm2vi/2 for every m2 and some constants Mb > 0 and vi = O(1), i.e. Zt,i are subexponential random variables.

Theorem 1. Assume conditions 1–3; then law (14) holds for any T such that log(T) = o(n).

Proof. By the definitions in expression (13), we have the decomposition

G^n(θ|x˜,θn(t))Gn(θ|θn(t))={G^n(θ|x˜,θn(t))G˜n(θ|θn(t))}+{G˜n(θ|θn(t))Gn(θ|θn(t))}, (15)

which consists of two terms; the first term comes from imputation of missing data, and the second term comes from the observed data.

First, we show that the second term of equation (15) converges to 0 uniformly, following the proof of theorem 2.4.3 of van der Vaart and Wellner (1996). By the symmetrization lemma 2.3.1 of van der Vaart and Wellner (1996), measurability of the class Fn and Fubini’s theorem,

E*[supθ,θn(t)Θn|G˜n(θ|θn(t))Gn(θ|θn(t))|]2Ex obs [Eϵ[supq(x)Fn1ni=1nϵiq(xiobs)]]2Ex obs  [Eϵ[supq(x)Gn,M1ni=1nϵiq(xi obs )]+2E*[mn*(xobs)1{mn*(xobs)>M}]],

where ϵi are independent and identically distributed Rademacher random variables with P(ϵi=1)=P (ϵi=1)=12, and E* denotes the outer expectation.

By condition 2, 2E*[mn*(xobs)1{mn*(xobs)>M}]0 for sufficiently large M. To prove convergence in mean, it suffices to show that the first term converges to 0 for fixed M. Fix x1obs,,xnObs, and let H be an ϵ-net in L1(n) over Gn,M; then

Eϵ[supq(x)Gn,M1ni=1nϵiq(xiobs)]Eϵ[supq(x)H1ni=1nϵiq(xiobs)]+ϵ.

The cardinality of H can be chosen equal to N{ϵ,Gn,M,L1(n)}. Bounding the L1-norm on the right-hand side by the Orlicz norm ψ2 and using the maximal inequality (lemma 2.2.2 of van der Vaart and Wellner (1996)) and Hoeffding’s inequality, it can be shown that

Eϵ[supq(x)Gn,M1ni=1nϵiq(xiobs)]K(1+log[N{ϵ,Gn,M,L1(n)}])(6/n)M+ϵP*ϵ, (16)

where K is a constant, and P* denotes outer probability. It has been shown that the left-hand side of inequality (16) converges to 0 in probability. Since it is bounded by M, its expectation with respect to x1obs,,xnobs converges to 0 by the dominated convergence theorem.

This concludes the proof that supθn(t)ΘnsupθΘn|Gn(θ|θn(t))Gn(θ|θn(t))|p0 in mean. Further, by Markov inequality, we conclude that

supθn(t)ΘnsupθΘn|G˜n(θ|θn(t))Gn(θ|θn(t))|p0. (17)

To establish the uniform convergence of the first term of equation (15), we fix x1obs,,xnobs. By condition 3, n{G^n(θ|x˜,θn(t))G˜n(θ|θn(t))}=Zt,1+Zt,2++Zt,n. By Bernstein’s inequality,

P{n|G^n(θ|x˜,θn(t))G˜n(θ|θn(t))|>z}=P(|Zt,1+Zt,2++Zt,n|>z)2exp(12z2v+Mbz),

for vv1 + … + vn. By lemma 2.2.10 of van der Vaart and Wellner (1996), for Orlicz norm ψ1, we have

supθΘn,t=1,2,,Tn{G^n(θ|x˜,θn(t))G˜n(θ|θn(t))}ψ1ϵ+K(Mblog[1+TN{ϵ,Gn,M,L1(n)}])+vlog[1+TN{ϵ,Gn,M,L1(n)}],

for a constant K and any ϵ > 0. By condition, part 2(c), and the condition log(T) = o(n),

supθΘn,t=1,2,,T{G^n(θ|x˜,θn(t))G˜n(θ|θn(t))}ψ1ϵ+K(Mblog[1+TN{ϵ,Gn,M,L1(n)}]/n)+(v/n)(log[1+TN{ϵ,Gn,M,L1(n)}]/n)P*ϵ.

Therefore,

supθΘn,t=1,2,,T|G^n(θ|x˜,θn(t))G˜n(θ|θn(t))|p0. (18)

The theorem can then be concluded by combining results (17) and (18).

Remark 1 (on the metric entropy condition). Assume that all elements in n1Fn are uniformly Lipschitz with respect to the L1-norm. Then the metric entropy log[N{ϵ,Gn,M,L1(n)}] can be measured on the basis of the parameter space Θ. Since the functions in Gn,M are all bounded, the corresponding parameter space Θn,M can be contained in an L1-ball by the continuity of log{f(x˜|θ)} in θ. Further, we assume that the diameter of the L1-ball or the space Θn,M grows at a rate of O(nα) for some 0α<12; then log[N{ϵ,Gn,M,L1(n)}]=O{n2αlog(p)} holds, which allows p to grow at a polynomial rate of O(nγ) for some constant 0 < γ < . The increased diameter accounts for the conventional assumption that the size of the true model grows with the sample size n. Refer to Vershynin (2015) for more discussion on this issue. Similar conditions on metric entropy have been used in the literature of high dimensional statistics. For example, Raskutti et al. (2011) studied minimax rates of estimation for high dimensional linear regression over Lq-balls.

Remark 2 (on the condition of T). Since the imputation step draws random data at each iteration t, there is no way to show uniform convergence of θn(t+1)to θ*(t) over all possible θn(t)Θn. However, we can prove that the consistency results hold for any sequence of θn(1),,θn(T) with T being not too large compared with exp(n). This is enough for theorems 3–6. To justify this, we may consider the case that the dimension of θn grows with n at a rate of p = O(nγ) for a constant γ > 0, say γ = 5. Then it is easy to see that, when n> 13, the ratio T/p has an order of

O{exp(n)/p}=O[exp{nγlog(n)}]O{exp(0.1n)}O(p100),

which implies that essentially there is no constraint on the setting of T. For Markov chain Monte Carlo simulations, the number of iterations is often set to a low order polynomial of p for a given set of observations.

For any θn(t)ΘnT, we define θn(t+1). We would like to establish the uniform consistency of θn(t+1) with respect to t, i.e.

supt{1,2,,T}θn(t+1)θ*(t)p0, as n. (19)

To achieve this goal, we assume the following condition.

Condition 4. For each t=1,2,,T,Gn(θ|θn(t)) has a unique maximum at θ*(t); for any ϵ > 0, supθΘn\Bt(ϵ)Gn(θ|θn(t)) exists, where Bt(ϵ)={θΘn:θθ*(t)<ϵ}. Let

δt=Gn(θ*(t)|θn(t))supθΘn\Bt(ϵ)Gn(θ|θn(t)),

δ = mint∈{1,2, …, T} δt > 0.

The existence of supθΘn\Bt(ϵ)Gn(θ|θn(t)) can be easily satisfied if Θn is restricted to a compact set, which implies that Θn \ Bt(ϵ) is also a compact set and thus the supremum is achievable. This condition can also be satisfied by assuming that Θn is convex and, for each í, θ*(t) is in the interior of Θn and Gn(θ|θn(t)) is concave in θ.

Theorem 2. Assume that conditions 1–4 hold; then the maximum pseudocomplete-data likelihood estimate θn(t+1) is uniformly consistent to θ*(t) over t = 1, 2, …, T, i.e. expression (19) holds.

Proof. Since both G^n(θ|x˜,θn(t)) and Gn(θ|θn(t)) are continuous in θ as implied by the continuity of log{fθ(x˜)},, the remaining part of the proof follows from lemma 1 by setting the penalty function Pλn(θ)=0 for all θ ∈ Θn.

Lemma 1. Consider a sequence of functions Qt(θ, Xn) for t = 1, 2, …, T. Suppose that the following conditions are satisfied.

Condition 5. For each t, Qt(θ, Xn) is continuous in θ and there is a function Qt*(θ), which is continuous in θ and uniquely maximized at θ*(t).

Condition 6. For any ϵ > 0, supθΘn\Bt(ϵ)Qt*(θ) exists, where Bt(ϵ)={θ:θθ*(t)<ϵ}; let δt=Qt*(θ*(t))supθΘn\Bt(ϵ)Qt*(θ),δ=mint{1,2T}δt>0.

Condition 7. supt{1,2,,T}supθΘn|Qt(θ,Xn)Qt*(θ)|p0 as n.

Condition 8. The penalty function Pλn(θ) is non-negative and converges to 0 uniformly over the set {θ*(t):t=1,2,,T} as n → ∞, where λn is a regularization parameter and its value can depend on the sample size n. Let θn(t)=arg maxθΘn{Qt(θ,Xn)Pλn(θ)}. Then uniform convergence holds, i.e. supt{1,2T}θn(i)θ*(t)p0.

Proof. Consider the two events

  1. supt{1,2T}supθΘn\Bt(ϵ)|Qt(θ,Xn)Qt*(θ)|<δ/2 and

  2. supt{1,2T}supθΘn|Qt(θ,Xn)Qt*(θ)|<δ/2.

From event (a), we can deduce that, for any t ∈ {1, 2, …, T} and any θΘn\Bt(ϵ),Qt(θ,Xn)<Qt*(θ)+δ/2Qt*(θ*(t))δt+δ/2Qt*(θ*(t))δ/2. Therefore, Qt(θ*(t),Xn)Pλn(θ*(t))<Qt*(θ*(t))δ/2o(1) by condition 8.

From event (b), we can deduce that, for any t ∈ {1, 2, …, T} and any θBt(ϵ),Qt(θ,Xn)>Qt*(θ)δ/2 and Qt(θ*(t),Xn)>Qt*(θ*(t))δ/2. Therefore, Qt(θ*(t),Xn)Pλn(θ*(t))>Qt*(θ*(i))δ/2o(1) by condition 8.

If both events hold simultaneously, then we must have θn(t)Bt(ϵ) for all t ∈ {1, 2, …, T} as n → ∞. By condition 7, the probability that both events hold tends to 1. Therefore,

P{θn(t)Bt(ϵ) for all t=1,2,,T}1,

which concludes the lemma. □

Theorem 2 establishes the consistency of θn(t+1) with respect to θ*(t) for each t = 1, 2, …, T. However, in the small n–large p scenario, θn(t+1) is not well defined. For this reason, a sparsity constraint needs to be imposed on θ. For example, we can apply a regularization method to obtain an estimate of θ*(t), i.e. we can define

θn,p(t+1)=argmaxθΘn{G^n(θ|x˜,θn(t))Pλn(θ)}, (20)

where the penalty function Pλn(θ) constrains the sparsity of the solution. Make the following assumption.

Condition 9. The penalty function Pλn(θ) is non-negative, ensures the existence of θn,p(t+1) for all n ∈ ℕ and t = 1, 2, …, T and converges to 0 uniformly over the set {θ*(t):t=1,2,,T} as n → ∞.

Corollary 1. If conditions 1–4 and 9 hold, then the regularization estimator θn,p(t+1) in definition (20) is uniformly consistent to θ*(t) over t = 1, 2, …, T, i.e. supt{1,2T}θn,p(t+1)θ*(t)p0 as n.

The proof of corollary 1 follows the proof of lemma 1 directly.

Take the high dimensional regression as an example. If we allow p to grow with n at the rate p=O(nγ) for some constant γ > 0, allow the size of β*(t) for all t to grow with n at the rate O(nα) for some constant 0<α<12, choose λn=O[{log(p)/n}] and set Pλn(θ)=λni=1pcλn(|θi|), where cλn() is set in the form of smoothly clipped absolute deviation (Fan and Li, 2001) or MCP (Zhang, 2010) penalties, then condition 9 is satisfied. For both the smoothly clipped absolute deviation and MCP penalties, cλn(|θi|) equals 0 if θi = 0 and is bounded by a constant otherwise. Similarly, if the beta-min assumption holds, i.e. there is a constant ßmin > 0 such that minjS*|β*j|βmin, where S* = {j: β*j ≠ 0} denotes the index set of non-zero regression coefficients, then the reciprocal lasso penalty (Song and Liang, 2015b) also satisfies condition 8. Note that, if Θ=p, the lasso penalty does not satisfy condition 9, which is unbounded. This explains why the lasso estimate is unbiased even as n → ∞. However, if Θn is restricted to a bounded space, then the lasso penalty also satisfies condition 9.

Alternatively to regularization methods, one may first restrict the space of θ*(t) to some low dimensional subspace through sure screening and then find a consistent estimate in the subspace by using conventional statistical methods, such as maximum likelihood, moment estimation or even regularization. Both the ψ-learning (Liang et al., 2015) and the SIS (Fan and Lv, 2008; Fan and Song, 2010) methods belong to this class. For ψ-learning, after correlation screening (based on the pseudocomplete data), the remaining network structure estimation procedure is essentially the same as the covariance selection method (Dempster, 1972) which, by nature, is a maximum likelihood estimation method. It is interesting that the sure-screening-based methods can be viewed as a special subclass of regularization methods, for which the solutions in the low dimensional subspace receive a zero penalty, and those outside the subspace receive a penalty of ∞. It is easy to see that such a binary-type penalty function satisfies condition 9.

Both the regularization and the sure-screening-based methods are constructive. In what follows, we give a proof for the use of general consistent estimation procedures in the IRO algorithm. Let θn,g(t1) denote the estimate of θ*(t) produced by such a general consistent estimation procedure at iteration t + 1. Corollary 2 shows that, if θn,g(t+1) is sufficiently accurate for each t (pointwise) and the log-likelihood function of the pseudocomplete data satisfies some moment conditions, then the estimation procedure can be used in the IRO algorithm. Therefore, by its maximum likelihood estimation nature in the subspace, the use of the ψ-learning algorithm in the IRO algorithm can also be justified by corollary 2.

Condition 10 (conditions for general consistent estimate θn,g(t)). Assume that, for each t = 1, 2, …, T, θn,g(t+1)θ*(t)=Op(1/n) (pointwise) and the Hessian matrix 2Gn(θ|x˜,θn(t))/θθ is bounded in a neighbourhood of θ*(t); let

Zt,i=log{f(xiobs,x˜imis|θn,g(t+1))}log{f(xobs,x˜mis|θn,g(t+1))}f(xobs|θ*)h(x˜ mis |xiobs,θn(t))dx˜misdxobs;

then E|Zt,i|mm!M˜bm2v˜i/2 for every m ≥ 2 and some constants M˜b>0 and v˜i=O(1).

Corollary 2. Assume conditions 1–4 and 10. Then θn,g(t+1) is uniformly consistent to θ*(t) over t = 1, 2, …, T, i.e. supt{1,2,,T}θn,g(t+1)θ*(t)p0 as n.

Proof. Applying Taylor series expansion to Gn(θ|θn(t)) at θ*(t), we obtain Gn(θn,g(t+1)|θn(t))Gn(θ*(t)|θn(t))=Op(1/n), following from condition 10 and condition 4 that Gn(θ|θn(t)) is maximized at θ*(t). Therefore,

n{G^n(θn,g(t+1)|x˜,θn(t))Gn(θ*(t)|θn(t))}=Zt,1++Zt,n+n{Gn(θn,g(t+1)|θn(t))Gn(θ*(t)|θn(t))}=Zt,1++Zt,n+ϵn,

where ϵn = Op(1), and

P{n|G^n(θn,g(t+1)|x˜,θn(t))Gn(θ*(t)|θn(t))|>nz}P(|Zt,1++Zt,n|>nz|ϵn|). (21)

By Bernstein’s inequality,

P(|Zt,1++Zt,n|>nz|ϵn|)2exp{12(z|ϵn|/n)2v˜+M˜b(z|ϵn|/n)}, (22)

for v˜(v˜1++v˜n)/n2 and M˜b=M˜b/n. Applying Taylor series expansion to the right-hand side of in equality (22) at z and combining with in equality (21) lead to

P{|G^n(θn,g(t+1)|x˜,θn(t))Gn(θ*(t)|θn(t))|>z}Kexp(12z2v˜+M˜bz), (23)

where K=2+(3/M˜b)Op(1/n)=2+(3/M˜b)Op(1), since the derivative |d{z2/(v˜+M˜bz)}/dz|3/M˜b.

As in the proof of theorem 1, by applying lemma 2.2.10 of van der Vaart and Wellner (1996), we can prove

supθn(t)Θn,t{1,2,,T}|G^n(θn,g(t+1)|x˜,θn(t))Gn(θ*(t)|θn(t))|p0. (24)

As implied by the proof of lemma 2.2.10 of van der Vaart and Wellner (1996), result (24) holds for a general constant K in inequality (22). Then, by condition 4, we must have the uniform convergence that θn,q(t+1)Bt(ϵ) for all t as n → ∞, where Bt(ϵ) is as defined in condition 4. This statement can be proved by contradiction as follows.

Assume that θn,q(i+1)Bi(ϵ) for some i ∈ {1, 2, …, T}. By the uniform convergence that was established in theorem 1, |G^n(θn,g(i+1)|x˜,θn(i))Gn(θn,g(i+1)|θn(i))|=op(1). Further, by condition 4 and the assumption that θn,g(i+1)Bi(ϵ),

|G^n(θn,g(i+1)|x˜,θn(i))Gn(θ*(i)|θn(i))||Gn(θn,g(i+1)|θn(i))Gn(θ*(i)|θn(i))||G^n(θn,g(i+1)|x˜,θn(i))Gn(θn,g(i+1)|θn(i))|δop(1),

which contradicts the uniform convergence that was established in result (24). This concludes the proof. □

Remark 3 (on the accuracy of θn,g(t)s). Condition 10 restricts the consistent estimates to those having a distance to the true parameter point of the order Op(1/n). Such a condition can be satisfied by some estimation procedures in the low dimensional subspace, e.g. maximum likelihood, for which both the variance and the bias are often of the order O(1/n) (Firth, 1993) and therefore the root- mean-squared error is of the order O(1/n). To make the result of corollary 2 more general to include more estimation procedures, we can relax this order to θn,g(t+1)θ*(t)=Op(n1/4), if we would like to relax the order of T to log(T)=o(n) and the order of metric entropy to log[N{ϵ,Gn,M,L1(n)}]=op*(n). As mentioned in remarks 1 and 2, both the order of T and the order of metric entropy are technical conditions and relaxing them to the order of O(n) will not restrict much the applications of the IRO algorithm. The proof for this relaxation is straightforward, following the proof of corollary 2.

A.2. Proof of ergodicity of the Markov chain {θn(t)}

Although the IRO algorithm is different from the SEM algorithm in the θn(t) updating step, the Markov chains {θn(t)} that are induced by the two algorithms share some similar properties as well as similar proofs. The following two lemmas, lemma 2 and lemma 3, can be proved in the same way as in Nielsen (2000), and thus the proofs have been omitted.

Lemma 2. The Markov chain {θn(t)} is irreducible and aperiodic.

Lemma 3. If condition 1 holds, then the Markov chain has the weak Feller property, and any compact subsets of Θ are small.

If condition 1 holds and Θn is restricted to a compact set, then the Markov chain {θn(t)} is ergodic. Here we would like to establish the ergodicity of the Markov chain {θn(t)} under a more general scenario Θn=p. This can be done by verifying a drift condition. Similarly to Nielsen (2000), we choose the negative log-likelihood function of the observed data as the drift function, motivated by the drift in the EM algorithm towards high density areas.

Theorem 3. If conditions 1–3 hold, then {θn(t)} is almost surely ergodic for sufficiently large n.

Proof. Let v(θ)=C(1/n)logf{x1obs,,xnobs|θ}, where C denotes a constant such that v(θ) ≥ 0 for all θ ∈ Θn. Since v(θ) is non-negative, it can be used to build the drift condition. Define

Δv(θ)=Eh[v(θn(t+1))v(θn(t))]=Eh[1nlog{f(xobs|θn(t))}1nlog{f(xobs|θn(t+1)}]=Eh[1nlog{f(x˜|θn(t))}1nlog{f(x˜|θn(t+1))}]Eh[ 1nlog{h(x˜mis|xobs,θn(t))}1nlog{h(x˜mis|xobs,θn(t+1))}]=I+II,

where Eh refers to the expectation with respect to the predictive distribution h(x˜mis|xobs,θn(t)).

First, we consider the negative of part I, which can be decomposed as

I=Eh[1nlog{f(x˜|θn(t+1))}1nlog{f(x˜|θn(t))}]=Eh [1nlog{f(x˜|θn(t+1))}1nlog{f(x˜|M(θn(t))}]+Eh[1nlog[f{(x˜|M(θn(t))}]1nlog{f(x˜|θn(t))}]],

where the function M(θ) is defined by

M(θ)=arg maxθEθ[log{f(x˜|θ)}]=arg maxθf(xobs,x˜ mis |θ)f(xobs|θ*)h(x˜mis|xobs,θ)dx˜misdxobs. (25)

By the uniform law of large numbers that was established in theorem 1, we have

1nlog{f(x˜|θn(t+1))}1nlog[f{x˜|M(θn(t))}]pGn(θn(t+1)|θn(t))Gn{M(θn(t))|θn(t)}.

From theorem 2, we have θn(t+1)M(θn(t))p0. Further, by the continuity of Gn(θ|θ’) with respect to θ, we have Gn(θn(t+1)|θn(t))Gn{M(θn(t))|θn(t)}p0 and thus (1/n)log{f(x˜|θn(t+1))}(1/n)log[f{x˜|M(θn(t))}]p0. Then, by the boundedness of log{f(x˜|θ)} (condition 2) and the dominated convergence theorem,

E [1nlog{f(x˜|θn(t+1))}1nlog[f{x˜|M(θn(t)}]]0, (26)

where the expectation is with respect to the joint density function of x˜=(xobs,x˜mis). Note that, for any θ ∈ Θn, we have

Eh[1nlog{f(x˜|θ)}]=1ni=1nEh[log{f(x˜i|θ)}]1ni=1ng(xiobs), (27)

where the g(xi obs ) are mutually independent, but not necessarily identically distributed because of the presence of missing data. Then, by result (26), condition 2 and Kolmogorov’s strong law of large numbers, we have

Eh[1nlog{f(x˜|θn(t+1))}1nlog[f{x˜|M(θn(t))}]]0, almost surely, (28)

as n → ∞. Therefore, there is a constant c>0 and a large number N such that

c<Eh [1nlog{f(x˜|θn(t+1))}1nlog[f{x˜|M(θn(t))}]]<c, almost surely, (29)

for any n>N and any t>0. With a similar argument to that for equation (27), by invoking Kolmogorov’s strong law of large numbers, it can be shown that there is a constant δ > 0 such that

Eh[1nlog[f{x˜|M(θn(t))}]1nlog{f(x˜|θn(t))}]δ, almost surely, (30)

for any t>0 as n → ∞. Combining expressions (29) and (30), we have −cδ < I <c holds almost surely for sufficiently large n.

Next, by Jensen’s in equality, we have

II=Eh[1nlog{h(x˜ mis |xobs,θn(t+1))}1nlog{h(x˜ mis |xobs,θn(t))}]1nlog(Eh[h(x˜mis|xobs,θn(t+1))h(x˜mis|xobs,θn(t))])=1nlog{h(x˜ mis |xobs,θn(t+1))dx˜ mis }=0.

Combining the results of I and II, we have that Δv(θ)<c almost surely for all θ ∈ Θn. Choose b as a positive number less than c + δ and D as a compact set including {θ ∈ Θn : Δv(θ) ∈ [−b, c)}. In summary, we have

Δv(θ){c,θDb,θΘn\D,

almost surely. Hence, the strict drift condition V2 (Meyn and Tweedie (2009), page 263) is almost surely satisfied.

Since (θn(t))t0 also has weak Feller property (see lemma 3), we can further conclude that an invariant probability measure π almost surely exists for this Markov chain (Meyn and Tweedie (2009), theorem 12.3.4). Since (θn(t))t0 irreducible (shown in lemma 2), D is a compact set and thus a small set (shown in lemma 3), and the drift condition V2 is stronger than the drift condition V1 (Meyn and Tweedie (2009), page 189), we can show that (θn(t))t0 is Harris recurrent (Meyn and Tweedie (2009), theorem 9.1.8). Since (θn(t))t0 is irreducible and has an invariant probability measure π, it is also a positive chain (Meyn and Tweedie (2009), page 230). Therefore, it is a positive Harris recurrent chain (Meyn and Tweedie (2009), page 231). Finally, since (θn(t))t0 is aperiodic (shown in lemma 2) and positive Harris recurrent, we can conclude that it is almost surely ergodic (Meyn and Tweedie (2009), theorem 13.3.3).

A.3. Proof of consistency of the imputation–regularized optimization estimator

To prove the consistency of the IRO estimator, we consider the mapping that is defined in equation (25). For the RO step, we have θ*(t)=M(θn(t)). Also, θ*, the true value of θn, is a fixed point of the mapping. Further, to show that the mean of the stationary distribution of the Markov chain forms a consistent estimate of θ*, we make the following assumption.

Condition 11. The mapping M(θ) is differentiable. Let λn (θ) be the largest singular value of ∂M(θ)/∂θ. There is a number λ* < 1 such that λn(θ) ≤ λ* for all θ ∈ Θn for sufficiently large n and an almost every xobs-sequence.

Remark 4 (on contraction mapping). Condition 11 directly implies that

M(θn(t))θ*=M(θn(t))M(θ*)λ*θn(t)θ*, (31)

i.e. the mapping is a contraction. We note that a continuous application in the mapping, i.e. setting θn(t+1)=θ*(t)=M(θn(t)) for all t, leads to a monotone increase in the expectation Eθn(t)[log{fθ(x˜)}]. Since Eθn(t)[log{fθ(x˜)}] attains its maximum at Eθ*[log{fθ*(x˜)}], it is reasonable to assume that M(θn(t)) is closer to θ* than θn(t). This condition should hold for sufficiently large n, at which the θ*(t) and θ* are all unique as assumed in condition 4. We note that a similar contraction condition has been used in analysis of the SEM algorithm (proposition 3, Nielsen (2000)). Some other conditions can potentially be specified based on fixed point theory (see, for example, Khamsi and Kirk (2001)).

Theorem 4. Assume conditions 1–4 and 11, and supn,tEθn(t)<. Then, for sufficiently large n, sufficiently large t and an almost every xobs-sequence, θn(t)θ*=op(1). Furthermore, the sample average of the Markov chain forms a consistent estimate of θ*, i.e. (1/T)t=1Tθn(t)θ*=op(1), as n → ∞ and T → ∞.

Proof. By theorem 3, the Markov chain {θn(t)} converges to a stationary distribution. For simplicity, we suppress the supscript t, let θn denote the current sample and let θn denote the next iteration sample. Therefore, θnθ*θnM(θn)+M(θn)θ*θnM(θn)+λ*θnθ*, where the last inequality follows from result (31). Taking expectation on both sides leads to

Eθnθ*EθnM(θn)+λ*Eθnθ*11λ*EθnM(θn)=11λ*o(1)=o(1), (32)

where the second in equality follows from the stationarity of the Markov chain, and the first equality follows from theorem 2 and the existence of E||θn||. Finally, by Markov’s in equality, we conclude the consistency of θn(t) as an estimator of θ*.

By result (32), we have ||E(θn) − θ* || ≤ E||θnθ* || =o(1), which implies that the mean of the stationary distribution of {θn(t)} converges to θ* for sufficiently large n. Further, by the ergodicity of the Markov chain {θn(t)}, we conclude the proof. □

Corollary 3. Assume conditions 1–4 and 11, supn,tEθn(t)<, h(θ) is a Lipschitz function on Θn and supn,tEθn(t)<. Then for sufficiently large n, sufficiently large t and an almost every xobs-sequence, h(θn(t))h(θ*)=op(1). Furthermore, (1/T)t=1T(θn(t))h(θ*)=op(1), as n → ∞ and T → ∞.

Proof. The proof follows from the definition of a Lipschitz function and the proof of theorem 4. □

A.4. Proof of ergodicity of the Markov chain for the imputation–conditional regularized optimization algorithm

Theorem 5. If conditions 1–3 hold, the Markov chain {(θn(t,1),,θn(t,k))} is almost surely ergodic for sufficiently large n.

Theorem 5 can be proved in a similar way to theorem 3 with the detail given in the on-line supplementary material.

A.5. Proof of consistency of the imputation–conditional regularized optimization estimator

Condition 11’. Let Mi denote the mapping of the ith part of the CRO step, i.e. θ*(t,i)=Mi(θn(t+1,1),,θn(t+1,i1),θn(t,i),,θn(t,k)). Let M = MkMk−1 ∘ … ∘ M1 denote the joint mapping of M1, …, Mk. Let λn(θ) denote the largest singular value of ∂M(θ)/∂θ. There is a number λ* < 1 such that λn(θ) ≤ λ* for all θ ∈ Θn, all sufficiently large n and an almost every xobs-sequence.

This condition is reasonable: it is easy to see that a continuous application of the mapping M, i.e. applying Mis in a circular manner, leads to a monotone increase in the function Eθn(t)[log{fθ(X˜)}]. Similarly to theorem 4, we can prove the following theorem with the detail given in the on-line supplementary material.

Theorem 6. Assume conditions 1–4 and 11’, and supn,tE|θn(t)|<. Then for sufficiently large n, sufficiently large t and an almost every xobs-sequence, θn(t)θ*=op(1). Furthermore, the sample average of the Markov chain also forms a consistent estimate of θ*, i.e. (1/T)t=1Tθn(t)θ*=op(1), as n → ∞ and T → ∞.

Corollary 4. Assume conditions 1–4 and 11’, supn,tEθn(t)<,h(θ) is a Lipschitz function on Θn and supn,tEθn(t)<. Then for sufficiently large n, sufficiently large t and an almost every xobs-sequence, h(θn(t))h(θ*)=op(1). Furthermore, (1/T)t=1Th(θn(t))h(θ*)=op(1), as n → ∞ and T→ ∞.

Proof. The proof follows from the definition of a Lipschitz function and the proof of theorem 6.

Footnotes

Supporting information

Additional ‘supporting information’ may be found in the on-line version of this article:

‘Supplementary material for “An imputation-consistency algorithm for high-dimensional missing data problems and beyond”‘.

Contributor Information

Faming Liang, Purdue University, West Lafayette, USA.

Bochao Jia, University of Florida, Gainesville, USA.

Jingnan Xue, Houzz, Palo Alto, USA.

Qizhai Li, Chinese Academy of Sciences, Beijing, People’s Republic of China.

Ye Luo, University of Florida, Gainesville, USA.

References

  1. Besag J (1974) Spatial interaction and the statistical analysis oflattice systems (with discussion). J. R. Statist. Soc. B, 36, 192–236. [Google Scholar]
  2. Bo TH, Dysvik B and Jonassen I (2004) LSimpute: accurate estimation of missing values in microarray data with least square methods. Nucleic Acids Res, 32, article e34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Burgette LF and Reiter JP (2010) Multiple imputation for missing data via sequential regression trees. Am. J Epidem, 172, 1070–1076. [DOI] [PubMed] [Google Scholar]
  4. van Buuren S and Groothuis-Oudshoom K (2011) mice: multivariate imputation by chained equations in R. J. Statist. Softwr, 45, no. 3. [Google Scholar]
  5. Cai J-F, Candes E and Shen Z (2010) A singular value thresholding algorithm for matrix completion. SIAM J. Optimizn, 20, 1956–1982. [Google Scholar]
  6. Castillo I, Schmidt-Hieber J and der Vaart AW (2015) Bayesian linear regression with sparse priors. Ann. Statist, 43, 1986–2018. [Google Scholar]
  7. Celeux G and Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computnl Statist. Q, 2, 73–82. [Google Scholar]
  8. Dempster AP (1972) Covariance selection. Biometrics, 28, 157–175. [Google Scholar]
  9. Dempster AP, Laird NM and Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc. B, 39, 1–38. [Google Scholar]
  10. Dobra A, Hans C, Jones B, Nevins JR, Yao G and West M (2004) Sparse graphical models for exploring gene expression data. J. Multiv. Anal, 90, 196–212. [Google Scholar]
  11. Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Am. Statist. Ass, 99, 96–104. [Google Scholar]
  12. Fan J and Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist. Ass, 96, 1348–1360. [Google Scholar]
  13. Fan Y and Li R (2012) Variable selection in linear mixed effects models. Ann. Statist, 40, 2043–2068. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Fan J and Lv J (2008) Sure independence screening for ultrahigh dimensional feature space (with discussion). J R. Statist. Soc. B, 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Fan J and Song R (2010) Sure independence screening in generalized linear model with NP-dimensionality. Ann. Statist, 38, 3567–3604. [Google Scholar]
  16. Firth D (1993) Bias reduction of maximum likelihood estimates. Biometrika, 80, 27–38. [Google Scholar]
  17. Friedman J, Hastie T and Tibshirani R (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9, 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Garcia RI, Ibrahim JG and Zhü H (2010) Variable selection for regression models with missing data. Statist. Sin, 20, 149–165. [PMC free article] [PubMed] [Google Scholar]
  19. Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D and Brown PO (2000) Genomic expression programs in the response of yeast cells to environmental changes. Molec. Biol. Cell, 11, 4241–4257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. van de Geer S, Bühlmann P, Ritov Y and Dezeure R (2014) On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist, 42, 1166–1202. [Google Scholar]
  21. Gelman A and Rubin DB (1992) Inference from iterative simulation using multiple sequences. Statist. Sci, 7, 457–472. [Google Scholar]
  22. Hastie T, Tibshirani R and Friedman J (2009) The Elements of Statistical Learning, 2nd edn. New York: Springer. [Google Scholar]
  23. He S (2011) Extension of SPACE: R package ‘SpaceExt’. (Available from http://cran.r-project.org/web/packages/spaceExt.)
  24. He Y and Liu C (2012) The dynamic ‘expectation-conditional maximization either’ algorithm. J. R. Statist. Soc. B, 74, 313–336. [Google Scholar]
  25. Johnson VE and Rossell D (2012) Bayesian model selection in high-dimensional settings. J. Am. Statist. Ass, 107, 649–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Khalili A and Chen J (2007) Variable selection in finite mixture of regression models. J. Am. Statist. Ass, 102, 1025–1038. [Google Scholar]
  27. Khamsi MA and Kirk WA (2000) An Introduction to Metric Spaces and Fixed Point Theory. Hoboken: Wiley. [Google Scholar]
  28. Liang F, Song Q and Qiü P (2015) An equivalent measure of partial correlation coefficients for high-dimensional Gaussian graphical models. J. Am. Statist. Ass, 110, 1248–1265. [Google Scholar]
  29. Liang F and Zhang J (2008) Estimating the false discovery rate using the stochastic approximation algorithm. Biometrika, 95, 961–977. [Google Scholar]
  30. Liu C and Rubin DB (1994) The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika, 81, 633–648. [Google Scholar]
  31. Liu C, Rubin DB and Wu YN (1998) Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika, 85, 755–770. [Google Scholar]
  32. Long Q and Johnson BA (2015) Variable selection in the presence of missing data: resampling and imputation. Biostatistics, 16, 596–610. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Mazumder R and Hastie T (2012) The graphical lasso: new insights and alternatives. Electron. J. Statist, 6, 2125–2149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Mazumder R, Hastie T and Tibshirani R (2010) Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res, 99, 2287–2322. [PMC free article] [PubMed] [Google Scholar]
  35. McLachlan GJ and Krishnan T (2008) The EM Algorithm and Extensions, 2nd edn. Hoboken: Wiley. [Google Scholar]
  36. Meinshausen N and Bühlmann P (2006) High-dimensional graphs and variable selection with the Lasso. Ann. Statist, 34, 1436–1462. [Google Scholar]
  37. Meinshausen N and Bühlmann P (2010) Stability selection (with discussion). J. R. Statist. Soc. B, 72, 417–473. [Google Scholar]
  38. Meng X-L and Rübin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika, 80, 267–278. [Google Scholar]
  39. Meyn S and Tweedie RL (2009) Markov Chains and Stochastic Stability, 2nd edn. Cambridge: Cambridge University Press. [Google Scholar]
  40. Nielsen SF (2000) The stochastic EM algorithm: estimation and asymptotic results. Bernoulli, 6, 457–489. [Google Scholar]
  41. Oba S, Sato M, Takemasa I, Monden M, Matsubara K-I and Ishii S (2003) A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19, 2088–2096. [DOI] [PubMed] [Google Scholar]
  42. Ouyang M, Welsh WJ and Georgopoülos P (2004) Gaussian mixture clustering and imputation of microarray data. Bioinformatics, 20, 917–923. [DOI] [PubMed] [Google Scholar]
  43. Raskutti G, Wainwright MJ and Yu B (2011) Minimax rates of estimation for high-dimensional linear regression over lq-balls. IEEE Trans. Inform. Theory, 57, 6976–6994. [Google Scholar]
  44. Scheetz TE, Kim K-Y, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, Di Bona GF, Huang J, Casavant TL, Sheffield VC and Stone EM (2006) Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc. Natn. Acad. Sci. USA, 103, 1442914434. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Song Q and Liang F (2015a) A split-and-merge Bayesian variable selection approach for ultrahigh dimensional regression. J. R. Statist. Soc. B, 77, 947–972. [Google Scholar]
  46. Song Q and Liang F (2015b) High dimensional variable selection with reciprocal L1-regularization. J. Am. Statist. Ass, 110, 1607–1620. [Google Scholar]
  47. Stacklies W, Redestig H, Scholz M, Walther D and Selbig J (2007) pcaMethods—a Bioconductor package providing PCA methods for incomplete data. Bioinformatics, 23, 1164–1167. [DOI] [PubMed] [Google Scholar]
  48. Städler N and Bühlmann P (2012) Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statist. Comput, 22, 219–235. [Google Scholar]
  49. Städler N, Stekhoven DJ and Bühlmann P (2014) Pattern alternating maximization algorithm for missing data in high-dimensional problems. J. Mach. Learn. Res, 15, 1903–1928. [Google Scholar]
  50. Storey JD (2002) A direct approach to false discovery rates. J. R. Statist. Soc. B, 64, 479–498. [Google Scholar]
  51. Tanner MA and Wong WH (1987) The calculation of posterior distributions by data augmentation (with discussion). J. Am. Statist. Ass, 82, 528–540. [Google Scholar]
  52. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58, 267–288. [Google Scholar]
  53. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D and Altman R (2001) Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520–525. [DOI] [PubMed] [Google Scholar]
  54. Tseng P (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optimizn Theory Appl, 109, 475–494. [Google Scholar]
  55. Tseng P and Yun S (2009) A coordinate gradient descent method for nonsmooth separable minimization. Math. Progrmmng, 117, 387–423. [Google Scholar]
  56. van der Vaart AW and Wellner JA (1996) Weak Convergence and Empirical Processes. New York: Springer. [Google Scholar]
  57. Vershynin R (2015) Estimation in high dimensions: a geometric perspective In Sampling Theory, a Renaissance (ed. Pfander G), pp. 3–66. Cham: Birkhäuser. [Google Scholar]
  58. Wei GCG and Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J. Am. Statist. Ass, 85, 699–704. [Google Scholar]
  59. Wu CFJ (1983) On the convergence properties of the EM algorithm. Ann. Statist, 11, 95–103. [Google Scholar]
  60. Yu G and Liu Y (2016) Sparse regression incorporating graphical structure among predictors. J. Am. Statist. Ass, 111,707–720. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Yuan M and Lin Y (2007) Model selection and estimation in the Gaussian graphical model. Biometrika, 94, 19–35. [Google Scholar]
  62. Zhang C-H (2010) Nearly unbiased variable selection under minimax concave penalty. Ann. Statist, 38,894–942. [Google Scholar]
  63. Zhang C-H and Zhang SS (2014) Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Statist. Soc. B, 76, 217–242. [Google Scholar]
  64. Zhao Y and Long Q (2016) Multiple imputation in the presence of high-dimensional data. Statist. Meth. Med. Res, 25, 2021–2035. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES