Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 28.
Published in final edited form as: Transact Mach Learn Res. 2023 Sep;2023:https://openreview.net/forum?id=oud7Ny0KQy.

RIFLE: Imputation and Robust Inference from Low Order Marginals

Sina Baharlouei 1, Kelechi Ogudu 1, Sze-chuan Suen 1, Meisam Razaviyayn 1
PMCID: PMC10977932  NIHMSID: NIHMS1967713  PMID: 38550611

Abstract

The ubiquity of missing values in real-world datasets poses a challenge for statistical inference and can prevent similar datasets from being analyzed in the same study, precluding many existing datasets from being used for new analyses. While an extensive collection of packages and algorithms have been developed for data imputation, the overwhelming majority perform poorly if there are many missing values and low sample sizes, which are unfortunately common characteristics in empirical data. Such low-accuracy estimations adversely affect the performance of downstream statistical models. We develop a statistical inference framework for regression and classification in the presence of missing data without imputation. Our framework, RIFLE (Robust InFerence via Low-order moment Estimations), estimates low-order moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model. We specialize our framework to linear regression and normal discriminant analysis, and we provide convergence and performance guarantees. This framework can also be adapted to impute missing data. In numerical experiments, we compare RIFLE to several state-of-the-art approaches (including MICE, Amelia, MissForest, KNN-imputer, MIDA, and Mean Imputer) for imputation and inference in the presence of missing values. Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small. RIFLE is publicly available at https://github.com/optimization-for-data-driven-science/RIFLE.

1. Introduction

Machine learning algorithms have shown promise when applied to various problems, including healthcare, finance, social data analysis, image processing, and speech recognition. However, this success mainly relied on the availability of large-scale, high-quality datasets, which may be scarce in many practical problems, especially in medical and health applications (Pedersen et al., 2017; Sterne et al., 2009; Beaulieu-Jones et al., 2018). Moreover, many experiments and datasets suffer from the small sample size in such applications. Despite the availability of a small number of data points in each study, an increasingly large number of datasets are publicly available. To fully and effectively utilize information on related research questions from diverse datasets, information across various datasets (e.g., different questionnaires from multiple hospitals with overlapping questions) must be combined in a reliable fashion.

After integrating data from different studies, the obtained dataset can contain large blocks of missing values, as they may not share the same features (Figure 1).

Figure 1:

Figure 1:

Consider the problem of predicting the trait y from feature vector (x1,,x100). Suppose that we have access to three data sets: The first dataset includes the measurements of (x1,x2,,x40,y) for n1 individuals. The second dataset collects data from another n2 individuals by measuring (x30,,x80) with no measurements of the target variable y in it; and the third dataset contains the measurements from the variables (x70,,x100,y) for n3 number of individuals. How one should learn the predictor y^=h(x1,,x100) from these three datasets?

There are three general approaches for handling missing values in statistical inference (classification and regression) tasks. A Naïve method is to remove the rows containing missing entries. However, such an approach is not an option when the percentage of missingness in a dataset is high. For instance, as demonstrated in Figure 1, the entire dataset will be discarded if we eliminate the rows with at least one missing entry.

The most common methodology for handling missing values in a learning task is to impute them in a pre-processing stage. The general idea behind data imputation is that the missing values can be predicted using the available data entries and correlated features. Imputation algorithms cover a wide range of methods, including imputing missing entries with the columns means Little & Rubin (2019, Chapter 3) (or median), least-square and linear regression-based methods (Raghunathan et al., 2001; Kim et al., 2005; Zhang et al., 2008; Cai et al., 2006; Buuren & Groothuis-Oudshoorn, 2010), matrix completion and expectation maximization approaches Dempster et al. (1977); Ghahramani & Jordan (1994); Honaker et al. (2011), KNN based (Troyanskaya et al., 2001), Tree based methods (Stekhoven & Bühlmann, 2012; Xia et al., 2017), and methods using different neural network structures. Appendix A presents a comprehensive review of these methods.

The imputation of data allows practitioners to run standard statistical algorithms requiring complete data. However, the prediction model’s performance can be highly reliant on the accuracy of the imputer. High error rates in the prediction of missing values by the imputer can lead to the catastrophic performance of the downstream statistical methods executed on the imputed data.

Another class of methods for inference in the presence of missing values relies on robust optimization over the uncertainty sets on missing entries. Shivaswamy et al. (2006) and Xu et al. (2009) adopt robust optimization to learn the parameters of a support vector machine model. They consider uncertainty sets for the missing entries in the dataset and solve a min-max problem over those sets. The obtained classifiers are robust to the uncertainty of missing entries within the uncertainty regions. In contrast to the imputation-based approaches, the robust classification formulation does not carry the imputation error to the classification phase. However, finding appropriate intervals for each missing entry is challenging, and it is unclear how to determine the uncertainty range in many real datasets. Moreover, their proposed algorithms are limited to the SVM classifier.

In this paper, we propose RIFLE (Robust InFerence via Low-order moment Estimations) for the direct inference of a target variable based on a set of features containing missing values. The proposed framework does not require the data to be imputed in a pre-processing stage. However, it can also be used as a pre-processing tool for imputing data. The main idea of the proposed framework is to estimate the first and second-order moments of the data and their confidence intervals by bootstrapping on the available data matrix entries. Then, RIFLE finds the optimal parameters of the statistical model for the worst-case distribution with the low-order moments (mean and variance) within the estimated confidence intervals (See Figure 2). Compared to Shivaswamy et al. (2006); Xu et al. (2009), we estimate uncertainty regions for the low-order marginals using the Bootstrap technique. Furthermore, our framework is not restricted to any particular machine learning model, such as support vector machines (Xu et al., 2009).

Figure 2:

Figure 2:

Prediction of the target variable without imputation. RIFLE estimates confidence intervals for low-order (first and second-order) marginals from the input data containing missing values. Then, it solves a distributionally robust problem over the set of all distributions whose low-order marginals are within the estimated confidence intervals.

Contributions:

Our main contributions are as follows:

  1. We present a distributionally robust optimization framework over the low-order marginals of the training data distribution for inference in the presence of missing values. The proposed framework does not require data imputation as a pre-processing stage. In Section 3 and Section 4, we specialize the framework to ridge regression and classification models as two case studies respectively. The proposed framework provides a novel strategy for inference in the presence of missing data, especially for datasets with large proportions of missing values.

  2. We provide theoretical convergence guarantees and the iteration complexity analysis of the presented algorithms for robust formulations of ridge linear regression and normal discriminant analysis. Moreover, we show the consistency of the prediction under mild assumptions and analyze the asymptotic statistical properties of the solutions found by the algorithms.

  3. While the robust inference framework is primarily designed for direct statistical inference in the presence of missing values without performing data imputation, it can also be adopted as an imputation tool. To demonstrate the quality of the proposed imputer, we compare its performance with several widely-used imputation packages such as MICE (Buuren & Groothuis-Oudshoorn, 2010), Amelia (Honaker et al., 2011), MissForest (Stekhoven & Bühlmann, 2012), KNN-Imputer (Troyanskaya et al., 2001), MIDA (Gondara & Wang, 2018), GAIN (Yoon et al., 2018) on real and synthetic datasets. Generally speaking, our method outperforms all of the mentioned packages when the number of missing entries is large.

2. Robust Inference via Estimating Low-order Moments

RIFLE is based on a distributionally robust optimization (DRO) framework over low-order marginals. Assume that (x,y)Rd×R follows a joint probability distribution P. A standard approach for predicting the target variable y given the input vector x is to find the parameter θ that minimizes the population risk with respect to a given loss function :

minθE(x,y)P[(x,y;θ)]. (1)

Since the underlying distribution of data is rarely available in practice, the above problem cannot be directly solved. The most common approach for approximating (1) is to minimize the empirical risk with respect to n given i.i.d samples (x1,y1),,(xn,yn) drawn from the joint distribution P:

minθ1ni=1n(xi,yi;θ).

The above empirical risk formulation assumes that all entries of xi and yi are available. Thus, to utilize the empirical risk minimization (ERM) framework in the presence of missing values, one can either remove or impute the missing data points in a pre-processing stage. Training via robust optimization is a natural alternative in the presence of missing data. Shivaswamy et al. (2006); Xu et al. (2009) suggest the following optimization problem that minimizes the loss function for the worst-case scenario over the defined uncertainty sets per data points:

minθmax{δi𝒩i}i=1n1ni=1n(xiδi,yi;θ), (2)

where 𝒩i represents the uncertainty region of data point i. Shivaswamy et al. (2006) obtains the uncertainty sets by assuming a known distribution on the missing entries of datasets. The main issue in their approach is that the constraints defined on data points are totally uncorrelated. Xu et al. (2009) on the other hand defines 𝒩i as a “box” constraint around the data point i such that they can be linearly correlated. For this specific case, they show that solving the corresponding robust optimization problem is equivalent to minimizing a regularized reformulation of the original loss function. Such an approach has several limitations: First, it can only handle a few special cases (SVM loss with linearly correlated perturbations on data points). Furthermore, Xu et al. (2009) is primarily designed for handling outliers and contaminated data. Thus, they do not offer any mechanism for the initial estimation of xi when several vector entries are missing. In this work, we instead take a distributionally robust approach by considering uncertainty on the data distribution instead of defining an uncertainty set for each data point. In particular, we aim to fit the best parameters of a statistical learning model for the worst distribution in a given uncertainty set by solving the following:

minθmaxP𝒫E(x,y)P[(x,y;θ)], (3)

where 𝒫 is an uncertainty set over the underlying distribution of data. A key observation is that defining the uncertainty set 𝒫 in (3) is easier and computationally more efficient than defining the uncertainty sets {𝒩i}i=1n in (2). In particular, the uncertainty set 𝒫 can be obtained naturally by estimating low-order moments of data distribution using only available entries. To explain this idea and to simplify the notations, let z=(x,y), μ¯zE[z], and C¯zE[zzT]. While μ¯z and C¯z are typically not known exactly, one can estimate the (within certain confidence intervals) from the available data by simply ignoring missing entries (assuming the missing value pattern is completely at random, e.g., MCAR). Moreover, we can estimate the confidence intervals via bootstrapping. Particularly, we can estimate μminz, μmaxz, Cminz, and Cmaxz from data such that μminzμ¯zμmaxz and CminzC¯zCmaxz with high probability (where the inequalities for matrices and vectors denote component-wise relations). In Appendix B, we show how a bootstrapping strategy can be used to obtain the confidence intervals described above. Given these estimated confidence intervals from data, (3) can be reformulated as

minθmaxPEP[(z;θ)]s.t.μminzEP[z]μmaxz,CminzEP[zzT]Cmaxz. (4)

Gao & Kleywegt (2017) utilize the distributionally robust optimization as (3) over the set of positive semi-definite (PSD) cones for robust inference under uncertainty. While their formulation considers 2 balls for the constraints on low order moments of the data, we use constraints that are computationally more natural in the presence of missing entries when combined with bootstrapping. Furthermore, while it can be applied to general convex losses, their method relies on the ellipsoid and the existence of oracles for performing the steps of the ellipsoid method, which is not applicable in modern high-dimensional problems. Moreover, they assume concavity in data (the existence of some oracle to return the worst-case data points) that is practically unavailable even in convex loss functions (including linear regression and normal discriminant analysis studied in our work).

In Section 3, we study the proposed distributionally robust framework described in (4) for the ridge linear regression. We design efficient first-order convergent algorithms to solve the problem and show how we can use the algorithms for both inference and imputation in the presence of missing values. Further, in Appendix F, we study the proposed distributionally robust framework for the classification problems under the normality assumption of features. In particular, we show how Framework (4) can be specialized to the robust normal discriminant analysis in the presence of missing values.

3. Robust Linear Regression in the Presence of Missing Values

Let us specialize our framework to the ridge linear regression model. In the absence of missing data, ridge regression finds optimal regressor parameter θ by solving

minθXθy22+λθ22,

or equivalently by solving:

minθθTXTXθ2θTXTy+λθ22. (5)

Thus, having the second-order moments of the data C=XTX and b=XTy is sufficient for finding the optimal solution. In other words, it suffices to compute the inner product of any two column vectors ai, aj of X, and the inner product of any column ai of X with vector y. Since the matrix X and vector y are not fully observed due to the existence of missing values, one can use the available data (see (24) for details) to compute the point estimators C0 and b0. These point estimators can be highly inaccurate, especially when the number of non-missing rows for two given columns is small. In addition, if the pattern of missing entries does not follow the MCAR assumption, the point estimators are not unbiased estimators of C and b.

3.1. A Distributionally Robust Formulation of Linear Regression

As we mentioned above, to solve the linear regression problem, we only need to estimate the second-order moments of the data (XTX and XTy). Thus, the distributionally robust formulation described in (4) is equivalent to the following optimization problem for the linear regression model:

minθmaxC,bθTCθ2bTθ+λθ22s.t.C0cΔCC0+cΔ,b0cδbb0+cδ,C¯0, (6)

where the last constraint guarantees that the covariance matrix is positive and semi-definite. We dicuss the procedure of estimating the confidence intervals (b0, C0, δ, and Δ) in Appendix B.

3.2. RIFLE for Ridge Linear Regression

Since the objective function in (6) is convex in θ (ridge regression) and concave in b and C (linear), the minimization and maximization sub-problems are interchangeable (Sion et al., 1958). Thus, we can equivalently rewrite Problem (6) as:

maxC,bg(C,b)s.t.C0cΔCC0+cΔ,b0cδbb0+cδ,C¯0, (7)

where g(b,C)=minθθTCθ2bTθ+λθ2. Function g can be computed in closed-form given any pair of (C,b) by setting θ=(C+λI)1b. Thus, using Danskin’s Theorem (Danskin, 2012), we can apply projected gradient ascent to function g to find an optimal solution of (7) as described in Algorithm 1. At each iteration of the algorithm, we first perform one step of projected gradient ascent on matrix C and vector b; then we update θ in closed-form for the obtained C and b. We initialize C and b using entriwise point estimation on the available rows (see Equation (24) in Appendix B). The projection of b to the box constraint b0cδbb0+cδ can be done entriwise and has the following closed-form

Πδ(bi)={biifb0icδibib0i+cδi,b0icδiifbi<b0icδi,b0i+cδiifb0i+cδi<bi.}
Algorithm 1 RIFLE for Ridge Linear Regression in the Presence of Missing Values
1:Input:C0,b0,Δ,δ,T2:Initialize:C=C0,b=b0.3:fori=1,,Tdo4:UpdateC=ΠΔ+[C+αθθT]5:Updateb=Πδ(b2αθ)6:Setθ=(C+λI)1b

Theorem 1. Let (θ~,C~,b~) be the optimal solution of (6), θ(b,C)=arg minθθTCθ2bTθ+λθ2, and D=C0C~F2+b0b~22. Assume that for any given b and C, within the uncertainty (constraint) sets described in (6), θ(b,C)τ. Then Algorithm 1 computes an ϵ-optimal solution of the objective function in (7) in 𝒪(D(τ+1)2λϵ) iterations.

Proof. The proof is relegated to Appendix H.

In Appendix C, we show how using the acceleration method of Nesterov can improve the convergence rate of Algorithm 1 to 𝒪(D(τ+1)2ϵλ). A technical issue of Algorithm 1 and its accelerated version presented in Appendix C is that projection of C to the intersection of box constraints and the set of positive semidefinite matrices (ΠΔ+[C]) is challenging and cannot be done in closed-form. In the implementation of Algorithm 1, we relax the problem by removing the PSD constraint on C to avoid this complexity and time-consuming singular value decomposition at each iteration. This relaxation does not drastically change the algorithm’s performance, as our experiments show in Section 5. A more systematic approach is to write the dual problem of the maximization problem and handle the resulting constrained minimization problem with the Alternating Direction Method of Multipliers (ADMM). The detailed procedure of such an approach can be found in Appendix D. All these algorithms are provably convergent to the optimal points of Problem (6). In addition to theoretical convergence, we have numerically evaluated the convergence of resulting algorithms in Appendix K. Further, the proposed algorithms are consistent, as discussed in Appendix J.

3.3. Performance Guarantees for RIFLE

Thus far, we have discussed how to efficiently solve the robust linear regression problem in the presence of missing values. A natural question in this context is the statistical performance of the obtained optimal solution in the previous section on the unseen test data points. Theorem 2 answers this question from two perspectives: Assuming that the missing values are distributed completely at random, our estimators are consistent. Moreover, for the finite case, Theorem 2 part (b) states that with the proper choice of confidence intervals, with high probability, the test loss of the obtained solution is bounded by the training loss of the estimator. Note that the results regarding the performance of the robust estimator generally hold for MCAR missing pattern. However, we perform several experiments on datasets with MNAR patterns to show how RIFLE works in practice on such datasets in Section 5.

Theorem 2. Assume the data domain is bounded and that the missing pattern of the data follows MCAR. Let Xn×d, y be the training data drawn i.i.d. from the ground-truth distribution P with low-order moments C and b. Further, assume that each entry of X and y is missing with probability p < 1. Let (θ~n,C~n,b~n) be the solution of Problem (6).

(a). Consistency of the Covariance Estimator:

As the number of data points goes to infinity, the estimated low-order marginals converge to the ground-truth values, almost surely. More precisely,

limnC~n=EP[xxT],a.s., (8)
limnb~n=EP[xy],a.s. (9)

(b). Defining

Ltrain(θ~n)=θ~nTC~nθ~n2b~nθ~n+λθ~n22
Ltest(θ~n)=θ~nTCθ~n2bTθ~n+λθ~n22,

where C=E(x,y)P[xxT] and b=E(x,y)P[xy] are the ground-truth second-order moments. Given V=maxi,jVar(XiXj) (maximum variance of pairwise feature products), with the probability of at least 1d2V22c2Δ2n(1p), we have:

Ltest(θ~)Ltrain(θ~), (10)

where Δ=min{Δij} and c is the hyper-parameter for controlling the size of the confidence intervals as presented in (6)

Proof. The proof is relegated to Appendix H.

3.4. Imputation of Missing Values and Going Beyond Linear Regression

RIFLE can be used for imputing missing data. To this end, we impute different features of a given dataset independently. More precisely, to impute each feature containing missing values, we consider it as a target variable y and the rest of the features as the input X in our methodology. Then, we train a model to predict the feature y given X via Algorihm 1 (or its ADMM version, Algorithm 7, in the appendix). Let the obtained optimal solutions be C, b, and θ. For a given missing entry, we can use θ only if all other features in the row of that missing entry are available. However, that is not usually the case in practice, as each row can contain more than one missing entry. Therefore, one can learn a separate model for each missing pattern in the dataset. Let us clarify this point through the example in Figure 1. In this example, we have three different missing patterns (one missing pattern for each dataset). For missing entries in Dataset 1, the first forty features are available. Let rj denote the vector of the first 40 features in row j. Assume that we aim to impute entry i{41,,100} in row j where i denoted by xji. To this end, we restrict X to the first 40 features. Moreover, we consider y=xi as the target variable. Then, we run Algorithm 1 on X and y to obtain the optimal C, bi, and θi. Consequently, we impute xji as follows:

xji=rjTθi

We can use the same methodology for imputing missing entries in each feature for missing patterns in Dataset 2 and Dataset 3. While this approach is reasonable for the missing pattern observed in Figure 1, in many practical problems, different rows can have distinct missing patterns. Thus, in the worst case, Algorithm 1 must be executed once for each missing entry. Such an approach is computationally expensive and might be infeasible in large-scale datasets containing large amounts of missing entries. Alternatively, one can perform Algorithm 1 only once to obtain C and b (considered the “worst-case/pessimistic” estimation of the moments). Then to impute each missing entry, C and b are restricted to the features available in that missing entry’s row. Having the restricted C and b, the regressor θ can be obtained in closed-form (line 6 in Algorithm 1). In this approach, we perform algorithm 1 once and find the optimal θ for each missing entry based on the estimated C and b. This approach can lead to sub-optimal solutions compared to the former approach, but it is much faster and more scalable.

Beyond Linear Regression:

While the developed methods are primarily designed for ridge linear regression, one can apply non-linear transformations (kernels) to obtain models beyond linear. In Appendix E, we show how to extend the developed algorithms to quadratic models. The RIFLE framework applied to the quadratically transformed data is called QRIFLE.

4. Robust Classification Framework

In this section, we study the proposed framework in (4) for the classification tasks in the presence of missing values. Since the target variable y𝒴={1,,M} takes discrete values in classification tasks, we consider the uncertainty sets over the data’s first- and second-order marginals given each target value (label) separately. Therefore, the distributionally robust classification over low-order marginals can be described as:

minwmaxPEP[(x,y,w)]s.t.μmin,yEP[xy]μmax,yy𝒴Σmin,yEP[xxTy]Σmax,yy𝒴 (11)

where μmin, μmax, Σmin, and Σmax are the estimated confidence intervals for the first and second order of the data distribution. Unlike the robust linear regression task in Section 3, the evaluation of the objective function in (11) might depend on higher-order marginals (beyond second-order) due to the nonlinearity of the loss function. As a result, Problem (11) is a non-convex non-concave intractable min-max optimization problem in general. For the sake of computational traceability, we restrict the distribution in the inner maximization problem to the set of normal distributions. In the following section, we specialize (11) to the quadratic discriminant analysis as a case study. The methodology can be extended to other popular classification algorithms, such as support vector machines and multi-layer neural networks.

4.1. Robust Quadratic Discriminant Analysis

Learning a logistic regression model on datasets containing missing values has been studied extensively in the literature (Fung & Wrobel, 1989; Abonazel & Ibrahim, 2018). Besides deleting missing values and imputation-based approaches, Fung & Wrobel (1989) models the logistic regression task in the presence of missing values as a linear discriminant analysis problem where the underlying assumption is that the predictors follow normal distribution conditional on the labels. Mathematically speaking, they assume that the data points assigned to a specific label follow a Gaussian distribution, i.e., xy=iN(μi,Σ). They use the available data to estimate the parameters of each Gaussian distribution. Therefore, the parameters of the logistic regression model can be assigned based on the estimated parameters of the Gaussian distributions for different classes. Similar to the linear regression case, the estimations of means and covariances are unbiased only when the data satisfies the MCAR condition. Moreover, when the number of data points in the dataset is small, the variance of the estimations can be very high. Thus, to train a logistic regression model that is robust to the percentage and different types of missing values, we specialize the general robust classification framework formulated in Equation (11) to the logistic regression model. Instead of considering a common covariance matrix for the conditional distributions of x given labels y (linear discriminant analysis), we assume a more general case where each conditional distribution has its own covariance matrix (quadratic discriminant analysis). Assume that xyN(μy,Σy) for y=0, 1. We aim to find the optimal solution to the following problem:

minwmaxμ0,μ1,Σ0,Σ1Exy=1N(μ1,Σ1)[log(σ(wTx))]P(y=1)+Exy=0N(μ0,Σ0)[log(1σ(wTx))]P(y=0)s.t.μmin0μ0μmax0μmin1μ1μmax1Σmin0Σ0Σmax0Σmin1Σ1Σmax1 (12)

Where σ(x)=1(1+exp(x)) is the sigmoid function.

To solve Problem (12), first, we focus on the scenario when the target variable has no missing values. In this case, each data point contributes to the estimation of either (μ1,Σ1) or (μ0,Σ0), depending on its label. Similar to the robust linear regression case, we can apply Algorithm 4 to estimate the confidence intervals for μi, Σi using data points whose target variable equals i(y=i).

Obviously, the objective function is convex in w since the logistic regression loss is convex, and the expectation of loss can be seen as a weighted summation, which is convex. Thus, fixing μ,Σ the outer minimization problem can be solved with respect to w using standard first-order methods such as gradient descent.

Although the robust reformulation of logistic regression stated in (12) is convex in w and concave in μ0 and μ1, the inner maximization problem is intractable with respect to Σ0 and Σ1. We approximate Problem (12) in the following manner:

minwmaxμ0,Σ0,μ1,Σ1π1Exy=1N(μ1,Σ1)[log(σ(wTx))]+π0Exy=0N(μ0,Σ0)[log(1σ(wTx))],μmin0μ0μmax0μmin1μ1μmax1Σ0{Σ01,Σ02,,Σ0k}Σ1{Σ11,Σ12,,Σ1k}, (13)

where π1=P(y=1) and π0=P(y=0). To compute optimal μ0 and μ1, we have:

maxμ1ExN(μ1,Σ1)[log(σ(wTx))]s.t.μminμ1μmax (14)

Theorem 3. Let a[i] be the i-th element of vector a. The optimal solution of Problem (14) has the following form:

μ1[i]={μmax[i],ifw[i]0μmin[i],ifw[i]>0.} (15)

Note that we relaxed (12) by taking the maximization problem over a finite set of Σ estimations. We estimate each Σ by bootstrapping on the available data using Algorithm 4. Define fi(w) as:

fi(w)=π1ExN(μ1,Σi1)[log(σ(wTx))] (16)

Similarly, we can define:

gi(w)=π0ExN(μ0,Σi0)[log(1σ(wTx))] (17)

Since the maximization problem is over a finite set, we can rewrite Problem (13) as:

minwmaxi,j{1,,k}fi(w)+gj(w)=minwmaxp1,,pk,q1,,qki=1kpifi(w)+j=1kpigj(w)s.t.i=1kpi=1,pi0j=1kqj=1,qj0 (18)

Since the maximum of several functions is not necessarily smooth (differentiable), we add a quadratic regularization term to the maximization problem, accelerating the convergence rate (Nouiehed et al., 2019) as follows:

minwmaxp1,,pk,q1,,qki=1kpifi(w)δi=1kpi2+j=1kqjgj(w)δj=1kqj2s.t.i=1kpi=1,pi0j=1kqj=1,qj0 (19)

First, we show how to solve the inner maximization problem. Note that the pis and qis are independent. We show how to find optimal pis. Optimizing with respect to qis is similar. Since the maximization problem is a constrained quadratic program, we can write the Lagrangian function as follows:

maxp1,,pki=1kpifi(w)δi=1kpi2λ(i=1kpi1)s.t.pi0 (20)

Having the optimal λ, the above problem has a closed-form solution with respect to each pi, which can be written as:

pi=[λ+fi2δ]+

Since pi is a non-increasing function with respect to λ, we can find the optimal value of λ using the following bisection algorithm. Algorithm 2 demonstrates how to find an ϵ-optimal λ and pis efficiently using the bisection idea.

Algorithm 2 Finding the optimal λ and pis using the bisection idea
1:Initialize:λlow=0,λhigh=maxifi,pi=0i{1,2,,k}.2:whilei=1npk1>ϵdo3:λ=λlow+λhigh24:Setpi=[λ+fi2δ]+i{1,2,,k}5:ifi=1kpi<1then6:λhigh=λ7:else8:λlow=λ9:returnλ,p1,p2,,pk.

Remark 4. An alternative method for finding optimal λ, and pis is to sort fi values in 𝒪(klogk) first, and then finding the smallest fi such that if we set λ=fi, the sum of pis is bigger than 1 (let j be the index of that value). Without loss of generality, assume that f1fk. Then, i=jkλ+fi2δ=1, which has a closed-form solution with respect to λ.

To update w, we need to solve the following optimization problem:

minwi=1kpifi(w)+j=1kqjgi(w), (21)

Similar to the standard statistical learning framework, we solve the following empirical risk minimization problem by applying the gradient descent to w on a finite data sample. Define f^i as follows:

f^i(w)=π1t=1n[log(σ(wTxt))], (22)

where x1,,xn are generated from the distribution 𝒩(μ1,Σ1i). The empirical risk minimization problem can be written as follows:

minwi=1kpif^i(w)+j=1kqjg^i(w), (23)

Algorithm 3 summarizes the robust linear discriminant analysis method for the case where the label of all data points is available. Theorem 5 demonstrates the convergence of gradient descent algorithm applied to (23) in 𝒪(kϵlog(Mϵ)) iterations to an ϵ-optimal solution.

Algorithm 3 Robust Quadratic Discriminant Analysis in the Presence of Missing Values
1:Input:X0,X1:matrix of data points with labels0and1respectively,T:Number of iterations,α:Step-size.2:Estimateμmin0andμmax0using the available entries ofX0.3:Estimateμmin1andμmax1using the available entries ofX1.4:EstimateΣ01,,Σ0kusing bootstrap estimator on the available data ofX0.5:EstimateΣ11,,Σ1kusing bootstrap estimator on the available data ofX1.6:fori=1,,Tdo7:Computeμ1andμ0by Equation(15).8:Find optimalp1,,pk,andq1,,qkusing Algorithm2.9:w=wα(i=1kpif^i(w)+j=1kqjg^i(w))

Theorem 5. Assume that M=maxifi. Gradient descent algorithm requires 𝒪(kϵlog(Mϵ)) gradient evaluations for converging to an ϵ-optimal saddle point of the optimization problem (23).

In Appendix F, we extend the methodology to the case where y contains missing entries.

5. Experiments

In this section, we evaluate RIFLE’s performance on a diverse set of inference tasks in the presence of missing values. We compare RIFLE’s performance to several state-of-the-art approaches for data imputation on synthetic and real-world datasets. The experiments are designed in a manner that the sensitivity of the model to factors such as the number of samples, data dimension, types, and proportion of missing values can be evaluated. The description of all datasets used in the experiments can be found in Appendix I.

5.1. Evaluation Metrics

We need access to the ground-truth values of the missing entries to evaluate RIFLE and other state-of-the-art imputation approaches. Hence, we artificially mask a proportion of available data entries and predict them with different imputation methods. A method performs better than others if the predicted missing entries are closer to the ground-truth values. To measure the performance of RIFLE and the existing approaches on a regression task for a given test dataset consisting of N data points, we use normalized root mean squared error (NRMSE), defined as:

NRMSE=1Ni=1N(yiy^i)21Ni=1N(yiy¯)2

where yi, y^i, and y¯ represent the true value of the i-th data point, the predicted value of the i-th data point, and the average of true values of data points, respectively. In all experiments, generated missing entries follow either a missing completely at random (MCAR) or a missing not at random (MNAR) pattern. A discussion on the procedure of generating these patterns can be found in Appendix G.

5.2. Tuning Hyper-parameters of RIFLE

The hyper-parameter c in (7) controls the robustness of the model by adjusting the size of confidence intervals. This parameter is tuned by performing a cross-validation procedure over the set {0.1, 0.25, 0.5, 1, 2, 5, 10, 20, 50, 100}, and the one with the lowest NMRSE is chosen. The default value in the implementation is c = 1 since it consistently performs well over different experiments. Furthermore, λ, the hyper-parameter for the ridge regression regularizer, is tuned by choosing 20% of the data as the validation set from the set {0.01, 0.1, 0.5, 1, 2, 5, 10, 20, 50}. To tune K, the number of bootstrap samples for estimating the confidence intervals, we tried 10, 20, 50, and 100. No significant difference is observed in terms of the test performance for the above values.

Furthermore, we tune the hyper-parameters of the competing packages as follows. For KNN-Imputer (Troyanskaya et al., 2001), we try {2, 10, 20, 50} for the number of neighbors (K) and pick the one with the highest performance. For MICE (Buuren & Groothuis-Oudshoorn, 2010) and Amelia (Honaker et al., 2011), we generate 5 different imputed data and pick the one with the highest performance on the test data. MissForest has multiple hyper-parameters. We keep the criterion as “MSE” since our performance evaluation measure is NRMSE. Moreover, we tune the number of iterations and number of estimations (number of trees) by checking values from {5, 10, 20} and {50, 100, 200}, respectively. We do not change the structure of the neural networks for MIDA (Gondara & Wang, 2018) and GAIN (Yoon et al., 2018), and the default versions are performed for imputing datasets.

5.3. RIFLE Consistency

In Theroem 2 Part (a), we demonstrated that RIFLE is consistent. In Figure 3, we investigate the consistency of RIFLE on synthetic datasets with different proportions of missing values. The synthetic data has 50 input features following a jointly normal distribution with the mean whose entries are randomly chosen from the interval (−100, 100). Moreover, the covariance matrix equals Σ=SST where S elements are randomly picked from (−1, 1). The dimension of S is 50 × 20. The target variable is a linear function of input features added to a mean zero normal noise with a standard deviation of 0.01. As depicted in Figure 3, RIFLE requires fewer samples to recover the ground-truth parameters of the model compared to MissForest, KNN Imputer, Expectation Maximization (Dempster et al., 1977), and MICE. Amelia’s performance is significantly good since the predictors have a joint normal distribution and the linear underlying model. Note that by increasing the number of samples, the NRMSE of our framework converges to 0.01, which is the standard deviation of the zero-mean Gaussian noise added to each target value (the dashed line).

Figure 3:

Figure 3:

Comparing the consistency of RIFLE, MissForest, KNN Imputer, MICE, Amelia, and Expectation Maximization methods on a synthetic dataset containing 40% of missing values.

5.4. Data Imputation via RIFLE

As explained in Section 3, while the primary goal of RIFLE is to learn a robust regression model in the presence of missing values, it can also be used as an imputation tool. We run RIFLE and several state-of-the-art approaches on five datasets from the UCI repository (Dua & Graff, 2017) (Spam, Housing, Clouds, Breast Cancer, and Parkinson datasets) with different proportions of MCAR missing values (the description of the datasets can be found in Appendix I). Then, we compute the NMRSE of imputed entries. Table 1 shows the performance of RIFLE compared to other approaches for the datasets where the proportion of missing values are relatively high (n(1p)d𝒪(1)). RIFLE outperforms these methods in almost all cases and performs slightly better than MissForest, which uses a highly non-linear model (random forest) to impute missing values.

Table 1:

Performance comparison of RIFLE, QRIFLE (Quadratic RIFLE), and state-of-the-art methods on several UCI datasets. We applied to impute methods on three different missing-value proportions for each dataset. The best imputer is highlighted with bold font, and the second-best imputer is underlined. Each experiment is done 5 times, and the average and the standard deviation of performances are reported.

Dataset Name RIFLE QRIFLE MICE Amelia GAIN MissForest MIDA EM
Spam (30%) 0.87 ±0.009 0.82 ±0.009 1.23 ±0.012 1.26 ±0.007 0.91 ±0.005 0.90 ±0.013 0.97 ±0.008 0.94 ± 0.004
Spam (50%) 0.90 ±0.013 0.86 ±0.014 1.29 ±0.018 1.33 ±0.024 0.93 ±0.015 0.92 ±0.011 0.99 ±0.011 0.97 ± 0.008
Spam (70%) 0.92 ±0.017 0.91 ±0.019 1.32 ±0.028 1.37 ±0.032 0.97 ±0.014 0.95 ±0.016 0.99 ±0.018 0.98 ± 0.017
Housing (30%) 0.86 ±0.015 0.89 ±0.018 1.03 ±0.024 1.02 ±0.016 0.82 ±0.015 0.84 ±0.018 0.93 ±0.025 0.95 ± 0.011
Housing (50%) 0.88 ±0.021 0.90 ±0.024 1.14 ±0.029 1.09 ±0.027 0.88 ±0.019 0.88 ±0.018 0.98 ±0.029 0.96 ± 0.016
Housing (70%) 0.92 ±0.026 0.95 ±0.028 1.22 ±0.036 1.18 ±0.038 0.95 ±0.027 0.93 ±0.024 1.02 ±0.037 0.98 ± 0.017
Clouds (30%) 0.81 ±0.018 0.79 ±0.019 0.98 ±0.024 1.04 ±0.027 0.76 ±0.021 0.71 ±0.011 0.83 ±0.022 0.86 ± 0.013
Clouds (50%) 0.84 ±0.026 0.84 ±0.028 1.10 ±0.041 1.13 ±0.046 0.82 ±0.027 0.75 ±0.023 0.88 ±0.033 0.89 ± 0.018
Clouds (70%) 0.87 ±0.029 0.90 ±0.033 1.16 ±0.044 1.19 ±0.048 0.89 ±0.035 0.81 ±0.031 0.93 ±0.044 0.92 ± 0.023
Breast Cancer (30%) 0.52 ±0.023 0.54 ±0.027 0.74 ±0.031 0.81 ±0.032 0.58 ±0.024 0.55 ±0.016 0.70 ±0.026 0.67 ± 0.014
Breast Cancer (50%) 0.56 ±0.026 0.59 ±0.027 0.79 ±0.029 0.85 ±0.033 0.64 ±0.025 0.59 ±0.022 0.76 ±0.035 0.69 ± 0.022
Breast Cancer (70%) 0.59 ±0.031 0.65 ±0.034 0.86 ±0.042 0.92 ±0.044 0.70 ±0.037 0.63 ±0.028 0.82 ±0.035 0.67 ± 0.014
Parkinson (30%) 0.57 ±0.016 0.55 ±0.016 0.71 ±0.019 0.67 ±0.021 0.53 ±0.015 0.54 ±0.010 0.62 ±0.017 0.64 ± 0.011
Parkinson (50%) 0.62 ±0.022 0.64 ±0.025 0.77 ±0.029 0.74 ±0.034 0.61 ±0.022 0.65 ±0.014 0.71 ±0.027 0.69 ± 0.022
Parkinson (70%) 0.67 ±0.027 0.74 ±0.033 0.85 ±0.038 0.82 ±0.037 0.69 ±0.031 0.73 ±0.022 0.78 ±0.038 0.75 ± 0.029

5.5. Sensitivity of RIFLE to the Number of Samples and Proportion of Missing Values

In this section, we analyze the sensitivity of RIFLE and other state-of-the-art approaches to the number of samples and the proportion of missing values. In the experiment in Figure 4, we create 5 datasets containing 40%, 50%, 60%, 70%, and 80% of MCAR missing values, respectively, for four real datasets (Spam, Parkinson, Wave Energy Converter, and Breast Cancer) from UCI Repository (Dua & Graff, 2017) (the description of the datasets can be found in Appendix I). Given a feature in a dataset containing missing values, we say an imputer wins that feature if the imputation error in terms of NRMSE for that imputer is less than the error of the other imputers. Figure 4 reports the number of features won by each imputer on the created datasets described above. As we observe, the number of wins for RIFLE increases as we increase the proportion of missing values. This observation shows that the sensitivity of RIFLE as an imputer to the proportion of missing values is less than MissForest and MICE in general.

Figure 4:

Figure 4:

Performance Comparison of RIFLE, MICE, and MissForest on four UCI datasets: Parkinson, Spam, Wave Energy Converter, and Breast Cancer. For each dataset, we count the number of features that each method outperforms the others.

Figure 4 does not show how the increase in the proportion of missing values changes the NRMSE of imputers. Next, we analyze the sensitivity of RIFLE and several imputers to change in missing value proportions. Fixing the proportion of missing values, we generate 10 random datasets containing missing values in random locations on the Drive dataset (the description of datasets is available in Appendix I). We impute the missing values for each dataset with RIFLE, MissForest, Mean Imputation, and MICE. Figure 5 shows the average and the standard deviation of these 4 imputers’ performances for different proportions of missing values (10% to 90%). Figure 5 depicts the sensitivity of MissForest and RIFLE to the proportion of missing values in the Drive dataset. We select 400 data points for each experiment with different proportions of missing values (from 10% to 90%) and report the average NRMSE of imputed entries. Finally, in Figure 6, we have evaluated RIFLE and other methods on the BlogFeedback dataset (see Appendix I) containing 40% missing values. The results show that RIFLE’s performance is less sensitive to decreasing the number of samples.

Figure 5:

Figure 5:

Sensitivity of RIFLE, MissForest, Amelia, KNN Imputer, MIDA, and Mean Imputer to the percentage of missing values on the Drive dataset. Increasing the percentage of missing value entries degrades the benchmarks’ performance compared to RIFLE. KNN-imputer implementation cannot be executed on datasets containing 80% (or more) missing entries. Moreover, Amelia and MIDA do not converge to a solution when the percentage of missing value entries is higher than 70%.

Figure 6:

Figure 6:

Sensitivity of RIFLE, MissForest, MICE, Amelia, Mean Imputer, KNN Imputer, and MIDA to the number of samples for the imputations of Blog Feedback dataset containing 40% of MCAR missing values. When the number of samples is limited, RIFLE outperforms other methods, and its performance is very close to the non-linear imputer MissForest for larger samples.

5.6. Performance Comparison on Real Datasets

In this section, we compare the performance of RIFLE to several state-of-the-art approaches, including MICE (Buuren & Groothuis-Oudshoorn, 2010), Amelia (Honaker et al., 2011), MissForest (Stekhoven & Bühlmann, 2012), KNN Imputer (Raghunathan et al., 2001), and MIDA (Gondara & Wang, 2018). There are two primary ways to do this. One method to predict a continuous target variable in a dataset with many missing values is first to impute the missing data with a state-of-the-art package, then run a linear regression. An alternative approach is to directly learn the target variable, as we discussed in Section 3.

Table 2 compares the performance of mean imputation, MICE, MIDA, MissForest, and KNN to that of RIFLE on three datasets: NHANES, Blog Feedback, and superconductivity. Both Blog Feedback and Superconductivity datasets contain 30% of MNAR missing values generated by Algorithm 9, with 10000 and 20000 training samples, respectively. The description of the NHANES data and its distribution of missing values can be found in Appendix I.

Table 2:

Normalized RMSE of RIFLE and several state-of-the-art Methods on Superconductivity, blog feedback, and NHANES datasets. The first two datasets contain 30% Missing Not At Random (MNAR) missing values in the training phase generated by Algorithm 9. Each method applied 5 times to each dataset, and the result is reported as the average performance ± standard deviation of experiments in terms of NRMSE.

Methods Datasets
Super Conductivity Blog Feedback NHANES
Regression on Complete Data 0.4601 0.7432 0.6287
RIFLE 0.4873 ± 0.0036 0.8326 ± 0.0085 0.6304 ± 0.0027
Mean Imputer + Regression 0.6114 ± 0.0006 0.9235 ± 0.0003 0.6329 ± 0.0008
MICE + Regression 0.5078 ± 0.0124 0.8507 ± 0.0325 0.6612 ± 0.0282
EM + Regression 0.5172 ± 0.0162 0.8631 ± 0.0117 0.6392 ± 0.0122
MIDA Imputer + Regression 0.5213 ± 0.0274 0.8394 ± 0.0342 0.6542 ± 0.0164
MissForest 0.4925 ± 0.0073 0.8191 ± 0.0083 0.6365 ± 0.0094
KNN Imputer 0.5438 ± 0.0193 0.8828 ± 0.0124 0.6427 ± 0.0135

Efficiency of RIFLE:

We perform RIFLE for 1000 iterations and the step size of 0.01 in the above experiments. At each iteration, the main operation is to find the optimal θ for any given b and C. The average time of each method on each dataset is reported in Table 5 in Appendix L. The main reason for the time efficiency of RIFLE compared to MICE, MissForest, MIDA, and KNN Imputer is that it directly predicts the target variable without imputation of all missing entries.

Since MICE and MIDA cannot predict values during the test phase without data imputation, we use them in a pre-processing stage to impute the data. Then we apply the linear regression to the imputed dataset. On the other hand, RIFLE, KNN imputer, and MissForest can predict the target variable without imputing the training dataset. Table 2 shows that RIFLE outperforms all other state-of-the-art approaches executed on the three mentioned datasets. In particular, RIFLE outperforms MissForest, while the underlying model RIFLE uses is simpler (linear) compared to the nonlinear random forest model utilized by Missforest.

5.6.1. Performance of RIFLE on Classification Tasks

In Section 4, we discussed how to specialize RIFLE to robust normal discriminant analysis in the presence of missing values. Since the maximization problem over the second moments of the data (Σ) is intractable, we solved the maximization problem over a set of k covariance matrices estimated by bootstrap sampling. To investigate the effect of choosing k on the performance of the robust classifier, we train robust normal discriminant analysis models for different values of k on two training datasets (Avila and Magic) containing 40% MCAR missing values. The description of the datasets can be found in Appendix I. For k=1, there is no maximization problem, and thus, it is equivalent to the classifier proposed in Fung & Wrobel (1989). As shown in Figure 7, increasing the number of covariance estimations generally enhances the accuracy of the classifier in the test phase. However, as shown in Theorem 5, the required time for completing the training phase grows linearly regarding the number of covariance estimations.

Figure 7:

Figure 7:

Effect of the number of covariance estimations on the performance (left) and run time (right) of robust LDA on Avila and Magic datasets. Increasing the number of covariance estimations (k) improves the model’s accuracy on the test data. However, it takes longer training time.

5.6.2. Comparison of Robust Linear Regression and Robust QDA

An alternative approach to the robust QDA presented in Section 4 is to apply the robust linear regression algorithm (Section 3) and mapping the solutions to each one of the classes by thresholding (positive value maps to Label 1 and negative values to label −1).

Table 4 compares the performance of two classifiers on three different datasets. As demonstrated in the table, when all features are continuous, quadratic discriminant analysis has a better performance. It shows the QDA model relies highly on the normality assumption, while robust linear regression handles the categorical features better than robust QDA.

Table 4:

Accuracy of RIFLE, MICE, KNN-Imputer, Expectation Maximization (EM), and Robust QDA on different discrete, mixed, and continuous datasets. Robust QDA can perform better than other methods when the input features are continuous, and the target variable is discrete. However, RIFLE results in higher accuracy in mixed and discrete settings.

Accuracy of Methods
Dataset Feature Type RIFLE Robust QDA MissForest MICE KNN Imputer EM
Glass Identification Continuous 67.12% ± 1.84% 69.54% ± 1.97% 65.76% ± 1.49% 62.48% ± 2.45% 60.37% + ±1.12% 68.21% + ±0.94%
Annealing Mixed 63.41% ± 2.44% 59.51% ± 2.21% 64.91% ± 1.35% 60.66% ± 1.59% 57.44% ± 1.44% 59.43% + ±1.29%
Abalone Mixed 68.41% ± 0.74% 63.27% ± 0.76% 69.40% ± 0.42% 63.12% ± 0.98% 62.43% ± 0.38% 62.91% + ±0.37%
Lymphography Discrete 66.32% ± 1.05% 58.15% ± 1.21% 66.11% ± 0.94% 55.73% ± 1.24 57.39% ± 0.88% 59.55% + ±0.68%
Adult Discrete 72.42% ± 0.06% 60.36% ± 0.08 70.34% ± 0.03% 63.30% ± 0.14% 60.14% ± 0.00 60.69% + ±0.01%
Limitations and Future Directions:

The proposed framework for robust regression in the presence of missing values is limited to linear models. While in Appendix E, we use polynomial kernels to apply non-linear transformations on the data, such an approach can potentially increase the number of missing values in the kernel space generated by the composition of the original features. A future direction is to develop efficient algorithms for non-linear regression models such as multi-layer neural networks, decision tree regressors, gradient boosting regressors, and support vector regression models. In the case of robust classification, the methodology is extendable to any loss beyond quadratic discriminant analysis. Unlike the regression case, a limitation of the proposed method for robust classification is its reliance on the Gaussianity assumption of data distribution (conditioned on each data label). A natural extension is to assume the underlying data distribution follows a mixture of Gaussian distributions.

Conclusion:

In this paper, we proposed a distributionally robust optimization framework over the distributions with the low-order marginals within the estimated confidence intervals for inference and imputation of datasets in the presence of missing values. We developed algorithms for regression and classification with convergence guarantees. The method’s performance is evaluated on synthetic and real datasets with different numbers of samples, dimensions, missing value proportions, and types of missing values. In most experiments, RIFLE consistently outperforms other existing methods.

Table 3:

Sensitivity of Linear Discriminant Analysis, Robust LDA (Common Covariance Matrices), and Robust QDA (Different Covariance matrices for two groups) to the number of training samples.

Number of Training Data Points Method
LDA Robust LDA Robust QDA
50 52.38% ± 3.91% 62.14% ± 1.78% 61.36% ± 1.62%
100 61.24% ± 1.89% 68.46% ± 1.04% 70.07% ± 0.95%
200 73.49% ± 0.97% 73.35% ± 0.67% 73.51% ± 0.52%

Acknowledgments

This work was supported by the NIH/NSF Grant 1R01LM013315-01, the NSF CAREER Award CCF-2144985, and the AFOSR Young Investigator Program Award FA9550-22-1-0192.

A A Review of Missing Value Imputation Methods in the Literature

The fundamental idea behind many data imputation approaches is that the missing values can be predicted based on the available data of other data points and correlated features. One of the most straightforward imputation techniques is to replace missing values by the mean (or median) of that feature calculated from what data is available see Little & Rubin (2019, Chapter 3). However, this naïve approach ignores the correlation between features and does not preserve the variance of features. Another class of imputers has been developed based on the least-square methods (Raghunathan et al., 2001; Kim et al., 2005; Zhang et al., 2008; Cai et al., 2006). Raghunathan et al. (2001) learns a linear model with multivariate Gaussian noise for the feature with the least missing entries. It repeats the same procedure on the updated data to impute the next feature with the least missing entries until all features are completely imputed. One drawback of this approach is that the error from the imputation of previous features can be propagated to subsequent features. To impute entries of a given feature in a dataset, Kim et al. (2005) learns several univariate regression models that consider that feature as the response. Then it takes the average of these predictions as the final value of imputation. This approach fails to learn the correlations involving more than two features.

Many more complex algorithms have been developed for imputation, although many are sensitive to initial assumptions and may not converge. For instance, KNN-Imputer imputes a missing feature of a data point by taking the mean value of the K closest complete data points (Troyanskaya et al., 2001). MissForest, on the other hand, imputes the missing values of each feature by learning a random forest classifier using other training data features (Stekhoven & Bühlmann, 2012). MissForest does not need to assume that all features are continuous (Honaker et al., 2011) or categorical (Schafer, 1997). However, both KNN-imputer and MissForest do not guarantee statistical or computational convergence for their algorithms. Moreover, when the proportion of missing values is high, both are likely to have a severe drop in performance, as demonstrated in Section 5. The Expectation Maximization (EM) algorithm is another popular approach that learns the parameters of a prior distribution on the data using available values based on the EM algorithm of Dempster et al. (1977); see also Ghahramani & Jordan (1994) and Honaker et al. (2011). The EM algorithm is also used in Amelia, which fits a jointly normal distribution to the data using EM and the bootstrap technique (Honaker et al., 2011). While Amelia demonstrates a superior performance on datasets following a normal distribution, it is highly sensitive to the violation of the normality assumption (as discussed in Bertsimas et al. (2017)). Ghahramani & Jordan (1994) adopt the EM algorithm to learn a joint Bernoulli distribution for the categorical data and a joint Gaussian distribution for the continuous variables independently. While those algorithms can be viewed as inference methods based on low-order estimates of moments, they do not consider uncertainty in such low-order moments estimates. By contrast, our framework utilizes robust optimization to consider the uncertainty around the estimated moments. Moreover, our optimization procedure for imputation and prediction is guaranteed to converge despite some of the algorithms mentioned above.

Another popular method for data imputation is multiple imputations by chained equations (MICE). MICE learns a parametric distribution for each feature conditional on the remaining features. For instance, it assumes that the current target variable is a linear function of other features with a zero-mean Gaussian noise. Each feature can have its distinct distribution and parameters (e.g., Poisson regression, logistic regression). Based on the learned parameters of conditional distributions, MICE can generate one or more imputed datasets (Buuren & Groothuis-Oudshoorn, 2010). More recently, several neural network-based imputers have been proposed. GAIN (Generative Adversarial Imputation Network) learns a generative adversarial network based on the available data and then imputes the missing values using the trained generator (Yoon et al., 2018). One advantage of GAIN over other existing GAN imputers is that it does not need a complete dataset during the training phase. MIDA (Multiple Imputation using Denoising Autoencoders) is an auto-encoder-based approach that trains a denoising auto-encoder on the available data considering the missing entries as noise. Similar to other neural network-based methods, these algorithms suffer from their black-box nature. They are challenging to interpret/explain, making them unpopular in mission-critical healthcare approaches. In addition, no statistical or computational guarantees are provided for these algorithms.

Bertsimas et al. (2017) formulates the imputation task as a constrained optimization problem where the constraints are determined by the underlying classification model such as KNN (k-nearest neighbors), SVM (Support Vector Machine), and Decision Trees. Their general framework is non-convex, and the authors relax the optimization for each choice of the cost function using first-order methods. The block coordinate descent algorithm then optimizes the relaxed problem. They show the convergence and accuracy of their proposed algorithm numerically, while a theoretical analysis that guarantees the algorithm’s convergence is absent in their work.

B Estimating Confidence Intervals of Low-order Moments

In this section, we explain the methodology of estimating confidence intervals for E[zi] and E[zizj]. Let Xn×d and y be the data matrix and target variables for n given data points respectively whose entries are in R~=R{}, where * symbol represents a missing entry. Moreover, assume that ai represents the i-th column (feature) of matrix X. We define:

a~i(k)={ai(k)ifai(k)0ifai(k)=}

Thus, a~ is obtained by replacing the missing values with 0. We estimate the confidence intervals for the mean and covariance of features using multiple bootstrap samples on the available data. Let C0[i][j] and Δ0[i][j] be the center and the radius of the confidence interval for C[i][j], respectively. We compute the center of the confidence interval for C[i][j] as follows:

C0[i][j]=1mija~iTa~j (24)

where mi={k:ai(k)} and mij={k:ai(k),aj(k)}. This estimator is obtained from the rows where both features are available. More precisely, let M be the mask of the input data matrix X defined as:

Mij={0,ifXijis missing,1,otherwise.}

Assume that mij=(MTM)ij, which is the number of rows in the dataset where both features i and j are available. To estimate the confidence intervals for Cij, we use Algorithm 4. First, we select multiple (K) samples of size N=mij from the rows where both features are available. Each one of these samples with size mij is obtained by applying a bootstrap sampler (sampling with replacement) on the mij rows where both features are available. Then, we compute the second-order moment of two features for each sample.

To find the radius of confidence intervals for each given pair (i,j) of features, we choose k different bootstrap samples with length n on the rows where both features i and j are available. Then, we compute C0[i][j] of two features in each bootstrap sample. The standard deviation of these estimations determines the radius of the corresponding confidence interval. Algorithm 4 summarizes the required steps for computing the confidence interval radius for the ij-th entry of covariance matrix Δ. Note that the confidence intervals for μ can be computed similarly. Having C0 and Δ, the confidence interval for the matrix C is computed as follows:

Cmin=C0cΔCmax=C0+cΔ,

Computing bmin and bmax can be done in the same manner. The hyper-parameter c is defined to control the robustness of the model by tuning the length of confidence intervals. A larger c corresponds to bigger confidence intervals and, thus, a more robust estimator. On the other hand, large values for c lead to very large confidence intervals that can adversely affect the performance of the trained model.

Algorithm 4 Estimating Confidence Interval Length Δij for Feature i and Feature j.
1:Input:K:Number of bootstrap estimations2:fort=1,,Kdo3:Picknsamples with replacement from the rows where bothi-th andj-th are available.4:Let(X^i1,X^j1),,(X^in,X^jn)be thei-th andj-th features of the selected samples5:Ct=1nr=1nX^irX^jr6:Δij=std(C1,C2,,CK)

Remark 6. Since the computation of confidence intervals for different entries of the covariance matrix are independent of each other, they can be computed in parallel. In particular, if γ cores are available, dγ features (columns of the covariance matrix) can be assigned to each one of the available cores.

C Solving Robust Ridge Regression with the Optimal Convergence Rate

The convergence rate of Algorithm 1 to the optimal solution of Problem (6) can be slow in practice since the algorithm requires to do a matrix inversion for updating θ and applying the box constraint to C and b at each iteration. While we update the minimization problem in closed-form with respect to θ, we can speed up the convergence rate of the maximization problem by applying Nesterov’s acceleration method to function g(b,C) in (7). Since function g is the minimum of convex functions, its gradient with respect to C and b can be computed using Danskin’s theorem. Algorithm 5 describes the steps to optimize Problem (7) using Nesterov’s acceleration method.

Algorithm 5 Applying the Nesterov’s Acceleration Method to Robust Linear Regression
1:C0,b0,Δ,δ,T2:Initialize:C1=C0,b1=b0,γ0=0,γ1=1.3:fori=1,,Tdo4:γi+1=1+1+4γi225:YCi=Ci+γi1γi+1(CiCi1)6:Ci+1=ΠΔ+(YCi+1LθθT)7:Ybi=bi+γi1γi+1(bibi1)8:bi+1=Πδ(Ybi2θL)9:Setθ=(Ci+1+λI)1bi+1

Theorem 7. Let (θ~,C~,b~) be the optimal solution of (6) and D=C0C~F2+b0b~22. Assume that for any given b and C, within the uncertainty sets described in (6), θ(b,C)τ. Then, Algorithm 1 computes an ϵ-optimal solution of the objective function in 𝒪(D(τ+1)2λϵ) iterations.

Proof. The proof is relegated to Appendix H.

D Solving the Dual Problem of the Robust Ridge Linear Regression via ADMM

The Alternating Direction Method of Multipliers (ADMM) is a popular algorithm for efficiently solving linearly constrained optimization problems (Gabay & Mercier, 1976; Hong et al., 2016). It has been extensively applied to large-scale optimization problems in machine learning and statistical inference in recent years (Assländer et al., 2018; Zhang et al., 2018). Consider the following optimization problem consisting of two blocks of variables x and y that are linearly coupled:

minw,zf(w)+g(z)s.t.Aw+Bz=c, (25)

The augmented Lagrangian of the above problem can be written as:

minw,zf(w)+g(z)+Aw+Bzc,λ+ρ2Aw+Bzc2 (26)

ADMM schema updates the primal and dual variables iteratively as presented in Algorithm 6.

Algorithm 6 General ADMM Algorithm
1:fort=1,,Tdo2:wt+1=arg minwf(w)+Aw+Bztc,λ+ρ2Aw+Bztc23:zt+1=arg minzf(wt+1)+Awt+1+Bzc,λ+ρ2Awt+1+Bzc24:λt+1=λt+ρ(Awt+1+Bzt+1c)

As we mentioned earlier, simultaneous projection of C to the set of positive semi-definite matrices and the box constraint CminCCmax in Algorithm 1 and Algorithm 5 is computationally expensive. Moreover, careful step-size tuning is necessary to avoid inconsistency and guarantee convergence in that algorithm.

An alternative approach for solving Problem (6) that avoids removing the PSD constraint in the implementation of Algorithm 1 and Algorithm 5 is to solve the dual of the inner maximization problem. Since the maximization problem is concave with respect to C and b, and the relative interior of the feasible set of constraints is non-empty, the duality gap is zero. Hence, instead of solving the inner maximization problem, we can solve its dual which is a minimization problem. Theorem 8 describes the dual problem of the inner maximization problem in (6). Thus, Problem (6) can be alternatively formulated as a minimization problem rather than a min-max problem. We can solve such a constrained minimization problem efficiently via the ADMM algorithm. As we will show, the ADMM algorithm applied to the dual problem does not need tuning of step-size or applying simultaneous projections to the box constraints and positive semi-definite (PSD) constraints.

Theorem 8. (Dual Problem) The inner maximization problem described in (6) can be equivalently formulated as:

minA,B,d,e,Hbmin,d+bmax,eCmin,A+Cmax,B+λθ2s.t.θθTA+BH=0,2θd+e=0,A,B,d,e0,H¯0.

Therefore, Problem (6) can be alternatively written as:

minθ,A,B,d,e,Hbmin,d+bmax,eCmin,A+Cmax,B+λθ2s.t.θθTA+BH=0,2θd+e=0,A,B,d,e0,H¯0. (27)

Proof. The proof is relegated to Appendix H.

To apply the ADMM method to the dual problem, we require to divide the optimization variables into two blocks as in (25) such that both sub-problems in Algorithm 6 can be efficiently solved. To do so, first, we introduce the auxiliary variables d,e,θ,A and B to the dual problem. Also, let G=H+θθT.

Therefore, Problem (27) is equivalent to:

minθ,A,B,d,e,Hbmin,d+bmax,eCmin,A+Cmax,B+λθ2s.t.BA=G,2θd+e=0,A=A,B=B,d=d,e=e,θ=θ,A,B,d,e0,G¯θθT. (28)

Since handling both constraints on θ in Problem (27) is difficult, we interchange θ with θ in the first constraint. Moreover, the non-negativity constraints on A,B,d and e are exchanged with non-negativity constraints on A,B,d and e. For the simplicity of presentation, assume that c1t=bminμdt+ρdt+ηt, c2t=bmaxμet+ρet+ηt, c3t=μθ+ρθt2ηt, D1t=ρAtρGt+ΓtMAt+Cmin, and D2t=ρBt+ρGtΓtMBtCmax. Algorithm 7 describes the ADMM algorithm applied to Problem (28).

Corollary 9. If the feasible set of Problem (6) has non-empty interior, then Algorithm 7 converges to an ϵ-optimal solution of Problem (28) in 𝒪(1ϵ) iterations.

Proof. Since the inner maximization problem, in (6) is convex, and its feasible interior set is not empty, the duality gap is zero by Slater’s condition. Thus, according to Theorem 6.1 in He & Yuan (2015), Algorithm 7 converges to an optimal solution of the primal-dual problem with a linear rate. Moreover, the sequence of constraint residuals converges to zero with a linear rate as well.

Remark 10. The optimal solution obtained from the ADMM algorithm can be different from the one given by Algorithm 1 because we remove the positive semi-definite constraint on C in the latter. We investigate the difference between solutions of two algorithms in three cases: First, we generate a small positive semi-definite matrix C and the matrix of confidence intervals (Δ) as follows:

C=[974092401738923888],Δ=[0.20.30.20.30.10.20.10.30.1].

Moreover, let b and δ are generated as follows:

b=[6.658.975.40],δ=[0.10.20.2].

Initializing both algorithms with a random matrix within Cmin=CΔ and Cmax=C+Δ, and a random vector within bmin=bδ and bmax=b+δ, ADMM algorithm returns a different solution from Algorithm 1. Besides, the difference in the performance of algorithms during the test phase can be observed in the experiments on synthetic datasets depicted in Figure 3 as well, especially when the number of samples is smaller.

Algorithm 7 Applying ADMM to the Dual Reformulation of Robust Linear Regression
1:Given:bmin,bmax,Cmin,Cmax,λ,ρ2:Initialize:C1=C0,b1=b0,γ0=0,γ1=1.3:fort=0,,Tdo4:θt+1=16λ+7ρ(2c1t2c2t3c3t)5:dt+1=16λ+7ρ(6ρ+4λρc1t+4ρ+4λρc2t+2c3t)6:et+1=26λ+7ρ(ρ+2λρc1t+3ρ+2λρc2tc3t)7:At+1=max(At+MAtρ,0)8:Bt+1=max(Bt+MBtρ,0)9:Gt+1=[BtAt+ΓtρθtθtT]++θtθtT10:dt+1=max(dt+μdtρ,0)11:et+1=max(et+μetρ,0)12:θt+1=arg minθθt+1θ2+μθt,θt+1θs.t.Gt+1¯θTθ13:At+1=13ρ(2D1t+D2t)14:Bt+1=13ρ(D1t+2D2t)15:MAt+1=MAt+ρ(At+1At+1)16:MBt+1=MBt+ρ(Bt+1Bt+1)17:μdt+1=μdt+ρ(dt+1dt+1)18:μet+1=μet+ρ(et+1et+1)19:μθt+1=μθt+ρ(θt+1θt+1)20:ηt+1=ηt+ρ(2θt+1dt+1+et+1)21:Γt+1=Γt+ρ(Bt+1At+1Gt+1)

Now, we show how to apply ADMM schema to Problem (28) to obtain Algorithm 7. As we discussed earlier, we consider two separate blocks of variables w=(θ,d,e,G,B,A) and z=(d,e,θ,B,A). Assigning Γ,η,MA,MB,μd,μe, and μθ to the constraints of Problem (28) in order, we can write the corresponding augmented Lagrangian function as:

minθ,θ,A,A,B,B,d,d,e,e,Gbmin,d+bmax,eCmin,A+Cmax,B+λθ2+AA,MA+ρ2AAF2+BB,MB+ρ2BBF2+dd,μd+ρ2dd2+ee,μe+ρ2ee2+θθ,μθ+ρ2θθ2+2θd+e,η+ρ22θd+e2+BAG,Γ+ρ2BAGF2A,B,d,e0,G¯θθT, (29)

At each iteration of the ADMM algorithm, the parameters of one block are fixed, and the optimization problem is solved with respect to the parameters of the other block. For the simplicity of presentation, let c1t=ρθtμθt2ηt, c2t=ρdtμdtbmin+ηt, c3t=ρetμet+bmaxηt, D1t=ρAtρGt+ΓtMAt+Cmin, and D2t=ρBt+ρGtΓtMBtCmax.

We have two non-trivial problems containing positive semi-definite constraints. The sub-problem with respect to G can be written as:

minGBtAtG,Γt+ρ2BtAtGF2s.t.G¯θtθtT, (30)

By completing the square, and changing the variable G=GθtθtT, equivalently we require to solve the following problem:

minGρ2G(BtAtθtθtT+Γtρ)F2s.t.G¯0, (31)

Thus, G=[BtAt+ΓtρθtθtT]+, where [A]+ is the projection to the set of PSD matrices, which can be done by setting the negative eigenvalues of A in its singular value decomposition to zero.

The other non-trivial sub-problem in Algorithm (7) is the minimization with respect to θ (Line 10). By completing the square, it can be equivalently formulated as:

minθθ(θt+1+μθtρ)22s.t.Gt+1¯θθT, (32)

Let G=UΛUT be the singular value decomposition of the matrix G where Λ is a diagonal matrix containing the eigenvalues of the matrix G. Set α=θt+1+μθt2. Since UTU=I, we have:

UTθUTα2=θα22

Set β=UTθ, then Problem (32) can be reformulated as:

minββUTα22s.t.ββT¯Λ. (33)

Note that the constraint of the above optimization problem is equivalent to the following:

ββT¯Λ[1βTβΛ]¯0βTΛ1β1i=1nβi2λi1,

where λi=Λii Since the block matrix is symmetric, using Schur Complement, it is positive semi-definite if and only if Λ is positive semi-definite and 1βtΛ1β0 (the third inequality above).

Set γ=UTα, then we can write Problem (33) as:

minββγ22s.t.i=1nβi2λi1, (34)

It can be easily shown that the optimal solution has the form βi=γi1+μλi, where μ is the optimal Lagrangian multiplier corresponding to the constraint of Problem (34). The optimal Lagrangian multiplier can be obtained by the bisection algorithm similar to Algorithm 2. Having β, the optimal θ can be computed by solving the linear equation UTθ=β.

E Quadratic RIFLE: Using Kernels to Go Beyond Linearity

A natural extension of RILFE to non-linear models is to transform the original data via multiple Kernels and then apply RIFLE to the obtained data. To this end, we applied Polynomial Kernels to the original data that considers the polynomial transformations of features and their interactions. A drawback of this approach is that if the original data contains d features, and the order of the polynomial Kernel is t, the number of features in the transformed data will be 𝒪(dt) that increases the runtime of the prediction/imputation drastically. Thus, we only consider t=2, which leads to a dataset containing the interaction of different features of the original data. We call the RIFLE algorithm applied on the data transformed by Quadratic Kernel Quadratic RIFLE (QRIFLE). Table 1 demonstrates the performance of QRIFLE alongside RIFLE and other state-of-the-art approaches. Moreover, we applied QRIFLE on a regression task where the correlation between predictors and the target variable is quadratic (Figure 8). We can observe that QRIFLE works better than RIFLE when the percentage of missing values is not high.

E.1 Performance of RIFLE and QRIFLE on Synthetic Non-linear Data

A natural question is how RIFLE performs when the underlying model is non-linear. To evaluate RIFLE and other methods, we have generated jointly normal data similar to the experiment in Figure 9. Here, we have 5000 data points, and the data dimension is d = 5. The target variable has the following quadratic relationship with the input features:

y=x12+3x326x520.9x1x4+9x2x3+3.2x4x51.7x2x55x12x3+7x4+4.6

Figure 8:

Figure 8:

Performance of RIFLE, QRIFLE, MissForest, Amelia, KNN Imputer, MICE, Expectation Maximization to the percentage of missing values on Quadratic artificial datasets with different percentages of missing values.

We evaluated the performance of KNN-Imputer (Troyanskaya et al., 2001), MICE (Buuren & Groothuis-Oudshoorn, 2010), Amelia (Honaker et al., 2011), MissForest (Stekhoven & Bühlmann, 2012), and Expectation Maximization (Dempster et al., 1977), alongside the RIFLE and QRIFLE. QRIFLE is the RIFLE application on the original data transformed by a polynomial kernel with the degree of 2. Although QRIFLE can learn the quadratic models, the number of missing values in the new features (interaction terms) will be higher than the original data. For instance, if, on average, 50% of entries are missing in the original features, there will be 75% of missing entries in the interaction terms. Moreover, the computation complexity will be increased since we have d2 features instead of d if we use QRIFLE. Figure 8 demonstrates the performance of the aforementioned methods on the artificial data with 5000 samples containing different percentages of missing values. We generated 5 artificial datasets for each missing value percentage, and each method is performed 5 times on the datasets. We reported the average performances for each method in Figure 8. For small percentages of missing values, QRIFLE performs better than other approaches. However, by increasing the percentage of missing values, QRIFLE performance drops, and RIFLE works much better than RIFLE.

F Robust Quadratic Discriminant Analysis (Presence of Missing Values in the Target Feature)

In Section 4 we formalized robust quadratic discriminant analysis assuming the target variable is fully available. In this appendix, we study Problem (12) when the target variable contains missing values.

If the target feature contains missing values, the proposed algorithm for solving the optimization problem (13) does not exploit the data points whose target feature is unavailable. However, such points can contain valuable statistical information about the underlying data distribution. Thus, we apply an Expectation Maximization (EM) procedure on the dataset as follows:

Assume that a dataset consisting of n+m samples. Let (X1,y1),,(Xn,yn) be n samples whose target variable is available and (Xn+1,z1),,(Xn+m,zm) are m samples where their corresponding labels are missing. Similar to the previous case, we assume:

Xizi=j𝒩(μj,Σj),j=0,1.

Thus, the probability of observing a data point xi can be written as:

P(Xi=xi)=π0P(Xi=xizi=0)+π1P(Xi=xizi=1)=π0𝒩(xi;μ0,Σ0)+π1𝒩(xi;μ1,Σ1)

The log of likelihood function can be formulated as follows:

(μ0,Σ0,μ1,Σ1)=i=1n+mlog(π0𝒩(xi;μ0,Σ0)+π1𝒩(xi;μ1,Σ1))

We apply Expectation Maximization procedure to jointly update Σ0,Σ1,μ0,μ1 and zis. Note that the posterior distribution of zi can be written as:

P(Zi=tXi=xi)=P(Xi=xiZi=t)P(Zi=t)P(Xi=xi)=πt𝒩(xi;μt,Σt)P(Xi=xi)

We update zi values in the E-step by comparing the posterior probabilities for two possible labels. Precisely, we assign label 1 to Zi if and only if:

π1𝒩(xi;μ1,Σ1)>π0𝒩(xi;μ0,Σ0)

In M-step, we estimate Σ0,Σ1,μ0,μ1,π0 and pi0 by fixing the zi values. Since in M-step, all labels (both already available yis and estimated zis in E-step) are assigned, updating the aforementioned parameters can be done as follows:

μ1[j]=1𝒮1𝒯ji𝒮1𝒯jxi[j] (35)
μ0[j]=1𝒮0𝒯ji𝒮0𝒯jxi[j] (36)
Σ1[i][j]=1𝒮1𝒯i𝒯ji𝒮1𝒯i𝒯jxt[i]xt[j] (37)
Σ0[i][j]=1𝒮0𝒯i𝒯ji𝒮0𝒯i𝒯jxt[i]xt[j] (38)
π1=S1S0S1 (39)
π0=S0S0S1 (40)

We apply the M-step and E-step iteratively to obtain Σ1 and Σ0. Based on the random initialization of zis we can obtain different values for μ0, μ1, Σ0 and Σ1. Having these estimations, we apply Algorithm 3 to solve the robust normal discriminant analysis formulated in (13).

Algorithm 8 Expectation Maximization Procedure for Learning a Robust Normal Discriminant Analysis
1:Input:T:Number of EM iterations,k:Number of covariance estimations at each iteration.2:Initialize:Set each missing labels randomly to 0 or 1.3:fori=1,,Tdo4:Estimatekcovariance matrices by sampling with replacement from the available entries5:Find an optimalwfor Problem(18)6:Update the missing labels using the newwobtained above.

G Generating Missing Values Patterns in Numerical Experiments

In this appendix, we define MCAR and MNAR patterns and discuss how to generate them in a given dataset. Formally, the distribution of missing values in a dataset follows a missing completely at random (MCAR) pattern if the probability of having a missing value for a given entry is constant, independent of other available and missing entries. On the other hand, a dataset follows a Missing At Random (MAR) pattern if the missingness of each entry only depends on the available data of other features. Finally, if the distribution of missing values does not follow an MCAR or MAR pattern, we call it missing not at random (MNAR).

To generate the MCAR pattern on a given dataset, we fix a constant probability 0 < p < 1 and make each data entry unavailable with the probability of p. On the other hand, the generation of the MNAR pattern is based on the idea that if the value of an entry is farther from the mean of its corresponding feature, then the probability of missingness for that entry is larger.

The generation of the MNAR pattern is based on the idea that if the value of an entry is farther from the mean of its corresponding feature, then the probability of missingness for that entry is larger. Algorithm 9 describes the procedure of generating MNAR missing values for a given column of a dataset:

Algorithm 9 Generating MNAR Pattern for a Given Column of a Dataset
1:Input:x1,x2,,xn:The entries of the current column in the dataset,a,b:Hyper-parameters control-ling the percentage of missing values2:Initialize:Setμ=1ni=1nxiandσ2=1ni=1nxi2μ2.3:fori=1,,ndo4:xi=xiμσ5:pi=F(axi+b)6:Setxi=with probability ofpi

Note that F in the above algorithm is the cumulative distribution function of a standard Gaussian random variable. a and b control the percentage of missing values in the given column. As a and b increase, the probability of having more missing values is higher. Since the availability of each data entry depends on its value, the generated missing pattern is missing not at random (MNAR).

H Proof of Lemmas and Theorems

In this appendix, we prove all lemmas and theorems presented in the article. First, we prove the following lemma that is useful in several convergence proofs:

Lemma 11. Let θ(b,C)=argminθθTCθ2bTθ+λθ2. Assume that for any given b and C, θ(b,C)τ. Then, the Lipschitz constant of the gradient of the function g(b,C)=minθθTCθ2bTθ+λθ2 used in Problem (7) is equal to L=2(τ+1)2λ.

Proof. Since the problem is convex in θ and concave in C and b, we have:

minθmaxC,bθTCθ2bTθ+λθ2=minC,bmaxθθTCθ+2bTθλθ2

Assume that h(θ,C,b)θTCθ+2bTθ+λθ2. Define L11, L12 as follows:

b,Ch(θ,b1,C1)b,Ch(θ,b2,C2)L11(C1,b1)(C2,b2)θh(θ,b1,C1)θh(θ,b2,C2)L12(C1,b1)(C2,b2)

h(θ,b,C) is convex in C and b and strongly concave with respect to θ. According to Lemma 1 in Barazandeh & Razaviyayn (2020), g=g=maxθh(θ,b,C) is Lipschitz continuous with the Lipschitz constant equal to:

Lg=Lg=L11+L122σ,

where σ=2λ is the strong-concavity modulus of θTCθ+2bTθλθ2. Note that

b,Ch(θ,b,C)=θθT+2θ
b,Ch(θ,b1,C1)b,Ch(θ,b2,C2)=0

Thus, L11=0. On the other hand,

θh(θ,b,C)=2θC+2b2λθθh(θ,b1,C1)θh(θ,b2,C2)=2(C1C2)θ+2(b1b2)2C1C22θ2+2b1b222(C1,b1)(C2,b2)2θ2+2(C1,b1)(C2,b2)2(2θ2+2)(C1,b1)(C2,b2)2

Therefore, L12=2maxθ2+2, which means Lg=2(maxθ+1)2λ. Note that θ is computed exactly in Algorithm 1 and Algorithm 5 at each iterations. Thus, during the optimization procedure the norm of θ is bounded by the maximum norm of θ for any given b and C:

maxθ2maxb,Cθ(b,C)τ

As a result, Lg=2(τ+1)2λ.

Proof of Theorem 1: Since the set of feasible solutions for b and C defines a compact set, and function g is a concave function with respect to b and C, the projected gradient ascent algorithm converges to the global maximizer of g in T=𝒪(LDϵ) iterations (Bubeck, 2014, Theorem 3.3), where D=C0CF2+b0b22 and L is the Lipschitz constant of function g, which is equal to 2(τ+1)2λ according to Lemma 11.

Proof of Theorem 7 Algorithm 5 applies the projected Nesterov acceleration method on the concave function g. As proved in Nesterov (1983), the rate of convergence of this method conforms to the lower bound of first-order oracles for the general convex minimization (concave maximization) problems, which is 𝒪(LD2ϵ). We compute the Lipschitz constant L that appeared in the iteration complexity bound by Lemma 11.

Proof of Theorem 8: First, note that if we multiply the objective function by −1, Problem (6) can be equivalently formulated as:

maxθminC,bθTCθ+2bTθλθ22s.t.C+Cmin0,CCmax0,b+bmin0,bbmax0,C¯0 (41)

If we assign A,B,d,e,H to the constraints respectively, then the Lagrangian function can be written as:

L(C,b,A,B,d,e,H)=θTCθ+2bTθ+A,C+Cmin+B,CCmax+d,b+bmin+e,bbmaxC,Hλθ22, (42)

The dual problem is defined as:

maxA,B,d,e,HminC,bL(C,b,A,B,d,e,H) (43)

The minimization of L takes the following form:

minC,bC,θθTA+BH+b,2θd+eλθ22B,Cmax+A,Cmine,bmax+d,bmin, (44)

To avoid value for the above minimization problem, it is required to set θθTA+BH and 2θd+e to zero. Thus the dual problem of (41) is formulated as:

maxA,B,d,e,HbminTdbmaxTe+Cmin,ACmax,Bλθ22s.t.θθTA+BH=0,2θd+e=0,A,B,d,e0,H¯0 (45)

Since the duality gap is zero, Problem (6) can be equivalently formulated as:

maxθ,A,B,d,e,HbminTdbmaxTe+Cmin,ACmax,Bλθ2s.t.θθTA+BH=0,2θd+e=0,A,B,d,e0,H¯0. (46)

We can multiply the objective function by −1 and change the maximization to minimization, which gives the dual problem described in (27).

Proof of Theorem 2:

(a) Let Δn be the estimated confidence matrix obtained from n samples. The first part of the theorem is true, if Δn converges to 0 as n, the number of samples goes to infinity (the same argument works for b and δ). Assume that {(xi1,xi2)}i=1n is an i.i.d bootstrap sample over data points that both features X1 and X2 are available. Since the distribution of missing values is completely at random (MCAR), we have E[xi1xi2]=E[X1X2]. Therefore, E[1ni=1nxi1xi2]=E[X1X2]. Moreover, since the samples are drawn independently, Var[1ni=1nxi1xi2]=1n2i=1nVar[xi1xi2]=nn2Var[X1X2]=1nVar[X1X2]. Since the variance of the product of every two features is bounded, according to the weak law of large numbers:

limnPr(1ni=1nxi1xi2E[X1X2]ϵ)=0

Therefore, for any given bootstrap sample of features X1 and X2, the estimation converges in probability to the ground-truth value. This means the size of the confidence interval Δ12 converges in probability to 0. Therefore, the estimation of E[X1X2] is consistent by the definition of consistency. With the same argument, we can prove the consistency of the estimator for any given features Xi and Xj.

(b) Fix two features i and j. Let (X^i1,X^j1),,(X^im,X^jm) be m=n(1p) i.i.d pairs sampled via bootstrap from the entries where both features i and j are available. Define Zt=X^itX^jt (for simplicity we do not consider the dependence of Zt to i and j in the notation). Assume that we initialize Cij=1mt=1mZt. Note that, E[Zt]=E[X^itX^jt]=Cij. According to Chebyshev’s inequality, we have:

Pr[1mt=1m(ZtE[Zt])Δij]Var(1mt=1mZt)c2Δij2

Note that Zis are iid samples, thus:

Var(1mt=1mZt)=1mVar(Zt)1mmaxi,jVar(X^iX^j)=Vm=Vn(1p)

Let Δ=min{Δij}. Then, based on the two above inequalities, we have:

Pr[C0[i][j]C[i][j]Δ]Vc2Δ2n(1p)

Using a union bound argument, with the probability of at least 1Vd22c2Δ2n(1p), we have: C0cΔCC0+cΔ, which means the actual covariance matrix is within the confidence intervals we have considered. In that case, for (θ~,b~,C~), we have:

θ~TC~θ~2b~Tθ~=maxC,bθ~TCθ~2bTθ~θ~TCθ~2bTθ~,

which completes the proof.

Proof of Theorem 3: Since the objective function is convex with respect to μ1, and the constraint on μ1 is closed and bounded (compact), an optimal solution exists to the problem on the boundaries (note that the problem is convex maximization.) Therefore, for any entry of the μ1, it should either take μmin[i] or μmax[i], which gives the provided solution in the theorem.

I Dataset Descriptions

In this section, we introduce the datasets used in Section 5 to evaluate the performance of RIFLE on regression and classification tasks. Except for the NHANES, all other datasets contain no missing values. For those datasets, we generate MCAR and MNAR missing values artificially (for MNAR patterns, we apply Algorithm 9 to the datasets).

Datasets for Evaluating RIFLE on Regression and Imputation Tasks

  • NHANES: The percentage of missing values varies for different features of the NHANES dataset. There are two sources of missing values in NHANES data: Missing entries during data collection and missing entries resulting from merging different datasets in the NHANES collection. On average, approximately 20% of data is missing.

  • Super Conductivity1: Super Conductivity datasets contains 21263 samples describing supercon-ductors and their relevant features (81 attributes). All features are continuous, and the assigned task is to predict the critical temperature based on the given 81 features. We have used this dataset in experiments summarized in Figure 10, Figure 11, and Table 2.

  • BlogFeedback2: BlogFeedback data is a collection of 280 features extracted from HTML-documents of the blog posts. The assigned task is to predict the number of comments in the upcoming 24 hours based on the features of more than 60K data training data points. The test dataset is fixed and is originally separated from the training data. The dataset is used in experiments described in Table 2.

  • Breast Cancer(Prognostic)3: The dataset consists of 34 features and 198 instances. Each record represents follow-up data for one breast cancer case collected in 1984. We have done several experiments to impute the MCAR missing values generated artificially with different proportions. The results are depicted in Table 1 and Figure 4.

  • Parkinson4: The dataset describes a range of biomedical voice recording from 31 people, 23 with Parkinson’s disease (PD). The assigned task is to discriminate healthy people from those with PD. There are 193 records and 23 features in the dataset. The dataset is processed similarly to the Breast Cancer dataset and used in the same experiments.

  • Spam Base5: The dataset consists of 4601 instances and 57 attributes. The assigned classification task is to predict whether the email is spam. To evaluate different imputation methods, we randomly mask a proportion of data entries and impute them with different approaches. The results are depicted in Table 1 and Figure 4.

  • Boston Housing6: Boston Housing dataset contains 506 instances and 14 columns. We generate random missing entries with different proportions and impute them with RIFLE and several state-of-the-art approaches. The results are demonstrated in Table 1 and Figure 4.

  • Cloud7: The dataset has 1024 instances and 10 features extracted from clouds images. We use this dataset in experiments depicted in Table 1 with 70% artificial MCAR missing values.

  • Wave Energy Converters8: We sample a subset of 3000 instances with 49 features from the original Wave Energy Converter dataset. We have executed several imputation methods on the dataset, and the results are shown in Figure 4.

  • Sensorless Drive Diagnosis9: The 49 continuous features in this dataset are extracted from electric current drive signals, and the associated classification task is to determine the condition of device’s motor. We choose different random samples with size 400 to run experiments (imputation) in Figure 5.

Datasets for Evaluating Robust QDA on Classification Tasks

  • Avila10: The Avila dataset consists of 10 attributes extracting from 800 images of "Avila Bible". The associated classification task is to match each pattern (an instance of the dataset) to a copyist. We put 40% of MCAR missing values (both input features and the target variable) on 10 different random samples of the dataset with size 1000. The average accuracy of the robust LDA method on the 10 datasets is demonstrated in Figure 7 for each value of k (the number of covariance estimations).

  • Magic Gamma Telescope11: The dataset consists of 11 continuous MC-generated features from contributing to the prediction of the type of event (signal or background). We used the same procedure as the above dataset for the results depicted in Figure 7 (random sampling subsets of 1000 data points out of more than 19000).

  • Glass Identification12: This dataset is composed of 10 continuous features and 214 instances. The assigned classification task is to predict the type of glasses based on the materials used for making them. We have assigned 40% of MCAR missing values to the dataset for the experiments reported in Table 4.

  • Annealing13 This dataset is a mix of categorical and numerical features (37 in total), and the associated task is to predict the class (5 classes) of instances (metals). The number of instances in this dataset is 798. We use 500 data points as the training data and the rest as the test. We apply 40% of MCAR missing values to both input features and the target variable. The accuracy of different models is reported in Table 4.

  • Abalone14: This dataset consists of 4177 instances and 8 categorical and continuous features. The goal is to predict the age of abalone based on physical measurements. The first 1000 samples are used as the training data and the rest as the test data. We applied the same pre-processing procedure as the above dataset to generate missing values on the training data. The accuracy of different models is reported in Table 4.

  • Lymphography15: Lymphography is a categorical dataset containing 18 features and 148 data points obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. 100 data points are used as the training data; the rest are test data points (with no missing values). We applied the same pre-processing described for the above dataset to generate MCAR missing values.

  • Adult 16: The adult dataset contains census information of individuals, including education, gender, and capital gain. The assigned classification task is to predict whether a person earns over 50k annually. The train and test sets are two separate files consisting of 32, 000 and 16, 000 samples, respectively. We consider gender and race as the sensitive attributes (For the experiments involving one sensitive attribute, we have chosen gender). Learning a logistic regression model on the training dataset (without imposing fairness) shows that only 3 features out of 14 have larger weights than the gender attribute.

J Further Discussion on the Consistency of RIFLE

The three developed algorithms in Section 3 for solving robust ridge regression are all consistent. To show this, we have generated a synthetic dataset with 50 input features following a jointly normal distribution. As observed in Figure 9, by increasing the number of samples, the NRMSE of all three algorithms converges to 0.01, which is the standard deviation of the zero-mean Gaussian noise added to each target value (the dashed line). The pattern can be observed for different percentages of missing values.

Figure 9:

Figure 9:

Consistency of ADMM (Algorithm 7) and Projected Gradient Ascent on function g (Algorithm 1) on the synthetic datasets with 40%, 60% and 80% missing values.

K Numerical Experiments for Convergence of RIFLE Algorithms

We presented three algorithms for solving the robust linear regression problem formulated in (6): Projected gradient ascent (Algorithm 1, Nesterov acceleration method (Algorithm 5), and Alternating Direction Method of Multipliers (ADMM) (Algorithm 7) applied on the dual problem.

Figure 10:

Figure 10:

Convergence of ADMM algorithm to the optimal solution of Problem (27) for different values of ρ. The left plot measures the objective function of Problem (27) per iteration (without considering the constraints), while the right plot demonstrates the constraint violation of the algorithm per iteration. The constraint violation can be measured by adding all regularization terms in the augmented Lagrangian function formulated in Problem (29).

We established the convergence rate of the gradient ascent and Nesterov acceleration methods in Theorem 1 and Theorem 7, respectively. To investigate the convergence of the ADMM algorithm and its dependence on ρ, we perform Algorithm 7 on the Super Conductivity dataset (Description in Appendix I) with 30% MCAR missing values. Figure 10 demonstrates the convergence of the ADMM algorithm for multiple values of ρ applied to the Super Conductivity dataset as described above. As can be observed, decreasing the value of ρ accelerates the ADMM convergence to the optimal value. Note that for ρ = 0.2, the objective function is smaller than the final value in the first few iterations. The reason is that for those iterations, the solution is not feasible (as observed in the right figure). The final solution is the optimal feasible solution.

In the next experiment, we compare the three proposed algorithms regarding the number of iterations required to reach a certain level of test accuracy on the Super Conductivity dataset. The number of training samples is 1000, containing 40% of MCAR missing values on both input features and the target variable. The test dataset contains 2000 samples. As depicted in Figure 11, ADMM and Nesterov’s algorithms require less number of iterations to reach the ϵ-optimal solution compared to Algorithm 1. However, the cost per iteration of the ADMM algorithm (Algorithm 7) is higher than the Nesterov acceleration and Algorithm 1.

Figure 11:

Figure 11:

The performance of the Nesterov acceleration method, projected gradient ascent, and ADMM on the Super Conductivity dataset vs. the number of iterations.

L Execution Time Comparison of RIFLE and Other State-of-the-art Approaches

This section reports the average execution time of the RIFLE and other approaches presented in Table 2.

Table 5:

Execution time of RIFLE and other SOTA methods on three datasets.

Methods Datasets
Super Conductivity Blog Feedback NHANES
Regression on Complete Data 0.3 sec 0.7 sec 0.4 sec
RIFLE 87 sec 471 sec 125 sec
Mean Imputer + Regression 0.4 sec 0.9 sec 0.5 sec
MICE + Regression 112 sec 573 sec 294 sec
EM + Regression 171 sec 612 sec 351 sec
MIDA Imputer + Regression 245 sec 726 sec 599 sec
MissForest 94 sec 321 sec 132 sec
KNN Imputer 66 sec 292 sec 144 sec

Footnotes

References

  1. Abonazel Mohamed Reda and Ibrahim Mohamed Gamal. On estimation methods for binary logistic regression model with missing values. International Journal of Mathematics and Computational Science, 4(3):79–85, 2018. [Google Scholar]
  2. Assländer Jakob, Cloos Martijn A, Knoll Florian, Sodickson Daniel K, Hennig Jürgen, and Lattanzi Riccardo. Low rank alternating direction method of multipliers reconstruction for mr fingerprinting. Magnetic resonance in medicine, 79(1):83–96, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Barazandeh Babak and Razaviyayn Meisam. Solving non-convex non-differentiable min-max games using proximal gradient method. arXiv preprint arXiv:2003.08093, 2020. [Google Scholar]
  4. Beaulieu-Jones Brett K, Lavage Daniel R, Snyder John W, Moore Jason H, Pendergrass Sarah A, and Bauer Christopher R. Characterizing and managing missing structured data in electronic health records: data analysis. JMIR medical informatics, 6(1):e11, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bertsimas Dimitris, Pawlowski Colin, and Zhuo Ying Daisy. From predictive methods to missing data imputation: an optimization approach. The Journal of Machine Learning Research, 18(1):7133–7171, 2017. [Google Scholar]
  6. Bubeck Sébastien. Convex optimization: Algorithms and complexity. arXiv preprint arXiv:1405.4980, 2014. [Google Scholar]
  7. van Buuren S and Groothuis-Oudshoorn Karin. mice: Multivariate imputation by chained equations in r. Journal of statistical software, pp. 1–68, 2010. [Google Scholar]
  8. Cai Zhipeng, Heydari Maysam, and Lin Guohui. Iterated local least squares microarray missing value imputation. Journal of bioinformatics and computational biology, 4(05):935–957, 2006. [DOI] [PubMed] [Google Scholar]
  9. Danskin John M. The theory of max-min and its application to weapons allocation problems, volume 5. Springer Science & Business Media, 2012. [Google Scholar]
  10. Dempster Arthur P, Laird Nan M, and Rubin Donald B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977. [Google Scholar]
  11. Dua Dheeru and Graff Casey. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml. [Google Scholar]
  12. Fung Karen Yuen and Wrobel Barbara A. The treatment of missing values in logistic regression. Biometrical Journal, 31(1):35–47, 1989. [Google Scholar]
  13. Gabay Daniel and Mercier Bertrand. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & mathematics with applications, 2(1):17–40, 1976. [Google Scholar]
  14. Gao Rui and Kleywegt Anton J. Distributionally robust stochastic optimization with dependence structure. arXiv preprint arXiv:1701.04200, 2017. [Google Scholar]
  15. Ghahramani Zoubin and Jordan Michael I. Supervised learning from incomplete data via an em approach. In Advances in neural information processing systems, pp. 120–127, 1994. [Google Scholar]
  16. Gondara Lovedeep and Wang Ke. MIDA: Multiple imputation using denoising autoencoders. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 260–272. Springer, 2018. [Google Scholar]
  17. He Bingsheng and Yuan Xiaoming. On non-ergodic convergence rate of douglas–rachford alternating direction method of multipliers. Numerische Mathematik, 130(3):567–577, 2015. [Google Scholar]
  18. Honaker James, King Gary, and Blackwell Matthew. Amelia ii: A program for missing data. Journal of statistical software, 45(7):1–47, 2011. [Google Scholar]
  19. Hong Mingyi, Luo Zhi-Quan, and Razaviyayn Meisam. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016. [Google Scholar]
  20. Kim Hyunsoo, Golub Gene H, and Park Haesun. Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics, 21(2):187–198, 2005. [DOI] [PubMed] [Google Scholar]
  21. Little Roderick JA and Rubin Donald B. Statistical analysis with missing data, volume 793. John Wiley & Sons, 2019. [Google Scholar]
  22. Nesterov Yurii E. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. akad. nauk Sssr, volume 269, pp. 543–547, 1983. [Google Scholar]
  23. Nouiehed Maher, Sanjabi Maziar, Huang Tianjian, Lee Jason D, and Razaviyayn Meisam. Solving a class of non-convex min-max games using iterative first order methods. Advances in Neural Information Processing Systems, 32:14934–14942, 2019. [Google Scholar]
  24. Pedersen AB, Mikkelsen EM, Cronin-Fenton D, Kristensen NR, Pham TM, Pedersen L, et al. Missing data and multiple imputation in clinical epidemiological research. clin epidemiol. 2017; 9: 157–66, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Raghunathan Trivellore E, Lepkowski James M, Van Hoewyk John, Solenberger Peter, et al. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey methodology, 27(1):85–96, 2001. [Google Scholar]
  26. Schafer Joseph L. Analysis of incomplete multivariate data. Chapman and Hall/CRC, 1997. [Google Scholar]
  27. Shivaswamy Pannagadatta K, Bhattacharyya Chiranjib, and Smola Alexander J. Second order cone programming approaches for handling missing and uncertain data. Journal of Machine Learning Research, 7 (Jul):1283–1314, 2006. [Google Scholar]
  28. Sion Maurice et al. On general minimax theorems. Pacific Journal of mathematics, 8(1):171–176, 1958. [Google Scholar]
  29. Stekhoven Daniel J and Bühlmann Peter. Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118, 2012. [DOI] [PubMed] [Google Scholar]
  30. Sterne Jonathan AC, White Ian R, Carlin John B, Spratt Michael, Royston Patrick, Kenward Michael G, Wood Angela M, and Carpenter James R. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Bmj, 338:b2393, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Troyanskaya Olga, Cantor Michael, Sherlock Gavin, Brown Pat, Hastie Trevor, Tibshirani Robert, Botstein David, and Altman Russ B. Missing value estimation methods for dna microarrays. Bioinformatics, 17 (6):520–525, 2001. [DOI] [PubMed] [Google Scholar]
  32. Xia Jing, Zhang Shengyu, Cai Guolong, Li Li, Pan Qing, Yan Jing, and Ning Gangmin. Adjusted weight voting algorithm for random forests in handling missing values. Pattern Recognition, 69:52–60, 2017. [Google Scholar]
  33. Xu Huan, Caramanis Constantine, and Mannor Shie. Robustness and regularization of support vector machines. Journal of machine learning research, 10(7), 2009. [Google Scholar]
  34. Yoon Jinsung, Jordon James, and Van Der Schaar Mihaela. Gain: Missing data imputation using generative adversarial nets. arXiv preprint arXiv:1806.02920, 2018. [Google Scholar]
  35. Zhang Tianyun, Ye Shaokai, Zhang Kaiqi, Tang Jian, Wen Wujie, Fardad Makan, and Wang Yanzhi. A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 184–199, 2018. [Google Scholar]
  36. Zhang Xiaobai, Song Xiaofeng, Wang Huinan, and Zhang Huanping. Sequential local least squares imputation estimating missing value of microarray data. Computers in biology and medicine, 38(10):1112–1120, 2008. [DOI] [PubMed] [Google Scholar]

RESOURCES