Skip to main content
BMC Medical Research Methodology logoLink to BMC Medical Research Methodology
. 2025 Feb 8;25:34. doi: 10.1186/s12874-025-02487-4

A generative model for evaluating missing data methods in large epidemiological cohorts

Lav Radosavljević 1, Stephen M Smith 2, Thomas E Nichols 1,
PMCID: PMC11806830  PMID: 39923001

Abstract

Background

The potential value of large scale datasets is constrained by the ubiquitous problem of missing data, arising in either a structured or unstructured fashion. When imputation methods are proposed for large scale data, one limitation is the simplicity of existing evaluation methods. Specifically, most evaluations create synthetic data with only a simple, unstructured missing data mechanism which does not resemble the missing data patterns found in real data. For example, in the UK Biobank missing data tends to appear in blocks, because non-participation in one of the sub-studies leads to missingness for all sub-study variables.

Methods

We propose a tool for generating mixed type missing data mimicking key properties of a given real large scale epidemiological data set with both structured and unstructured missingness while accounting for informative missingness. The process involves identifying sub-studies using hierarchical clustering of missingness patterns and modelling the dependence of inter-variable correlation and co-missingness patterns.

Results

On the UK Biobank brain imaging cohort, we identify several large blocks of missing data. We demonstrate the use of our tool for evaluating several imputation methods, showing modest accuracy of imputation overall, with iterative imputation having the best performance. We compare our evaluations based on synthetic data to an exemplar study which includes variable selection on a single real imputed dataset, finding only small differences between the imputation methods though with iterative imputation leading to the most informative selection of variables.

Conclusions

We have created a framework for simulating large scale data with that captures the complexities of the inter-variable dependence as well as structured and unstructured informative missingness. Evaluations using this framework highlight the immense challenge of data imputation in this setting and the need for improved missing data methods.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12874-025-02487-4.

Keywords: Structured missingness, Imputation, Missing data, UK Biobank, Neuroimaging, Multivariate modelling

Background

Missing data is common in epidemiological and health data research and presents formidable challenges for many analytical approaches. The causes of missing data vary, from being inherent to the study design, to elective non-participation, or simply faults in measurement. Much work has therefore been devoted to evaluating the performance of methods for handling missing data. The most common approaches to comparing imputation methods include simulating data and inducing missingness using a priori chosen mechanisms [1, 2]. Alternatively, artificial missingness is induced on real, complete data [35] or real missingness patterns are imposed on simulated data [6]. Simulation studies of this kind usually rely on quite restrictive assumptions that might not be reflective of large scale epidemiological cohorts such as UK Biobank (UKB). For example, while some studies induce missingness in a manner which lacks any clear structure [1, 4], in UKB missing data caused by non-participation in a sub-study/questionnaire comes in “blocks”. Specifically, if a subset of participants do not participate in an extension of the core study, then all of these subjects will have missing entries for the variables of this extension, and (when the rows and columns of the subjects-by-variables matrix are suitably reordered) this will form a solid block of missing data. In Mitra et. al (2023) [7], this type of “blocky” missingness is categorised as structured missingness (SM). Expanding on this way of thinking, we refer to missing data lacking structure as unstructured missingness (UM). Since UKB and similar datasets consist of many different sub-studies and questionnaires, this is a crucial feature to consider when evaluating the performance of imputation methods.

While there has been work done on the evaluation of existing methods on data with structured missingness and the development of new methods for handling such data, it has been common to use a non data-driven method for inducing structured missingness [2, 8, 9] or use real missingness patterns imposed on simulated data [6]. Assuming that SM is Missing Completely at Random (MCAR) is especially problematic in the case where it is caused by non-participation, since we know that participants often are disproportionately healthy [10], meaning that the data is not likely to be MCAR. Our aim is therefore to define a tool for generating synthetic data which has the same properties as a given data set. We want the pattern to satisfy the following three criteria:

  1. There is structured missingness, blocks of missingness caused by non-participation in sub-studies, as well as unstructured missingness that is not in blocks and is attributable to any other cause.

  2. Missingness is informative in the sense of MAR (Missing at Random), where there is a relationship between missingness in a given variable and the observed elements of other variables.

  3. There is an association between inter-variable correlation and inter-variable missingness similarity, typically where tightly correlated variables are more likely to be jointly missing.

With our framework for simulating such synthetic data, we evaluate the performance of several imputation methods. While the general motivation of our work is developing a tool for evaluating the accuracy of missing data methods for epidemiological data combined from different sub-studies, we are specifically interested in the problem of missing data when modelling relationships between brain imaging variables and health, demographic, behavioural and lifestyle variables in UK Biobank. We thus focus on the subset of 46471 subjects with imaging derived phenotype (IDP) data, and a collection of 23871 non-Imaging Derived Phenotypes (nIDPs) variables; these nIDP variables are a mixture of continuous and binary variables (some of the binary variables are 1-hot encodings of categorical variables). The illustrative study in the section  Illustrative example: variable selection for total grey matter volume on nIDP regression is an example of tasks that involve imputing missing nIDP data with the final goal of creating a model describing the association between total grey matter volume (which is an IDP) and several nIDPs.

UK Biobank

UK Biobank is a biomedical database and epidemiological cohort study combining, among other things, demographic, genetic, medical, dietary, imaging, biomarker and questionaire data from around half a million subjects residing across the United Kingdom [1116]. The brain imaging sub-cohort consists of roughly 50000 subjects and is our object of interest in this work. IDPs for the brain imaging cohort are variables extracted from Magnetic Resonance Imaging (MRI) that inform various aspects of brain function and structure. Example IDPs include cortical volume, total grey matter volume and white matter hyperintensity volume. nIDPs on the other hand are demographic, health, cognition and lifestyle related variables from diverse sources that are deemed interesting in neuroimaging research.

Methods

Terminology

Let n and d be the number of subjects and variables respectively, X be our n×d dataset and M be the n×d missingness matrix where Mij=1 if variable j is missing for subject i and Mij=0 if it is not missing. The following definitions and notation are central to our work:

  1. Variable-wise missingness pattern. For any variable j=1,2,...,d, the variable-wise missingness pattern for variable j is mjv=(M1j,M2j,...,Mnj)0,1n.

  2. Subject-wise missingness pattern. For any subject i=1,2,...,n, the subject-wise missingness pattern for subject i is mis=(Mi1,Mi2,...,Mid)0,1d.

  3. Variable-wise missingness distance. For any two variables j and j’, the variable-wise missingness distance between them is the proportion of discordant missingness indicators
    Djjv=1nk=1n1MkjMkj,
    where Dv=Djjv is the d×d variable-wise missingness distance matrix.
  4. Subject-wise missingness distance. For any two subjects i and i’ the subject-wise missingness distance between them is likewise
    Diis=1dk=1d1MikMik,
    where Ds is the n×n subject-wise missingness distance matrix.
  5. Structured missingness (SM). We call missingness that is caused by non-participation in a sub-study/questionnaire structured missingness, resulting in a subset of subjects having missing data for a set of variables. This is also sometimes called block-wise missingness, as when subjects and variables are suitably reordered, this will result in solid blocks of missing data in the data matrix.

  6. Unstructured missingness (UM). We call missingness that is not caused by non-participation in a study/questionnaire unstructured missingness. This type of missingness does not induce any blocks of missing data.

We now define the stochastic mechanisms that can give rise to missing data. Let x be a d-dimensional random vector drawn from the same distribution as the data in our data set and m be its corresponding subject-wise missingness pattern. Let further xobs(m) and xmiss(m) be the observed and unobserved parts of the random vector x respectively. We follow the terms in Rubin (1976) for different types of missingness:

  • Missing Completely at Random (MCAR)
    P(m|x)=P(m).
    This means that the missingness mask is completely independent from underlying data.
  • Missing at Random (MAR)
    P(m|x)=P(m|xobs(m)).
    This means that there exists some dependence between the missingness mask and the underlying data, but that this relationship can be described using only observed data, i.e., the relationship between m and x is determined exclusively by the observed part xobs(m). For example, this can mean that there exists a group of variables with no missingness which determine the missingness mask m.
  • Missing Not at Random (MNAR)
    P(m|x)P(m|xobs(m)).
    For this type of missingness, the relationship between the data and the missingness requires knowledge of data obscured by missingness. This is the most difficult setting to handle, assuming no prior knowledge of the mechanism by which missingness is induced, since it has been shown that for any MNAR model explaining missing data in a given data set, there exists an MAR model with equal evidence [17]. In other words, there can be no theoretical guarantees of correctness for MNAR models explaining missing data barring direct knowledge of the missingness mechanism.

Characterising types of missingness is crucial to our work since many methods of handling missing data, most notably Multivariate Imputation by Chained Equations (MICE), have theoretical guarantees under MCAR and MAR [18], while MNAR requires additional assumptions [19].

Parameters of the generative model

We assume that our data consists of C different sub-studies, where study c=0 is assumed to be a baseline study with no missingness while studies c=1,...,C-1 are follow up sub-studies with both structured and unstructured missingness. The following parameters define our generative model:

  • ρc,c, the distribution of between-variable correlations for all pairs of sub-studies c, c’. We assume a mixed data generative model [20], where data arise from a multivariate normal distribution with zero means and unit variances: for continuous variables these values are directly observed, while for binary variables the normal variate is latent and the data is obtained by thresholding continuous variables to 0/1. Therefore, ρc,c represents the correlation distribution of the underlying data prior to thresholding.

  • πc the rate of structured missingness for sub-study c.

  • (αc,βc), parameters governing the rate of unstructured missingness for each variable. We assume that the rate of unstructured missingness pjum is drawn from Beta(αc,βc) if feature j belongs to sub-study c.

  • Σcore. We assume that dcore variables from the baseline study c=0 determine all structured missingness through a logistic model. Σcore is the correlation matrix of these core variables. The core variables are assumed to all be continuous.

  • AUCc, the Area Under the Curve (AUC) score of the logistic model determining structured missingness for sub-study c.

Estimating parameters

We estimate the parameters of the model using the following procedure, also detailed in the flowchart in Fig. 1. To avoid confusion, we refer to c=0,1,...,C-1 as “clusters” from here on out rather than “sub-studies”, as a single cluster c may contain several similar sub-studies depending on the size of C that we choose.

  1. C clusters are identified using hierarchical agglomerative complete linkage clustering [21].

  2. The densities ρc,c are estimated using a histogram for each pair of clusters c, c’.

  3. We define a subject i to be structurally missing for a cluster c if at least 90% of the variables from c are missing for subject i. This will give us the vectors bcs0,1n where bc,is=1 if subject i is structurally missing for cluster c. This result also directly gives us πc, i.e., the probability of a subject having structured missingness for cluster c.

  4. Having identified all structured missingness, we estimate (αc,βc) using the method of moments on the remaining, unstructured missingness.

  5. We use LASSO Logistic Regression (LASSO-LR) [22] to simultaneously identify the core variables that determine structured missingness and AUCc, by fitting C-1 penalised logistic regression models that use the baseline study data X0 as predictors and the subject-wise structured missingness vectors bcs as outcomes. Specifically, AUCc is estimated using 5-fold cross validation. Note that the core variables are cluster specific and may or may not overlap for different clusters. We then estimate Σcore, the correlation matrix of the core variables for all clusters. The penalty term λc for each LASSO models can be chosen in multiple appropriate ways (see section  Calibrating the predictability of missingness).

Fig. 1.

Fig. 1

Flow chart of the data analysis pipeline

Generating synthetic data

The data is generated using a step-wise procedure as seen in Fig. 2. Since all continuous variables have unit variance correlation matrices and covariance matrices of continuous data are the same.

  1. Using Σcore, we simulate the cluster specific core variables Xcorec for clusters c=1,...,C-1, by drawing the full core variable data matrix Xcore from N(0,Σcore).

  2. Using a binary search procedure, we determine intercepts and coefficients of C-1 logistic models determining structured missingness, with Xcorec as covariates, such that the model’s AUC score and rate of positive cases will match AUCc and πc. All the coefficients of the logistic models are assumed to be equal. Using these models, we generate synthetic subject-wise structured missingness vectors bcs.

  3. We generate the rates of unstructured missingness pjum by drawing them independently from Beta(αc,βc) where variable j is in cluster c. Unstructured missingness is assumed to be MCAR and is induced for each subject i and variable j with probability pjum. By combining the generated structured and unstructured missingness, we obtain an n×d synthetic missingness indicator matrix M.

  4. We simulate the full d×d correlation matrix Σ by drawing its entries from ρc,c and, if necessary, projecting it to the nearest positive definite correlation matrix using Higham’s algorithm [23].

  5. Having the complete correlation matrix, we generate the rest of the data X, conditioned on the already simulated core variable data Xcore. To allow for binary variables, we threshold a subset of variables to become binary, corresponding to the same number of binary variables in each cluster.

  6. Finally the synthetic missingness mask M is imposed upon X to obtain the corresponding synthetic data set with missingness Xmiss.

Fig. 2.

Fig. 2

Flow chart of the synthetic data generation pipeline

This procedure generates synthetic datasets which satisfy the key criteria outlined in the introduction, including that the datasets fall under MAR. Crucially, we have access to the true mean vector and covariance matrix, as well as the underlying data obscured by missingness.

Calibrating the predictability of missingness

The choice of penalty term λc for each LASSO-LR model that predicts structured missingness can most easily be made by selecting the value of λc which minimises validation loss. This is, however, not always the option which is most faithful to the assumptions of our generative model. Since our generative model assumes that all core variables have equal importance in predicting structured missingness, we want to choose a value of λc which will minimise the number of core variables of low predictive importance, while having a validation loss that is close to that of the optimal value. This is an inevitably arbitrary feature of our generative model and we choose to select a reasonable value of λc through trial and error.

Simulation study

In order to demonstrate the use of our generative model, we conduct a simulation study on synthetic data mimicking the UK Biobank Brain nIDPs to evaluate the performance of three commonly used imputation methods on this data set. Our study tests the accuracy of imputation for missing data entries, as measured by Mean Squared Error (MSE) for continuous variables and Balanced Accuracy (BA) for binary variables. Accuracy is recorded by individual variable across B=20 synthetic data sets. Specifically, if we, for some variable j with nmiss missing values, have imputed values ximp and corresponding true values xtrue, MSE and BA are calculated according to the following formulae:

MSE=1nmissi(ximp,i-xtrue,i)2
BA=12i|xtrue,i=01{ximp,i=xtrue,i}i|xtrue,i=01+12i|xtrue,i=11{ximp,i=xtrue,i}i|xtrue,i=11.

Additionally, to illustrate the difficulty of imputing data with SM, we perform the same simulation study on copies of the synthetic datasets where missingness has been induced in an MCAR and completely unstructured manner. The data in each cluster is set to be missing using independent Bernoulli variables with probability equal to the total rate of missingness for the cluster.

The first imputation method is mean imputation, which serves as our benchmark method. The second is the matrix completion method SoftImpute [24], which assumes that there exists a low rank approximation of the data set. This method has a tuning parameter, the value of the low rank, which we vary as 5%, 15% and 30% of the full matrix rank. Both the mean imputation and SoftImpute methods are binarised to impute binary variables using 0.5 as the threshold, so imputed values greater than or equal to 0.5 are transformed to 1, while the rest are transformed to 0. The last method is called iterative imputation [25] or ICE, Iterative Imputation by Chained Equations [3], which uses the same iterative procedure as MICE, but does not include randomness in the imputed values and only creates a single imputed data set. As we are interested in evaluating the imputation accuracy of missing values themselves, we evaluate this method rather than MICE, since it follows the same iterative procedure, but does not add noise to the imputed values like MICE does. MICE is also impractical to use in this high dimensional setting with respect to memory use and computational time since it requires a large number of multiply imputed data sets. We choose to impute continuous values using Bayesian Ridge Regression [25] and binary values using Logistic Regression with a Ridge penalty. In high dimensional setting, iterative imputation requires us to choose a subset of k<<d variables that are used to impute each variable j. These variables are normally set to be the k variables with highest absolute correlation with j [2527], or the k variables with the most favorable missingness patterns [26]; all else being equal, we favour variables for imputing j which are observed the most often in rows where j is missing and therefore select variables using the rows of the matrix

V=MT(1n×d-M), 1

where Vjj is the number of times j’ is observed when j is missing.

We propose a third selection method which utilises correlation and missingness jointly, while being applicable to mixed data. It calculates a score Sjj which is proportional to the maximum expected reduction imputation error (MSE for continuous and misclassification rate for binary variables) under the assumption of MCAR and under the generative model described in [20] for joint continuous and binary data, i.e., an underlying multivariate normal distribution with thresholding for binary variables.

  • j and jare both continuous
    Sjj=Vjjρ2,
    where ρ is the Pearson correlation between variables j and j’.
  • j is continuous and jis binary
    Sjj=Vjjρ2,
    where ρ is the Pearson correlation between variables j and j’.
  • j is binary and jis continuous
    Sjj=Vjj[-D/ρbϕ(x)ΦD-ρbx1-ρb2dx+
    D/ρbϕ(x)1-ΦD-ρbx1-ρb2dx-maxp,1-p]
    if ρb>0 and
    Sjj=Vjj[-D/ρbϕ(x)1-ΦD-ρbx1-ρb2dx+
    D/ρbϕ(x)ΦD-ρbx1-ρb2dx-maxp,1-p]
    if ρb<0, where ϕ and Φ are the the probability density function and cumulative distribution function of the standard-normal distribution, ρ is the Pearson correlation between variables j and j’, p is the rate of positive cases for variable j, D=Φ-1(p) and
    ρb=ρp(1-p)ϕ(D).
    D is the threshold of the underlying standard-normal variable that determines the binary value of j and ρb is the correlation between this underlying variable and j’. The reduction in misclassification loss can be calculated directly using these quantities by assuming that we predict 0/1 depending on whether the median of the latent variable conditioned on the value of j’ is greater than D or not.
  • j and jare both binary
    Sjj=Vjj[pmaxP(x=1|x=1),1-P(x=1|x=1)+
    (1-p)maxP(x=1|x=0),1-P(x=1|x=0)-maxp,1-p],
    where p and p’ are the rates of positive cases for p and p’ respectively. Here, the reduction in misclassification loss is calculated directly from the 2×2 contingency table of j and j’, since we know the most likely outcome of variable j given the value of j’. This contingency table is calculated using p, p’ and ρ.

A formal proof of these results can be found in the supplementary material. We vary the tuning parameter k to be 10, 50 and 150.

Illustrative example: variable selection for total grey matter volume on nIDP regression

In order to demonstrate the validity of the conclusions drawn from our simulation study, we apply our imputation methods to an exemplary analytical task on real data and see if there is agreement between the results of the analysis and the conclusions drawn from the simulation study. We choose the task of selecting 15 nIDPs for an Ordinary Least Squares (OLS) model, modelling the association between log-transformed normalised total grey matter volume and the selected nIDPs. The total pool to select from is 15000 nIDPs (nIDPs with 0 variance or missingness above 40% are excluded). Not a single subject has all of these variables observed, so complete case analysis is not an option. Imputation is used here as a pre-processing step and LASSO-LR is used for variable selection. The outcome, i.e., the 15 variables that are selected varies depending on the imputation method. We compare four different approaches: using only complete variables (meaning only using variables that have no missingness), mean imputation, SoftImpute and iterative imputation. The tuning parameters for SoftImpute and iterative imputation are chosen based on their performance in the simulation study. The four approaches are evaluated by the relevance of the 15 selected variables, as measured by variance explained (R2) of each OLS model and the number of statistically significant variable coefficients. To ensure fair assessment of the R2 scores and coefficient p-values irrespective of missingness in the selected variables, we return missing data entries post selection and use the mice package in R [19] to create m=100 multiply imputed data sets which are used to obtain robust p-values and also to quantify the uncertainty in the R2 score. Pooled R2 scores with corresponding standard errors are obtained according to “Rubin’s rules” [28]. We use the following estimator for the standard error of R2 [29]:

se(R2)=4R2(1-R2)2(n-p-1)2(n<spanclass=convertEndash><spanclass=convertEndash>2-1</span></span>)(n+3),

where n is the number of observations and p the number of variables.

We also ensure that the baseline variables age, age squared, sex and Townsend deprivation index are included in the OLS model as potential nuisance variables, to avoid detecting spurious associations. Pipelines similar to the one described here are commonly used for association modelling tasks on UKB data [3032].

Results

Analysis pipeline

As seen in the section Estimating parameters, we need to select a value for the number of clusters C as a parameter of our analysis pipeline. This choice is jointly driven by the data itself as well as the need to select a small number of clusters to allow us to clearly illustrate our methodology. Figure 3 shows the dendrogram of the 100 last agglomerations in the hierarchical clustering of nIDPs by variable-wise missingness pattern. It can be seen from this dendrogram that C=4 will give us clusters which have a high between-cluster missingness distance relative to within-cluster distance. The choice of C=4 clusters also gives us reasonably sized clusters for our analysis, as seen in Table 1.

Fig. 3.

Fig. 3

Dendrogram of the 100 last agglomerations in the hierarchical clustering of nIDPs by variable-wise missingness pattern, where distance between merged clusters (y-axis) is the maximal variable-wise missingness distance between agglomerated clusters. We determine that C=4 clusters is an appropriate choice for illustrating the workings of our tool since it gives us reasonably sized clusters with a high between-cluster distance relative to within-cluster distance

Table 1.

Table of cluster sizes

Cluster # of binary variables # of continuous variables Total # of variables
c=0 13458 1227 14685
c=1 371 6434 6805
c=2 131 2012 2143
c=3 3 235 238

The share of each nIDP type by cluster is shown in Fig. 4. Cluster c=0 contains almost exclusively health and medical related nIDPs, cluster c=2 contains mostly lifestyle and environment related variables, cluster c=3 almost exclusively contains cognitive phenotype variables and cluster c=1 contains a mix of the remaining types of variables. This results shows that nIDPs of the same type tend to have similar variable-wise missingness patterns.

Fig. 4.

Fig. 4

Bar plots detailing the proportion of nIDP variable types in each cluster. Cluster c=0 contains almost exclusively health and medical related nIDPs, cluster c=2 contains mostly lifestyle and environment related variables, cluster c=3 almost exclusively contains cognitive phenotype variables and cluster c=1 contains a mix of the remaining types of variables. This shows that nIDPs of the same type tend to have similar variable-wise missingness patterns

Figure 5 plots the histograms of the proportions of variable-wise missing data, i.e., fraction of subjects missing for each variable in a cluster. As we can see, cluster c=0, the cluster that contains mostly health- and medical related variables, has almost no missingness. This is an expected result, as health records usually either contain too much missingness to be included in the first place or they have very little missingness as absence of data indicates abscence of recorded disease or diagnosis. Clusters c=2,3 have intermediate rates of missingness with cluster c=3 having lower rates of missingness as well as a lower variability of rates of missingness, while c=1 has very high rates of missingness. For this reason, we exclude cluster c=1 from our generative model, as its variables have too high rates of missingness to be interesting to use for imputation.

Fig. 5.

Fig. 5

Histograms of the proportions of missing data for variables in each cluster. Each entry in the histogram for cluster c is the proportion of missing data for a single variable belonging to cluster c. Cluster c=0, i.e., the cluster that contains mostly health- and medical related variables, has almost no missingness. Clusters c=2,3 have intermediate rates of missingness with cluster c=3 having lower rates of missingness as well as a lower variability in the same, while c=1 has very high rates of missingness

Figure 6 displays the subject-wise missingness histograms, the proportion of cluster-c variables missing for a given subject. The red line in each plot signifies the 90% threshold for structured missingness, meaning that subjects for which 90% or more of the features assigned to cluster c are missing are considered to have structured missingness for the variables in cluster c. We see that cluster c=3 has a much clearer separation between structured and unstructured missingness, whereas it is less clear for clusters c=1,2, likely due to higher rates of unstructured missingness as well as our approximation of C=4 leading to different small sub-studies being grouped together in the same cluster.

Fig. 6.

Fig. 6

Histograms detailing the proportion of variables assigned to cluster c that are missing, by subject. The red line in each plot signifies the 90% threshold for structured missingness, meaning that subjects for which 90% or more of the features assigned to cluster c are missing are considered to have structured missingness for the variables in cluster c. We see that the cluster c=3 has a much clearer separation between structured and unstructured missingness, whereas it is less clear for clusters c=1,2, likely due to higher rates of unstructured missingness as well as our approximation of C=4 leading to different small sub-studies being assigned to the same cluster

As discussed in the section Calibrating the predictability of missingness, the penalty terms λc need to be carefully chosen to not violate the assumptions of our generative model. Manual tuning arrived at a value of λc=exp(6) for both clusters c=2,3 which minimised both the total number of core variables as well as the proportion of binary variables, while having validation AUC scores close to those of the optimal values, as shown in Tables 2 and 3.

Table 2.

Table detailing the validation AUC for predicting structured missingness using variables from cluster c=0 and number of binary and continuous variables selected using LASSO Logistic Regression (LASSO-LR) for optimal values of λc

Cluster AUCc # of binary variables # of continuous variables dcore,c
c=2 0.72 48 575 623
c=3 0.90 271 948 1219

Table 3.

Table detailing the validation AUC for predicting structured missingness using variables from cluster c=0 and number of binary and continuous variables selected using LASSO logistic regression using λc=exp(6) for both clusters c=2,3

Cluster AUCc # of binary variables # of continuous variables dcore,c
c=2 0.71 5 62 67
c=3 0.86 5 78 83

The final results of the data analysis are summarised in Table 4. These results indicate that clusters c=2,3 have similar rates of structured missingness, while cluster c=2 has a much higher rate of unstructured missingness, as indicated by the values of αc and βc. It is also apparent that the variables in cluster c=3 have a more informative type of structured missingness as we can see by the higher value of AUCc.

Table 4.

Table of results summarising the data analysis step of our tool

Cluster AUCc dcore,c πc αc βc
c=2 0.71 67 0.26 6.0 4.6
c=3 0.86 83 0.29 0.44 4.4

These results indicate that clusters c=2,3 have similar rates of structured missingness, while cluster c=2 has a much higher rate of unstructured missingness, as indicated by the values of αc and βc. It is also apparent that the variables in cluster c=3 have a more informative type of structured missingness as we can see by the values of AUCc

Simulation study

Figures 7 and 8 plot the imputation accuracy by variable for datasets using the generative model as well as the unstructured equivalent described in the section Simulation study. The red line in each violin plot is the median of the best performing method in that comparison, i.e., the lowest median MSE and the highest median BA. To interpret the results, it should be noted that the continuous variables of the generative model all have zero mean and unit variance. Iterative imputation that uses Pearson correlation or the mixed score as its criterion for variable selection is the best performing method overall. It is also notable that the SoftImpute performs poorly for binary data and notably worse than the iterative imputation methods for continuous data. When comparing the performance between generative model versus completely unstructured missing data, we see that performance is better for the completely unstructured case, for both continuous and binary variables. This difference is particularly stark for cluster c=3, where there is a lot of structured missingness and very little unstructured missingness, highlighting the difficulty of imputation in this setting. Finally, it should be noted that we are rarely able to explain more than 20% of variance in the missing values and that this could mean that the choice of imputation method will not greatly impact the outcome of many analytical tasks, as the modest accuracy of imputation may not be enough to greatly alter the final outcome.

Fig. 7.

Fig. 7

Violin plots of imputation accuracy for continuous variables, over B=20 synthetic data sets. The red line in each violin plot is the median of the best performing method in that comparison, i.e., the lowest median MSE. All continuous variables have unit variance. It is clear that iterative imputation that uses Pearson correlation or the mixed score as its criterion for selecting k variables is the best performing method overall. Our tool estimates lower imputation accuracy than the unstructured approach, indicating that imputation is harder under SM

Fig. 8.

Fig. 8

Violin plots of imputation accuracy for binary variables, over B=20 synthetic data sets. The red line in each violin plot is the median of the best performing method in that comparison, i.e., the highest median BA. It is clear that iterative imputation that uses Pearson correlation or the mixed score as its criterion for selecting k variables is the best performing method overall. Our tool estimates lower imputation accuracy than the unstructured approach, indicating that imputation is harder under SM

Illustrative example

Table 5 lists the variance explained for the OLS model using the selected variables along with the number of variables in the model that ended up being statistically significant. The results show a modest difference between the three imputation methods, with iterative imputation having the most statistically significant variables and the best R2 score. The results when using only complete variables are worse with a considerably lower R2 score as well as fewer variables ending up statistically significant. We also see that the complete variables method selected many more binary variables and this is because the complete variables are mostly health record data, i.e., data assigned to cluster c=0, which is disproportionately binary as seen in Table 1. Meanwhile, the results for iterative imputation are the best, having the highest R2 score as well as the highest number of statistically significant variables. These results align well with our simulation study; a small difference in the final outcome of the analytical task for different methods caused due to the difficulty of imputing structurally missing data, but with iterative imputation clearly being the best alternative.

Table 5.

R2 scores for the OLS models using the selected variables along with the number of variables in the model that ended up being statistically significant

Method R2 of OLS se(R2) #Statistically significant in OLS # Binary vars. selected Rate of missingness in selected vars.
Complete Variables 0.475 0.0034 11/15 12/15 0.00
Mean 0.510 0.0033 12/15 1/15 0.09
SoftImpute 0.504 0.0034 12/15 1/15 0.13
Iterative Imputation 0.516 0.0033 14/15 2/15 0.10

The results show a modest difference between the three imputation methods, with iterative imputation having the most statistically significant variables and the best R2 score. The results when using only complete variables are worse with a considerably lower R2 score as well as fewer variables ending up statistically significant

Discussion

We have proposed a method for generating large-scale data with complex patterns of missing data that make imputation difficult. In particular, our data-driven simulation framework allows for highly informative missingness and joint missingness for variables that are strongly correlated. This ability to mimic the properties of large scale epidemiological datasets makes our tool useful for gaining insight into the performance of handling missing data for different analytical tasks. There are, however, limitations to our model which are important to note and represent potential future work on this topic. Our model assumes multivariate normality for all continuous features, which limits the generalisability of any conclusions drawn from simulation studies of analytical methods that are sensitive to non-gaussianity or strong outliers. In such scenarios, it is possible that conclusions drawn using our generative model would unduly favour linear methods over more complicated black-box methods that would fare better on real, non-Gaussian data. This could be solved by parameterising the model differently, allowing for more flexibility on the underlying multivariate distribution, or by using non-parametric methods.

Another potential limitation of our generative model is that we assume that the correlation structure of the data closely follows the missingness structure. This is because we assume that for any pair of clusters c, c’, the correlation between pairs of variables in c and c’ are drawn independently from some distribution ρc,c, i.e., we assume that there is no further covariance structure within or between clusters. We have found this to be approximately true for the nIDPs that we have been working with, but this might not be the case for other datasets. This could be solved by modelling missingness and correlation structure jointly in a way which allows for further complexity inside clusters.

In this paper we chose to use C=4 clusters of variables as an approximation of reality in order to be able to inspect the properties of these clusters separately. In all likelihood, the true number of sub-studies is higher, and more representative results could be obtained by choosing a higher figure. We deemed it necessary to use a lower number in order to demonstrate the inner workings of our tool. When allowing C to be higher, we found C=15 clusters with 100 or more variables present in the data set. These clusters all had a very clear separation between the structured and unstructured missingness, which bolsters our hypothesis that missingness in UKB data can be effectively modelled as we have suggested.

Conclusions

The results from the simulation study combined with the illustrative example show that there is room for improvement in the current missing data methodology to accommodate for this specific type of missing data, and that the final result of many analytical tasks on data from the UKB Brain Imaging cohort will vary little depending which commonly used imputation method is chosen. This is due to the difficulty of imputation in this setting. Even so, we have shown some advantage in using iterative imputation over matrix completion methods. While we do not propose new missing data methods here, our results highlight the need for developing methods that specifically account for structured missingness.

Supplementary Information

Acknowledgements

The computational aspects of this research were supported by the Wellcome Trust Core Award Grant Number 203141/Z/16/Z and the NIHR Oxford BRC. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

Abbreviations

SM

Structured missingness

UM

Unstructured missingness

UKB

UK Biobank

MRI

Magnetic resonance imaging

IDP

Imaging derived phenotype

nIDP

Non-imaging derived phenotype

MCAR

Missing completely at random

MAR

Missing at random

MNAR

Missing not at random

AUC

Area under the (reciever operating characteristic) curve

LASSO-LR

Least absolute shrinkage and selection operator logistic regression

MICE

Multivariate imputation by chained equations

ICE

Imputation by chained equations

MSE

Mean squared error

BA

Balanced accuracy

OLS

Ordinary least squares

Authors’ contributions

LR performed data analysis, simulation and the exemplar study. All authors contributed ideas throughout the project. All authors contributed to and revised the final manuscript.

Funding

LR is supported by the EPSRC Centre for Doctoral Training in Health Data Science (EP/S02428X/1).

The Wellcome Centre for Integrative Neuroimaging (WIN FMRIB) is supported by core funding from the Wellcome Trust (203139/Z/16/Z).

SS: Wellcome Trust Collaborative Award 215573/Z/19/Z.

Data availability

One example dataset generated using our method can be found here: https://www.kaggle.com/datasets/lrstats/example-dataset-generative-model. Further data can be made available upon reasonable request to authors.

Declarations

Ethics approval and consent to participate

The UK Biobank has received ethical approval from the North West Multi-Center Research Ethics Committee (11/NW/0382).

This research project received approval from the UKB under application number 8107.

This research project has adhered to the Declaration of Helsinki.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Le Morvan M, Josse J, Moreau T, Scornet E, Varoquaux G. NeuMiss networks: differentiable programming for supervised learning with missing values. Adv Neural Inf Process Sys. 2020;33:5980–90. [Google Scholar]
  • 2.van Loon W, Fokkema M, de Vos F, Koini M, Schmidt R, de Rooij M. Imputation of missing values in multi-view data. Information Fusion. 2024:102524.
  • 3.Jarrett D, Cebere BC, Liu T, Curth A, van der Schaar M. Hyperimpute: Generalized iterative imputation with automatic model selection. In: International Conference on Machine Learning. Cambridge: PMLR; 2022. pp. 9916–37.
  • 4.Ghalebikesabi S, Cornish R, Holmes C, Kelly L. Deep generative missingness pattern-set mixture models. In: International Conference on Artificial Intelligence and Statistics. Cambridge: PMLR; 2021. pp. 3727–35.
  • 5.Beaulieu-Jones BK, Lavage DR, Snyder JW, Moore JH, Pendergrass SA, Bauer CR. Characterizing and managing missing structured data in electronic health records: data analysis. JMIR Med Inform. 2018;6(1):e8960. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Gentry AE, Kirkpatrick RM, Peterson RE, Webb BT. Missingness adapted group informed clustered (MAGIC)-LASSO: a novel paradigm for phenotype prediction to improve power for genetic loci discovery. Front Genet. 2023;14:1162690. 10.3389/fgene.2023.1162690. [DOI] [PMC free article] [PubMed]
  • 7.Mitra R, McGough SF, Chakraborti T, Holmes C, Copping R, Hagenbuch N, et al. Learning from data with structured missingness. Nat Mach Intell. 2023;5(1):13–23. [Google Scholar]
  • 8.Jackson J, Mitra R, Hagenbuch N, McGough S, Harbron C. A Complete Characterisation of Structured Missingness. arXiv preprint arXiv:2307.02650. 2023.
  • 9.Little TD, Jorgensen TD, Lang KM, Moore EWG. On the joys of missing data. J Pediatr Psychol. 2014;39(2):151–62. [DOI] [PubMed] [Google Scholar]
  • 10.Fry A, Littlejohns TJ, Sudlow C, Doherty N, Adamska L, Sprosen T, et al. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. Am J Epidemiol. 2017;186(9):1026–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.UK Biobank. 2024. https://www.ukbiobank.ac.uk. Accessed 21 Aug 2024.
  • 12.Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. Genome-wide genetic data on 500,000 UK Biobank participants. BioRxiv. 2017;166298.
  • 13.Miller KL, Alfaro-Almagro F, Bangerter NK, Thomas DL, Yacoub E, Xu J, et al. Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nat Neurosci. 2016;19(11):1523–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Allen N, Sudlow C, Downey P, Peakman T, Danesh J, Elliott P, et al. UK Biobank: Current status and what it means for epidemiology. Health Policy Technol. 2012;1(3):123–6. [Google Scholar]
  • 15.Julkunen H, Cichońska A, Tiainen M, Koskela H, Nybo K, Mäkelä V, et al. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank. Nat Commun. 2023;14(1):604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Collins R. What makes UK Biobank special? Lancet. 2012;379(9822):1173–4. [DOI] [PubMed] [Google Scholar]
  • 17.Molenberghs G, Beunckens C, Sotto C, Kenward MG. Every missingness not at random model has a missingness at random counterpart with equal fit. J R Stat Soc Ser B Stat Methodol. 2008;70(2):371–88. [Google Scholar]
  • 18.Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res. 2011;20(1):40–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1–67. [Google Scholar]
  • 20.Demirtas H, Doganay B. Simultaneous generation of binary and normal data with specified marginal and association structures. J Biopharm Stat. 2012;22(2):223–36. [DOI] [PubMed] [Google Scholar]
  • 21.Sorensen TA. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Biol Skar. 1948;5:1–34. [Google Scholar]
  • 22.Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Stat Methodol. 1996;58(1):267–88. [Google Scholar]
  • 23.Higham NJ. Computing the nearest correlation matrix-a problem from finance. IMA J Numer Anal. 2002;22(3):329–43. [Google Scholar]
  • 24.Mazumder R, Hastie T, Tibshirani R. Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res. 2010;11:2287–322. [PMC free article] [PubMed] [Google Scholar]
  • 25.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]
  • 26.Van Buuren S, Oudshoorn K. Flexible multivariate imputation by MICE. Leiden: TNO; 1999. [Google Scholar]
  • 27.Graham JW. Missing data analysis: Making it work in the real world. Ann Rev Psychol. 2009;60:549–76. [DOI] [PubMed] [Google Scholar]
  • 28.Rubin DB. Multiple imputation for nonresponse in surveys. Wiley Series in Probability and Statistics. Hoboken: Wiley; 1987.
  • 29.Olkin I, Finn JD. Correlations redux. Psychol Bull. 1995;118(1):155. [Google Scholar]
  • 30.Madakkatel I, Zhou A, McDonnell M, Hyppönen E. Can we use machine learning to discover risk factors? Testing the proof of principle using data on 11,000 predictors and mortality in the UK Biobank. medRxiv. 2021;2021–05.
  • 31.Liu X, Morelli D, Littlejohns TJ, Clifton DA, Clifton L. Combining machine learning with Cox models to identify predictors for incident post-menopausal breast cancer in the UK Biobank. Sci Rep. 2023;13(1):9221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Zhang Y, Chen SD, Deng YT, You J, He XY, Wu XR, et al. Identifying modifiable factors and their joint effect on dementia risk in the UK Biobank. Nat Hum Behav. 2023;7(7):1185–95. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

One example dataset generated using our method can be found here: https://www.kaggle.com/datasets/lrstats/example-dataset-generative-model. Further data can be made available upon reasonable request to authors.


Articles from BMC Medical Research Methodology are provided here courtesy of BMC

RESOURCES