Skip to main content
BMC Medical Research Methodology logoLink to BMC Medical Research Methodology
. 2025 Feb 20;25:43. doi: 10.1186/s12874-025-02496-3

Conceptual framework as a guide to choose appropriate imputation method for missing values in a clinical structured dataset

Marziyeh Afkanpour 1,2, Diyana Tehrany Dehkordy 1,2, Mehri Momeni 1, Hamed Tabesh 1,
PMCID: PMC11843774  PMID: 39979819

Abstract

Background

Missing data is a common challenge in structured datasets, and numerous methods are available for imputing these missing values. While all of these imputation methods address the issue of incomplete data, it is important to note that some methods perform better than others in terms of their effectiveness. A thorough data analysis can help a researcher identify a given dataset’s most appropriate imputation approach, leading to more reliable analytical results. The primary objective of this study is to develop a conceptual framework that integrates various data imputation methods.

Methods

This study was conducted in two main steps. First, we defined the conceptual components and their interrelationships by identifying and categorizing primary concepts through a secondary analysis of our previous systematic review, which examined 58 studies to uncover influential factors for selecting optimal imputation methods. Second, we analyzed the implementation process, focusing on the properties of missing values and selecting appropriate imputation techniques while verifying the underlying assumptions according to the estimand framework from the ICH E9(R1) Guideline to ensure unbiased estimates and enhance the credibility of our findings.

Results

The findings from the secondary analysis suggest that the primary concepts of the developed conceptual framework directly influence the selection of appropriate imputation methods.

Conclusions

This integrated structure will enable researchers to select the most suitable imputation method based on the specific characteristics and conditions of the dataset under investigation. By employing the appropriate imputation method, the study aims to enhance the overall quality and trustworthiness of the analytical outcomes derived from the research dataset.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12874-025-02496-3.

Keywords: Conceptual framework, Imputation methods, Missing values, Taxonomy of concepts, Clinical tabular dataset, Simulation study, Estimands framework

Introduction

The Missing values problem is a main challenge in the data analysis process, particularly within the healthcare domain. In the clinical context, missing data can result from factors such as incomplete data collection [1], missed appointments, or loss of follow-up [2], equipment or system failures [3], patient non-compliance [4], inconsistent data definitions or coding, and heterogeneous data sources (Integrating data from multiple, disparate sources can lead to inconsistencies and missing values) [5]. Furthermore, the data fusion process of two tabular datasets, results in the fused dataset containing missing values, which appear as blocks of missing data [6, 7].

The presence of missing values within a clinical dataset can significantly affect the analytical results and the validity of the study findings. Missing data can lead to biased parameter estimates [79], reduced statistical power [6, 9], inaccurate data visualizations [10], challenges in model selection and evaluation [11], and decreased generalizability of the study results [12], and reduced data quality and reliability [13]. Furthermore, missing data in various epidemiological studies can lead to significant issues. In interventional studies, such as clinical trials, missing data can result in biased estimates of treatment effects, reduce statistical power, and complicate the interpretation of study results. It can also introduce selection bias and compromise the trial’s internal validity [14, 15]. In observational studies, missing data may cause biased estimates of exposure-outcome associations and confounding effects, ultimately reducing the generalizability of the findings [16, 17]. In causal inference studies, missing data can undermine the assumptions necessary for valid conclusions. This includes the assumption of no unmeasured confounding and the consistency assumption [18]. In longitudinal studies, missing data can bias estimates of both within-subject and between-subject effects. It can also lead to incorrect inferences about the trajectory of outcomes over time [19]. Moreover, missing data can distort model coefficients, decrease prediction performance, and diminish the generalizability of prediction models. In the evaluation of diagnostic tests, missing data may result in inaccurate estimates of sensitivity, specificity, and other performance metrics [20]. Finally, in survival analysis, missing data, particularly in time-to-event data like censoring information, can lead to biased estimates of survival probabilities and hazard ratios [21]. According to these challenges of missing values in a clinical dataset, it is essential to address this issue through appropriate methods and robust data management practices. Therefore, attention to missing values in the data analysis process is of utmost importance.

The ICH E9(R1) Guideline [22] on estimands and sensitivity analysis emphasizes the importance of distinguishing between missing data and data that are not relevant due to intercurrent events. It presents several strategies for addressing intercurrent events, including treatment policy, hypothetical scenarios, composite variables, and principal stratification. Each strategy offers a unique perspective on interpreting treatment effects in the presence of missing data, thereby influencing the robustness of the conclusions. Sensitivity analysis plays a critical role in this context, as it assesses the robustness of the main estimators against deviations from underlying assumptions. By systematically varying these assumptions, researchers can evaluate how sensitive their findings are to potential biases or limitations in the data. Although the ICH E9(R1) framework primarily focused on randomized clinical trials, its principles are also relevant to other study designs, including observational studies. This broader applicability highlights the framework’s importance in providing reliable estimates of findings across different research contexts. By following the guidelines on estimands and sensitivity analysis, researchers can bolster the credibility of their findings, which in turn supports better regulatory decisions and enhances patient outcomes. Therefore, employing robust methods for managing missing data and performing sensitivity analyses is crucial for accurately estimating and choosing appropriate imputation methods in clinical research settings.

One of the existing solutions for managing missing data is the use of data imputation methods. Researchers have developed various imputation methods that aim to estimate the missing values based on the available information within the dataset [11, 19]. These imputation methods range from simple approaches, such as mean or median imputation, to more sophisticated methods, including multiple imputation and model-based imputation [20, 21]. Therefore, there are numerous imputation methods available.

A notable consideration when employing data imputation methods is the careful examination of the characteristics of the missing values, including the mechanism, pattern, and ratio of missing-ness, alongside the investigation of the properties of the proposed dataset, such as sample size, data type, the role of missing values, the distribution of variables, correlations among them, and the overall study design.

According to the estimand framework [22], it is essential to consider all aforementioned assumptions made for imputing missing values. Failure to account for these assumptions can lead to significant issues in conducting sensitivity analyses. Therefore, neglecting the missing data mechanism may lead to power reduction, biased statistical inference, and invalid conclusions about the intervention’s efficacy [13, 23]. Determining the mechanism of missing values helps to identify the most appropriate analysis method [24]. So, it is crucial to identify the mechanism of missing-ness and thus the most appropriate method for valid analysis and minimum biased result. A single imputation method could be used if the missing values have a missing completely at random (MCAR) mechanism. If the missing-ness mechanism is missing at random (MAR), then multiple imputation is useful [25, 26]. When the mechanism of missing values is not at random (MNAR), then more sophisticated approaches such as joint modeling, pattern mixture models, or Markov Chain Monte Carlo (MCMC) imputation method should be used [8]. In these situations, findings produce good estimates of the variability in the dataset. Moreover, ignoring the pattern of missing-ness may bias the result obtained [26], the missing data pattern is essential for selecting an appropriate imputation method to estimate missing values [27]. As a consequence, applying imputation methods without considering the specific properties of the missing data can result in biased analyses and unreliable conclusions.

Given that data imputation methods are based on statistical and mathematical theories, thus, they possess a strong inferential structure. Paying attention to this inferential structure is crucial when using these methods. The lack of awareness or insufficient knowledge of many health researchers about these inferential structures in the analysis and management of missing values using imputation methods will significantly impact the results and findings of a research study. For instance, the use of the mean imputation method for missing values with a MAR mechanism will lead to an underestimation of the findings. Although the researcher has been able to solve the problem of missing values, they may not be aware that the results are underestimated. On the other hand, identifying the type of missing value mechanism for a researcher without the necessary background knowledge and relevant theories is a challenging task. Moreover, the abundance of data imputation methods may confuse the researcher in selecting the appropriate imputation method. Unfortunately, there is no guidance to direct health researchers in choosing the correct imputation method for conducting analytical processes in the face of missing values.

In this regard, studies such as [28] have introduced an analytical approach specifically for the data under investigation in the face of missing values, and have ultimately used simulation methods to prove the correctness of the introduced analytical approach. Additionally, in the study [16] a framework for imputing missing values in observational studies has been presented. Furthermore, in the study [15] a framework for dealing with missing values in randomized clinical trials has been introduced. In fact, in these studies, under specific conditions, a guide has been provided for dealing with missing values in the data analysis process. Therefore, given the importance of the structural characteristics of missing values in selecting the appropriate imputation method, the lack of background knowledge of health researchers in identifying the structural characteristics of missing values, and the absence of a practical guide to assist researchers in facilitating this process, the introduction of a structured guide to help health researchers in the analysis of missing values and the selection of the appropriate imputation method is a necessary endeavor.

Based on this, the objective of this study is to present a conceptual framework to facilitate, accelerate, and precision in the analytical processes related to the selection of the appropriate imputation method for a health researcher. In this conceptual framework, we have tried to simplify the analytical process as much as possible when dealing with missing values. This means that a researcher, without having background knowledge in the mathematical and statistical theories related to the missing value problem, can easily determine the appropriate imputation method for the missing values in the dataset under analysis, using this framework.

A conceptual framework is a crucial component of research that describes the researcher’s understanding of the factors or variables involved in the study and their interrelationships [29, 30]. The purpose of a conceptual framework is twofold, to delineate the concepts under investigation using relevant literature and to elucidate the presumed connections among those concepts [31, 32]. The conceptual framework guides the methodological decisions and the elucidation of important findings, offering a distinct perspective on the research problem. It is informed by literature reviews, experiences, or experiments, and it may incorporate emergent ideas not yet grounded in the existing literature. By describing the conceptual framework, researchers can effectively communicate their assumptions, orientations, and overall understanding of the concepts being investigated to their readers. The conceptual framework may be presented through written descriptions and/or visual representations to clarify the phenomenon under study [30]. Jabareen defined a conceptual framework as a network or system of interlinked concepts that facilitate a comprehensive understanding of a phenomenon. The advantages of conceptual framework analysis include its flexibility, capacity for modification, and emphasis on understanding rather than prediction. The concepts that constitute a conceptual framework support one another, articulate their respective phenomena, and establish a framework-specific philosophical foundation [33]. In this study, we introduce a systematic and structured conceptual framework that outlines the process of selecting the appropriate imputation method for managing missing values in a clinically structured dataset.

Methods

This study was conducted in two main steps. In the first step, defining the conceptual components and their relationships: First main concepts were identified through a secondary analysis of our previous systematic review. The main concepts were then categorized and organized. Next, the primary concepts were integrated to establish their interrelationships. In the second step, examining the implementation process, including analyzing the existing missing value properties and selecting the appropriate imputation method.

Identification of each fundamental concept constituting the framework structure was based on existing scientific evidence. These concepts were determined through a secondary analysis of the systematic review conducted by the researchers of this study [34]. Specifically, the 58 studies [28, 3591] included in the systematic review were examined to identify new and influential factors in selecting the appropriate imputation method. The process of conducting secondary analysis is as follows: A secondary analysis by first creating a table based on the essential assumptions for selecting an appropriate imputation method. These assumptions were identified from our previous research and other relevant studies [3438]. For each study, three experts independently extracted the relevant factors. Each expert evaluated which of the predetermined assumptions in the table were considered in selecting the imputation method for the specific study. Thus, these three experts provided their opinions based on the presence of specified items in each survey during the re-analysis process. Subsequently, we reached a consensus based on the majority opinion regarding the extracted items for each study. In addition to the fundamental assumptions for choosing an appropriate imputation method, the imputation algorithms implemented in each study were also extracted. These algorithms were then regarded as the foundational models for each study. Consequently, each item in the table was identified during this process. After identifying the fundamental concepts in the proposed framework, all subordinates associated with each concept were categorized. Subsequently, the relationships between each of these concepts were defined. As a result, the taxonomy of each concept comprising this proposed conceptual framework is implemented. It is important to note that the mentioned previous systematic review aimed to introduce the most suitable imputation methods based on the structural characteristics of missing values. In this study, it was re-evaluated and analyzed concerning the identification of new factors related to the dataset characteristics. Furthermore, In our study, based on the ICH E9(R1) guideline [22], the appropriate imputation method is considered to be the estimand. The various data imputation methods can be categorized as estimators. The values estimated through the chosen imputation method, intended to replace the missing values in the dataset, represent these estimators. To identify the main estimator, or the specific imputation method relevant to our analysis, it is essential to assess the underlying assumptions regarding the characteristics of the missing data and the dataset itself. Therefore, conducting a sensitivity analysis—specifically a sensitivity estimator that aligns with the focus of our work—will be necessary to evaluate these assumptions.

Therefore, according to the estimand framework outlined in the ICH E9(R1) Guideline, it is essential to conduct sensitivity analyses as sensitivity estimators that include evaluating the assumptions related to imputation methods. One of the key sensitivity estimators should focus on verifying these assumptions, as they are foundational to selecting appropriate imputation techniques. By rigorously checking the assumptions underlying the imputation methods, we can ensure that the main estimate is derived in an unbiased manner with minimized error. This process is crucial for maintaining the integrity of the estimand, as it directly influences the reliability of the estimate findings.

Results

The identified concepts, serving as essential components of the proposed conceptual framework, were categorized into two groups: primary concepts and the final key concept. Figure 1 illustrates the main concepts associated with the characteristics of missing data and various properties of datasets. It outlines several critical factors, including the mechanisms and patterns of missingness, the ratio of missing values, data type, and the role of missing values in variables.

Fig. 1.

Fig. 1

Primary concepts of conceptual framework

Figure 2 outlined various imputation methods, serving as a crucial component of the proposed conceptual framework for addressing missing data. It categorizes these methods into traditional statistical methods and learning-based methods. Additionally, a hybrid method is highlighted, which combines algorithms from both categories

Fig. 2.

Fig. 2

The final key concept of the conceptual framework. * Hybrid methods are characterized by the combination of two traditional statistical methods (e.g., Bayesian Gaussian Mixture Models (BGMM)), or a combination of two learning-based methods (e.g., WLI fuzzy clustering & Grey Neural Network (GNN)), or a combination of a traditional statistical method with a learning-based approach (e.g., rough set theory and Artificial Neural Network (ANN))

Definitions and theoretical foundations for each of the primary concepts identified in the design of the proposed conceptual framework are presented in Appendix 1 (see Additional file 1). This section provides a more detailed explanation of the subcategories related to the two concepts of mechanism and pattern of missing values in a dataset, offering readers greater insight. Moreover, methods for identifying the type of missing value mechanism and the type of missing value pattern are introduced [92103]. Furthermore, Appendix 2 (see Additional file 2) presents information regarding data imputation methods, which serve as the final key concept in the proposed conceptual framework.

The findings from the secondary analysis of the 58 relevant studies examined in our previous systematic review indicated a close relationship among each of the identified primary concepts, including missing value mechanism, missing value pattern, ratio of missing-ness, study design, sample size, data type, role of missing values in variables, distribution of variables, and correlation between variables, to the selection of an appropriate imputation method. This means that each of the primary concepts introduced in the development of the conceptual framework serves as essential prerequisites for choosing the suitable imputation method. These findings are presented in Table 1, where columns two to seven represent the primary concepts of the conceptual framework, and the last column denotes the imputation methods, which serve as the final key concept within the framework. Given the relatively large number of concepts and their sub-ordinate elements within this conceptual framework, presenting the framework as a graph, diagram, or chart not only did not enhance its visual representation but also introduced complexities. Consequently, this conceptual framework is presented in Table 1 to ensure clarity in distinguishing between the primary concepts and the final key concept.

Table 1.

Conceptual framework outlining the primary concepts and the final key concept of imputation methods

Study Primary Concepts of Conceptual Framework Imputation Methods
mechanism pattern Ratio Sample size Study design Data type

Study 1,

[35]

MCAR

MAR

----

High

Moderate

---- ---- Numeric (PMM)

Study 2,

[36]

---- Arbitrary ---- ---- ---- (**) RM

Study 3,

[84]

---- ---- Low Moderate ---- ---- Numeric (RST) & ANN)

Study 4,

[37]

MCAR ----

High

Moderate

---- ---- Numeric (Extra Trees)

Study 5,

[38]

(***) Monotone Low ---- ---- Numeric RM

Study 6,

[39]

(***) Arbitrary

High

Moderate

Moderate ---- ---- RM

Study 7,

[40]

(***) Arbitrary (****) ---- Observational (survival study) Numeric (SVM), (ANN), (DT), (NN), (NB), Logit

Study 8,

[41]

(***) ---- Low Moderate ---- Observational Categorical RF

Study 9,

[28]

---- ---- High High Observational Numeric (KNN)

Study 10,

[42]

MAR

MNAR

----

High

Moderate

High Observational Numeric (MICE)

Study 11,

[43]

MNAR Univariate Moderate ---- Observational (**) (MICE)

Study 12,

[44]

MCAR Multivariate (****) High Observational (**) (GAN)

Study 13,

[45]

MAR Arbitrary (****) Low ---- (**) (MICE) & (CART)

Study 14,

[46]

MAR

MNAR

----

Moderate

Low

Clinical Trail (**) (MICE)

Study 15,

[47]

MAR Arbitrary Moderate High Observational (**) (RF)

Study 16,

[48]

MAR

MNAR

---- ---- ---- Clinical Trail (**) (LR-A)

Study 17,

[49]

(***) ----

Moderate

Low

---- Clinical Trail (**) (KNN)

Study 18,

[50]

(***) ----

Moderate

Low

moderate Clinical Trail (**) (KNN)

Study 19,

[51]

MAR Multivariate

High

Moderate

Low Observational (**) (RM)

Study 20,

[52]

MAR Multivariate (****) ---- Clinical Trail Categorical (RM)

Study 21,

[53]

---- Multivariate Low Low Clinical Trail Numeric (RM)

Study 22,

[54]

(***) ----

High

Moderate

---- ---- (**) (KNN)

Study 23,

[55]

MCAR

MNAR

---- Moderate

Moderate

Low

Observational Categorical (RF), (PCA), (KNN)

Study 24,

[56]

MAR Arbitrary Moderate Low Clinical Trail Numeric (MMI), (FMI)

Study 25,

[57]

MAR Arbitrary

Moderate

Low

High Observational (**) (GAN)

Study 26,

[58]

MAR Arbitrary (****) ---- Observational (**) (PCA), (SVM) & (FCM)

Study 27,

[59]

MNAR Univariate High ---- Observational Numeric Ensemble Learning

Study 28,

[60]

MAR

MNAR

Univariate Moderate ---- Clinical Trail (**) (MICE)

Study 29,

[61]

MCAR Arbitrary

Moderate

Low

---- Clinical Trail Numeric (RM)

Study 30,

[62]

---- Multivariate (****) ---- ---- Numeric (SSA)

Study 31,

[63]

MCAR Multivariate Moderate Moderate Observational (**) (PCA)

Study 32,

[64]

MAR ---- Moderate High ---- (**) (PCA), (FA)

Study 33,

[60]

---- Arbitrary (****) Low ---- ---- (IB-CI)

Study 34,

[66]

MCAR ----

High

Moderate

High

Low

---- (**) (MGP)

Study 35,

[67]

MCAR ----

High

Moderate

Moderate Observational Numeric (DMU)

Study 36,

[68]

---- ---- (****) ---- ---- (**) (AE)

Study 37,

[69]

---- ---- Low

High

Moderate

Low

---- ---- (MICE)

Study 38,

[70]

---- ---- (****)

High

Low

---- (**) (LR-A)

Study 39,

[71]

---- ---- (****) ---- ---- Categorical (GNN)

Study 40,

[72]

---- ----

High

Moderate

---- ---- ---- (GNN)

Study 41,

[73]

(***) Univariate

High

Moderate

---- ---- (**) (MCMC)

Study 42,

[74]

MAR Multivariate Moderate Moderate observational Numeric (JMI), (CMI)

Study 43,

[75]

MAR Multivariate

Moderate

Low

---- ---- (**) Interpolation

Study 44,

[76]

(***) ----

High

Moderate

---- ---- Numeric (MCMC), (MICE), (EM)

Study 45,

[79]

MCAR Univariate

High

Moderate

observational Numeric (JMI), (CMI)

Study 46,

[78]

MNAR Arbitrary (****) Moderate observational Numeric (AE)

Study 47,

[77]

MNAR Arbitrary

High

Moderate

High observational (**) (BGMM)

Study 48,

[80]

---- ---- Moderate observational Numeric (AE)

Study 49,

[81]

MCAR

MAR

---- Moderate High observational (**) (JM-MI), (FCMI)

Study 50,

[82]

MCAR

MAR

----

Moderate

Low

---- observational Categorical (MICE)

Study 51,

[83]

---- ---- Moderate ---- ---- ---- (CLUSTIMP)

Study 52,

[85]

MAR

MNAR

Univariate

Multivariate

Moderate

Low

---- ---- (**) Matrix Completion Methods

Study 53,

[86]

MNAR Univariate Moderate ---- ---- Numeric (BCMI)

Study 54,

[87]

MAR

MNAR

----

High

Moderate

---- ---- Numeric (SVM)

Study 55,

[88]

---- ---- High Low observational (**) (AE)

Study 56,

[90]

MCAR

MAR

Univariate

Moderate

Low

High ---- (**) (CART), (BART)

Study 57,

[89]

MAR ---- High ---- ---- (**) (BNP)

Study 58,

[91]

MCAR

MAR

Arbitrary Moderate ---- Clinical Trail (**) (MISI)

Predictive Mean Matching (PMM), regression model (RM), Rough Set Theory (RST), Artificial Neural Network (ANN), Extremely Randomized Trees (Extra Trees), Support Vector Machine (SVM), Decision Tree (DT), K Nearest Neighbor (KNN), Naïve Bayesian (NB), Logistic Regression (Logit), Random Forest (RF), Multiple Imputation by Chain Equation (MICE), Generative Adversarial Network (GAN), Sequential Classification and Regression Tree (CART), Principal Component Analysis (PCA), Multilevel Multiple Imputation (MMI), Fixed-Effect Multiple Imputation (FMI), Fixed-Effect Multiple Imputation (FMH), Fuzzy C-Means (FCM) Clustering Method, Salp Swarm Algorithm (SSA), Factorial Analysis (FA), Instance-Based Cluster Imputation (IB-CI), Missing Gaussian Processes (MGP), Dynamic Model Updating (DMU), Auto Encoder (AE), Low-Rank Approximation–Based Imputation (LR-A), Markov Chain Monte Carlo (MCMC), Joint Multiple Imputation (JMI) and Conditional Multiple Imputation (CMI), Expectation-Maximization (EM), Bayesian Gaussian Mixture Models (BGMM), Joint Modelling Multiple Imputation (JM-MI), Full Conditional Multiple Imputation (FCMI), Cluster-Based Imputation Method (CLUSTIMP), Bias-Corrected Multiple Imputation (BCMI), Bayesian Additive Regression Trees (BART), Bayesian Non-Parametric (BNP) causal model, Multilevel and Stratified Imputation (MISI) Approach

----: The desired assumption is not considered in the proposed study

(*): Independent variables and outcome

(**): Numeric & Categorical

(***): MCAR, MAR, MNAR

(****): High, Moderate, Low

In addition to the items presented in Table 1, these three factors were also considered essential in selecting the appropriate imputation method based on the characteristics of the analyzed dataset. The findings regarding the role of missing values in variables indicate that, in 32 studies [1, 3, 6, 7, 9, 10, 12, 15, 19, 21, 22, 25, 26, 3034, 37, 39, 42, 4446, 48, 49, 5257], these values served as predictors; in 3 studies [2, 4, 24], they were regarded as outcomes; and in 23 studies [5, 8, 11, 13, 14, 1618, 20, 23, 2729, 35, 36, 38, 40, 41, 43, 47, 50, 51, 58], they were recognized as both predictors and outcomes. Regarding the distribution of variables, 34 studies [5, 8, 1015, 1720, 2225, 28, 29, 34, 35, 37, 39, 43, 44, 46, 47, 4951, 5356, 58] exhibited a multivariate normal distribution, while 4 studies [27, 36, 48, 57] showed a skewed distribution. The distribution among variables was not specified in the remaining studies. For the factor of correlation between variables, 6 studies [2, 9, 10, 15, 21, 37] reported high correlation, while 10 studies [13, 14, 17, 20, 22, 27, 30, 31, 34, 35] indicated moderate correlation. Low correlation was noted in studies [24, 29, 32, 45, 48, 49, 51]. This factor was not examined in the other studies listed in Table 1.

As shown in Table 1, each row represents a distinct study. For each of these studies, items have been extracted as fundamental concepts that form the proposed conceptual framework. The items in the first three columns are derived from the characteristics of the missing values, specifically the mechanism, pattern, and ratio of missingness. The items in columns four through nine are determined based on the features of the dataset analyzed in each study. For instance, in Study 15, the missing values exhibit a Missing at Random (MAR) mechanism, an arbitrary pattern, and a moderate ratio of missingness. The dataset analyzed in this study is large and is classified as an observational study. The data types are both numerical and categorical, with missing values occurring in the predictors. The distribution among the variables follows a multivariate normal distribution, and there is a high degree of correlation among them. Based on these essential assumptions extracted from the study, the appropriate method for imputing the missing values is to utilize learning-based approaches, specifically the Random Forest algorithm. As another example from this table, consider Study 46. The assumptions made in this study indicate that the dataset analyzed has missing values characterized by a Missing Not at Random (MNAR) mechanism, an arbitrary pattern, and varying ratios of missingness: high, average, and low. The dataset size is moderate, and it is also an observational study. The data types are numerical, with missing values occurring in the predictors. The distribution among the variables is described as a multivariate normal distribution. Given these assumptions, the imputation method proposed for the missing values in this study is based on learning methods, specifically the Autoencoder algorithm.

Table 1 presents the output from the secondary analysis conducted in this study. Through this analysis, we were able to identify all the foundational concepts of the proposed conceptual framework. To achieve the objective of the conceptual framework, which is to determine the appropriate imputation method based on the established assumptions, we also extracted the imputation methods implemented in each study. Consequently, to encompass all concepts related to the conceptual framework, we considered various imputation methods as the final key concepts within the structure of this framework.

Through this analysis, we were able to identify all the foundational concepts of the proposed conceptual framework. To achieve the objective of the conceptual framework, which is to determine the appropriate imputation method based on the established assumptions, we also extracted the imputation methods implemented in each study. Consequently, to encompass all concepts related to the conceptual framework, we considered various imputation methods as the final key concepts within the structure of this framework.

The taxonomy of concepts illustrated in Fig. 3 depicts the fundamental components that comprise the structure of the proposed conceptual framework. This hierarchical arrangement provides a detailed overview of the categories of primary and final key concepts and their subordinate elements as a graph.

Fig. 3.

Fig. 3

Taxonomic representation of primary concepts, including their subordinate elements and the final key concept within the conceptual framework

As previously mentioned, due to the large number of primary concepts and their associated subordinate elements, a table was utilized to display this conceptual framework in this study. Without compromising the overall integrity of the designed framework, a graphical representation of the framework is provided as a sample for each concept, specifically the mechanism and the pattern of missing values, as illustrated in Figs. 4 and 5. The graphic structure in Fig. 4 illustrates a part of the conceptual framework, focusing solely on the mechanism aspect. It highlights the relationships between the mechanism’s sub-components and the various imputation methods. Furthermore, the graphic structure in Fig. 5 depicts a section of the conceptual framework, emphasizing the pattern aspect. It illustrates the connections between the different types of imputation methods and the sub-ordinates of the pattern.

Fig. 4.

Fig. 4

Connections between the mechanism aspect and various imputation methods within the conceptual framework

Fig. 5.

Fig. 5

Connections between the pattern aspect and various imputation methods within the conceptual framework

Given the broad nature of the designed framework, Figs. 4 and 5 illustrate parts of the framework related to the concepts of mechanism and pattern. Due to the numerous relationships among the concepts, we initially considered all studies that shared common characteristics for the concept of mechanism within a single scenario. Then, based on Table 1 and the findings from the reanalysis, we identified the imputation methods for each study within each scenario. Thus, Fig. 4 represents a portion of the designed framework for the concept of mechanism, highlighting the final key concept: the imputation method. Similarly, Fig. 5 demonstrates the implementation of this process for the concept of pattern.

Discussions

The issue of missing data was first introduced as a mathematical problem, and it gradually made its way into the field of medicine and healthcare. Managing missing data is crucial to obtaining unbiased and reliable analytical results for making accurate medical decisions. Nowadays, the use of imputation methods is considered one of the approaches to deal with missing values, allowing for the analysis to be conducted on a complete dataset. Proper handling of missing data is essential to ensure the validity and reliability of analytical findings, which in turn supports informed decision-making in the healthcare domain. By addressing the challenge of missing data, researchers and practitioners can enhance the quality and trustworthiness of the data-driven insights, ultimately leading to improved patient outcomes and more effective healthcare interventions.

There are several approaches for data imputation methods, including those based on traditional statistical methods and those relying on learning-based methods. A crucial consideration in utilizing these approaches is the understanding of the structure and characteristics of the missing values in a clinical dataset. Failure to account for the underlying assumptions of a particular imputation method can diminish the power of the statistical tests and lead to the introduction of bias, ultimately reducing the reliability of the results. Key factors to consider when selecting the appropriate imputation methods include identifying the mechanism of the missing values, recognizing the pattern of missing values, determining the proportion of missing values, and understanding the study design and dataset properties including sample size, data type, role of missing values in variables, distribution of variables and correlation between variables. Careful consideration of these aspects is essential to choosing the right imputation method and ensuring the validity and trustworthiness of the analytical outcomes, supporting informed decision-making in the healthcare domain.

The imputation of missing data is a critical component of the data preprocessing phase, as it directly impacts the quality and reliability of the subsequent analytical processes. Ensure your data are coded correctly, identify missing data within each variable (this step leads to identifying the pattern of missing values; indeed, you verify the position of missing values in a data set), look for the mechanism of missing-ness, check for the association between missing and observed data, decide how to handle missing data. By addressing the challenge of missing values through appropriate imputation methods, researchers and practitioners can ensure the completeness of the dataset, reduce potential biases, enhance the statistical power and validity of the analytical findings, and facilitate the application of advanced data analysis methods.

This study presents a conceptual framework as an integrated structure of the key underlying assumptions to be considered when employing various data imputation methods. The proposed framework aims to guide researchers and practitioners in the healthcare domain on the critical factors to be taken into account when choosing the most suitable imputation method for their specific dataset and research objectives. This approach helps to ensure the validity and reliability of the analytical outcomes, ultimately supporting informed decision-making in the healthcare context.

In our proposed conceptual framework, we explicitly referenced the importance of checking imputation assumptions as part of the sensitivity estimator in the estimand framework [22]. This step is integral to ensuring that the estimate accurately reflects the most appropriate imputation method and supports robust decision-making in the clinical context.

The conceptual framework developed in this study is based on a secondary analysis of our previous systematic review. Initially, the essential concepts that constitute this framework were identified and categorized. Next, the relationships between each of these concepts were established. Finally, the proposed framework was implemented to provide the most suitable data imputation method, taking into account the fundamental assumptions of each imputation approach. This structured approach helps researchers and practitioners make informed decisions when addressing the challenge of missing values in their datasets.

Our previous systematic review serves as a fundamental prerequisite for the current study. This is because, to design a conceptual framework, a comprehensive review of the relevant literature and theories related to the research question must first be conducted. This process allows for the identification of key concepts and the relationships between them. Therefore, to design the proposed conceptual framework, a systematic review was initially performed. The findings from this review were then considered as a critical prerequisite for examining the conceptual framework. In other words, the results of the systematic review provided essential input for the design of our conceptual framework. Specifically, identifying the key influencing factors as foundational concepts that constitute the structure of this framework is a crucial prerequisite for selecting the appropriate imputation method, and this prerequisite was derived from our systematic review. Therefore, considering that the proposed conceptual framework is based on studies that establish essential assumptions for selecting appropriate imputation methods and the techniques employed for assigning missing values, which are founded on simulation methods, the credibility of the framework as mentioned earlier is ensured through the utilization of these simulation studies as robust and validated evidence in its design, thereby making it reliable. Each study presented in Table 1 can serve as a practical example of the work that has been conducted using our framework. Additionally, the studies referenced in [16] and [104] provide case studies that specifically design frameworks for observational studies and clinical trials [15]. These designed frameworks have been successfully implemented and executed in real-world settings, demonstrating their applicability and effectiveness.

However, it cannot be claimed that the proposed framework is a comprehensive conceptual framework, as not all the primary concepts may have been identified and incorporated into the design and implementation structure. Therefore, it is recommended to conduct a systematic review to identify all the existing methods, which would help to further strengthen and expand the proposed framework, ensuring a more robust and inclusive tool for addressing the challenge of missing values in healthcare-related datasets and research endeavors.

In this study, we have solely designed and implemented the foundation of a conceptual framework. This conceptual framework serves as a backbone throughout the data imputation process for missing values. In the future, by introducing more advanced data imputation methods tailored to the characteristics of the missing values, the designed framework will have the capability to be re-evaluated and modified. This iterative process ensures that the proposed conceptual framework remains flexible and adaptable to emerging insights. In this way, it can be claimed that we have achieved a comprehensive conceptual framework for selecting the most suitable data imputation method. Moreover, we plan to develop a web application that allows researchers to input key parameters, including the missing data mechanism, patterns of missingness, the ratio of missingness, data types, and other critical assumptions. This application will provide users with the most appropriate imputation methods, thereby improving the practical application of our conceptual framework.

Conclusions

This study presents a conceptual framework for selecting the most suitable imputation methods to address missing values in clinical datasets based on scientific evidence. It is expected that the recommended imputation methods derived from this framework will yield valid and reliable outcomes during the data preprocessing process of modeling. The developed framework will streamline data preprocessing and improve robust clinical decision-making. Additionally, it aims to assist researchers in systematically considering missing data and transparently reporting its potential impact on study results, ultimately enhancing the confidence in and reproducibility of research findings.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1 (631.5KB, pdf)
Supplementary Material 2 (162.6KB, pdf)

Acknowledgements

Not applicable.

Author contributions

M.A. participated in conceptualizing the study and designing its framework. She organized the relevant studies and extracted concepts related to the study’s objectives. Additionally, she conducted the analysis and interpretation of the findings. She Conceptualized visualizations for various components of the developed framework. She drafted and revised the manuscript. D.T.D. edited the manuscript. M.M. contributed to the review of proposed studies and edited the manuscript. H.T. contributed to the development of the methodology and conceptualization. He provided feedback on the manuscript draft and offered critical insights for interpreting the results. Additionally, he reviewed and revised the manuscript for significant intellectual content and supervised the project. All authors read and approved the final version of the manuscript and agree to be accountable for all aspects of the work.

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Data availability

No datasets were generated or analysed during the current study.

Declarations

Ethics approval and consent to participate

This study was assessed by the research council of Mashhad University of Medical Sciences (Reference Number: IR.MUMS.REC.1402.069). The study was approved because no identifying data have been reported.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338. [DOI] [PMC free article] [PubMed]
  • 2.Howe CJ, Cole SR, Lau B, Napravnik S, Eron JJ Jr. Selection bias due to loss to follow up in cohort studies. Epidemiology. 2016;27(1):91–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kahn MG, Callahan TJ, Barnard J, Bauck AE, Brown J, Davidson BN et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. Egems. 2016;4(1). [DOI] [PMC free article] [PubMed]
  • 4.Little RJ, D’agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT, et al. The prevention and treatment of missing data in clinical trials. N Engl J Med. 2012;367(14):1355–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013;20(1):144–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Dong Y, Peng C-YJ. Principled missing data methods for researchers. SpringerPlus. 2013;2:1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Little RJ, Rubin DB. Statistical analysis with missing data. Wiley; 2019.
  • 8.Bennett DA. How can I deal with missing data in my study? Aust N Z J Public Health. 2001;25(5):464–9. [PubMed] [Google Scholar]
  • 9.Newman DA. Missing data: five practical guidelines. Organizational Res Methods. 2014;17(4):372–411. [Google Scholar]
  • 10.Wickham H, Wickham H. Data analysis: Springer; 2016.
  • 11.Harrell FE. Regression modeling strategies. R package version. 2012:6.2-0.
  • 12.Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–92. [Google Scholar]
  • 13.Schafer JL, Graham JW. Missing data: our view of the state of the art. Psychol Methods. 2002;7(2):147. [PubMed] [Google Scholar]
  • 14.Friedman LM, Furberg C, DeMets DL, Reboussin DM, Granger CB. Fundamentals of clinical trials: Springer; 2010.
  • 15.Staudt A, Freyer-Adam J, Ittermann T, Meyer C, Bischof G, John U, Baumann S. Sensitivity analyses for data missing at random versus missing not at random using latent growth modelling: a practical guide for randomised controlled trials. BMC Med Res Methodol. 2022;22(1):250. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Lee KJ, Tilling KM, Cornish RP, Little RJ, Bell ML, Goetghebeur E, et al. Framework for the treatment and reporting of missing data in observational studies: the treatment and reporting of missing data in observational studies framework. J Clin Epidemiol. 2021;134:79–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Rothman K. Modern epidemiology. Lippincott Williams & Wilkins; 2008.
  • 18.Hernan M, Robins J. Causal inference: what if. boca raton: Chapman & hill/crc; 2020. [Google Scholar]
  • 19.Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis: Wiley; 2012.
  • 20.Zhou X-H, Obuchowski NA, McClish DK. Statistical methods in diagnostic medicine. Wiley; 2014.
  • 21.Kleinbaum DG, Klein M. Survival analysis a self-learning text. Springer; 1996.
  • 22.Harmonisation ICf. ICH Harmonised Guideline E9 (R1): estimands and sensitivity analysis in clinical trials. International Council for Harmonization; 2017.
  • 23.Goldberg SB, Bolt DM, Davidson RJ. Data missing not at random in mobile health research: Assessment of the problem and a case for sensitivity analyses. J Med Internet Res. 2021;23(6):e26749. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Fairclough DL. Design and analysis of quality of life studies in clinical trials. Chapman and Hall/CRC; 2010.
  • 25.Demirtas H. Flexible imputation of missing data. J Stat Softw. 2018;85:1–5.30505247 [Google Scholar]
  • 26.Fielding S, Fayers PM, Ramsay CR. Investigating the missing data mechanism in quality of life outcomes: a comparison of approaches. Health Qual Life Outcomes. 2009;7:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Zhang Z. Missing data exploration: highlighting graphical presentation of missing pattern. Annals Translational Med. 2015;3(22). [DOI] [PMC free article] [PubMed]
  • 28.Batra S, Khurana R, Khan MZ, Boulila W, Koubaa A, Srivastava P. A pragmatic ensemble strategy for missing values imputation in health records. Entropy. 2022;24(4):533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Farrow R, Iniesto F, Weller M, Pitt R, Algers A, Bozkurt A et al. GO-GN guide to conceptual frameworks. 2021.
  • 30.Luft JA, Jeong S, Idsardi R, Gardner G. Literature reviews, theoretical frameworks, and conceptual frameworks: an introduction for new biology education researchers. CBE—Life Sci Educ. 2022;21(3):rm33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Anfara VA Jr, Mertz NT. Theoretical frameworks in qualitative research. Sage; 2014.
  • 32.Rocco TS, Plakhotnik MS. Literature reviews, conceptual frameworks, and theoretical frameworks: terms, functions, and distinctions. Hum Resour Dev Rev. 2009;8(1):120–30. [Google Scholar]
  • 33.Jabareen Y. Building a conceptual framework: philosophy, definitions, and procedure. Int J Qualitative Methods. 2009;8(4):49–62. [Google Scholar]
  • 34.Afkanpour M, Hosseinzadeh E, Tabesh H. Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review. BMC Med Res Methodol. 2024;24(1):188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Abassi RA, Msengwa AS. Classification of breast cancer recurrence based on imputed data: a simulation study. BioData Min. 2022;15(1):30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Ahmad A, Mohamed HH. The enhancement of linear regression algorithm in handling missing data for medical data set. 2006.
  • 37.Alabadla M, Sidi F, Ishak I, Ibrahim H, Affendey L, Hamdan H. ExtraImpute: a Novel Machine Learning Method for Missing Data Imputation. J Adv Inform Technol. 2022;13(5):470–6. [Google Scholar]
  • 38.Alade OA, Selamat A, Sallehuddin R. The effects of Missing Data characteristics on the choice of imputation techniques. Vietnam J Comput Sci. 2020;7(02):161–77. [Google Scholar]
  • 39.Algarni A, Ragab M, Alamri W, Mostafa SM. Towards improving Predictive Statistical Learning Model Accuracy by enhancing learning technique. Comput Syst Sci Eng. 2022;42(1):303–18. [Google Scholar]
  • 40.Almasinejad P, Golabpour A, Mollakhalili Meybodi MR, Mirzaie K, Khosravi A. A dynamic model for imputing missing medical data: a multiobjective particle swarm optimization algorithm. J Healthc Eng. 2021;2021(1):1203726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Alsaber A, Al-Herz A, Pan J, AL‐Sultan AT, Mishra D, Group K. Handling missing data in a rheumatoid arthritis registry using random forest approach. Int J Rheum Dis. 2021;24(10):1282–93. [DOI] [PubMed] [Google Scholar]
  • 42.Beaulieu-Jones BK, Lavage DR, Snyder JW, Moore JH, Pendergrass SA, Bauer CR. Characterizing and managing missing structured data in electronic health records: data analysis. JMIR Med Inf. 2018;6(1):e8960. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Beesley LJ, Taylor JM. Accounting for not-at‐random missingness through imputation stacking. Stat Med. 2021;40(27):6118–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Bernardini M, Doinychko A, Romeo L, Frontoni E, Amini M-R. A novel missing data imputation approach based on clinical conditional generative adversarial networks applied to EHR datasets. Comput Biol Med. 2023;163:107188. [DOI] [PubMed] [Google Scholar]
  • 45.Burgette LF, Reiter JP. Multiple imputation for missing data via sequential regression trees. Am J Epidemiol. 2010;172(9):1070–6. [DOI] [PubMed] [Google Scholar]
  • 46.Carreras G, Miccinesi G, Wilcock A, Preston N, Nieboer D, Deliens L, et al. Missing not at random in end of life care studies: multiple imputation and sensitivity analysis on data from the ACTION study. BMC Med Res Methodol. 2021;21:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Casiraghi E, Wong R, Hall M, Coleman B, Notaro M, Evans MD, et al. A method for comparing multiple imputation techniques: a case study on the US national COVID cohort collaborative. J Biomed Inform. 2023;139:104295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Chen J, Hunter S, Kisfalvi K, Lirio RA. A hybrid approach of handling missing data under different missing data mechanisms: VISIBLE 1 and VARSITY trials for ulcerative colitis. Contemp Clin Trials. 2021;100:106226. [DOI] [PubMed] [Google Scholar]
  • 49.Cheng C-H, Chang J-R, Huang H-H. A novel weighted distance threshold method for handling medical missing values. Comput Biol Med. 2020;122:103824. [DOI] [PubMed] [Google Scholar]
  • 50.Cheng C-H, Huang S-F. A novel clustering-based purity and distance imputation for handling medical data with missing values. Soft Comput. 2021;25(17):11781–801. [Google Scholar]
  • 51.Choi YJ, Nam CM, Kwak MJ. Multiple imputation technique applied to appropriateness ratings in cataract surgery. Yonsei Med J. 2004;45(5):829–37. [DOI] [PubMed] [Google Scholar]
  • 52.Clark TG, Altman DG. Developing a prognostic model in the presence of missing data: an ovarian cancer case study. J Clin Epidemiol. 2003;56(1):28–37. [DOI] [PubMed] [Google Scholar]
  • 53.Cleophas EP, Cleophas TJ. Clinical research: a novel approach to regression substitution for handling missing data. Am J Ther. 2013;20(5):514–9. [DOI] [PubMed] [Google Scholar]
  • 54.Curioso I, Santos R, Ribeiro B, Carreiro A, Coelho P, Fragata J, Gamboa H. Addressing the curse of missing data in clinical contexts: a novel approach to correlation-based imputation. J King Saud University-Computer Inform Sci. 2023;35(6):101562. [Google Scholar]
  • 55.Dekermanjian JP, Shaddox E, Nandy D, Ghosh D, Kechris K. Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics. BMC Bioinformatics. 2022;23(1):179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.DiazOrdaz K, Kenward M, Gomes M, Grieve R. Multiple imputation methods for bivariate outcomes in cluster randomised trials. Stat Med. 2016;35(20):3482–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Dong W, Fong DYT, Yoon J-s, Wan EYF, Bedford LE, Tang EHM, Lam CLK. Generative adversarial networks for imputing missing data for big data clinical research. BMC Med Res Methodol. 2021;21:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Dzulkalnine MF, Sallehuddin R. Missing data imputation with fuzzy feature selection for diabetes dataset. SN Appl Sci. 2019;1(4):362. [Google Scholar]
  • 59.Ferri P, Romero-Garcia N, Badenes R, Lora-Pablos D, Morales TG, De La Camara AG, et al. Extremely missing numerical data in Electronic Health records for machine learning can be managed through simple imputation methods considering informative missingness: a comparative of solutions in a COVID-19 mortality case study. Comput Methods Programs Biomed. 2023;242:107803. [DOI] [PubMed] [Google Scholar]
  • 60.Galimard JE, Chevret S, Protopopescu C, Resche-Rigon M. A multiple imputation approach for MNAR mechanisms compatible with Heckman’s model. Stat Med. 2016;35(17):2907–20. [DOI] [PubMed] [Google Scholar]
  • 61.Haliduola HN, Bretz F, Mansmann U. Missing data imputation using utility-based regression and sampling approaches. Comput Methods Programs Biomed. 2022;226:107172. [DOI] [PubMed] [Google Scholar]
  • 62.Hassan GS, Ali NJ, Abdulsahib AK, Mohammed FJ, Gheni HM. A missing data imputation method based on salp swarm algorithm for diabetes disease. Bull Electr Eng Inf. 2023;12(3):1700–10. [Google Scholar]
  • 63.Hegde H, Shimpi N, Panny A, Glurich I, Christie P, Acharya A. MICE vs PPCA: missing data imputation in healthcare. Inf Med Unlocked. 2019;17:100275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Husson F, Josse J, Narasimhan B, Robin G. Imputation of mixed data with multilevel singular value decomposition. J Comput Graphical Stat. 2019;28(3):552–66. [Google Scholar]
  • 65.Ilango P, Vijayakumar K, Rajasekhara Babu M. Instance driven clustering for the imputation of missing data in KDD. Int J Communication Networks Distrib Syst. 2014;12(1):69–81. [Google Scholar]
  • 66.Jafrasteh B, Hernández-Lobato D, Lubián-López SP. Benavente-Fernández I. Gaussian processes for missing value imputation. Knowl Based Syst. 2023;273:110603. [Google Scholar]
  • 67.Jain R, Xu W. Dynamic model updating (DMU) approach for statistical learning model building with missing data. BMC Bioinformatics. 2021;22(1):221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Jolani S. Hierarchical imputation of systematically and sporadically missing data: an approximate bayesian approach using chained equations. Biom J. 2018;60(2):333–51. [DOI] [PubMed] [Google Scholar]
  • 69.Kabir S, Farrokhvar L. Non-linear missing data imputation for healthcare data via index-aware autoencoders. Health Care Manag Sci. 2022;25(3):484–97. [DOI] [PubMed] [Google Scholar]
  • 70.Khan SI, Hoque ASML. SICE: an improved missing data imputation technique. J big Data. 2020;7(1):37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Kim K-H, Kim K-J. Missing-data handling methods for lifelogs-based wellness index estimation: comparative analysis with panel data. JMIR Med Inf. 2020;8(12):e20597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Kuppusamy V, Paramasivam I. Integrating WLI fuzzy clustering with grey neural network for missing data imputation. Int J Intell Enterp. 2017;4(1–2):103–27. [Google Scholar]
  • 73.Kuppusamy V, Paramasivam I. Grey fuzzy neural network-based hybrid model for Missing Data Imputation in mixed database. Int J Intell Eng Syst. 2017;10(2).
  • 74.Lee JH, Huber JC Jr. Evaluation of multiple imputation with large proportions of missing data: how much is too much? Iran J Public Health. 2021;50(7):1372. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Ma Y, Zhang W, Lyman S, Huang Y. The HCUP SID imputation project: improving statistical inferences for health disparities research by imputing missing race data. Health Serv Res. 2018;53(3):1870–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Miao S-d, Li S-q, Zheng X-y, Wang R-t, Li J, Ding S-s. Ma J-f. Missing Data Interpolation of Alzheimer’s Disease Based on Column-by‐column mixed Mode. Complexity. 2021;2021(1):3541516. [Google Scholar]
  • 77.Nadimi-Shahraki MH, Mohammadi S, Zamani H, Gandomi M, Gandomi AH. A hybrid imputation method for multi-pattern missing data: a case study on type II diabetes diagnosis. Electronics. 2021;10(24):3167. [Google Scholar]
  • 78.Nijman SWJ, Groenhof TKJ, Hoogland J, Bots ML, Brandjes M, Jacobs JJ, et al. Real-time imputation of missing predictor values improved the application of prediction models in daily practice. J Clin Epidemiol. 2021;134:22–34. [DOI] [PubMed] [Google Scholar]
  • 79.Pereira RC, Abreu PH, Rodrigues PP. Partial multiple imputation with variational autoencoders: tackling not at randomness in healthcare data. IEEE J Biomedical Health Inf. 2022;26(8):4218–27. [DOI] [PubMed] [Google Scholar]
  • 80.Pezoulas VC, Tachos NS, Olivotto I, Barlocco F, Fotiadis DI, editors. A smart Imputation Approach for Effective Quality Control Across Complex Clinical Data Structures. 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC); 2022: IEEE. [DOI] [PubMed]
  • 81.Phung S, Kumar A, Kim J, editors. A deep learning technique for imputing missing healthcare data. 2019 41st annual international conference of the IEEE engineering in medicine and biology society (EMBC); 2019: IEEE. [DOI] [PubMed]
  • 82.Quartagno M, Carpenter JR. Multiple imputation for discrete data: evaluation of the joint latent normal model. Biom J. 2019;61(4):1003–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Rani P, Kumar R, Jain A. HIOC: a hybrid imputation method to predict missing values in medical datasets. Int J Intell Comput Cybernetics. 2021;14(4):598–616. [Google Scholar]
  • 84.Setiawan NA, Venkatachalam P, Ahmad Fadzil M. A knowledge discovery from incomplete coronary artery disease datasets using rough set. Int J Med Eng Inf. 2011;3(1):60–77. [Google Scholar]
  • 85.Shobha K, Savarimuthu N. Clustering based imputation algorithm using unsupervised neural network for enhancing the quality of healthcare data. J Ambient Intell Humaniz Comput. 2021;12(2):1771–81. [Google Scholar]
  • 86.Sportisse A, Boyer C, Josse J. Imputation and low-rank estimation with missing not at random data. Stat Comput. 2020;30(6):1629–43. [Google Scholar]
  • 87.Tomita H, Fujisawa H, Henmi M. A bias-corrected estimator in multiple imputation for missing data. Stat Med. 2018;37(23):3373–86. [DOI] [PubMed] [Google Scholar]
  • 88.Wang G, Lu J, Choi K-S, Zhang G. A transfer-based additive LS-SVM classifier for handling missing data. IEEE Trans Cybernetics. 2018;50(2):739–52. [DOI] [PubMed] [Google Scholar]
  • 89.Xu D, Daniels MJ, Winterstein AG. Sequential BART for imputation of missing covariates. Biostatistics. 2016;17(3):589–602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Xu D, Hu PJ-H, Huang T-S, Fang X, Hsu C-C. A deep learning–based, unsupervised method to impute missing values in electronic health records for improved patient management. J Biomed Inform. 2020;111:103576. [DOI] [PubMed] [Google Scholar]
  • 91.Zang H, Kim HJ, Huang B, Szczesniak R. Bayesian causal inference for observational studies with missingness in covariates and outcomes. Biometrics. 2023;79(4):3624–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Jamshidian M, Yuan K-H. Data-driven sensitivity analysis to detect missing data mechanism with applications to structural equation modelling. J Stat Comput Simul. 2013;83(7):1344–62. [Google Scholar]
  • 93.Jamshidian M, Yuan KH. Examining missing data mechanisms via homogeneity of parameters, homogeneity of distributions, and multivariate normality. Wiley Interdisciplinary Reviews: Comput Stat. 2014;6(1):56–73. [Google Scholar]
  • 94.Carpenter JR, Kenward MG. Missing data in randomised controlled trials: a practical guide. Health Technology Assessment Methodology Programme; 2007.
  • 95.Jamshidian M, Jalal S. Tests of homoscedasticity, normality, and missing completely at random for incomplete multivariate data. Psychometrika. 2010;75(4):649–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Kim KH, Bentler PM. Tests of homogeneity of means and covariance matrices for multivariate incomplete data. Psychometrika. 2002;67:609–23. [Google Scholar]
  • 97.Jamshidian M, Schott JR. Testing equality of covariance matrices when data are incomplete. Comput Stat Data Anal. 2007;51(9):4227–39. [Google Scholar]
  • 98.Diggle P, Kenward MG. Informative drop-out in longitudinal data analysis. J Royal Stat Soc Ser C: Appl Stat. 1994;43(1):49–73. [Google Scholar]
  • 99.Hawkins DM. A new test for multivariate normality and homoscedasticity. Technometrics. 1981;23(1):105–10. [Google Scholar]
  • 100.Moorthy K, Saberi Mohamad M, Deris S. A review on missing value imputation algorithms for microarray gene expression data. Curr Bioinform. 2014;9(1):18–22. [DOI] [PubMed] [Google Scholar]
  • 101.Sun Y, Li J, Xu Y, Zhang T, Wang X. Deep learning versus conventional methods for missing data imputation: a review and comparative study. Expert Syst Appl. 2023;227:120201. [Google Scholar]
  • 102.Alabadla M, Sidi F, Ishak I, Ibrahim H, Affendey LS, Ani ZC, et al. Systematic review of using machine learning in imputing missing values. IEEE Access. 2022;10:44483–502. [Google Scholar]
  • 103.Liu M, Li S, Yuan H, Ong MEH, Ning Y, Xie F, et al. Handling missing values in healthcare data: a systematic review of deep learning-based imputation techniques. Artif Intell Med. 2023;142:102587. [DOI] [PubMed] [Google Scholar]
  • 104.Psychogyios K, Ilias L, Askounis D, editors. Comparison of missing data imputation methods using the Framingham heart study dataset. 2022 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI); 2022: IEEE.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1 (631.5KB, pdf)
Supplementary Material 2 (162.6KB, pdf)

Data Availability Statement

No datasets were generated or analysed during the current study.


Articles from BMC Medical Research Methodology are provided here courtesy of BMC

RESOURCES