Evaluation of imputation strategies for multi-centre studies: Application to a large clinical pathology dataset

Lucy Grigoroff; Reika Masuda; John Lindon; Janonna Kadyrov; Jeremy K Nicholson; Elaine Holmes; Julien Wist

doi:10.1371/journal.pone.0335852

. 2025 Nov 20;20(11):e0335852. doi: 10.1371/journal.pone.0335852

Evaluation of imputation strategies for multi-centre studies: Application to a large clinical pathology dataset

Lucy Grigoroff ^1,^*, Reika Masuda ¹, John Lindon ², Janonna Kadyrov ¹, Jeremy K Nicholson ^1,², Elaine Holmes ^1,³, Julien Wist ^1,^3,^4,^*

Editor: Robin Haunschild⁵

PMCID: PMC12633936 PMID: 41264609

Abstract

As part of a strategy for accommodating missing data in large heterogeneous datasets, two Random Forest-based (RF) imputation methods, missForest and MICE were evaluated along with several strategies to help navigate the inherently incomplete structure of the dataset. Background: A total of 3817 complete cases of clinical chemistry variables from a large-scale, multi-site preclinical longitudinal pathology study were used as an evaluation dataset. Three types of ‘missingness’ in various proportions were artificially introduced to compare imputation performance for different strategies including variable inclusion and stratification. Results: MissForest was found to outperform MICE, being robust and capable of automatic variable selection. Stratification had minimal effect on missForest but severely deteriorated the performance of MICE. Conclusion: In general, storing and sharing datasets prior to any correction is a good practise, so that imputation can be performed on merged data if necessary.

Introduction

Advances in high-throughput technologies have enabled the generation and accumulation of large-scale biological and chemical datasets that are increasingly applied in molecular epidemiology studies. Datasets encompassing diverse data types, including genome sequences, proteomic profiles, and compound libraries can contribute to our understanding of complex biological phenomena, disease mechanisms, ecological community structures and drug discovery processes. However, missing values are common in most large-scale multiparametric studies and this inherent incompleteness presents significant informatic and modelling challenges. Incompleteness results from a variety of diverse factors and experimental limitations, data acquisition errors, or the sheer scale of data collection efforts, often carried out across multiple centres or over disjointed time periods. For example, in clinical studies, the lack of standardisation in diagnostic reporting and medical records may result in missing clinical annotations, particularly when data collection is dispersed over multiple centres. Further, the range of recorded clinical annotations depends on the research question resulting in missing values when two or more studies are combined for analysis. Missing samples are also common in single cohort studies; It is likely that some samples will not be collected or that one of the many experimental steps from sample preparation to analysis will fail, thus producing incomplete datasets. Whether from the experimental data or clinical annotations (metadata), missing values can introduce different types of bias in the subsequent analysis or prevent application of certain approaches [1,2].

The presence of missing values poses a critical impediment to downstream analyses, including statistical analysis, machine learning, and predictive modelling. Missing data can introduce biases, reduce statistical power, and compromise the validity of biological and chemical inferences. Therefore, the development of robust imputation strategies is paramount to address these gaps, ensuring the integrity and enhancing the value of the datasets. Effective imputation not only enhances the quality and completeness of the data but also facilitates more accurate and reliable analyses, leading to deeper insights into biological mechanisms and chemical interactions.

Given the complexity and heterogeneity of biological and chemical data, the design of imputation methods tailored for studies collected longitudinally in batches is desirable. As the volume and diversity of omics and chemical data continue to expand, there is a need for sophisticated imputation approaches that minimise reprocessing as new data are appended.

Several multivariate analysis techniques, including Principal Component Analysis (PCA) and Partial Least Squares-Discriminant Analysis (PLS-DA) require a dataset to be complete, i.e., they do not accept missing values. Furthermore, as entire rows (samples) and/or columns (variables) are excluded, the statistical power of the study is reduced [3–5], and valuable information is lost. To overcome this deletion referred to as Complete Case Analysis (CCA) numerous imputation methods have been proposed that replace missing values with imputed ones.

There is no strong consensus as to which single imputation method is best or which performance metrics are most appropriate; The method adopted is highly dependent on the dataset, a-priori knowledge and the causes behind missing values. If the reason for a data point being absent is independent of the data, it is referred to as Missing Completely At Random (MCAR). If the probability of a data point being absent is dependent on a known variable, it is referred to as Missing At Random (MAR). Where the probability of a point being absent is unexplained and does not fit the criteria for MCAR or MAR is referred to as Missing Not At Random (MNAR). Some methods are more suitable for certain types of missingness, and some methods become ineffective once a certain proportion of missingness is reached [6–10]. Furthermore, not all methods are suitable for categorical and/or continuous variables. The choice of method is context dependent; Sophisticated methods such as Parametric Multiple Imputation by Chained Equations (MICE) cannot account for variable interactions, non-linearities and non-normality without accurate and detailed specifications in the hyperparameters or pre-imputation transformation of the data [7,11–20].

Utilising a-priori knowledge about covariates or experimental design is important to avoid introducing unnecessary bias in the imputed dataset, and consequently in future analysis [21]. This can be achieved in two ways, by appending this extra information to the dataset or by stratifying the dataset accordingly. Let’s consider an outcome variable (column C in Fig 1) that contains, for example, a treatment or control status that accepts two values α and β. This outcome variable reflects the study experimental design and thus the known part of the inherent structure in the data. Including C as a feature (variable) is demonstrated in Fig 1a and 1b, while performing separate imputation models per strata defined by C is depicted in Fig 1c [5,22]. While it is common practice to include all available features as input (Fig 1a), this can lead to overparameterization when the number of variables exceeds the number of samples (p > n) [23]. Therefore, some imputation methods, such as Random Forest (RF), include a feature selection procedure [24]. This feature selection may result in the exclusion of one or more columns of Y, as depicted in Fig 1b. The impact of including the outcome variable column C in Fig 1, on the accuracy of the imputation has been discussed for machine learning methods such as MICE [25].

A notable gap in the current literature is the comparison between the two strategies, stratified imputation (Fig 1c) or inclusion of outcome variables in the imputation process (Fig 1a and 1b), with or without with feature selection (Fig 1b) [22,26–29]. In this study we set out to identify a suitable implementation of a practical and robust imputation method for datasets resulting from the analysis of multiple epidemiological studies or from datasets across multiple national databanks.

As an example, we evaluate different strategies to impute a large-scale chemical pathology dataset, using a variety of performance metrics for accuracy and precision, to select the methods that best maintains data integrity. This dataset was generated from the COnsortium for MEtabonomic Toxicology (COMET), a large multi-centre preclinical toxicology study in a rat model, aimed at building augmented models of the toxicity for a range of compounds. Within this project, all the data were generated and recorded using a harmonised protocol across five different pharmaceutical companies [30–32]. The dataset consists of serum clinical chemistry measurements for a range of parameters typically used in acute toxicity studies aiming to assess the safety of drug candidates. Data were generated for 21 serum parameters measured for 107 chemical pathogens (86) or physiological stressors (21) across a dataset of 7528 samples. This dataset contains inherent structure, such as samples being collected from different companies with different suppliers, for different toxins, and for high- low-dose and control groups and provides a good example of the challenges of joining multiple epidemiological studies for the construction of atlases of diseases, and more generically large biological and chemical datasets.

Methods

Dataset

The COMET project, a preclinical rat toxicology study, resulted from a collaboration between five pharmaceutical companies and Imperial College London, UK which studied a diverse list of toxins targeting various organs in the body. Each of the 107 studies (one for each toxin or physiological stressor) consisted of 30 male rats of the same strain in highly specified/controlled environments (temperature, humidity, diurnal variation, diet etc.). The collection details, applied to all studies, are depicted in Fig 2.

Fig 2 — Each study had 30 rats assigned across control, low dose and high dose with only two serum samples per rat.

We selected the COMET project based on several key characteristics: a large sample size (7528 serum samples with 21 clinical chemistry variables per sample), the presence of inherent structure (data generated across five independent laboratories or Companies (denoted A, B, C, D and E)), and the inevitable missing data. Like most studies, some data are absent and the inputs from multi-centre sources require careful curation and correction for batch-to-batch variation. The full process of curation is captured in Fig 3.

The first steps in Fig 3 encompass removing failed experiments (defined as use of incorrect animal species, like mice, and first attempts of studies that required repetition), harmonising labelling of metadata fields, removal of empty samples and mostly empty variables as well as biologically impossible values (e.g., negative values). Following these steps, the sample size was reduced from 7528 to 5796 and the number of clinical chemistry parameters was reduced from 21 to 12.

The remaining 5796 samples were inspected for batch effects. Within a highly controlled experimental environment and the adoption of standard protocols, the control data should not present any pattern according to treatment, as this is effectively the baseline for a study. As per the experimental design (Fig 2), animals 1–10 are designated as controls that yield two samples each: a sample at 24 hours was obtained from all control rats with a second sample obtained at either 48 or 168 hours. These control samples were inspected for batch differences. The serum controls for Urea Nitrogen are displayed for each study in Fig 4A, while all remaining variables are displayed in Figure S2E in S1 File. Median centring (a shift-based correction), calculated using the control data, was applied to all data to compensate for any systematic batch differences. The median centring used the difference between the Grand Median (GM, median across all studies using the 24-hour control samples) and the Study Median (SM, the median per study for the controls inclusive of all timepoints). The corrected data are shown for Urea Nitrogen in Fig 4B. Similar results were obtained for other variable outcomes (Figure S2G in S1 File).

Fig 4 — Colours alternate for ease of visualisation.

In addition to addressing batch effects, median centring maintained biologically relevant peaks in the non-control data. The remaining variables and non-control data, along with comparison of a correction using the ratio of the SM to GM, are displayed in Figure S2Fii and S2xiii in S1 File and the batch corrected controls in Figure SG. After median centring the data were re-examined for spurious values such as negative values, which is detailed in the supplementary material. Post curation, the data remained at 5796 samples and with minimal additional sparsity as seen in Fig 5.

Fig 5 — The y-axis depicts the sample, of which there are 5796, while the lower x-axis dictates the variable and number of values removed post batch correction. The upper x-axis denotes if it is pre- or post-batch correction.

Of the 5796 samples 3817 complete cases were identified to create the X block (Fig 1). This large number of complete cases makes it a valuable dataset for introducing artificial missingness while having true values available for evaluation.

The knowledge about the intrinsic structure in the data mainly comes from the experimental design (Y matrix in Fig 1): company, time point, dose group (control, low dose, high dose), target of toxicity, toxin and euthanasia group. Company was selected as the C column (outcome variable) and the remaining metadata were considered the Y block (Fig 1).

All three types of missingness (MCAR, MAR and MNAR) with missing proportions of 5, 10, 20, 30 and 40%, were introduced to the 3817 complete case serum clinical sample measurements resulting in a total of 15 datasets prepared for imputation. A maximum missing proportion of 40% was chosen since this was the threshold to retain a COMET clinical chemistry parameter before curation was completed.

Imputation

MICErf and missForest were applied to the 15 datasets with artificial missingness. The hyperparameters for number of trees was set to 10 for both MICErf and missForest to reduce bias and computational time [13,14,17]. The hyperparameter for the number of MICErf iterations was set to the recommended five for reasonable computation time [8].

As seen in Table 1, imputation was applied to: the X block (serum clinical chemistry parameters) without inclusion of metadata (X), with only C (X + C) and with the entire Y block (X + Y + C). Two different uses of the outcome variable C were evaluated: via inclusion of a C as a feature (C_Variable) and by separate (stratified) imputation for each company (C_Strata). The total list of combinations of variables and stratification applied are displayed in Table 1. Notably, Y_All did not include information about the toxin (“Toxin” field) and was completely excluded when applying missForest. This was due to the maximum number of levels per factor being limited to 53 when using missForest, while there are 107 different toxins and stressors. The 11 imputation strategies described in Table 1 were applied to the 15 datasets, generated from five different levels of missingness (M = 5, 10, 20, 30 and 40%) and three different missingness types (MT = MAR, MCAR and MNAR) for the 3817 complete cases. These 15 datasets were repeated for two additional sample sizes (N = 50 and 500). Imputation for each dataset was repeated 20 times with metrics from Table S2 in S1 File. Each metric had its mean and standard deviation from the iterations recorded. Results for the reduced sample sizes can be found in the supplementary material. All analysis was performed using R (version 4.2.2), with the software packages mice 3.16.0 [33] and missForest 1.5 [34] used for imputation. Missingness was generated using the package missMethods 0.4.0. As MICE produces multiple imputed data sets, the package sjmisc 2.8.9 [35] was used to merge them into a single result.

Table 1. Combinations of variable inclusion and stratification approaches where X is the clinical chemistry dataset that is missing values. C is the outcome variable, with C_Strata representing separate imputation per group defined in the chosen variable and C_Variable is including the outcome as a variable. Y_All is the remaining metadata not used for stratification. Y_{All + Toxin} is the same as Y_All but with Toxin metadata now included.

MICErf	MissForest
X	X
X + C_Strata	X + C_Strata
X + C_Variable	X + C_Variable
X + Y_All+ C_Strata	X + Y_All+ C_Strata
X + Y_All+ C_Variable	X + Y_All+ C_Variable
X + Y_All+Toxins+ C_Variable

Open in a new tab

Performance metrics

Best practice guidelines for imputation are still evolving and the validity of the methods should be checked using multiple performance metrics [14,36]. The resulting complete-by-imputation datasets were assessed using internal performance metrics such as Normalised Root Mean Square Error (NRMSE) and Mean Absolute Error (MAE). Additionally, external performance metrics for Bias (using existing relationships) and distribution analysis were applied. Full details are provided in Table S2 in S1 File.

Random Forest (RF) Machine Learning (ML) imputation methods were selected for evaluation. Their accuracy and maintenance of predictive ability (robustness) currently outperform traditional [14,17,37,38] and Deep Learning (DL) [12,37] imputation methods. ML-based imputation is more suitable for high-dimensional data with complex inherent structure and collinear variables [39] and the utilisation of RF does not require the same detailed and accurate specifications in the parameters or pre-imputation transformation of the data to cope with variable interactions, non-linearities and non-normal distributions as non-RF methods such as parametric MICE [13–20,40]. RF based methods also implement some degree of automated variable selection as the algorithm can choose not to incorporate or extremely minimise the impact of available features [37,40,41]. For these reasons MICE based on RF (MICErf) [33] and missForest, in combination with stratification options, were given a comparative assessment of robustness, maintenance and accuracy using the metrics in Table S2 in S1 File using the COMET data. To the best of our knowledge, no prior publications have addressed these considerations.

Results

Stratification

Unsupervised multivariate analysis using Principal Component Analysis (PCA) of the 3817 complete case serum clinical measurements was conducted. The scores of the first two principal components are displayed in Fig 6 and coloured according to the serum variables. Fig 6 has been restricted to scores of ± 5 for visualisation purposes, capturing 90% of the data. While the clusters of samples by Company are not distinctly separated in Fig 6F, their presence, even with overlap, indicates grouping. Company was designated the outcome variable (C) for this study because any influence it exerts is undesirable and would require correction. In contrast the variables time point, dose group, target of toxicity, toxin and euthanasia group are pertinent to the study’s purpose and were not considered as confounders. Hence, these remaining metadata variables constitute the remainder of our Y block.

Fig 6 — The range was restricted to ± 5 for visualisation purposes, highlighting groupings based on the metadata variables A) ‘Toxin’, B) ‘Euthanasia Subgroup’, C) ‘Time Point (hours)’, D) ‘Target of Toxicity’, E) ‘Dose Flag’ and F) ‘Company’ for the first two principal components.

Imputation method

A total of 11 different imputation methods/strategies described in Table 1 were tested on the fifteen datasets with various artificial missingness (MT = MAR, MCAR and MNAR at M = 5, 10, 20, 30 and 40%). Each imputation method was iterated 20 times with the mean and standard deviation of the performance metrics recorded. The internal metrics, representing accuracy and precision all followed the same trend as shown in Fig 7 for the full MAE. The NRMSE and partial MAE can be found in the supplementary material.

Fig 7 — A) the full MAE values where the solid line and points represent the mean values averaged over 20 iterations, while the shadow denotes the confidence interval. B) The Bump Chart for method performance ranked 1^st (best) to 11^th (worst).

As has been observed with other types of data [17,38], the missForest algorithm consistently, if only slightly, outperformed MICErf in terms of accuracy and precision by having the lowest internal metrics under every type of missingness. The combinations of stratification and variable inclusion described in Table 1 had minimal impact when applying missForest. The full MAE was tightly clustered for the missForest points in Fig 7A and maintained the top five ranks in Fig 7B. However, stratification via separate imputation per company (C_strata) had a pronounced effect on MICErf, consistently being the lowest ranked and significantly deviating from all other methods.

As mentioned, the inclusion of the Y block and inclusion or choice of C_Strata or C_Variable caused minimal disparity between the missForest results. Results for smaller sample sizes, as seen in Figures S4–S6 in S1 File, still designated missForest as the more robust out of the two methods. Of note is the fact that stratified MICErf was unable to complete any imputation for a sample size of 50 and was unsuccessful at imputing one to two variables for a sample size of 500 for various missingness types and proportions.

Within the twelve serum clinical parameters, “AST” and “ALT”, alongside “Creatinine” and “Urea Nitrogen” variables, are biologically connected and are known to hold linear relationships. These relationships had higher correlations than other variable pairs, with ALT versus AST having an R² of 0.86 and Creatinine versus Urea Nitrogen having an R² of 0.72 (seen in Figure S7 in S1 File). The complete list of correlations can be found in Figure S8 in S1 File. Using this prior knowledge, Fig 8 illustrates the degree of bias introduced by each imputation method in comparison to not imputing as described in Table S1 in S1 File.

Fig 8 — The bias is measured with respect to the linear relationship between AST versus ALT and Creatinine versus Urea Nitrogen. Mean values after 20 iterations are represented by the solid line with points and standard deviation by the shadow.

Optimal methods in terms of bias oscillate between not imputing, missForest and MICErf depending on the type and proportion of missingness. Similar, the internal metric in Fig 7, C_Strata often has a negative effect on MICErf. Although missForest does not always outperform MICErf, the majority of the unstratified missForest results either outperform or are on par with the MICErf results. The comparison between two sample sizes (N = 500 and 50) shown in Figures S9 and S10 in S1 File demonstrates that, majority of the time, missForest still outperforms MICErf, further supporting the use of missForest when choosing to impute.

The distributions pre- and post-imputation for 40% missing values, found in the supplementary material (Figure S11 in S1 File), did not raise concerns, as they did not display any major deviations in distribution from the original (CCA) data distribution. We selected 40% as a worst-case scenario.

Discussion

With no clear distinction in preferable performance in the distributions in Figure S11 in S1 File, and examination of the internal and external performance metrics displayed in Figs 7 and 8, missForest was identified as the most suitable imputation method for all missingness types and up to 40% of missing data, confirming earlier reports on different types of data [17,38]. If choosing to impute with MICErf, stratification should be avoided due to its large negative effect on the internal metrics displayed in Fig 7. This holds true for our largest sample size of 3817 for all combinations of features and stratification methods. While missForest remains the optimal imputation method for the smaller sample sizes, unstratified MICErf is still a close competitor (Figures S3 and S4 in S1 File).

The imputation and data processing strategy presented here allowed the alignment and removal of systematic batch differences between studies. In regard to chemical pathology data, this is of value for biological interpretation since clinical chemistry variables, particularly electrolytes and enzymes are known to be prone to variation [42,43] and therefore, the ability to co-analyse data from studies adds statistical power to establishing expected ranges for these parameters. However, the evaluation of the imputation pipeline presented here has a broader relevance to large-scale biological and chemical datasets. The fact that missForest delivers similar results with both stratification and variable inclusion makes it a suitable candidate for large databanks where data are imputed sequentially, as more datasets are added.

It is important to note that our comparisons do not encompass all possible data transformations prior to imputation, nor do they consider fine-tuning hyperparameters, such as the number of trees. Optimum hyperparameters are intrinsically dependent of the dataset of interest and lies beyond the scope of our current study but warrants further investigation. We recognize that by varying hyperparameters, the relative performance of MICErf and missForest may improve substantially. Nevertheless, we illustrate the effectiveness of missForest, which operates well and robustly with minimal user intervention, aligning well with our research objectives. Our recommendation of missForest is primarily based on its robustness to user error and its consistent performance across missingness types and proportion and sample size.

Conclusion

The results of the simulation of varying proportions and types of missingness in an exemplary large dataset support missForest as an ideal and robust imputation method. We recommend missForest as the most method robust to user error. And, if choosing to impute by MICErf, we recommend to avoid stratification due to its large impact on performance at all sample sizes. Further examination of the effect of fine tuning hyperparameters on relative performance of missForest and MICErf is required, and our recommendations is only to apply the approach in cases for up to 40% missingness. These results highlight, once more, the necessity to store the raw data (prior to any processing) so that imputation or any further correction can be applied on the merged data if necessary.

Supporting information

S1 File. Supplementary figures Fig S1 to Fig S11.

(DOCX)

pone.0335852.s001.docx^{(12.4MB, docx)}

Data Availability

Data are available at: https://zenodo.org/records/14987944.

Funding Statement

The author(s) received no specific funding for this work.

References

1.Komamine M, Fujimura Y, Omiya M, Sato T. Dealing with missing data in laboratory test results used as a baseline covariate: results of multi-hospital cohort studies utilizing a database system contributing to MID-NET® in Japan. BMC Med Inform Decis Mak. 2023;23(1):242. doi: 10.1186/s12911-023-02345-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Smirnova E, Zhong Y, Alsaadawi R, Ning X, Kress A, Kuiper J, et al. Missing data interpolation in integrative multi-cohort analysis with disparate covariate information. arXiv [stat.ME]. 2022. [Google Scholar]
3.Keun HC, Ebbels TMD, Bollard ME, Beckonert O, Antti H, Holmes E, et al. Geometric trajectory analysis of metabolic responses to toxicity can define treatment specific profiles. Chem Res Toxicol. 2004;17(5):579–87. doi: 10.1021/tx034212w [DOI] [PubMed] [Google Scholar]
4.Dekermanjian JP, Shaddox E, Nandy D, Ghosh D, Kechris K. Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics. BMC Bioinform. 2022;23(1):179. doi: 10.1186/s12859-022-04659-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Little RJA. Statistical analysis with missing data. Wiley & Sons, Limited; 2019. [Google Scholar]
6.van Buuren S. Flexible imputation of missing data. CRC Press, Taylor and Francis Group; 2018. [Google Scholar]
7.Osman MS, Abu-Mahfouz AM, Page PR. A survey on data imputation techniques: water distribution system as a use case. IEEE Access. 2018;6:63279–91. doi: 10.1109/access.2018.2877269 [DOI] [Google Scholar]
8.Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393. doi: 10.1136/bmj.b2393 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Jadhav A, Pramod D, Ramanathan K. Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell. 2019;33:913–33. [Google Scholar]
10.Graham JW. Missing data analysis: making it work in the real world. Annu Rev Psychol. 2009;60:549–76. doi: 10.1146/annurev.psych.58.110405.085530 [DOI] [PubMed] [Google Scholar]
11.Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A survey on missing data in machine learning. J Big Data. 2021;8(1):140. doi: 10.1186/s40537-021-00516-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Jäger S, Allhorn A, Bießmann F. A benchmark for data imputation methods. Front Big Data. 2021;4:693674. doi: 10.3389/fdata.2021.693674 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Slade E, Naylor MG. A fair comparison of tree-based and parametric methods in multiple imputation by chained equations. Stat Med. 2020;39(8):1156–66. doi: 10.1002/sim.8468 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. Am J Epidemiol. 2014;179(6):764–74. doi: 10.1093/aje/kwt312 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Tilling K, Williamson EJ, Spratt M, Sterne JAC, Carpenter JR. Appropriate inclusion of interactions was needed to avoid bias in multiple imputation. J Clin Epidemiol. 2016;80:107–15. doi: 10.1016/j.jclinepi.2016.07.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Doove LL, Van Buuren S, Dusseldorp E. Recursive partitioning for missing data imputation in the presence of interaction effects. Comput Stat Data Anal. 2014;72:92–104. doi: 10.1016/j.csda.2013.10.025 [DOI] [Google Scholar]
17.Hong S, Lynn HS. Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med Res Methodol. 2020;20(1):199. doi: 10.1186/s12874-020-01080-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Seaman SR, Bartlett JW, White IR. Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods. BMC Med Res Methodol. 2012;12:46. doi: 10.1186/1471-2288-12-46 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Strobl C, Malley J, Tutz G. An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol Methods. 2009;14(4):323–48. doi: 10.1037/a0016973 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Burgette LF, Reiter JP. Multiple imputation for missing data via sequential regression trees. Am J Epidemiol. 2010;172(9):1070–6. doi: 10.1093/aje/kwq260 [DOI] [PubMed] [Google Scholar]
21.White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011;30(4):377–99. doi: 10.1002/sim.4067 [DOI] [PubMed] [Google Scholar]
22.Sullivan TR, White IR, Salter AB, Ryan P, Lee KJ. Should multiple imputation be the method of choice for handling missing data in randomized trials? Stat Methods Med Res. 2018;27(9):2610–26. doi: 10.1177/0962280216683570 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Tang F, Ishwaran H. Random forest missing data algorithms. Stat Anal Data Min. 2017;10(6):363–77. doi: 10.1002/sam.11348 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Hapfelmeier A, Ulm K. Variable selection by Random Forests using data with missing values. Comput Stat Data Anal. 2014;80:129–39. doi: 10.1016/j.csda.2014.06.017 [DOI] [Google Scholar]
25.Moons KGM, Donders RART, Stijnen T, Harrell FE Jr. Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol. 2006;59(10):1092–101. doi: 10.1016/j.jclinepi.2006.01.009 [DOI] [PubMed] [Google Scholar]
26.Resche-Rigon M, White IR, Bartlett JW, Peters SAE, Thompson SG, PROG-IMT Study Group. Multiple imputation for handling systematically missing confounders in meta-analysis of individual participant data. Stat Med. 2013;32(28):4890–905. doi: 10.1002/sim.5894 [DOI] [PubMed] [Google Scholar]
27.Andridge R, Thompson KJ. Using the fraction of missing information to identify auxiliary variables for imputation procedures via proxy pattern‐mixture models. Int Stat Rev. 2015;83:472–92. [Google Scholar]
28.Jolani S, Debray TPA, Koffijberg H, van Buuren S, Moons KGM. Imputation of systematically missing predictors in an individual participant data meta-analysis: a generalized approach using MICE. Stat Med. 2015;34(11):1841–63. doi: 10.1002/sim.6451 [DOI] [PubMed] [Google Scholar]
29.Zhang J, Dashti SG, Carlin JB, Lee KJ, Moreno-Betancur M. Should multiple imputation be stratified by exposure group when estimating causal effects via outcome regression in observational studies? BMC Med Res Methodol. 2023;23(1):42. doi: 10.1186/s12874-023-01843-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Lindon JC, Nicholson JK, Holmes E, Antti H, Bollard ME, Keun H, et al. Contemporary issues in toxicology the role of metabonomics in toxicology and its evaluation by the COMET project. Toxicol Appl Pharmacol. 2003;187(3):137–46. doi: 10.1016/s0041-008x(02)00079-0 [DOI] [PubMed] [Google Scholar]
31.Cantor GH, Beckonert O, Bollard ME, Keun HC, Ebbels TMD, Antti H, et al. Integrated histopathological and urinary metabonomic investigation of the pathogenesis of microcystin-LR toxicosis. Vet Pathol. 2013;50(1):159–71. doi: 10.1177/0300985812443839 [DOI] [PubMed] [Google Scholar]
32.Ebbels TMD, Keun HC, Beckonert OP, Bollard ME, Lindon JC, Holmes E, et al. Prediction and classification of drug toxicity using probabilistic modeling of temporal metabolic data: the consortium on metabonomic toxicology screening approach. J Proteome Res. 2007;6(11):4407–22. doi: 10.1021/pr0703021 [DOI] [PubMed] [Google Scholar]
33.van Buuren S, Groothuis-Oudshoorn K. Mice: Multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1–67. [Google Scholar]
34.Stekhoven DJ, Bühlmann P. MissForest--non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–8. doi: 10.1093/bioinformatics/btr597 [DOI] [PubMed] [Google Scholar]
35.Lüdecke D. sjmisc: Data and variable transformation functions. JOSS. 2018;3(26):754. doi: 10.21105/joss.00754 [DOI] [Google Scholar]
36.Nguyen CD, Carlin JB, Lee KJ. Model checking in multiple imputation: an overview and case study. Emerg Themes Epidemiol. 2017;14:8. doi: 10.1186/s12982-017-0062-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Sun Y, Li J, Xu Y, Zhang T, Wang X. Deep learning versus conventional methods for missing data imputation: a review and comparative study. Expert Syst Appl. 2023;227:120201. doi: 10.1016/j.eswa.2023.120201 [DOI] [Google Scholar]
38.Waljee AK, Mukherjee A, Singal AG, Zhang Y, Warren J, Balis U, et al. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open. 2013;3(8):e002847. doi: 10.1136/bmjopen-2013-002847 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Lenz M, Schulz A, Koeck T, Rapp S, Nagler M, Sauer M, et al. Missing value imputation in proximity extension assay-based targeted proteomics data. PLoS One. 2020;15(12):e0243487. doi: 10.1371/journal.pone.0243487 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Jin H, Jung S, Won S. missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data. Genes Genomics. 2022;44(6):651–8. doi: 10.1007/s13258-022-01247-8 [DOI] [PubMed] [Google Scholar]
41.Rosado-Galindo H, Dávila-Padilla S. Tree-based missing value imputation using feature selection. J Data Sci. 2021;18(4):606–31. doi: 10.6339/jds.202010_18(4).0002 [DOI] [Google Scholar]
42.Boehm O, Zur B, Koch A, Tran N, Freyenhagen R, Hartmann M, et al. Clinical chemistry reference database for Wistar rats and C57/BL6 mice. Biol Chem. 2007;388(5):547–54. doi: 10.1515/BC.2007.061 [DOI] [PubMed] [Google Scholar]
43.Petterino C, Argentino-Storino A. Clinical chemistry and haematology historical data in control Sprague-Dawley rats from pre-clinical toxicity studies. Exp Toxicol Pathol. 2006;57(3):213–9. doi: 10.1016/j.etp.2005.10.002 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File. Supplementary figures Fig S1 to Fig S11.

(DOCX)

pone.0335852.s001.docx^{(12.4MB, docx)}

Data Availability Statement

Data are available at: https://zenodo.org/records/14987944.

[pone.0335852.ref001] 1.Komamine M, Fujimura Y, Omiya M, Sato T. Dealing with missing data in laboratory test results used as a baseline covariate: results of multi-hospital cohort studies utilizing a database system contributing to MID-NET® in Japan. BMC Med Inform Decis Mak. 2023;23(1):242. doi: 10.1186/s12911-023-02345-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref002] 2.Smirnova E, Zhong Y, Alsaadawi R, Ning X, Kress A, Kuiper J, et al. Missing data interpolation in integrative multi-cohort analysis with disparate covariate information. arXiv [stat.ME]. 2022. [Google Scholar]

[pone.0335852.ref003] 3.Keun HC, Ebbels TMD, Bollard ME, Beckonert O, Antti H, Holmes E, et al. Geometric trajectory analysis of metabolic responses to toxicity can define treatment specific profiles. Chem Res Toxicol. 2004;17(5):579–87. doi: 10.1021/tx034212w [DOI] [PubMed] [Google Scholar]

[pone.0335852.ref004] 4.Dekermanjian JP, Shaddox E, Nandy D, Ghosh D, Kechris K. Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics. BMC Bioinform. 2022;23(1):179. doi: 10.1186/s12859-022-04659-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref005] 5.Little RJA. Statistical analysis with missing data. Wiley & Sons, Limited; 2019. [Google Scholar]

[pone.0335852.ref006] 6.van Buuren S. Flexible imputation of missing data. CRC Press, Taylor and Francis Group; 2018. [Google Scholar]

[pone.0335852.ref007] 7.Osman MS, Abu-Mahfouz AM, Page PR. A survey on data imputation techniques: water distribution system as a use case. IEEE Access. 2018;6:63279–91. doi: 10.1109/access.2018.2877269 [DOI] [Google Scholar]

[pone.0335852.ref008] 8.Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393. doi: 10.1136/bmj.b2393 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref009] 9.Jadhav A, Pramod D, Ramanathan K. Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell. 2019;33:913–33. [Google Scholar]

[pone.0335852.ref010] 10.Graham JW. Missing data analysis: making it work in the real world. Annu Rev Psychol. 2009;60:549–76. doi: 10.1146/annurev.psych.58.110405.085530 [DOI] [PubMed] [Google Scholar]

[pone.0335852.ref011] 11.Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A survey on missing data in machine learning. J Big Data. 2021;8(1):140. doi: 10.1186/s40537-021-00516-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref012] 12.Jäger S, Allhorn A, Bießmann F. A benchmark for data imputation methods. Front Big Data. 2021;4:693674. doi: 10.3389/fdata.2021.693674 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref013] 13.Slade E, Naylor MG. A fair comparison of tree-based and parametric methods in multiple imputation by chained equations. Stat Med. 2020;39(8):1156–66. doi: 10.1002/sim.8468 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref014] 14.Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. Am J Epidemiol. 2014;179(6):764–74. doi: 10.1093/aje/kwt312 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref015] 15.Tilling K, Williamson EJ, Spratt M, Sterne JAC, Carpenter JR. Appropriate inclusion of interactions was needed to avoid bias in multiple imputation. J Clin Epidemiol. 2016;80:107–15. doi: 10.1016/j.jclinepi.2016.07.004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref016] 16.Doove LL, Van Buuren S, Dusseldorp E. Recursive partitioning for missing data imputation in the presence of interaction effects. Comput Stat Data Anal. 2014;72:92–104. doi: 10.1016/j.csda.2013.10.025 [DOI] [Google Scholar]

[pone.0335852.ref017] 17.Hong S, Lynn HS. Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med Res Methodol. 2020;20(1):199. doi: 10.1186/s12874-020-01080-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref018] 18.Seaman SR, Bartlett JW, White IR. Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods. BMC Med Res Methodol. 2012;12:46. doi: 10.1186/1471-2288-12-46 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref019] 19.Strobl C, Malley J, Tutz G. An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol Methods. 2009;14(4):323–48. doi: 10.1037/a0016973 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref020] 20.Burgette LF, Reiter JP. Multiple imputation for missing data via sequential regression trees. Am J Epidemiol. 2010;172(9):1070–6. doi: 10.1093/aje/kwq260 [DOI] [PubMed] [Google Scholar]

[pone.0335852.ref021] 21.White IR, Royston P, Wood AM. Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011;30(4):377–99. doi: 10.1002/sim.4067 [DOI] [PubMed] [Google Scholar]

[pone.0335852.ref022] 22.Sullivan TR, White IR, Salter AB, Ryan P, Lee KJ. Should multiple imputation be the method of choice for handling missing data in randomized trials? Stat Methods Med Res. 2018;27(9):2610–26. doi: 10.1177/0962280216683570 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref023] 23.Tang F, Ishwaran H. Random forest missing data algorithms. Stat Anal Data Min. 2017;10(6):363–77. doi: 10.1002/sam.11348 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref024] 24.Hapfelmeier A, Ulm K. Variable selection by Random Forests using data with missing values. Comput Stat Data Anal. 2014;80:129–39. doi: 10.1016/j.csda.2014.06.017 [DOI] [Google Scholar]

[pone.0335852.ref025] 25.Moons KGM, Donders RART, Stijnen T, Harrell FE Jr. Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol. 2006;59(10):1092–101. doi: 10.1016/j.jclinepi.2006.01.009 [DOI] [PubMed] [Google Scholar]

[pone.0335852.ref026] 26.Resche-Rigon M, White IR, Bartlett JW, Peters SAE, Thompson SG, PROG-IMT Study Group. Multiple imputation for handling systematically missing confounders in meta-analysis of individual participant data. Stat Med. 2013;32(28):4890–905. doi: 10.1002/sim.5894 [DOI] [PubMed] [Google Scholar]

[pone.0335852.ref027] 27.Andridge R, Thompson KJ. Using the fraction of missing information to identify auxiliary variables for imputation procedures via proxy pattern‐mixture models. Int Stat Rev. 2015;83:472–92. [Google Scholar]

[pone.0335852.ref028] 28.Jolani S, Debray TPA, Koffijberg H, van Buuren S, Moons KGM. Imputation of systematically missing predictors in an individual participant data meta-analysis: a generalized approach using MICE. Stat Med. 2015;34(11):1841–63. doi: 10.1002/sim.6451 [DOI] [PubMed] [Google Scholar]

[pone.0335852.ref029] 29.Zhang J, Dashti SG, Carlin JB, Lee KJ, Moreno-Betancur M. Should multiple imputation be stratified by exposure group when estimating causal effects via outcome regression in observational studies? BMC Med Res Methodol. 2023;23(1):42. doi: 10.1186/s12874-023-01843-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref030] 30.Lindon JC, Nicholson JK, Holmes E, Antti H, Bollard ME, Keun H, et al. Contemporary issues in toxicology the role of metabonomics in toxicology and its evaluation by the COMET project. Toxicol Appl Pharmacol. 2003;187(3):137–46. doi: 10.1016/s0041-008x(02)00079-0 [DOI] [PubMed] [Google Scholar]

[pone.0335852.ref031] 31.Cantor GH, Beckonert O, Bollard ME, Keun HC, Ebbels TMD, Antti H, et al. Integrated histopathological and urinary metabonomic investigation of the pathogenesis of microcystin-LR toxicosis. Vet Pathol. 2013;50(1):159–71. doi: 10.1177/0300985812443839 [DOI] [PubMed] [Google Scholar]

[pone.0335852.ref032] 32.Ebbels TMD, Keun HC, Beckonert OP, Bollard ME, Lindon JC, Holmes E, et al. Prediction and classification of drug toxicity using probabilistic modeling of temporal metabolic data: the consortium on metabonomic toxicology screening approach. J Proteome Res. 2007;6(11):4407–22. doi: 10.1021/pr0703021 [DOI] [PubMed] [Google Scholar]

[pone.0335852.ref033] 33.van Buuren S, Groothuis-Oudshoorn K. Mice: Multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1–67. [Google Scholar]

[pone.0335852.ref034] 34.Stekhoven DJ, Bühlmann P. MissForest--non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–8. doi: 10.1093/bioinformatics/btr597 [DOI] [PubMed] [Google Scholar]

[pone.0335852.ref035] 35.Lüdecke D. sjmisc: Data and variable transformation functions. JOSS. 2018;3(26):754. doi: 10.21105/joss.00754 [DOI] [Google Scholar]

[pone.0335852.ref036] 36.Nguyen CD, Carlin JB, Lee KJ. Model checking in multiple imputation: an overview and case study. Emerg Themes Epidemiol. 2017;14:8. doi: 10.1186/s12982-017-0062-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref037] 37.Sun Y, Li J, Xu Y, Zhang T, Wang X. Deep learning versus conventional methods for missing data imputation: a review and comparative study. Expert Syst Appl. 2023;227:120201. doi: 10.1016/j.eswa.2023.120201 [DOI] [Google Scholar]

[pone.0335852.ref038] 38.Waljee AK, Mukherjee A, Singal AG, Zhang Y, Warren J, Balis U, et al. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open. 2013;3(8):e002847. doi: 10.1136/bmjopen-2013-002847 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref039] 39.Lenz M, Schulz A, Koeck T, Rapp S, Nagler M, Sauer M, et al. Missing value imputation in proximity extension assay-based targeted proteomics data. PLoS One. 2020;15(12):e0243487. doi: 10.1371/journal.pone.0243487 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0335852.ref040] 40.Jin H, Jung S, Won S. missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data. Genes Genomics. 2022;44(6):651–8. doi: 10.1007/s13258-022-01247-8 [DOI] [PubMed] [Google Scholar]

[pone.0335852.ref041] 41.Rosado-Galindo H, Dávila-Padilla S. Tree-based missing value imputation using feature selection. J Data Sci. 2021;18(4):606–31. doi: 10.6339/jds.202010_18(4).0002 [DOI] [Google Scholar]

[pone.0335852.ref042] 42.Boehm O, Zur B, Koch A, Tran N, Freyenhagen R, Hartmann M, et al. Clinical chemistry reference database for Wistar rats and C57/BL6 mice. Biol Chem. 2007;388(5):547–54. doi: 10.1515/BC.2007.061 [DOI] [PubMed] [Google Scholar]

[pone.0335852.ref043] 43.Petterino C, Argentino-Storino A. Clinical chemistry and haematology historical data in control Sprague-Dawley rats from pre-clinical toxicity studies. Exp Toxicol Pathol. 2006;57(3):213–9. doi: 10.1016/j.etp.2005.10.002 [DOI] [PubMed] [Google Scholar]

PERMALINK

Evaluation of imputation strategies for multi-centre studies: Application to a large clinical pathology dataset

Lucy Grigoroff

Reika Masuda

John Lindon

Janonna Kadyrov

Jeremy K Nicholson

Elaine Holmes

Julien Wist

Roles

Abstract

Introduction

Fig 1. Schematic of different imputation strategies.

Methods

Dataset

Fig 2. Serum sample collection details for all COMET studies.

Fig 3. Schematic of the order of processes to curate the serum clinical chemistry parameters.

Fig 4. Urea Nitrogen values for control samples across all studies for A) raw data and B) Median Centered (batch corrected) data.

Fig 5. The sparsity of serum parameters pre- and post-batch correction.

Imputation

Performance metrics

Results

Stratification

Fig 6. Zoomed-in PCA scores plot of the 3817 complete cases for serum clinical chemistry parameters.

Imputation method

Fig 7. Internal performance metric full MAE for varying proportions of missing data for all three types of missingness (MT = MCAR, MAR and MNAR) using imputation methods missForest and MICErf.

Fig 8. External performance metric of imputation bias for varying proportions of missing data for all three types of missingness (MCAR, MAR and MNAR) using imputation methods missForest and MICErf.

Discussion

Conclusion

Supporting information

Data Availability

Funding Statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases