Abstract
Exploratory factor analysis is a statistical method commonly used in psychological research to investigate latent variables and to develop questionnaires. Although such self-report questionnaires are prone to missing values, there is not much literature on this topic with regard to exploratory factor analysis—and especially the process of factor retention. Determining the correct number of factors is crucial for the analysis, yet little is known about how to deal with missingness in this process. Therefore, in a simulation study, six missing data methods (an expectation–maximization algorithm, predictive mean matching, Bayesian regression, random forest imputation, complete case analysis, and pairwise complete observations) were compared with respect to the accuracy of the parallel analysis chosen as retention criterion. Data were simulated for correlated and uncorrelated factor structures with two, four, or six factors; 12, 24, or 48 variables; 250, 500, or 1,000 observations and three different missing data mechanisms. Two different procedures combining multiply imputed data sets were tested. The results showed that no missing data method was always superior, yet random forest imputation performed best for the majority of conditions—in particular when parallel analysis was applied to the averaged correlation matrix rather than to each imputed data set separately. Complete case analysis and pairwise complete observations were often inferior to multiple imputation.
Keywords: missing data, exploratory factor analysis, multiple imputation, factor retention
Introduction
Missing data can be a severe problem for the quality of statistical analyses. Survey data in particular are prone to high levels of missingness. In political surveys, for example, around of participants do not answer every question (King, Honaker, Joseph, & Scheve, 2001). This so-called item nonresponse can be caused either by a lack of knowledge, comprehension problems, or unwillingness (Shoemaker, Eichholz, & Skewes, 2002). Psychological research often relies on questionnaire surveys likewise and is affected by missing data accordingly. Within the questionnaire design process, exploratory factor analysis (EFA) is basically mandatory. Determining the number of factors might be one of the most important decisions a researcher has to make when carrying out an EFA. Nevertheless, researchers almost always seem to ignore missing data in dealing with the factor retention in their EFA (e.g., Russell, 2002). This might be partly due to a lack of awareness and partly due to little literature on the problem. McNeish (2017) is one exception where missingness was investigated in the context of EFA. However, the author focused on rather small sample sizes (), which are sometimes used in psychological research, but are not recommended for EFA (Fabrigar, Wegener, MacCallum, & Strahan, 1999; Goretzko, Pham, & Bühner, 2019) and evaluated only very few conditions (only 1 three-factor model was used in the data-generating process). Moreover, using the Kaiser (1960) criterion for factor retention is definitely outdated (Fabrigar et al., 1999; Goretzko et al., 2019) as McNeish (2017) admitted. Other articles evaluating multiple imputation in the context of EFA rather focus on the estimation of factor loadings (e.g., Lorenzo-Seva & Van Ginkel, 2016) or investigate principal component analyses (PCAs; e.g., Dray & Josse, 2015).
There are also methods developed to specifically deal with missing data in PCA (e.g., a regularized PCA approach by Josse & Husson, 2012), but those methods also need valid estimates of the dimensionality as Dray and Josse (2015) point out and call for improvements in the dimensionality assessment when data are missing.
Therefore, this article examines the impact of missing data on the factor retention process for a broad range of conditions. We compare different missing data methods (especially multiple imputation methods) and two possible methods for combining different solutions of multiply imputed data sets.
Missing Data Mechanisms
Generally, three types of missingness can be distinguished: missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR) or missing not at random (MNAR).1MCAR means that the missingness is unrelated to other variables and is completely based on a random process. If the missingness in a particular variable is related to another variable, but both variables are not correlated, the missing data mechanism is referred to as essential MCAR (Graham, 2012). MAR means that the missingness depends only on observed variables, so that the probability that an observation is missing can be modeled by the given data. Contrarily, MNAR means that missingness either depends on variables that are not observed or on the variable with missing data itself. Self-reported income, for example, is often affected by missingness, which is dependent on the amount of income itself, as item nonresponse increases at the tails of the income distribution (Frick & Grabka, 2005).
Depending on the type of missingness conducting complete case analyses or using pairwise complete observations can have adverse consequences. Estimates can be severely biased, correlations can be incoherent, and inference might be misleading. Besides, the statistical power decreases as sample sizes sometimes decrease heavily (Little & Rubin, 2002). Therefore, different imputation methods have been developed to address these problems and to allow for statistical inference when data are affected by missing values.
Multiple Imputation
The aim of imputation is to obtain valid inference from data with missing values (e.g., due to item nonresponse). Depending on the specific setting, different imputation methods might perform best, which means that they provide unbiased estimates with smaller standard errors and thus more powerful tests than other missing data methods. Several imputation methods contain stochastic components, so that each imputation (random draw from the predictive distribution) can lead to different data and thus to different (point) estimates. Applying so-called single imputation procedures, in which each missing value is imputed once, does not allow the additional variance component (introduced by the imputation itself) to be taken into account and often yields underestimated standard errors (and too narrow confidence intervals). For this reason, multiple imputation methods have become the state-of-the-art solution, generating two or more imputed values for each missing value, resulting in two or more imputed data sets (Little & Rubin, 2002).
The core idea of multiple imputation is to draw times (for imputed data sets) from the posterior predictive distribution to obtain imputed values for each missing value, analyze the resulting data sets, and combine the estimates (to obtain an estimate of the parameter of interest, e.g., a population mean ) into a final estimate.
Imputation Methods
One specific framework for imputation purposes is called multiple imputations by chained equations (MICE) or fully conditional specification (FCS). This is an iterative procedure in which missing values of one variable are imputed given the currently imputed values of all other variables (e.g., Azur, Stuart, Frangakis, & Leaf, 2011).
Predictive Mean Matching
Predictive mean matching is one common imputation method in the FCS framework. It is based on a linear regression of the observed values of variable (its missing values will be imputed) on the remaining variables .2 The estimated parameters serve as expected values of a multivariate normal distribution from which s are drawn. These parameters then are used to predict and via and .
For each missing observation in , the three nearest are selected (based on the absolute difference: ) and one of them is randomly selected as the donor: The corresponding observed value is taken as the imputed value for the missing value (Vink, Frank, Pannekoek, & van Buuren, 2016).
Regression Models
Another common method is Bayesian linear regression. It works quite similar to predictive mean matching, but instead of finding the imputation values using the described “proximity” approach, the missing values are imputed directly via linear prediction (Yuan, 2000): with being a random part based on a random number of a standard normal distribution, , and being the estimated standard error of the regression.
Random Forest Imputation
In addition to these classical regression approaches, alternatives for more complex missingness patterns such as random forest imputation have emerged and can also be used with FCS. The random forest is based on bootstrap samples drawn from the empirical data. On each of these data sets, a regression (or classification) tree is built by recursive binary splitting until each terminal node contains fewer observations than a chosen threshold. At each node of the tree, variables are randomly selected and the best possible split is performed based on one of these variables. Unlike single regression trees, the resulting trees are not pruned, since averaging over the solutions prevents overfitting. Since only of all variables are used for each possible split the resulting trees can vary heavily and averaging them provides more reliable results. Besides the maximal number of observations per terminal node or the depth of the trees, and can be tuned for prediction purposes (James, Witten, Hastie, & Tibshirani, 2013).
When using the random forest as a model for imputation, there is no strong need to tune these parameters, as valid inference is the main goal of imputation rather than accurately predicting the missing values (Little & Rubin, 2002). In general, is often chosen as or with being the number of variables in the data set (James et al., 2013). For regression tasks, seems to be slightly favorable (Breiman, 1999). In the context of imputation, can be smaller than for “real” prediction tasks, since can yield as good results as (Shah, Bartlett, Carpenter, Nicholas, & Hemingway, 2014).
Expectation–Maximization Algorithms: Amelia
Expectation–maximization (EM) algorithms provide an alternative to the FCS framework assuming multivariate normality . Further assuming MAR and a flat prior on the parameters , one can write the posterior distribution of as
EM algorithms can be used to detect the mode of this posterior. The algorithm Amelia, in particular, combines bootstrapping with the classical EM algorithm to take draws from the posterior (Honaker, King, & Blackwell, 2011). Amelia estimates the sufficient statistics for and of a multivariate normal distribution with a typical two-step EM process. During the expectation step, missing values are imputed given the current estimates of and . During the maximization step, the parameters are reestimated using the current imputed values (Honaker & King, 2010). The Amelia algorithm returns imputed data sets like the FCS approach using bootstrapped samples from the original data.
Other Missing Data Methods
There are several other methods used for missing data applications. One prominent example in the context of structural equation modelling is the so-called full-information maximum likelihood where partially complete data are integrated in the likelihood function so that information about the underlying marginals of incomplete variables can reduce bias (Enders & Bandalos, 2001). However, as Nassiri, Lovik, Molenberghs, and Verbeke (2018) demonstrated that full-information maximum likelihood often yields nonpositive definite covariance matrices, we decided to not include full-information maximum likelihood in our study. Of course, there are further imputation models that can be used within the FCS framework or as an alternative multiple imputation method—namely, hot-deck imputation (Andridge & Little, 2010), which also borrows data from similar observations that serve as donors like in the predictive mean matching approach (yet without the regression model), any type of (more complex) regression models like mixed linear models or splines that might be useful for complex missing data mechanisms, or clustering-based imputation (Zhang, Zhang, Zhu, Qin, & Zhang, 2008) that uses a kernel-based method to generate data from observations that are similar to the respective observation with missingness.
Method
In this study, six different methods for dealing with missing values in the factor retention process were compared. Four multiple imputation methods (FCS with random forest, predictive mean matching, and linear regression as well as Amelia) were compared with neglecting missingness, either by using pairwise complete observations or complete cases analysis. Since we wanted to evaluate imputation methods that cover a broad range of concepts, we chose one similarity approach (predictive mean matching), one regression approach ([Bayesian] linear regression), one tree-based method (random forest imputation), and one EM algorithm (the Amelia algorithm). We chose predictive mean matching as the similarity approach candidate for our study, because it can be seen as a special case of hot-deck imputation (Laaksonen, 1998) and the similar clustering-based imputation approach is quite time-consuming in comparison. Furthermore, we decided to use a linear regression model, because it matches the relatively simple missing data mechanisms that were assumed in our simulations and the analysis model (EFA) also assumes linear relations.3 Choosing the random forest model within the FCS framework as the representative of the tree-based models was simply due to the comparably good expected performance compared with single decision trees or random forest implementations independent of FCS (e.g., missForest by Stekhoven & Bühlmann, 2011) as demonstrated by Shah et al. (2014), for example.
Two different approaches have been tested to combine the results of the different imputed data sets. The factor retention criterion (here parallel analysis, Horn, 1965, was chosen to determine the number of factors) was either applied to each of the imputed data sets and the most frequent solution was selected or applied to an averaged correlation matrix. Latter is quite similar to the approach of Nassiri et al. (2018). Choosing the best retention criterion can be challenging as their performance varies under different conditions (e.g., Auerswald & Moshagen, 2019; van der Eijk & Rose, 2015). Parallel analysis (Horn, 1965) was chosen as it has become a “gold standard” (e.g., Goretzko et al., 2019) for this issue with quite good performance under various conditions (e.g., Dinno, 2009).
Data Conditions
Data were simulated for three different sample sizes (N = 250, 500, 1,000), three numbers of variables (), three numbers of factors (), orthogonal and correlated factors, and three missing data mechanisms (MCAR, MARRIGHT, MARCOMPL, see Data Manipulation section) assuming multivariate normality and missing values. Five hundred replications per conditions were conducted. Both data simulation and analysis were performed with R, Version 5.3.2 (R Core Team, 2018).
Data Simulation
Data were simulated for true factor patterns with standardized primary loadings between and and standardized secondary loadings between and . A true loading matrix with such loadings was used to determine the true correlation matrix of the variables. The following decomposition of provided as a function of the simulated true loading matrix and the predefined correlation matrix :
Since is usually assumed to be a diagonal matrix, the correlation matrix can be rewritten as
was chosen to be
for two-factor solutions,
for four-factor solutions and
for six-factor solutions. was then used to simulate the sample data set for the analysis. Given , and , a concrete sample was drawn and then manipulated with the respective missing data mechanism to generate the proportions of missing values. The samples were simulated with the mvtnorm package, Version 1.0.10 (Genz et al., 2018) using and according to the experimental condition.
Data Manipulation
Missing values were induced according to the given proportion of missingness and the missing data mechanism. In case of MCAR and MARRIGHT, the ampute function of the mice package, Version 3.4.0 (van Buuren & Groothuis-Oudshoorn, 2011) was used. The underlying procedure is based on the ideas of Brand (1999) and has been implemented by Schouten, Lugtig, and Vink (2018). There are several subtypes of the MAR mechanism, depending on which part of the marginal distribution is more likely to be affected by missing values. One can distinguish between MARRIGHT (higher values), MARMID (medium values), MARTAILS (extreme values), or MARLEFT (lower values; van Buuren, Brand, Groothuis-Oudshoorn, & Rubin, 2006). Figure 1 shows a possible representation of each of the mechanisms described.
Figure 1.
Different types of the missing at random mechanism.
For more complex missingness relations, MARCOMPL was designed. In each replication and for each variable, between five and all of the other variables were selected randomly to predict missingness. The coefficients were drawn from a uniform distribution, and prediction scores were calculated based on a common linear model. The randomly selected variables were either included in two-way interactions or entered into the linear predictor as a quadratic term. Then the resulting prediction scores were standardized (standard scores) and adjusted, so that the expected value of an applied logistic regression, with these adjusted z scores as predictors, was equal to the proportion of missing values (). Missing values were created by drawing times from a Bernoulli distribution with , which is the respective outcome of the logistic regression
and setting all values to missing for which a “success” was sampled.
Data Analysis
The resulting data sets with missing values were imputed with the presented imputation methods, namely, Amelia (Honaker et al., 2011) as well as FCS (with five iterations) with random forest ( as suggested by Shah et al., 2014, and as suggested by Breiman, 1999), predictive mean matching and Bayesian linear regression via the mice-function of the mice-package, Version 3.4.0 (van Buuren & Groothuis-Oudshoorn, 2011). Then parallel analysis was performed using the fa.parallel function in the package psych, Version 1.8.12 (Revelle, 2018). Mainly default settings were used for the imputation process (, sequence of imputation from left to right, default tolerance for the Amelia algorithm, starting values for the EM algorithm were obtained from the observed data with listwise deletion and no priors were specified for the sufficient statistics). Five imputed data sets () seemed to be a good choice as comparable studies (e.g., Dray & Josse, 2015; Lorenzo-Seva & Van Ginkel, 2016; Nassiri et al., 2018) also used (the latter used ). As we did not want to estimate standard errors for model parameters, more than five imputed data sets would have been an unnecessary computational burden.
As common rules of combining the imputed data sets like averaging the separately estimated parameters (see Little & Rubin, 2002) are not applicable to the factor retention process (since the estimate for the dimensionality has to be an integer4), we evaluated two possible options. Either, the correlation matrices of each imputed data set were averaged to obtain a unique solution for each procedure and each initial data set (comparable with the approach of Nassiri et al., 2018), or the result of the parallel analysis for each imputed data set was collected and the most common factor solution for each initial data set was chosen (to be more precise, the mode of the distribution of suggested number of factors was selected). When more than one mode occurred and this decision rule was inconclusive, the lower mode was selected as parallel analysis rather tends to extract too many factors (Crawford et al., 2010; Warne & Larsen, 2014).
Evaluation
The proposed number of factors was averaged over the replications of the same condition for each method. The accuracy and proportions of under- and overfactoring (cases where the parallel analysis suggested less or more than factors) were then compared among the six missing data methods (Amelia = em, random forest imputation = rf, Bayesian regression = reg, predictive mean matching = pmm, pairwise complete observations = pair, and complete case analysis = compl) and the two different approaches to combine the multiply imputed data sets.
Results
In Table 1, all conditions with the same number of true factors were averaged, and the average suggested number of factors as well as the proportion of correct solutions (the accuracy of the factor retention) are displayed.
Table 1.
Average Factor Solution and Accuracy for Data Based on Two, Four, and Six True Factors.
| Number of factors | Method | ||||
|---|---|---|---|---|---|
| 2 | em | 2.45 | 2.16 | 0.67 | 0.87 |
| 2 | rf | 2.08 | 2.02 | 0.93 | 0.98 |
| 2 | pmm | 2.89 | 2.31 | 0.48 | 0.77 |
| 2 | reg | 2.95 | 2.31 | 0.47 | 0.76 |
| 2 | pair | 2.36 | NA | 0.75 | NA |
| 2 | compl | 2.14 | NA | 0.87 | NA |
| 4 | em | 4.15 | 4.04 | 0.83 | 0.91 |
| 4 | rf | 4.00 | 3.96 | 0.93 | 0.94 |
| 4 | pmm | 4.33 | 4.12 | 0.73 | 0.87 |
| 4 | reg | 4.36 | 4.12 | 0.72 | 0.86 |
| 4 | pair | 4.13 | NA | 0.86 | NA |
| 4 | compl | 3.69 | NA | 0.70 | NA |
| 6 | em | 5.59 | 5.44 | 0.53 | 0.59 |
| 6 | rf | 5.27 | 5.17 | 0.56 | 0.56 |
| 6 | pmm | 5.95 | 5.62 | 0.53 | 0.61 |
| 6 | reg | 6.00 | 5.61 | 0.52 | 0.60 |
| 6 | pair | 5.66 | NA | 0.63 | NA |
| 6 | compl | 4.47 | NA | 0.37 | NA |
Note. em = Amelia; rf = random forest imputation; reg = Bayesian regression; pmm = predictive mean matching; pair = pairwise complete observations; compl = complete case analysis. The subscript mode indicates the procedure using the most common factor solution across the imputed data sets. The subscript cor indicates the solution based on the averaged correlation matrix. As both pair and compl are based on one data set, no combination rule had to be used, and the respective solution was arbitrarily added in the “mode” column.
Averaging the correlation matrices (indicated by cor, e.g., em_cor for Amelia) over the imputed data sets yielded better results in general than applying the parallel analysis to each imputed data set and taking the most frequent solution (indicated by mode, e.g., pmm_mode). This can be observed for two- and four-factor solutions, whereas in case of , both approaches yielded quite similar results. For and , rf_cor yielded the highest accuracy ( and ) and no substantial bias. All imputation methods within the cor approach achieved accuracies greater than , while pmm_mode and reg_mode clearly failed to retain the correct number of factors in many cases with ( and accuracy).
In general, ignoring missingness was inferior in terms of correctly identified factors, yet pair performed comparably well for ( accuracy on average outperforming the other criteria, although pmm_cor and reg_cor had similar accuracies) and some of the other conditions, but using complete case analysis was able to compete solely for ( accuracy). Overall, the random forest imputation seemed to work best, although its performance got worse for (both rf_cor and rf_mode with accuracy). In these conditions, the estimated number of factors was / on average for rf indicating underfactoring—a bias that was even more severe for compl.
As there were four more variables potentially affecting the accuracy of the retention criterion for each missing data method—namely, , , , and the missing data mechanism—a more nuanced perspective at the performance might be necessary. However, the missing data mechanism had little to no effect on the accuracy—only compl performed substantially worse when the missing data mechanism was MARCOMPL (for more details, see Supplementary Table 1 available online). Figure 2 (more detailed figures can be found in Supplementary Figures 1–3) displays the accuracy of factor retention depending on and for orthogonal factors (averaged over different values of and the missing data mechanisms).
Figure 2.
Accuracy for orthogonal structures averaged over sample sizes and missing data mechanisms.
Again, rf_cor showed the best performance, but had less than accuracy when , and —conditions in which every method achieved quite weak results averaged over all sample sizes and missing data mechanisms (accuracies below , due to the small item-to-factor ratio). When the sample size was high () all methods showed satisfying results with more than and often more than accuracy. When the missingness was induced by MARCOMPL, complete case analysis often performed worst. Yet a tendency for a decreased accuracy under these conditions was present for other methods as well. Amelia in combination with averaged correlation matrices (em_cor) provided quite good solutions for a broad range of conditions, but was almost always inferior to random forest imputation (in more than of the cases, rf_cor had an accuracy greater or equal to em_cor). The other imputation methods showed no improvement over complete case analysis and pairwise complete analysis when using the most frequent factor solution (pmm_mode and reg_mode). Also, pair even yielded comparable results with pmm_cor and reg_cor.
In cases of correlated factors, the results were quite different (Figure 3, more detailed figures can be found in Supplementary Figures 4–6). Although higher sample sizes generally fostered higher accuracies and the error rates were higher for six-factor solutions as in the orthogonal case, no missing data mechanism was clearly superior. Once again rf_cor provided the best results for most conditions with , but often failed to determine the correct number of factors when . Especially when was small, rf_cor yielded an accuracy close to . The other methods struggled under these conditions as well, yet reg_mode, pmm_mode and pair performed better and were among the best methods when and . Complete case analysis again showed the worst accuracy for almost every condition.
Figure 3.
Accuracy for oblique structures averaged over sample sizes and missing data mechanisms.
Random forest imputation tended to underestimate the number of factors when and when correlated factors were present. For example, when data were based on , , , and MARCOMPL, rf_cor suggested factors on average and highly underestimated the true factor number. This bias was less severe when rose. For and , rf_cor again performed quite well. Amelia showed similar patterns and also tended to retain too few factors under the respective conditions. The use of reg_mode or pmm_mode worked better under these conditions, yet these methods led to heavy overfactoring when or and . The pair method showed some similar patterns (overfactoring in these conditions as well) and was prone to overfactoring when MARCOMPL induced missingness. This tendency to extract to many factors can be found for orthogonal factor structures as well. Besides reg_mode and pmm_mode, pair, em_mode, pmm_cor, and partly reg_cor also led to solutions with too many factors. For orthogonal structures, these methods yielded solutions with too many factors even for . For reg_mode, for example, parallel analysis suggested and factors on average for , , and when MARCOMPL or MCAR were the missing data mechanisms. However, this extreme overfactoring is by no means an exclusive problem of reg_mode, since pmm_mode suggested and factors on average and em_mode and factors on average for these conditions.
Complete case analysis occasionally provided results that would lead to serious misinterpretations. In 10 conditions, the parallel analysis suggested zero factors at least once when compl was used. This means that in these cases, even though there were relations among the variables in the population (in the data-generating process), no relations were found in the sample due to the missing values. The respective conditions were based on the MARCOMPL assumption, which again demonstrates the weak performance of complete case analysis under this missing data mechanism.
Since in practice only one data set is analyzed, not only the bias of the discussed methods is of interest but also the variance of the estimation. Table 2 displays the variance of the suggested number of factors for each method, averaged over all conditions (for a more detailed table, see Supplementary Table 2). In general, complete case analysis yielded the most volatile solutions and averaging the correlation matrices over the different imputed data sets reduced the variance for every imputation method compared with the mode approach.
Table 2.
Variance of the Suggested Number Factors Averaged Over All Conditions.
| Method | Variance |
|---|---|
| em_mode | 0.32 |
| rf_mode | 0.18 |
| pmm_mode | 0.46 |
| reg_mode | 0.47 |
| pair | 0.31 |
| compl | 0.51 |
| em_cor | 0.22 |
| rf_cor | 0.17 |
| pmm_cor | 0.30 |
| reg_cor | 0.30 |
For most conditions, complete case analysis had the highest variance with the maximum of for , , , an orthogonal structure, and MARCOMPL. It is striking that conditions with MARCOMPL in particular induced a higher variance when using complete case analysis, on average. With increasing sample size, the variance tended to decrease, which is obviously correlated with the higher proportions of correct solutions in the respective conditions. In addition, the variances within the cor approach were smaller or equal to those of the mode approach in about three fourths of the cases (). Exceptions were, for example, the conditions with oblique structures, , , and . Yet these differences appeared to be quite small and all variances (except from complete case analysis compl) were less than .
Discussion
The present article evaluated the performance of six missing data methods in the context of the factor retention process in EFA. No missing data method clearly outperformed the others in all conditions. Although complex interactions of different variables (missing data mechanisms and data conditions in the data-generating process) hampered the interpretability of the results, it became clear that imputation methods tended to produce better results than ignoring the missing values using complete cases or pairwise complete observations.
As McNeish (2017) pointed out, no general rule has been established on how to combine multiply imputed data sets in the context of factor retention. Hence, this study can provide first insights on two possible procedures. While the mode approach (using the most frequent solution of the retention criterion across the imputed data sets) was superior in some cases where the true number of factors was comparably high () and factors were correlated,5 it yielded less accurate estimates than the cor approach for the majority of conditions. Although, a main reason for multiple imputation (instead of single imputation) is the ability to take into account the additional variance component (induced by the imputation procedure) as outlined by Little and Rubin (2002), this might not be of primary interest when determining the number of factors. In the factor retention process, a point estimate for is almost always used and estimation uncertainties are not taken into account. Therefore, the cor approach that does not allow to estimate this additional variance component with respect to the estimate of , but promises higher accuracies, might be preferable to the mode approach. The averaged correlation matrix appeared to provide more stable solutions (smaller variances in almost all conditions compared with the mode approach), and thus, might be worth pursuing. Since the suggested factor solutions generally varied stronger when using the mode approach, researchers should increase the number of imputations for this method. However, the cor approach seems to be quite stable for five imputed data sets—especially when it is combined with random forest imputation—and was superior in almost every case, so the cor approach might be preferable anyway. Nassiri et al. (2018) also showed that their similar approach (averaging the covariance matrices of all imputed data sets) performed quite well for estimating the proportion of explained variance in EFA.
Overall, the imputation methods were superior to listwise deletion (compl) and partially to pairwise deletion (pair), which is consistent with the findings of McNeish (2017), who showed that both were inferior to pmm—especially when factor loadings were small. When comparing the four imputation methods, however, no method stood out and clearly outperformed the others. In fact, rf or pmm/reg seemed to be favorable depending on the condition.6 For most conditions, the random forest imputation seemed to be the best method when combined with the cor procedure, yet under some conditions this approach could not retain the correct number of factors, while pmm and reg worked quite well. Further research should start here and evaluate further data conditions and missing data mechanisms to clarify when random forest imputation should be used and when, for example, predictive mean matching should be preferred.
Although parallel analysis can be seen as the standard factor retention criterion per se—since it is considered as a kind of comparative standard in a variety of studies presenting new criteria (e.g., Braeken & Van Assen, 2017; Lorenzo-Seva, Timmerman, & Kiers, 2011; Ruscio & Roche, 2012)—these studies also show that there are modern criteria with advantages in some data conditions. Hence, further studies could also combine different retention criteria with different imputation methods to evaluate possible interactions between the imputation method and the criterion used to determine the number of factors.
The choice of the correlations between the factors for the data-generating process was completely arbitrary. However, these correlations could have had an impact on the performance of the missing data methods we evaluated. The random forest imputation, for example, extracted fewer factors than included in the data-generating process when and factors were correlated. One could hypothesize that its performance will improve if the correlations are smaller, as they were for the conditions with and . Thus, the between-factor correlations could be systematically varied in further studies as well.
In this study, it was assumed that the missing data mechanism is MAR (or MCAR), which is essential for the presented imputation frameworks (mice and Amelia). However, NMAR cannot be ruled out in many cases, so these methods should not be adopted carelessly in any real application. In the context of EFA though, when several indicators represent the same latent variables as they share common variance, it may be reasonable to assume that if one indicator is not fully observed, a combination of the other indicators intended to serve as indicators for the same latent variable(s) can be used to predict the missing values—which is actually the assumption of the MAR mechanism. Researchers can make use of sensitivity analyses as described by Zygmont and Smith (2014) to determine whether it is more appropriate to assume MCAR or MAR.
Conclusion
This study shows that imputation methods are superior to complete case analysis (and pairwise complete observations) in most data conditions. Random forest imputation (within the FCS framework) appears to be an effective method for many of these conditions, but predictive mean matching or the Bayesian regression may be preferable for oblique factor structures and six factors. Further research should extend the current simulation design by further data conditions and other retention criteria. Besides, it might be useful to focus on the development of new approaches tailored to the specific requirements of EFA. Josse and Husson (2012), for example, developed an EM algorithm with bootstrapped residuals to determine the dimensionality in PCA—a procedure that might be transferable to this context. For now, using one of the imputation methods can improve current research practice. Since the cor approach was superior to the mode approach, researchers should use this option to combine the multiply imputed data sets. In cases where it is reasonable to assume that the data are based on less than six factors, random forest imputation appears to be a trustworthy missing data method. In other cases, it may be necessary to compare the solutions of different missing data methods as a robustness measure. Identical factor solutions would argue for a robust and trustworthy solution, while variations among different methods should be seen as a warning that missing data could adversely affect the interpretability of the analysis.
As both terms are used equivalently, we decided to refer to this mechanism as MNAR in the following.
are the observations of the covariates corresponding to , whereas is also fully observed and corresponds to .
The imputation model should resemble the analysis model and reflect its complexity.
Therefore, McNeish (2017) computed only one imputed data set, Lorenzo-Seva and Van Ginkel (2016) constrained the number of factors to be equal for all five imputed data sets, and Dray and Josse (2015) averaged the five imputed data sets and estimated the dimensionality based on this averaged data set.
When it was combined with predictive mean matching or Bayesian regression imputation.
The Amelia algorithm was constantly outperformed by the random forest imputation, so it is not considered further in our discussion.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iD: David Goretzko
https://orcid.org/0000-0002-2730-6347
Supplemental Material: Supplemental material for this article is available online.
References
- Andridge R. R., Little R. J. (2010). A review of hot deck imputation for survey non-response. International Statistical Review, 78, 40-64. doi: 10.1111/j.1751-5823.2010.00103.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Auerswald M., Moshagen M. (2019). How to determine the number of factors to retain in exploratory factor analysis: A comparison of extraction methods under realistic conditions. Psychological Methods, 24, 468-491. doi: 10.1037/met0000200 [DOI] [PubMed] [Google Scholar]
- Azur M. J., Stuart E. A., Frangakis C., Leaf P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20, 40-49. doi: 10.1002/mpr.329 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Braeken J., Van Assen M. A. (2017). An empirical kaiser criterion. Psychological Methods, 22, 450-466. [DOI] [PubMed] [Google Scholar]
- Brand J. P. L. (1999). Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets (Unpublished doctoral dissertation). Erasmus University Rotterdam, Netherlands. Retrieved from https://core.ac.uk/download/pdf/18508128.pdf
- Breiman L. (1999). Random forest. Retrieved from http//machinelearning202.pbworks.com/w/file/fetch/60606349/breiman_randomforests.pdf [Google Scholar]
- Crawford A. V., Green S. B., Levy R., Lo W.-J., Scott L., Svetina D., Thompson M. S. (2010). Evaluation of parallel analysis methods for determining the number of factors. Educational and Psychological Measurement, 70, 885-901. doi: 10.1177/0013164410379332 [DOI] [Google Scholar]
- Dinno A. (2009). Exploring the sensitivity of Horn’s parallel analysis to the distributional form of random data. Multivariate Behavioral Research, 44, 362-388. doi: 10.1080/00273170902938969 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dray S., Josse J. (2015). Principal component analysis with missing values: A comparative survey of methods. Plant Ecology, 216, 657-667. doi: 10.1007/s11258-014-0406-z [DOI] [Google Scholar]
- Enders C. K., Bandalos D. L. (2001). The relative performance of full information maximum likelihood estimation for missing data in structural equation models. Structural Equation Modeling, 8, 430-457. doi: 10.1207/S15328007SEM0803_5 [DOI] [Google Scholar]
- Fabrigar L. R., Wegener D. T., MacCallum R. C., Strahan E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods, 4, 272-299. [Google Scholar]
- Frick J. R., Grabka M. M. (2005). Item nonresponse on income questions in panel surveys: Incidence, imputation and the impact on inequality and mobility. Allgemeines Statistisches Archiv, 89, 49-61. doi: 10.1007/s101820500191 [DOI] [Google Scholar]
- Genz A., Bretz F., Miwa T., Mi X., Leisch F., Scheipl F., Hothorn T. (2018). mvtnorm: Multivariate normal and t distributions. Retrieved from https://CRAN.R-project.org/package=mvtnorm
- Goretzko D., Pham T. T. H., Bühner M. (2019). Exploratory factor analysis: Current use, methodological developments and recommendations for good practice. Current Psychology. Advance online publication. doi: 10.1007/s12144-019-00300-2 [DOI] [Google Scholar]
- Graham J. W. (2012). Missing data: Analysis and design. New York, NY: Springer. [Google Scholar]
- Honaker J., King G. (2010). What to do about missing values in time-series cross-section data. American Journal of Political Science, 54, 561-581. doi: 10.1111/j.1540-5907.2010.00447.x [DOI] [Google Scholar]
- Honaker J., King G., Blackwell M. (2011). Amelia II: A program for missing data. Journal of Statistical Software, 45(7), 1-47. [Google Scholar]
- Horn J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30, 179-185. doi: 10.1007/BF02289447 [DOI] [PubMed] [Google Scholar]
- James G., Witten D., Hastie T., Tibshirani R. (2013). An introduction to statistical learning. New York, NY: Springer. [Google Scholar]
- Josse J., Husson F. (2012). Selecting the number of components in principal component analysis using cross-validation approximations. Computational Statistics & Data Analysis, 56, 1869-1879. doi: 10.1016/j.csda.2011.11.012 [DOI] [Google Scholar]
- Kaiser H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, 141-151. [Google Scholar]
- King G., Honaker J., Joseph A., Scheve K. (2001). Analyzing incomplete political science data: An alternative algorithm for multiple imputation. American Political Science Review, 95, 49-69. [Google Scholar]
- Laaksonen S. (1998). Regression-based nearest neighbour hot decking. In International workshop on household survey nonresponse (Vol. 4, pp. 285-298). Retrieved from https://nbn-resolving.org/urn:nbn:de:0168-ssoar-49726-1 [Google Scholar]
- Little R. J. A., Rubin D. B. (2002). Statistical analysis with missing data. Hoboken, NJ: Wiley. [Google Scholar]
- Lorenzo-Seva U., Timmerman M. E., Kiers H. A. L. (2011). The hull method for selecting the number of common factors. Multivariate Behavioral Research, 46, 340-364. doi: 10.1080/00273171.2011.564527 [DOI] [PubMed] [Google Scholar]
- Lorenzo-Seva U., Van Ginkel J. R. (2016). Multiple imputation of missing values in exploratory factor analysis of multidimensional scales: Estimating latent trait scores. Anales de Psicologı’a/Annals of Psychology, 32, 596-608. doi: 10.6018/analesps.32.2.215161 [DOI] [Google Scholar]
- McNeish D. (2017). Exploratory factor analysis with small samples and missing data. Journal of Personality Assessment, 99, 637-652. doi: 10.1080/00223891.2016.1252382 [DOI] [PubMed] [Google Scholar]
- Nassiri V., Lovik A., Molenberghs G., Verbeke G. (2018). On using multiple imputation for exploratory factor analysis of incomplete data. Behavior Research Methods, 50, 501-517. doi: 10.3758/s13428-017-1013-4 [DOI] [PubMed] [Google Scholar]
- R Core Team. (2018). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; Retrieved from https://www.R-project.org/ [Google Scholar]
- Revelle W. (2018). Psych: Procedures for psychological, psychometric, and personality research. Evanston, IL: Northwestern University; Retrieved from https://CRAN.R-project.org/package=psych [Google Scholar]
- Ruscio J., Roche B. (2012). Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure. Psychological Assessment, 24, 282-292. [DOI] [PubMed] [Google Scholar]
- Russell D. W. (2002). In search of underlying dimensions: The use (and abuse) of factor analysis in Personality and Social Psychology Bulletin. Personality and Social Psychology Bulletin, 28, 1629-1646. [Google Scholar]
- Schouten R. M., Lugtig P., Vink G. (2018). Generating missing values for simulation purposes: A multivariate amputation procedure. Journal of Statistical Computation and Simulation, 88, 2909-2930. doi: 10.1080/00949655.2018.1491577 [DOI] [Google Scholar]
- Shah A. D., Bartlett J. W., Carpenter J., Nicholas O., Hemingway H. (2014). Comparison of random forest and parametric imputation models for imputing missing data using MICE: A CALIBER study. American Journal of Epidemiology, 179, 764-774. doi: 10.1093/aje/kwt312 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shoemaker P. J., Eichholz M., Skewes E. A. (2002). Item nonresponse: Distinguishing between don’t know and refuse. International Journal of Public Opinion Research, 14, 193-201. doi: 10.1093/ijpor/14.2.193 [DOI] [Google Scholar]
- Stekhoven D. J., Bühlmann P. (2011). MissForest: Non-parametric missing value imputation for mixed-type data. Bioinformatics, 28, 112-118. doi: 10.1093/bioinformatics/btr597 [DOI] [PubMed] [Google Scholar]
- van Buuren S., Brand J. P. L., Groothuis-Oudshoorn C. G., Rubin D. B. (2006). Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation, 76, 1049-1064. doi: 10.1080/10629360600810434 [DOI] [Google Scholar]
- van Buuren S., Groothuis-Oudshoorn K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1-67. Retrieved from https://www.jstatsoft.org/v45/i03/ [Google Scholar]
- van der Eijk C., Rose J. (2015). Risky business: Factor analysis of survey data—Assessing the probability of incorrect dimensionalisation. PLoS One, 10, e0118900. doi: 10.1371/journal.pone.0118900 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vink G., Frank L. E., Pannekoek J., van Buuren S. (2016). Predictive mean matching imputation of semicontinuous variables. Statistica Neerlandica, 68, 61-90. doi: 10.1111/stan.12023 [DOI] [Google Scholar]
- Warne R. T., Larsen R. (2014). Evaluating a proposed modification of the Guttman rule for determining the number of factors in an exploratory factor analysis. Psychological Test and Assessment Modeling, 56, 104-123. [Google Scholar]
- Yuan Y. C. (2000, April). Multiple imputation for missing data: Concepts and new development. Retrieved from https://support.sas.com/rnd/app/stat/papers/multipleimputation.pdf
- Zhang S., Zhang J., Zhu X., Qin Y., Zhang C. (2008). Missing value imputation based on data clustering. In Transactions on computational science I (pp. 128-138). New York, NY: Springer. doi: 10.1007/978-3-540-79299-4_7 [DOI] [Google Scholar]
- Zygmont C., Smith M. R. (2014). Robust factor analysis in the presence of normality violations, missing data, and outliers: Empirical questions and possible solutions. Quantitative Methods for Psychology, 10, 40-55. [Google Scholar]



