Exploratory structural equation modeling and the curse of dimensionality

Tra T Le; Jeroen K Vermunt; Nicola Ballhausen; Katrijn Van Deun

doi:10.3758/s13428-026-02960-y

. 2026 Mar 11;58(3):84. doi: 10.3758/s13428-026-02960-y

Exploratory structural equation modeling and the curse of dimensionality

Tra T Le ^1,^✉, Jeroen K Vermunt ¹, Nicola Ballhausen ², Katrijn Van Deun ¹

PMCID: PMC12979312 PMID: 41814057

Abstract

The next-generation approach to research in the behavioral sciences is based on intensive collections of data and complex models characterized by many parameters for a limited sample size. This introduces new challenges for traditional latent-variable methods, as they are found to fail or yield unstable solutions when the number of variables is large relative to the sample size. To tackle this issue, we propose a two-stage regularized approach for exploratory structural equation modeling. In the first stage, we introduce a novel (exploratory) approximate factor analysis technique that not only estimates the measurement model but also the factor scores; indeterminacy of the measurement model is addressed by imposing simple structure through regularizing techniques (LASSO penalty and cardinality constraint). The factor scores can then be used to estimate the structural model in the second stage. An extensive simulation shows that the proposed method outperforms other approaches in recovering the underlying simple structure of the measurement model in both low-dimension high-sample-size and high-dimension low-sample-size settings. The use of the method is demonstrated on two empirical datasets. An implementation of the proposed method in the R software is publicly available: https://github.com/trale97/regularizedESEM.

Keywords: Structural equation modeling, High-dimensional data, Exploratory factor analysis, Regularization

Research on human behavior and cognition involves studying non-observable constructs such as personality, intelligence, and well-being, which are typically measured through a set of observed indicators. In this setting, structural equation models (SEM) are particularly powerful for building explanatory models, as they model both the latent variables (so-called measurement model) and the relations between the latent variables (so-called structural model). SEM methods are known to work when the number of parameters to estimate is relatively small compared to the sample size. Yet, this condition of “low-dimension high-sample-size” is often not met by modern research designs that make use of technologically advanced measurement tools (e.g., wearable devices tracking one’s physiology and location, genetic sequencing tools, or digital assessment tools used in educational testing) and large data collection (e.g., supplementing questionnaire data with social media data). The resulting data may have more variables than cases, or be so-called high-dimensional, low-sample-size (HDLSS) data. Existing structural equation methods do not work (well) in this setting: the parameter estimates are known to be unreliable when sample sizes are small, and solutions may not even converge (Rosseel, 2020).

The need for SEM for high-dimensional data has been answered in mainly two ways. A first one is based on adaptations of the common factor model, which views the observed variables as reflective indicators of the latent construct. In the common factor model, the loadings are the main parameters of interest and are key to studying the latent variables: Loadings reflect the strength of association between the observed indicators and the unobservable construct. The common approach to obtaining estimates of the loadings requires inverting the observed variables’ (model-implied) covariance matrix, which is an ill-posed problem in the HDLSS setting. Building on the state-of-the-art in computational statistics and machine learning, attempts have been made to adapt such covariance-based SEM methods by adding penalties to the maximum-likelihood (ML) objective functions, such as Regularized SEM (RegSEM Jacobucci et al., 2016), Penalized ML SEM (lslx Huang, 2017; Huang et al., 2020), or Penalized SEM in Mplus (Asparouhov & Muthén, 2024). These approaches, however, often exhibit non-convergence issues when the number of parameters is large relative to the sample size and have high false-positive rates when sample sizes are small (i.e., nonzero loadings are obtained for variables that do not load on the latent construct) (Li & Jacobucci, 2022).

A second common approach to SEM in the setting of HDLSS data is component-based SEM, which represents constructs as weighted sums of observed variables (e.g., poverty index, obesity risk score). Typical methods are partial least-squares SEM (PLS-SEM, Hair et al., 2017), principal component analysis (PCA, Jolliffe et al., 2003), regularized generalized canonical correlation (RGCCA, Tenenhaus et al., 2017), and generalized structural component analysis (GSCA, Hwang & Takane, 2004). In these models, the weights used to form the component scores are the main parameters of interest. An often-named drawback of the component approach is that it is not appropriate for modeling latent variables (Rigdon, 2012; Rönkkö et al., 2015; Sarstedt et al., 2016). Another, not-so-well-known problem is that the component weights are extremely unstable, even with regularization, in the low-dimension, high-sample-size setting. In simulation studies where the composite scores were artificially constructed, zero weights were recovered for variables that were used to form the composite scores, as well as nonzero weights were recovered for variables that did not contribute to the composite scores (Guerra-Urzola et al., 2021; Park et al., 2024). Thus, no meaning can be attached to the size of the weights.

To overcome the aforementioned issues of the currently available SEM methods in the HDLSS settings, we propose a third approach to SEM called Regularized Exploratory Structural Equation Modeling (Regularized ESEM). Two developments in the factor analysis (FA) and SEM literature are key to this proposed approach. First, the estimation of the measurement model is based on the framework of the approximate factor model (Chamberlain & Rothschild, 1982; Fan et al., 2021; Bai & Ng, 2023). This is because in the HDLSS setting, factors cannot be estimated with all the properties of the common factor analysis; instead, some assumptions need to be relaxed. Second, inspired by Rosseel and Loh (2022), a two-stage approach is used: the attained factor scores of the approximate factor model are used to estimate the path coefficients in the structural model. The two-stage approach affords applied researchers with flexibility in an exploratory setting, that is, the structural model can be modified (e.g., adding covariates or an outcome) without having to re-estimate the measurement part (Vermunt, 2010). This also helps prevent the meaning of the latent constructs from becoming dependent on the outcome or covariate considered.

The remainder of this paper is structured as follows: first, we will introduce the details of the proposed Regularized ESEM; second, an extensive simulation study will be conducted to evaluate its performance in comparison with other methods; third, the use of Regularized ESEM is demonstrated using two empirical applications, in which one is of an ultra-high-dimensional nature; lastly, the paper ends with a general discussion of the limitations and suggestions for future research. The proposed method was implemented in the R language for statistical computing (R Core Team, 2012). Both the implementation and the code used to generate the results in this paper can be found at https://github.com/trale97/rESEM.

Method

In this section, we will first introduce the notation and data, followed by the proposed models, their estimation, and model selection.

Data and notation

The following notations will be used in this paper: matrices and vectors are denoted by bold uppercase and bold lowercase letters, respectively; the transpose is indicated by the superscript $^{T}$ , and scalars by lowercase italics.

The data of interest is denoted by $y_{i}$ , a $J \times 1$ vector containing the scores of person i on J observed variables. Throughout the paper, we work with standardized data, that is, variables are mean-centered and scaled to unit variance. S() is the soft-thresholding operator, which is denoted as $S (x, λ) = sign (x) {(| x | - λ)}_{+}$ with $sign (x)$ being the sign of x and ${(| x | - λ)}_{+} = max (0, | x | - λ)$ .

Model

Our proposed model is a structural equation model and hence consists of two parts: (1) the measurement model describing the relation between the observed indicator variables and the factors and (2) the structural model describing the relation between the factors (or, with observed variables having no measurement model). First, the measurement model for Q factors is expressed as the following model:

\begin{matrix} y = P η + ϵ \end{matrix}

where $P$ is the $J \times Q$ loading matrix, $η$ is the $Q \times 1$ vector of factor scores and $ϵ$ is the $J \times 1$ vector of residuals. Here, no intercept is included because we assume standardized data. Model (1) is subject to two identification constraints: (1) $P$ is assumed to have simple structure, that is, not all observed variables load on all factors but some/many have zero loadings; (2) the factor scores are subject to a length restriction. Furthermore, $ϵ$ is assumed to have a mean of zero and is uncorrelated with the factor scores. If we write Cov( $ϵ)$ = $Φ$ , then we have the model-implied covariance matrix $Σ = P Σ_{η} P^{T} + Φ$ . Note it is not possible to estimate $Φ$ as a full rank diagonal matrix in the high-dimensional setting (as in the classical case of a common factor model). Thus, we relax this assumption to allow for $ϵ$ to be weakly correlated. This type of model is known as the approximate factor model in the econometric literature (Bai & Ng, 2023; Chamberlain & Rothschild, 1982; Fan et al., 2021). Although it differs from the common factor model, the approximate factor model has been shown to produce consistent estimates of loadings and factors under (mild) assumptions in the HDLSS setting (Bai, 2003; Fan et al., 2021).

The structural model expresses the relations among the factors:

\begin{matrix} η = B η + ζ \end{matrix}

with $B$ the $Q \times Q$ matrix of path coefficients among the factors and $ζ$ is an $Q \times 1$ vector of residuals. Note that here to keep the notation simple, Eq. (2) also applies when the structural model includes observed variables instead of factors.

Stage 1: Estimating the measurement model

The approximate factor model in (1) can be estimated in several ways (e.g., maximum likelihood-based estimation (Bai & Li, 2016), covariance matrix estimation (Fan et al., 2011)). However, for our proposed method, we rely on a least-squares approach as this has been advocated in the high-dimensional and large-scale settings (see Fan et al. (2021)). Hence, we minimize the following objective function:

\begin{matrix} \sum_{j = 1}^{J} \sum_{i = 1}^{N} {(y_{ij} - \sum_{q} η_{iq} p_{jq})}^{2} \\ subject to & \sum_{i} η_{iq}^{2} = N \forall q = 1, . . ., Q, \\ and P showing simple structure, \end{matrix}

with $p_{jq}$ representing the loading of variable j and $η_{iq}$ the score of person i on factor q. The loadings are subject to simple structure; some or even many are equal to zero. Apart from sign and permutation invariance, the norm constraint on the factor scores together with the simple structure of the loadings introduces uniqueness1. Furthermore, a simple structure facilitates interpretation, especially in settings with an (ultra-) large number of observed variables.

There are several options to formulate a simple structure as a mathematical objective. A direct way to do this is to impose a hard constraint on the number of nonzero loadings; this is looking for a solution with at most K nonzero loadings, also known as a cardinality constraint or the best subset problem (Bertsimas et al., 2016a). The best subset problem is generally considered an NP-hard problem and thus is usually approximated by (convex) relaxations. One such popular relaxation is to add the LASSO penalty (Tibshirani, 1996) to the objective function:

\begin{matrix} min & \sum_{j = 1}^{J} \sum_{i = 1}^{N} {(y_{ij} - \sum_{q} η_{iq} p_{jq})}^{2} + λ \sum_{j = 1}^{J} \sum_{q = 1}^{Q} | p_{jq} | \\ subject to & \sum η_{iq}^{2} = N \forall q = 1, . . . Q, \end{matrix}

with $λ \geq 0$ , the tuning parameter of the LASSO. When $λ > 0$ , the penalty actively shrinks the loadings toward zero, and some elements to exactly zero. For high values of $λ$ , (almost) all loadings will become zero. Note that it is not easy to know beforehand how many elements will be zero for a particular value of $λ$ .

In what follows here, we assume the number of factors Q and the value of the LASSO tuning parameter $λ$ are given. A strategy to determine these hyperparameters is discussed in Section “Model selection”. The optimization problem defined in (4) is a least-squares low-rank factorization problem with a convex penalty, which can be efficiently solved by alternating optimization (AO, Huang et al., 2016). After initialization, the factor scores and loadings are updated in turn assuming fixed values of the other. To update the factor scores subject to the norm constraint, the method of Lagrange multipliers is used. For the update of the loadings, a coordinate descent procedure (Friedman et al., 2010) is followed. Appendix A details the derivation of the conditional updates of factor scores and loadings, along with the pseudocode of the whole AO procedure. Note that this AO procedure results in a non-increasing sequence of loss values defined in (4) and converges to a stationary point. However, as the exploratory approximate factor analysis problem introduced in (4) is not convex, convergence to the global optimum is not guaranteed and we recommend using multiple starting values to initialize the alternating procedure and retain the solution with the lowest loss value as the final solution.

Orthogonal factors are a special case of model (1), and although our algorithm also applies to this case, we propose to use the algorithm presented by Adachi and Trendafilov (2016) (see also Shen and Huang (2008)). The reason is that this is a particularly elegant method that -exceptionally- allows solving the cardinality-constrained problem with very low computational cost. Furthermore, the use of a cardinality constraint gives users easy control over the number of nonzero loadings and avoids the shrinkage-to-zero bias of the LASSO approach (Bertsimas et al., 2016b). Hence, in the case of orthogonal factors, we solve the cardinality-constrained problem:

\begin{matrix} \begin{matrix} min & \sum_{j = 1}^{J} \sum_{i = 1}^{N} {(y_{ij} - \sum_{q} η_{iq} p_{jq})}^{2} \\ subject to & \sum_{i} η_{iq}^{2} = N \forall q = 1, . . . Q, \\ \sum_{i} η_{iq} η_{it} = 0 \forall q \neq t, and Card (P) = C, \end{matrix} \end{matrix}

where Card( $P$ ) denotes the number of nonzero coefficients in the loading matrix $P$ . Note that here, the factor scores are constrained to be uncorrelated and have unit variance. This implies that when the observed variables are standardized, the nonzero loadings $p_{jq}$ are equal to the correlation between item j and factor q. Under the constraint of orthogonal factors, the cardinality-constrained problem defined in (5) can be solved efficiently in an alternating optimization scheme. Because (5) is not convex, this algorithm is also subject to local optima. Furthermore, the cardinality constraint itself is not a convex constraint; thus, the convergence of the AO procedure is not guaranteed (Huang et al., 2016).

Stage 2: Estimating the structural model

In Stage 2, the factor loadings and factor scores obtained from Stage 1 can be used in several ways to study the structural relations among variables. The factor scores can directly be used with any suitable method for users’ purposes (e.g., multiple linear regression or path analysis); additional covariates can also be controlled for if desired. We present Algorithm 1 for solving the LASSO problem and Algorithm 2 for solving the cardinality-constrained problem in the special case of orthogonal factors in Appendix A and Appendix B, respectively.

Model selection

The proposed estimation procedure for Stage 1 (exploratory approximate factor model) requires the following inputs: the number of factors Q, the number of nonzero loadings C, or the value of the tuning parameter $λ$ . Proper values may be theory-based, but if such prior knowledge is lacking, a model selection strategy is needed. We propose a sequential strategy: first, the number of factors Q is determined based on either prior knowledge or (a combination) of various factor retention methods (e.g., parallel analysis (Horn, 1965), Kaiser–Guttman rule (Kaiser, 1960), scree test (Cattell, 1966), or a machine learning method (Goretzko & Bühner, 2020)); second, given Q, the algorithm searches through a sequence of candidate values for K or $λ$ and chooses the value for which the model yields the highest Index of Sparseness (IS). IS is an indicator developed to determine the level of sparseness in PCA (Gajjar et al., 2017; Trendafilov, 2014). In an extensive simulation study, Gu et al. (2019) showed that the IS resulted in the highest selection accuracy compared with other selection methods (i.e., BIC, cross-validation (CV), repeated double CV, Bolasso with CV, and stability selection). The formula for IS is as follows:

\begin{matrix} I S = P E V_{PCA} \times P E V_{rESEM} \times P S, \end{matrix}

where $P E V_{PCA}, P E V_{rESEM}$ and PS are the proportion of explained variance using ordinary PCA, the proportion of explained variance using Regularized ESEM, and the proportion of zero loadings, respectively. The IS value increases with increasing $P E V_{PCA}$ , increasing $P E V_{rESEM}$ , and the proportion of zero loadings. Thus, IS balances goodness-of-fit with the complexity of the solution (sparser solutions being less complex). A suitable range of C is Q (at least one nonzero loading per factor) to JQ. $λ$ ranges between 0 and $λ_{\max}$ with $λ_{\max}$ being the smallest value of $λ$ that makes $(J - 1)$ elements become zero per factor.

Related methods

The objectives of our two proposed methods are: (a) dealing with HDLSS settings, (b) deriving simple structure of the loading matrix with exact zero loadings to get interpretable factors, and (c) exploring relationships among the factors or with observed covariates/outcome variables. This section reviews existing methods related to ours. Their key features are summarized in Table 3. They can be categorized into three main groups: penalized maximum likelihood (ML) methods, (penalized )least-squares methods, and Bayesian methods.

The methods that use a penalized ML approach are RegSEM (Regularized SEM, Jacobucci, 2017), lslx (Penalized ML SEM Huang, 2020; Huang et al., 2017), and PSEM-Mplus (Penalized SEM in Mplus, Asparouhov & Muthén, 2024). They are all one-stage methods that add penalties to the ML discrepancy function with some fundamental differences. RegSEM and lslx both use similar types of shrinkage penalties to reduce the number of free parameters, stabilize estimation, and obtain interpretable solutions, with penalty parameters selected using information criteria or cross-validation. However, the two methods rely on different optimization algorithms, which results in different performance (Huang, 2020). In contrast, PSEM-Mplus primarily aims at estimating models that are unidentified under ordinary ML. For instance, to deal with rotational freedom in an EFA model, PSEM-Mplus can incorporate the Geomin rotation criterion as a penalty. When this penalty is lightly imposed, as recommended by Asparouhov and Muthén (2024, 2025), the resulting estimates will be highly similar to the rotation solution, and the log-likelihood fit is not sacrificed. Its penalty parameters are user-specified rather than tuned through data-driven methods. None of these penalized ML approaches were specifically proposed as solutions to high-dimensional data problems, and their performance has been evaluated only for complex models (i.e., models with many free parameters) in traditional low-dimensional settings.

Structured Factor Analysis (SFA, Cho & Hwang, 2023) is a two-stage least-squares method. In the first stage, SFA uses an alternating least-squares procedure to estimate the measurement parameters and factor scores. In the second stage, the estimated factor scores are used to estimate the structural parameters. SFA was proposed to avoid improper solutions (e.g., negative residual variances, non-positive-definite covariance matrices) that often occur in the traditional covariance-based SEM approach, and to derive a probability distribution for all feasible factor scores rather than obtaining a single unbiased estimate of factor scores. The latter makes it possible to draw probabilistic inferences about individuals’ true factor scores. Note that the method assumes a confirmatory measurement model rather than an exploratory one.

Regularized Generalized Structured Component Analysis (Regularized IGSCA, Cho et al., 2025) is a one-stage least-squares method, in which penalties (LASSO and Ridge) were added to the IGSCA objective function, separately for loadings and structural coefficients. IGSCA (Hwang et al., 2021) is a one-step framework that aims at simultaneously estimating common factors and components, allowing for hybrid models. The penalty is included to deal with multicollinearity problems (Ridge penalty) and perform variable selection (LASSO penalty). Regularized IGSCA uses K-fold cross-validation to determine the tuning parameter that controls the penalty strength. The method was demonstrated using the conventional SEM contexts only.

The Bayesian framework is an alternative approach for handling complex SEM models. Bayesian estimation, typically implemented via Markov chain Monte Carlo (MCMC), does not rely on large-sample asymptotics and can therefore provide stable estimates, valid uncertainty quantification, and improved performance in complex settings where the number of parameters is large relative to the sample size (Marcoulides et al., 2023) or high-dimensional settings through the use of shrinkage priors (van Erp, 2023). However, the MCMC procedures can be computationally intensive. Moreover, Bayesian estimation depends strongly on prior specification: diffuse default (noninformative) priors can lead to substantial bias in small samples (McNeish, 2016; Rosseel, 2020). Thus, both prior specification and model evaluation can be challenging and require careful judgment (Marcoulides, 2018; Muthén & Asparouhov, 2012; Van Erp et al., 2018). A full comparison of Bayesian SEM and frequentist regularization approaches is beyond the scope of this paper, and interested readers are referred to the relevant literature (Jacobucci & Grimm, 2018; van Erp, 2023).

In short, our proposed method relates to the above-mentioned approaches in two main ways: it adopts a two-stage least-squares framework and uses regularization to obtain simple structure in the loading matrix. However, important differences remain: our method applies a single type of penalty (LASSO) or a cardinality constraint only to the measurement model (unlike penalized ML approaches and Regularized IGSCA), and in doing so, sacrifices some of the fit to impose simple structure (to some degree) on the loading matrix. Furthermore, the model was developed such that solutions can be computed for very large data (in terms of the number of observed variables, sample size, or both), and it allows for a fully exploratory measurement model. The strength used to impose sparseness of the loadings is determined using a data-driven criterion (i.e., the Index of Sparseness).

Simulation study

A simulation study was conducted to evaluate the performance of our proposed method compared with related methods. We excluded Bayesian SEM to maintain a focus on frequentist approaches, and SFA because it is not applicable in exploratory settings. The six methods RegSEM, lslx, PSEM-Mplus, Regularized IGSCA, and the proposed approaches (cardinality-constrained and LASSO) were examined based on the following performance measures: recovery rate of the simple structure for the measurement model; absolute bias and variance for the estimated regression weights in the structural model.

Setup

The simulated data were generated based on the population model in Fig. 1. For the traditional low-dimensional settings, we varied the following:

Sample size N at 3 levels: 50, 100, 500.
Number of factors Q at 2 levels: 3, 5.
Number of items K per factor at 2 levels: 3, 5.
Cross-loadings at 2 levels: 0, 0.5.
Item reliability (i.e., proportion of explained variance in each item by the factors) at 2 levels: 0.3, 0.8.
Correlation r between factors 2 levels: 0, 0.3.
Variance accounted for in the outcome $z$ by the factors (VAFz) at 2 levels: 0.5, 0.9.

For the high-dimensional settings, we fixed $N = 50, Q = 5$ and varied the number of items K per factor at 3 levels: 15, 30, and 100 (which means in total there are 75, 150 and 500 items for a sample size of 50). The other criteria were varied in the same manner. The regression weights were fixed at 0.1 for all conditions. We set the primary loadings as $P_{primary} = \sqrt{0.6}$ , and the cross-loadings $P_{cross} = . 5$ which is a quite substantial cross-loading. The locations of the cross-loadings were specified in the same manner as in Rosseel and Loh (2022) and can be found in the detailed data generation procedure in Appendix D. All simulation design factors were crossed, and in each condition, 50 replicate data sets were generated, leading to 9600 low-dimensional datasets and 2400 high-dimensional datasets. In total, 10,400 datasets were generated and analyzed.

Performance measures

The performance of all methods was compared using several measures that are defined as follows:

The recovery of the simple structure with zero/nonzero recovery rate PL. It is the proportion of zero and nonzero true loadings $p_{jq}^{true}$ used to generate the data that is correctly recovered by the estimated loadings:
$\begin{matrix} P L = \frac{# of correctly nonzero loadings + # of correctly identified zero loadings}{# of loadings in P_{true}} . \end{matrix}$ 7
To assess the accuracy and precision of the estimated path coefficients in the structural model, we use the absolute bias and root mean squared error (RMSE).

Analyses

The number of factors Q was treated as known in this simulation study. It is important to point out that deciding the number of factors in latent variable methods is an important task, yet not a straightforward one, which has raised extensive discussions. However, this is beyond the scope of this paper, and interested readers are referred to the literature dedicated to this matter (e.g., Auerswald & Moshagen, 2019; Goretzko, 2025). All other parameters (e.g., penalty tuning parameters, number of nonzero loadings, etc.) were treated as unknown for all methods.

For RegSEM, we chose the LASSO penalty in the algorithm and tuned the penalty parameter using cross-validation, following closely the authors’ tutorial paper (Li et al., 2021). We used lslx with the default setting with MCP penalty and tuned the (penalty and shape) parameters using the BIC. For PSEM-Mplus, the Geomin prior (as recommended in Asparouhov and Muthén (2024) for EFA/ESEM models) was used as the penalty function for all loadings: $p_{11} - p_{jq} \sim G e o m i n (Q, v)$ , where Q is the number of factors and v takes the smallest value among (0.1, 1, 10, 100) for which the model converges. Since using the Geomin prior boils down to the same solution that would result from a rotation technique, the estimated factor loadings are not exactly zero. Thus, those that did not differ significantly from zero were considered to be zero in calculating the proportion of correctly recovered (non)zero loadings (significance level $α = . 05$ ). Lastly, Regularized IGSCA used the LASSO penalty with cross-validation to tune the penalty parameter. Both lslx and Regularized IGSCA failed to estimate the full model with measurement and structural models simultaneously; thus, we only used them to estimate the measurement model.

Our proposed methods require tuning the number of nonzero loadings (for the cardinality-constrained approach) and the penalty parameter (for the LASSO approach). To this end, the algorithm is run for a sequence of candidate values and chooses the one that maximizes the Index of Sparseness. We specified the range2 for the cardinality as all positive integers values between $[3 \times Q, J \times Q]$ , and 100 values of penalty parameters between $[N \times 0.1, λ_{\max}]$ , where $λ_{\max}$ is the maximum penalty so that each factor has three items. We also used the proposed method with oracle information to disentangle the algorithmic performance of our method from the model selection performance. Here, the LASSO approach requires setting the value of the tuning parameter $λ$ to reach the correct number of (non)zero loadings. This was done using the binary search procedure where $λ$ is gradually adapted until the desired number of zero loadings is attained. That is, the model is estimated using a certain value of $λ$ , which results in a number of zero loadings in the estimated loading matrix. Depending on whether the true loading matrix contains more or less zero loadings than the estimated one, $λ$ is increased or decreased, respectively. The cardinality-constrained approach is more straightforward: the number of (non)zeroloadings was given as direct input. Note that the factor solutions are subjected to permutational freedom and sign invariance. Thus, all possible permutations and sign configurations were considered, and the one yielding the highest Tucker’s congruence with the true factor loadings (Lorenzo-Seva & Ten Berge, 2006) was selected as the final solution.

Results

Interested readers can inspect interactively all simulation results with the R Shiny app at https://trale.shinyapps.io/rES EMextrasim/ (Le et al., 2025). It is worth noting that RegSEM struggled to converge and produce sensible results in even very simple settings (e.g., without cross-loadings, high reliability, etc.). Thus, we omitted RegSEM results from the simulation study.

Measurement model

We first discuss the results from the typical low-dimensional settings. Here, we report the extent to which each method recovered the underlying loadings structure. PSEM-Mplus and lslx both had non-convergence issues (i.e., no results were returned) in 4267 datasets and 37 datasets (out of 9600), respectively. Additionally, PSEM-Mplus produced improper results (i.e., negative residual variances) for 590 datasets. The proposed method (both the LASSO and cardinality-constrained approaches) converged in all datasets. Figures 2 and 3 show the zero/nonzero recovery rate of all methods across different conditions for orthogonal and correlated factors, respectively (for VAFz = 0.9 and five indicators per factor). As expected, the sample size, the number of factors, and item reliability affected all methods. That is, the recovery rate was higher for larger sample sizes and higher reliability, yet lower with a higher number of factors. The presence of cross-loadings proved to be challenging for all methods. Both of the proposed approaches, when using oracle information (i.e., the true number of (nonzero) loadings was used as input), had the highest rate of recovering the correct positions of zero and nonzero loadings. When the IS was used for model selection, the proposed cardinality-constrained approach still outperformed the other methods, followed by PSEM-Mplus and the proposed LASSO approach. However, it is important to note that the results for PSEM-Mplus were only based on 4743 datasets for which it converged with proper solutions, which is less than half of the total datasets. Regularized IGSCA had the lowest recovery rate.

Fig. 2 — Percentage of correctly identified zero and nonzero loadings for orthogonal factors. *rESEM-l1* denotes the proposed Regularized ESEM using the LASSO penalty. *rESEM-CC* denotes the proposed Regularized ESEM using the cardinality-constrained approach; *(IS)* denotes the proposed method using the IS for model selection, while *(oracle)* uses the true number of (non)zero loadings. Improper and non-convergent results from *PSEM-Mplus* were excluded

Fig. 3 — Percentage of correctly identified zero and nonzero loadings for correlated factors. *rESEM-l1* denotes the proposed Regularized ESEM using the LASSO penalty. *rESEM-CC* denotes the proposed Regularized ESEM using the cardinality-constrained approach; *(IS)* denotes the proposed method using the IS for model selection, while *(oracle)* uses the true number of (non)zero loadings. Improper and non-convergent results from *PSEM-Mplus* were excluded

Focusing on the proposed Regularized ESEM, we saw an interesting pattern: the cardinality-constrained approach had slightly higher recovery rates than the LASSO approach, even when the factors were correlated in the data-generating model. Using Oracle information improved the recovery rate of the proposed method compared with the model selection procedure using the IS. The improvement is most noticeable for the proposed LASSO approach and in less ideal scenarios, namely, small sample sizes, low item reliability, and a larger number of factors. Furthermore, we recorded the percentage of datasets per condition for which the IS found the correct number of (non)zero loadings. The results are reported in Appendix E. It appeared that the accuracy of the IS was affected negatively by a larger number of factors and the presence of (moderate) cross-loadings. This pattern was stronger for the LASSO approach than the cardinality-constrained approach. Larger sample sizes seemed to improve the performance of the IS, especially in the case of the cardinality-constrained approach (i.e., the IS selected the correct cardinality for 100% of the datasets when $N = 500$ ). Note that this measure for the performance of the IS is a very strict criterion: any deviation between the true and selected number of (non)zero loadings, even by a single loading, is considered an incorrect selection. In high-dimensional settings (Table 6, the IS therefore appeared to perform worse because the larger number of loadings increases the chance of missing one. Both approaches in these analyses used a multi-start procedure, thus, it is of great interest to see how many different starts actually resulted in the same final solution. To this end, the loss value of each of 100 random starts was recorded and compared with the loss value of the final solution. The results are reported in Appendix F. Averaged across all conditions, the cardinality-constrained approach had a higher number of starts with the same loss value as the final one than the LASSO approach (i.e., 76 vs. 47).

Table 6.

Percentage of datasets per condition for which IS selected the correct number of (non)zero loadings (high-dimensional scenario)

Items per factor	Reliability	Factors correlation	Cross-loadings	rESEM-CC	rESEM-l1
15	0.3	0	yes	50	1
30	0.3	0	yes	31	2
100	0.3	0	yes	15	0
15	0.8	0	yes	69	38
30	0.8	0	yes	92	38
100	0.8	0	yes	97	28
15	0.3	0.3	yes	31	1
30	0.3	0.3	yes	19	1
100	0.3	0.3	yes	2	2
15	0.8	0.3	yes	83	41
30	0.8	0.3	yes	80	34
100	0.8	0.3	yes	80	29
15	0.3	0	no	50	3
30	0.3	0	no	47	0
100	0.3	0	no	9	0
15	0.8	0	no	100	78
30	0.8	0	no	100	65
100	0.8	0	no	100	62
15	0.3	0.3	no	19	1
30	0.3	0.3	no	6	2
100	0.3	0.3	no	0	1
15	0.8	0.3	no	100	69
30	0.8	0.3	no	100	68
100	0.8	0.3	no	100	53

Open in a new tab

Structural model

Figure 4 shows the absolute bias of the path coefficients across all conditions (for five items per factor and cross-loadings were present) in the low-dimensional settings. In the case of orthogonal factors, the two proposed approaches had almost indistinguishable bias. However, when the factors are correlated, the cardinality-constrained approach estimated the structural coefficients with much higher bias than the LASSO approach. This result is to be expected, given that the cardinality-constrained approach restricts the factors to be orthogonal. Thus, this approach is not safeguarded against local misspecification. Here, PSEM-Mplus appeared to be the most biased across almost all conditions. It is only less biased than the cardinality-constrained approach in a few scenarios with a high number of correlated factors and high item reliability. Again, note that more than half of PSEM-Mplus results were excluded due to non-convergence issues and improper solutions.

Fig. 4 — Average absolute bias of the estimated coefficients for the structural model. Results for *rESEM-l1* and *rESEM-CC* were based on model selection using the IS. Improper and non-convergent results from *PSEM-Mplus* were excluded

High-dimensional settings

In the high-dimensional settings, lslx and IGSCA failed to produce any results. The former method could not estimate any model because the sample covariance matrix is singular in high-dimensional settings. This prevented lslx from generating starting values for the estimated latent covariance matrix. PSEM-Mplus again failed to return any results in more than half of the datasets (i.e., 1528 out of 2400 datasets). Especially, this non-convergence issue occurred in all datasets with 500 observed variables.

The results for the measurement model by the proposed approaches and PSEM-Mplus in the high-dimensional settings are displayed in Fig. 5. Here, the same patterns as in the low-dimensional settings were observed: the performance of all methods was negatively affected by a lower reliability and the presence of cross-loadings; the cardinality-constrained outperformed the others in all conditions, and the LASSO approach was only better than the PSEM-Mplus when item reliability was high, and the oracle information was used. However, an important result to point out here is that while increasing the number of items per factor improved the recovery rates of all methods, PSEM-Mplus could not deal with situations with $K = 100$ (i.e., 500 items for a sample size of 50). Our proposed approaches did not encounter this problem. The stability of these estimated loadings in the high-dimensional settings was also evaluated by examining their (average) standard deviations (SD) across replications. The results in Fig. 6 suggested that the proposed cardinality-constrained approach had the lowest SD among the three methods. More importantly, the SD decreased as the number of items grew for both the proposed cardinality-constrained approach and PSEM-Mplus. This result confirms what has been established in the literature that studies the consistency of the approximate factor model in the high-dimensional setting (Daniele et al., 2025; Fan et al., 2021). However, this pattern did not seem to hold for the LASSO approach because the model selection procedure itself (using the IS to choose sparsity-inducing penalty) for the LASSO approach was already unstable (as seen in the previous sections), and thus, it could be exacerbated when there is a higher number of items.

Fig. 6 — Average standard deviation of the estimated loading matrix across replications in the high-dimensional settings. *rESEM-l1* denotes the proposed Regularized ESEM using the LASSO penalty. *rESEM-CC* denotes the proposed Regularized ESEM using the cardinality-constrained approach. Improper and non-convergent results from *PSEM-Mplus* were excluded

The results for the structural model followed the same trend as the low-dimensional results, and thus, we do not report them here. Interested readers can inspect the detailed results using the Shiny app (Le et al., 2025). To summarize, PSEM-Mplus estimated the structural coefficients with the highest bias across almost all conditions. Both the proposed approaches were the least biased, but again, the proposed cardinality-constrained approach was substantially biased when factors were correlated in the data-generating model. The method, however, did have the lowest SD out of the three methods in all conditions.

Summary

In short, the proposed method (both approaches) outperformed the other methods: the cardinality-constrained approach had the highest recovery rate of the zero/nonzero loadings regardless of the factors’ relationship, and the LASSO approach estimated the structural coefficients with the lowest bias (only compared with PSEM-Mplus). lslx and IGSCA failed to estimate complex models (i.e., models with both measurement and structural parts simultaneously) and also cannot deal with high-dimensional data. PSEM-Mplus and lslx both had non-convergence issues, with PSEM-Mplus having the worst convergence rate (i.e., it did not converge in more than 50% of the simulated datasets in both low- and high-dimensional settings).

It is important to point out that even though the cardinality-constrained approach outperformed the LASSO approach when estimating the measurement model for both orthogonal and correlated factors, this result did not hold for the structural part. That is, the cardinality-constrained approach produced much more biased structural coefficients than the LASSO approach when the factors were correlated in the population model. Thus, in an exploratory setting where no prior information about the factor correlations exists, it is advised to: first, use the cardinality-constrained approach to find the simple structure of the loadings; then, this simple structure can be used as a guide for the LASSO approach to tune the penalty parameter and estimate the factor scores.

Empirical application

This section demonstrates the application of the two proposed methods using two empirical datasets: one is a traditional psychology questionnaire dataset while the other is an ultra-high-dimensional gene expression dataset (thousands of genes for just several dozen observations).

Habitual stress recovery data and pre-processing

The first dataset concerns habitual strategies for stress recovery and their association with well-being and health in aging. Data was collected in 2023. In total, 421 individuals participated in the survey. For our analysis, we only included participants between 18 and 35 years as the “young” group and those 60 years old and older as the “old” age group. This results in a dataset with 415 observations. Data also contain missing values, which were imputed using multiple imputation with the ‘mice’ package in R (van Buuren & Groothuis-Oudshoorn, 2011). Here, we demonstrate how our proposed methods can be used to find the latent stress recovery strategies from self-reported questionnaire items and obtain factor scores for these strategies. The factor scores can then be used to explore how one’s recovery strategies are linked with their well-being.

Measures

Outcome variable The survey measured the participants’ satisfaction with life using a questionnaire of five items on a seven-point scale (Diener et al., 1985). We obtained a sum score of these five items to create a single outcome variable.

Exogenous factors The exogenous factors were the different latent stress recovery strategies. These strategies were measured using five different questionnaires: the COPE Brief (Carver, 1997) with 28 items on a four-point scale, Emotion regulation questionnaire (ERQ, Gross & John,2003) with ten items on a seven-point scale, Interpersonal Emotion Regulation in Close Relationship (IER, Horn, 2022) with 15 items on a five-point scale, Ruminative Responses Scale (RSS, Treynor et al., 2003) with ten items on a four-point scale, and Thought Control and Ability Questionnaire (TCAQ, Luciano et al., 2005) with 25 items on a five-point scale. For each questionnaire, we created sum scores for its subscales. In total, 26 sum scores were computed and used as observed variables for the measurement model. Variables were standardized prior to the analysis.

Control variables In this analysis, we controlled for participants’ education levels, gender, and age groups. Furthermore, we also examined whether the effects of stress recovery strategies differ between the young and old groups.

Results

Measurement model Using parallel analysis, we identified five exogenous factors from the five questionnaires measuring stress recovery strategies. Given this, we carried out the model selection procedure to determine the number of nonzero loadings. Ordinary PCA with five components accounted for 51.6% of the total variance. The maximum IS was achieved for a cardinality of 23 nonzero loadings using the cardinality-constrained approach (rESEM-CC), with five factors explaining 43.7% of the total variance (see Fig. 8 for the values of IS and proportion of explained variance as functions of the number of nonzero loadings). The penalty parameter of the LASSO approach (rESEM-l1) was tuned to reach the same number of nonzero loadings, which explained 44.6% of the total variance. Since we did not have prior knowledge, we chose to conduct the analysis using the LASSO approach to be able to explore the correlations of these five stress recovery strategies. The correlation matrix among the factors can be found in Appendix H. The final loading matrix is reported in Table 1.

Fig. 8 — Index of sparseness (IS) and proportion of explained variance (PEV) against cardinality level

Table 1.

Loading matrix using the LASSO approach

Questionnaire	Items	External support	Cognitive reappraisal	Humor	Brooding	Suppression
IER	Co-distraction	0.771	0	0	0	0
	Co-suppression	0	0	0	0	0.874
	Co-brooding	0	0	0	0.656	0
	Physical affection	0.807	0	0	0	0
	Co-reappraisal	0.782	0	0	0	0
	Positive humor	0.513	0	0.635	0	0
	Negative humor	0	0	0.844	0	0
ERQ	Cognitive reappraisal	0	0.649	0	0	0
	Expressive suppression	0	0	0	0	0.874
COPE Brief	Active coping	0	0.776	0	0	0
	Planning	0	0.710	0	0	0
	Positive reframing	0	0.644	0	0	0
	Acceptance	0	0	0	-0.501	0
	Humor	0	0	0.802	0	0
	Religion	0	0	0	0	0
	Using emotional support	0.749	0	0	0	0
	Using instrumental support	0.737	0	0	0	0
	Self-distraction	0	0	0	0	0
	Denial	0	0	0	0.367	0
	Venting	0	0	0	0.405	0
	Substance	0	0	0	0	0
	Behavioral disengagement	0	0	0	0	0
	Self-blame	0	0	0	0.566	0
TCAQ	Thought control	0	0	0	-0.815	0
RSS	Reflection	0	0	0	0.663	0
	Brooding	0	0	0	0.844	0

Open in a new tab

Structural model The factor scores of the five stress recovery strategies were used in a regression analysis with Life Satisfaction as the outcome variable, controlling for participants’ gender, education level, age group, and the interactions between the stress recovery strategies and age group. As shown in Table 2, Brooding, Social Support, and Suppression strategies had significant effects on one’s life satisfaction level, controlling for the other predictors. Specifically, a higher level of using brooding and expressive suppression to deal with stress is associated with a lower level of one’s life satisfaction, controlling for other predictors. In contrast, individuals who tend to seek social support as a stress recovery strategy have a higher level of life satisfaction. Furthermore, the effect of using external support and expression suppression is stronger for the young group than the old group.

Table 2.

Structural model parameters

Exogenous factors	rESEM-l1
(intercept)	0.214
External support	0.185**
Humor	0.091
Brooding	-0.416**
Cognitive reappraisal	0.037
Suppression	-0.169*
Female	-0.127
Other gender	-0.417
Young	-0.157
Lower vocational education	0.461
Vocational education	0.017
University of applied sciences	-0.155
Research university	-0.166
External support * Young	0.318*
Humor * Young	-0.149
Brooding * Young	-0.001
Cognitive reappraisal * Young	0.053
Suppression * Young	0.314*

Open in a new tab

Note. The outcome variable is Life Satisfaction

$^{*} p < . 05$ . $^{* *} p < . 01$

The autism genetic data

To illustrate the proposed method as a summarization tool in high-dimensional settings, we use the gene expression data of lymphoblastoid cells to distinguish different types of autism (Nishimura et al., 2007). The dataset consists of 43,893 genetic markers measured for each of the 27 individuals, from which 14 are in the control group, six are affected with autism caused by a fragile X mutation (FMR1-FM), and seven are affected with autism caused by 15q11-q13 duplication (dup15q). All variables were standardized prior to the analyses. Three factors were chosen based on the original work of (Nishimura et al., 2007).

Ordinary PCA with three components explained 32% of the total variance. The maximum IS was achieved for the cardinality of 37,172 non-zero loadings for the cardinality-constrained approach, with the three factors explaining 25% of the total variance. Again, the LASSO approach was tuned with several trials and errors. Figure 7 presents the scatterplot of the factor scores3. As shown, the second factor clearly separates the control group from the two autism groups. This observation is in accordance with the result from Nishimura et al. (2007): the separation between the groups is the largest source of variation in the data. While the author used a data-driven approach (namely, maximizing the F statistic in an analysis of variance) to select a subset of 293 relevant variables which were subsequently used to construct genetic risk scores for autism, our method did not use such prior information and yet was still able to observe the distinction between groups.

Fig. 7 — Scatterplot of the factor scores using cardinality constraint

Discussion

In this paper, we propose a two-staged regularized exploratory SEM method. The most important contribution of this method lies in the first stage wherein we offer a unique exploratory approximate factor analysis that imposes simple structure on the loading matrix and can efficiently deal with high-dimensional data. The resulting measurement model can be used to indicate which variables load (i.e., have a non-zero loading) on each of the factors and followed by a confirmatory SEM/factor analysis. Additionally, factor scores are obtained in the second stage to directly investigate the structural relations among the variables in a manner appropriate to the users’ purposes. The method is rather easy to use, does converge, and comes with freely available code in R. It is important to note that the current paper does not propose a method to completely replace the traditional SEM method. Rather, our objectives are: Regularized ESEM can efficiently deal with exploratory and HDLSS settings, in which all other methods fail; (ii) Regularized ESEM produces a simple structure for the measurement model with exact zero loadings to ease interpretation, and (iii) Regularized ESEM is offered as a user-friendly algorithm in the free open-source statistical software R.

The proposed method uses the LASSO penalty to achieve a simple structure for the loading matrix, but we also pointed out the (more elegant) cardinality-constrained algorithm for the special case of orthogonal factors. Note that due to the different assumptions in factor relations (i.e., correlated versus orthogonal), the two approaches can lead to very different results. Especially, the cardinality-constrained approach is not robust against local misspecification: it produces highly biased estimates for the structural model when the factors are correlated in the population model. Thus, in an exploratory setting where no prior information is known, we recommend using the cardinality-constrained approach to determine the number of nonzero/zero loadings and their positions since its model selection procedure is more accurate and straightforward, regardless of the factor relations in the structural model. This can then be used as a guide for the LASSO approach to select the penalty parameter and estimate the structural paths.

There are several limitations to the current paper. Firstly, the paper did not propose a specific method for determining the number of factors. This is a nuanced task, especially in the high-dimensional settings where little prior knowledge is available. It often requires inspecting different results from different methods of extracting the number of factors, instead of using one single method (Auerswald & Moshagen, 2019). Although factor retention was not addressed in this paper, it is an important direction for future work. For instance, a more detailed exploration of this issue in the context of regularized methods and high-dimensional settings would provide more insights and completeness for the proposed framework. Secondly, there are three sources of uncertainty carried from the first to the second stage of the proposed method: (i) model selection uncertainty; (ii) parameter estimation uncertainty in Stage 1 (i.e., factor scores are based on loadings, which are themselves estimates); and (iii) factor scores are not free of measurement errors in the low-dimensional settings. A potential solution would be to draw inferences for Stage 2 using a type of parametric bootstrap method: data is sampled from a multivariate normal distribution using the correlation matrix of the original data. For each bootstrap sample, model selection is performed, and factor loadings and scores are estimated, which are then used to estimate the structural model. However, this remedy can be computationally intensive, and in the low-dimensional settings, it does not correct for the bias of the estimates caused by measurement errors. This issue of measurement error yielding biased estimates for the structural parameters in the low-dimensional setting might be resolved by accounting for unique factors (e.g., as done in IGSCA Cho et al., 2025; Hwang et al., 2021) or using some bias-correction methods (for an overview of step-wise methods in latent variable modeling, see Vermunt, 2024). Alternatively, one can use the nonzero/zero loadings structure obtained from the cardinality-constrained approach as prior information for a (semi)confirmatory method, such as in lavaan (Rosseel, 2012), Mplus (Muthén & Muthén, 1998-2017), lslx (Huang, 2020), and others, which would help reduce the number of parameters needed to be freely estimated in an exploratory model that often causes non-convergence issues for these methods (even in the low-dimensional case). This type of strategy is also known as EFA-based CFA (EFCA) in the literature and has been shown to estimate parameters with good accuracy and model fit (Nájera et al., 2023).

Apart from the aforementioned limitations, our work can be extended in several directions. For example, another topic of interest is the use of a cardinality constraint in more general settings. As seen in this paper, the cardinality constraint offers several advantages yet requires orthogonality of the factor scores. An algorithm that does not require this orthogonality constraint can be developed based on the numerical solution proposed by Adachi and Kiers (2017) in the regression context and Guerra-Urzola et al. (2023) in the sparse PCA context. Note that such an algorithm does not guarantee convergence and optimality, similarly to the proposed Regularized ESEM in this paper. A further topic of research is to examine the predictive power of the proposed method, which might be useful for applied researchers who are not only interested in explaining the underlying mechanisms but also in predicting certain outcomes. Another natural extension of the current method is to account for multi-group analyses, for which the latent factors can be compared across different groups of individuals.

Acknowledgements

This paper draws on the data from the Survey Assessment of Habitual Stress Recovery Strategies and Their Link to Well-Being and Health in Aging. Data collection was supported by a seed funding grant of the Herbert-Simon Research Institute, Tilburg University (grant holders: Dr. Nicola Ballhausen, Dr. Stefanie Duijndam, and Prof. Dr. Yvonne Brehmer).

Appendix A Algorithm: Regularized ESEM with the LASSO penalty

Here, we present the mathematical derivations and detailed description of the algorithm for the method using the LASSO penalty.

1. Update of the factor scores conditional upon fixed values for the loadings is based on the following objective:

\begin{matrix} min_{η_{11}, \dots, η_{iq}, \dots, η_{NQ}} & \sum_{j = 1}^{J} \sum_{i = 1}^{N} {(y_{ij} - \sum_{q} η_{iq} p_{jq})}^{2} + λ \sum_{j = 1}^{J} \sum_{q = 1}^{Q} {| p_{jq} |}_{1} \\ subject to & \sum_{i} η_{iq}^{2} = N \forall q = 1, . . . Q . \end{matrix}

Equation (8) can be rewritten in matrix form as follows:

\begin{matrix} min_{η_{1}, \dots, η_{q}, \dots, η_{Q}} & | | Y - \sum_{q} η_{q} p_{q}^{T} {| |}^{2} + \sum_{q} λ {| p_{q} |}_{1} \\ subject to & η_{q}^{T} η_{q} = N \forall q = 1, . . ., Q . \end{matrix}

Using the method of Lagrange Multipliers, we obtain the Lagrangian function

\begin{matrix} L & = | | Y - \sum_{q} η_{q} p_{q}^{T} {| |}^{2} - μ (η_{q}^{T} η_{q} - N) \\ = | | Y - \sum_{t \neq q} η_{t} p_{t}^{T} - η_{q} p_{q}^{T} {| |}^{2} - μ (η_{q}^{T} η_{q} - N) \\ = | | E_{q} - η_{q} p_{q}^{T} {| |}^{2} - μ (η_{q}^{T} η_{q} - N) \end{matrix}

Solving $\frac{\partial L}{\partial η_{q}} = 0$ and $\frac{\partial L}{\partial μ} = 0$ , we have $η_{q} = \frac{\sqrt{N} E_{q} p_{q}}{\sqrt{{(E_{q} p_{q})}^{T} (E_{q} p_{q})}}$ .

2. To update the loading matrix $P$ a coordinate descent procedure is used. First note that the objective function is separable in the variables:

\begin{matrix} min_{p_{11}, \dots, p_{pq}, \dots, p_{JQ}} & \sum_{j = 1}^{J} \sum_{i = 1}^{N} {(y_{ij} - \sum_{q} η_{iq} p_{jq})}^{2} + λ \sum_{j = 1}^{J} \sum_{q = 1}^{Q} {| p_{jq} |}_{1} \\ = \sum_{j = 1}^{J} (\sum_{i = 1}^{N}, {(y_{ij} - \sum_{q} η_{iq} p_{jq})}^{2}) + λ \sum_{j = 1}^{J} \sum_{q = 1}^{Q} {| p_{jq} |}_{1}, \end{matrix}

implying that we can optimize per variable j, this is solving for each j,

\begin{matrix} min_{p_{j 1}, \dots, p_{jq}, \dots, p_{jQ}} \sum_{i} {(y_{ij} - \sum_{q} η_{iq} p_{jq})}^{2} + \sum_{q} λ | p_{jq} | . \end{matrix}

Coordinate descent relies on a conditional optimization scheme were each of the $p_{jq}$ is estimated $\forall q = 1, \dots, Q$ in turn and using an iterative scheme. With some rewriting, the objective for obtaining the conditional estimate of $p_{jq}$ becomes

\begin{matrix} min_{p_{jq}} \sum_{i} {(y_{ij} - \sum_{t \neq q} η_{it} p_{jt} - η_{iq} p_{jq})}^{2} + \sum_{q} λ | p_{jq} | \\ = \sum_{i} {(e_{ij}^{r} - η_{iq} p_{jq})}^{2} + \sum_{t \neq q} λ | p_{jt} | + λ | p_{jq} | \end{matrix}

which is minimized by

\begin{matrix} p_{jq} = \frac{2 \sum_{i} e_{ij} η_{iq} - λ}{2 \sum_{i} η_{iq}^{2}} = \frac{2 \sum_{i} e_{ij} η_{iq} - λ}{2 N} if \sum_{i} e_{ij} η_{iq} > \frac{λ}{2} \\ p_{jq} = \frac{2 \sum_{i} e_{ij} η_{iq} + λ}{2 \sum_{i} t_{iq}^{2}} = \frac{2 \sum_{i} e_{ij} η_{iq} + λ}{2 N} if \sum_{i} e_{ij} η_{iq} < - \frac{λ}{2} \\ p_{jq} = 0 if | \sum_{i} e_{ij} η_{iq} | \leq \frac{λ}{2} . \end{matrix}

The solution corresponds to the univariate soft thresholding operator derived in the context of the univariate LASSO regression problem (Friedman et al., 2010).

Using the estimated factor score to obtain the path coefficients in the structural part with OLS (or other methods suitable for users’ purposes), we present the algorithm for LSLV-LASSO:

Algorithm 1 — Regularized ESEM with the LASSO penalty (rESEM-l1)

Appendix B Algorithm: Regularized ESEM with cardinality constraint

With orthogonal factors, finding the best subset of loadings to (5) is not an NP-hard problem but can be efficiently solved in an alternating optimization scheme in which the optimization criterion is decomposed into two parts:

\begin{matrix} \sum_{j = 1}^{J} \sum_{i = 1}^{N} {(y_{ij} - \sum_{q = 1}^{Q} η_{iq} p_{jq})}^{2} & = \sum_{j = 1}^{J} \sum_{i = 1}^{N} {(y_{ij} - \sum_{q = 1}^{Q} η_{iq} a_{jq})}^{2} \\ + N {(\sum_{j = 1}^{J} \sum_{q = 1}^{Q} a_{jq} - p_{jq})}^{2} . \end{matrix}

with $a_{jq} = \frac{1}{N} \sum_{i} y_{ij} η_{iq}$ . The decomposition in (12) holds true if $\sum_{i} η_{iq}^{2} = N \forall q = 1, . . ., Q$ and $\sum_{i} η_{iq} η_{it} \forall q \neq t$ .

The first term does not involve the loadings $p_{jq}$ , while the second can be easily minimized under the cardinality constraint by keeping C loadings with the highest absolute value and setting the remaining loadings to zero. The main complexity of the procedure only involves sorting the loading matrix $P$ of size $J \times Q$ , and thus can scale up to a large number of variables. Even though this method has previously been introduced as unpenalized sparse principal component analysis, we would like to point out that, because of the cardinality constraint, the component scores cannot be written as a linear combination of the variables (unlike with PCA). Hence, here we have a sparse exploratory factor analysis method with reflective indicators: the loadings express the strength of the association of the score vectors with the observed variables.

Equation (12) can be rewritten in matrix form as follows:

\begin{matrix} | | Y - {HP}^{T} {| |}^{2} = | | Y - {HA}^{T} {| |}^{2} + N | | A - {P | |}^{2}, \end{matrix}

with $A = \frac{1}{N} Y^{T} H$ ( $H$ is the factor score matrix). Then, $H$ is attained by minimizing the expanded loss function

\begin{matrix} tr Y^{T} Y + tr P H^{T} H P^{T} - 2 tr Y^{T} HP = N tr S + N tr {PP}^{T} - 2 N tr {AP}^{T}, \end{matrix}

with $S = N^{- 1} Y^{T} X$ . Hence, we have

\begin{matrix} H = \sqrt{N} {UV}^{T}, \end{matrix}

with $U$ and $V$ from the SVD of $\frac{1}{\sqrt{N}} YP = {USV}^{T}$ .

For a given $H$ , minimizing the loss function (13) over constrained $P$ is equivalent to minimizing:

\begin{matrix} | | A - {P | |}^{2} = \sum_{(j, q) \in Z} a_{jq}^{2} + \sum_{(j, q) \in Z^{*}} {(a_{jq} - p_{jq})}^{2} \geq \sum_{(j, q) \in Z} a_{jq}^{2}, \end{matrix}

with Z being the set of $k = J \times Q - C$ indexes (j, q) for $p_{jq} = 0$ and $Z^{*}$ being the set of indexes for $p_{jq} \neq 0$ . Thus, (16) is minimized for $P = (p_{jq})$ being

\begin{matrix} p_{jq} = \{\begin{matrix} 0, & if a_{jq}^{2} \leq a_{| k |}^{2} \\ a_{jq}, & otherwise \end{matrix}) \end{matrix}

Using the estimated factor scores to obtain the path coefficients in the structural part with OLS (or other methods suitable for users’ purposes), we present the algorithm for CCLSLV:

Algorithm 2 — Regularized ESEM with cardinality constraint (rESEM-CC)

Appendix C Summary of related methods

Table 3.

Summary of related methods

Method	Reference	Type	Penalty	Software
RegSEM	Jacobucci et al. (2016)	Penalized ML (one-stage)	$l_{1}, l_{2}$ , elastic net	R package
lslx	Huang et al. (2017); Huang (2020)	Penalized ML (one-stage)	$l_{1}, l_{2}$ , elastic net, SCAD, MCP (default)	R package
PSEM-Mplus	Asparouhov and Muthén (2024)	Penalized ML (one-stage)	Prior*	Mplus
SFA	Cho and Hwang (2023)	LS (two-stage)	No	MATLAB
Regularized IGSCA	Cho et al. (2025)	Penalized LS (one-stage)	$l_{1}, l_{2}$	MATLAB
Bayesian SEM	Bayesian (one-stage)	Shrinkage priors (e.g., normal, Laplace, horseshoe, spike-and-slab, etc.)	Various

Open in a new tab

Note. ML = Maximum-likelihood; LS = Least-squares; l1 = LASSO; l2 = Ridge; SCAD = smoothly clipped absolute deviation penalty; MCP = minimax concave penalty

Appendix D Data generation for simulation study

The setup and data generation procedure for the simulation study was inspired by Bollen et al. (2024); Rosseel and Loh (2022), which consists of the following steps:

Generate loadings $P_{J \times Q}$ : The primary loadings were fixed as $\sqrt{. 6}$ (similarly done by De Roover and Vermunt (2019); The cross-loadings were fixed as 0.5 and their locations were generated as follows: for each factor $η_{q}$ , take the middle item from the block of items that load primarily on this factor and assign this item a cross-loading of 0.5 for the next factor $η_{q + 1}$ .
Generate factor scores $H_{N \times Q}$ from a multivariate normal distribution $M V N (0, Φ)$ , where $Φ$ is the correlation matrix between the factors. In the case of orthogonal factors, $Φ = I_{Q \times Q}$ .
Generate the residuals $E_{N \times J}$ from a multivariate normal distribution $M V N (0, Θ)$ where the diagonal of $Θ$ are the unique variance of the items. The diagonal of $Θ$ was generated to match the desired level of item reliability.
Generate the observed variables $Y = {HP}^{T} + E$ .
Generate the observed outcome variable $z = H β + E_{z}$ , where each $β$ is fixed as 0.1 and $E_{z}$ was generated to match the desired VAFz (variance accounted for in $z$ by the factors) level.

Appendix E Accuracy of Index of Sparseness as a model selection criterion

Table 4.

Percentage of datasets per condition for which IS selected the correct number of (non)zero loadings (for orthogonal factors and VAFz = 0.9)

Items per factor	Number of factors	Sample size	Cross-loadings	rESEM-CC	rESEM-l1
3	3	50	yes	78	24
5	3	50	yes	76	24
3	5	50	yes	73	20
5	5	50	yes	70	30
3	3	100	yes	88	30
5	3	100	yes	88	32
3	5	100	yes	80	34
5	5	100	yes	77	38
3	3	500	yes	100	49
5	3	500	yes	100	55
3	5	500	yes	100	48
5	5	500	yes	100	64
3	3	50	no	79	54
5	3	50	no	70	48
3	5	50	no	71	39
5	5	50	no	65	38
3	3	100	no	98	74
5	3	100	no	98	64
3	5	100	no	98	60
5	5	100	no	96	50
3	3	500	no	100	96
5	3	500	no	100	90
3	5	500	no	100	92
5	5	500	no	100	96

Open in a new tab

Note. The results for VAFz = 0.5 had a similar trend. Reported values are rounded to whole numbers

Table 5.

Percentage of datasets per condition for which IS selected the correct number of (non)zero loadings (for correlated factors and VAFz = 0.9)

Items per factor	Number of factors	Sample size	Cross-loadings	rESEM-CC	rESEM-l1
3	3	50	yes	80	30
5	3	50	yes	64	30
3	5	50	yes	68	24
5	5	50	yes	62	20
3	3	100	yes	88	34
5	3	100	yes	80	38
3	5	100	yes	72	40
5	5	100	yes	75	36
3	3	500	yes	100	54
5	3	500	yes	100	59
3	5	500	yes	91	50
5	5	500	yes	96	62
3	3	50	no	62	52
5	3	50	no	61	44
3	5	50	no	70	34
5	5	50	no	57	34
3	3	100	no	86	64
5	3	100	no	85	55
3	5	100	no	85	51
5	5	100	no	72	50
3	3	500	no	100	96
5	3	500	no	100	90
3	5	500	no	100	88
5	5	500	no	100	84

Open in a new tab

Note. The results for VAFz = 0.5 had a similar trend

Appendix F Number of starts resulting in the same final solution by the proposed Regularized ESEM

Table 7.

Number of starts resulting in the same final solution by Regularized ESEM out of 100 random starts

Number of factors	Sample	Factors correlation	cross-loadings	rESEM-CC	rESEM-l1
3	50	0	yes	76	48
5	50	0	yes	62	46
3	100	0	yes	84	47
5	100	0	yes	74	46
3	500	0	yes	93	47
5	500	0	yes	92	46
3	50	0.3	yes	72	48
5	50	0.3	yes	58	47
3	100	0.3	yes	80	48
5	100	0.3	yes	62	47
3	500	0.3	yes	94	49
5	500	0.3	yes	82	46
3	50	0	no	79	47
5	50	0	no	59	46
3	100	0	no	87	47
5	100	0	no	68	46
3	500	0	no	92	47
5	500	0	no	79	45
3	50	0.3	no	74	47
5	50	0.3	no	53	46
3	100	0.3	no	82	48
5	100	0.3	no	58	46
3	500	0.3	no	92	48
5	500	0.3	no	74	46
Total				76	47

Open in a new tab

Note. The results were averaged across 50 replications and some conditions

Appendix G Additional plot for the stress recovery example

Appendix H Factors correlation obtained by the LASSO penalty approach rESEM-l1

Table 8.

Correlation matrix of the five factor scores obtained by the Lasso penalty approach

	External	Suppression	Cognitive	Humor	Brooding
	support		reappraisal
External support	1.000	-0.434	0.233	-0.021	0.185
Suppression	-0.434	1.000	-0.119	-0.241	0.162
Cognitive reappraisal	0.233	-0.119	1.000	0.020	-0.151
Humor	-0.021	-0.241	0.020	1.000	-0.179
Brooding	0.185	0.162	-0.151	-0.179	1.000

Open in a new tab

Appendix I Additional plot for the gene expression example

Fig. 9 — Three-dimensional scatterplot of the correlated factor scores obtained from the LASSO approach rESEM-l1

Funding

This publication is part of the project SEM2.0 (with project number NWO 406.22.GO.022) of the Open Competition research program, which is financed by the Dutch Research Council (NWO).

Data Availability

The synthetic data and results reported in this manuscript can be generated using the code publicly available at https://github.com/trale97/regularizedESEM. This manuscript made use of two empirical data sets. The Habitual Stress Recovery Strategies data are not publicly available but can be made accessible from the author (Dr. Nicola Ballhausen at n.m.ballhausen@tilburguniversity.edu) upon reasonable request. The Autism Genetic data (Nishimura et al., 2007) is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7329.

Code availability

The implementation of the proposed method was done in R and can be found, along with the code to reproduce the results in this manuscript, at https://github.com/trale97/regularizedESEM.

Declarations

Funding

This publication is part of the project SEM2.0 (with project number NWO 406.22.GO.022) of the Open Competition research program, which is financed by the Dutch Research Council (NWO).

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Ethics approval

This manuscript used empirical data to demonstrate the proposed statistical method. The Habitual Stress Recovery Strategies data is part of the project A Survey Assessment of Habitual Stress Recovery Strategies and Their Link to Well-Being and Health in Aging, which was approved by the Ethics Review Board of the Tilburg School of Social and Behavioral Sciences, Tilburg University (Ethics approval code: TSB_RP103).

Consent to participate

Informed consent was obtained from all individual participants included in the Habitual Stress Recovery Strategies data set.

Consent for publication

The participants included in the Habitual Stress Recovery Strategies data set have consented for their data to be used for publications.

Open Practices Statements

The manuscript made use of simulated data and two empirical datasets. The code for data reproduction and analyses is available at https://github.com/trale97/regularizedESEM. One of the empirical datasets is not publicly available but can be made accessible from the author (Dr. Nicola Ballhausen at n.m.ballhausen@tilburguniversity.edu) upon reasonable request. The code for its analysis is also available at the same GitHub repository.

Footnotes

In exceptional cases, uniqueness is not guaranteed, e.g., in case of highly correlating factors.

This range ensures the minimum cardinality is three items per factor and the maximum cardinality is all loadings are nonzero

The scatterplots resulting from the two methods are very similar, hence, we only reported the one using cardinality constraint. The other plot using the LASSO penalty can be found in Appendix I

This publication is part of the project SEM2.0 (with project number NWO 406.22.GO.022) of the Open Competition research program, which is financed by the Dutch Research Council (NWO). The authors have no conflicts of interest to declare that are relevant to the content of this article.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Adachi, K., & Kiers, H. A. (2017). Sparse regression without using a penalty function. Retrieved from http://www.jfssa.jp/taikai/2017/table/program_detail/pdf/1-50/10009.pdf.
Adachi, K., & Trendafilov, N. T. (2016). Sparse principal component analysis subject to prespecified cardinality of loadings. Computational Statistics,31, 1403–1427. 10.1007/s00180-015-0608-4 [Google Scholar]
Asparouhov, T., & Muthén, B. (2024). Penalized structural equation models. Structural Equation Modeling: A Multidisciplinary Journal,31(3), 429–454. 10.1080/10705511.2023.2263913 [Google Scholar]
Asparouhov, T., & Muthén, B. (2025). Methodological advances with penalized structural equation models. Structural Equation Modeling: A Multidisciplinary Journal,32(4), 688–716. 10.1080/10705511.2024.2425996 [Google Scholar]
Auerswald, M., & Moshagen, M. (2019). How to determine the number of factors to retain in exploratory factor analysis: A comparison of extraction methods under realistic conditions. Psychological Methods,24(4), 468. 10.1037/met0000200 [DOI] [PubMed] [Google Scholar]
Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica,71(1), 135–171. 10.1111/1468-0262.00392 [Google Scholar]
Bai, J., & Li, K. (2016). Maximum likelihood estimation and inference for approximate factor models of high dimension. Review of Economics and Statistics,98(2), 298–309. 10.1162/REST_a_00519
Bai, J., & Ng, S. (2023). Approximate factor models with weaker loadings. Journal of Econometrics,235(2), 1893–1916. 10.1016/j.jeconom.2023.01.027 [Google Scholar]
Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics,44(2), 813–852. 10.1214/15-AOS1388 [Google Scholar]
Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics,44(2), 813–852. 10.1214/15-AOS1388 [Google Scholar]
Bollen, K. A., Gates, K. M., & Luo, L. (2024). A model implied instrumental variable approach to exploratory factor analysis (miiv-efa). psychometrika,89 (2), 687–716. 10.1007/s11336-024-09949-6
Carver, C. S. (1997). You want to measure coping but your protocol’too long: Consider the brief cope. International Journal of Behavioral Medicine,4(1), 92–100. 10.1207/s15327558ijbm0401_6 [DOI] [PubMed] [Google Scholar]
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research,1(2), 245–276. 10.1207/s15327906mbr0102_10 [DOI] [PubMed] [Google Scholar]
Chamberlain, G., & Rothschild, M. (1982). Arbitrage, factor structure, and mean-variance analysis on large asset markets. 10.3386/w0996
Cho, G., Choi, J. Y., Sarstedt, M., & Hwang, H. (2025). Regularized structural equation modeling with both factors and components. Structural Equation Modeling: A Multidisciplinary Journal,pp. 1–12.
Cho, G., & Hwang, H. (2023). Structured factor analysis: A data matrix-based alternative approach to structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal,30(3), 364–377. 10.1080/10705511.2022.2126360 [DOI]
Daniele, M., Pohlmeier, W., & Zagidullina, A. (2025). A sparse approximate factor model for high-dimensional covariance matrix estimation and portfolio selection. Journal of Financial Econometrics,23 (1), nbae017. 10.1093/jjfinec/nbae017.
De Roover, K., & Vermunt, J. K. (2019). On the exploratory road to unraveling factor loading non-invariance: A new multigroup rotation approach. Structural Equation Modeling: A Multidisciplinary Journal,26(6), 905–923. 10.1080/10705511.2019.1590778 [Google Scholar]
Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The satisfaction with life scale. Journal of Personality Assessment,49(1), 71–75. 10.1207/s15327752jpa4901_13 [DOI] [PubMed] [Google Scholar]
Fan, J., Liao, Y., & Mincheva, M. (2011). High dimensional covariance matrix estimation in approximate factor models. Annals of Statistics,39(6), 3320. 10.1214/11-AOS944 [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan, J., Wang, K., Zhong, Y., & Zhu, Z. (2021). Robust high dimensional factor models with applications to statistical machine learning. Statistical science: a review journal of the Institute of Mathematical Statistics,36(2), 303. 10.1214/20-sts785 [DOI] [PMC free article] [PubMed] [Google Scholar]
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software,33(1), 1. 10.18637/jss.v033.i01 [PMC free article] [PubMed] [Google Scholar]
Gajjar, S., Kulahci, M., & Palazoglu, A. (2017). Selection of non-zero loadings in sparse principal component analysis. Chemometrics and Intelligent Laboratory Systems,162, 160–171. 10.1016/j.chemolab.2017.01.018 [Google Scholar]
Goretzko, D. (2025). How many factors to retain in exploratory factor analysis? A critical overview of factor retention methods: Psychological Methods. 10.1037/met0000733 [Google Scholar]
Goretzko, D., & Bühner, M. (2020). One model to rule them all? Using machine learning algorithms to determine the number of factors in exploratory factor analysis. Psychological Methods,25(6), 776. 10.1037/met0000262 [DOI] [PubMed] [Google Scholar]
Gross, J. J., & John, O. P. (2003). Individual differences in two emotion regulation processes: Implications for affect, relationships, and well-being. Journal of Personality and Social Psychology,85(2), 348. 10.1037/0022-3514.85.2.348 [DOI] [PubMed] [Google Scholar]
Gu, Z., de Schipper, N. C., & Van Deun, K. (2019). Variable selection in the regularized simultaneous component analysis method for multi-source data integration. Scientific Reports,9(1), 18608. 10.1038/s41598-019-54673-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
Guerra-Urzola, R., de Schipper, N. C., Tonne, A., Sijtsma, K., Vera, J. C., & Van Deun, K. (2023). Sparsifying the least-squares approach to PCA: Comparison of lasso and cardinality constraint. Advances in Data Analysis and Classification,17(1), 269–286. 10.1007/s11634-022-00499-2 [Google Scholar]
Guerra-Urzola, R., Van Deun, K., Vera, J. C., & Sijtsma, K. (2021). A guide for sparse PCA: Model comparison and applications. psychometrika,86 (4), 893–919. 10.1007/s11336-021-09773-2
Hair, J. F., Hult, G. T. M., Ringle, C. M., Sarstedt, M., & Thiele, K. O. (2017). Mirror, mirror on the wall: A comparative evaluation of composite-based structural equation modeling methods. Journal of the Academy of Marketing Science,45, 616–632. 10.1007/s11747-017-0517-x [Google Scholar]
Horn, A. B. (2022). Interpersonal emotion regulation in close relationships questionnaire-ier-cr.10.31234/osf.io/kmxye [Google Scholar]
Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika,30, 179–185. 10.1007/BF02289447 [DOI] [PubMed] [Google Scholar]
Huang, P.-H. (2020). Lslx: Semi-confirmatory structural equation modeling via penalized likelihood. Journal of Statistical Software,93(7), 1–37. 10.18637/jss.v093.i07 [Google Scholar]
Huang, P.-H., Chen, H., & Weng, L.-J. (2017). A penalized likelihood method for structural equation modeling. psychometrika,82 (2), 329–354. 10.1007/s11336-017-9566-9
Huang, K., Sidiropoulos, N. D., & Liavas, A. P. (2016). A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Transactions on Signal Processing,64(19), 5052–5065. [Google Scholar]
Hwang, H., Cho, G., Jung, K., Falk, C. F., Flake, J. K., Jin, M. J., & Lee, S. H. (2021). An approach to structural equation modeling with both factors and components: Integrated generalized structured component analysis. Psychological Methods,26(3), 273. 10.1037/met0000336 [DOI] [PubMed] [Google Scholar]
Hwang, H., & Takane, Y. (2004). Generalized structured component analysis. Psychometrika,69(1), 81–99. [Google Scholar]
Jacobucci, R. (2017). Regsem: Regularized structural equation modeling. arXiv preprint arXiv:1703.08489. 10.48550/arXiv.1703.08489
Jacobucci, R., & Grimm, K. J. (2018). Comparison of frequentist and bayesian regularization in structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal,25(4), 639–649. 10.1080/10705511.2017.1410822 [DOI] [PMC free article] [PubMed] [Google Scholar]
Jacobucci, R., Grimm, K. J., & McArdle, J. J. (2016). Regularized structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal,23(4), 555–566. 10.1080/10705511.2016.1154793 [DOI] [PMC free article] [PubMed] [Google Scholar]
Jolliffe, I. T., Trendafilov, N. T., & Uddin, M. (2003). A modified principal component technique based on the LASSO. Journal of Computational and Graphical Statistics,12(3), 531–547. [Google Scholar]
Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement,20(1), 141–151. 10.1177/001316446002000116 [Google Scholar]
Le, T., Vermunt, J., & Van Deun, K. (2025). RegularizedESEM simulation study Shiny app (Version 1.0.0). 10.5281/zenodo.17849772
Li, X., & Jacobucci, R. (2022). Regularized structural equation modeling with stability selection. Psychological Methods,27(4), 497. 10.1037/met0000389 [DOI] [PubMed] [Google Scholar]
Li, X., Jacobucci, R., & Ammerman, B. A. (2021). Tutorial on the use of the regsem package in r. Psych,3(4), 579–592. 10.3390/psych3040038 [Google Scholar]
Lorenzo-Seva, U., & Ten Berge, J. M. (2006). Tucker’s congruence coefficient as a meaningful index of factor similarity. Methodology,2(2), 57–64. 10.1027/1614-2241.2.2.57 [Google Scholar]
Luciano, J. V., Algarabel, S., Tomás, J. M., & Martınez, J. L. (2005). Development and validation of the thought control ability questionnaire. Personality and Individual Differences,38(5), 997–1008. 10.1016/j.paid.2004.06.020 [Google Scholar]
Marcoulides, K. M. (2018). Careful with those priors: A note on bayesian estimation in two-parameter logistic item response theory models. Measurement: Interdisciplinary Research and Perspectives,16 (2), 92–99. 10.1080/15366367.2018.1437305
Marcoulides, K., Yuan, K., & Deng, L. (2023). Structural equation modeling with small samples and many variables. Handbook of structural equation modeling, pp. 525–542.
McNeish, D. (2016). On using bayesian methods to address small sample problems. Structural Equation Modeling: A Multidisciplinary Journal,23(5), 750–773. 10.1080/10705511.2016.1186549 [Google Scholar]
Muthén, B., & Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods,17(3), 313. 10.1037/a0026802 [DOI] [PubMed] [Google Scholar]
Muthén, L., & Muthén, B. (1998–2017). Mplus user’s guide. eight edition. Muthén & Muthén,10.
Nájera, P., Abad, F. J., & Sorrel, M. A. (2023). Is exploratory factor analysis always to be preferred? A systematic comparison of factor analytic techniques throughout the confirmatory–exploratory continuum. Psychological Methods. 10.1037/met0000579 [DOI] [PubMed] [Google Scholar]
Nishimura, Y., Martin, C. L., Vazquez-Lopez, A., Spence, S. J., Alvarez-Retuerto, A. I., Sigman, M., Steindler, C., Pellegrini, S., Schanen, N. C., Warren, S. T., et al. (2007). Genome-wide expression profiling of lymphoblastoid cell lines distinguishes different forms of autism and reveals shared pathways. Human Molecular Genetics,16(14), 1682–1698. 10.1093/hmg/ddm116
Park, S., Ceulemans, E., & Van Deun, K. (2024). A critical assessment of sparse pca (research): Why (one should acknowledge that) weights are not loadings. Behavior Research Methods,56(3), 1413–1432. 10.3758/s13428-023-02099-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
Rigdon, E. E. (2012). Rethinking partial least squares path modeling: In praise of simple methods. Long Range Planning,45(5–6), 341–358. 10.1016/j.lrp.2012.09.010 [Google Scholar]
Rönkkö, M., McIntosh, C. N., & Antonakis, J. (2015). On the adoption of partial least squares in psychological research: Caveat emptor. Personality and Individual Differences,87, 76–84. 10.1016/j.paid.2015.07.019 [Google Scholar]
Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software,48(2), 1–36. 10.18637/jss.v048.i02 [Google Scholar]
Rosseel, Y. (2020). Small sample solutions for structural equation modeling. In Small sample size solutions (pp. 226–238). Routledge.
Rosseel, Y., & Loh, W. W. (2022). A structural after measurement approach to structural equation modeling. Psychological Methods. 10.1037/met0000503 [DOI] [PubMed] [Google Scholar]
Sarstedt, M., Hair, J. F., Ringle, C. M., Thiele, K. O., & Gudergan, S. P. (2016). Estimation issues with PLS and CBSEM: Where the bias lies! Journal of Business Research,69(10), 3998–4010. 10.1016/j.jbusres.2016.06.007
Shen, H., & Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis,99(6), 1015–1034. 10.1016/j.jmva.2007.06.007 [Google Scholar]
Tenenhaus, M., Tenenhaus, A., & Groenen, P. J. (2017). Regularized generalized canonical correlation analysis: A framework for sequential multiblock component methods. Psychometrika,82, 737–777. 10.1007/s11336-017-9573-x [DOI] [PubMed] [Google Scholar]
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B: Statistical Methodology,58(1), 267–288. 10.1111/j.2517-6161.1996.tb02080.x [Google Scholar]
Trendafilov, N. T. (2014). From simple structure to sparse components: A review. Computational Statistics,29, 431–454.
Treynor, W., Gonzalez, R., & Nolen-Hoeksema, S. (2003). Rumination reconsidered: A psychometric analysis. Cognitive Therapy and Research,27, 247–259. 10.1023/A:1023910315561 [Google Scholar]
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in r. Journal of Statistical Software,45(3), 1–67. 10.18637/jss.v045.i03 [Google Scholar]
Van Erp, S., Mulder, J., & Oberski, D. L. (2018). Prior sensitivity analysis in default bayesian structural equation modeling. Psychological Methods,23(2), 363. 10.1037/met0000162 [DOI] [PubMed] [Google Scholar]
van Erp, S. (2023). Bayesian regularized sem: Current capabilities and constraints. Psych,5(3), 814–835. 10.3390/psych5030054 [Google Scholar]
Vermunt, J. K. (2010). Latent class modeling with covariates: Two improved three-step approaches. Political Analysis,18(4), 450–469. 10.1093/pan/mpq025 [Google Scholar]
Vermunt, J. K. (2024). Stepwise latent variable modeling: An overview of approaches.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The implementation of the proposed method was done in R and can be found, along with the code to reproduce the results in this manuscript, at https://github.com/trale97/regularizedESEM.

[CR1] Adachi, K., & Kiers, H. A. (2017). Sparse regression without using a penalty function. Retrieved from http://www.jfssa.jp/taikai/2017/table/program_detail/pdf/1-50/10009.pdf.

[CR2] Adachi, K., & Trendafilov, N. T. (2016). Sparse principal component analysis subject to prespecified cardinality of loadings. Computational Statistics,31, 1403–1427. 10.1007/s00180-015-0608-4 [Google Scholar]

[CR3] Asparouhov, T., & Muthén, B. (2024). Penalized structural equation models. Structural Equation Modeling: A Multidisciplinary Journal,31(3), 429–454. 10.1080/10705511.2023.2263913 [Google Scholar]

[CR4] Asparouhov, T., & Muthén, B. (2025). Methodological advances with penalized structural equation models. Structural Equation Modeling: A Multidisciplinary Journal,32(4), 688–716. 10.1080/10705511.2024.2425996 [Google Scholar]

[CR5] Auerswald, M., & Moshagen, M. (2019). How to determine the number of factors to retain in exploratory factor analysis: A comparison of extraction methods under realistic conditions. Psychological Methods,24(4), 468. 10.1037/met0000200 [DOI] [PubMed] [Google Scholar]

[CR6] Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica,71(1), 135–171. 10.1111/1468-0262.00392 [Google Scholar]

[CR7] Bai, J., & Li, K. (2016). Maximum likelihood estimation and inference for approximate factor models of high dimension. Review of Economics and Statistics,98(2), 298–309. 10.1162/REST_a_00519

[CR8] Bai, J., & Ng, S. (2023). Approximate factor models with weaker loadings. Journal of Econometrics,235(2), 1893–1916. 10.1016/j.jeconom.2023.01.027 [Google Scholar]

[CR9] Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics,44(2), 813–852. 10.1214/15-AOS1388 [Google Scholar]

[CR10] Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics,44(2), 813–852. 10.1214/15-AOS1388 [Google Scholar]

[CR11] Bollen, K. A., Gates, K. M., & Luo, L. (2024). A model implied instrumental variable approach to exploratory factor analysis (miiv-efa). psychometrika,89 (2), 687–716. 10.1007/s11336-024-09949-6

[CR12] Carver, C. S. (1997). You want to measure coping but your protocol’too long: Consider the brief cope. International Journal of Behavioral Medicine,4(1), 92–100. 10.1207/s15327558ijbm0401_6 [DOI] [PubMed] [Google Scholar]

[CR13] Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research,1(2), 245–276. 10.1207/s15327906mbr0102_10 [DOI] [PubMed] [Google Scholar]

[CR14] Chamberlain, G., & Rothschild, M. (1982). Arbitrage, factor structure, and mean-variance analysis on large asset markets. 10.3386/w0996

[CR15] Cho, G., Choi, J. Y., Sarstedt, M., & Hwang, H. (2025). Regularized structural equation modeling with both factors and components. Structural Equation Modeling: A Multidisciplinary Journal,pp. 1–12.

[CR16] Cho, G., & Hwang, H. (2023). Structured factor analysis: A data matrix-based alternative approach to structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal,30(3), 364–377. 10.1080/10705511.2022.2126360 [DOI]

[CR17] Daniele, M., Pohlmeier, W., & Zagidullina, A. (2025). A sparse approximate factor model for high-dimensional covariance matrix estimation and portfolio selection. Journal of Financial Econometrics,23 (1), nbae017. 10.1093/jjfinec/nbae017.

[CR18] De Roover, K., & Vermunt, J. K. (2019). On the exploratory road to unraveling factor loading non-invariance: A new multigroup rotation approach. Structural Equation Modeling: A Multidisciplinary Journal,26(6), 905–923. 10.1080/10705511.2019.1590778 [Google Scholar]

[CR19] Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The satisfaction with life scale. Journal of Personality Assessment,49(1), 71–75. 10.1207/s15327752jpa4901_13 [DOI] [PubMed] [Google Scholar]

[CR20] Fan, J., Liao, Y., & Mincheva, M. (2011). High dimensional covariance matrix estimation in approximate factor models. Annals of Statistics,39(6), 3320. 10.1214/11-AOS944 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] Fan, J., Wang, K., Zhong, Y., & Zhu, Z. (2021). Robust high dimensional factor models with applications to statistical machine learning. Statistical science: a review journal of the Institute of Mathematical Statistics,36(2), 303. 10.1214/20-sts785 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software,33(1), 1. 10.18637/jss.v033.i01 [PMC free article] [PubMed] [Google Scholar]

[CR23] Gajjar, S., Kulahci, M., & Palazoglu, A. (2017). Selection of non-zero loadings in sparse principal component analysis. Chemometrics and Intelligent Laboratory Systems,162, 160–171. 10.1016/j.chemolab.2017.01.018 [Google Scholar]

[CR24] Goretzko, D. (2025). How many factors to retain in exploratory factor analysis? A critical overview of factor retention methods: Psychological Methods. 10.1037/met0000733 [Google Scholar]

[CR25] Goretzko, D., & Bühner, M. (2020). One model to rule them all? Using machine learning algorithms to determine the number of factors in exploratory factor analysis. Psychological Methods,25(6), 776. 10.1037/met0000262 [DOI] [PubMed] [Google Scholar]

[CR26] Gross, J. J., & John, O. P. (2003). Individual differences in two emotion regulation processes: Implications for affect, relationships, and well-being. Journal of Personality and Social Psychology,85(2), 348. 10.1037/0022-3514.85.2.348 [DOI] [PubMed] [Google Scholar]

[CR27] Gu, Z., de Schipper, N. C., & Van Deun, K. (2019). Variable selection in the regularized simultaneous component analysis method for multi-source data integration. Scientific Reports,9(1), 18608. 10.1038/s41598-019-54673-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] Guerra-Urzola, R., de Schipper, N. C., Tonne, A., Sijtsma, K., Vera, J. C., & Van Deun, K. (2023). Sparsifying the least-squares approach to PCA: Comparison of lasso and cardinality constraint. Advances in Data Analysis and Classification,17(1), 269–286. 10.1007/s11634-022-00499-2 [Google Scholar]

[CR29] Guerra-Urzola, R., Van Deun, K., Vera, J. C., & Sijtsma, K. (2021). A guide for sparse PCA: Model comparison and applications. psychometrika,86 (4), 893–919. 10.1007/s11336-021-09773-2

[CR30] Hair, J. F., Hult, G. T. M., Ringle, C. M., Sarstedt, M., & Thiele, K. O. (2017). Mirror, mirror on the wall: A comparative evaluation of composite-based structural equation modeling methods. Journal of the Academy of Marketing Science,45, 616–632. 10.1007/s11747-017-0517-x [Google Scholar]

[CR31] Horn, A. B. (2022). Interpersonal emotion regulation in close relationships questionnaire-ier-cr.10.31234/osf.io/kmxye [Google Scholar]

[CR32] Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika,30, 179–185. 10.1007/BF02289447 [DOI] [PubMed] [Google Scholar]

[CR33] Huang, P.-H. (2020). Lslx: Semi-confirmatory structural equation modeling via penalized likelihood. Journal of Statistical Software,93(7), 1–37. 10.18637/jss.v093.i07 [Google Scholar]

[CR34] Huang, P.-H., Chen, H., & Weng, L.-J. (2017). A penalized likelihood method for structural equation modeling. psychometrika,82 (2), 329–354. 10.1007/s11336-017-9566-9

[CR35] Huang, K., Sidiropoulos, N. D., & Liavas, A. P. (2016). A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Transactions on Signal Processing,64(19), 5052–5065. [Google Scholar]

[CR36] Hwang, H., Cho, G., Jung, K., Falk, C. F., Flake, J. K., Jin, M. J., & Lee, S. H. (2021). An approach to structural equation modeling with both factors and components: Integrated generalized structured component analysis. Psychological Methods,26(3), 273. 10.1037/met0000336 [DOI] [PubMed] [Google Scholar]

[CR37] Hwang, H., & Takane, Y. (2004). Generalized structured component analysis. Psychometrika,69(1), 81–99. [Google Scholar]

[CR38] Jacobucci, R. (2017). Regsem: Regularized structural equation modeling. arXiv preprint arXiv:1703.08489. 10.48550/arXiv.1703.08489

[CR39] Jacobucci, R., & Grimm, K. J. (2018). Comparison of frequentist and bayesian regularization in structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal,25(4), 639–649. 10.1080/10705511.2017.1410822 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] Jacobucci, R., Grimm, K. J., & McArdle, J. J. (2016). Regularized structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal,23(4), 555–566. 10.1080/10705511.2016.1154793 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] Jolliffe, I. T., Trendafilov, N. T., & Uddin, M. (2003). A modified principal component technique based on the LASSO. Journal of Computational and Graphical Statistics,12(3), 531–547. [Google Scholar]

[CR42] Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement,20(1), 141–151. 10.1177/001316446002000116 [Google Scholar]

[CR43] Le, T., Vermunt, J., & Van Deun, K. (2025). RegularizedESEM simulation study Shiny app (Version 1.0.0). 10.5281/zenodo.17849772

[CR44] Li, X., & Jacobucci, R. (2022). Regularized structural equation modeling with stability selection. Psychological Methods,27(4), 497. 10.1037/met0000389 [DOI] [PubMed] [Google Scholar]

[CR45] Li, X., Jacobucci, R., & Ammerman, B. A. (2021). Tutorial on the use of the regsem package in r. Psych,3(4), 579–592. 10.3390/psych3040038 [Google Scholar]

[CR46] Lorenzo-Seva, U., & Ten Berge, J. M. (2006). Tucker’s congruence coefficient as a meaningful index of factor similarity. Methodology,2(2), 57–64. 10.1027/1614-2241.2.2.57 [Google Scholar]

[CR47] Luciano, J. V., Algarabel, S., Tomás, J. M., & Martınez, J. L. (2005). Development and validation of the thought control ability questionnaire. Personality and Individual Differences,38(5), 997–1008. 10.1016/j.paid.2004.06.020 [Google Scholar]

[CR48] Marcoulides, K. M. (2018). Careful with those priors: A note on bayesian estimation in two-parameter logistic item response theory models. Measurement: Interdisciplinary Research and Perspectives,16 (2), 92–99. 10.1080/15366367.2018.1437305

[CR49] Marcoulides, K., Yuan, K., & Deng, L. (2023). Structural equation modeling with small samples and many variables. Handbook of structural equation modeling, pp. 525–542.

[CR50] McNeish, D. (2016). On using bayesian methods to address small sample problems. Structural Equation Modeling: A Multidisciplinary Journal,23(5), 750–773. 10.1080/10705511.2016.1186549 [Google Scholar]

[CR51] Muthén, B., & Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods,17(3), 313. 10.1037/a0026802 [DOI] [PubMed] [Google Scholar]

[CR52] Muthén, L., & Muthén, B. (1998–2017). Mplus user’s guide. eight edition. Muthén & Muthén,10.

[CR53] Nájera, P., Abad, F. J., & Sorrel, M. A. (2023). Is exploratory factor analysis always to be preferred? A systematic comparison of factor analytic techniques throughout the confirmatory–exploratory continuum. Psychological Methods. 10.1037/met0000579 [DOI] [PubMed] [Google Scholar]

[CR54] Nishimura, Y., Martin, C. L., Vazquez-Lopez, A., Spence, S. J., Alvarez-Retuerto, A. I., Sigman, M., Steindler, C., Pellegrini, S., Schanen, N. C., Warren, S. T., et al. (2007). Genome-wide expression profiling of lymphoblastoid cell lines distinguishes different forms of autism and reveals shared pathways. Human Molecular Genetics,16(14), 1682–1698. 10.1093/hmg/ddm116

[CR55] Park, S., Ceulemans, E., & Van Deun, K. (2024). A critical assessment of sparse pca (research): Why (one should acknowledge that) weights are not loadings. Behavior Research Methods,56(3), 1413–1432. 10.3758/s13428-023-02099-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR56] Rigdon, E. E. (2012). Rethinking partial least squares path modeling: In praise of simple methods. Long Range Planning,45(5–6), 341–358. 10.1016/j.lrp.2012.09.010 [Google Scholar]

[CR57] Rönkkö, M., McIntosh, C. N., & Antonakis, J. (2015). On the adoption of partial least squares in psychological research: Caveat emptor. Personality and Individual Differences,87, 76–84. 10.1016/j.paid.2015.07.019 [Google Scholar]

[CR58] Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software,48(2), 1–36. 10.18637/jss.v048.i02 [Google Scholar]

[CR59] Rosseel, Y. (2020). Small sample solutions for structural equation modeling. In Small sample size solutions (pp. 226–238). Routledge.

[CR60] Rosseel, Y., & Loh, W. W. (2022). A structural after measurement approach to structural equation modeling. Psychological Methods. 10.1037/met0000503 [DOI] [PubMed] [Google Scholar]

[CR61] Sarstedt, M., Hair, J. F., Ringle, C. M., Thiele, K. O., & Gudergan, S. P. (2016). Estimation issues with PLS and CBSEM: Where the bias lies! Journal of Business Research,69(10), 3998–4010. 10.1016/j.jbusres.2016.06.007

[CR62] Shen, H., & Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis,99(6), 1015–1034. 10.1016/j.jmva.2007.06.007 [Google Scholar]

[CR63] Tenenhaus, M., Tenenhaus, A., & Groenen, P. J. (2017). Regularized generalized canonical correlation analysis: A framework for sequential multiblock component methods. Psychometrika,82, 737–777. 10.1007/s11336-017-9573-x [DOI] [PubMed] [Google Scholar]

[CR64] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B: Statistical Methodology,58(1), 267–288. 10.1111/j.2517-6161.1996.tb02080.x [Google Scholar]

[CR65] Trendafilov, N. T. (2014). From simple structure to sparse components: A review. Computational Statistics,29, 431–454.

[CR66] Treynor, W., Gonzalez, R., & Nolen-Hoeksema, S. (2003). Rumination reconsidered: A psychometric analysis. Cognitive Therapy and Research,27, 247–259. 10.1023/A:1023910315561 [Google Scholar]

[CR67] van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in r. Journal of Statistical Software,45(3), 1–67. 10.18637/jss.v045.i03 [Google Scholar]

[CR68] Van Erp, S., Mulder, J., & Oberski, D. L. (2018). Prior sensitivity analysis in default bayesian structural equation modeling. Psychological Methods,23(2), 363. 10.1037/met0000162 [DOI] [PubMed] [Google Scholar]

[CR69] van Erp, S. (2023). Bayesian regularized sem: Current capabilities and constraints. Psych,5(3), 814–835. 10.3390/psych5030054 [Google Scholar]

[CR70] Vermunt, J. K. (2010). Latent class modeling with covariates: Two improved three-step approaches. Political Analysis,18(4), 450–469. 10.1093/pan/mpq025 [Google Scholar]

[CR71] Vermunt, J. K. (2024). Stepwise latent variable modeling: An overview of approaches.

Items per factor	Reliability	Factors correlation	Cross-loadings	rESEM-CC	rESEM-l1
15	0.3	0	yes	50	1
30	0.3	0	yes	31	2
100	0.3	0	yes	15	0
15	0.8	0	yes	69	38
30	0.8	0	yes	92	38
100	0.8	0	yes	97	28
15	0.3	0.3	yes	31	1
30	0.3	0.3	yes	19	1
100	0.3	0.3	yes	2	2
15	0.8	0.3	yes	83	41
30	0.8	0.3	yes	80	34
100	0.8	0.3	yes	80	29
15	0.3	0	no	50	3
30	0.3	0	no	47	0
100	0.3	0	no	9	0
15	0.8	0	no	100	78
30	0.8	0	no	100	65
100	0.8	0	no	100	62
15	0.3	0.3	no	19	1
30	0.3	0.3	no	6	2
100	0.3	0.3	no	0	1
15	0.8	0.3	no	100	69
30	0.8	0.3	no	100	68
100	0.8	0.3	no	100	53

Items per factor	Number of factors	Sample size	Cross-loadings	rESEM-CC	rESEM-l1
3	3	50	yes	78	24
5	3	50	yes	76	24
3	5	50	yes	73	20
5	5	50	yes	70	30
3	3	100	yes	88	30
5	3	100	yes	88	32
3	5	100	yes	80	34
5	5	100	yes	77	38
3	3	500	yes	100	49
5	3	500	yes	100	55
3	5	500	yes	100	48
5	5	500	yes	100	64
3	3	50	no	79	54
5	3	50	no	70	48
3	5	50	no	71	39
5	5	50	no	65	38
3	3	100	no	98	74
5	3	100	no	98	64
3	5	100	no	98	60
5	5	100	no	96	50
3	3	500	no	100	96
5	3	500	no	100	90
3	5	500	no	100	92
5	5	500	no	100	96

Items per factor	Number of factors	Sample size	Cross-loadings	rESEM-CC	rESEM-l1
3	3	50	yes	80	30
5	3	50	yes	64	30
3	5	50	yes	68	24
5	5	50	yes	62	20
3	3	100	yes	88	34
5	3	100	yes	80	38
3	5	100	yes	72	40
5	5	100	yes	75	36
3	3	500	yes	100	54
5	3	500	yes	100	59
3	5	500	yes	91	50
5	5	500	yes	96	62
3	3	50	no	62	52
5	3	50	no	61	44
3	5	50	no	70	34
5	5	50	no	57	34
3	3	100	no	86	64
5	3	100	no	85	55
3	5	100	no	85	51
5	5	100	no	72	50
3	3	500	no	100	96
5	3	500	no	100	90
3	5	500	no	100	88
5	5	500	no	100	84

Items per factor	Reliability	Factors correlation	Cross-loadings	rESEM-CC	rESEM-l1
15	0.3	0	yes	50	1
30	0.3	0	yes	31	2
100	0.3	0	yes	15	0
15	0.8	0	yes	69	38
30	0.8	0	yes	92	38
100	0.8	0	yes	97	28
15	0.3	0.3	yes	31	1
30	0.3	0.3	yes	19	1
100	0.3	0.3	yes	2	2
15	0.8	0.3	yes	83	41
30	0.8	0.3	yes	80	34
100	0.8	0.3	yes	80	29
15	0.3	0	no	50	3
30	0.3	0	no	47	0
100	0.3	0	no	9	0
15	0.8	0	no	100	78
30	0.8	0	no	100	65
100	0.8	0	no	100	62
15	0.3	0.3	no	19	1
30	0.3	0.3	no	6	2
100	0.3	0.3	no	0	1
15	0.8	0.3	no	100	69
30	0.8	0.3	no	100	68
100	0.8	0.3	no	100	53

Items per factor	Number of factors	Sample size	Cross-loadings	rESEM-CC	rESEM-l1
3	3	50	yes	78	24
5	3	50	yes	76	24
3	5	50	yes	73	20
5	5	50	yes	70	30
3	3	100	yes	88	30
5	3	100	yes	88	32
3	5	100	yes	80	34
5	5	100	yes	77	38
3	3	500	yes	100	49
5	3	500	yes	100	55
3	5	500	yes	100	48
5	5	500	yes	100	64
3	3	50	no	79	54
5	3	50	no	70	48
3	5	50	no	71	39
5	5	50	no	65	38
3	3	100	no	98	74
5	3	100	no	98	64
3	5	100	no	98	60
5	5	100	no	96	50
3	3	500	no	100	96
5	3	500	no	100	90
3	5	500	no	100	92
5	5	500	no	100	96

Items per factor	Number of factors	Sample size	Cross-loadings	rESEM-CC	rESEM-l1
3	3	50	yes	80	30
5	3	50	yes	64	30
3	5	50	yes	68	24
5	5	50	yes	62	20
3	3	100	yes	88	34
5	3	100	yes	80	38
3	5	100	yes	72	40
5	5	100	yes	75	36
3	3	500	yes	100	54
5	3	500	yes	100	59
3	5	500	yes	91	50
5	5	500	yes	96	62
3	3	50	no	62	52
5	3	50	no	61	44
3	5	50	no	70	34
5	5	50	no	57	34
3	3	100	no	86	64
5	3	100	no	85	55
3	5	100	no	85	51
5	5	100	no	72	50
3	3	500	no	100	96
5	3	500	no	100	90
3	5	500	no	100	88
5	5	500	no	100	84

PERMALINK

Exploratory structural equation modeling and the curse of dimensionality

Tra T Le

Jeroen K Vermunt

Nicola Ballhausen

Katrijn Van Deun

Abstract

Method

Data and notation

Model

Stage 1: Estimating the measurement model

Stage 2: Estimating the structural model

Model selection

Related methods

Simulation study

Setup

Fig. 1.

Performance measures

Analyses

Results

Measurement model

Fig. 2.

Fig. 3.

Table 6.

Structural model

Fig. 4.

High-dimensional settings

Fig. 5.

Fig. 6.

Summary

Empirical application

Habitual stress recovery data and pre-processing

Measures

Results

Fig. 8.

Table 1.

Table 2.

The autism genetic data

Fig. 7.

Discussion

Acknowledgements

Appendix A Algorithm: Regularized ESEM with the LASSO penalty

Algorithm 1.

Appendix B Algorithm: Regularized ESEM with cardinality constraint

Algorithm 2.

Appendix C Summary of related methods

Table 3.

Appendix D Data generation for simulation study

Appendix E Accuracy of Index of Sparseness as a model selection criterion

Table 4.

Table 5.

Appendix F Number of starts resulting in the same final solution by the proposed Regularized ESEM

Table 7.

Appendix G Additional plot for the stress recovery example

Appendix H Factors correlation obtained by the LASSO penalty approach rESEM-l1

Table 8.

Appendix I Additional plot for the gene expression example

Fig. 9.

Funding

Data Availability

Code availability

Declarations

Funding

Competing interests

Ethics approval

Consent to participate

Consent for publication

Open Practices Statements

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Items per factor	Reliability	Factors correlation	Cross-loadings	rESEM-CC	rESEM-l1
15	0.3	0	yes	50	1
30	0.3	0	yes	31	2
100	0.3	0	yes	15	0
15	0.8	0	yes	69	38
30	0.8	0	yes	92	38
100	0.8	0	yes	97	28
15	0.3	0.3	yes	31	1
30	0.3	0.3	yes	19	1
100	0.3	0.3	yes	2	2
15	0.8	0.3	yes	83	41
30	0.8	0.3	yes	80	34
100	0.8	0.3	yes	80	29
15	0.3	0	no	50	3
30	0.3	0	no	47	0
100	0.3	0	no	9	0
15	0.8	0	no	100	78
30	0.8	0	no	100	65
100	0.8	0	no	100	62
15	0.3	0.3	no	19	1
30	0.3	0.3	no	6	2
100	0.3	0.3	no	0	1
15	0.8	0.3	no	100	69
30	0.8	0.3	no	100	68
100	0.8	0.3	no	100	53

Items per factor	Number of factors	Sample size	Cross-loadings	rESEM-CC	rESEM-l1
3	3	50	yes	78	24
5	3	50	yes	76	24
3	5	50	yes	73	20
5	5	50	yes	70	30
3	3	100	yes	88	30
5	3	100	yes	88	32
3	5	100	yes	80	34
5	5	100	yes	77	38
3	3	500	yes	100	49
5	3	500	yes	100	55
3	5	500	yes	100	48
5	5	500	yes	100	64
3	3	50	no	79	54
5	3	50	no	70	48
3	5	50	no	71	39
5	5	50	no	65	38
3	3	100	no	98	74
5	3	100	no	98	64
3	5	100	no	98	60
5	5	100	no	96	50
3	3	500	no	100	96
5	3	500	no	100	90
3	5	500	no	100	92
5	5	500	no	100	96

Items per factor	Number of factors	Sample size	Cross-loadings	rESEM-CC	rESEM-l1
3	3	50	yes	80	30
5	3	50	yes	64	30
3	5	50	yes	68	24
5	5	50	yes	62	20
3	3	100	yes	88	34
5	3	100	yes	80	38
3	5	100	yes	72	40
5	5	100	yes	75	36
3	3	500	yes	100	54
5	3	500	yes	100	59
3	5	500	yes	91	50
5	5	500	yes	96	62
3	3	50	no	62	52
5	3	50	no	61	44
3	5	50	no	70	34
5	5	50	no	57	34
3	3	100	no	86	64
5	3	100	no	85	55
3	5	100	no	85	51
5	5	100	no	72	50
3	3	500	no	100	96
5	3	500	no	100	90
3	5	500	no	100	88
5	5	500	no	100	84