Abstract
Characterizing associations among multiple single-nucleotide polymorphisms (SNPs) within and across genes, and measures of disease progression or disease status will potentially offer new insight into disease etiology and disease progression. However, this presents a significant analytic challenge due to the existence of multiple potentially informative genetic loci, as well as environmental and demographic factors, and the generally uncharacterized and complex relationships among them. Latent variable modeling approaches offer a natural framework for analysis of data arising from these population-based genetic association investigations of complex diseases as they are well-suited to uncover simultaneous effects of multiple markers. In this manuscript we describe application and performance of two such latent variable methods, namely structural equation models (SEMs) and mixed effects models (MEMs), and highlight their theoretical overlap. The relative advantages of each paradigm are investigated through simulation studies and, finally, an application to data arising from a study of anti-retroviral-associated dyslipidemia in HIV-infected individuals is provided for illustration.
Keywords: Mixed effects models, Single-nucleotide polymorphisms (SNPs), Structural equation model (SEM)
1 Introduction
The increased availability of data on single-nucleotide polymorphisms (SNPs) has led to heighten interest in understanding how this genetic information correlates with measures of disease progression. One analytic challenge plaguing these genotype-trait association studies is the potential for multiple SNPs to be implicated in complex diseases. In this manuscript, we describe applications and performance of two latent variable paradigms, namely structural equation models (SEMs) and mixed effects models (MEMs), for addressing this challenge.
SEMs constitute a broad range of multivariate regression models that allow complex dependencies among multiple predictors and outcome variables and are widely used in economics, sociology and psychology (Pugesek et al., 2003; Rabe-Hesketh et al., 2004; Skrondal and Rabe-Hesketh, 2004). Several recent manuscripts extend the conventional measurement component of an SEM, conditional on latent variables, to the generalized linear model setting, rendering these models naturally conducive to continuous as well as categorical outcomes (Muthén, 1984; Muthén and Muthén, 2007; Skrondal and Rabe-Hesketh, 2005, 2004; Lee and Shi, 2001; Reboussin and Liang, 1998). Recent applications of SEMs to genetic data include those that aim to reconstruct the linkage disequilibrium structure among genes (Lee et al., 2007) as well as one study to characterize associations between multiple SNPs, smoking, gender and rheumatoid arthritis (Nock et al., 2007). MEMs, widely used to address correlations in repeated-measures and multi-level data (Laird and Ware, 1982), are an alternative latent variable modeling strategy that has been described for characterizing association between multiple SNPs, within and across genes, and a measured trait (Foulkes et al., 2005; Goeman et al., 2004; Foulkes and De Gruttola, 2002).
A growing body of literature exists on the methods for analyzing data arising from candidate gene association studies, including approaches targeted specifically at characterizing combinations of SNPs and their association with a measure of disease status or disease progression. Among these are most notably machine learning applications, including classification and regression trees (CART) (Zhang and Singer, 1999; Breiman et al., 1993), random forests (Bureau et al., 2005; Segal et al., 2004; Breiman, 2001), logic regression (Schwender and Ickstadt, 2008; Kooperberg and Ruczinski, 2005; Ruczinski et al., 2004, 2003; Kooperberg et al., 2001), lasso (Kooperberg et al., 2010; Wu et al., 2009; Tibshirani, 1996), elastic net (Kooperberg et al., 2010; Zou and Hastie, 2005) and Bayesian network (BN) analysis (Rodin and Boerwinkle, 2005; Pearl, 1988). The gains attributable to first-stage creation of meta-variables within these frameworks are also described, for example in Foulkes et al. (2004); Bastone et al. (2004) and Malovini et al. (2009). The former involves a first-stage, unsupervised clustering of individuals based solely on genotype data, followed by application of CART to characterize association, while the later involves a first-stage application of CART to identify clusters, followed by application of BN analysis to characterize association. The latent class approaches described herein similarly involve defining group indicators based on a collection of SNPs and, in turn, relating these to a measured trait for characterizing association; however, both the SEM and MEM approaches detailed below are distinct in that they involve fully parametric modeling of association and corresponding parameter estimation and testing. The present manuscript focuses on the overlap of these two specific latent class paradigms while additional details on several of the alternative approaches listed above, including discussion of their relative merits, can be found in Hastie et al. (2001); Gentleman et al. (2005); Schwender et al. (2008) and Foulkes (2009).
We begin by formalizing the SEM approach for genetic association studies and extend the research of Nock et al. (2007), to characterize broadly the performance under a range of underlying models of association (Section 2.1). Second, we present an extension of the MEM approach of Foulkes et al. (2005), for this setting that offers additional flexibility in defining the model of association through inclusion of cross-classified clusters (Section 2.2). We then highlight the theoretical overlap between SEMs and MEMs (Section 2.3) and explore, through simulation studies, the relative advantages of each approach (Section 3.1). Specifically, we focus on the flexibility and performance under model misspecification. Finally, we apply both approaches, as we as an alternative two-stage BN analysis, to data arising from a study of anti-retroviral therapy (ART)-associated dyslipidemia in HIV (Section 3.2).
2 Methods
2.1 Structural equation model for genetic association studies
We begin by describing how the SEM framework can be applied for analysis of data derived from genetic association studies, where the goal is to characterize associations between genotypes, within and across multiple genetic loci, and a single measure of disease progression or disease status. An extensive literature exists on SEMs, and correspondingly a variety of approaches to specifying the model have been described (Jöreskog, 1975; Bentler and Weeks, 1980; Muthén, 1984, 2002; Sánchez et al., 2005; Skrondal and Rabe-Hesketh, 2005; Muthén and Muthén, 2007). Here, we use the formulation given by Sánchez et al. (2005) and apply the measurement model described by Muthén (1984); Skrondal and Rabe-Hesketh (2005); and Muthén and Muthén (2007).
Let yi denote the trait under study, where
represents individuals. Further suppose
represents genotypes across S bi-allelic SNPs for individual i. Since each SNP is bi-allelic, we have
, where A and a represent the two possible nucleotides at site s and
. For simplicity of presentation, we drop the notational dependency of A and a on s. Finally, let
represent P measured covariates for individual i.
Similar to the approach described by Nock et al. (2007), we assume that each candidate gene has a corresponding latent variable, representing an unobservable effect of the corresponding gene on the trait. These latent variables are given for individual i by the vector
where
corresponds to gene k,
. The measurement component of an SEM relates the observed data components, Xi and yi, to the latent variables, Ui, while the structural component defines the relationship among the latent variables. These are formulated as follows:
| (1) |
| (2) |
where
and Γy are unknown parameters, and
. Here
is used to represent an appropriately defined link function, such as the probit or logit link for categorical and binary outcomes, respectively
| (3) |
where α, B are unknown parameters, the diagonal elements of B are identically equal to 0 and (I−B) is invertible (Sánchez et al., 2005). Here we assume
and
is independent of εi. In the genetic association study setting, interest is in characterizing the association between the latent variables, Ui, and the measured trait, given by yi. Formally, a test of association is given by a Wald test of the null hypothesis,
.
Notably, in the genetic association setting, where Xi represents SNPs as described above, many of the covariates represented in Zi will influence the trait yi but will not be directly predictive of Xi as described in Eq. (1). Covariates that are potentially relevant in this component of the measure model include surrogates for population substructure, such as race and ethnicity, as well as measures of exposure to environmental toxins, such as smoking status, that may result in oncogenic mutations within specific organ tissue. In this sense, Zi can be thought of as a partitioned matrix, given by
, where only the covariates represented in
are potentially predictive of Xi while the covariates given in both
and
are potentially predictive of yi. In turn, the element of Γx corresponding to
is identically equal to 0.
Visual path diagram representations of this model with one and two latent variables are given in Fig. 1A and B, respectively. Here, observed variables are represented by rectangles while latent variables are given by ovals. Dashed lines represent fixed, independent variables, whereas solid lines indicate dependent variables with corresponding distributional assumptions. Single-direction arrows represent causal relationships among variables while double-headed arrows represent non-zero correlations.
Figure 1.

Sample SEM path diagrams for genetic association studies. (A) Single latent variable SEM and (B) SEM with two latent variables.
2.2 Mixed effect model for genetic association studies
Distinct from the SEM setting, application of an MEM to SNP data is a staged approach that involves first assigning individuals to groups based on observed genotypes across multiple SNPs. These genotype group assignments are then treated as cluster indicators in the usual mixed modeling framework. While several approaches to the first-stage dimension reduction are tenable, such as hierarchical or K-means clustering (Hartigan, 1975), here we apply the simple deterministic approach of assigning individuals with identical multi-locus genotypes to the same genotype group, as described by Foulkes et al. (2004).
Again, we begin by letting
represent the multilocus genotype for individual i across S bi-allelic SNPs. Now suppose
represents the J groups resulting from assigning individuals with identical genotypes to the same group, that is
implies
. The MEM as described previously for this setting (Foulkes et al., 2005) can be formulated as follows:
| (4) |
where ν and Γ are again unknown parameters,
,
is an indicator for individual i belonging to genotype group gj,
is a vector of corresponding random effects of genotype groups on the trait under study,
for
,
and bj and εi are independent. A likelihood ratio test of the null hypothesis H0: σb=0 is applied to assess the presence of a genotype-trait association.
More generally, a grouping variable can be defined for each of multiple genes. To see this, suppose now that Xi represents a vector of S SNPs across K genes. We assume nk SNPs are measured within gene k, such that
. Now
represents the groups corresponding to gene k where Jk is the number of such groups. Notably, in the setting of three-level SNPs, we have
, while for binary SNPs,
. The MEM for such cross-classified data is then given by
| (5) |
where
is a vector of group membership indicators,
is defined as the genotype group random effects on yi for gene k,
and
. In this setting, a likelihood ratio test can again be applied to test the null hypothesis
for each gene k.
Visual representations of the MEMs for single- and multi-level clustering are given in Fig. 2A and B, using the same notation as described above for Fig. 1. Here, the broken lines indicate a deterministic relationship between SNPs – represented by
– and cluster assignments – represented by Ci. A few notable distinctions can be discerned by comparing Figs. 1 and 2. Most notably, in the SEM framework, we see that the SNP variables are treated as random, and modeled as a function of the latent variables, ui. In the MEM setting, on the other hand, these are treated as fixed and inform cluster assignments deterministically. Additionally, in the SEM setting, the latent variables – given by
and
– are person-specific and potentially correlated, while in the MEM framework the latent effects – given by
and
– correspond to genotype groups and are independent. Further discussion of theoretical overlap between the two modeling approaches is given in Section 2.3.
Figure 2.

Sample MEM path diagrams for genetic association studies. (A) Single level clustering MEM and (B) MEM with multi-level clustering.
2.3 A comparison of the SEM and MEM approaches
Both the SEM and MEM approaches, as formulated in Sections 2.1 and 2.2, involve modeling underlying latent variables that represent unobservable effects of genes on the trait under study. Indeed, it is straightforward to show that the SEM can be reduced to an MEM for this setting. To begin, consider the simple case of a single gene, and thus a single latent variable. We first omit the regression of Xi on the latent variable – Eq. (1) of the SEM – as the MEM treats the Xi as fixed. For the single gene setting, the models of Eqs. (2) and (3) reduce to
| (6) |
| (7) |
where
is now a scalar and, because there is only one assumed latent variable, B is identically equal to 0 and
. In order for this SEM to reduce to the MEM, we let λy equal the vector
and replace the individual level latent variable ui with the vector of random cluster effects
. Importantly, this is equivalent to making the assumption that the latent effects on the trait are the same for those individuals with the same observed genetic profile. Finally, we set α=0 and, together, these restrictions yield Eq. (4).
In the case of K=2 genes, we note that Eqs. (2) and (3) can be written as:
| (8) |
| (9) |
Now we replace
and
with the vectors
and
and replace
and
with
and
, respectively. Notably, as the lengths of
and
(as well as b1 and b2), given by J1 and J2, are not necessarily equal, these vectors need to be concatenated with vectors of 0 of appropriate length. That is, Eq. (8) is replaced with:
where, without loss of generality, we assume J1<J2 and
is defined as a vector of length (J2−J1) with all 0 elements. In order for this SEM to simplify to the MEM with two levels of clustering, we additionally need to assume α=0 and
. In other words, we must assume that the latent variables are uncorrelated and centered at 0.
In summary, and more generally for K>2 genes, we make the following three assumptions for the SEM to reduce to the MEM: (i) SNPs, represented by Xi, are fixed, so that the model of Eq. (1) is omitted; (ii) ΛyUi is given by
, that is individual level latent variables are the same for individuals within the same defined genotype group and (iii) α=B=0, that is latent variables across genes are mutually independent and centered at 0. In the MEM setting, the cluster random effects are assumed independent, although a correlation structure between bj and
could be imposed.
3 Applications
In the following sections we report the results of a simulation study and an application to a study of anti-retroviral-associated dyslipidemia in HIV. Restricted maximum likelihood is used to derive point estimates of parameters in the MEMs. A likelihood ratio test of
, comparing the full (mixed effects) and reduced (fixed effects only) model, is used to investigate the association between genotype groups and a measured trait. As this involves testing a parameter at a boundary, a mixture of a χ2 with 1 and 0 degrees of freedom is assumed for the resultant test statistic. All MEMs are fitted with the lme() function within the nlme package in R, Version 2.9.1. In the context of fitting SEMs, weighted least squares is applied to derive parameter estimates and a Wald test of the null hypothesis
, is reported. SEMs are fitted using Mplus Version 5.21.
3.1 Simulation studies
In this section we explore, through simulation studies, the relative practical performance of SEMs and MEMs under a range of underlying models of association. In each simulation, we generate 500 sets of data, each of size n=1000, for each combination of true parameter values. SEM data are simulated using the MONTECARLO Command in Mplus Version 5.21 (Muthén, 1984; Muthén and Muthén, 2007).
We begin by simulating data under an SEM with a single latent variable, according to Eqs. (1) and (6)–(7), where for simplicity of presentation, we let Γx=0. In scalar notation, this model can be rewritten as:
![]() |
where we assume
, xis is a binary SNP indicator,
,
,
and
. For identifiability, α is set to 0 and
is restricted to 1. Furthermore, we define a threshold model in Mplus such that
. It is straightforward to show that the covariance between any two SNPs is then given by ψ. Finally, we set
,
and
, while the values of the remaining parameters, λy,
,
, and ψ, vary as described in Table 1.
Table 1.
Simulation results under SEM with one latent variable.
| Data | True value | SEMs | MEMs | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Bias | CR | CI | Power | Bias | CR | CI | Power | ||
| 1 | ![]() |
0.00 | 0.95 | 0.23 | – | 0.13 | 0.24 | 0.20 | – |
![]() |
0.01 | 0.96 | 0.21 | – | – | – | – | – | |
![]() |
0.00 | 0.95 | 0.14 | – | 0.00 | 0.96 | 0.13 | – | |
![]() |
−0.02 | 0.94 | 0.73 | 1.00 | – | – | – | 1.00 | |
| 2 | ![]() |
0.00 | 0.95 | 0.23 | – | 0.18 | 0.06 | 0.21 | – |
![]() |
0.01 | 0.96 | 0.20 | – | – | – | – | – | |
![]() |
0.00 | 0.95 | 0.15 | – | 0.00 | 0.96 | 0.14 | – | |
![]() |
−0.01 | 0.95 | 0.37 | 1.00 | – | – | – | 1.00 | |
| 3 | ![]() |
0.00 | 0.94 | 0.22 | – | 0.19 | 0.03 | 0.21 | – |
![]() |
0.01 | 0.96 | 0.17 | – | – | – | – | – | |
![]() |
0.00 | 0.95 | 0.16 | – | 0.00 | 0.95 | 0.14 | – | |
![]() |
0.00 | 0.97 | 0.25 | 1.00 | – | – | – | 1.00 | |
| 4 | ![]() |
0.00 | 0.94 | 0.19 | – | 0.04 | 0.85 | 0.18 | – |
![]() |
0.01 | 0.95 | 0.22 | – | – | – | – | – | |
![]() |
0.00 | 0.96 | 0.13 | – | 0.00 | 0.96 | 0.13 | – | |
![]() |
−0.01 | 0.96 | 0.30 | 1.00 | – | – | – | 1.00 | |
| 5 | ![]() |
0.01 | 0.95 | 0.39 | – | 0.69 | 0.00 | 0.30 | – |
![]() |
0.00 | 0.96 | 0.18 | – | – | – | – | – | |
![]() |
0.00 | 0.95 | 0.20 | – | 0.00 | 0.95 | 0.16 | – | |
![]() |
−0.02 | 0.95 | 0.56 | 1.00 | – | – | – | 1.00 | |
(a) Median estimates are reported from 500 sets of data. (b) Absolute difference between Est. and true value. (c) Coverage rate, the percentage of confidence intervals (CIs) that cover true value among 500 CIs. For the CR of variance, the CIs that contain negative values are excluded for consideration. (d) Median length among the 500 length of CIs. (e) Wald test statistics of
is used to test the association between latent variable and measured trait. (f) Likelihood ratio test (LRT) of
is applied to investigate the association between genotype groups and measured trait. Here the approximate distribution of LRT under null hypothesis is 50:50 mixture of
and
.
The results of fitting both SEMs and MEMs to these data are provided in Table 1. In general, both approaches have high power for detecting association, but the SEM approach performs better in terms of bias and coverage for the variance component,
. As ψ and λy increase while
remains fixed, the absolute bias associated with
for the MEM increases and the corresponding coverage rate (CR) is low. Notably, the type-1 error rate (under the model in which λy=0) is 0.03 for both the SEM and MEM approaches.
Second, we simulate data according to an SEM with two uncorrelated latent variables. That is, we let the data arise from Eqs. (1) and (8)–(9) with B=0. In this case, our model can be written as:
![]() |
where
and
, that is we have 8 SNPs with 4 in each of two genes. We further assume
,
,
, all mutually independent and
. For identification, α1 and α2 are set to 0 and we restrict
. For the simulation, we set
,
,
, and define a threshold in Mplus such that
. Finally, the values of
,
, γy,
, ψ1, and ψ2 are assumed to vary as described in Table 2.
Table 2.
Simulation results under SEM with two uncorrelated latent variables.
| Data | True value | SEMs | MEMs | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Bias | CR | CI | Power | Bias | CR | CI | Power | ||
| 1 | ![]() |
−0.01 | 0.94 | 0.30 | – | 0.25 | 0.00 | 0.22 | – |
![]() |
0.00 | 0.95 | 0.21 | – | – | – | – | – | |
![]() |
0.00 | 0.94 | 0.21 | – | – | – | – | – | |
![]() |
0.00 | 0.94 | 0.15 | – | 0.00 | 0.95 | 0.14 | – | |
![]() |
0.00 | 0.96 | 0.81 | 1.00 | – | – | – | 1.00 | |
![]() |
0.00 | 0.96 | 0.82 | 1.00 | – | – | – | 1.00 | |
| 2 | ![]() |
0.00 | 0.94 | 0.28 | – | 0.34 | 0.00 | 0.24 | – |
![]() |
0.00 | 0.96 | 0.20 | – | – | – | – | – | |
![]() |
0.00 | 0.94 | 0.20 | – | – | – | – | – | |
![]() |
0.00 | 0.95 | 0.17 | – | 0.00 | 0.93 | 0.15 | – | |
![]() |
0.00 | 0.97 | 0.41 | 1.00 | – | – | – | 1.00 | |
![]() |
0.00 | 0.94 | 0.41 | 1.00 | – | – | – | 1.00 | |
| 3 | ![]() |
0.00 | 0.93 | 0.27 | – | 0.37 | 0.00 | 0.24 | – |
![]() |
0.00 | 0.94 | 0.18 | – | – | – | – | – | |
![]() |
0.00 | 0.94 | 0.17 | – | – | – | – | – | |
![]() |
0.00 | 0.95 | 0.18 | – | 0.00 | 0.94 | 0.15 | – | |
![]() |
0.00 | 0.97 | 0.27 | 1.00 | – | – | – | 1.00 | |
![]() |
0.00 | 0.93 | 0.27 | 1.00 | – | – | – | 1.00 | |
| 4 | ![]() |
0.00 | 0.94 | 0.61 | – | 1.37 | 0.00 | 0.42 | – |
![]() |
0.00 | 0.95 | 0.19 | – | – | – | – | – | |
![]() |
0.00 | 0.93 | 0.19 | – | – | – | – | – | |
![]() |
0.00 | 0.96 | 0.25 | – | 0.00 | 0.94 | 0.19 | – | |
![]() |
−0.01 | 0.96 | 0.67 | 1.00 | – | – | – | 1.00 | |
![]() |
−0.01 | 0.94 | 0.66 | 1.00 | – | – | – | 1.00 | |
| 5 | ![]() |
0.00 | 0.95 | 0.41 | – | 0.72 | 0.00 | 0.31 | – |
![]() |
0.00 | 0.95 | 0.22 | – | – | – | – | – | |
![]() |
0.00 | 0.94 | 0.18 | – | – | – | – | – | |
![]() |
0.00 | 0.95 | 0.20 | – | 0.00 | 0.95 | 0.16 | – | |
![]() |
0.00 | 0.97 | 0.41 | 1.00 | – | – | – | 1.00 | |
![]() |
−0.01 | 0.93 | 0.57 | 1.00 | – | – | – | 1.00 | |
The results of fitting the SEM and MEM to these data are given in Table 2. The results are similar to those we saw with a single latent variable, with more extreme biases associated with
using the MEM approach. The type-1 error rates in the SEM setting are 0.05 and 0.06 for
and
, respectively, while the type-1 error rates in the MEM setting are 0.04 and 0.05 for
and
, respectively.
Finally, we simulate data according to an SEM with two correlated latent variables, where the model is the same as described above for two uncorrelated latent variables, with the exception that it is assumed
. Data are simulated under two models specified by ψ12=0.1 and ψ12=0.2, where in both cases ψ1=ψ2=0.4. These models correspond to correlations between latent variables of 0.25 and 0.5, respectively. The results are reported in Table 3. Again power is high under both the SEM and MEM, and similar bias is observed under the MEM for
. The corresponding type-1 error rates are 0.05 and 0.06 for both SEM parameters,
and
, and 0.04 and 0.05 for the two MEM variance parameters.
Table 3.
Simulation results under SEM with two correlated latent variables.
| Data | True value | SEMs | MEMs | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Bias | CR | CI | Power | Bias | CR | CI | Power | ||
| 1 | ![]() |
0.00 | 0.93 | 0.28 | – | 0.37 | 0.00 | 0.24 | – |
![]() |
0.00 | 0.95 | 0.20 | – | – | – | – | – | |
![]() |
0.00 | 0.94 | 0.20 | – | – | – | – | – | |
![]() |
0.00 | 0.95 | 0.18 | – | 0.00 | 0.93 | 0.15 | – | |
![]() |
0.00 | 0.94 | 0.42 | 1.00 | – | – | – | 1.00 | |
![]() |
−0.01 | 0.93 | 0.42 | 1.00 | – | – | – | 1.00 | |
![]() |
0.00 | 0.96 | 0.09 | – | – | – | – | – | |
| 2 | ![]() |
0.00 | 0.93 | 0.28 | – | 0.40 | 0.00 | 0.25 | – |
![]() |
0.00 | 0.94 | 0.20 | – | – | – | – | – | |
![]() |
0.00 | 0.94 | 0.20 | – | – | – | – | – | |
![]() |
0.00 | 0.95 | 0.18 | – | 0.00 | 0.94 | 0.15 | – | |
![]() |
0.00 | 0.95 | 0.49 | 1.00 | – | – | – | 1.00 | |
![]() |
0.00 | 0.95 | 0.49 | 1.00 | – | – | – | 1.00 | |
![]() |
0.00 | 0.95 | 0.10 | – | – | – | – | – | |
Next we simulate data under an MEM with a single clustering variable, as described by Eq. (4). We again assume S=4 SNPs, each coded as binary indicators with
and minimal correlation induced by the assumption
and
for s=1,2,3. A single continuous covariate
is generated and we set ν=0. The remaining parameters,
,
and γ are varied across the simulations as given in Table 4. The results of fitting SEMs and MEMs to these data are reported in Table 4. In this setting, a test of λy=0 has reduced power for detecting association. The type-1 error rates are 0.05 and 0.03 for the SEM and MEM approaches, respectively.
Table 4.
Simulation results under MEM with single-level clustering.
| Data | True value | SEMs | MEMs | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Bias | CR | CI | Power | Bias | CR | CI | Power | ||
| 1 | ![]() |
0.12 | 0.42 | 0.21 | – | 0.00 | 0.96 | 0.18 | – |
![]() |
– | – | – | 0.65 | −0.01 | 0.94 | 0.33 | 1.00 | |
![]() |
0.00 | 0.93 | 0.13 | – | 0.00 | 0.94 | 0.13 | – | |
| 2 | ![]() |
0.22 | 0.23 | 0.25 | – | 0.00 | 0.94 | 0.18 | – |
![]() |
– | – | – | 0.76 | 0.00 | 0.95 | 0.65 | 1.00 | |
![]() |
0.00 | 0.94 | 0.15 | – | 0.00 | 0.95 | 0.13 | – | |
| 3 | ![]() |
0.33 | 0.17 | 0.28 | – | 0.00 | 0.94 | 0.18 | – |
![]() |
– | – | – | 0.77 | −0.02 | 0.95 | 0.94 | 1.00 | |
![]() |
0.00 | 0.95 | 0.15 | – | 0.00 | 0.94 | 0.12 | – | |
Finally, we generate data according to a two-level clustering MEM, as described by Eq. (5). Here we again assume that we observe 4 SNPs in each of 2 genes. The corresponding results of fitting SEMs and MEMs to these data are given in Table 5. Again the power for the SEM approach to detect association based on a test of λy is relatively small in all cases while the power for the MEM approach is reasonable (>90%) for
. The estimated type-1 error rates are 0.03 and 0.05 for the SEM parameters and 0.01 for both of the MEM parameters.
Table 5.
Simulation results under MEM with two-level clustering.
| Data | True value | SEMs | MEMs | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Bias | CR | CI | Power | Bias | CR | CI | Power | ||
| 1 | ![]() |
0.31 | 0.03 | 0.23 | – | 0.00 | 0.95 | 0.18 | – |
![]() |
– | – | – | 0.65 | −0.01 | 0.98 | 0.45 | 0.97 | |
![]() |
– | – | – | 0.68 | −0.01 | 0.96 | 0.44 | 0.97 | |
![]() |
0.00 | 0.94 | 0.15 | – | −0.01 | 0.94 | 0.25 | – | |
| 2 | ![]() |
0.59 | 0.00 | 0.29 | – | 0.00 | 0.94 | 0.18 | – |
![]() |
– | – | – | 0.78 | −0.01 | 0.95 | 0.79 | 0.99 | |
![]() |
– | – | – | 0.79 | −0.02 | 0.96 | 0.77 | 1.00 | |
![]() |
0.00 | 0.94 | 0.16 | – | 0.00 | 0.94 | 0.25 | – | |
| 3 | ![]() |
0.95 | 0.00 | 0.36 | – | 0.00 | 0.95 | 0.18 | – |
![]() |
– | – | – | 0.77 | −0.03 | 0.95 | 1.10 | 1.00 | |
![]() |
– | – | – | 0.77 | 0.00 | 0.95 | 1.11 | 0.99 | |
![]() |
0.00 | 0.95 | 0.18 | – | 0.00 | 0.94 | 0.25 | – | |
| 4 | ![]() |
0.47 | 0.00 | 0.27 | – | 0.00 | 0.97 | 0.18 | – |
![]() |
– | – | – | 0.73 | −0.01 | 0.97 | 0.48 | 0.97 | |
![]() |
– | – | – | 0.74 | −0.02 | 0.96 | 0.73 | 1.00 | |
![]() |
0.00 | 0.96 | 0.15 | – | 0.00 | 0.96 | 0.25 | – | |
3.2 Genetics of therapy-associated lipid abnormalities in HIV
In this section we apply the SEM and MEM frameworks to data arising from the New Works Concept Sheet (NWCS) 224 study, an investigation of genetic factors that contribute to anti-retroviral-associated dyslipidemia in HIV-1-infected individuals. This cross-sectional study is comprised of n=626 HIV-infected participants enrolled in 5 AIDS Clinical Trials Group (ACTG) trials who agreed to genetic testing. A complete discussion of the study design and patient demographics is given in Foulkes et al. (2006). Here we focus on 7 SNPs – rs1045642, rs2032582, rs2235035, rs11772987, rs10256836, rs9282564 and rs2157926 – within the ATP-binding cassette, sub-family B (MDR/TAP), member 1 (ABCB1) gene, a gene involved in transporting substrates, including Protease Inhibitors (PIs) across the cell membrane, and an additional 3 SNPs – rs2854117, rs4520 and rs2070669 – in apolipoprotein C-III (APOC3), a gene involved in inhibiting hepatic uptake of triglyceride-rich particles. All SNPs are treated as binary indicators for the presence of at least one variant allele. Interest is in characterizing association between these SNPs and high-density lipoprotein cholesterol (HDL-C). White/non-Hispanic, Hispanic and Black/non-Hispanic subjects (n=532) with complete data, including known drug exposure histories, are included in analysis.
The results of fitting unadjusted models are reported in Table 6. Here we consider three models: two single-gene models (that include either ABCB1 or APOC3) and one two-gene model (that includes both ABCB1 and APOC3). The SEM and MEM results are consistent with one another, suggesting that ABCB1 is associated with HDL-C, as measured by
(p=0.002 and p=0.003) in the SEM and
(p=0.02) in the MEM for both the single-gene and two-gene models. These effects are attenuated (and no longer statistically significant) in adjusted models and may represent spurious associations resulting from population-admixture, i.e. confounding by race/ethnicity and study site. Adjusted models also included PI exposure as a three-level factor – no current PI exposure; currently exposed to a non-RTV-containing PI regimen; and currently exposed to an RTV-containing PI regimen – gender, race/ethnicity and study.
Table 6.
Application to study of therapy-associated lipid abnormalities in HIV.
| ABCB1 – model | APOC3 – model | (ABCB1, APOC3) – model | ||||
|---|---|---|---|---|---|---|
| SEM | MEM | SEM | MEM | SEM | MEM | |
| Est (p-value) | Est (p-value) | Est (p-value) | Est (p-value) | Est (p-value) | Est (p-value) | |
![]() |
−0.06 (0.002) | – | 0.02 (0.53) | – | −0.06 (0.003); 0.01 (0.84) | – |
![]() |
0.78 | – | 0.18 | – | 0.78, 0.28 | – |
![]() |
– | 0.004 (0.02) | – | 0.00 (0.50) | – | 0.004 (0.02); 0.00 (0.50) |
![]() |
0.097 | 0.096 | 0.099 | 0.099 | 0.097 | 0.096 |
To further explore the results of this model fitting procedure, we consider
from the SEM and the empirical Bayes estimates of the random effects from the MEM. For illustration, we focus on the unadjusted model with the single-gene ABCB1. Based on the SEM, the relationship between the SNPs and the latent gene variable, ui, is estimated by
corresponding to rs1045642, rs2032582, rs2235035, rs11772987, rs10256836, rs9282564 and rs2157926, respectively. Associated p-values based on a Wald test are given by
where p-values of “0” are less than 1×10−8. Recall the first element of Λ is fixed at 1 for identifiability. These results suggest variant alleles at rs2032582 (
) and rs9282564 (
) are significantly positively associated with ui, while variant alleles at rs11772987 (
) and rs2157926 (
) are significantly negatively associated with ui. Further,
, suggesting an inverse relationship between ui and HDL-C.
In total there are 46 observed genotype groups and the corresponding empirical Bayes estimates based on the MEM range from −0.097 to 0.074, as illustrated in Figure 3. All corresponding 95% prediction intervals for the genotype groups cover zero, with the exception of the group with at least one variant allele at each of the three SNPs, rs1045642, rs2032582 and rs11772987 and homozygous for the common allele at all other SNPs. Membership to this group is inversely associated with HDL-C with a corresponding empirical Bayes estimate of −0.097 and a 95% prediction interval of (−0.185,−0.009).
Figure 3.

Empirical Bayes (EB) estimates of latent genotype group effects.
3.3 An alternative Bayesian network analysis framework
BN analysis is an alternative analysis framework that similarly aims to identify and characterize association among combinations of potential predictor variables and an observed trait. In this section, we briefly describe the application of one such approach, proposed by Malovini et al. (2009). This strategy is comprised of two stages: First, meta-variables are created based on fitting a classification or regression tree to the trait under study where SNPs and other potentially relevant variables are treated as potential predictor variables; and second, a BN is learned based on these meta-variables and the trait under study. Analysis is based on application of the rpart() and network() functions within the R rpart and deal packages, respectively.
The results of applying the approach of Malovini et al. (2009), to the SNPs within ABCB1 and APOC3 separately, and in combination, are provided in Fig. 4. To begin, we fitted a regression tree to the log transformed quantitative trait, HDL-C separately for each gene. For ABCB1, a single split is observed, as illustrated in Fig. 4A, where ABCB1.S7 corresponds to rs2157926, and a cost complexity parameter (cp) of 0.01 is applied for first-stage pruning. This constitutes the metavariable used for the second-stage BN analysis of ABCB1. For APOC3, no splits result in an increase of more than cp=0.01 in the overall R-squared value, and thus individual SNP variables for this gene are used in the second-stage BN analysis. The resulting directed acyclic graphs (DAGs) illustrated in Fig. 4B–D are consistent with the results presented in Table 6 based on the SEM and MEM analyses. Again, an association between ABCB1 and HDL-C is observed (Fig. 4B), while associations between SNPs within APOC3 and HDL-C are not detectable (Fig. 4C). The DAG based on both genes (Fig. 4D) additionally identifies an association between ABCB1 and the first SNP within APOC3.
Figure 4.

Bayesian networks of ABCB1, APOC3 and HDL-C. (A) Fitted regression tree using all SNPs within ABCB1 (coded as binary indicators for the presence of at least one variant allele) as potential predictor variables and log-transformed HDL-C as the outcome. (B) DAG based on ABCB1 metavariable and log-transformed HDLC. (C) DAGs with highest scores based on individual SNPs within APOC3 and log-transformed HDL-C. (D) DAG based on ABCB1 metavariable, individuals SNPs within APOC3 and log-transformed HDL-C.
Finally, as a case study, we applied the two-stage BN approach to a single randomly selected simulated data set from each of the scenarios (i.e. combinations of parameters) described in Tables 1–5. The results are presented in Fig. 5. In-line with the finding presented in Table 1 of consistently high power of the SEM and MEM under the SEM model with a single latent variable, Fig. 5A illustrates that all five corresponding DAGs identify association between the gene metavariable and the trait. Under the SEMs with two uncorrelated latent variables, the BN analysis consistently identifies at least one of the two gene metavariables; however, in 3 of the 5 cases only one gene is identified, as shown in Fig. 5B. The association between the two genes under the SEM with correlated latent variables is detected with higher correlation, as illustrated in Fig. 5C. Finally, the BN analysis appropriately identifies the gene metavariable for the MEM with single-level clustering, and, in 3 of the 4 cases, was able to identify both gene metavariables under the MEM with two-level clustering.
Figure 5.

Bayesian networks of simulated data. (A) DAGs based on data simulated according to a SEM with one latent variable where gene represents a metavariable based on a fitted regression tree. The 5 DAGs correspond to each combination of parameters given in Table 1. (B) DAGs based on data simulated according to a SEM with two uncorrelated latent variables where gene1 and gene2 represent metavariables based on two separate fitted regression trees. The 5 DAGs correspond to each combination of parameters given in Table 2. (C) DAGs based on data simulated according to a SEM with two correlated latent variables where gene1 and gene2 represent metavariables based on two separate fitted regression trees. The 2 DAGs correspond to each combination of parameters given in Table 3. (D) DAGs based on data simulated according to an MEM with single-level clustering where gene represents a metavariable based on a fitted regression tree. The 3 DAGs correspond to each combination of parameters given in Table 4. (E) DAGs based on data simulated according to an MEM with two-level clustering where gene1 and gene2 represent metavariables based on two separate fitted regression trees. The 4 DAGs correspond to each combination of parameters given in Table 5.
4 Discussion
In this manuscript we describe the application of two related latent variable modeling approaches, MEMs and SEMs, for identifying and characterizing genetic contributors to complex diseases. While these two frameworks have some important commonalities, several notable differences emerged during our investigation. These are highlighted by the assumptions listed in Section 2.3, under which the SEM reduces to the MEM as described for this setting. Importantly, in the context of an SEM, the test of association is based on a fixed effect coefficient (λy) relating the latent gene variable to the trait. In the MEM context, on the other hand, the test of association is based on a variance parameter (
) of the latent gene effects.
Interestingly, our simulation studies reveal that the performance of the two modeling approaches is comparable under the SEMs in terms of power and type-1 error rates; however, when the data arise from an MEM, power for the SEM approach is lower than the corresponding power for detecting association using the MEM approach. Also of note, when data are simulated under the SEM model, the estimates of the nuisance parameter, σe, under the MEM exhibit substantial bias. In these cases, the estimate of σb is also estimated poorly (results not shown). Notably, for Tables 1 and 2, var
and an estimate of this variance under the MEM is
. For example, for the first scenario in Table 1, we have
and the estimate under the MEM is
(results not shown). In general, these are not as closely aligned; however, there appears to be a tradeoff between the two parameters. In turn, estimation of γy depends on
. Further research on sensitivity to alternative underlying model specifications may further elucidate the relative merits of each approach. Specifically, SEMs may be more conducive to testing specific hypotheses involving multiple genes and their relationships to one another in a pathway to disease.
A notable limitation of both the MEM and SEM approaches is their potential inability to handle a large number of SNPs. In the context of the MEM, the number of genotype groups can become unwieldy as the number of SNPs increases, as described by Foulkes et al. (2008). Interestingly, inclusion of highly correlated SNPs in the MEM approach results in fewer genotype groups but does not otherwise effect model performance. Furthermore, our preliminary research suggests that iteratively sampling a subset of SNPs and fitting the MEM, and then combining results over the multiple samples (Efron, 1979, 1981; Good, 2005), performs reasonably well (results not shown) in terms of the overall power for detecting association in the setting of a large number of SNPs. The extension of the MEM approach to more than one clustering variable, as described in Section 2.2, also offers the advantage of reducing the total number of genotype groups. For example, if we observe r SNPs in one gene and s SNPs in a second gene, the original formulation of the MEM approach involves
genetic groups while the proposed extension involves only
genetic groups (across two clustering variables). Additional research is needed to evaluate the performance of this extended MEM approach with multiple SNPs across a larger number of genes.
We also presented the results of applying an alternative two-stage BN analysis approach to the NWCS224 data, as well as to randomly selected simulated data sets. In the real data example, we were unable to fit a regression tree with splits beyond the root node subject to the specified pruning criterion for one of the genes under study. In this case, the analysis reduced to fitting a BN to single SNP variables rather than metavariables as described by Malovini et al. (2009). On the other hand, for the ABCB1 gene, a metavariable did emerge, albeit based on a single SNP, and an association was detected. Importantly, the structure of association that CART is designed to detect, namely conditional associations (Foulkes, 2009), may explain why only a single SNP emerged while for the SEM four SNPs within this gene were identified as statistically relevant. Although our case studies of simulated data suggest that the BN analysis generally (though not always consistently) identified relevant genes, further investigation is needed to characterize the overall performance and type-1 error rates.
Finally, further research is needed to characterize the performance of both the SEM and MEM frameworks in the context of more complex structures of association and underlying genetic models. In a recent manuscript, we describe application of a mixture modeling approach that may be more appropriate than the single normal prior assumption on the random effects under a dominant or recessive genetic model (Au et al., 2010). While the inclusion of cross-classified clusters, as proposed in Section 2.2, allows for consideration of more SNPs through reduction in the number of genotype groups, this approach assumes an additive model of association. More generally, it is of interest to characterize interactions among genes in their effect on the trait under study, an area of on-going research.
Acknowledgments
The support for this research was provided by National Institutes of Health (NIH) individual research awards, NIH/NHLBI R01HL107196 and NIH/NIDDK R01DK021224.
Conflict of interest The authors have declared no conflict of interest.
Supplementary material
Detailed facts of importance to specialist readers are published as ”Supporting Information”. Such documents are peer-reviewed, but not copy-edited or typeset. They are made available as submitted by the authors.
References
- Au K, Lin R, Foulkes AS. Mixture modeling as an exploratory framework for genotype-trait associations. Journal of the Royal Statistical Society Series C. 2011;60:355–375. doi: 10.1111/j.1467-9876.2010.00750.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bastone L, Reilly M, Rader D, Foulkes A. MDR and PRP: a comparison of methods for high-order genotype–phenotype associations. Human Heredity. 2004;58:82–92. doi: 10.1159/000083029. [DOI] [PubMed] [Google Scholar]
- Bentler PM, Weeks DG. Linear structural equations with latent variables. Psychometrika. 1980;45:289–308. [Google Scholar]
- Breiman L. Random forests. Machine Learning. 2001;45:5–32. [Google Scholar]
- Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. New York, USA: Chapman and Hall/CRC; 1993. [Google Scholar]
- Bureau A, Dupuis J, Falls K, Lunetta K, Hayward B, Keith T, Van Eerdewegh P. Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology. 2005;28:171–182. doi: 10.1002/gepi.20041. [DOI] [PubMed] [Google Scholar]
- Efron B. Bootstrap methods: another look at the jackknife. Annals of Statistics. 1979;7:1–26. [Google Scholar]
- Efron B. Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods. Biometrika. 1981;68:589–599. [Google Scholar]
- Foulkes A. Applied Statistical Genetics with R: for Population-based Association Studies. NY: Springer; 2009. [Google Scholar]
- Foulkes AS, De Gruttola V. Characterizing the relationship between HIV-1 genotype and phenotype: prediction-based classification. Biometrics. 2002;58:145–156. doi: 10.1111/j.0006-341x.2002.00145.x. [DOI] [PubMed] [Google Scholar]
- Foulkes AS, De Gruttola V, Hertogs K. Combining genotype groups and recursive partitioning: an application to human immunodeficiency virus type 1 genetics data. Journal of the Royal Statistical Society Series C. 2004;53:311–323. [Google Scholar]
- Foulkes AS, Reilly M, Zhou L, Wolfe M, Rader DJ. Mixed modelling to characterize genotype–phenotype associations. Statistics in Medicine. 2005;24:775–789. doi: 10.1002/sim.1965. [DOI] [PubMed] [Google Scholar]
- Foulkes AS, Wohl DA, Frank I, Puleo E, Restine S, Wolfe ML, Dube MP, Tebas P, Reilly MP. Associations among race/ethnicity, ApoC-III genotypes, and lipids in HIV-1-infected individuals on antiretroviral therapy. PLoS Medicine. 2006;3:0337–0347. doi: 10.1371/journal.pmed.0030052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Foulkes AS, Yucel R, Li X. A likelihood-based approach to mixed modeling with ambiguity in cluster identifiers. Biostatistics. 2008;9:635–657. doi: 10.1093/biostatistics/kxm055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. New York, USA: Springer; 2005. [Google Scholar]
- Goeman J, van de Geer S, de Kort F, van Houwelingen H. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004;20:93–99. doi: 10.1093/bioinformatics/btg382. [DOI] [PubMed] [Google Scholar]
- Good PI. Permutation, Parametric, and Bootstrap Tests of Hypotheses. New York, USA: Springer; 2005. [Google Scholar]
- Hartigan JA. Clustering Algorithms. Ann Arbor: Wiley; 1975. [Google Scholar]
- Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, USA: Springer; 2001. [Google Scholar]
- Jöreskog KG. Goldberger AS, Duncan OD. Structural Equation Models in the Social Sciences. New York: 1973. A general method for estimating a linear structural equation system; pp. 85–112. [Google Scholar]
- Kooperberg C, LeBlanc M, Obenchain V. Risk prediction using genome-wide association studies. Genetic Epidemiology. 2010;34:643–52. doi: 10.1002/gepi.20509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kooperberg C, Ruczinski I. Identifying interacting SNPs using Monte Carlo logic regression. Genetic Epidemiology. 2005;28:157–170. doi: 10.1002/gepi.20042. [DOI] [PubMed] [Google Scholar]
- Kooperberg C, Ruczinski I, LeBlanc M, Hsu L. Sequence analysis using logic regression. Genetic Epidemiology. 2001;21:S626–S631. doi: 10.1002/gepi.2001.21.s1.s626. [DOI] [PubMed] [Google Scholar]
- Laird N, Ware J. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed] [Google Scholar]
- Lee S, Jhun M, Lee EK, Park T. Application of structural equation models to construct genetic networks using differentially expressed genes and single-nucleotide polymorphisms. BMC Proceedings. 2007;1:S76. doi: 10.1186/1753-6561-1-s1-s76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S, Shi J. Maximum likelihood estimation of two-level latent variable models with mixed continuous and polytomous data. Biometrics. 2001;57:787–794. doi: 10.1111/j.0006-341x.2001.00787.x. [DOI] [PubMed] [Google Scholar]
- Malovini A, Nuzzo A, Ferrazzi F, Puca A, Bellazzi R. Phenotype forecasting with SNPs data through gene-based Bayesian networks. BMC Bioinformatics. 2009;10:S7. doi: 10.1186/1471-2105-10-S2-S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Muthén BO. A general structural equation model with dichotomous, ordered categorical and continuous latent indicators. Psychometrika. 1984;49:115–132. [Google Scholar]
- Muthén BO. Beyond SEM: general latent variable modeling. Behaviormetrika. 2002;29:81–117. [Google Scholar]
- Muthén LK, Muthén BO. Mplus User's Guide. Los Angeles, CA: Muthén & Muthén; 2007. [Google Scholar]
- Nock NL, Larkin EK, Morris NJ, Li Y, Stein CM. Modeling the complex gene x environment interplay in the simulated rheumatoid arthritis GAW15 data using latent variable structural equation modeling. BMC Proceedings. 2007;1:S118. doi: 10.1186/1753-6561-1-s1-s118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pearl J. Probabilistic Reasoning in Intelligent Systems. San Mateo, CA: Morgan Kaufmann; 1988. [Google Scholar]
- Pugesek BH, Tomer A, Eye AV. Structural Equation Modeling: Applications in Ecological and Evolutionary Biology. Cambridge: Cambridge University Press; 2003. [Google Scholar]
- Rabe-Hesketh S, Skrondal A, Pickles A. Generalized multilevel structural equation modeling. Psychometrika. 2004;69:167–190. [Google Scholar]
- Reboussin B, Liang K. An estimating equations approach for the liscomp model. Psychometrika. 1998;63:165–182. [Google Scholar]
- Rodin A, Boerwinkle E. Mining genetic epidemiology data with Bayesian networks i: Bayesian networks and example application (plasma apoE levels) Bioinformatics. 2005;21:3273–3278. doi: 10.1093/bioinformatics/bti505. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ruczinski I, Kooperberg C, LeBlanc M. Logic regression. Journal of Computational and Graphical Statistics. 2003;12:475–511. [Google Scholar]
- Ruczinski I, Kooperberg C, LeBlanc M. Exploring interactions in high dimensional genomic data: an overview of logic regression. Journal of Multivariate Analysis. 2004;90:178–195. [Google Scholar]
- Sánchez BN, Budtz-Jørgensen E, Ryan LM, Hu H. Structural equation models: a review with applications to environmental epidemiology. Journal of the American Statistical Association. 2005;100:1443–1455. [Google Scholar]
- Schwender H, Ickstadt K. Identification of SNP interactions using logic regression. Biostatistics. 2008;9:187–198. doi: 10.1093/biostatistics/kxm024. [DOI] [PubMed] [Google Scholar]
- Schwender H, Ickstadt K, Rahnenführer J. Classification with high-dimensional genetic data: assigning patients and genetic features to known classes. Biometrical Journal. 2008;50:911–926. doi: 10.1002/bimj.200810475. [DOI] [PubMed] [Google Scholar]
- Segal M, Barbour J, Grant R. Relating HIV-1 sequence variation to replication capacity via trees and forests. Statistical Applications in Genetics and Molecular Biology. 2004;3 doi: 10.2202/1544-6115.1031. Article 2. [DOI] [PubMed] [Google Scholar]
- Skrondal A, Rabe-Hesketh S. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models. Boca Raton: Chapman & Hall/CRC; 2004. [Google Scholar]
- Skrondal A, Rabe-Hesketh S. Encyclopedia of Statistics in Behavioral Science. London: Wiley; 2005. Structural equation modeling: categorical variables; pp. 1–8. [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B. 1996;58:267–288. [Google Scholar]
- Wu T, Chen Y, Hastie T, Sobel E, Lange K. Genome-wide association analysis by Lasso penalized logistic regression. Bioinformatics. 2009;25:714–21. doi: 10.1093/bioinformatics/btp041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang H, Singer B. Recursive Partitioning in the Health Sciences. New York, USA: Springer; 1999. [Google Scholar]
- Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B. 2005;67:301–320. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.































































































