Table 2.
Analytic Approach | Assumptions | Statistical Package/Criteria for Factor Determination | Exemplar | Advantages | Disadvantages |
---|---|---|---|---|---|
EFA | • Variables used should be metric • Sample size should be > 200 • Homogeneous sample • At least 0.30 correlations are required between the research variables • No outliers in the data |
• SPSS • SAS • STATA • R • Eigenvalue or adjusted eigen value > 1, model fit statistics |
Eastwood et al. (2013) explored 67 potential anginal symptoms in women to idenify the latent variables that are representative of angina symptoms in Black vs. White women. Four anginal symptom clusters were found using EFA: chest/general malaise, upper body, abdominal discomfort, and typical triggers/relievers | • Explores data to provide information about the number of factors required to represent the data • All measured variables are related to every latent variable • Determines factors based on covariance between variables • Determines clusters conceptually based on cooccurrence and relatedness |
• Number of variables tied to sample size • Requires EFA-generated model or theory • Requires a priori N of factors or factor structures |
Confirmatory Factor Analysis | • Multivariate normality • Sample size > 200 • The correct a priori model specification • Data from a random sample |
• SAS • STATA • Mplus • R • Model fit statistics: RMSEA, CFI, GFI, TLI, Χ2 |
• Accounts for measurement error • Is model-based • Confirms or rejects the measurement theory |
• Requires EFA-generated model or theory • Requires a priori N of factors or factor structures |
|
PCA | • Variables should be metric • Sample size should be > 200 • Homogeneous sample • At least 0.30 correlations are required between the research variables • No outliers in the data |
• SPSS • SAS • R • STATA • Eigenvalue > 1 |
Jurgens et al. (2009) used PCA and and a theoretical approach (Theory of Unpleasant Symptoms) to identify the multidimensional nature of symptoms in hospitalized patients with HF based on nine symptoms from the Minnesota Living with Heart Failure Questionnaire. Three unique symptom clusters were identified: acute volume overload, emotional, and chronic volume overload. Other Example: Herr et al. (2015) |
• Requires highly correlated variables • Orthogonally transforms correlated variables into a set of uncorrelated variables called principal components |
• Assumes multivariate normal distribution of items • PCA minimizes distance between clusters that are not widely separated • Does not account for measurement error |
Hierarchical Methods— Agglomerative | • Spherical shaped clusters • Variables are uncorrelated within clusters |
• SPSS • SAS • STATA • R • Can be represented by dendrogram |
Lindgren et al. (2008) sought to determine symptom clusters in elderly patients experiencing ischemic coronary heart disease in the week before hospitalization. Three clusters (groups) were identified: (a) Classic Acute Coronary Syndrome, (b) Weary, and (c) Diffuse Symptoms. The most appropriate clusters were determined using a combination of dendrograms, fit indices, and the clinical judgment of the investigators. Other examples: Fukuoka, Lindgren, Rankin, Cooper, and Carroll (2007); Hertzog, Pozehl, and Duncan (2010); Lindgren et al. (2008); Lee et al. (2010); Moser et al. (2014); Song, Moser, Rayens, and Lennie (2010); Streur, Ratcliffe, Ball, Stewart, and Riegel (2017) |
•Links clusters through four different methods (Single, Complete, Average, and Ward’s Method) • Joins new candidate for cluster membership to existing group based on highest level of similarity to existing group • Begins with each case being own group, therefore manageable • Does not allow overlapping groups, people, or symptoms • Best with small data sets |
• Requires calculation and storage of large similarity matrix • Makes only one pass through data; therefore, poor early partitioning can create problems • Can generate different solutions by reordering data in similarity matrix • Becomes unstable when cases dropped |
k-means | • Data should have roughly the same scale to squared Euclidean distance • Each group has roughly the same Error Sum of Squares (variance/covariance matrix between objects within a group is equal) |
• SPSS • SAS • STATA • R • Scree plot • Cluster reduction software |
McSweeney, Cleves, Lefler, and Yang (2010) used k-means for a secondary analysis of prodromal and acute symptoms of myocardial infarction in women. The authors wanted to determine the naturally occurring homogeneous subgroups based on increasing symptom frequency and severity. | • Compensates for initial poor partitioning • Reduces data • Develops taxonomies |
• Requires that you define the number of clusters in advance or that you run multiple solutions to identify the best solution • Does not permit overlapping clusters |
• Variance of the distribution of each attribute (variable) is spherical • All variables have the same variance • Each cluster has roughly equal number of observations |
• Repeatedly calculates to identify best partitioning • Based on a predetermined number of clusters • Not model-based • Multiple algorithms available to cluster the data |
||||
Latent Class Analysis (categorical variables) Latent Profile Analysis (continuous variables) |
• The population is composed of different unobserved groups, or latent classes • Observations conditionally independent variables • No assumptions related to linearity, normal distribution, or homogeneity • Data level should be categorical or ordinal • Observations should be independent in each class |
• Latent Gold • Mplus • LEM • MLLSA • SAS • STATA-Model-based • R • Formal criteria such as BIC, AIC, and R2 entropy to make decisions related to number of clusters |
Ryan et al. (2007) used latent class analysis to identify clusters of symptoms that represented AMI based
on subject demographic characteristics, with the goal of informing clinical
practice. Five distinct clusters were identified that included high, medium, and low probability symptoms. They also noted that age, race, and sex were statistically significant in predicting cluster membership. Other examples: DeVon et al. (2010); Riegel et al. (2010); Rosenfeld et al. (2015) |
• Is model-based • Maximum likelihood estimation with fit statistics used to make decisions related to number of classes • Can incorporate covariates in the model • Used in conjunction with large data sets and to develop taxonomies • ONLY method that allows for the inclusion of covariates in the analyses |
• Sparse cells and response patterns may lead to difficulties in model evaluation and identification |
Note. EFA = exploratory factor analysis; RMSEA = root mean square error of approximation; SPSS = Statistical Package for the Social Sciences; CFI = comparative fit index; GFI = Goodness-of-fit index; HF = heart failure; LEM = statistical software for latent class analysis; TLI = Tucker-Lewis Index; PCA = principal component analysis; AIC = Akaike information criterion; BIC = Bayesian information criterion; MLLSA = Maximum likelihood latent structure analysis; SAS = statistical software package; STATA = statistical software package.