Abstract
The use of cluster analysis in the nursing literature is limited to the creation of classifications of homogeneous groups and the discovery of new relationships. As such, it is important to provide clarity regarding its use and potential. The purpose of this article is to provide an introduction to distance-based, partitioning-based, and model-based cluster analysis methods commonly utilized in the nursing literature, provide a brief historical overview on the use of cluster analysis in nursing literature, and provide suggestions for future research. An electronic search included three bibliographic databases, PubMed, CINAHL and Web of Science. Key terms were cluster analysis and nursing. The use of cluster analysis in the nursing literature is increasing and expanding. The increased use of cluster analysis in the nursing literature is positioning this statistical method to result in insights that have the potential to change clinical practice.
Keywords: cluster analysis, nursing research, historical article
Clustering items together into groups is a fundamental desire by humans to impart order into our daily lives. For example, grocery stores group products by item, such as produce, dairy, meat, and dry goods. Libraries group books according to genres including fiction, nonfiction, mystery, or romance. Clustering is simply the process of partitioning individual items into successively smaller and smaller numbers of similar groups. Cluster analysis is an umbrella term for one of a variety of multivariate statistical techniques that allows for heuristic examination of data points as a method of pattern recognition. The overall objective of cluster analysis is to combine data points together into homogeneous groups, referred to as clusters (Beckstead, 2002).
The impetus for the development of a systematic clustering method was the seminal publication, Principles of Numerical Taxonomy, by biologist Robert Sokal and Peter Sneath in 1963 (Sokal & Sneath, 1963). The biologists argued that patterns of observed differences and similarities could be used as a process for understanding the evolutionary process. Sokal and Sneath's interest in clustering methods coincided with the development and availability of high-speed computers, thus facilitating the previously time-consuming process of large matrices analysis and allowing for a more robust use of cluster analysis in scientific research. The hard sciences of biology and chemistry, as well as the social sciences of anthropology, psychology, and political science, were early adopters of modern clustering methods recognizing in the 1960s the applicability and value of cluster analysis to their respective fields of study (Aldenderfer & Blashfield, 1984). More recently, medicine has embraced cluster analysis as an investigational statistical class of methods that has the power to detect new relationships and subsequently influence clinical practice decisions. For example, new recommendations regarding corticosteroid use have emerged after the elucidation of asthma phenotypes using cluster analysis (Boudier et al., 2013; Haldar et al., 2008; Moore et al., 2010; Siroux et al., 2011). There has been a virtual paradigm shift in the recommended treatment of breast and ovarian cancers after the discovery of mutant BRCA1 and BRCA2 genes, discoveries facilitated by the use of cluster analysis methodologies (Sørlie et al., 2003; van ‘t Veer et al., 2002).
Regularly published nursing research articles began to appear in the literature in the 1990s (Online Supplementary Table 1). Subsequently, there has been a recent explosion in published nursing literature citing cluster analysis methodologies. A recent Web of Science citation report reveals approximately 140 articles published in the nursing literature referencing cluster analysis published in 2015, an all-time high (Figure 1).
Figure 1.
Web of Science citation report for nursing and cluster analysis.
The purpose of this article is twofold. First, this article will provide a basic overview of cluster analysis methods and statistical considerations commonly utilized in the nursing literature. Second, this article will provide a brief historical overview of the use of cluster analysis in research conducted by nurses and/or published in nursing sources from the 1970s to current day.
Cluster Analysis
Purpose
One reason for the enduring popularity of cluster analysis techniques in the nursing literature is their versatility and applicability to a variety of research aims and questions. Cluster analysis can be completed as an independent analysis, such as in Hillhouse and Adler's (1997) published research identifying three distinct stress effect subtypes of nurses based on measured nursing stressors and burnout. Alternatively, cluster analysis techniques can be a preprocessing step for subsequent post hoc analysis. Such is the case in a study by Lindberg, Wikström, and Lindberg (2010) who first identified subgroups of hemodialysis patients, and then studied differences in interdialytic fluid intake and weight gain among the clusters. In general, however, there are four primary recognized uses of cluster analysis techniques: (a) create classifications of homogeneous groups, (b) discover new relationships and/or investigation of conceptual schemes, (c) hypothesis testing, and (d) confirmatory analysis of previously identified classifications (Aldenderfer & Blashfield, 1984; Table 1).
Table 1.
The Four Primary Uses of Cluster Analysis.
Author (Year) | Purpose of Study | Results of Study | Primary Use of Cluster Analysis |
---|---|---|---|
Hillhouse and Adler (1997) | Identify effects of stress on nurses via analysis of patterns of nursing stressors, nursing burnout, affective symptoms, and physical symptoms | Three homogeneous stress effect subgroups were identified | Create classifications of homogeneous groups |
Allred etal. (1994) | Understand the nursing practice environment via identification of factors that compose the environment; identification of the amount of complexity and predictability of the environment; identification of the amount of uncertainty in the environment; and understanding the relationship between the environment and these factors | Three unique nursing environments emerged. Post hoc analysis revealed that as the nursing environment increased in levels of complexity, change, and unpredictability, the uncertainty levels in nurses increased | Investigate conceptual schemes |
Gilbertson-White, Miaskowski, Lee, Dodd, West, and Cooper (2007) | Cluster stability testing to test the hypothesis that distinct patient subgroups have a predilection toward sickness behaviors | A stable four-cluster solution was identified, which was identical to a previously conducted study by the same researchers, thus supporting the hypothesis that certain patient types have predispositions for higher levels of sickness behaviors | Hypothesis testing |
Kuhn and Culhane (1998) | Identification of an expanded homelessness typologies | A more complex model of homelessness emerged, with a three-cluster solution as opposed to the four-cluster solution that had been popular in the literature before the publication of the study | Confirmatory Analysis |
Cluster Analysis Algorithms
There are different algorithms, or statistical techniques, for performing cluster analysis. The decisions on which method a researcher chooses is dependent on many things. This can include the research questions and aims of the study, type of data that have been collected (i.e., nominal, ordinal, interval, or ratio; Kaufman & Rousseau, 2005), familiarity and expertise with a particular algorithm, and/or the size of the data set (Abbas, 2008). A detailed discussion of all cluster analysis algorithms is beyond the scope of this article. However, a brief review of some of the most popular methods used in the nursing literature (distance-based, partitioning-based, and model-based approaches) is appropriate to help understand the application of these methods and also the progression toward modern day approaches (Figure 2).
Figure 2.
Cluster analysis flowchart.
Distance-based algorithms
The goal of cluster analysis is to group similar items together into discrete clusters. To do this, a proximity measure of the distance between the two items is necessary. In cluster analysis, mathematical distance measures are used to determine the amount of dissimilarity between any two items (Rokach & Maimon, 2005). There are many proximity measures available, one of the most familiar being the Pearson correlation coefficient. Euclidean, or “straight-line” distance, based in part on the Pythagorean theorem (Jain, Murty, & Flynn, 1999), is also frequently referenced in distance-based cluster analysis. The Manhattan distance, which is based on the Cartesian coordinate system of strictly horizontal and/or vertical paths (Krause, 1986), is another proximity measure commonly referenced in the nursing literature. The choice of which proximity measure to use is influenced by many statistical considerations such as the type of data in the data set, the presence of multicollinearity among the variables, variable scale, or evidence of prior use of a particular proximity measure in a specific field of study (Tan, Steinbach, & Kumar, 2005).
Hierarchical methods
Hierarchical methods historically have been among the most popular cluster analysis techniques in the literature across all disciplines. In a review of all articles using cluster analysis, published since 1978, nearly two thirds of those articles used the agglomerative hierarchical method (Manning, Raghavan, & Schütze, 2009). The review of the nursing literature conducted for this article identified agglomerative hierarchical methodology in one third of the research studies reporting a particular algorithm (Online Supplementary Table 1). Hierarchical methods are popular as (a) conceptually they are relatively easy to understand, being an iterative distance-based method that relies on sorting and linking rules (Aldenderfer & Blashfield, 1984) and (b) they are versatile and capable of handling many forms of data and linkage rules (Rokach & Maimon, 2005). The biggest disadvantage of hierarchical methods is the time complexity and computational storage capacity involved with the various algorithms, making large database analysis cumbersome (Rokach & Maimon, 2005).
Hierarchical clustering methods can be subdivided based on the linkage rule used in cluster formation (Rokach & Maimon, 2005). Linkage is the formula used to calculate the mathematical distance measure between data points and prospective clusters. There are many different linkage rules for the formation of clusters, each of which will result in a unique solution. Four linkage rules are commonly used in the published nursing literature: single-linkage, complete-linkage, average-linkage, and Ward's method (Table 2).
Table 2.
Cluster Distance Measures.
Linkage Type | Process | Advantage | Disadvantage |
---|---|---|---|
Single-linkage or nearest neighbor | Combines two clusters together that have the smallest amount of dissimilarity (distance) between the closest pair of data points belonging to the different clusters. | Less sensitive to outliers | Based completely on single links between individual data points and cluster, forms elongated chains |
Complete-linkage or furthest-neighbor | Combines two clusters together that have the largest amount of dissimilarity (distance) between the farthest pair of data points belonging to the different clusters. | Compact, hyperspherical clusters composed of very similar data points | Vulnerable to outliers Tends to break large clusters apart All clusters have the same diameter, so smaller clusters are merged with larger ones |
Average-linkage or minimum variance | Combines two clusters together after an average distance measure for all preexisting data points belonging to the different cluster is calculated. Clusters are combined together only if a predetermined mathematical threshold is obtained. | Less sensitive to outliers | Tendency to split elongated clusters in half and tail portions of clusters tend to merge with neighboring clusters |
Ward's method | Minimizes intracluster variance via calculation of the error sum of squares. Combinations are made that yield the smallest error sum of squares. | Clusters are relatively equal in size and shape | Cannot be used with binary variables Has a tendency to form globular clusters |
There are two types of hierarchical algorithms, agglomerative and the less common divisive (Manning, Raghavan, & Schütze, 2009). At the initial step in agglomerative hierarchical clustering, all data points are considered a singleton cluster of one. Next, the data undergo sorting, and data points are joined based on the linkage rules employed in a heuristic manner. Data are sorted and resorted with additional data points linked to the subsequent clusters until ultimately all data points are joined into only one large cluster. Divisive hierarchical clustering works in the opposite fashion. In the first step in divisive hierarchical clustering, all items belong to one large cluster. The initial cluster is sorted and divided into two subclusters. The two resultant subclusters are successively divided again and again into smaller and smaller subclusters until all items are singleton clusters of one. Divisive hierarchical clustering methods are less popular than agglomerative algorithms.
Partitioning-based algorithms
K-means/iterative
K-means clustering is another popular algorithm for clustering in large part because of its simplicity and efficiency with large data sets, which can otherwise be time-consuming if using other algorithms (Abbas, 2008). K-means is considered to be a partitioning method as it finds all the clusters at the same time as the data are partitioned off in one step, not forced into a nested hierarchical structure (Jain, 2010). K-means clustering is a method that attempts to generate clusters that minimize the distance between individual data points to the cluster via calculation of centroids, the point that represents the theoretical multidimensional mean of all the data points in a given group (Abbas, 2008). The most glaring difference between k-means clustering and agglomerative hierarchical methods is the requirement to determine the number of clusters prior to initiating clustering (Norušis, 2008). The process of k-means clustering is as follows. First, the researcher selects the number of clusters and estimates of the initial cluster centroids. Next, after completion of a first pass through the data, the distances between each data point and the centroids are calculated. Data points are assigned to the cluster with which they share the smallest distance to the cluster centroid. New centroids are then computed, and distances between all data points and the new centroids are recalculated. This process continues until reassignment of data points ceases, and the centroids remain stable.
While k-means clustering is conceptually easy to understand, several limitations restrict its use in the nursing literature. Most notably, there must be a defined mean, which eliminates the possibility of using this method with nominal level data (Magidson & Vermunt, 2002a). Also, the number of k clusters must be defined a priori, a limitation when analyzing new populations with limited theoretical background or understanding. K-means clustering is also sensitive to missing data, outliers, and tends to result in clusters of uneven sizes (Rokach & Maimon, 2005).
Model-based algorithms
Latent class models/finite mixture models
Latent class modeling is becoming a preferred method of clustering since the development of implementable methods for obtaining an analysis of conditional relationships between outcome variables via log-linear models as a technique for analyzing categorical latent variables in the 1980s. Latent class models are unique among clustering algorithms in that they are a probabilistic model–based attempt to describe unobservable, or latent, categorical variables. Because latent variables are unobservable, they are also not directly measurable (Collins & Lansa, 2010). Latent variables are measured indirectly via isolation of the association between two or more independent observable variables (Magidson & Vermunt, 2004). Measurable association between observable variables is believed to be the consequence of the presence of a latent variable and a certain amount of error. When the latent variable is categorical in nature and comprised of a set of clusters, these clusters are referred to as latent classes (Collins & Lansa, 2010). The goal of latent class modeling is to (a) determine the smallest number of latent classes that explains the relationship between the observed variables and (b) identify the patterns between the latent classes and the observed variables (Magidson & Vermunt, 2004)
Latent class modeling has many advantages over traditional clustering techniques, particularly for nurse researchers. First, there are no requirements of the traditional modeling assumptions of linearity, normality, and homogeneity, thus resulting in latent class models that are less subject to bias (Magidson & Vermunt, 2002b). Second, there are no limitations on variable type in latent class modeling. Modern advances in latent class modeling have expanded to include mixed scale variables, including nominal, making latent class modeling useful in an array of observed variables (Magidson & Vermunt, 2002b). The ability to apply mixed scale models is particularly important in nursing research where it is common to have data scaled at all levels of measurement. Third, latent class modeling is considered a person-oriented approach, not a variable-oriented approach, such as the case with other forms of model-based clustering algorithms (Collins & Lansa, 2010). The emphasis in latent class modeling is the individual and their individual pattern of relevant, unique characteristics. Individual comparisons against the entire data set identify meaningful and scientifically interesting patterns. Fourth, latent class analysis can work with missing data, unlike traditional distance-based clustering algorithms (Collins & Lansa, 2010). Finally, latent class analysis allows for the inclusion of covariates that can be used to build predictive models of class membership (Magidson & Vermunt, 2002b). The ability to predict and anticipate an individual's future health needs and provide preventive care is invaluable in nursing research.
Cluster Validation
Validation of the cluster solution is necessary for any study utilizing cluster analysis. Despite this, validation techniques are not well understood, and a discussion of validation technique is often missing in the published literature. One third of the articles identified for this article contained no discussion of validation techniques (Online Supplementary Table 1). Five of the most common validation techniques are introduced below.
The cophenetic correlation coefficient is similar to a correlation coefficient in regression in that it shows the goodness of fit of the clustering solution. Specifically, cophenetic correlations measure how faithfully a hierarchical clustering solution preserved pairwise distances when compared with the original “unclustered” data points (Aldenderfer & Blashfield, 1984). As with any correlation coefficient, values approaching 1 are indicators of high correlation or, in the case of cophenetic correlation, cluster fit. Limitations of the cophenetic correlation are that its only use is with agglomerative hierarchical methods, a major limitation for modern users of latent class analysis (Halkidi, Batistakis, & Vazirgiannis, 2002). It is important to note that after the development of advanced statistical techniques, cophenetic correlations are considered to be a misleading indicator of cluster validity (Aldenderfer & Blashfield, 1984).
A second frequently used validation technique is the use of statistical tests of significance between the clusters using internal predictor variables used to create the clusters via ANOVA or MANOVA testing. Statistical testing of internal variables is a popular approach in the nursing literature. Of the articles reviewed for this article that discuss validation techniques, 50% report use of significance testing of internal variables (Online Supplementary Table 1). However, the use of post hoc ANOVA or MANOVA testing of internal predictor variables is not recommended. The ANOVA or MANOVA F tests will be significant whether or not clusters exist in the data set under analysis (Aldenderfer & Blashfield, 1984). In cluster analysis, observable predictor variables have been used to discriminate among clusters. As a result, predictor variables are grouped together in a meaningful way. Any tests of significance using these same predictor variables will invariably find significantly discrepant levels among the clusters.
Replication of a cluster solution with different data sets is a third, more robust way to validate cluster accuracy (Aldenderfer & Blashfield, 1984). Using Karl Popper's idea of falsifiability, if a cluster solution fails to replicate with a different data set, then this is evidence to reject the solution (Popper, 1968). If a different data set does replicate the cluster solution, then this is evidence in support of cluster validity. Of the articles reviewed for this article, only 14 reported replication as the cluster validation technique of choice (Online Supplementary Table 1).
Statistical tests of significance on variables external to the clustering solution are considered to be among the most robust methods of cluster validation (Aldenderfer & Blashfield, 1984). This fourth approach is straightforward. Statistical testing, such as ANOVAs, of each cluster on variables not used in cluster assignment, directly tests the cluster solution against relevant criteria. Despite its strength, this approach is used less frequently than statistical testing of internal variables as the added expense and time in the collection of additional variables can be a limiting factor. Of the articles reviewed for this article that discuss validation techniques, 38% report use of significance testing of external variables (Online Supplementary Table 1).
Finally, Monte Carlo procedures can also be used for cluster validation (Halkidi et al., 2002). Monte Carlo procedures are complicated and involve the creation of multiple stochastic artificial data sets that match the characteristics of the actual data set. Each iteration is clustered using the same algorithm, and results are compared with the original data set for similarities and differences. Monte Carlo procedures are time-consuming and require considerable computer processing capabilities (Halkidi, Batistakis, & Vazirgiannis, 2001).
Application to Nursing Research
Historical perspective
An electronic search was conducted that included three databases: CINAHL, PubMed, and Web of Science to gain perspective on the use of cluster analysis conducted by nurses and/or published in nursing literature. Key terms were “cluster analysis” and “nursing” with search restrictions limiting to published English-language articles, human research, and no date restrictions. In total, this focused search identified 253 potentially relevant studies for review (Figure 3). After completion of each database search, records of interest were exported to RefWorks web-based bibliographic management system, where duplicates were removed. Removal of duplicate records resulted in 201 potentially relevant studies, which were screened for inclusion via title and abstract review using predetermined inclusion and exclusion criteria. Eligibility criteria include primary source material, human research, and studies published in English. Exclusion criteria included abstracts; secondary and tertiary source materials including review articles, meta-analysis, and practice guidelines; unpublished research including dissertations; research completed by nonnurses; articles published in nonnursing journals or research using nurses merely as study subjects; use of factor analytic methods as the clustering method; and/or animal research. Title and abstract screen resulted in the removal of 48 additional articles, resulting in 153 articles retrieved for full article review. Full article review resulted in the removal of 17 additional articles, resulting in 136 for inclusion in this review. A researcher-created structured form was used to facilitate data abstraction. Publication year, authors, cluster analysis method, research purpose, basic research findings, and validation technique were extracted from each article (Online Supplementary Table 1). An author-assigned organizational category is also listed in the extraction matrix.
Figure 3.
Literature search flow diagram.
1970s: The emergence of cluster analysis
One of the earliest applications of cluster analysis in the published nursing literature focused on defining nursing practice. In 1977, D. Hagan published “Cluster Analysis: An Empirical Methodology for Developing Homogeneous Patient Subjects for Research Purposes” (Hagan, 1977). Hagan interests were in conceptually defining quality nursing care. He recognized the difficulty in identifying the target population, not by the traditionally accepted patient characteristics such as age, diagnosis, or nursing unit but by nursing care provided. Cluster analysis provided the methodological impetus to delineate an old concept, quality nursing care, in an updated and modern way. Hagan's exploratory cluster analysis identified homogeneous subgroups of patients based solely on the nursing care provided to the patient and patient acuity.
1990s: The reemergence of cluster analysis
A decade passed before cluster analysis is seen again in the published nursing literature. Stuifbergen (1990) used k-means clustering to identify profiles of families in which one parent was chronically ill, defining a four-cluster solution and description. Agglomerative hierarchical methods dominated the literature during the 1990s, cited as the method of analysis in 11 of the 14 cluster analysis research articles published during this time. The focus of researchers using cluster analysis during the 1990s was varied and included nursing management and administration, nurse typology, nurse practice environment, patient typology, and theory/construct development. The 1990s also saw the introduction of the use of cluster analysis in a disease-specific focus. Vasiliadou, Karvountzis, Soumilas, Roumeliotis, and Theodosopoulou (1995) were among the first to use cluster analysis in a disease-specific analysis of occupation-related low back pain. However, their use of cluster analysis was limited and was used specifically to define hospital units into homogeneous groups. Johnson (1997) expanded the use of cluster analysis to develop a healing typology for chronic leg ulcers via factors that affect healing in venous and arterial disease, developing a three-cluster solution. The remainder of the decade saw published disease-specific literature on a wide variety of conditions: AIDS (Huba, Brief, Cherin, Panter, & Melchior, 1998), chronic pain (Hall-Lord, Larsson, & Steen, 1999), and urinary incontinence (Shimanouchi, Kamei, & Hayashi, 2000).
2000s: The expansion of cluster analysis
The start of the new millennium in the nursing literature saw the introduction of latent class analysis and program-specific approaches, such as SPSS's two-step clustering (SPSS Inc., 2001), which identifies groupings by first running preclustering scanning and then analyzes data further with hierarchical methods. The number of published research articles utilizing cluster analysis more than doubled in the 2000s, with 55 published articles extracted for this review (Online Supplementary Table 1), indicating increasing recognition of the methods. However, no single method dominated during this decade. Cluster analysis was used for the first time in the nursing educational literature (Thorpe & Loo, 2003) and in instrument development (Lin, Cheng, Kuo, & Chou, 2009). Disease-specific focus was increasingly common during the first decade of the 21st century, with the oncology literature pioneering the innovative use of cluster analysis methods. Bender, Ergun, Rosenzweig, Cohen, and Sereika (2005) published “Symptom Clusters in Breast Cancer Across 3 Phases of the Disease,” signaling the use of cluster analysis in the development of symptom clusters. Few nursing specialties have embraced cluster analysis methods as readily as the oncology literature. Just more than one third (n = 14) of the 42 extracted disease-specific cluster analysis research studies for this article focus on oncologic issues (Online Supplementary Table 1).
The cardiovascular literature was the next to embrace the use of cluster analysis, incorporating the idea of symptom clusters from the oncology literature into cardiac disorders. Ryan et al. (2006) were the first to apply cluster analysis to symptoms of acute coronary syndrome, identifying five clusters of acute myocardial infarction symptoms. Acute coronary syndrome continues to be the focus of study in the cardiovascular cluster analysis nursing literature with multiple publications in the current decade (McSweeney, Cleves, Zhao, Lefler, & Yang, 2010; Riegel et al., 2010; Rosenfeld et al., 2015).
2010s: The surge of cluster analysis
The numbers of published articles continue to increase in the current decade. Sixty nursing research articles were identified in this paper, representing almost 45% of the total sample of extracted articles since 1977 (Online Supplementary Table 1). Latent class analysis is becoming more common in the nursing literature. However, agglomerative hierarchical methods continue to prevail, referenced in almost 45% of the published articles during the first half of the current decade. While the oncology literature continues to predominate in the second decade of the 21st century, the disease-specific focus of other nursing specialties continues to expand, and now includes nephrology considerations (Lindberg et al., 2010), neurology (Buijck et al., 2012), and autoimmune diseases (Shahrbanian, Duquette, Kuspinar, & Mayo, 2015). As there is increasing focus on hospital readmission, the cardiovascular research expanded to include heart failure during the most recent decade (Hertzog, Pozehl, & Duncan, 2010; Lee et al., 2010). In addition, cluster analysis has been used for the first time to examine issues specific to advanced practice nursing, thus expanding cluster analysis to the full spectrum of nursing care (Ghosh, Sterns, Drew, & Hamera, 2011; Kulczycki, Qu, Bosarge, & Shewchuk, 2010).
Implications for the future
To date, cluster analysis in the nursing literature primarily focuses on identification of subgroups of various populations. A recommended future direction would be for researchers to further develop and refine their results by applying them to outcomes and interventional studies. As an example, Haldar et al. (2008) not only discriminated between characteristics of specific phenotypes of refractory asthma patients, but they also used data from the randomized control trial arm of their study to identify a steroid management strategy that reduced exacerbation frequency among the obese, symptom-predominant, noneosinophilic phenotype 10-fold. This 2-stage research design has potential in many challenging patient populations with chronic illness familiar to nursing such as diabetes or epilepsy. In the review of the literature for this article, the authors did not identify research using cluster analysis in the diabetic population. Identification of subtypes of brittle type 2 diabetes mellitus patients could lay the foundation for the development of target interventions to improve glycemic control and long-term outcome in this challenging population.
In addition, longitudinal research examining cluster stability over time as a validation of previously identified clusters is encouraged. Once validity and stability are established, the next step would be the development of targeted clinical interventions aimed at shifting cluster membership. The ability to define interventions that can alter an individual's chance of morbidity or symptom burden in a positive manner should be the ultimate, long-term goal of cluster analysis in nursing research. As nursing science continues to evolve and progress, robust clustering techniques have the potential to produce transformative research that can improve the health outcomes and quality of life of a vast variety of patient populations.
Discussion
The major findings of this analysis are that (a) cluster analysis is being used increasingly in the nursing literature; (b) commonly used methodological approaches in nursing research are agglomerative hierarchical, k-means, and latent class analysis; and (c) the use of cluster analysis in the nursing literature primarily focuses on homogeneous groupings of patients, nurses, nursing units, caregivers, and nursing students. As a result of expanded use, coupled with increasingly statistically robust methods, cluster analysis in nursing has been useful in the discovery of substantive differences and patterns among various groups of interest in nursing research and practice.
Cluster analysis is one of a variety of statistical methods used to identify and classify items into groups. The use of cluster analysis in the nursing literature spans five decades and is increasing in acceptance and popularity, with no evidence of a slowing. In the nursing literature, the items frequently grouped are patients, nurses, nursing units, caregivers, and nursing students. Advances in methodology have broadened the use of cluster analysis to an increasingly diverse array of research questions. Despite increased acceptance of the methods, the use of cluster analysis in the nursing literature is limited to the creation of classifications of homogeneous groups and the discovery of new relationships. Future research using cluster analysis techniques to inform intervention studies should be encouraged, especially in research focusing on nursing practice, clinical decision making, and implementation of evidence-based practice.
Supplementary Material
Acknowledgments
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental Material: The online supplements are available at http://journals.sagepub.com/doi/full/10.1177/0193945917707705.
References
- Abbas OA. Comparisons between data clustering algorithms. International Arab Journal of Information Technology. 2008;5:320–325. [Google Scholar]
- Aldenderfer MS, Blashfield RK, editors. SAGE University paper series on quantitative applications in the social sciences, series no 07-044. 6th. Newbury Park, CA: SAGE; 1984. Cluster analysis. [Google Scholar]
- Allred CA, Michel Y, Arford PH, Carter V, Veitch JS, Dring R, et al. Finch NJ. Environmental uncertainty: Implications for practice model redesign. Nursing Economic$ 1994;12:318–326. [PubMed] [Google Scholar]
- Beckstead J. Using hierarchical cluster analysis in nursing research. Western Journal of Nursing Research. 2002;24:307–319. doi: 10.1177/01939450222045923. [DOI] [PubMed] [Google Scholar]
- Bender CM, Ergun FS, Rosenzweig MQ, Cohen SM, Sereika SM. Symptom clusters in breast cancer across 3 phases of the disease. Cancer Nursing. 2005;28:219–225. doi: 10.1097/00002820-200505000-00011. [DOI] [PubMed] [Google Scholar]
- Boudier A, Curjuric I, Basagaña X, Hazgui H, Anto JM, Bousquet J, et al. Sunyer J. Ten-year follow-up of cluster-based asthma phenotypes in adults. A pooled analysis of three cohorts. American Journal of Respiratory & Critical Care Medicine. 2013;188:550–560. doi: 10.1164/rccm.201301-0156OC. [DOI] [PubMed] [Google Scholar]
- Buijck BI, Zuidema SU, Eijk MS, Bor H, Gerritsen DL, Koopmans RT. Is patient-grouping on basis of condition on admission indicative for discharge destination in geriatric stroke patients after rehabilitation in skilled nursing facilities? The results of a cluster analysis. BMC Health Services Research. 2012;12:443. doi: 10.1186/1472-6963-12-443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Collins LM, Lansa ST. Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. Hoboken, NJ: John Wiley; 2010. [Google Scholar]
- Ghosh D, Sterns AA, Drew BL, Hamera E. Geospatial study of psychiatric mental health-advanced practice registered nurses (PMH-APRNs) in the United States. Psychiatric Services. 2011;62:1506–1509. doi: 10.1176/appi.ps.000532011. [DOI] [PubMed] [Google Scholar]
- Gilbertson-White S, Miaskowski C, Lee K, Dodd M, West C, Cooper B. The stability of patient subgroups identified using cluster analysis. Oncology Nursing Forum. 2007;34(1):202–203. [Google Scholar]
- Hagan DE. Cluster analysis: An empirical methodology for developing homogeneous patient subjects for research purposes. Communicating Nursing Research. 1977;9:395–407. [PubMed] [Google Scholar]
- Haldar P, Pavord ID, Shaw DE, Berry MA, Thomas M, Brightling CE, et al. Green RH. Cluster analysis and clinical asthma phenotypes. American Journal of Respiratory & Critical Care Medicine. 2008;178:218–224. doi: 10.1164/rccm.200711-1754OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Halkidi M, Batistakis Y, Vazirgiannis M. On clustering validation techniques. Journal of Intelligent Information Systems. 2001;17:107–145. doi: 10.1023/A:1012801612483. [DOI] [Google Scholar]
- Halkidi M, Batistakis Y, Vazirgiannis M. Cluster validity methods: Part I. SIGMOD Record. 2002;31(2):40–45. doi: 10.1145/565117.565124. [DOI] [Google Scholar]
- Hall-Lord ML, Larsson G, Steen B. Chronic pain and distress in older people: A cluster analysis. International Journal of Nursing Practice. 1999;5:78–85. doi: 10.1046/j.1440-172x.1999.00157.x. [DOI] [PubMed] [Google Scholar]
- Hertzog MA, Pozehl B, Duncan K. Cluster analysis of symptom occur-rence to identify subgroups of heart failure patients: A pilot study. The Journal of Cardiovascular Nursing. 2010;25:273–283. doi: 10.1097/JCN.0b013e3181cfbb6c. [DOI] [PubMed] [Google Scholar]
- Hillhouse JJ, Adler CM. Investigating stress effect patterns in hospital staff nurses: Results of a cluster analysis. Social Science & Medicine. 1997;45:1781–1788. doi: 10.1016/S0277-9536(97)00109-3. [DOI] [PubMed] [Google Scholar]
- Huba GJ, Brief DE, Cherin DA, Panter AT, Melchior LA. A typology of service patterns in end-stage AIDS care: Relationships to the transprofessional model. Home Health Care Services Quarterly. 1998;17:73–92. doi: 10.1300/J027v17n01_05. [DOI] [PubMed] [Google Scholar]
- Jain AK. Data clustering: 50 years beyond K-means. Pattern Recognition Letters. 2010;31:651–666. doi: 10.1016/j.patrec.2009.09.011. [DOI] [Google Scholar]
- Jain AK, Murty M, Flynn P. Data clustering: A review. ACM Computing Surveys (CSUR) 1999;31:264–323. doi: 10.1145/331499.331504. [DOI] [Google Scholar]
- Johnson M. Using cluster analysis to develop a healing typology in vascular ulcers. Journal of Vascular Nursing. 1997;15(2):45–49. doi: 10.1016/S1062-0303(97)90000-5. [DOI] [PubMed] [Google Scholar]
- Kaufman L, Rousseau PJ. Finding groups in data: An introduction to cluster analysis. Hoboken, NJ: John Wiley; 2005. [Google Scholar]
- Krause EF. Taxicab geometry: An adventure in non-Euclidean geometry. New York, NY: Dover; 1986. [Google Scholar]
- Kulczycki A, Qu H, Bosarge PM, Shewchuk RM. Examining the diverse perspectives of nurse practitioners regarding obstacles to diaphragm prescription: A latent class analysis. Journal of Women's Health. 2010;19:1355–1361. doi: 10.1089/jwh.2009.1730. [DOI] [PubMed] [Google Scholar]
- Lee KS, Song EK, Lennie TA, Frazier SK, Chung ML, Heo S, et al. Moser DK. Symptom clusters in men and women with heart failure and their impact on cardiac event-free survival. The Journal of Cardiovascular Nursing. 2010;25:263–272. doi: 10.1097/JCN.0b013e3181cfbb88. [DOI] [PubMed] [Google Scholar]
- Lin C, Cheng C, Kuo S, Chou F. Development of a Chinese short form of the prenatal self-evaluation questionnaire. Journal of Clinical Nursing. 2009;18:659–666. doi: 10.1111/j.1365-2702.2007.02201.x. [DOI] [PubMed] [Google Scholar]
- Lindberg M, Wikström B, Lindberg P. Subgroups of haemodialysis patients in relation to fluid intake restrictions: A cluster analytical approach. Journal of Clinical Nursing. 2010;19:2997–3005. doi: 10.1111/j.1365-2702.2010.03372.x. [DOI] [PubMed] [Google Scholar]
- Magidson J, Vermunt JK. Latent class models for clustering: A comparison with K-means. Canadian Journal of Marketing Research. 2002a;20:36–43. [Google Scholar]
- Magidson J, Vermunt JK. A nontechnical introduction to latent class models. Belmont, MA: Statistical Innovations Inc; 2002b. [Google Scholar]
- Magidson J, Vermunt JK. Latent class models: The SAGE handbook of quantitative methodology for the social sciences. Thousand Oaks, CA: SAGE; 2004. [Google Scholar]
- Manning CD, Raghavan P, Schütze H. Hierarchical clustering. In: Manning CD, Raghavan P, Schütze H, editors. An introduction to information retrieval. Cambridge, UK: Cambridge University Press; 2009. pp. 377–401. [Google Scholar]
- McSweeney JC, Cleves MA, Zhao W, Lefler LL, Yang S. Cluster analysis of women's prodromal and acute myocardial infarction symptoms by race and other characteristics. The Journal of Cardiovascular Nursing. 2010;25:311–322. doi: 10.1097/JCN.0b013e3181cfba15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moore WC, Meyers DA, Wenzel SE, Teague WG, Li H, Li X, et al. Bleecker ER. Identification of asthma phenotypes using cluster analysis in the severe asthma research program. American Journal of Respiratory and Critical Care Medicine. 2010;181:315–323. doi: 10.1164/rccm.200906-0896OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Norušis MJ. Cluster analysis. In: Norušis MJ, editor. SPSS statistics 17.0: Advanced statistical procedures companion. Upper Saddle River, NJ: Prentice Hall; 2008. pp. 361–391. [Google Scholar]
- Popper KR. Conjectures and refutations: The growth of scientific knowledge. 5th. New York, NY: Routledge Classics; 1968. [Google Scholar]
- Riegel B, Hanlon AL, McKinley S, Moser DK, Meischke H, Doering LV, et al. Dracup K. Differences in mortality in acute coronary syndrome symptom clusters. American Heart Journal. 2010;159:392–398. doi: 10.1016/j.ahj.2010.01.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rokach L, Maimon O. Clustering methods. In: Maimon O, Rokach L, editors. Data mining and knowledge discovery handbook. 1st. New York, NY: Springer; 2005. pp. 321–352. [DOI] [Google Scholar]
- Rosenfeld AG, Knight EP, Steffen A, Burke L, Daya M, DeVon HA. Symptom clusters in patients presenting to the emergency department with possible acute coronary syndrome differ by sex, age, and discharge diagnosis. Heart & Lung. 2015;44:368–375. doi: 10.1016/j.hrtlng.2015.05.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ryan CJ, Devon HA, Horne R, King KB, Milner K, Moser DK, et al. Zerwic JJ. Symptom clusters in acute myocardial infarction. American Journal of Critical Care. 2006;15:337. [Google Scholar]
- Shahrbanian S, Duquette P, Kuspinar A, Mayo NE. Contribution of symptom clusters to multiple sclerosis consequences. Quality of Life Research. 2015;24:617–629. doi: 10.1007/s11136-014-0804-7. [DOI] [PubMed] [Google Scholar]
- Shimanouchi S, Kamei T, Hayashi M. Home care for the frail elderly based on urinary incontinence level. Public Health Nursing. 2000;17:468–473. doi: 10.1046/j.1525-1446.2000.00468.x. [DOI] [PubMed] [Google Scholar]
- Siroux V, Basaga X, Boudier A, Pin I, Garcia-Aymerich J, Vesin A, et al. Sunyer J. Identifying adult asthma phenotypes using a clustering approach. European Respiratory Journal. 2011;38:310–317. doi: 10.1183/09031936.00120810. [DOI] [PubMed] [Google Scholar]
- Sokal RR, Sneath PH. Principles of numerical taxonomy. San Francisco, CA: W.H. Freeman; 1963. [Google Scholar]
- Sørlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, et al. Botstein D. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proceedings of the National Academy of Sciences of the United States of America. 2003;100:8418–8423. doi: 10.1073/pnas.0932692100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- SPSS Inc. A scalable component to segment your customers more effectively (White paper, technical report) Chicago, IL: Author; 2001. The SPSS TwoStep cluster component. [Google Scholar]
- Stuifbergen AK. Patterns of functioning in families with a chronically ill parent: An exploratory study. Research in Nursing & Health. 1990;13:35–44. doi: 10.1002/nur.4770130107. [DOI] [PubMed] [Google Scholar]
- Tan P, Steinbach M, Kumar V. Cluster analysis. In: Tan P, Steinbach M, Kumar V, editors. Introduction to data mining. 1st. Boston, MA: Addison-Wesley; Longman: 2005. pp. 7–65. [Google Scholar]
- Thorpe K, Loo R. Critical-thinking types among nursing and management undergraduates. Nurse Education Today. 2003;23:566–574. doi: 10.1016/S0260-6917(03)00102-3. [DOI] [PubMed] [Google Scholar]
- van ‘t Veer LJ, Dai H, van dV, He YD, Hart AAM, Mao M, et al. Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]
- Vasiliadou A, Karvountzis GG, Soumilas A, Roumeliotis D, Theodosopoulou E. Occupational low-back pain in nursing staff in a Greek Hospital. Journal of Advanced Nursing. 1995;21:125–130. doi: 10.1046/j.1365-2648.1995.21010125.x. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.