Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2023 May 4. Online ahead of print. doi: 10.1016/j.amepre.2023.05.002

A Novel Approach to Developing Disease and Outcome−Specific Social Risk Indices

Michael Korvink 1,, Laura H Gunn 2,3,4, German Molina 5, Dani Hackner 6, John Martin 1
PMCID: PMC10156642  PMID: 37149108

Abstract

Introduction

A variety of industry composite indices are employed within health research in risk-adjusted outcome measures and to assess health-related social needs. During the COVID-19 pandemic, the relationships among risk adjustment, clinical outcomes, and composite indices of social risk have become relevant topics for research and healthcare operations. Despite the widespread use of these indices, composite indices are often comprised of correlated variables and therefore may be affected by information duplicity of their underlying risk factors.

Methods

A novel approach is proposed to assign outcome- and disease group−driven weights to social risk variables to form disease and outcome−specific social risk indices and apply the approach to the county-level Centers for Disease Control and Prevention social vulnerability factors for demonstration. The method uses a subset of principal components reweighed through Poisson rate regressions while controlling for county-level patient mix. The analyses use 6,135,302 unique patient encounters from 2021 across 7 disease strata.

Results

The reweighed index shows reduced root mean squared error in explaining county-level mortality in 5 of the 7 disease strata and equivalent performance in the remaining strata compared with the reduced root mean squared error using the current Centers for Disease Control and Prevention Social Vulnerability Index as a benchmark.

Conclusions

A robust method is provided, designed to overcome challenges with current social risk indices, by accounting for redundancy and assigning more meaningful disease and outcome−specific variable weights.

INTRODUCTION

There is increasing demand in the healthcare industry to understand and adjust for social factors that may be associated with disparities in health outcomes.1, 2, 3, 4, 5 A more informed understanding of the relationship between social determinants and clinical outcomes may improve fairness in the risk adjustment of metrics within regulatory programs that impact Inpatient Prospective Payment System payment.5, 6, 7, 8, 9 Other programs, although not tied to hospital payment, publish grades or hospital rankings that can affect an organization's industry reputation.10, 11, 12, 13, 14

Outcome measures included in these programs are risk-adjusted using patient comorbidities and other patient characteristics; however, none adjust for social factors known to be associated with disparities in health outcomes. Hospitals serving historically marginalized communities may be unfairly penalized in such programs.15, 16, 17 Such redistribution of resources brings to light policy issues related to social determinants of equity, which Camara Jones aptly defines as, “interventions on the structures, policies, practices, norms, and values, that differently distribute resources and risks…”.15 , 16 , 18

A wide range of social risk indices has been put forth in the literature, such as the Centers for Disease Control and Prevention Agency for Toxic Substances and Disease Registry Social Vulnerability Index (SVI), Minority Health Social Vulnerability Index, Social Deprivation Index, and Area Deprivation Index. Such indices are comprised of social risk variables across various socioeconomic and sociodemographic domains.19, 20, 21, 22, 23, 24 Although absent in hospital ranking programs, the utility of such indices in risk adjustment to help explain aspects of disparity as they relate to health outcomes has also been shown.22 , 25, 26, 27, 28, 29

Index development continues to mature, overcoming various methodologic concerns along the way. A primary concern is the potential to over- or under-represent the influence of underlying aspects of disparity owing to the aggregation of variables with similar information content into a singular index. As Krieger et al. state in reference to the use of overall indices, “One concern is that combining measures of income and education into one index… can conflate pathways and obscure each component's distinct-and conceivably different-contribution to specified health outcomes.”30 More robust methods have been shown in recent research, such as the use of principal component analysis (PCA) to identify latent aspects of social risk spanning the range of variables used within the composite. Singh et al., for example, developed an index on the basis of the first principal component derived from a curated set of SES variables.25

The second concern relates to methods used to weigh variables or principal components in the case of PCA within an overall composite. The National Quality Foundation (NQF) has put forth 10 recommendations relating to the use of social risk factors within clinical measurement.5, 31 Among these recommendations, the NQF suggests that social risk factors should demonstrate (1) “clinical/conceptual relationship with the outcome of interest”; (2) “empirical association with the outcome of interest,” and (3) “contribution of unique variation in the outcome (i.e., not redundant).”31 Braveman et al. corroborate this position by advocating for the use of outcome- and social group−specific measures of SES.32 Kolak et al. employed a PCA index method but recommends the use of multiple components (i.e., principal components) to capture the unique association between the latent aspects of social risk and the outcome being measured rather than a singular index.26

It has long been established that disease states can be sensitive to social risk factors. For instance, although housing air quality and other social factors may correlate with asthma hazards,33 , 34 the models may fail without a disease-specific approach.35 By identifying the social risk factors most relevant to each health outcome, policies can be designed to overcome the disparities in those health outcomes by factor. Thus, such outcome-specific identification can help to allocate resources efficiently where disparities due to such factors may be most influential to enhance health outcomes. Although it is true that those factors may relate to health outcomes in intricate, complex forms, it is essential to identify root causes to address these disparities.

Robust social indices should employ methods to ensure that unique aspects are appropriately captured and meaningfully weighed in the context of health outcomes and disease states, in line with how clinical patient characteristics are weighed differently when constructing outcome and disease−specific risk indicators, such as validated clinical determinants of cardiovascular disease events.36 , 37 With these best practices in mind, a methodologic approach to index formulation is proposed that (1) identifies unique latent aspects of social risk and (2) appropriately weighs aspects of social risk on the basis of their unique association to an outcome within a specific disease group. Although the methodologic approach in this study was applied to variables used within the SVI, the approach is generalizable and can be applied to other sets of social factors, such as those included within other composite indices (Area Deprivation Index, Minority Health Social Vulnerability Index, Social Deprivation Index, and others).

METHODS

Study Sample

Acute inpatient hospitalizations from the year 2021 were extracted from the Premier Healthcare Database (PHD), a private all-payer administrative database.38 All patient data are deidentified, so this study is exempt from IRB approval. To show the proposed approach, clinical cohorts frequently used within hospital regulatory and private hospital ratings programs were evaluated, including acute myocardial infarction (AMI), heart failure (HF), perinatal and related conditions (PR), pneumonia (PN), stroke (STK), and total hip and knee arthroplasty (THA/TKA).7 , 10 , 12 , 13 These cohorts were identified by the ICD-10 principal diagnosis associated with the hospitalization. A coronavirus disease (COVID-19) cohort was also included, given its relevance to current research. A principal or secondary diagnosis of COVID-19 was used to identify COVID-19 hospitalizations. The number of hospitalizations varied by cohort (Table 1) .

Table 1.

County and Patient Hospitalization Counts by Disease Group

Disease group PHD counties (n) PHD percentage of all U.S. counties PHD hospitalizations (n)
AMI 2,843 90% 364,094
COVID-19 3,023 96% 1,398,265
HF 2,852 91% 774,251
PN 2,872 91% 738,269
PR 2,896 92% 2,207,649
STK 2,829 90% 429,232
THA/TKA 2,629 84% 223,542

AMI, acute myocardial infarction; HF, heart failure; PN, pneumonia; PR, perinatal and related conditions; STK, stroke; THA/TKA, total hip and knee arthroplasty.

The extracted data elements included patient age, sex, Federal Information Processing Standards (FIPS) county code of the primary patient residence, Clinical Classification Software Refined (CCSR)39 grouping of the principal ICD-10 code associated with the hospitalization, and an indicator of mortality during hospitalization.

County factors describing social risk across 3,142 counties were extracted from the 2018 Centers for Disease Control and Prevention vulnerability data set.23 These 15 variables represent aspects of social risk across 4 data domains (Table 2 ).23 The overall SVI index, an equally weighted aggregation of these social factors, was also extracted.23, 40

Table 2.

CDC SVI Variable Descriptions by Domain

Variable name Variable description
SES
 ep_pov Percentage of persons below poverty
 ep_unemp Percentage of civilians (aged 16+ years) unemployed
 ep_pci_ra Per capita income
 ep_nohsdp Percentage of persons with no high-school diploma (aged 25+ years)
Household composition/disability
 ep_age_65 Percentage of persons aged ≥65 years
 ep_age_17 Percentage of persons aged ≤17 years
 ep_disabl Percentage of civilian non-institutionalized population with a disability
 ep_sngpnt Percentage of single-parent households with children aged <18 years
 ep_minrty Percentage of minority population (all persons except White, non-Hispanic)
Minority status and language
 ep_limeng Percentage of persons (aged 5+ years) who speak English less than well
 ep_munit Percentage housing in structures with 10 or more units
Housing type and transportation
 ep_mobile Percentage of mobile homes
 ep_crowd Percentage of households with more people than rooms
 ep_noveh Percentage of households with no vehicle available
 ep_groupq Percentage of persons in institutionalized group quarters

Source: CDC SVI Documentation 2018 | Place and Health | Agency for Toxic Substances and Disease Registry. Accessed January 21, 2022.

a

Log transformed and reversed.

CDC, Centers for Disease Control and Prevention; SVI, Social Vulnerability Index.

The per capita income variable (ep_pci), exhibiting a strong rightward skew, was the only variable used in the SVI that was not reported as a percentage and was expected to be inversely correlated with social vulnerability. The ep_pci variable was therefore log transformed and reversed by subtracting each value from the maximum log value in the data set, with the resulting variable renamed as epi_pci_r. The values for ep_pov, ep_unemp, ep_pci, and rpl_themes were missing for Rio Arriba County, NM (FIPS County Code=35,039), and it was the only county excluded from the analyses owing to missing data. The resulting SVI county dataset is comprised of 3,141 counties.

Patient CCSR groupings, included as control variables, were highly imbalanced for many cohorts, with most of the patient volume represented by a small subset of CCSR categories. To reduce dimensionality, CCSR groupings with an occurrence frequency <0.05% within each disease stratum were set to an Other CCSR category. The percentages of data grouped to the Other CCSR category were 0.43% (AMI), 15.4% (COVID-19), 1.16% (HF), 6.72% (PN), 17.6% (PR), 11.5% (STK), and 5.36% (THA/TKA) for each of the 7 cohorts evaluated in this study.

Statistical Analysis

A disease and outcome−specific SVI (DOS-SVI) was produced on the basis of an outcome-driven reweighing of the 15 individual SVI factors while controlling for patient mix within each county. A risk-adjustment model developed at the patient hospitalization level was first developed to assess patient-level risk, which was subsequently aggregated at the county level. The second county-level model was designed to measure the association between the 15 social risk factors and county-level mortality while controlling for expected mortality extracted from the previous hospitalization-level model.

Patient-level risk was estimated through generalized additive models, stratified by clinical cohort. For each generalized additive model, the binary occurrence of mortality was regressed on patient age, sex, and CCSR grouping. Without loss of generality, other patient-level characteristics can be added to the model, if available. Age was modeled as a thin plate regression spline owing to its potential nonlinear relationship with the probability of mortality for some disease strata. Observed mortality outcomes and fitted mortality probabilities were then summed by county, using the FIPS county code, which served as inputs to the subsequent county-level model. County-added observed outcomes represent actual county mortality counts by stratum, whereas county-added probabilities represent the corresponding expected mortality counts.

Total observed and expected cases by county were linked to the SVI variables, such that each observation represents a unique FIPS county code. Although the SVI data include county-level information for 3,141 distinct counties, patient data were only available for a subset of the total U.S. counties by cohort within the PHD (Table 1).

As shown in Figure 1 , the SVI variables exhibit varying degrees of multicollinearity. To reduce information redundancy, social vulnerability principal components were extracted and used as inputs in the analysis. To extract additional sources of variability of mortality counts at the county level, Poisson rate regression models were fit for each disease stratum, regressing total observed county-level cases of mortality on 9 principal components, with expected cases as the offset variable. The choice of 9 principal components was made through an arbitrary inclusion threshold of 90% of the cumulative SVI variance, with 9 principal components explaining 91.46% of the variation of the SVI variables.

Figure 1.

Figure 1

Correlation matrix of social risk factors.

ep_pov denotes the percentage of persons below poverty, ep_unemp denotes the percentage of civilians (aged 16+ years) unemployed, ep_pci_r denotes per capita income, ep_nohsdp denotes the percentage of persons with no high school diploma (aged 25+ years), ep_age_65 denotes the percentage of persons aged ≥65 years, ep_age_17 denotes the percentage of persons aged ≤17 years, ep_disabl denotes the percentage of civilian non-institutionalized population with a disability, ep_sngpnt denotes the percentage of single-parent households with children aged <18 years, ep_minrty denotes the percentage of minority population (all persons except White, non- Hispanic), ep_limeng denotes the percentage of persons (aged 5+ years) who speak English less than well, ep_munit denotes the percentage housing in structures with 10 or more units, ep_mobile denotes the percentage of mobile homes, ep_crowd denotes the percentage of households with more people than rooms, ep_noveh denotes the percentage of households with no vehicle available, and ep_groupq denotes the percentage of persons in institutionalized group quarters.

The coefficients for each of the principal components were normalized, multiplied by their respective principal components, and summed, resulting in the reweighed overall composites. Analyses were conducted using R statistical software, Version 3.6.2 (R Foundation for Statistical Computing).41

To measure the benefit of the DOS-SVI, 2 benchmarks were calculated. The first benchmark, designed to measure the impact of excluding social risk factors altogether, was calculated as a model-free estimate through the root mean squared error (RMSE) between the county-level expected and observed cases—referred to as the null model in this paper. The second benchmark was based on disease-specific Poisson rate regression models, using the SVI as the single explanatory variable while adjusting for expected county-level mortality. The resulting RMSE was designed to measure the fit of the current SVI compared with the proposed alternative within a common Poisson model.

RESULTS

Table 3 lists the RMSEs for the proposed model and the 2 benchmark values across strata. Among the 3 models, the principal component model reduces RMSE across all the 7 cohorts compared with the null model. In comparison with the benchmark model, the principal component model reduces RMSE for 5 of the 7 cohorts, with equivalent performance in the remaining 2 cohorts. The most salient shifts from the null model can be seen in the COVID-19 and PN cohorts, with RMSE reduced from 48.85 to 40.94 and 17.61 to 13.85, respectively.

Table 3.

RMSE by Model Type and Disease Group

Disease group Null model RMSE SVI RMSE Principal component RMSE RMSE percent reductiona
AMI 4.33 4.23 3.93 7.09%
COVID-19 48.85 50.00 40.94 18.1%
HF 6.68 6.64 6.00 9.64%
PN 17.61 17.90 13.85 22.6%
PR 0.28 0.27 0.27 0.0%
STK 7.87 7.96 7.09 10.9%
THA/TKA 0.24 0.23 0.23 0.0%
a

RMSE reduction column corresponds to the change from the SVI to principal component model.

AMI, acute myocardial infarction; HF, heart failure; PN, pneumonia; PR, perinatal and related condition; RMSE, root mean squared error; STK, stroke; SVI, Social Vulnerability Index; THA/TKA, total hip and knee arthroplasty.

The DOS-SVI deviated from the SVI at varying degrees depending on the disease stratum being evaluated. The SVI is a percentile rank of the domain aggregates, and therefore to compare fairly with the DOS-SVI, the difference in the percentile rank of the DOS-SVI and SVI is evaluated in Appendix Figure 1 (available online). This comparison shows that the AMI, COVID-19, PN, PR, and STK cohorts aligned most closely with the SVI, with residual standard deviations of 0.19, 0.23, 0.26, 0.28, and 0.28, respectively. The HF and THA/TKA cohorts varied to a greater degree with SDs of 0.34 and 0.36, respectively.

Owing to the orthogonal nature of the PCA, results in each component capture different latent aspects of social risk. The scree plot in Appendix Figure 2 (available online) shows the cumulative variability of the underlying data explained by the principal components derived from the SVI variables, with the first 2 principal components explaining more than 53% of the variance of the 15 variables. Appendix Figure 3 (available online) shows the correlation between each of the derived principal components and the raw SVI variables.

The correlation between the derived DOS-SVIs and the raw social risk variables (Appendix Figures 4–10, available online) can help with the interpretability of the results. The AMI, COVID-19, and PR cohorts have a clear set of social risk factors associated with mortality risk. The AMI DOS-SVI is correlated with the percentage of households with more people than rooms (r=0.754, p<0.001) and unemployment (r=0.732, p<0.001). The COVID DOS-SVI is correlated with income (r=0.667, p<0.001) and the percentage of households with more people than rooms (r=0.678, p<0.001). The PR DOS-SVI is associated with income (r=0.724, p<0.001), percentage of mobile homes (r=0.724, p<0.001), disability (r=0.743, p<0.001), unemployment (r=0.703, p<0.001), poverty (r=0.757, p<0.001), and completion of high school (r=0.618, p<0.001). The PN DOS-SVI is correlated with the percentage of households with more people than rooms (r=0.816, p<0.001). The HF, STK, and THA/TKA DOS-SVIs are not largely correlated with raw risk factors, with an arbitrary correlation threshold of 0.6.

The beta coefficients for each principal component can give additional insight into the differing disease-specific association between the unique latent aspects of social risk and inpatient mortality (Appendix Table 1, available online). In addition to the residual index difference shown in Appendix Figure 1 (available online), a geographic representation of the county-level indices is provided in Appendix Figure 11 (available online), using the state of North Carolina as an example.

In-sample counties (i.e., counties with patient volume in the PHD) comprised a large proportion of the national total (Table 1). While out-of-sample counties had a relatively lower total population (Appendix Figure 12, available online). The DOS-SVI distributions for in- and out-of-sample counties are evenly distributed across disease strata (Appendix Figure 13, available online).

Significance of the disease-specific coefficients and the correlation between social risk factors and their respective indices emphasize the utility of disease-specific variable weights because an index comprising equally weighted variables would suffer from the impacts of multicollinearity when capturing the unique disease-specific association between mortality and aspects of social risk.

County population size is a factor in the alignment between the SVI and DOS-SVI. Appendix Table 1 (available online) shows the RMSE between the SVI and DOS-SVI across counties grouped by population quartiles. Apart from the HF disease group, the first and fourth quartiles have the lowest RMSE relative to the second and third quartiles, indicating stronger alignment at the population extremes. In the case of HF, the alignment between the 2 indices generally increases as county population size increases. The full set of DOS-SVI and their respective percentiles by cohort type are included in Appendix Tables 3−9 (available online).

DISCUSSION

Enhanced approaches for weighing the underlying aspects of social risk factors can help explain variation in mortality outcomes upon controlling for patient, age, sex, and disease group. In this application of disease-specific indices, the associations between social risk factors and mortality within disease strata are consistent with previous research, corroborating the importance of considering social risk in the context of disease strata.27 , 42, 43, 44, 45, 46 Although it is common in the literature to consider disease-specific clinical risk factors, it is still less common to consider social vulnerability factors or indices that are also disease specific. The proposed method addresses this gap in the literature. The approach put forth in this study is a computationally tractable model that is generalizable to other disease groups and outcomes. In addition, the approach can be implemented at different levels, such as the census block or patient level, providing data availability. This approach can also serve to reveal and/or empirically demonstrate the heterogeneity of mechanisms linking social determinants of health with specific diseases or health outcomes, especially because such associations are not expected to be homogeneous across diseases or health outcomes. This aligns with the NQF call for disease-specific metrics.5 When appropriate, the index formulation method shown in this study can be used to adjust for social factors within risk-adjusted outcomes and as a distribution to stratify populations.

Accounting for social factors in risk adjustment is necessary from a benchmarking perspective to ensure that hospitals and physicians are not unfairly penalized for serving historically disadvantaged communities.15, 16, 17 This is especially important in pay for performance and other publicly reported programs.10, 11, 12, 13, 14

Limitations

Despite these advantages, there are limitations to the proposed approach. Counties can be heterogeneous, and therefore county-level SVI variables can mask considerable disparities within counties.47 In these cases, the resulting principal component weights will be diluted through the county-level aggregation. To mitigate this limitation, future studies may need to include census tract−level variables.

Although the principal component model overcomes challenges with correlated variables, there is greater opacity in interpreting the principal components themselves. Conversely, the extracted PCA weights are disease-specific, aligning with NQF's recommendation, and further combine highly correlated variables, making the resulting composite a mixture of latent components with disease-agnostic undefined weightings by uncorrelated latent factor.

Finally, although the clinical mechanisms in which risk factors affect specific diseases are not explored, the proposed data-driven approach offers a unique opportunity to build social vulnerability indices that are disease appropriate and avoid the dilution implicit in one-size-fits-all approaches. The proposed approach can in fact be used as a stepping stone to explore such mechanisms upon demonstrating the relevant risk factors and associated principal components.

CONCLUSIONS

The proposed index formulation method has been designed to account for areas of increased disease-specific mortality risk associated with social risk factors. Regressing observed county-level mortality frequencies as a broader set of social risk factors while controlling for patient clinical and demographic characteristics reduces errors in risk-adjusted mortality estimates more than the county-level SVI. Social risk indices such as DOS-SVI and SVI may be associated with areas of risk surrounding a hospital or within a population associated with a health system. Applying the DOS-SVI index formulation method may enable further study of specific regions or populations adversely affected by social risk factors or potentially neglected by health structures. In turn, the DOS-SVI approach may enhance control for social factors in risk-adjustment modeling for population health or reveal specific gaps in healthcare operations.

Acknowledgments

ACKNOWLEDGMENTS

The Premier Healthcare Database is considered exempt from IRB oversight as dictated by Title 45 Code of Federal Regulations, Part 46 of the U.S., specifically 45 CFR 46.101(b)(4). In accordance with the Health Insurance Portability and Accountability Act Privacy Rule, disclosed data from the Premier Healthcare Database are considered deidentified per 45 CFR 164.506(d)(2)(ii)(B) through the Expert Determination method.

The authors received no specific funding for this work.

JM and MK are employed by and have stock in Premier, Inc. No other financial disclosures were reported.

CRediT AUTHOR STATEMENT

Michael Korvink: Conceptualization, Methodology, Formal analysis, Data curation, Writing–original draft, Writing–review & editing. Laura H. Gunn: Conceptualization, Methodology, Writing–review & editing, Supervision. German Molina: Conceptualization, Methodology, Writing–review & editing, Supervision. Dani Hackner: Writing–review & editing, Supervision. John Martin: Conceptualization, Methodology, Writing–original draft, Writing–review & editing, Supervision.

Footnotes

Supplemental materials associated with this article can be found in the online version at https://doi.org/10.1016/j.amepre.2023.05.002.

Appendix. SUPPLEMENTAL MATERIAL

mmc1.pdf (8.6MB, pdf)

REFERENCES

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

mmc1.pdf (8.6MB, pdf)

Articles from American Journal of Preventive Medicine are provided here courtesy of Elsevier

RESOURCES