Abstract
Objective
To measure hospital quality based on routine data available in many health care systems including the United States, Germany, the United Kingdom, Scandinavia, and Switzerland.
Data Sources and Study Setting
We use the Swiss Medical Statistics of Hospitals, an administrative hospital dataset of all inpatient stays in acute care hospitals in Switzerland for the years 2017–2019.
Study Design
We study hospital quality based on quality indicators used by leading agencies in five countries (the United States, the United Kingdom, Germany, Austria, and Switzerland) for two high‐volume elective procedures: inguinal hernia repair and hip replacement surgery. We assess how least absolute shrinkage and selection operator (LASSO), a supervised machine learning technique for variable selection, and Mundlak corrections that account for unobserved heterogeneity between hospitals can be used to improve risk adjustment and correct for imbalances in patient risks across hospitals.
Data Collection/Extraction Methods
The Swiss Federal Statistical Office collects annual data on all acute care inpatient stays including basic socio‐demographic patient attributes and case‐level diagnosis and procedure codes.
Principal Findings
We find that LASSO‐selected and Mundlak‐corrected hospital random effects logit models outperform common practice logistic regression models used for risk adjustment. Besides the more favorable statistical properties, they have superior in‐ and out‐of‐sample explanatory power. Moreover, we find that Mundlak‐corrected logits and the more complex LASSO‐selected models identify the same hospitals as high or low‐quality offering public health authorities a valuable alternative to standard logistic regression models. Our analysis shows that hospitals vary considerably in the quality they provide to patients.
Conclusion
We find that routine hospital data can be used to measure clinically relevant quality indicators that help patients make informed hospital choices.
Keywords: hospital, machine learning, quality of care, risk adjustment for clinical outcomes, surgery
What is known on this topic
To support patients in their hospital choices, the public disclosure of hospital quality measures has become a top priority in many health care systems.
To account for systematic differences in patient risks across hospitals, public health authorities typically apply simple risk adjustment models.
Logistic regression models that account for case‐mix differences between hospitals are the most popular approach for risk adjusting patient outcomes.
What this study adds
We propose two novel approaches that are more suitable to adjust hospital performance for differences in patient risk than the standard logistic regression approach.
Our approach is tailored to practitioners and public health authorities who intend to compare hospital quality based on routine inpatient data.
Hospitals vary considerably in the quality they deliver to patients who undergo inguinal hernia repair and hip replacement surgery.
1. INTRODUCTION
Many developed countries including Germany, the Netherlands, Belgium, Switzerland, and the United States (Affordable Care Act marketplaces & Medicare) have adapted elements of “managed competition” in their health care systems. 1 , 2 , 3 In such systems, patients choose the health plan that fits their needs and they typically have some discretion over who treats them. 4 Choice, the rationale goes, can have a disciplining effect on providers of care and insurers because “poor” providers will lose patients/clients unless they enhance their quality efforts or lower their costs. 5 , 6
The public disclosure of quality information has become a top priority in many health care systems. Supporting patients in their provider choices can contribute to a more efficient allocation of patients to the best providers in the system and ultimately enhance the quality of care. 7 , 8 , 9 However, these mechanisms can only unfold if two basic conditions are met. First, quality measures need to account for differences in patient risks between providers. Some hospitals, such as university hospitals, treat more complex cases than other hospitals. Directly comparing hospitals would unfairly penalize hospitals with more complex cases and provide a distorted quality signal. Thus, accounting for differences in patient risk is necessary to ensure a fair provider comparison.
Second, quality measures need to be meaningful and easy‐to‐understand for patients. A growing body of research documents that patients have difficulties understanding quality information 10 , 11 , 12 and as a result often base their provider choices on non‐clinical factors such as travel distance or amenities. 4 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 Moreover, the quality indicators that are available might be of little use to inform provider choices. For example, mortality rates are largely meaningless for patients undergoing high‐volume, low‐risk procedures such as hernia repair or hip replacement because patient death is an extreme event that rarely occurs during these procedures.
In this article, we study the question as to how to measure hospital quality indicators based on routine data available in most developed countries. We examine alternatives to the “standard approach” of risk adjusting clinical outcomes based on logistic regression models commonly used by researchers and public health authorities including, among others, the Center for Medicare and Medicaid Services (CMS) in the United States and the Allgemeine Ortskasse (AOK) in Germany. 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 Specifically, we assess how least absolute shrinkage and selection operator (LASSO), a supervised machine learning technique for variable selection, and Mundlak corrections in hospital random effects logits can be applied to account for risk selection and thus make hospitals more comparable. Moreover, we investigate to what degree hospitals vary in the quality they deliver to patients. Our analysis is confined to two high‐volume, low‐risk elective procedures: inguinal hernia repair and primary hip replacement. We analyze clinically meaningful events including complications, infections, and revisions as they provide direct information on patient health based on clinical outcomes instead of indirect measures such as nurse‐to‐patient ratios or the adherence to clinical guidelines. 32 , 33 Finally, to make the resulting quality indicators more accessible to patients, we translate hospital performance to a simple 3‐star rating.
2. RISK SELECTION
In the Swiss health care system, the federal states (“Cantons”) determine which hospitals can reimburse their services through mandatory health insurance. Coverage is broad so that only a negligible fraction of hospitals is not part of these cantonal “hospital lists.” Patients with mandatory health insurance have free choice among all listed hospitals across Switzerland, that is, access to inpatient care is not restricted to the region of residence or plan choice. 34
Free choice, however, has important implications for the comparison of hospitals. With choice comes selection because certain types of patients select into certain hospitals. As a result, patient risks are unevenly distributed across hospitals—what we refer to as “risk selection” hereafter. Some hospitals might benefit from a desirable risk pool, while others predominantly treat complex cases (e.g., chronically ill, older patients). Such differences in the risk structure likely affect patient outcomes such as complication or infection rates.
Figure 1 illustrates the uneven distributions of risk factors across hospital types in the Swiss inpatient sector in case of inguinal hernia repair surgeries. We classify the 123 acute hospitals that performed inguinal hernia repair in 2019 into one of the five size‐based hospital categories defined by the Federal Office of Statistics: university hospitals, center hospitals, specialized clinics as well as large and small regional hospitals. To give a flavor about the differences in patient risks, we compare average Elixhauser comorbidities per hospital type, which can be constructed based on the International Statistical Classification of Diseases and Related Health Problems 10th Revision (ICD‐10) diagnosis codes. 35 , 36 Figure 1 shows the share of inguinal hernia repair patients suffering from specific comorbidities for each hospital type: brighter colors indicate a higher share of comorbid patients. The figure underlines that patient risks are not evenly distributed across hospitals: university hospitals and central hospitals treat more patients with comorbidities, which is an indication of higher risks of adverse outcomes and thus risk selection.
FIGURE 1.

The share of patients undergoing inguinal hernia repair who suffer from comorbidities by hospital type for the years 2017–2019.
3. METHODS
3.1. Data
We use the Swiss Medical Statistics of Hospitals, which is a comprehensive administrative hospital dataset of all inpatient stays in acute care hospitals in Switzerland. 37 The data are routinely collected and publicly available. Similar datasets are available in other countries such as the United States (e.g., CMS data), the Netherlands (e.g., National Basic Register of Hospital Care), the United Kingdom (e.g., Hospital Episode Statistics), France (e.g., French Hospital Discharge Database), Sweden (e.g., National Patient Register), Finland (e.g., Care Register for Health Care), Denmark (e.g., National Patient Register), and Norway (e.g., Norwegian Patient Registry).
The Medical Statistics of Hospitals includes patient demographics (e.g., age, gender), socioeconomic characteristics (e.g., type of insurance), administrative information (e.g., hospital type), and medical information (e.g., main and secondary diagnoses and procedures). Diagnoses are coded based on the ICD‐10 German Modification (ICD‐10‐GM). Procedures are coded with the Swiss classification of surgical interventions (CHOP). Unique patient identifiers allow us to follow patients over time to capture adverse events during re‐hospitalizations. We enrich the dataset by using secondary diagnoses to create the Charlson 38 and Elixhauser comorbidity index 36 as well as binary indicators for the underlying comorbidities and the number of comorbidities using the R package “comorbidity.” 39
3.2. Study sample
The data analyzed in this study include all hospitalizations and re‐hospitalizations of patients with a primary hip replacement due to coxarthrosis or chronic arthritis or an inguinal hernia repair surgery over period 2017–2019, that is, the pre‐COVID years. Because of long follow‐up periods of certain quality measures (see Section 3.3), we analyze hospital quality for primary hip replacement in 2018. In 2018, 118 hospitals performed 18,028 primary hip replacements due to coxarthrosis or chronic arthritis. While the number of hip replacements remained constant over time (18,086 in 2017, 18,028 in 2018, and 18,175 in 2019), the number of inguinal hernia repairs decreased markedly (17,219 in 2017, 15,634 in 2018, and 12,061 performed by 123 hospitals in 2019). Since 2019, mandatory insurance (with some exceptions) only reimburses inguinal hernia repair if performed in the ambulatory setting. Due to likely changes in the patient risk structure after the reform, we thus confine the analysis to patients treated in 2019.
3.3. Quality measures
We assess hospital quality for the two surgeries based on outcome indicators. For both indications, we collected a set of evidence‐based, disease‐specific indicators that are used by public health agencies in five countries (the United States, the United Kingdom, Germany, Austria, and Switzerland). The final set of indicators results from a two‐step selection procedure. In the first step, we systematically extracted all indicators used by leading agencies in the United States (Center for Medicare and Medicaid Services, CMS; Agency for Healthcare Research and Quality, AHRQ; National Quality Forum, NQF), the United Kingdom (National Health Service, NHS; National Joint Registry, NJR; Getting It Right First Time, GRIFT), Germany (Allgemeine Ortskasse, AOK; Institut für Qualitätssicherung und Transparenz im Gesundheitswesen, IQTIG), Austria (Austrian Inpatient Quality Indicators, A‐IQI), and Switzerland (Bundesamt für Gesundheit, BAG; Nationaler Verein für Qualitätsentwicklung in Spitälern und Kliniken, ANQ; Swissnoso). In the second step, we narrowed down the set of indicators to outcome measures. Such measures are considered the “gold standard” in quality measurement as they directly measure patient outcomes instead of indirect signals such as process quality or hospital endowment (e.g., number of beds). The final set is restricted to the most frequently used indicators, which can be constructed based on routine data available in most developed countries.
We measure quality of inguinal hernia repair surgery based on the following two quality dimensions: general complications during initial stay (e.g., cardiac arrest) and surgical complications within 90 days after the surgery (e.g., vascular complications). The four final quality indicators for primary hip replacement include general complications during initial stay (e.g., myocardial infarction), surgical complications within 90 days after the surgery (e.g., luxation), infections within 365 days after the surgery, and revisions within 365 days after the surgery. Tables 5 and 6 in Appendix S1 provide a detailed account on all adverse events as well as procedure and diagnostic codes used to construct the quality indicators for inguinal hernia repair. Tables 7–10 in Appendix S1 show the same for the quality indicators for the hip replacement surgeries.
To measure hospital quality, we compute quality indices for standard logistic regression models as:
| (1) |
whereas is the observed event rate (e.g., complications) in hospital and is the risk‐adjusted expected rate in the same hospital. The expected rate amounts to the average predicted probability over all cases treated in hospital . can be interpreted as the share of adverse events, which would have occurred if the same patients from hospital would have been treated in the average hospital instead.
As common in hierarchical models, we compute the quality indices for models with hospital random effects as 40 :
| (2) |
whereas is the average predicted adverse event rate for patients treated in hospital (including the hospital random effect ) and is the average predicted event rate if the same patients were treated in the average hospital. The average hospital has a random effect of zero (i.e., ) so that is simply estimated by averaging the predicted event rate in hospital without the hospital random effect. 40 , 41 , 42
3.4. Regression‐based risk adjustment
The section on risk selection above illustrated that the risk profiles of patients differ considerably between hospitals. The main issue with risk selection is that the direct comparison of hospitals not only captures differences in quality but also imbalances in patient risks. Therefore, risk adjustment is necessary. Since individual patient outcomes (e.g., the occurrence of a complication) are binary events, we estimate logistic regression models that account for risk factors that can explain individual patients' risk of experiencing these adverse events (see Table 11 in Appendix S1).
We compare two implementations of logit models: standard logits with patient‐level covariates and Mundlak‐corrected hospital random effects logits with patient‐level covariates and hospital means of covariates. Hospitals may differ in many risk factors from one another—some of which are observable to the researcher such as age, comorbidities, and gender, whereas others are not (e.g., health behaviors). If the unobservable risk factors are randomly distributed across hospitals, hospitals can be compared based on risk‐adjusted patient outcomes and the corresponding variation across hospitals can be interpreted as quality differences. 43 , 44 The validity of standard logit models to quality measurement hinges precisely on this assumption. If, however, the unobservable risk factors are systematically concentrated in certain hospitals, then standard logits yield biased estimates of the (relative) hospital quality. 45 , 46 , 47
A possible solution to relax the assumption of uncorrelatedness of hospital random effects with patient‐level covariates is Mundlak‐corrected hospital random effects models that include hospital means of covariates as additional explanatory variables. 44 , 48 In Mundlak‐corrected models, coefficients of patient‐level covariates capture individual patients' risks, and coefficients of hospital means capture the risk profile of a hospital's entire patient population. The inclusion of the hospital means can correct for omitted variable bias due to unobserved heterogeneity when unobserved hospital‐level risk factors that are captured in the random effects are correlated with patient‐level covariates. Mundlak‐corrected models are consistent in the presence of unobserved heterogeneity under a standard strict exogeneity assumption of explanatory variables (including the hospital means) conditional on unobserved effects. 49 Thus, Mundlak‐corrected logit specifications are preferable over the standard logit when observations are grouped at the hospital level, and hospitals vary in unobservable quality.
Besides the statistical properties, researchers must decide which covariates to include in their risk adjustment model. We follow two approaches to specify the set of risk adjusters: First, we estimate a series of hand‐picked specifications for both the standard logits and Mundlak‐corrected logits. Second, we employ LASSO, a supervised machine learning technique designed for “automatic” variable selection. 50 , 51 In essence, the LASSO holds the promise to detect non‐linearities and complex interactions, which could have easily been overlooked in hand‐picked models to explain adverse patient events. Hence, in comparison to hand‐picked specifications, the LASSO automatically selects the best predictors and thus may soften misspecification issues biasing the hospital comparison.
The set of hand‐picked specifications includes 26 models of age, gender, type of insurance plan, nationality, place before admission, the number of Elixhauser or Charlson comorbidities as well as separate indicators for the underlying comorbidities. The Elixhauser and Charlson indices and comorbidities are not used together in any of the model specifications. Table 12 in Appendix S1 summarizes all model specifications in more detail. To tune the LASSO based on 10‐fold cross‐validation, we use the complete pool of candidate variables (in levels), polynomials of continuous variables, discretized categorical variables, and restricted cubic splines of continuous variables. We include all terms in the LASSO specification that received non‐zero coefficients in the final LASSO model from cross‐validation. Once the relevant predictors are selected, we estimate standard logit specifications based on the corresponding selection of variables (“LASSO‐selected logits”). Furthermore, we estimate the combination of the LASSO selection and Mundlak correction to leverage the benefits of both (“LASSO‐selected Mundlak‐corrected logits”).
To discriminate between the different specifications, we select the most suitable model based on the Akaike information criterion (AIC), 52 the out of sample performance, and the modified Hosmer–Lemeshow test. 53 All models and the AIC and Hosmer–Lemeshow test are estimated based on the full data spanning over the years 2017–2019. The AIC is a measure of the relative goodness of fit of different models that considers the number of parameters and favors simpler over more complex models with the same fit thus partly safeguarding against overfitting the data. Also model selection based on the AIC has been shown to largely align with models selected based on leave‐one‐out cross validation and maximizing the out‐of‐sample expected log‐likelihood in cross‐validation. 54
To assess the out‐of‐sample prediction accuracy of all “hand‐picked” and LASSO specifications, we compute the mean absolute error (MAE) using 10‐fold cross‐validation. The MAE is computed by averaging over all the differences in the average predicted probability per hospital and the actual adverse event rate in the same hospital.
The modified Hosmer–Lemeshow test checks for significant departures of mean residuals in different groups or percentiles of outcomes and explanatory variables. We apply the modified Hosmer–Lemeshow test over intervals of predicted event probabilities as well as age groups and the number of comorbidities to assess how well different models identify patients who vary in event risk. To ensure a “fair” comparison between the standard logits and hospital random effects models, the residuals were based on predictions without hospital random effects.
4. RESULTS
4.1. What is the best risk adjustment model?
The comparison of the different model specifications yields three main findings: First, Mundlak‐corrected hospital random effects logits outperform standard logistic regression models. Besides the more desirable statistical properties, the Mundlak‐corrected logits provide a better in‐sample fit to the data (see Figures 4–9 in Appendix S1). Also, they have a higher out‐of‐sample prediction accuracy. Figures 10–15 in Appendix S1 show that the standard logits are generally outperformed in terms of out‐of‐sample MAE. The modified Hosmer–Lemeshow plots based on predictions without random effects did not show significant differences between the standard logits and the Mundlak‐corrected logits, which confirms that the random effects and the Mundlak terms did not come at the costs of less accurate fixed effects estimates at the individual level.
Second, LASSO‐selected logits tend to have higher overall and group‐specific prediction accuracy than the hand‐picked standard logits. Likewise, the LASSO‐selected Mundlak‐corrected logits outperform the hand‐picked Mundlak‐corrected logits both in terms of overall as well as group‐specific fit to the data.
Third, the LASSO‐selected Mundlak‐corrected logits have the lowest or second lowest AIC of all candidate specifications for both indications (Figures 4–9 in Appendix S1). Furthermore, hand‐picked Mundlak‐corrected logits that adjust for patient demographics (age, gender) and Elixhauser comorbidities (i.e., specifications 21–26) provide an overall prediction accuracy at par with the LASSO‐selected Mundlak‐corrected logits. This might be an especially valuable finding for public health authorities that need to justify the hospital ranking to the public and wish to use interpretable models. The main drawback of the LASSOs is that they typically produce complex specifications as, for example, higher order polynomials are selected into the model. Moreover, the selection of risk factors can change from year to year potentially leading to confusion. In contrast, the hand‐picked Mundlak‐corrected random effects logits are easy to implement, understand, and communicate.
Besides the comparison of the different models, our analysis is informative regarding the risk factors that should be accounted for when comparing hospitals. First, we find that models including Elixhauser comorbidities show lower AIC values indicating better fit than identical models that adjust for Charlson comorbidities. The inclusion of the number of Elixhauser comorbidities and their non‐linear transformations decreases AIC even when being used together with binary indicators for Elixhauser comorbidities. Second, the inclusion of socio‐demographic variables (insurance plan, nationality, and place before admission) does not improve model fit over identical specifications without these variables. Hence, they are not included in our preferred risk adjustment model specification. However, socioeconomic differences can be correlated to unobservable risk factors (e.g., health behaviors) that vary between hospitals so that they should nevertheless be included in the risk adjustment model even when not improving the fit to the data. This might be especially relevant in health care systems with restricted access to care for specific patient groups such as low‐income families and ethnic minorities.
In summary, LASSO‐selected (Mundlak‐corrected) logits and hand‐picked Mundlak‐corrected hospital random effects models outperform standard logistic regression models as they tend to fail to capture relevant differences in the risk structure between hospitals.
4.2. LASSO‐selected versus Mundlak‐corrected logit?
In the last section, we have argued that the gains in prediction accuracy of LASSO‐selected specifications come at the cost of higher complexity. Practitioners might therefore wonder whether it is worthwhile to pay that “price.” To answer this question, we compare the hand‐picked Mundlak‐corrected logits with the LASSO‐selected (Mundlak‐corrected) logits and discuss how the model choice affects the hospital comparison. We find that practitioners tend to pay a small price when opting for the simpler, yet more transparent alternative—they paint an almost identical picture about the question as to which are the “good” and “bad” hospitals in the market.
Figure 2 plots the adjusted rate of surgical complications within 90 days (i.e., the quality index) of the Mundlak‐corrected logit (x‐axis) against the LASSO‐selected Mundlak‐corrected logit (y‐axis) for hospitals performing hip replacement surgery in 2018. The graph shows that the two risk adjustment models widely agree on the overall rating of hospitals as they closely fluctuate around the 45‐degree line (intraclass correlation coefficient [ICC]: 0.896). A similar pattern can be observed when comparing LASSO‐selected Mundlak‐corrected logits with Mundlak‐corrected logits for all the other quality indicators for hip replacement surgery. Hence, the choice between the LASSO‐selected models and hand‐picked Mundlak‐corrected logits does not systematically influence the hospital rating in case of Switzerland.
FIGURE 2.

The quality index for the Mundlak‐corrected hospital random effects logit (x‐axis) versus the LASSO‐selected Mundlak‐corrected logit (y‐axis) for surgical complications within 90 days after hip replacement surgery in 2018. The quality indices are computed as: , whereas is the average predicted adverse event rate for patients treated in hospital and is the average predicted event rate if the same patients were treated in the average hospital. ICC, intraclass correlation coefficient.
4.3. Does quality vary between hospitals?
In the last step of the analysis, we approach the question as to whether hospitals differ in the quality they deliver to patients. We analyze the variation in hospital quality based on hand‐picked Mundlak‐corrected hospital random effects logit models that account for age, gender, and comorbidities (i.e., specifications 22 for hip replacement and 23 for hernia repair).
Figure 3 illustrates the variation in hospital quality for the case of general complications during initial stay for primary hip replacement surgeries. Each point in the graph shows the risk‐adjusted quality index of a single hospital, and the size of the points represents the annual case numbers. The solid vertical line corresponds to the index of the average hospital. Hospitals to the left of this line provide above‐average quality (index <1), whereas hospitals with an index above one provide below‐average quality. Hospitals vary considerably in the quality they provide to patients. While we observe hospitals with an index of as little as 0.36, the hospital with the highest index has a value of 3.30, which is equivalent to a 230% higher complication rate than expected in the average hospital.
FIGURE 3.

The risk‐adjusted quality index for general complications during initial stay for primary hip replacement in 2018, using Mundlak‐corrected hospital random effect logit (specification 22). The quality index is computed as: , whereas is the average predicted adverse event rate for patients treated in hospital and is the average predicted event rate if the same patients were treated in the average hospital. Each point represents a hospital that treated ≥50 cases (as higher case numbers are associated with learning and better patient outcomes) in 2018. The vertical solid line shows the index for “the average hospital.” Hospitals to the left of this line provide above‐average quality (index <1) and hospitals to the right provide below‐average quality (index >1). The red dashed lines show the 25%‐ and 75%‐quartile of the index.
To make the quality measures more accessible to patients, we translate the index to a simple 3‐star rating. Hospitals with an index value in the lowest 25% of values are classified as 3‐star hospitals. Conversely, hospitals in the top quartile receive a 1‐star rating. Hospitals in between form the 2‐star category (i.e., 50% of hospitals). Although the choice of the cutoff values is arbitrary, we recommend a mapping to a more intuitive scale because patients tend to have difficulties understanding ratios (i.e., the quality index) and thus might altogether ignore quality measures in their hospital choice. The dashed lines in Figure 3 illustrate the 25%‐ and 75%‐quartile and the star rating, respectively. Three‐star hospitals on average have a 47% lower complication rate than the average hospital (average index: 0.53), and the average 1‐star hospital shows a 68% higher complication rate compared to the average hospital (average index: 1.68). Thus, patients treated in 1‐star hospitals face systematically more complications and other adverse events than those treated in 3‐star clinics.
The variation in quality, however, is not confined to this single indicator. We observe pronounced quality differences across all quality indicators and both indications (see Figures 16–20 and Table 13 in Appendix S1). Moreover, we observe a positive volume–outcome relationship for hip replacement surgeries, that is, a higher hospital caseload tends to be associated with being a 3‐star hospital. For hernia repair surgery, on the other hand, we do not observe any benefits from higher volumes.
5. DISCUSSION AND CONCLUSION
Every year, thousands of patients undergo elective surgery. Low quality of care can seriously harm patients and result in unnecessary costs for the health care system. Quality competition between hospitals can only work if patients have access to reliable and easy‐to‐understand quality information. In this paper, we proposed a systematic procedure to measure hospital quality based on routine data and assessed hospital quality for two elective procedures: inguinal hernia repair and primary hip replacement. Our analysis showed that Mundlak‐corrected and LASSO‐selected models were better at accounting for the risk selection between hospitals than standard logistic regression models commonly used in hospital comparisons. Applying the LASSO, a supervised machine learning method to “automatically” select risk adjusters resulted in models with higher explanatory power and identification of high‐risk cases than a wide variety of hand‐picked candidate specifications. While Mundlak‐corrected hospital random effects logits offer more favorable statistical properties than the standard logits, we also found that they generally explained a larger portion of the hospital‐level variation. Finally, our analysis provides evidence for considerable variation in the quality across hospitals for both indications in the Swiss inpatient sector. Hence, hospital choices can have severe clinical implications for patients.
This study comes with different limitations. First, there is no such thing as a perfect risk adjustment based on routine data. 7 , 55 Although our approach addresses key shortcomings of standard logistic risk adjustment models, hospitals might still vary in unobservable patient risks, which we do not account for. Additional data on, for example, disease severity or patient mobility scores could further improve risk adjustment and de‐bias the hospital comparison.
Second, we selected models both on their in‐ and out‐of‐sample prediction accuracy. However, we had minor convergence issues when cross‐validating the LASSO‐selected Mundlak‐corrected model for one of the quality indicators in case of hernia repair. We thus suggest that future work more systematically focuses on the out‐of‐sample performance of different risk adjustment models.
Third, although hospital routine data have important properties for quality measurement—it is readily available, comprehensive, and results are available in a timely manner—it also comes with certain limitations. Many outcomes are delayed and may only occur after hospital discharge. Specifically, we do not observe adverse events treated in outpatient care (e.g., infections). Thus, the hospital comparison would also be biased if hospitals differ in the management of subsequent adverse events. Another potential drawback is that “coding practices” may differ between hospitals. 56 Certain hospitals could code adverse events less stringently than others. Although coding regulations typically leave little space for “creative” coding, this issue deserves further study in future research.
Fourth, mapping quality indices to simple 3‐star ratings can be problematic if quality varies little between hospitals. In that case, the translation to a simpler scale is not informative, and researchers should altogether dismiss the corresponding indicators. Furthermore, hospitals “at the margin” might rightfully complain to be underrated despite providing almost identical quality like hospitals just above the cutoff. Indeed, the choice of the cutoff values (e.g., quantiles) is arbitrary. To alleviate this limitation, the star rating should ideally not entirely be based on point estimates but also incorporate the uncertainty around the estimated hospital quality. For example, stars could only be given to hospitals that significantly differ in their quality from other hospitals across quality segments (e.g., hospitals in the top 25% vs. middle 50% of the index distribution). Note, however, that methods to consistently estimate confidence intervals for the predicted‐to‐expected ratios used in this paper have not been developed yet, and bootstrapping procedures tend to have poor coverage rates in simulation studies. 57
Our results have relevance for several stakeholders in the health care system. Public health authorities might be particularly interested in the finding that relatively simple hand‐picked Mundlak‐corrected specifications showed similar prediction accuracy like the more complex LASSO‐selected models. Moreover, since both model types provided an almost identical answer to the question as to which are the low‐ and high‐quality hospitals in the market, the hand‐picked Mundlak‐corrected models have even more practical appeal. For example, this could be relevant for the Agency for Healthcare Research and Quality (AHRQ), which currently applies a LASSO‐selected logit for risk adjustment. 58 Complex models are harder to defend, and hospitals might rightfully claim that it is unclear how they are evaluated and thus dismiss the validity of the ranking altogether. Finally, since there is growing interest in composite measures of hospital performance as used by the CMS on its Hospital Compare website, our quality indices can be easily combined into a single measure. Different weighting strategies could be applied to combine the indicators to an overall rating: equal weighting, weights according to expert opinion, or empirical weights. 7 , 10 , 59 , 60
Patients benefit from access to quality information because it allows them to undergo surgery in high‐quality hospitals with fewer harmful or potentially life‐threatening events. This should both improve patient health and enhance quality competition in the health care sector. In addition, the communication of the quality indicators based on intuitive scales like the 3‐star rating should increase the probability that hospital choices are guided by quality information instead of travel distance and the size of the parking lot. 4 , 10 , 13 , 61 , 62 Lastly, while existing quality metrics are often based on mortality, such “extreme” outcomes are of second order importance in high‐volume, low‐risk elective procedures. Patients and/or their referring physicians can thus base their future provider decisions on quality dimensions that are more relevant to them.
Payers benefit from reliable quality measures to allocate resources, and patient streams to high‐quality hospitals. This might be particularly relevant in health care systems where health authorities/states decide which hospitals can reimburse their services via mandatory health insurance. Moreover, quality information could prove valuable in price negotiations in supplementary insurance and value‐based purchasing programs that have gained popularity in many health care systems over the last decade. 63 , 64
Supporting information
Appendix S1. Supporting Information.
ACKNOWLEDGMENTS
We are grateful for the financial support by SWICA Health Insurance and the productive meetings with Maria Trottmann, Delia Meyer, Aleksandra Jäger, Christian Frey, Harry Scheja, Daniel Rochat, and Eva Blozik. Special thanks also go to Joëlle Riedweg, Silvia Thomann, Niklaus Bernet, and Rahel Röösli for their valuable contribution, especially in the extraction and construction of the quality indicators. Finally, we would like to thank Stefan Boes, Daniel Ammann, and the participants at the European Health Economics Association Conference (EuHEA) in Oslo for the helpful inputs and comments. All authors have approved the final version of the manuscript. All errors are our own.
Bilger J, Pletscher M, Müller T. Separating the wheat from the chaff: How to measure hospital quality in routine data? Health Serv Res. 2024;59(2):e14282. doi: 10.1111/1475-6773.14282
REFERENCES
- 1. van de Ven WPMM, Beck K, Buchner F, et al. Preconditions for efficiency and affordability in competitive healthcare markets: are they fulfilled in Belgium, Germany, Israel, the Netherlands and Switzerland? Health Policy. 2013;109(3):226‐245. [DOI] [PubMed] [Google Scholar]
- 2. Enthoven AC. The history and principles of managed competition. Health Aff (Millwood). 1993;12 Suppl:24‐48. doi: 10.1377/hlthaff.12.suppl_1.24 [DOI] [PubMed] [Google Scholar]
- 3. Enthoven AC. Managed competition: an agenda for action. Health Aff (Millwood). 1988;7(3):25‐47. doi: 10.1377/hlthaff.7.3.25 [DOI] [PubMed] [Google Scholar]
- 4. Smith H, Currie C, Chaiwuttisak P, Kyprianou A. Patient choice modelling: how do patients choose their hospitals? Health Care Manag Sci. 2018;21(2):259‐268. doi: 10.1007/s10729-017-9399-1 [DOI] [PubMed] [Google Scholar]
- 5. Chandra A, Finkelstein A, Sacarny A, Syverson C. Health care exceptionalism? Performance and allocation in the US health care sector. Am Econ Rev. 2016;106(8):2110‐2144. doi: 10.1257/aer.20151080 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Cutler DM. Where are the health care entrepreneurs? The failure of organizational innovation in health care. Innov Policy Econ. 2011;11:1‐28. doi: 10.1086/655816 [DOI] [Google Scholar]
- 7. Dimick JB, Staiger DO, Osborne NH, Nicholas LH, Birkmeyer JD. Composite measures for rating hospital quality with major surgery. Health Serv Res. 2012;47(5):1861‐1879. doi: 10.1111/j.1475-6773.2012.01407.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Wang J, Hockenberry J, Chou S‐Y, Yang M. Do bad report cards have consequences? Impacts of publicly reported provider quality information on the CABG market in Pennsylvania. J Health Econ. 2011;30(2):392‐407. doi: 10.1016/j.jhealeco.2010.11.006 [DOI] [PubMed] [Google Scholar]
- 9. Peters E, Dieckmann N, Dixon A, Hibbard JH, Mertz CK. Less is more in presenting quality information to consumers. Med Care Res Rev. 2007;64(2):169‐190. doi: 10.1177/10775587070640020301 [DOI] [PubMed] [Google Scholar]
- 10. Emmert M, Schlesinger M. Hospital quality reporting in the United States: does report card design and incorporation of patient narrative comments affect hospital choice? Health Serv Res. 2017;52(3):933‐958. doi: 10.1111/1475-6773.12519 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Hibbard JH, Greene J, Daniel D. What is quality anyway? Performance reports that clearly communicate to consumers the meaning of quality of care. Med Care Res Rev. 2010;67(3):275‐293. doi: 10.1177/1077558709356300 [DOI] [PubMed] [Google Scholar]
- 12. Kolstad JT, Chernew ME. Quality and consumer decision making in the market for health insurance and health care services. Med Care Res Rev. 2009;66(1 Suppl):28S‐52S. doi: 10.1177/1077558708325887 [DOI] [PubMed] [Google Scholar]
- 13. de Cruppé W, Geraedts M. Hospital choice in Germany from the patient's perspective: a cross‐sectional study. BMC Health Serv Res. 2017;17(1):720. doi: 10.1186/s12913-017-2712-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Moscone F, Tosetti E, Vittadini G. Social interaction in patients' hospital choice: evidence from Italy. J R Stat Soc A Stat Soc. 2012;175(2):453‐472. doi: 10.1111/j.1467-985X.2011.01008.x [DOI] [Google Scholar]
- 15. Sivey P. The effect of waiting time and distance on hospital choice for English cataract patients. Health Econ. 2012;21(4):444‐456. doi: 10.1002/hec.1720 [DOI] [PubMed] [Google Scholar]
- 16. Dijs‐Elsinga J, Otten W, Versluijs MM, et al. Choosing a hospital for surgery: the importance of information on quality of care. Med Decis Making. 2010;30(5):544‐555. doi: 10.1177/0272989X09357474 [DOI] [PubMed] [Google Scholar]
- 17. Epstein AJ. Effects of report cards on referral patterns to cardiac surgeons. J Health Econ. 2010;29(5):718‐731. doi: 10.1016/j.jhealeco.2010.06.002 [DOI] [PubMed] [Google Scholar]
- 18. Fung CH, Lim Y‐W, Mattke S, Damberg C, Shekelle PG. Systematic review: the evidence that publishing patient care performance data improves quality of care. Ann Intern Med. 2008;148(2):111‐123. doi: 10.7326/0003-4819-148-2-200801150-00006 [DOI] [PubMed] [Google Scholar]
- 19. Borah BJ. A mixed logit model of health care provider choice: analysis of NSS data for rural India. Health Econ. 2006;15(9):915‐932. doi: 10.1002/hec.1166 [DOI] [PubMed] [Google Scholar]
- 20. Jha AK, Epstein AM. The predictive accuracy of the New York State coronary artery bypass surgery report‐card system. Health Aff (Millwood). 2006;25(3):844‐855. doi: 10.1377/hlthaff.25.3.844 [DOI] [PubMed] [Google Scholar]
- 21. Federal Office of Public Health . Qualitätsindikatoren der Schweizer Akutspitäler 2020. 2022. Accessed April 20, 2022. https://spitalstatistik.bagapps.ch/data/download/qip20_publikation.pdf?v=1649076288
- 22. Henderson M, Han F, Perman C, Haft H, Stockwell I. Predicting avoidable hospital events in Maryland. Health Serv Res. 2022;57(1):192‐199. doi: 10.1111/1475-6773.13891 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Nwachukwu C, Trieu H, Land T, Li W, Zhang Z. Measuring the relative contribution of social risk factors in an enhanced risk‐adjustment model for unplanned hospital readmissions. Health Serv Res. 2020;55(S1):108‐109. doi: 10.1111/1475-6773.13485 [DOI] [Google Scholar]
- 24. Bundesministerium für Arbeit, Soziales, Gesundheit und Konsumentenschutz . Austrian Inpatient Quality Indicators (A‐IQI): Bericht 2019. https://www.sozialministerium.at/dam/sozialministeriumat/Anlagen/Themen/Gesundheit/Gesundheitssystem/Gesundheitssystem‐und‐Qualit%C3%A4tssicherung/Ergebnisqualit%C3%A4tsmessung/A‐IQI_Bericht_2019.pdf.2019
- 25. Haute Autorité de santé . Hospital mortality indicators: foreign experience, literature teachings and guidelines: to support public decision‐making and the development of indicators in France. 2017. Accessed December 8, 2022. https://www.has-sante.fr/upload/docs/application/pdf/2020-12/hospital_mortality_indicators_report.pdf
- 26. Nimptsch U. Disease‐specific trends of comorbidity coding and implications for risk adjustment in hospital administrative data. Health Serv Res. 2016;51(3):981‐1001. doi: 10.1111/1475-6773.12398 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Lim E, Cheng Y, Reuschel C, et al. Risk‐adjusted in‐hospital mortality models for congestive heart failure and acute myocardial infarction: value of clinical laboratory data and race/ethnicity. Health Serv Res. 2015;50(Suppl 1):1351‐1371. doi: 10.1111/1475-6773.12325 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Cohen ME, Ko CY, Bilimoria KY, et al. Optimizing ACS NSQIP modeling for evaluation of surgical quality and risk: patient risk adjustment, procedure mix adjustment, shrinkage adjustment, and surgical focus. J Am Coll Surg. 2013;217(2):336‐346.e1. doi: 10.1016/j.jamcollsurg.2013.02.027 [DOI] [PubMed] [Google Scholar]
- 29. Steinberg SM, Popa MR, Michalek JA, Bethel MJ, Ellison EC. Comparison of risk adjustment methodologies in surgical quality improvement. Surgery. 2008;144(4):662‐667; discussion 662‐7. doi: 10.1016/j.surg.2008.06.010 [DOI] [PubMed] [Google Scholar]
- 30. AOK‐Bundesverband, Forschungs‐ und Entwicklungsinstitut für das Sozial‐ und Gesundheitswesen Sachsen‐Anhalt, HELIOS Klinikum Emil von Behring Berlin‐Zehlendorf, Wissenschaftliches Institut der AOK, HELIOS Kliniken GmbH. Qualitätssicherung der stationären Versorgung mit Routinedaten (QSR): Abschlussbericht. WIdO; 2007.
- 31. Pine M, Jordan HS, Elixhauser A, et al. Enhancement of claims data to improve risk adjustment of hospital mortality. JAMA. 2007;297(1):71‐76. doi: 10.1001/jama.297.1.71 [DOI] [PubMed] [Google Scholar]
- 32. Donabedian A. Quality assessment and monitoring. Retrospect and prospect. Eval Health Prof. 1983;6(3):363‐375. doi: 10.1177/016327878300600309 [DOI] [PubMed] [Google Scholar]
- 33. Donabedian A. The quality of care: how can it be assessed? JAMA. 1988;260(12):1743. doi: 10.1001/jama.1988.03410120089033 [DOI] [PubMed] [Google Scholar]
- 34. van Kleef R, McGuire TG. Risk adjustment, risk sharing and premium regulation in health insurance markets. In Theory and practice. Academic Press; 2018. [Google Scholar]
- 35. Menendez ME, Neuhaus V, van Dijk CN, Ring D. The Elixhauser comorbidity method outperforms the Charlson index in predicting inpatient death after orthopaedic surgery. Clin Orthop Relat Res. 2014;472(9):2878‐2886. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Elixhauser A, Steiner C, Harris DR, Coffey RM. Comorbidity measures for use with administrative data. Med Care. 1998;36(1):8‐27. doi: 10.1097/00005650-199801000-00004 [DOI] [PubMed] [Google Scholar]
- 37. Federal Statistical Office . Medizinische Statistik der Krankenhäuser. Published October 17, 2016. Accessed May 26, 2023. https://dam-api.bfs.admin.ch/hub/api/dam/assets/7369/master
- 38. Charlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J Chronic Dis. 1987;40(5):373‐383. [DOI] [PubMed] [Google Scholar]
- 39. Gasparini A. comorbidity: an R package for computing comorbidity scores. JOSS. 2018;3(23):648. doi: 10.21105/joss.00648 [DOI] [Google Scholar]
- 40. Dimick JB, Staiger DO, Birkmeyer JD. Ranking hospitals on surgical mortality: the importance of reliability adjustment. Health Serv Res. 2010;45(6 Pt 1):1614‐1629. doi: 10.1111/j.1475-6773.2010.01158.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Arling G, Reeves M, Ross J, et al. Estimating and reporting on the quality of inpatient stroke care by Veterans Health Administration Medical Centers. Circ Cardiovasc Qual Outcomes. 2012;5(1):44‐51. doi: 10.1161/CIRCOUTCOMES.111.961474 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. MacKenzie TA, Grunkemeier GL, Grunwald GK, et al. A primer on using shrinkage to compare in‐hospital mortality between centers. Ann Thorac Surg. 2015;99(3):757‐761. doi: 10.1016/j.athoracsur.2014.11.039 [DOI] [PubMed] [Google Scholar]
- 43. Busse R, Klazinga N, Panteli D, Quentin W. In: Busse R, Klazinga N, Panteli D, Quentin W, eds. Improving Healthcare Quality in Europe: Characteristics, Effectiveness and Implementation of Different Strategies. World Health Organisation; 2019. [PubMed] [Google Scholar]
- 44. Wooldridge JM. Econometric Analysis of Cross Section and Panel Data. MIT press; 2010. [Google Scholar]
- 45. Brzoska P, Sauzet O, Breckenkamp J. Unobserved heterogeneity and the comparison of coefficients across nested logistic regression models: how to avoid comparing apples and oranges. Int J Public Health. 2017;62(4):517‐520. [DOI] [PubMed] [Google Scholar]
- 46. Mood C. Logistic regression: why we cannot do what we think we can do, and what we can do about it. Eur Sociol Rev. 2010;26(1):67‐82. [Google Scholar]
- 47. Ayis S. Quantifying the impact of unobserved heterogeneity on inference from the logistic model. Commun Stat Theory Methods. 2009;38(13):2164‐2177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Cameron AC, Trivedi PK. Microeconometrics: Methods and Applications. Cambridge University Press; 2005. [Google Scholar]
- 49. Wooldridge JM. Correlated random effects models with unbalanced panels. J Econom. 2019;211(1):137‐150. doi: 10.1016/j.jeconom.2018.12.010 [DOI] [Google Scholar]
- 50. Belloni A, Chernozhukov V, Hansen C. High‐dimensional methods and inference on structural and treatment effects. J Econ Perspect. 2014;28(2):29‐50. doi: 10.1257/jep.28.2.29 [DOI] [Google Scholar]
- 51. Hastie T, Tibshirani R, Wainwright M. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC press; 2015. [Google Scholar]
- 52. Akaike H. Information theory and an extension of the maximum likelihood principle. Selected Papers of Hirotugu Akaike. Springer; 1998:199‐213. [Google Scholar]
- 53. Hosmer DW Jr, Lemeshow S, Sturdivant RX. Applied Logistic Regression. John Wiley & Sons; 2013. [Google Scholar]
- 54. Burnham KP, Anderson DR. Multimodel inference. Sociol Methods Res. 2004;33(2):261‐304. doi: 10.1177/0049124104268644 [DOI] [Google Scholar]
- 55. Preyra C. Coding response to a case‐mix measurement system based on multiple diagnoses. Health Serv Res. 2004;39(4 Pt 1):1027‐1045. doi: 10.1111/j.1475-6773.2004.00270.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Pongpirul K, Robinson C. Hospital manipulations in the DRG system: a systematic scoping review. Asian Biomed. 2013;7(3):301‐310. doi: 10.5372/1905-7415.0703.180 [DOI] [Google Scholar]
- 57. Austin PC. The failure of four bootstrap procedures for estimating confidence intervals for predicted‐to‐expected ratios for hospital profiling. BMC Med Res Methodol. 2022;22(1):271. doi: 10.1186/s12874-022-01739-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Agency for Healthcare Research and Quality . AHRQ Quality Indicator Empirical Methods. 2021. https://qualityindicators.ahrq.gov/measures/qi_resources
- 59. Dimick JB, Staiger DO, Hall BL, Ko CY, Birkmeyer JD. Composite measures for profiling hospitals on surgical morbidity. Ann Surg. 2013;257(1):67‐72. doi: 10.1097/SLA.0b013e31827b6be6 [DOI] [PubMed] [Google Scholar]
- 60. Staiger DO, Dimick JB, Baser O, Fan Z, Birkmeyer JD. Empirically derived composite measures of surgical performance. Med Care. 2009;47(2):226‐233. doi: 10.1097/mlr.0b013e3181847574 [DOI] [PubMed] [Google Scholar]
- 61. Donelan K, Rogers RS, Eisenhauer A, Mort E, Agnihotri AK. Consumer comprehension of surgeon performance data for coronary bypass procedures. Ann Thorac Surg. 2011;91(5):1400‐1405; discussion 1405‐6. doi: 10.1016/j.athoracsur.2011.01.019 [DOI] [PubMed] [Google Scholar]
- 62. Hibbard JH, Sofaer S. Best Practices in Public Reporting No. 1: How to Effectively Present Health care Performance Data to Consumers. https://www.researchgate.net/profile/judith‐hibbard‐2/publication/266041920_best_practices_in_public_reporting_no_1_how_to_effectively_present_health_care_performance_data_to_consumers/links/54d8c78d0cf25013d03f4d46/best‐practices‐in‐public‐reporting‐no‐1‐how‐to‐effectively‐present‐health‐care‐performance‐data‐to‐consumers.pdf
- 63. Chee TT, Ryan AM, Wasfy JH, Borden WB. Current state of value‐based purchasing programs. Circulation. 2016;133(22):2197‐2205. doi: 10.1161/CIRCULATIONAHA.115.010268 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Ryan AM. Will value‐based purchasing increase disparities in care? N Engl J Med. 2013;369(26):2472‐2474. doi: 10.1056/nejmp1312654 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Appendix S1. Supporting Information.
