Assessing Geographical Variations in Hospital Processes of Care Using Multilevel Item Response Models

Yulei He; Robert E Wolf; Sharon-Lise T Normand

doi:10.1007/s10742-010-0065-z

. Author manuscript; available in PMC: 2021 May 11.

Published in final edited form as: Health Serv Outcomes Res Methodol. 2020 Sep 17;10(3-4):111–133. doi: 10.1007/s10742-010-0065-z

Assessing Geographical Variations in Hospital Processes of Care Using Multilevel Item Response Models

Yulei He ¹, Robert E Wolf ², Sharon-Lise T Normand ^3,^*

PMCID: PMC8112639 NIHMSID: NIHMS1693526 PMID: 33981181

Abstract

With health care reform passing in the United States, much effort is directed toward developing and disseminating comparative information on standardized processes of care for health care providers. We propose the use of Bayesian multilevel item response theory models to estimate hospital quality from multiple process measures and to assess geographical variation in hospital quality. Our approach fully incorporates the nesting structure of measures, patients, hospitals, and various levels of geographical units to provide a summary of hospital quality. A national dataset of patients treated for a heart attack, heart failure, or pneumonia illustrates our methods. We find considerable geographical differences in hospital quality for these conditions with variations across census regions and states accounting for slightly more than 10% of the total variation. Some states performed well for all three conditions (e.g., the respective posterior probabilities of having better than the national average performance was close to 1 in Iowa, New Jersey, South Dakota, and Wisconsin). In contrast, quality of other states varied across conditions (e.g., the corresponding posterior probability was close to 1 in Massachusetts for heart attack and heart failure quality, but less than 0.5 for pneumonia care). Our framework provides a comprehensive approach to assessment of hospital performance at both regional and national levels, and might be informative for policy development.

Keywords: Hierarchical models, hospital profiling, multiple outcomes, pay-for-performance, process measures

1. Introduction

The study of quality of health care providers, such as hospitals, is a central activity of health services and outcomes research. Quality of care is an abstract and multidimensional construct that cannot be measured directly. Instead, measurable indicators are used to characterize three dimensions of “quality of care”: structure, process, and outcome, attributed to Donabedian (1966)’s work. Structural measures are characteristics of the care provider, such as nursing ratios and presence of residence programs, and outcome measures refer to responses that characterize patients’ health status and include survival.

Process measures refer to what providers do to and for patients. These include documented adherence to established best practices, such as the use of aspirin at arrival for a heart attack (acute myocardial infarction, AMI). Process measures may exist for many conditions where outcome measures do not exist or where outcomes may have limited applicability due to sample size issues or infrequent endpoints. Another attractive feature of process measures involves their purported transparency. They are actionable, giving providers immediate guidance as to where to focus improvement effort.

With the increasing availability of evidence-based practice guidelines and the advent of health care reform in the United States, much effort is now directed at developing and disseminating comparative data involving process measures for quality improvement. The Joint Commission on Accreditation of Health Care Organizations (JCAHO) has disseminated monthly performance data from JCAHO accredited hospitals (www.jointcommission.org/PerformanceMeasurement). Evaluation of providers based on their process measures has also been the basis of pay-for-performance initiatives. These initiatives use retrospectively collected information on the use of evidence-based therapies to provide financial awards to hospitals that show high-quality performance in several acute areas (see Premier Hospital Quality Incentive Demonstration Project: Premier Inc. 2007).

Prior studies have documented substantial geographical variations, across census regions, divisions, and states, in the process of care in AMI and several other acute areas (e.g., Krumholz et al. 1998, O’connor et al. 1999, Jencks et al. 2000, Krumholz et al. 2003, Kaul and Peterson 2007). These studies suggest that some of the variations might be attributed to the differences of patterns of care, such as how the treatment is practiced, the availability of resources for some procedures, and the rules and regulations across geographical units. In addition, the variations in use of evidence-based medical therapies, such as the use of β-blockers in the presence of AMI, still persist after adjustment for patient, physician, and hospital characteristics, despite that these therapies can be prescribed at low cost and the wide availability of published evidence demonstrating their effectiveness. Fully understanding the extent and patterns of geographical variations would help to implement policies for reducing the gap between evidence and practice.

There are two main challenges in studying geographical variations using hospital performance data on process measures. One is how to best summarize hospital quality from multiple performance measures. For a given medical condition, quality indicators (Table 1) often consist of multiple items/measures. From a clinical perspective, developing effective composite measures allows for continuous measurement across providers through aggregation by patient and for an examination of all aspects of recommended care at the community/population level. Composite measures thus can provide a different and integrated view of the reliability of the care system as a whole, encouraging and facilitating systems-level changes by highlighting the need for better care coordination and accountability across multiple providers. From a statistical perspective, analyzing individual measures separately ignores the often strong correlation among these measures and does not provide an overall picture of the quality. It thus becomes necessary to combine all the measures in some fashion to summarize the quality of care, especially when there is an increase of the number of developed process measures (Performance measurement 2006; Schwartz et al. 2008).

Table 1:

Description of Process Measures

Condition	Measure
Acute Myocardial Infarction (AMI)	Aspirin at arrival
	Aspirin prescribed at discharge
	ACE inhibitor or ARB for LVSD
	Beta blocker prescribed at discharge
	Beta blocker at arrival

Congestive Heart Failure (CHF)	Discharge instructions
	LVF assessment
	ACE inhibitor or ARB for LVSD
	Adult smoking cessation advice/counseling

Community Acquired Pneumonia (CAP)	Oxygen assessment
	Pneumococcal vaccination status
	Blood culture before first antibiotic
	Adult smoking cessation advice/counseling
	Initial antibiotic received within 4 hrs of arrival

Open in a new tab

Note: ACE: angiotensin-converting enzyme; ARB: angiotensin receptor blocker; LVSD: left ventricular systolic dysfunction.

Some existing methods for aggregating performance measures include using a raw average weighted by the number of eligible patients for each measure (Section 3.2). This approach has been referred to as the the opportunity-based approach (Premier Inc. 2007) as it assessese the number of met opportunties. A simple average of measure-specific estimates with equal weights across measures has been utilized as have the all-or-none scoring approach (Nolan and Berwick 2006). Each of these methods are simple to implement and are popular among practitioners, yet the corresponding composite scores have some pitfalls. For example, both weighted and simple average score methods may permit high performance on some measures to mask poor performance on other measures that may be critical to quality. More comments can be found in O’Brien et al. (2007) and Teixeira-Pinto and Normand (2008).

A more principled approach is to use a model-based score constructed using item response theory (IRT) models (van der Linden and Hambleton 1997). IRT models (also can be referred to as latent trait models) were originally developed in the fields of psychometric and educational testing. In this approach, multiple quality indicators, such as process compliance, are assumed to be related to an underlying (unobserved) latent variable summarizing the quality, and the latent is the primary focus of interest. Analogous to the factor analysis models used for multivariate continuous data, this type of model potentially allows quality to be estimated with high statistical efficiency by combining information from multiple categorical variables into a single latent parameter (Skrondal and Rabe-Hesketh 2004).

Second, hospital performance data often exhibit multilevel or hierarchical structure. That is, quality measures are nested within patients, patients nested within hospitals, and hospitals nested within the geographical units, which are also multilevel (e.g., states are nested within census regions). Some simple methods, including using 20%-80% percentile range to describe the variability, and using raw average rates and ranks to compare the performance at the state or local level, ignore this statistical dependence and fail to estimate the inter- and intra-unit (e.g., hospital or geographical unit) variation. The corresponding results may be more susceptible to random error (Goldstein and Spiegelhalter 1996; Normand and Shahian 2007).

In principle, integrating geographical variations in studying hospital performance data require statistical modeling strategies that can simultaneously incorporate the contributions of patient, hospital, and geographical characteristics. A general statistical tool for jointly describing effects operating at several levels of aggregation is multilevel or hierarchial modeling (Snijders and Bosker 1999; Raudenbush and Bryk 2002). Each level of a multilevel model represents the effects of each type of unit, such as census regions, states, hospitals, or individual patients. At each level, the systematic effects of measured covariates are represented analytically by regression coefficients and the magnitude of the “random” effects (effects of unmeasured characteristics of the units that cause unexplained residual variation at that level) by variance components. The random effects are of considerable interest: unexplained variation signals opportunities for further explanation of important mediators at a given level and for quality improvement in local regions or hospitals. As an analytical approach of growing applications in health services and outcomes research, multilevel models have been effectively used to model variations in quality of care and to profile providers (Goldstein and Speigelhalter 1996; Normand, Glickman, and Gatsonis 1997; Aguilar and West 1998; Daniels and Gastonis 1999; Draper and Gittoes 2004).

In this paper, we propose the use of the multilevel item response theory models (Fox and Glas 2001) to assess the geographical variations in hospital processes of care. Our research extends previous work in modeling multivariate multilevel data to summarize quality from hospital process performance measures. We use a fully Bayesian estimation approach to obtain the posterior estimates of the functions of the parameters quantifying hospital quality and its summary within geographical units, facilitating easy comparison of performance at different levels. In Section 2 we briefly introduce IRT models, and describe the multilevel IRT models for assessing geographical variations of hospital care. Section 3 illustrates our approach using a national dataset for hospital performance in several acute conditions. We conclude with recommendations and future work in Section 4.

2. Item Response Theory Models for Hospital Quality of Care

2.1. Background

Our underlying premise is that any hospital quality estimate should be determined by the hospital’s success in providing therapies or processes of care from multiple measures, with higher rate of success coressponding to better quality, and should differentiate each measure’s ability to discriminate hospitals. Following the scheme in Landrum and Normand (2000), Normand, Wolf, and McNeil (2008), and Teixeria-Pinto and Normand (2008), we lay out some basics of IRT models for hospital quality. More details can be found in the references.

Let n_ht denote the number of eligible patients for the t-th therapy at the h-th hospital (h = 1, … H and t = 1,…,T), y_ht the number of eligible patients at the h-th hospital who have received therapy t, and p_ht the probability that a patient receives therapy t at the h-th hospital given that the patient is eligible for the therapy. An example of a two-parameter IRT model (van der Linden and Hambleton 1997) is

\begin{matrix} y_{h t} & \sim & B i n o m i a l (n_{h t}, p_{h t}), \\ Logit (or Probit) (p_{h t}) & = & a_{t} + b_{t} θ_{h}, b_{t} > 0, \\ θ_{h} & \overset{i . i . d .}{\sim} & N (0, 1), \end{matrix}

(1)

where a_t and b_t are the “difficulty” and “discriminating” parameters (i.e., the two parameters) of the t-th measure, and $θ_{h}$ is the latent quality of hospital h. Specifically, a_t represents how difficult it is to achieve the t-th measure, and b_t represents a measure-specific discriminating weight. A process measure that is less homogeneous among hospitals has a larger value of b_t because it can discriminate better among hospitals (Section 3.3.2). The constraint $b_{t} > 0$ is specified for identifying parameters. Under the probit link, the model is also referred to as the two-parameter Normal-Ogive model. The values for θ_h range from $- \infty$ to $\infty$ , and a larger value of $θ_{h}$ corresponds to better hospital quality. Assuming a standardized normal distribution for θ_h accommodates the correlation of patient data within hospitals. This can be seen from the fact that $C o v (p_{h t}, p_{h t^{'}} | a_{t}, b_{t}, a_{t^{'}}, b_{t^{'}}) = b_{t} b_{t^{'}}$ for therapy t and t′ in hospital h.

Note that model (1) is for $p_{h t}$ rather than ${\hat{p}}_{h t} = y_{h t} / n_{h t}$ so that the sampling variation of the estimate is taken into consideration. A limitation of model (1) is that it does not fully account for within-patient correlation that arises because some patients are eligible for multiple measures, and will likely result in the underestimation of the variance of the latent score. This can be accommodated by specifying an extended model for each patient response. But this requires more detailed clinical data collection and presentation and is much less likely to be accessible than the hospital level data in our illustrative example.

2.2. Methods: multilevel IRT models for hospital quality within geographical units

Fox and Glas (2001) proposed a multilevel extension of IRT models to incorporate the hierarchical structure of student performance data in educational testing. For example, this type of model can be used to make inferences about the relationships between some explanatory variables and the student performance within and between schools, simultaneously handling student level relationships and taking account of the way that students are grouped in schools.

We apply this modeling technique in health profiling studies. On the basis of model (1), a multilevel IRT model further assumes a random-effects model (Laird and Ware 1982) for the latent variable, as

θ_{h} = X_{h} β + Z_{h} γ_{h} + ϵ_{h},

(2)

where for hospital h, $X_{h}$ is a $1 \times p$ design matrix for the fixed effects common to all hospitals, $β$ is a $p \times 1$ vector of coefficients to the fixed effects, $Z_{h}$ is a $1 \times q$ design matrix for the hospital-specific random effects, $γ_{h}$ is a $q \times 1$ vector of coefficients to the random effects often assumed to follow independent identical normal distributions across hospitals, and $ϵ_{h}$ is the error for the residual variation unexplained by both the fixed and random effects, also often assumed to have independent identical normal distribution across hospitals.

In profiling studies, hospitals can be classified into groups with common characteristics described by X_h’s or Z_h’s. To study the geographical variations in hospital performance, it is natural to group hospitals based on their geographical locations, such as census regions, divisions, states, or counties, and use the multilevel IRT model to characterize the variations within and between these geographical units.

In this paper, we are interested in studying the distribution of hospital quality across census regions and states. Using r(r = 1,…, 4), $s (s = 1, \dots, S_{r}, \sum_{r = 1}^{4} S_{r} = 50)$ , and $h (h = 1, \dots, H_{s}, \sum_{s = 1}^{50} H_{s} = N)$ to index 4 regions, 50 states, and hospitals included in the data, then Equation (2) becomes

θ_{r s h} = β_{r} + γ_{s (r)} + ϵ_{h (s)},

(3)

where $β_{r}$ is the fixed region effect (i.e., the deviation of the average hospital quality in region r from the national average), $γ_{s (r)}$ is the random state effect (i.e., the deviation between the average hospital quality in state s and the average quality of region r) with $γ_{s (r)} \overset{i . i . d .}{\sim} N (0, τ^{2})$ , $ϵ_{h (s)}$ is the residual hospital-level variation with $ϵ_{h (s)} \overset{i . i . d .}{\sim} N (0, σ^{2})$ , and the notations s(r) and h(s) are used because states are nested within the regions and hospitals are nested within the states. Models incorporating other geographical units such as counties can be built in a similar manner.

Equations (1) and (3) assume that the difficulty and discriminating parameters are invariant across geographical units. Because the process measures are derived from standardized therapies, we assume that the ability of different measures in characterizing the underlying performance does not differ across geographical locations. Equation (3) also assumes that quality of hospitals in the same state are correlated while those in different states are not. Our justification is that hospitals in the same state are more likely to be subject to the same department of health regulations which might impact their patterns of care. Equation (3) further assumes equal state-level variation across regions and equal hospital-level variation across states. A more general assumption is $γ_{s (r)} \sim N (0, τ_{r}^{2})$ and $ϵ_{s (h)} \sim N (0, σ_{s}^{2})$ so that both the state- and hospital-level variations can vary. However, the more general model might lead to identifiability or over-shrinakge issues because of complex variance component structure. For the ease of illustration, therefore, we use the model described by Equations (1) and (3).

Constraints need to be imposed on multilevel IRT models in order to identify the parameters. In Equation (3), we set $\sum_{r = 1}^{4} β_{r} = 0$ and σ² = 1. The former indicates that average quality is set as 0 on the latent quality scale, and the latter standardizes the hospital-level variation. Note that setting σ² equal to another constant will only proportionally change the parameter estimates but not the statistical inferences (Appendix). In our application, we find that these constraints work better than some alternative approaches (e.g., set one of the β_r’s as 0 and $τ^{2} + σ^{2} = 1$ , or fix certain values for a_t’s and b_t’s) in terms of stabilizing the model estimation.

We use a fully Bayesian estimation approach to the multilevel IRT model. Bayesian methods can incorporate our prior beliefs or information about the studies of interest through prior distributions for the parameters such as a_t’s, b_t’s, β_r’s, and τ² in Equations (1) and (3). In addition, because parameters are considered as random variables, functions of parameters can be obtained from the Bayesian estimation process. For each state, we obtain the posterior probability that its average quality is better than the national average, and its expected fraction of superior hospitals when compared across the nation. This allows more flexibility for the inferences and predictions from the model, as the typical frequentist approach only focuses on the mean and variance estimates of the parameters. We use the general Bayesian package WinBUGS (Spiegelhalter et al. 2003) for the model estimation via the Gibbs sampling approach. Some code examples are included in Appendix.

3. Application: U.S. Hospital Compare Data

3.1. Data

We use a database comprised of various performance measures collected from hospitals that volunteered to participate in the Hospital Compare Program (www.qualitynet.org) sponsored by the Centers for Medicare and Medicaid Services (CMS). The dataset includes performance measures from about 4000 US hospitals for care delivered to patients over 65 years old from the October 1, 2005 through September 30, 2006. We focus on three clinical conditions: AMI, congestive heart failure (CHF), and community acquired pneumonia (CAP). We report on 14 clinical performance measures (Table 1), which are widely used by CMS and JCAHO for measuring hospital quality.

For each hospital, the dataset includes the number of sampled patients who were admitted and eligible for the therapies (n_ht) as well as the number of eligible patients who received the therapy (y_ht). The guidelines for defining the eligibility for a patient are given by the specifications of the CMS and JCAHO. For example, a patient with pneumonia is eligible for smoking cessation advice counseling if he/she has a history of smoking cigarettes any time during the year prior to hospital arrival. The geographical covariates included in the analysis are the 4 census regions (North East(NE), Midwest(MW), South(S), West(W)) and 50 states.

3.2. Model fitting

To stabilize the model estimation, hospitals with more than 30 eligible patients in at least one of the selected measures are included in the IRT analysis for each condition (Teixeira-Pinto and Normand 2008). The number of hospitals included for AMI, CHF, and CAP are 2336, 3399, and 3675, respectively. We use the model with the logit link because using the probit link often results in trapping of the Gibbs chain in WinBUGS, as probit models can sometimes suffer from numerical difficulties in posterior sampling (Spiegelhalter et al. 2003).

We use diffuse, proper prior distributions for the model parameters, aiming to obtain objective inferences. We use normal priors with relatively large variances (e.g., N(0, 10²), N(0, 10³), and N(0, 10⁴)) for the difficulty parameter a_t and fixed regional effect parameter β_r. For the discriminating parameters b_t, we use both log-normal (LN) prior and half-normal (HN) prior to satisfy the constraint that b_t > 0 (Fox and Glas 2001). The tested priors include LN(0, 10⁴), LN(0, 10³), LN(0, 10²), and HN(0, 10³). For the random state-effect variance parameter τ², the priors include the typical inverse-gamma distribution (IG(a, b)) with very small shape parameter a and inverse-scale parameter b (e.g., a = b = 10⁻⁴, 10⁻³, 10⁻²) so that the prior would have little effect on the posterior distribution. However, such prior tend to support low value of τ. Therefore we also consider τ² ~ Pareto(.5, .01) so τ is rather uniform from (0, ∞), and τ ~ H N(0, 0.26) implying 95% of the τ lie between 0 and 1. All the above priors are often suggested in fitting multilevel models (Spiegelhalter et al. 2004).

The parameter estimates are rather insensitive to the prior distributions across a reasonable range of specifications. The sensitivity analysis results for AMI data are shown in Table 9 of Appendix). Final model estimates are based on the prior distributions as N(0, 10³) for a_t’s, LN(0, 10³) for b_t’s, N(0, 10³) for β_r’s, and IG(10⁻³, 10⁻³) for τ². The posterior summaries of the parameters are based on a chain of 10⁵ iterations after a burn-in chain of 10⁵ iterations. We examine the convergence of the Gibbs chain by checking the trace plots as well as examining Gelman-Rubin statistic (Gelman and Rubin 1992) from running multiple chains with different initial values.

Table 9:

Sensitivity Analysis of Various Priors for AMI data

Prior	a₁	a₂	a₃	a₄	a₅	b₁	b₂	b₃	b₄	b₅	β₁	β₂	β₃	τ²
N (0, 10⁴) for a_t’s	3.378	3.397	1.660	3.198	2.789	.762	1.123	.611	1.087	.838	.242	.181	−.373	.0493
N (0, 10²) for a_t’s	3.388	3.412	1.668	3.213	2.801	.761	1.123	.611	1.086	.837	.258	.179	−.395	.0539
LN (0, 10⁴) for b_t’s	3.379	3.399	1.661	3.200	2.790	.761	1.123	.611	1.086	.838	.239	.179	−.387	.0507
LN (0, 10²) for b_t’s	3.374	3.392	1.657	3.193	2.785	.760	1.121	.610	1.085	.836	.247	.177	−.384	.0537
HN (0, 10³) for b_t’s	3.386	3.409	1.667	3.211	2.799	.762	1.123	.611	1.086	.838	.244	.172	−.400	.0494
N (0, 10⁴) for β’s	3.379	3.399	1.661	3.2	2.79	.761	1.123	.611	1.086	.838	.239	.179	−.387	.0507
N (0, 10²) for β’s	3.387	3.411	1.668	3.212	2.8	.761	1.122	.610	1.085	.837	.244	.184	−.394	.0506
IG (10⁻⁴, 10⁻⁴) for τ²	3.385	3.409	1.667	3.209	2.798	.761	1.122	.611	1.085	.837	.246	.194	−.392	.0533
IG (10⁻³, 10⁻³) for τ²	3.379	3.399	1.661	3.200	2.790	.761	1.123	.611	1.086	.838	.239	.179	−.387	.0507
IG(10⁻², 10⁻²) for τ²	3.388	3.413	1.669	3.214	2.801	.759	1.120	.609	1.083	.836	.256	.187	−.405	.0543
Pareto(.5,.01) for τ²	3.383	3.406	1.665	3.207	2.796	.761	1.123	.611	1.086	.838	.265	.160	−.394	.0564
HN (0, .26) for τ	3.381	3.402	1.663	3.203	2.793	.760	1.121	.610	1.084	.836	.265	.160	−.395	.0601

Open in a new tab

Note: “LN”, “HN”, and “IG” denote lognormal, half-normal, and inverse-gamma distributions.

We use two Bayesian techniques for model assessment and diagnostics. The first is for model comparison. We fit several IRT models which have simpler structure than Equations (1) and (3), referred to as the Hospital+Region+State model. These simpler models include one with no geographical factors (Equation (1) only, referred to as the Hospital-only model); one with only fixed region effects (Equation (1) and dropping γ_s(r) in Equation (3), referred to as the Hospital+Region model); one with only random state effects (Equation (1) and dropping β_r in Equation (3), referred to as the Hospital+(Random)State model); and one including only fixed state effects, referred to as the Hospital+(Fixed)State model). Other simpler models are also considered, including those having constraints on the discriminating parameters by assuming all b_t’s are equal in Equation (1), and one-parameter IRT models (i.e., set all b_t’s as 1 in Equation (1)). The deviance information criterion (DIC) (Spiegehalter et al. 2002), which summarizes model predictive accuracy, is computed to compare among these models. Models with lower DIC-values possess better out-of-sample predictive power and are generally preferred.

We also fit a model assuming unequal variance of the state-average across regions, assuming $γ_{s (r)} \sim N (0, τ_{r}^{2})$ . Across all conditions, the results (not shown) suggest that the variance in the southern states is relatively smaller, while that in the Western states is relatively larger. But the credible intervals for the variance estimates overlap. The DIC results do not suggest a significant improvement of model fit, nor does the new model qualitatively change the average quality estimates at the region or state level. We are also concerned that assuming region-specific variance may lead to some over-shrinkage if the state averages are rather close in certain regions. Therefore we retain the equal variance model as in Equation (3).

Second, we use posterior predictive checking (Gelman, Meng, and Stern 1996) to assess the model fit. The main idea of this method is to compare the observed data with replicated data simulated under the model. Let Y^rep represent the vector of replicates of the hospital data y_ht. The distribution of Y^rep given the observed data Y and postulated model is

P (Y^{r e p} | Y) = \int P (Y^{r e p} | Θ) P (Θ | Y) d Θ,

(4)

where Θ is the vector of parameters in the specified model. Sampling from (4), we replicate 1000 datasets and calculate the empirical distributions of several summary statistics characterizing the main features of the data. Denote the summary statistics of the replicated and observed data as T(Y^rep) and T(Y), respectively. We compare T(Y^rep) and T(Y), specifically by computing Bayesian p-values estimated as the proportions of times the statistics in the replicated data are more extreme than the observed data, that is, $p_{B} = P (T (Y^{r e p}) \geq T (Y) | Y)$ . Bayesian p-values that are close to 0 or 1 (e.g., < 0.01 or > 0.99) indicate poor model fit.

The chosen summary statistics include the mean and different percentiles of the distribution of the raw-weighted average score (RWAS) in the national sample, and those within the samples of different regions and states. The RWAS is defined as $\sum_{t} y_{h t} / \sum_{t} n_{h t}$ for each hospital h. The choice of the percentiles of the RWAS is motivated by the fact that these percentiles are often used to identify hospitals with superior or inferior performance (Premiere Inc. 2007). Additionally, we use posterior predictive checking to analyze hospital fit and measure fit by computing the empirical distributions of Y^rep for each hospital and each measure. We report for each measure the number of hospitals that have extreme p_B-values and for each hospital the number of measures with extreme p_B-values.

3.3. Results

3.3.1. Descriptive statistics

Table 2 shows some of the associated statistics of RWAS for the national and regional samples. The mean level of RWAS is higher for AMI and lower for CHF and CAP, and the range is also narrower in the former. Hospitals tend to have better performance data with overall higher success rates and lower variations in AMI than the other two conditions, and this might be due to the fact that AMI process measures have been the primary targets for improving the quality of care in recent years. For all three conditions, the RWAS in the midwestern and northeastern regions are similar, and they are generally higher than those in the southern and western regions. Figures 1 and 2 contain Box plots illustrating the empirical distributions of RWAS by regions and states, respectively. There exist considerable geographical variations in RWAS but such variability is still much less than the inter-hospital variability. In addition, the southern and western regions tend to have more hospitals with relatively low RWAS.

Table 2:

Descriptive statistics of raw-weighted averages scores (RWAS)

Sample	AMI			CHF			CAP

	N	Mean	Range	N	Mean	Range	N	Mean	Range
National	2336	.930	(.783, .996)	3399	.766	(.397, .972)	3675	.838	(.664, .959)


Midwest	528	.944	(.815, .998)	874	.791	(.407, .983)	985	.858	(.683, .969)
Northeast	472	.941	(.831, .998)	557	.804	(.510, .971)	571	.847	(.674, .953)
South	872	.916	(.770, .994)	1377	.752	(.347, .970)	1449	.834	(.664, .958)
West	464	.930	(.738, .994)	591	.726	(.380, .956)	670	.811	(.639, .942)

Open in a new tab

Note: N: the number of hospitals; Range: the 2.5% and 97.5% percentiles of RWAS.

Figure 1: — Box plots of hospital raw-weighted average scores (RWAS) by census region.

Figure 2: — Box plots of hospital RWAS by state, sorted by the state-level median.

Table 3 provides the Spearman pairwise correlations among the fractions of the eligible patients receiving therapies (y_ht/n_ht) and the composite quality estimates including both the RWAS and latent scores. Within each condition, the selected process measures are positively correlated with each other, and they are also strongly correlated with both the RWAS and latent estimates.

Table 3.

Spearman pairwise correlation coefficients among raw fractions and composite scores, θ_h.

Acute Myocardial Infarction	(1)	(2)	(3)	(4)	(5)	RWAS	θ_h
Aspirin at arrival	1
Aspirin prescribed at discharge	.56	1
ACE inhibitor or ARB for LVSD	.35	.38	1
Beta blocker prescribed at discharge	.50	.66	.47	1
Beta blocker at arrival	.65	.53	.42	.64	1
RWAS	.77	.79	.59	.82	.84	1
θ_h	.75	.82	.55	.85	.83	.99	1

Congestive Heart Failure	(1)	(2)	(3)	(4)	RWAS	θ_h

Discharge instructions	1
LVF assessment	.41	1
ACE inhibitor or ARB for LVSD	.36	.44	1
Adult smoking cessation advice/counseling	.50	.38	.27	1
RWAS	.89	.72	.52	.59	1
θ_h	.94	.68	.47	.58	.95	1

Community Acquired Pneuomonia	(1)	(2)	(3)	(4)	(5)	RWAS	θ_h

Oxygen assessment	1
Pneumococcal vaccination status	.21	1
Blood culture before first antibiotic	.07	.23	1
Adult smoking cessation advice/counseling	.07	.41	.08	1
Initial antibiotic received within 4 hrs of arrival	.17	.38	.25	.13	1
RWAS	.27	.90	.31	.53	.66	1
θ_h	.25	.96	.27	.52	.52	.97	1

Open in a new tab

3.3.2. Model estimates

In our analysis, a lower value for the difficulty parameter a_t indicates that, on average across all hospitals, a patient has a lower probability of receiving therapy t. Processes easier to implement might have larger a_t. Measures with higher a_t’s correspond to those with stronger “ceiling effects”. The discriminating parameter b_t indicates how much the t-th measure contributes to the final latent score. Higher values for this coefficient indicate a “steeper slope” for the corresponding logit and therefore a higher ability to discriminate among hospitals. For example, Figure 4 (Appendix) displays the smoothed probability of receiving CHF therapies as a function of the hospital latent quality. These logistic curves are denoted as the item characteristic curve (Baker and Kim 2004) in item response theory. Table 10 (Appendix) presents the estimates for the difficulty and discriminating parameters. For AMI, aspirin and β-blocker prescribed at discharge have higher discriminating power; for CHF, “discharge instructions” has the highest ability (i.e., the largest slopes in Figure 4) to discriminate among hospitals; and for CAP, “pneumococcal vaccination status” discriminates quality of hospitals better than other measures.

Figure 4: — Probability of receiving therapies as a function of the hospital latent quality; measure 1: discharge instructions; 2: LVF assessment; 3: ACE inhibitor; 4: smoking cessation.

Table 10:

Posterior means and 95% credible intervals (CI) for the difficulty and discriminating parameters

Measure	${\hat{a}}_{t}$	95% CI	${\hat{b}}_{t}$	95% CI
AMI
Aspirin at arrival	3.371	(3.305, 3.451)	.760	(.727, .794)
Aspirin prescribed at discharge	3.387	(3.291, 3.504)	1.121	(1.076, 1.167)
ACE or ARB for LVSD	1.655	(1.601, 1.719)	.610	(.579, .641)
β-blocker prescribed at discharge	3.189	(3.095, 3.302)	1.084	(1.044, 1.127)
β-blocker at arrival	2.782	(2.711, 2.869)	.836	(.803, .871)


CHF
Discharge instructions	.333	(.263, .408)	1.414	(1.376, 1.453)
LVF assessment	2.377	(2.333, 2.424)	.860	(.836, .885)
ACE or ARB for LVSD	1.570	(1.546, 1.596)	.457	(.440, .475)
Adult smoking cessation advice	1.920	(1.863, 1.980)	1.090	(1.054, 1.128)


CAP
Oxygen assessment	5.442	(5.365, 5.504)	.638	(.605, .672)
Pneumococcal vaccination status	.860	(.742, .939)	1.099	(1.071, 1.127)
Blood culture before first antibiotic	2.215	(2.180, 2.242)	.286	(.268, .305)
Adult smoking cessation advice	1.742	(1.662, 1.798)	.734	(.710, .759)
Initial antibiotic received	1.252	(1.213, 1.277)	.342	(.332, .353)

Open in a new tab

Multilevel IRT models further allow us to quantify the variation of the hospital quality in different levels of geographical units. Table 4 presents the estimates of variance components from Equation (3). For the fixed region effects, their variance is estimated by the average of sum of squares of β_r’s across hospitals. This tends to slightly overestimate the region contribution due to incorporating some components of sampling error into region. The variance of the random state effects is estimated by τ². For the residual hospital effects, their variance is fixed at 1. The proportion of variation explained by each source is estimated by the ratio of its variance over the sum of variances from all sources. For example, for AMI the proportion of variation explained by the region effect is .072/(.072 + .051 + 1) = 6.4%, and that by the state effects is .051/(.072 + .051 + 1) = 4.5%. For all three conditions, the geographical variation (including both region and state) of hospital quality accounts slightly more than 10% of the total variation.

Table 4:

Analysis of geographical variations in hospital quality

Source of variation	Variance		AMI		CHF		CAP

		Est.	Prop. Explained	Est.	Prop. Explained	Est.	Prop. Explained
Region	$\frac{\sum_{r} n_{r} β_{r}^{2}}{N - 1}$	.072	6.4%	.053	4.7%	.036	3.2%
State	τ²	.051	4.5%	.072	6.4 %	.084	7.5%
Hospital	σ²	1	89.1%	1	88.9%	1	89.3%

Open in a new tab

Note: n_r is the number of hospitals in region r, and N is the number of total hospitals in the sample.

Table 5 presents posterior estimates of β_r’s, which can be used to represent the average hospital quality in different census regions. It also provides the estimates of regional differences, using the northeastern region as the reference group. For all three conditions, hospital performances in the Midwest are not significantly different from those in the Northeast, yet those in either the South or West of the U.S. are generally worse. These results are consistent with those shown by descriptive statistics of RWAS.

Table 5:

Regional average of hospital quality and their differences

Sample	AMI		CHF		CAP

	${\hat{β}}_{r}$	95% CI	${\hat{β}}_{r}$	95% CI	${\hat{β}}_{r}$	95% CI
Midwest (MW)	.255	(.104, .412)	.204	(.039, .384)	.250	(.085, .400)
Northeast (NE)	.163	(−.022, .308)	.290	(.140, .438)	.179	(−.014, .365)
South (S)	−.386	(−.535, −.232)	−.102	(−.242, .035)	−.146	(−.287, 0)
West (W)	−.032	(−.194, .113)	−.392	(−.566, −.261)	−.282	(−.420, −.155)


MW-NE	.092	(−.157, .389)	−.086	(−.336, .206)	.071	(−.212, .361)
S-NE	−.549	(−.783, −.294)	−.393	(−.585, −.165)	−.325	(−.587, −.017)
W-NE	−.196	(−.432, .102)	−.682	(−.974, −.462)	−.461	(−.751, −.209)

Open in a new tab

Note: MW-NE/S-NE/W-NE: the difference between Midwest/South/West and Northeast regions.

For each state, the average performance estimate can be quantified as ${\bar{θ}}_{r s .} = β_{r} + γ_{s (r)}$ in Equation (3). Figure 3 plots the posterior means of state level average and their corresponding 95% credible intervals. It demonstrates that there exist appreciable variations in overall hospital quality for the conditions across states, as can be seen from the fact that the 95% credible intervals for some states do not overlap (e.g., New Hampshire vs. Georgia for AMI, Maine vs. Washington for CHF, and Indiana vs. Louisiana for CAP). Judged solely by posterior means, the states with the best overall process performance for AMI, CHF, and CAP are Iowa, New Jersey, and New Jersey, respectively, while the corresponding states with the worst performance are Mississippi, New Mexico, and California, respectively. To account for the uncertainty of posterior means, we can also compare state performance using the coverage probability of the posterior intervals (Teixeria-Pinto and Normand 2008). Table 6 presents the posterior probability that the average hospital quality from each state is greater than the national average (i.e., $P ({\bar{θ}}_{r s .} > 0 | Y)$ ). A state with a large probability indicates that it has good average hospital quality (e.g., Massachusetts for AMI an CHF). On the other hand, there are a dozen states with posterior probabilities close to 0, indicating their average hospital quality is almost sure to be below the national average (e.g., Alabama for all 3 conditions).

Table 6:

Posterior probability of state-average better than national average and relative number of top 10% hospitals within each state

State	AMI		CHF		CAP

	P(Better than national average)	Fraction of top 10%	P(Better than national average)	Fraction of top 10%	P(Better than national average)	Fraction of top 10%
AK	.40	.00	.03	.00	.08	.00
AL	.00	.07	.01	.05	.03	.12
AR	.00	.00	.05	.07	.13	.06
AZ	.06	.09	.01	.08	.00	.08
CA	.00	.10	.00	.07	.00	.03
CO	1	.18	.05	.03	.53	.07
CT	.95	.08	.95	.07	.88	.17
DE	.10	.20	.46	.20	.30	.00
FL	.00	.07	.94	.15	.00	.07
GA	.00	.02	.00	.04	.00	.06
HI	.04	.08	.00	.00	.00	.00
IA	1	.32	1	.20	1	.28
ID	.75	.25	.19	.09	.41	.17
IL	.35	.12	.97	.11	.00	.08
IN	.77	.15	.88	.16	.99	.16
KS	.91	.15	.00	.04	.33	.13
KY	.00	.02	.00	.08	.74	.09
LA	.04	.10	.11	.11	.00	.05
MA	1	.24	.97	.10	.46	.07
MD	.00	.00	.78	.04	.12	.02
ME	.88	.11	.97	.15	.93	.13
MI	.99	.16	1	.14	.90	.15
MN	.99	.21	.10	.09	.98	.18
MO	.67	.06	.51	.12	.83	.12
MS	.00	.04	.00	.03	.02	.06
MT	.76	.22	.03	.07	.93	.25
NC	.01	.09	.74	.09	.88	.12
ND	.97	.25	.97	.17	.97	.27
NE	.94	.12	.93	.04	1	.24
NH	.93	.21	.87	.11	.90	.17
NJ	.99	.19	1	.32	1	.32
NM	.20	.00	.00	.00	.00	.00
NV	.11	.06	.04	.11	.00	.05
NY	.24	.07	.71	.07	.02	.04
OH	.88	.12	1	.21	.99	.13
OK	.01	.13	.26	.12	.99	.20
OR	.59	.00	.00	.03	.00	.00
PA	.08	.07	.22	.08	.63	.09
RI	.71	.11	.93	.10	.52	.10
SC	.04	.09	.61	.06	.76	.02
SD	.96	.40	.99	.44	1	.39
TN	.00	.06	.12	.09	.83	.14
TX	.00	.09	.00	.09	.00	.10
UT	.27	.08	.20	.05	.14	.06
VA	.00	.07	.60	.08	.03	.04
VT	.78	.00	.93	.00	.91	.00
WA	.68	.10	.00	.04	.03	.04
WI	.99	.13	1	.15	1	.18
WV	.00	.11	.90	.20	.28	.07
WY	.41	.00	.00	.00	.28	.00

Open in a new tab

The model estimates reveal the unequal distributions of hospital quality across geographical units. Consequently, the classified top or bottom-performing hospitals from the national sample are likely to be concentrated in some of the geographical areas. We determine the posterior probability that each hospital is placed at the top 10% of the national sample, that is, $P (θ_{r s h} > θ_{r s h, 90 th} | Y)$ , where θ_rsh,90th is the 90%-th tile of θ_rsh’s. We sum these probabilities within each geographic unit, that is, $\sum_{h} P (θ_{r s h} > θ_{r s h, 90 th} | Y)$ , as an indicator of the relative number of superior hospitals for that unit. Unlike the state-average performance indicated by $P ({\bar{θ}}_{r s .} > 0 | Y)$ , this metric summarizes the quality of an area at the higher end of the distribution. As expected, the southern and western regions have fewer top-performing hospitals than the Midwest and Northeast (results not shown). Table 6 also presents the fractions of superior hospitals by state. There are quite a few states for which the estimated relative number of top hospitals is close to 0 (e.g., Arkansas, New Mexico, Wyoming for all three conditions), indicating their top hospitals have little chance to be selected as the nation’s top ones.

3.3.3. Model diagnostics

Table 7 shows the DIC results from two-parameter models with different specifications. The Hospital+(Fixed) State model has apparently larger DIC than others and is not preferred. This might due to the penalty caused by the extra number of state fixed-effect parameters. Except that, other models yield very similar DIC’s, demonstrating the closeness of the out-of-sample predictive power for y_ht. Because region and state do not constitute a major source of variation for hospital quality (Table 4), the predicted values from models fully or partially ignoring these factors would not differ too much from the model accounting for them. However, the use of Hospital+Region+State model would allow us to explicitly characterize the geographical variations in hospital quality and provide average quality estimates in different geographical units. On the other hand, two-parameter models with equal discriminating parameters and one-parameter models show much worse fit (results not shown). For example, Hospital-only model with equal discriminating parameters yields DIC as 60334 for AMI, significantly larger than any of the two-parameter models for AMI in Table 7.

Table 7:

Deviance information criterion for various models

Model	AMI	CHF	CAP
Hospital only	58800	142518	139239
Hospital+Region	58801	142525	139237
Hospital+(Random) State	58801	142527	139235
Hospital+(Fixed) State	58851	142733	139363
Hospital+Region+State	58797	142529	139238

Open in a new tab

The general pattern for the posterior predictive checking results is that the mean of the replicated statistics is close to the observed value, yet some of the Bayesian p-values can be extreme (i.e., outside (.01, .99)). Table 11 in Appendix lists the checking results using the mean and percentiles of RWAS from the national sample as the checking statistics. Counting the number of extreme p_B-values, the model seems to have a reasonably good fit for AMI data, while performs worse for either CHF or CAP data. The checking results (not shown) for the RWAS statistics using regional or state samples yield similar patterns. Table 8 presents the relative number of hospitals at each measure having extreme p_B-values for $y_{ht}^{rep}$ as an indication of measure fit. For CHF and CAP data, the model fits better for some measures than others. Regarding hospital fit for AMI, 95.9% of the hospitals have at most 1 measure with extreme p_B-values, and less than 1% of the hospitals have more than 2 measures with extreme p_B-values. For CHF, 72.8% of the hospitals have at most 1 measure with extreme p_B-values, and less than 2% of the hospitals have extreme p_B-values for all 4 measures. For CAP, 78.1% of the hospitals have at most 1 measure with extreme p_B-values, and less than 1% of the hospitals have all 4 measures with extreme p_B-values.

Table 11:

Posterior predictive checking results using statistics of the national sample

Statistics	AMI			CHF			CAP

	T(Y)	$\bar{T (Y^{rep})}$	p_B	T(Y)	$\bar{T (Y^{rep})}$	p_B	T(Y)	$\bar{T (Y^{rep})}$	p_B
Mean	.9301	.9307	.896	.7678	.7742	1	.8397	.8373	0
5%-th tile	.8182	.8177	.467	.4914	.5117	1	.7013	.6968	.012
10%-th tile	.8571	.8571	.556	.5841	.5920	.998	.7373	.7348	.051
15%-th tile	.8802	.8805	.573	.6337	.6389	.996	.7610	.7596	.156
20%-th tile	.8978	.8968	.22	.6645	.6713	.999	.7798	.7779	.044
25%-th tile	.9085	.9089	.632	.6909	.6979	1	.7940	.7929	.144
30%-th tile	.9164	.9185	.982	.7138	.7211	1	.8065	.8059	.245
35%-th tile	.9250	.9268	.983	.7345	.7418	1	.8187	.8174	.08
40%-th tile	.9326	.9339	.967	.7532	.7607	1	.8300	.8279	.006
45%-th tile	.9387	.9403	.985	.7733	.7786	1	.84	.8376	.003
50%-th tile	.9451	.9462	.929	.7914	.7959	1	.8489	.8471	.012
55%-th tile	.95	.9516	.997	.8079	.8124	1	.8577	.8562	.022
60%-th tile	.955	.9568	.999	.8234	.8285	1	.8669	.8650	.006
65%-th tile	.9607	.9618	.97	.8402	.8447	1	.875	.8740	.137
70%-th tile	.9662	.9664	.665	.8569	.8609	1	.8857	.8832	.001
75%-th tile	.9703	.9709	.831	.8732	.8777	1	.8957	.8928	0
80%-th tile	.9749	.9752	.742	.8924	.8952	.992	.9056	.9028	0
85%-th tile	.9788	.9798	.979	.9106	.9128	.986	.9159	.9134	.001
90%-th tile	.9850	.9846	.26	.9299	.9322	.975	.9296	.9256	0
95%-th tile	.9917	.9904	.003	.9571	.9557	.102	.9458	.9419	0

Open in a new tab

Note: T(Y): statistics of the observed data; $\bar{T (Y^{rep})}$ : the mean of statistics of the replicates; p_B: Bayesian p-value.

Table 8:

Percentage of hospitals with extreme p_B-values at each measure

Measure	%
AMI
Aspirin at arrival	4.8
Aspirin prescribed at discharge	5.1
ACE or ARB for LVSD	6.5
β-blocker prescribed at discharge	1.5
β-blocker at arrival	5.1


CHF
Discharge instructions	11.5
LVF assessment	41.0
ACE or ARB for LVSD	23.1
Adult smoking cessation advice	23.9


CAP
Oxygen assessment	13.0
Pneumococcal vaccination status	3.5
Blood culture before first antibiotic	11.7
Adult smoking cessation advice	31.7
Initial antibiotic received	32.8

Open in a new tab

Most of the extreme p_B-values occur for the cases where the raw fractions are either very high (e.g., close to 1) or very low, suggesting that the underlying normality assumptions in Equation (3) might be less capable of accommodating these extreme values. The quantile-quantile plots (not shown) of posterior means of γ_s(r) and ϵ_h(s) indicate that the normality assumptions seem to be questionable for CHF and CAP data. Improvement might be made by assuming a more flexible distribution (e.g., t-distribution or skewed normal/t-distribution) for the random-effects or error components in Equation (3) (Lee and Thompson 2008).

4. Discussion

We study the geographical variations in the process dimension of hospital quality using multilevel item response theory models. A major advantage of this approach is that the model can summarize information from multiple quality measures into a single latent quality score and provide estimates of the hospital performance for different levels of geographical units, accounting for the multilevel structure of the data. Using a fully Bayesian approach, we can compare hospital quality within and across geographical units. Model inferences are informative in reflecting the patterns of health care at both the national and local settings, and might be useful for both the federal and local health policy makers in their joint effort in improving the quality of hospitals in the nation.

The presence of geographical variations in quality of health care is a universal pattern in American health care system. Consistent with past literature, our study results demonstrate that there exist considerable geographical variations in hospital performance indexed by the use of evidence-based therapies in several acute conditions, using census regions and states as the main geographical units. However, a large part of residual variation in hospital quality persists, which might be associated with hospital-level factors such as volume and availability of specialists. Our study results also show that overall hospital quality in some geographical units is clearly different from that in others, implying that interventions focused on the quality of hospitals in a few regions of the country (where a disproportionate share of low-performing hospitals are located) could effectively improve the overall quality of hospital care in the nation. In the pay-for-performance initiatives, top hospitals in some states with overall worse quality are little likely to get funding support from the current rewarding programs because they can barely be ranked as the top ones in the nation. Should the mechanism of allocating funding then be adjusted to comply with such geographical variations?

Given the existing geographical variations in the hospital quality, a central policy question involves the identification of concrete reasons and, if possible, to implement policy to reduce the disparities. Future substantive research should involve identification and collection of variables that reflect the different patterns of care, availability of resources, and rules and regulations across geographical units as potential explanatory variables to explain the geographical variation. These variables must also be measures that are actionable for intervention policy to be implemented.

From a methodological perspective, our study has several limitations which deserve future research. The analysis does not include patient-level characteristics or other hospital factors because of the limited information from the dataset. This might reduce the predictive power of the model. Another limitation is that we perform the analysis in a cross-sectional manner, although recent literature demonstrate steady improvements of hospital quality over time (Williams et al. 2005). Future research on the temporary trend of geographical variations in hospital performance using extensions of the multilevel IRT models for longitudinal data. Another extension of the model involves relaxation of the distributional assumptions of the multilevel IRT models to better accommodate extreme values of performance data. Finally, comparing IRT models to other types of models (e.g., multivariate binomial regressions) for multiple categorical variables in terms of model fit would be of interest.

Acknowledgments

The authors thank the Colorado Medical Foundation for their assistance with preparation of the data and Dr. Harlan Krumholz for his constructive comments. The work was supported in part by grant MH54693 from the National Institute of Mental Health.

Appendix

The following points out that specifying the value for σ² (hospital-level variance) is imposing a constraint for the model and it does not lead to the change of the statistical inferences. To see that, suppose c is a positive constant, from Eqs (1) and (2) we have

\begin{array}{l} L o g i t (p_{h t}) = a_{t} + b_{t} θ_{h} \\ = a_{t} + b_{t} (X_{h} β + Z_{h} γ + ϵ_{h}) \\ = a_{t} + b_{t} / c * c (X_{h} β + Z_{h} γ + ϵ_{h}) \\ = a_{t}^{'} + b_{t}^{'} (X_{h} β^{'} + Z_{h} γ^{'} + ϵ_{h}^{'}) \end{array}

(5)

where $a_{t}^{'} = a_{t}, b_{t}^{'} = b_{t} / c, β^{'} = c β, γ^{'} = c γ \sim N (0, c^{2} τ^{2}), ϵ_{h}^{'} = c ϵ \sim N (0, c^{2} σ^{2})$ .

If we set σ′² = 10, which implies c² = 10, then ${τ^{'}}^{2} = 10 τ^{2}$ , $β^{'} = \sqrt{10} * β$ , $b_{t}^{'} = b_{t} / \sqrt{10}$ , and $a_{t}^{'} = a_{t}$ . Clearly, the new parameter estimates are proportionally changed but will not qualitatively change the statistical inferences (e.g., the variation explained by different geographical units as in Table 4).

The following Table 9 lists the parameter estimates (posterior means) under different priors examined for the multilevel IRT model of AMI. Our primary analyses use prior distributions as N(0, 10³) for a_t, LN(0, 10³) for b_t, N(0, 10³) for β_r, and IG(10⁻³, 10⁻³) for τ², as highlighted in Table 9. We only change prior for one set of parameters at a time, leaving others fixed. For example, the model in the first row of the table has N(0, 10⁴) for a_t but prior for the other parameters are LN(0, 10³) for b_t, N(0, 10³) for β_r, and IG(10⁻³, 10⁻³) for τ².

The parameter estimates for a_t’s and b_t’s appear reasonably stable under the various choices of prior distributions. There appear some differences β’s and τ² but they are minor. The results overall suggest the considered multilevel IRT models are rather insensitive to the priors considered. Further, the geographical variations (Table 4 in the main text) and the predictions of hospital and state-level performance (Figure 3 and Table 6 in the main text) do not qualitatively change under different priors. The parameter estimates for the multilevel models under CHF and CAP measures are also stable to the prior distributions used in the sensitivity analysis (results not shown).

The following describes sample of AMI data and WinBUGS code to fit the multilevel IRT model (Hospital+Region+State model, Equations (1) and (3)). The complete data include 2336 hospitals and 5 measures for AMI. In Table 12, the variable hosp identifies the hospital. The variable measure refers to the measure. The variable hnumer refers to the number of eligible patients who received the therapy associated with each measure. The variable hdenom refers to the total number of patients in a hospital who are eligible for a given measure. The variables region2-region4 are 3 dummy variables indicating the region where the hospital is. The variable state indicates the state where the hospital is.

Table 12:

AMI Data

hosp	measure	hnumer	hdenom	region2	region3	region4	state
1	1	245	258	1	0	0	7
2	1	180	191	1	0	0	7
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
1000	1	70	71	0	1	0	48
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
1000	3	28	34	0	1	0	48
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
2336	5	102	113	−1	−1	−1	50

Open in a new tab

WinBUGS code:

model {

for( i in 1 : M ) {

hnumer[i] ~ dbin(p[hosp[i],measure[i]],hdenom[i]);

logit(p[hosp[i],measure[i]]) ← a[measure[i]]+b[measure[i]]*

(beta[1]*region2[hosp[i]]+beta[2]*region3[hosp[i]]+beta[3]*region4[hosp[i]]+gamma[state[i]]+epsilon[hosp[i]]);

for(j in 1:N) {

epsilon[j] ~ dnorm(0,1);

for(m in 1:W) {

gamma[m] ~ dnorm(0, tau_gamma);

for (k in 1:S) {

a[k] ~ dnorm(0, .001);

b[k] ~ dlnorm(0, .001);

for (l in 1:T) {

beta[l] ~ dnorm(0, .001);

tau_gamma ~ dgamma(.001, .001);

sigma_gamma ← 1/tau_gamma;

DATA list(M = 11656, W=50, N=2336, S=5, T=3)

INITIAL VALUES list(a=c(1,1,1,1,1), b=c(0.5,0.5,0.5,0.5,0.5), beta=c(−0.1,−0.2,−0.2), tau_gamma=10)

Contributor Information

Yulei He, Department of Health Care Policy, Harvard Medical School, 180 Longwood Avenue, Boston, MA, 02115.

Robert E. Wolf, Department of Health Care Policy, Harvard Medical School, 180 Longwood Avenue, Boston, MA, 02115.

Sharon-Lise T. Normand, Department of Health Care Policy, Harvard Medical School; Department of Biostatistics, Harvard School of Public Health, 180 Longwood Ave, Boston, MA, 02115.

References

[1].Aguilar O and West M (1998), “Analysis of hospital quality monitors using hierarchical time series models”, in Gatsonis C, Kass RE, Carlin B, Carriquiry A, Gelman A. Verdlinelli I, and West M. edits Case Studies in Bayesian Analysis, Vol IV, Springer: New York, 287–302. [Google Scholar]
[2].Baker FB and Kim SH (2004), Item Response Theory: Parameter Estimation Techniques, Marcel Dekker, New York. [Google Scholar]
[3].Daniels MJ and Gatsonis C (1999), “Hierarchical generalized linear models in the analysis of variations in health care utilization”, Journal of the American Statistical Association, 94, 29–42. [Google Scholar]
[4].Donabedian A (1966), “Evaluating the Quality of Medical Care”, Milbank Memorial Fund Quarterly 44, 166–203. [PubMed] [Google Scholar]
[5].Draper D and Gittoes M (2004), “Statistical analysis of performance indicators in UK higher education”, Journal of the Royal Statistical Society, Series B (Statistical Methodology), 167, 449–474. [Google Scholar]
[6].Fox JP and Glas CAW (2001), “Bayesian estimation of a multilevel IRT model using Gibbs sampling”, Psychometrika, 66, 269–286. [Google Scholar]
[7].Gelman A, Meng XL, Stern HS (1996) “Posterior predictive assessment of model fitness via realized discrepancies (with discussion)”, Statistica Sinica, 6 733–807. [Google Scholar]
[8].Gelman A and Rubin DB (1992), “Inference from iterative simulation using multiple sequences (with discussion)”, Statistical Science, 7, 457–511. [Google Scholar]
[9].Goldstein H and Speigelhalter D (1996), “League tables and their limitation: statistical issues in comparisons of institutional performance (with discussion)”, Journal of the Royal Statistical Society, Series A (Statistics in Society), 159, 385–443. [Google Scholar]
[10].Institute of Medicine (2006), Performance Measurement: Accelerating Improvement. Washington DC: The National Academies Press. [Google Scholar]
[11].Jencks SF, Cuerdon T, Burwen DR, et al. (2000) “Quality of medical care delivered to medicare beneficiaries: a profile of state and national levels”, Journal of the American Medical Association, 284, 1670–1676. [DOI] [PubMed] [Google Scholar]
[12].Johnson VE and Albert JH (1999), Ordinal Data Modeling, Springer-Verlag, New York. [Google Scholar]
[13].Kaul P and Peterson ED (2007), “The Cardiovascular world is definitely not flat”, Circulation, 115, 158–160. [DOI] [PubMed] [Google Scholar]
[14].Krumholz HM, Chen J, Rathore SS et al. (2003), “Regional variation in the treatment and outcomes of myocardial infarction: investigating New England’s advantage”, American Heart Journal, 146, 242–249. [DOI] [PubMed] [Google Scholar]
[15].Krumholz HM, Radford MJ, Wang Y et al. (1998), “National use and effectiveness of β-blockers for the treatment of elderly patients after acute myocardial infarction”, Journal of American Medical Association, 280, 623–629. [DOI] [PubMed] [Google Scholar]
[16].Laird NM, and Ware JH (1982), “Random-effects models for longitudinal data”, Biometrics, 38, 963–974. [PubMed] [Google Scholar]
[17].Landrum MB, Bronskill SE, and Normand SLT (2000), “Analytic methods for constructing cross-sectional profiles of health care providers”, Health Services and Outcome Research Methodology, 1, 23–47. [Google Scholar]
[18].Lee KJ and Thompson SG (2008), “Flexible parametric models for random-effects distributions”, Statistics in Medicine, 27, 418–434. [DOI] [PubMed] [Google Scholar]
[19].Nolan T and Berwick DM (2006), “All-or-none measurement raises the bar on performance”, Journal of the American Medical Association, 295, 1168–1170. [DOI] [PubMed] [Google Scholar]
[20].Normand SLT, Glickman M, Gatsonis CA (1997), “Statistical methods for profiling providers of medical care: issues and applications”, Journal of the American Statistical Association, 92, 803–814. [Google Scholar]
[21].Normand SLT, and Shahian DM (2007). “Statistical and clinical aspects of hospital outcomes profiling”, Statistical Science, 22, 206–226. [Google Scholar]
[22].Normand SLT, Wolf RE, and McNeil BJ (2008), “Discriminating quality of hospital care in the US”, Medical Decision Making, 38, 308–322. [DOI] [PubMed] [Google Scholar]
[23].O’Brien SM, Shahian DM, DeLong ER, Normand SLT, Edwards FH, Ferraris VA, Haan CK, Rich JB, Shewan CM, Dokholyan RS, Anderson RP, and Peterson ED (2007), “Quality measurement in adult cardiac surgery: part 2-statistical considerations in composite measure scoring and provider rating”, Annals of the Society of Thoracic Surgeons, 83, S13–26. [DOI] [PubMed] [Google Scholar]
[24].O’Connor GT, Quinton HB, Traven ND et al. (1999), “Geographical variation in the treatment of acute myocardial infarction: The Cooperative Cardiovascular Project”, Journal of the American Medical Association, 281, 627–633. [DOI] [PubMed] [Google Scholar]
[25].Premier, Inc. (2007), “Centers for medicare and medicaid services (cms)/premier hospital quality incentive demonstration project: findings from year 2” accessed from http://www.premierinc.com/quality-safety/tools-services/p4p/hqi/resources/hqi-whitepaper-year2.pdf.
[26].Raudenbush SW, and Bryk AS (2002), Hierarchical Linear Models: Applications and Data Analysis Methods, Newbury Park, CA: Sage Publications Inc. [Google Scholar]
[27].Shwartz M, Ren J, Pekoz EA, Wang X, Cohen AB, and Resuccia JD (2008), “Estimating a composite measure of hospital quality from the hospital compare database differences when using a Bayesian hierarchical latent variable model versus denominator-based weights”, Medical Care, 46, 778–785. [DOI] [PubMed] [Google Scholar]
[28].Skrondal A and Rabe-Hesketh S (2004), Generalized Latent Variable Modeling, Chapman & Hall/CRC. [Google Scholar]
[29].Snijders TAB, and Bosker RJ (1999), Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling, Newbury Park, CA: Sage Publications Inc. [Google Scholar]
[30].Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A (2002), “Bayesian measures of model complexity and fit”, Journal of the Royal Statistical Society, Series B (Statistical Methodology), 64, 583–639. [Google Scholar]
[31].Spiegelhalter DJ, Thomas A, Best N, Lunn D (2003) WinBUGS Users Manual, Verson 1.4, Cambridge: Medical Research Council Biostatistics Unit. [Google Scholar]
[32].Spiegelhalter DJ, Abrams KR, and Myles JP (2004), Bayesian Approaches to Clinical Trials and Health Care Evaluations Wiley, Chichester. [Google Scholar]
[33].Teixeira-Pinto A and Normand SLT. (2008), “Statistical methodology for classifying units on the basis of multiple-related measures”, Statistics in Medicine, 27, 1329–1350. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].van der Linden WJ, and Hambleton RK (1997), Handbook of Modern Item Response Theory, Springer: New York. [Google Scholar]
[35].Williams SC, Schmaltz SP, Morton DJ, Koss RG, and Loeb JM (2005), “Quality of care in U.S. hospitals as reflected by standardized measures, 2002-2004”, The New England Journal of Medicine, 353, 255–264. [DOI] [PubMed] [Google Scholar]

[R1] [1].Aguilar O and West M (1998), “Analysis of hospital quality monitors using hierarchical time series models”, in Gatsonis C, Kass RE, Carlin B, Carriquiry A, Gelman A. Verdlinelli I, and West M. edits Case Studies in Bayesian Analysis, Vol IV, Springer: New York, 287–302. [Google Scholar]

[R2] [2].Baker FB and Kim SH (2004), Item Response Theory: Parameter Estimation Techniques, Marcel Dekker, New York. [Google Scholar]

[R3] [3].Daniels MJ and Gatsonis C (1999), “Hierarchical generalized linear models in the analysis of variations in health care utilization”, Journal of the American Statistical Association, 94, 29–42. [Google Scholar]

[R4] [4].Donabedian A (1966), “Evaluating the Quality of Medical Care”, Milbank Memorial Fund Quarterly 44, 166–203. [PubMed] [Google Scholar]

[R5] [5].Draper D and Gittoes M (2004), “Statistical analysis of performance indicators in UK higher education”, Journal of the Royal Statistical Society, Series B (Statistical Methodology), 167, 449–474. [Google Scholar]

[R6] [6].Fox JP and Glas CAW (2001), “Bayesian estimation of a multilevel IRT model using Gibbs sampling”, Psychometrika, 66, 269–286. [Google Scholar]

[R7] [7].Gelman A, Meng XL, Stern HS (1996) “Posterior predictive assessment of model fitness via realized discrepancies (with discussion)”, Statistica Sinica, 6 733–807. [Google Scholar]

[R8] [8].Gelman A and Rubin DB (1992), “Inference from iterative simulation using multiple sequences (with discussion)”, Statistical Science, 7, 457–511. [Google Scholar]

[R9] [9].Goldstein H and Speigelhalter D (1996), “League tables and their limitation: statistical issues in comparisons of institutional performance (with discussion)”, Journal of the Royal Statistical Society, Series A (Statistics in Society), 159, 385–443. [Google Scholar]

[R10] [10].Institute of Medicine (2006), Performance Measurement: Accelerating Improvement. Washington DC: The National Academies Press. [Google Scholar]

[R11] [11].Jencks SF, Cuerdon T, Burwen DR, et al. (2000) “Quality of medical care delivered to medicare beneficiaries: a profile of state and national levels”, Journal of the American Medical Association, 284, 1670–1676. [DOI] [PubMed] [Google Scholar]

[R12] [12].Johnson VE and Albert JH (1999), Ordinal Data Modeling, Springer-Verlag, New York. [Google Scholar]

[R13] [13].Kaul P and Peterson ED (2007), “The Cardiovascular world is definitely not flat”, Circulation, 115, 158–160. [DOI] [PubMed] [Google Scholar]

[R14] [14].Krumholz HM, Chen J, Rathore SS et al. (2003), “Regional variation in the treatment and outcomes of myocardial infarction: investigating New England’s advantage”, American Heart Journal, 146, 242–249. [DOI] [PubMed] [Google Scholar]

[R15] [15].Krumholz HM, Radford MJ, Wang Y et al. (1998), “National use and effectiveness of β-blockers for the treatment of elderly patients after acute myocardial infarction”, Journal of American Medical Association, 280, 623–629. [DOI] [PubMed] [Google Scholar]

[R16] [16].Laird NM, and Ware JH (1982), “Random-effects models for longitudinal data”, Biometrics, 38, 963–974. [PubMed] [Google Scholar]

[R17] [17].Landrum MB, Bronskill SE, and Normand SLT (2000), “Analytic methods for constructing cross-sectional profiles of health care providers”, Health Services and Outcome Research Methodology, 1, 23–47. [Google Scholar]

[R18] [18].Lee KJ and Thompson SG (2008), “Flexible parametric models for random-effects distributions”, Statistics in Medicine, 27, 418–434. [DOI] [PubMed] [Google Scholar]

[R19] [19].Nolan T and Berwick DM (2006), “All-or-none measurement raises the bar on performance”, Journal of the American Medical Association, 295, 1168–1170. [DOI] [PubMed] [Google Scholar]

[R20] [20].Normand SLT, Glickman M, Gatsonis CA (1997), “Statistical methods for profiling providers of medical care: issues and applications”, Journal of the American Statistical Association, 92, 803–814. [Google Scholar]

[R21] [21].Normand SLT, and Shahian DM (2007). “Statistical and clinical aspects of hospital outcomes profiling”, Statistical Science, 22, 206–226. [Google Scholar]

[R22] [22].Normand SLT, Wolf RE, and McNeil BJ (2008), “Discriminating quality of hospital care in the US”, Medical Decision Making, 38, 308–322. [DOI] [PubMed] [Google Scholar]

[R23] [23].O’Brien SM, Shahian DM, DeLong ER, Normand SLT, Edwards FH, Ferraris VA, Haan CK, Rich JB, Shewan CM, Dokholyan RS, Anderson RP, and Peterson ED (2007), “Quality measurement in adult cardiac surgery: part 2-statistical considerations in composite measure scoring and provider rating”, Annals of the Society of Thoracic Surgeons, 83, S13–26. [DOI] [PubMed] [Google Scholar]

[R24] [24].O’Connor GT, Quinton HB, Traven ND et al. (1999), “Geographical variation in the treatment of acute myocardial infarction: The Cooperative Cardiovascular Project”, Journal of the American Medical Association, 281, 627–633. [DOI] [PubMed] [Google Scholar]

[R25] [25].Premier, Inc. (2007), “Centers for medicare and medicaid services (cms)/premier hospital quality incentive demonstration project: findings from year 2” accessed from http://www.premierinc.com/quality-safety/tools-services/p4p/hqi/resources/hqi-whitepaper-year2.pdf.

[R26] [26].Raudenbush SW, and Bryk AS (2002), Hierarchical Linear Models: Applications and Data Analysis Methods, Newbury Park, CA: Sage Publications Inc. [Google Scholar]

[R27] [27].Shwartz M, Ren J, Pekoz EA, Wang X, Cohen AB, and Resuccia JD (2008), “Estimating a composite measure of hospital quality from the hospital compare database differences when using a Bayesian hierarchical latent variable model versus denominator-based weights”, Medical Care, 46, 778–785. [DOI] [PubMed] [Google Scholar]

[R28] [28].Skrondal A and Rabe-Hesketh S (2004), Generalized Latent Variable Modeling, Chapman & Hall/CRC. [Google Scholar]

[R29] [29].Snijders TAB, and Bosker RJ (1999), Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling, Newbury Park, CA: Sage Publications Inc. [Google Scholar]

[R30] [30].Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A (2002), “Bayesian measures of model complexity and fit”, Journal of the Royal Statistical Society, Series B (Statistical Methodology), 64, 583–639. [Google Scholar]

[R31] [31].Spiegelhalter DJ, Thomas A, Best N, Lunn D (2003) WinBUGS Users Manual, Verson 1.4, Cambridge: Medical Research Council Biostatistics Unit. [Google Scholar]

[R32] [32].Spiegelhalter DJ, Abrams KR, and Myles JP (2004), Bayesian Approaches to Clinical Trials and Health Care Evaluations Wiley, Chichester. [Google Scholar]

[R33] [33].Teixeira-Pinto A and Normand SLT. (2008), “Statistical methodology for classifying units on the basis of multiple-related measures”, Statistics in Medicine, 27, 1329–1350. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].van der Linden WJ, and Hambleton RK (1997), Handbook of Modern Item Response Theory, Springer: New York. [Google Scholar]

[R35] [35].Williams SC, Schmaltz SP, Morton DJ, Koss RG, and Loeb JM (2005), “Quality of care in U.S. hospitals as reflected by standardized measures, 2002-2004”, The New England Journal of Medicine, 353, 255–264. [DOI] [PubMed] [Google Scholar]

PERMALINK

Assessing Geographical Variations in Hospital Processes of Care Using Multilevel Item Response Models

Yulei He

Robert E Wolf

Sharon-Lise T Normand

Roles

Abstract

1. Introduction

Table 1:

2. Item Response Theory Models for Hospital Quality of Care

2.1. Background

2.2. Methods: multilevel IRT models for hospital quality within geographical units

3. Application: U.S. Hospital Compare Data

3.1. Data

3.2. Model fitting

Table 9:

3.3. Results

3.3.1. Descriptive statistics

Table 2:

Figure 1:

Figure 2:

Table 3.

3.3.2. Model estimates

Figure 4:

Table 10:

Table 4:

Table 5:

Figure 3:

Table 6:

3.3.3. Model diagnostics

Table 7:

Table 11:

Table 8:

4. Discussion

Acknowledgments

Appendix

Table 12:

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases